Commit Graph

318 Commits

Author SHA1 Message Date
Georgi Gerganov
9da243b36a
Revert "llava : add support for moondream vision language model (#6899)"
This reverts commit 46e12c4692.
2024-05-08 22:14:39 +03:00
Jeximo
c780e75305
Further tidy on Android instructions README.md (#7077)
* Further tidy on Android instructions README.md

Fixed some logic when following readme direction

* Clean up redundent information

A new user arriving will see simple directions on llama.cpp homepage

* corrected puncuation

Period after cmake, colon after termux

* re-word for clarity

method seems to be more correct, instead of alternative in this context

* Organized required packages per build type

building llama.cpp with NDK on a pc doesn't require installing clang, cmake, git, or wget in termux.

* README.md

corrected title

* fix trailing whitespace
2024-05-08 02:26:43 +02:00
Georgi Gerganov
53d6c52e22
readme : update hot topics 2024-05-07 21:43:13 +03:00
Georgi Gerganov
bcdee0daa7
minor : fix trailing whitespace 2024-05-06 09:31:30 +03:00
Lyle Dean
ca36326020
readme : add note that LLaMA 3 is not supported with convert.py (#7065) 2024-05-05 08:21:46 +03:00
Jeximo
cf768b7e71
Tidy Android Instructions README.md (#7016)
* Tidy Android Instructions README.md

Remove CLBlast instructions(outdated), added OpenBlas.

* don't assume git is installed

Added apt install git, so that git clone works

* removed OpenBlas

Linked to Linux build instructions

* fix typo

Remove word "run"

* correct style

Co-authored-by: slaren <slarengh@gmail.com>

* correct grammar

Co-authored-by: slaren <slarengh@gmail.com>

* delete reference to Android API

* remove Fdroid reference, link directly to Termux

Fdroid is not required

Co-authored-by: slaren <slarengh@gmail.com>

* Update README.md

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-05-04 18:10:15 +02:00
Olivier Chafik
b8a7a5a90f
build(cmake): simplify instructions (cmake -B build && cmake --build build ...) (#6964)
* readme: cmake . -B build && cmake --build build

* build: fix typo

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* build: drop implicit . from cmake config command

* build: remove another superfluous .

* build: update MinGW cmake commands

* Update README-sycl.md

Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>

* build: reinstate --config Release as not the default w/ some generators + document how to build Debug

* build: revert more --config Release

* build: nit / remove -H from cmake example

* build: reword debug instructions around single/multi config split

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2024-04-29 17:02:45 +01:00
Georgi Gerganov
24affa7db3
readme : update hot topics 2024-04-29 17:06:19 +03:00
vik
46e12c4692
llava : add support for moondream vision language model (#6899)
* add support for moondream vision language model

This required making the following changes to the CLIP model:

1. Support for patch embedding bias.
2. Make class embedding and pre-layernorm optional.
3. Add support for post-layernorm.

* Update examples/llava/clip.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-25 22:38:31 +03:00
BarfingLemurs
3fe0596c18
readme : update model list (#6908)
* Update README.md

* missing space

* llama3 !
2024-04-25 16:52:28 +03:00
Johannes Gäßler
784e11dea1
README: add graphic for matrix multiplication (#6881) 2024-04-24 21:29:13 +02:00
Georgi Gerganov
40f74e4d73
llama : add option to render special/control tokens (#6807)
* make : fix common dep on llama.h

* llama : add option to render special tokens

* readme : add API change notice

ggml-ci

* swift : fix build
2024-04-21 18:36:45 +03:00
Jan Boon
e8d35f47cb
doc : add link to falcon (#6789) 2024-04-21 15:35:40 +03:00
Mohammadreza Hendiani
2cca09d509
readme : add Fedora instructions (#6783)
* added fedora to list of distros that may need the package (the packages have the same name on Fedora)

* how to add clblast that is avalible in the fedora repos
2024-04-21 15:32:05 +03:00
nopperl
9958c81b79
Implement the OLMo architecture (#6741)
* implement olmo architecture

* remove unused variable

* remove unused moe branch

* remove check for weight

* remove superfluous moe, bias and rope tensors

* clarified comment

* fix clamp_kqv setting

* remove obsolete parameter name filter
2024-04-19 11:35:54 +02:00
Yaroslav
8dd1ec8b3f
readme : add UI (#6724)
* Update README.md

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-17 15:47:50 +03:00
Pierrick Hymbert
4bd0f93e4a
model: support arch DbrxForCausalLM (#6515)
* model: dbrx convert to gguf
#6344

* llama: support dbrx
#6344

* doc: dbrx: add the model as supported

* scripts: get-wikitext-2 add unzip

* llama: increase maximum experts allowed

* llama: factorize moe graph implementation between grok, mixtral and dbrx


---------

Co-authored-by: Megha Agarwal <16129366+megha95@users.noreply.github.com>
2024-04-13 11:33:52 +02:00
Rene Leonhardt
5c4d767ac0
chore: Fix markdown warnings (#6625) 2024-04-12 10:52:36 +02:00
Pierrick Hymbert
67fac4b95f
docs : how to add a model (#6565)
* docs: how to add a model

* docs: model: typo and docs

* docs: model: add prevision on RoPE

* docs: model: rephrasing README.md

* docs: model: rephrasing README.md

* docs: model: README.md fix trailing spaces

* docs : some fixes

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-10 09:58:48 +03:00
Artem Zinnatullin
29122d32ac
readme : fix ROCm link (#6579) 2024-04-10 09:49:12 +03:00
sjxx
b231b37b09
readme : update UI list (#6560) 2024-04-10 09:34:00 +03:00
Jiří Sejkora
ba5e134e07
readme: fix typo in amdgpu target name (#6573) 2024-04-10 00:23:02 +02:00
Jan Boon
beea6e1b16
llama : save and restore kv cache for single seq id (#6341)
* llama : save and restore kv cache for single seq id

* remove trailing whitespace

* respond error in case there's no space in the kv cache

* add kv seq save restore to test case

* add --slot-save-path arg to enable save restore and restrict save location

* Returning 0 for some cases, instead of asserting.

* cleanup error cases

* rename sequence state functions

* rename state get set functions

* add previous function names back in with DEPRECATED notice

* update doc

* adjust endpoints to preferred style

* fix restoring zero cell count

* handle seq rm return value

* unused param

* keep in the size check

* fix return types

* add server test case for slot save restore

* cleanup

* add cake

* cleanup style

* add special

* removing a whole sequence never fails

* move sequence state file functionality from server to llama to match session api and add version tags

* catch exceptions on save as well

* error log messages

* check types for stricter restore

* update server doc

* readme : update API changes date

* strict filename validation

* move include, reject bom as well

* also reject empty filename

* reject whitespace and trailing dot

---------

Co-authored-by: Martin Evans <martindevans@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-08 15:43:30 +03:00
Firat
d752327c33
Adding KodiBot to UI list (#6535)
KodiBot is free and open source ai chat app released under the GNU General Public License.
2024-04-08 09:48:29 +02:00
Mark Fairbairn
855f54402e
Change Windows AMD example to release build to make inference much faster. (#6525) 2024-04-07 20:52:19 +02:00
DAN™
e0717e751e
Add GritLM as supported models. (#6513) 2024-04-07 19:33:59 +02:00
Hoang Nguyen
d0f5deebf8
readme : update UI list (#6503)
* Add MindMac to UI list

* Update proprietary description

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-04-05 21:39:43 +03:00
alexpinel
a307375c02
readme : add Dot to UI list (#6487) 2024-04-04 13:22:50 -04:00
Jun Jie
b660a5729e
readme : fix typo (#6481) 2024-04-04 13:16:37 -04:00
bryanSwk
bb43cf7e9d
llama : add SEA-LION support (#6448)
* initial commit for sealion support

* add sealion support

* minor fix

* q/k ln and pos_embd only if required

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* minor : clear whitespaces

---------

Co-authored-by: bryan <bryansiow@aisingapore.org>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-03 21:05:10 +03:00
Francisco Melo
154d4ee39c
readme : add feature-rich rust bindings (#6465) 2024-04-03 20:53:37 +03:00
Georgi Gerganov
076b08649e
readme : update hot topics 2024-04-03 16:11:15 +03:00
Georgi Gerganov
c50a82ce0f
readme : update hot topics 2024-03-31 11:56:30 +03:00
0cc4m
ba0c7c70ab
Vulkan k-quant mmq and ggml-backend offload functionality (#6155)
* Fix Vulkan no kv offload incoherence

* Add k-quant mul mat mat shaders

* Rework working buffer allocation, reduces vram use noticeably

Clean up cpu assist code, replaced with ggml-backend offload function

* Default to all dedicated GPUs

* Add fallback for integrated GPUs if no dedicated GPUs are found

* Add debug info which device is allocating memory

* Fix Intel dequant issue

Fix validation issue

* Fix Vulkan GGML_OP_GET_ROWS implementation

* Clean up merge artifacts

* Remove Vulkan warning
2024-03-29 17:29:21 +01:00
hxer7963
069574775c
[Model] Add support for xverse (#6301)
* Support xverse model convert to gguf format.

* 1. Convert xverse models to gguf;
2. Add LLM_ARCH_XVERSE inference in llama.cpp;
3. Add xverse item in Supported models in README.md;

* * gguf-py: remove redundant logs
* llama: remove the init_mapping_prefetch custom parameter

* llama.cpp: Include the changes from #6122 to exclude the unused outputs of the last layers.

* - Fix format issues
- Remove duplicate set kqv_out to llm_build_kv

* Update llama.cpp

---------

Co-authored-by: willhe <willhe@xverse.cn>
Co-authored-by: willhe <hexin@xverse.cn>
2024-03-29 14:37:03 +01:00
zhouwg
b910287954
readme : add project (#6356)
* readme: add Android UI binding

* Update README.md
2024-03-29 09:33:46 +02:00
Georgi Gerganov
bfe7dafc9c
readme : add notice for UI list 2024-03-28 22:56:03 +02:00
Mateusz Charytoniuk
1740d6dd4e
readme : add php api bindings (#6326)
* add php bindings to readme

* readme : add link to PR

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-27 09:08:59 +02:00
compilade
557410b8f0
llama : greatly reduce output buffer memory usage (#6122)
* llama : greatly reduce logits memory usage

* llama : more compact state saving and reloading

* llama : fix lctx.n_outputs not being set before building graph

* perplexity : adapt to the logits API changes

* perplexity : fix Winogrande, use correct logits for second choice start

The first logits used to evaluate the second choice were not from
the end of the common prefix; instead, they were the logits from the end
of the first choice. This has been corrected.

The previous implementation sometimes had outliers in the scores of
choices for some tasks, and the logic to skip choices words
in the log-likelihood evaluation probably was an attempt to reduce those,
but it was complex and didn't quite seem to be the right thing.

This is simpler now, and the outlier scores aren't there anymore.

* perplexity : normalize spaces and punctuation in Winogrande sentences

* llama : fix embedding conditions

* llama : fix llama_get_embeddings_ith when the resulting id is 0

* llama : fix wrong n_outputs in llama_set_inputs

A mismatch happened when using a smaller n_ubatch than n_batch and then using
llama_batch_get_one(). The decision of what n_outputs should be now almost
fully depends on how lctx.n_outputs is set in llama_decode_internal.
The conditions are simpler this way.

* llama : when saving the state, recalculate n_outputs

This ensures the correct number of outputs for the entire previous batch
is stored in the session file, even when n_ubatch is smaller than n_batch.

* llama : fix not-skipping outputs of non-causal models

* llama : fix running a batch with n_outputs == 0

It previously worked because lctx.inp_out_ids was not initialized,
so it pointed to some garbage address which was somehow still valid when I
ran my tests.

* llama : keep same graph topology even when n_outputs == 0

* ggml : saner ggml_can_repeat with empty tensors

*  ggml : future-proof ggml_is_empty by using GGML_MAX_DIMS - 1

* ggml : do not multi-thread ops returning empty tensors

* ggml : make ggml_is_empty public and work with views

* llama : use a vector for ctx->output_ids

* llama : rework reallocation logic for llama_output_reserve

Now comparing the actual size with the new total size of the output buffer
to allow more efficient enabling and disabling of the embeddings
and/or logits output in the future.

* ggml : skip empty tensors in all backends

* llama : fix llama_output_reserve nullptr deref when new_size is 0

* perplexity : make Winogrande work as it does on master

The problems with the Winogrande implementation will
need to be fixed in a separate PR to ease review.

* llama : clearer error messages for invalid logits or embeddings ids

* llama : assert all models that can have inp_out_ids

Since the graph topology is now constant, this presence check
can be done even when there are no outputs.

* llama : assert logits and embd buffers exist before writing to them

* llama : handle errors from llama_output_reserve at call sites

* perplexity : make hellaswag and multiple-choice outputs identical to master

Due to how the KV cache is updated, the logprobs for tokens in a batch
are very slightly affected by the other tokens present in the batch,
so to make hellaswag and multiple-choice return exactly the same results
as on master, the last token of each sequence needs to be evaluated
even though its output is not used at all.

This will probably be changed back in the future to make these benchmarks
a tiny bit faster.

* perplexity : fix division by zero when using less than 100 multiple-choice tasks

* llama : allow loading state saved with a different ctx size

When loading a session file, the context size is now only required to be
at least enough to load the KV cells contained in that session file,
instead of requiring to use exactly the same context size as when saving.

Doing this enables the use-case of extending or shrinking the context size
of a saved session.

This breaks existing session files because the meaning of kv_buf_size
is slightly changed (previously it was the size of the whole KV cache,
now it's only the size of the saved part of it). This allows for
finer-grained sanity checks when loading in an effort to keep kv_buf_size
useful even when the kv_size is changed.

* llama : minor

ggml-ci

* readme : update recent API changes, and warn about Vulkan

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-26 16:46:41 +02:00
slaren
280345968d
cuda : rename build flag to LLAMA_CUDA (#6299) 2024-03-26 01:16:01 +01:00
Pierrick Hymbert
dba1af6129
llama_model_loader: support multiple split/shard GGUFs (#6187)
* split: support in llama_model_loader

* avoid copying the entire vector

Co-authored-by: slaren <slarengh@gmail.com>

* split: move llama_tensor_offset to llama_model_loader

* llama_model_loader: PR feedbacks:
 - use only one gguf_context for metadata only
 - store all ggml_context in a vector as the files and mappings
 - store all weights in a vector along with the source tensor
 - rename ctx_gguf to meta
 - rename ctx_meta to contexts

* avoid copying the entire vector

* Simplify this by making these optional, switch some layer creation tensor optional

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Handle optional tensors

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llama_model_loader: fail if backend cannot allocate buffer

* fix mmap buffer management

* llama_model_loader: map file to backend buffer if the allocation succeeds only

* llama_model_loader: only map tensors included in the context

* llama_model_loader: minor, use same variable name for consistency, fix spacing in types cast

* llama_model_loader: fail if any of backend buffer cannot be allocated

* spacing

Co-authored-by: slaren <slarengh@gmail.com>

* fix loop over pointer

Co-authored-by: slaren <slarengh@gmail.com>

* llama_model_loader: if n_tensors declared not equals to loaded tensors in split, throw an exception instead of asserting

* llama_model_loader: ensure mappings vector has the expected size

* llama_model_loader:  use at instead of operator[] if this should never add to the map.

* llama_model_loader: immediately add the backend buffer to the model buffers in order to free them if an error occurs in the next allocation. Reserve the expected size.

* llama_model_loader: be sure the model mappings has enough capacity before allocating backend buffer

* llama_model_loader: fix map -> unordered map

* llama_split_prefix: use a clearer version, not pass split path len but dest max len.

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* llama : minor

ggml-ci

* llama : introduce some typedef helpers

* docs: add model shard in hot topic

* llama_model_loader: put mapping in a unique_ptr from the moment it is allocated

Co-authored-by: slaren <slarengh@gmail.com>

* fix llama_split_prefix

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-03-22 19:00:01 +01:00
Xiaoyi Chen
29ab270e65
readme : add RecurseChat to the list of UIs (#6219) 2024-03-22 13:29:49 +02:00
Georgi Gerganov
b3e94f26ba
metal : proper assert for mat-mat memory alignment (#6225)
* metal : proper assert for mat-mat memory alignment

ggml-ci

* readme : add notice about the bug fix

* metal : fix the fix

ggml-ci
2024-03-22 11:35:53 +02:00
Xuan Son Nguyen
dfbfdd60f9
readme : add wllama as a wasm binding (#6100) 2024-03-16 17:42:08 +02:00
Andrew Canis
12247f4c69
llama : add Command-R support (#6033)
Information about the Command-R 35B model (128k context) can be found at:
	https://huggingface.co/CohereForAI/c4ai-command-r-v01

Based on the llama2 model with a few changes:

1) New hyper parameter to scale output logits (logit_scale)
2) Uses LayerNorm instead of RMSNorm
3) Transfomer layers have a single shared LayerNorm that feeds into both the
   self-attention and FFN layers in parallel. There is no post-attention LayerNorm.
4) No support for Rotary Position Embeddings (RoPE) scaling
5) No biases used

Find GGUF files here:
	https://huggingface.co/andrewcanis/c4ai-command-r-v01-GGUF

To convert model to GGUF format yourself:

1) Download Command-R Hugging Face safetensors:
	git lfs install
	git clone https://huggingface.co/CohereForAI/c4ai-command-r-v01

2) Run:
	python3 convert-hf-to-gguf.py --outtype f16 ./c4ai-command-r-v01
2024-03-15 22:41:22 +02:00
Linwei Wang
19885d205e
readme : update details about running llama in Termux on Android (#6039) 2024-03-13 20:34:40 +02:00
Georgi Gerganov
76a936c893
readme : update API changes and hot topics 2024-03-13 20:33:56 +02:00
Georgi Gerganov
05b06210c9
llama : more consistent names of count variables (#5994)
* llama : more consistent names of count variables

ggml-ci

* llama : n_parallel -> n_seq_max

* common : fix param name

* examples : fix param name
2024-03-11 17:49:47 +02:00
Georgi Gerganov
d9f65c97c3
readme : update hot topics 2024-03-10 20:58:26 +02:00
Georgi Gerganov
098dbaab44
readme : update hot topics 2024-03-09 18:14:13 +02:00