llama.cpp/examples
Justine Tunney 8cc91dc63c
ggml : add llamafile sgemm (#6414)
This change upstreams llamafile's cpu matrix multiplication kernels
which improve image and prompt evaluation speed. For starters, Q4_0
and Q8_0 weights should go ~40% faster on CPU. The biggest benefits
are with data types like f16 / f32, which process prompts 2x faster
thus making them faster than quantized data types for prompt evals.

This change also introduces bona fide AVX512 support since tinyBLAS
is able to exploit the larger register file. For example, on my CPU
llama.cpp llava-cli processes an image prompt at 305 tokens/second,
using the Q4_K and Q4_0 types, which has always been faster than if
we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With
this change, f16 LLaVA performance leap frogs to 464 tokens/second.

On Intel Core i9-14900K this change improves F16 prompt perf by 5x.
For example, using llama.cpp at HEAD with Mistral 7b f16 to process
a 215 token prompt will go 13 tok/sec. This change has fixes making
it go 52 tok/sec. It's mostly thanks to my vectorized outer product
kernels but also because I added support for correctly counting the
number of cores on Alderlake, so the default thread count discounts
Intel's new efficiency cores. Only Linux right now can count cores.

This work was sponsored by Mozilla who's given permission to change
the license of this code from Apache 2.0 to MIT. To read more about
what's improved, and how it works, see: https://justine.lol/matmul/
2024-04-16 21:55:30 +03:00
..
baby-llama code : normalize enum names (#5697) 2024-02-25 12:09:09 +02:00
batched metal : pad n_ctx by 32 (#6177) 2024-03-22 09:36:03 +02:00
batched-bench bench : make n_batch and n_ubatch configurable in Batched bench (#6500) 2024-04-05 21:34:53 +03:00
batched.swift ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
beam-search ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
benchmark ggml : remove old quantization functions (#5942) 2024-03-09 15:53:59 +02:00
convert-llama2c-to-ggml llama2c : open file as binary (#6332) 2024-03-27 09:16:02 +02:00
embedding BERT tokenizer fixes (#6498) 2024-04-09 13:44:08 -04:00
eval-callback model: support arch DbrxForCausalLM (#6515) 2024-04-13 11:33:52 +02:00
export-lora ci : add an option to fail on compile warning (#3952) 2024-02-17 23:03:14 +02:00
finetune code : normalize enum names (#5697) 2024-02-25 12:09:09 +02:00
gbnf-validator grammars: 1.5x faster inference w/ complex grammars (vector reserves / reuses) (#6609) 2024-04-11 19:47:34 +01:00
gguf gguf : add option to not check tensor data (#6582) 2024-04-10 21:16:48 +03:00
gguf-split Fix --split-max-size (#6655) 2024-04-14 13:12:59 +02:00
gritlm gritlm : add --outdir option to hf.sh script (#6699) 2024-04-16 09:34:06 +03:00
imatrix imatrix : remove invalid assert (#6632) 2024-04-12 11:49:58 +03:00
infill infill : add download instructions for model (#6626) 2024-04-12 15:11:46 +03:00
jeopardy parallel : add option to load external prompt file (#3416) 2023-10-06 16:16:38 +03:00
llama-bench ggml : add llamafile sgemm (#6414) 2024-04-16 21:55:30 +03:00
llama.android android : fix utf8 decoding error (#5935) 2024-03-10 22:03:17 +02:00
llama.swiftui llama : add pipeline parallelism support (#6017) 2024-03-13 18:54:21 +01:00
llava chore: Fix markdown warnings (#6625) 2024-04-12 10:52:36 +02:00
lookahead BERT tokenizer fixes (#6498) 2024-04-09 13:44:08 -04:00
lookup BERT tokenizer fixes (#6498) 2024-04-09 13:44:08 -04:00
main main: add --json-schema / -j flag (#6659) 2024-04-15 18:35:21 +01:00
main-cmake-pkg cuda : rename build flag to LLAMA_CUDA (#6299) 2024-03-26 01:16:01 +01:00
parallel llama : greatly reduce output buffer memory usage (#6122) 2024-03-26 16:46:41 +02:00
passkey llama : fix defrag bugs + add parameter (#5735) 2024-02-27 14:35:51 +02:00
perplexity perplexity : require positive --ctx-size arg (#6695) 2024-04-16 09:28:33 +03:00
quantize chore: Fix markdown warnings (#6625) 2024-04-12 10:52:36 +02:00
quantize-stats refactor : switch to emplace_back to avoid extra object (#5291) 2024-02-03 13:23:37 +02:00
retrieval examples : add "retrieval" (#6193) 2024-03-25 09:38:22 +02:00
save-load-state llama : save and restore kv cache for single seq id (#6341) 2024-04-08 15:43:30 +03:00
server main: add --json-schema / -j flag (#6659) 2024-04-15 18:35:21 +01:00
simple ggml : add numa options (#5377) 2024-02-16 11:31:07 +02:00
speculative BERT tokenizer fixes (#6498) 2024-04-09 13:44:08 -04:00
sycl fix memcpy() crash, add missed cmd in guide, fix softmax (#6622) 2024-04-14 10:42:29 +08:00
tokenize BERT tokenizer fixes (#6498) 2024-04-09 13:44:08 -04:00
train-text-from-scratch gguf : fix resource leaks (#6061) 2024-03-14 20:29:32 +02:00
alpaca.sh alpaca.sh : update model file name (#2074) 2023-07-06 19:17:50 +03:00
base-translate.sh examples : improve base-translate.sh script (#4783) 2024-01-06 11:40:24 +02:00
chat-13B.bat Create chat-13B.bat (#592) 2023-03-29 20:21:09 +03:00
chat-13B.sh examples : read chat prompts from a template file (#1196) 2023-05-03 20:58:11 +03:00
chat-persistent.sh llama : fix session saving/loading (#3400) 2023-10-03 21:04:01 +03:00
chat-vicuna.sh examples : add chat-vicuna.sh (#1854) 2023-06-15 21:05:53 +03:00
chat.sh main : log file (#2748) 2023-08-30 09:29:32 +03:00
CMakeLists.txt eval-callback: Example how to use eval callback for debugging (#6576) 2024-04-11 14:51:07 +02:00
gpt4all.sh examples : add -n to alpaca and gpt4all scripts (#706) 2023-04-13 16:03:39 +03:00
json_schema_to_grammar.py JSON schema conversion: ️ faster repetitions, min/maxLength for strings, cap number length (#6555) 2024-04-12 19:43:38 +01:00
json-schema-pydantic-example.py json-schema-to-grammar improvements (+ added to server) (#5978) 2024-03-21 11:50:43 +00:00
llama2-13b.sh gitignore : changes for Poetry users + chat examples (#2284) 2023-07-21 13:53:27 +03:00
llama2.sh gitignore : changes for Poetry users + chat examples (#2284) 2023-07-21 13:53:27 +03:00
llama.vim llama.vim : added api key support (#5090) 2024-01-23 08:51:27 +02:00
llm.vim llm.vim : stop generation at multiple linebreaks, bind to <F2> (#2879) 2023-08-30 09:50:55 +03:00
make-ggml.py make-ggml.py : compatibility with more models and GGUF (#3290) 2023-09-27 19:25:12 +03:00
Miku.sh MIKU MAYHEM: Upgrading the Default Model for Maximum Fun 🎉 (#2287) 2023-07-21 11:13:18 +03:00
pydantic_models_to_grammar.py examples : make pydantic scripts pass mypy and support py3.8 (#5099) 2024-01-25 14:51:24 -05:00
pydantic-models-to-grammar-examples.py examples : make pydantic scripts pass mypy and support py3.8 (#5099) 2024-01-25 14:51:24 -05:00
reason-act.sh chmod : make scripts executable (#2675) 2023-08-23 17:29:09 +03:00
regex-to-grammar.py JSON schema conversion: ️ faster repetitions, min/maxLength for strings, cap number length (#6555) 2024-04-12 19:43:38 +01:00
server-embd.py server : refactor (#5882) 2024-03-07 11:41:53 +02:00
server-llama2-13B.sh chmod : make scripts executable (#2675) 2023-08-23 17:29:09 +03:00
ts-type-to-grammar.sh JSON schema conversion: ️ faster repetitions, min/maxLength for strings, cap number length (#6555) 2024-04-12 19:43:38 +01:00