llama.cpp/examples
Justine Tunney 3855416027
ggml : introduce bfloat16 support (#6412)
* Introduce bfloat16 support

Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as
their canonical floating point format.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───┐
    0b0000000000000000 brain16

This encoding has the same number of exponent bits as float32. That
makes conversion relatively straightforward, even in the absence of
hardware support. For example, converting brain16 to binary32 means
simply shifting 16 bits to the left.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───────────────────┐
    0b00000000000000000000000000000000 IEEE binary32

The issue is that converting bf16 to fp16 can result in information
loss. Only 13% of bf16 numbers can be precisely represented in fp16
which in practice ends up being 99.71% of Mistral 7b v0.2's weights
however there is currently no way other than fp32 to get the others

      ┌sign
      │
      │  ┌exponent
      │  │
      │  │    ┌mantissa
      │  │    │
      │┌─┴─┐┌─┴──────┐
    0b0000000000000000 IEEE binary16

This change fixes that, by adding a bf16 data type to GGML. Support
for CPU inference has been implemented along with optimizations for
the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2
improves somewhere around -0.0024 to -0.0046 compared to using fp16

* Remove GGML code that's not needed

* Minimize the GGML API surface area for BF16

* Remove bf16 luts

* Make the GGML header look nicer

* Fix documentation

* Apply ggerganov's fixes for test-backend-ops

* Add BF16 code for new ggml_validate_row_data() function
2024-05-08 09:30:09 +03:00
..
baby-llama code : normalize enum names (#5697) 2024-02-25 12:09:09 +02:00
batched llama : support Llama 3 HF conversion (#6745) 2024-04-21 14:50:41 +03:00
batched-bench ggml : add Flash Attention (#5021) 2024-04-30 12:16:08 +03:00
batched.swift llama : add option to render special/control tokens (#6807) 2024-04-21 18:36:45 +03:00
beam-search llama : support Llama 3 HF conversion (#6745) 2024-04-21 14:50:41 +03:00
benchmark ggml : remove old quantization functions (#5942) 2024-03-09 15:53:59 +02:00
convert-llama2c-to-ggml llama2c : open file as binary (#6332) 2024-03-27 09:16:02 +02:00
embedding BERT tokenizer fixes (#6498) 2024-04-09 13:44:08 -04:00
eval-callback model: support arch DbrxForCausalLM (#6515) 2024-04-13 11:33:52 +02:00
export-lora ci : add an option to fail on compile warning (#3952) 2024-02-17 23:03:14 +02:00
finetune ggml : introduce bfloat16 support (#6412) 2024-05-08 09:30:09 +03:00
gbnf-validator grammars: 1.5x faster inference w/ complex grammars (vector reserves / reuses) (#6609) 2024-04-11 19:47:34 +01:00
gguf gguf : add option to not check tensor data (#6582) 2024-04-10 21:16:48 +03:00
gguf-split gguf-split: add --no-tensor-first-split (#7072) 2024-05-04 18:56:22 +02:00
gritlm gritlm : add --outdir option to hf.sh script (#6699) 2024-04-16 09:34:06 +03:00
imatrix Fixed save_imatrix to match old behaviour for MoE (#7099) 2024-05-08 02:24:16 +02:00
infill llama : support Llama 3 HF conversion (#6745) 2024-04-21 14:50:41 +03:00
jeopardy parallel : add option to load external prompt file (#3416) 2023-10-06 16:16:38 +03:00
llama-bench Adding support for the --numa argument for llama-bench. (#7080) 2024-05-05 14:17:47 +02:00
llama.android llama : support Llama 3 HF conversion (#6745) 2024-04-21 14:50:41 +03:00
llama.swiftui llama : add option to render special/control tokens (#6807) 2024-04-21 18:36:45 +03:00
llava docs: fix typos (#7124) 2024-05-07 18:20:33 +03:00
lookahead llama : support Llama 3 HF conversion (#6745) 2024-04-21 14:50:41 +03:00
lookup Server: fix seed for multiple slots (#6835) 2024-04-24 11:08:36 +02:00
main main : update log text (EOS to EOG) (#7104) 2024-05-07 20:51:31 +03:00
main-cmake-pkg build(cmake): simplify instructions (cmake -B build && cmake --build build ...) (#6964) 2024-04-29 17:02:45 +01:00
parallel llama : support Llama 3 HF conversion (#6745) 2024-04-21 14:50:41 +03:00
passkey llama : support Llama 3 HF conversion (#6745) 2024-04-21 14:50:41 +03:00
perplexity perplexity: more statistics, added documentation (#6936) 2024-04-30 23:36:27 +02:00
quantize ggml : introduce bfloat16 support (#6412) 2024-05-08 09:30:09 +03:00
quantize-stats Improve usability of --model-url & related flags (#6930) 2024-04-30 00:52:50 +01:00
retrieval examples : add "retrieval" (#6193) 2024-03-25 09:38:22 +02:00
save-load-state llama : save and restore kv cache for single seq id (#6341) 2024-04-08 15:43:30 +03:00
server server: fix incorrectly reported token probabilities (#7125) 2024-05-07 23:07:58 +02:00
simple llama : support Llama 3 HF conversion (#6745) 2024-04-21 14:50:41 +03:00
speculative llama : support Llama 3 HF conversion (#6745) 2024-04-21 14:50:41 +03:00
sycl docs: fix typos (#7124) 2024-05-07 18:20:33 +03:00
tokenize BERT tokenizer fixes (#6498) 2024-04-09 13:44:08 -04:00
train-text-from-scratch train : add general name (#6752) 2024-04-19 10:16:45 +03:00
alpaca.sh alpaca.sh : update model file name (#2074) 2023-07-06 19:17:50 +03:00
base-translate.sh examples : improve base-translate.sh script (#4783) 2024-01-06 11:40:24 +02:00
chat-13B.bat Create chat-13B.bat (#592) 2023-03-29 20:21:09 +03:00
chat-13B.sh examples : read chat prompts from a template file (#1196) 2023-05-03 20:58:11 +03:00
chat-persistent.sh llama : fix session saving/loading (#3400) 2023-10-03 21:04:01 +03:00
chat-vicuna.sh examples : add chat-vicuna.sh (#1854) 2023-06-15 21:05:53 +03:00
chat.sh main : log file (#2748) 2023-08-30 09:29:32 +03:00
CMakeLists.txt eval-callback: Example how to use eval callback for debugging (#6576) 2024-04-11 14:51:07 +02:00
gpt4all.sh examples : add -n to alpaca and gpt4all scripts (#706) 2023-04-13 16:03:39 +03:00
json_schema_to_grammar.py JSON schema conversion: ️ faster repetitions, min/maxLength for strings, cap number length (#6555) 2024-04-12 19:43:38 +01:00
json-schema-pydantic-example.py json-schema-to-grammar improvements (+ added to server) (#5978) 2024-03-21 11:50:43 +00:00
llama2-13b.sh gitignore : changes for Poetry users + chat examples (#2284) 2023-07-21 13:53:27 +03:00
llama2.sh gitignore : changes for Poetry users + chat examples (#2284) 2023-07-21 13:53:27 +03:00
llama.vim llama.vim : added api key support (#5090) 2024-01-23 08:51:27 +02:00
llm.vim llm.vim : stop generation at multiple linebreaks, bind to <F2> (#2879) 2023-08-30 09:50:55 +03:00
make-ggml.py make-ggml.py : compatibility with more models and GGUF (#3290) 2023-09-27 19:25:12 +03:00
Miku.sh MIKU MAYHEM: Upgrading the Default Model for Maximum Fun 🎉 (#2287) 2023-07-21 11:13:18 +03:00
pydantic_models_to_grammar.py examples : make pydantic scripts pass mypy and support py3.8 (#5099) 2024-01-25 14:51:24 -05:00
pydantic-models-to-grammar-examples.py examples : make pydantic scripts pass mypy and support py3.8 (#5099) 2024-01-25 14:51:24 -05:00
reason-act.sh chmod : make scripts executable (#2675) 2023-08-23 17:29:09 +03:00
regex-to-grammar.py JSON schema conversion: ️ faster repetitions, min/maxLength for strings, cap number length (#6555) 2024-04-12 19:43:38 +01:00
server-embd.py server : refactor (#5882) 2024-03-07 11:41:53 +02:00
server-llama2-13B.sh chmod : make scripts executable (#2675) 2023-08-23 17:29:09 +03:00
ts-type-to-grammar.sh JSON schema conversion: ️ faster repetitions, min/maxLength for strings, cap number length (#6555) 2024-04-12 19:43:38 +01:00