llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-24 10:24:35 +00:00

History

Justine Tunney 3855416027 ggml : introduce bfloat16 support (#6412 ) * Introduce bfloat16 support Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as their canonical floating point format. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───┐ 0b0000000000000000 brain16 This encoding has the same number of exponent bits as float32. That makes conversion relatively straightforward, even in the absence of hardware support. For example, converting brain16 to binary32 means simply shifting 16 bits to the left. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───────────────────┐ 0b00000000000000000000000000000000 IEEE binary32 The issue is that converting bf16 to fp16 can result in information loss. Only 13% of bf16 numbers can be precisely represented in fp16 which in practice ends up being 99.71% of Mistral 7b v0.2's weights however there is currently no way other than fp32 to get the others ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌─┴─┐┌─┴──────┐ 0b0000000000000000 IEEE binary16 This change fixes that, by adding a bf16 data type to GGML. Support for CPU inference has been implemented along with optimizations for the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2 improves somewhere around -0.0024 to -0.0046 compared to using fp16 * Remove GGML code that's not needed * Minimize the GGML API surface area for BF16 * Remove bf16 luts * Make the GGML header look nicer * Fix documentation * Apply ggerganov's fixes for test-backend-ops * Add BF16 code for new ggml_validate_row_data() function		2024-05-08 09:30:09 +03:00
..
baby-llama	code : normalize enum names (#5697 )	2024-02-25 12:09:09 +02:00
batched	llama : support Llama 3 HF conversion (#6745 )	2024-04-21 14:50:41 +03:00
batched-bench	ggml : add Flash Attention (#5021 )	2024-04-30 12:16:08 +03:00
batched.swift	llama : add option to render special/control tokens (#6807 )	2024-04-21 18:36:45 +03:00
beam-search	llama : support Llama 3 HF conversion (#6745 )	2024-04-21 14:50:41 +03:00
benchmark	ggml : remove old quantization functions (#5942 )	2024-03-09 15:53:59 +02:00
convert-llama2c-to-ggml	llama2c : open file as binary (#6332 )	2024-03-27 09:16:02 +02:00
embedding	BERT tokenizer fixes (#6498 )	2024-04-09 13:44:08 -04:00
eval-callback	model: support arch `DbrxForCausalLM` (#6515 )	2024-04-13 11:33:52 +02:00
export-lora	ci : add an option to fail on compile warning (#3952 )	2024-02-17 23:03:14 +02:00
finetune	ggml : introduce bfloat16 support (#6412 )	2024-05-08 09:30:09 +03:00
gbnf-validator	grammars: 1.5x faster inference w/ complex grammars (vector reserves / reuses) (#6609 )	2024-04-11 19:47:34 +01:00
gguf	gguf : add option to not check tensor data (#6582 )	2024-04-10 21:16:48 +03:00
gguf-split	gguf-split: add --no-tensor-first-split (#7072 )	2024-05-04 18:56:22 +02:00
gritlm	gritlm : add --outdir option to hf.sh script (#6699 )	2024-04-16 09:34:06 +03:00
imatrix	Fixed save_imatrix to match old behaviour for MoE (#7099 )	2024-05-08 02:24:16 +02:00
infill	llama : support Llama 3 HF conversion (#6745 )	2024-04-21 14:50:41 +03:00
jeopardy	parallel : add option to load external prompt file (#3416 )	2023-10-06 16:16:38 +03:00
llama-bench	Adding support for the --numa argument for llama-bench. (#7080 )	2024-05-05 14:17:47 +02:00
llama.android	llama : support Llama 3 HF conversion (#6745 )	2024-04-21 14:50:41 +03:00
llama.swiftui	llama : add option to render special/control tokens (#6807 )	2024-04-21 18:36:45 +03:00
llava	docs: fix typos (#7124 )	2024-05-07 18:20:33 +03:00
lookahead	llama : support Llama 3 HF conversion (#6745 )	2024-04-21 14:50:41 +03:00
lookup	Server: fix seed for multiple slots (#6835 )	2024-04-24 11:08:36 +02:00
main	main : update log text (EOS to EOG) (#7104 )	2024-05-07 20:51:31 +03:00
main-cmake-pkg	build(cmake): simplify instructions (`cmake -B build && cmake --build build ...`) (#6964 )	2024-04-29 17:02:45 +01:00
parallel	llama : support Llama 3 HF conversion (#6745 )	2024-04-21 14:50:41 +03:00
passkey	llama : support Llama 3 HF conversion (#6745 )	2024-04-21 14:50:41 +03:00
perplexity	perplexity: more statistics, added documentation (#6936 )	2024-04-30 23:36:27 +02:00
quantize	ggml : introduce bfloat16 support (#6412 )	2024-05-08 09:30:09 +03:00
quantize-stats	Improve usability of --model-url & related flags (#6930 )	2024-04-30 00:52:50 +01:00
retrieval	examples : add "retrieval" (#6193 )	2024-03-25 09:38:22 +02:00
save-load-state	llama : save and restore kv cache for single seq id (#6341 )	2024-04-08 15:43:30 +03:00
server	server: fix incorrectly reported token probabilities (#7125 )	2024-05-07 23:07:58 +02:00
simple	llama : support Llama 3 HF conversion (#6745 )	2024-04-21 14:50:41 +03:00
speculative	llama : support Llama 3 HF conversion (#6745 )	2024-04-21 14:50:41 +03:00
sycl	docs: fix typos (#7124 )	2024-05-07 18:20:33 +03:00
tokenize	BERT tokenizer fixes (#6498 )	2024-04-09 13:44:08 -04:00
train-text-from-scratch	train : add general name (#6752 )	2024-04-19 10:16:45 +03:00
alpaca.sh	alpaca.sh : update model file name (#2074 )	2023-07-06 19:17:50 +03:00
base-translate.sh	examples : improve base-translate.sh script (#4783 )	2024-01-06 11:40:24 +02:00
chat-13B.bat	Create chat-13B.bat (#592 )	2023-03-29 20:21:09 +03:00
chat-13B.sh	examples : read chat prompts from a template file (#1196 )	2023-05-03 20:58:11 +03:00
chat-persistent.sh	llama : fix session saving/loading (#3400 )	2023-10-03 21:04:01 +03:00
chat-vicuna.sh	examples : add chat-vicuna.sh (#1854 )	2023-06-15 21:05:53 +03:00
chat.sh	main : log file (#2748 )	2023-08-30 09:29:32 +03:00
CMakeLists.txt	eval-callback: Example how to use eval callback for debugging (#6576 )	2024-04-11 14:51:07 +02:00
gpt4all.sh	examples : add -n to alpaca and gpt4all scripts (#706 )	2023-04-13 16:03:39 +03:00
json_schema_to_grammar.py	JSON schema conversion: ⚡️ faster repetitions, min/maxLength for strings, cap number length (#6555 )	2024-04-12 19:43:38 +01:00
json-schema-pydantic-example.py	json-schema-to-grammar improvements (+ added to server) (#5978 )	2024-03-21 11:50:43 +00:00
llama2-13b.sh	gitignore : changes for Poetry users + chat examples (#2284 )	2023-07-21 13:53:27 +03:00
llama2.sh	gitignore : changes for Poetry users + chat examples (#2284 )	2023-07-21 13:53:27 +03:00
llama.vim	llama.vim : added api key support (#5090 )	2024-01-23 08:51:27 +02:00
llm.vim	llm.vim : stop generation at multiple linebreaks, bind to <F2> (#2879 )	2023-08-30 09:50:55 +03:00
make-ggml.py	make-ggml.py : compatibility with more models and GGUF (#3290 )	2023-09-27 19:25:12 +03:00
Miku.sh	MIKU MAYHEM: Upgrading the Default Model for Maximum Fun 🎉 (#2287 )	2023-07-21 11:13:18 +03:00
pydantic_models_to_grammar.py	examples : make pydantic scripts pass mypy and support py3.8 (#5099 )	2024-01-25 14:51:24 -05:00
pydantic-models-to-grammar-examples.py	examples : make pydantic scripts pass mypy and support py3.8 (#5099 )	2024-01-25 14:51:24 -05:00
reason-act.sh	chmod : make scripts executable (#2675 )	2023-08-23 17:29:09 +03:00
regex-to-grammar.py	JSON schema conversion: ⚡️ faster repetitions, min/maxLength for strings, cap number length (#6555 )	2024-04-12 19:43:38 +01:00
server-embd.py	server : refactor (#5882 )	2024-03-07 11:41:53 +02:00
server-llama2-13B.sh	chmod : make scripts executable (#2675 )	2023-08-23 17:29:09 +03:00
ts-type-to-grammar.sh	JSON schema conversion: ⚡️ faster repetitions, min/maxLength for strings, cap number length (#6555 )	2024-04-12 19:43:38 +01:00