llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-11 03:01:45 +00:00

History

John 3778836046 Work in progress. Added falcon main and library based on llama.cpp CPU inference works (getting ~260ms/token on 7B 16 bit falcon) Tested with 7B 16 bit and the two shakespear models (both in 16 bit precisiononly) TODO/WIP: 1) quantization runs, creates a ggjt 3 file but something is wrong with the quantized model binary - even quantization from 16 -> 16 also fails, something is wrong in the tensors produced 2) mmap should work with quantized binaries once 1) is solved 3) CUDA support is mostly there, it's currently disabled (all CPU backend) 4) memory/context caluculations are off, GPU memory calculations are wrong either 5) the python conversion script is pre GGML 1 version (tokens without scores) 6) some stuff is still called "llama", some of it should be renamed to a generic name as it works for both 7) the GGML produced by the current python uses an old ftype method Makfiles: cmake on windows with build tools works the makefile for linux/msys was blind adjusted but not tested yet - possibly missed something Changes to the codebase: * repeat2 has been added to ggml (jploski - https://github.com/ggerganov/ggml/pull/231) including the backward variant (untested, probably fails) * minor changes to work with falcon (name length) * libfalcon is the previous "llama.cpp" and falcon_main is the previous main.cpp		2023-06-16 16:31:02 +02:00
..
baby-llama	baby-llama : fix operator!= (#1821 )	2023-06-13 22:37:54 +03:00
benchmark	llama : add llama_init_backend() API (close #1527 )	2023-05-20 11:06:37 +03:00
embedding	llama : add llama_init_backend() API (close #1527 )	2023-05-20 11:06:37 +03:00
falcon	Work in progress.	2023-06-16 16:31:02 +02:00
falcon_quantize	Work in progress.	2023-06-16 16:31:02 +02:00
jeopardy	examples : add Jeopardy example (#1168 )	2023-04-28 19:13:33 +03:00
main	llama : do a warm-up eval at start for better timings (#1824 )	2023-06-13 20:20:07 +03:00
metal	llama : Metal inference (#1642 )	2023-06-04 23:34:30 +03:00
perplexity	llama : add llama_init_backend() API (close #1527 )	2023-05-20 11:06:37 +03:00
quantize	Allow "quantizing" to f16 and f32 (#1787 )	2023-06-13 04:23:23 -06:00
quantize-stats	ggml : add SOTA 2,3,4,5,6 bit k-quantizations (#1684 )	2023-06-05 22:56:18 +03:00
save-load-state	Remove unused n_parts parameter (#1509 )	2023-05-17 22:12:01 +00:00
server	Multi GPU support, CUDA refactor, CUDA scratch buffer (#1703 )	2023-06-06 21:33:23 +02:00
train-text-from-scratch	train : improved training-from-scratch example (#1652 )	2023-06-13 22:04:40 +03:00
alpaca.sh	examples : Improve Alpaca Default Repeat Penalty: Better Match Alpaca.cpp Experience (#1107 )	2023-04-22 09:54:33 +03:00
chat-13B.bat	Create chat-13B.bat (#592 )	2023-03-29 20:21:09 +03:00
chat-13B.sh	examples : read chat prompts from a template file (#1196 )	2023-05-03 20:58:11 +03:00
chat-persistent.sh	chat-persistent.sh : use bracket expressions in grep (#1564 )	2023-05-24 09:16:22 +03:00
chat.sh	If n_predict == -1, generate forever	2023-03-25 21:51:41 +02:00
CMakeLists.txt	Work in progress.	2023-06-16 16:31:02 +02:00
common.cpp	Fix issue where interactive mode crashes when input exceeds ctx size (#1789 )	2023-06-11 08:19:17 -06:00
common.h	Fix issue where interactive mode crashes when input exceeds ctx size (#1789 )	2023-06-11 08:19:17 -06:00
falcon_common.cpp	Work in progress.	2023-06-16 16:31:02 +02:00
falcon_common.h	Work in progress.	2023-06-16 16:31:02 +02:00
gpt4all.sh	examples : add -n to alpaca and gpt4all scripts (#706 )	2023-04-13 16:03:39 +03:00
Miku.sh	examples : various prompt and example fixes (#1298 )	2023-05-03 18:26:47 +03:00
reason-act.sh	add example of re-act pattern (#583 )	2023-03-29 10:10:24 -05:00