llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-26 03:14:35 +00:00

Author	SHA1	Message	Date
Stephan Walter	36d19a603b	Remove Q4_3 which is no better than Q5 (#1218 )	2023-04-28 23:10:43 +00:00
CRD716	5fba3c016b	examples : add Jeopardy example (#1168 ) * Basic Setup * Prevent Results.txt from coming up * Prefixes, Line separators, etc * editorcheck * introduction to give more consistent results * Basic graph thing * Grading, ready for testing! * Y'all ready to get funky? * fix column removal stuff * missed a few	2023-04-28 19:13:33 +03:00
Evan Jones	1481a9cf25	llama : add session file format and saved sessions in main (#1169 )	2023-04-28 18:59:37 +03:00
Georgi Gerganov	574406dc7e	ggml : add Q5_0 and Q5_1 quantization (#1187 ) * ggml : add Q5_0 quantization (cuBLAS only) * ggml : fix Q5_0 qh -> uint32_t * ggml : fix q5_0 histogram stats * ggml : q5_0 scalar dot product * ggml : q5_0 ARM NEON dot * ggml : q5_0 more efficient ARM NEON using uint64_t masks * ggml : rename Q5_0 -> Q5_1 * ggml : adding Q5_0 mode * quantize : add Q5_0 and Q5_1 to map * ggml : AVX2 optimizations for Q5_0, Q5_1 (#1195) --------- Co-authored-by: Stephan Walter <stephan@walter.name>	2023-04-26 23:14:13 +03:00
Pavol Rusnak	859fee6dfb	quantize : use `map` to assign quantization type from `string` (#1191 ) instead of `int` (while `int` option still being supported) This allows the following usage: `./quantize ggml-model-f16.bin ggml-model-q4_0.bin q4_0` instead of: `./quantize ggml-model-f16.bin ggml-model-q4_0.bin 2`	2023-04-26 18:43:27 +02:00
Georgi Gerganov	7a32fcb3b2	ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) (#1179 ) * ggml : add Q8_0 quantization format (rename the old one to Q8_1) * tests : fix test-quantize-fns * ggml : finalize Q8_0 implementation * ggml : use q4_0_q8_0 and q4_2_q8_0 * ggml : fix Q8_0 dot product bug (ARM) * ggml : Q8_0 unroll x2 * ggml : fix bug - using wrong block type * ggml : extend quantize_fns_t with "vec_dot_type" * ggml : fix Q8_0 to use 255 values out of 256 * ggml : fix assert using wrong QK4_2 instead of QK4_3	2023-04-25 23:40:51 +03:00
xaedes	0c5692345d	examples : add save_load_state example (#1150 ) * add save_load_state example * use <cstdio> instead of <iostream> and fprintf / printf instead of cout * renamed save-load-state example files replacing underscores by dashes	2023-04-24 19:23:31 +03:00
mgroeber9110	9b0a4d4214	examples/main README improvements and some light refactoring (#1131 )	2023-04-24 15:45:32 +00:00
slaren	1d78fecdab	Fix LoRA acronym (#1145 )	2023-04-23 23:03:44 +02:00
DannyDaemonic	edce63baa9	Added README.md for main with examples and explanations (#1139 )	2023-04-23 15:37:02 +00:00
Stephan Walter	c50b628810	Fix CI: ARM NEON, quantization unit tests, editorconfig (#1122 )	2023-04-22 10:54:13 +00:00
wbpxre150	36b4f7e064	llama : print timings on ctrl+c exit (#1021 ) * print timings on ctrl+c exit * remove redundant free memory call. * add global pointer to ctx.	2023-04-22 11:56:35 +03:00
eiery	10f19c1121	llama : have n_batch default to 512 (#1091 ) * set default n_batch to 512 when using BLAS * spacing * alternate implementation of setting different n_batch for BLAS * set n_batch to 512 for all cases	2023-04-22 11:27:05 +03:00
Clint Herron	e9a9cb0c54	examples : Improve Alpaca Default Repeat Penalty: Better Match Alpaca.cpp Experience (#1107 ) * Moving parameters to separate lines for readability. * Increasing repeate_penalty to 1.1 to make alpaca more usable by default. * Adding trailing newline.	2023-04-22 09:54:33 +03:00
Alex Klinkhamer	9411288271	main : evaluate tokens in batches after swapping context (#1014 ) * examples : evaluate tokens in batches after swapping context * Update examples/main/main.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-21 21:18:09 +03:00
slaren	3d59769c3b	Show perplexity ETA in hours and minutes (#1096 )	2023-04-21 14:57:57 +02:00
Kawrakow	38de86a711	llama : multi-threaded quantization (#1075 ) * Multi-threading quantization. Not much gain for simple quantizations, bit it will be important for quantizations that require more CPU cycles. * Multi-threading for quantize-stats It now does the job in ~14 seconds on my Mac for Q4_0, Q4_1 and Q4_2. Single-threaded it was taking more than 2 minutes after adding the more elaborate version of Q4_2. * Reviewer comments * Avoiding compiler confusion After changing chunk_size to const int as suggested by @ggerganov, clang and GCC starting to warn me that I don't need to capture it in the lambda. So, I removed it from the capture list. But that makes the MSVC build fail. So, making it a constexpr to make every compiler happy. * Still fighting with lambda captures in MSVC --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-20 20:42:27 +03:00
Georgi Gerganov	e0305ead3a	ggml : add Q4_3 quantization (#1082 )	2023-04-20 20:35:53 +03:00
Georgi Gerganov	77a73403ca	ggml : add new Q4_2 quantization (ARM only) (#1046 ) * ggml : Q4_2 ARM * ggml : add ggml_is_quantized() * llama : update llama_type_name() with Q4_2 entry * ggml : speed-up q4_2 - 4 threads: ~100ms -> ~90ms - 8 threads: ~55ms -> ~50ms * ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32	2023-04-18 23:54:57 +03:00
slaren	315a95a4d3	Add LoRA support (#820 )	2023-04-17 17:28:55 +02:00
Georgi Gerganov	eb17a026fd	quantize-stats : fix bug in --type argument	2023-04-17 17:31:06 +03:00
Pavol Rusnak	489537e6cf	examples: add missing <ctime> include for time() (#1011 )	2023-04-16 10:13:00 +00:00
Ivan Komarov	c12b14b77f	benchmark : fix result validation in benchmark-q4_0-matmult (#987 )	2023-04-15 08:51:54 +03:00
Pavol Rusnak	c85e03d12e	Revert "main : alternative instruct mode (Vicuna support, etc.) (#863 )" (#982 ) This reverts commit `f4d277ae17`.	2023-04-14 22:58:43 +03:00
Pavol Rusnak	c56b715269	Expose type name from ggml (#970 ) Avoid duplication of type names in utils Co-authored-by: Håkon H. Hitland <haakon@likedan.net>	2023-04-14 20:05:37 +02:00
Tomáš Pazdiora	f4d277ae17	main : alternative instruct mode (Vicuna support, etc.) (#863 ) * Add support for configs, add configurable prefixes / suffixes, deprecate instruct mode, add stop prompt * Add multiline mode, update text input. * bugfix * update implementation * typos * Change --multiline implementation to be toggled by EOF. * bugfix * default multiline mode * add more configs * update formating * update formatting * apply suggestions	2023-04-14 18:19:17 +03:00
Gary Linscott	be87b6ed20	perplexity : add support for batch size to `--perplexity` (#407 ) * Add support to batch size for perplexity * Revert "Fix memory allocation issues and seg faults" This reverts commit `4870e455b3`. * update from merge * Remove perplexity from main * updates * Update batch size for efficiency	2023-04-14 00:50:42 +03:00
CRD716	0e07e6a839	common : remove unnecessary includes (#947 )	2023-04-13 18:39:25 +03:00
Georgi Gerganov	9190e8eac8	llama : merge llama_internal.h into llama.h Hide it behind an #ifdef	2023-04-13 18:04:45 +03:00
CRD716	8cda5c981d	fix whitespace (#944 )	2023-04-13 16:03:57 +02:00
niansa/tuxifan	107980d970	examples : add -n to alpaca and gpt4all scripts (#706 )	2023-04-13 16:03:39 +03:00
SebastianApel	95ea26f6e9	benchmark : add tool for timing q4_0 matrix multiplication (#653 ) * Initial version of q4_0 matrix multiplication benchmark * Bugfix: Added dependency to ggml.o to benchmark * Reviewer requests: added parameter for threads, switched to ggml_time_us() * Reviewer input: removed rtsc, use epsilon for check * Review comment: Removed set_locale * Feature: Param for numer of iterations, Bugfix for use of parameter threads * Reviewer suggestion: Moved to examples * Reviewer feedback: Updated clean: and benchmark: sections --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-13 15:46:23 +03:00
Pavol Rusnak	8b679987cd	Fix whitespace, add .editorconfig, add GitHub workflow (#883 )	2023-04-11 19:45:44 +00:00
Stephan Walter	3e6e70d8e8	Add enum llama_ftype, sync ggml_type to model files (#709 )	2023-04-11 15:03:51 +00:00
comex	2663d2c678	Windows fixes (#890 ) Mostly for msys2 and mingw64 builds, which are different from each other and different from standard Visual Studio builds. Isn't Windows fun? - Define _GNU_SOURCE in more files (it's already used in ggml.c for Linux's sake). - Don't use PrefetchVirtualMemory if not building for Windows 8 or later (mingw64 doesn't by default). But warn the user about this situation since it's probably not intended. - Check for NOMINMAX already being defined, which it is on mingw64. - Actually use the `increment` variable (bug in my `pizza` PR). - Suppress unused variable warnings in the fake pthread_create and pthread_join implementations for Windows. - (not Windows-related) Remove mention of `asprintf` from comment; `asprintf` is no longer used. Fixes #871.	2023-04-11 15:19:54 +02:00
comex	f963b63afa	Rewrite loading code to try to satisfy everyone: - Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.) - Support both mmap and read (mmap is used by default, but can be disabled with `--no-mmap`, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported). - Support multi-file models like before, but automatically determine the number of parts rather than requiring `--n_parts`. - Improve validation and error checking. - Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields. This has no immediate benefit, but makes it easier to experiment with different formats, and should make it easier to support the new GPTQ-for-LLaMa models in the future (I have some work in progress on that front). - Support VirtualLock on Windows (using the same `--mlock` option as on Unix). - Indicate loading progress when using mmap + mlock. (Which led me to the interesting observation that on my Linux machine, with a warm file cache, mlock actually takes some time, whereas mmap without mlock starts almost instantly...) - To help implement this, move mlock support from ggml to the loading code. - madvise/PrefetchVirtualMemory support (based on #740) - Switch from ifstream to the `fopen` family of functions to avoid unnecessary copying and, when mmap is enabled, allow reusing the same file descriptor for both metadata reads and mmap (whereas the existing implementation opens the file a second time to mmap). - Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way'). Implementation notes: I tried to factor the code into more discrete pieces than before. Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly: - Destructors to make it easier to ensure everything gets cleaned up. - Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. The exceptions are converted to error codes at the API boundary.) Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from #740)	2023-04-10 01:10:46 +02:00
Tomáš Pazdiora	aaf3b23deb	fix for windows utf-8 input (#840 ) Use UTF-16 as input on Windows, since UTF-8 does not work and reads multibyte characters as zeros	2023-04-08 17:49:39 +02:00
unbounded	62cfc54f77	Add quantize-stats command for testing quantization (#728 ) Command that calculates some statistics over the errors introduced by quantization, like mean square error, max error and some percentile errors for layer weights. Should be useful for testing quantization improvements. Exposes some internal state from ggml and llama for testing	2023-04-08 00:09:18 +02:00
Sergey Alirzaev	cc9cee8e9e	Do not crash when it has nothing to say. (#796 ) Otherwise observing this in the interactive mode: /usr/lib/gcc/x86_64-pc-linux-gnu/12/include/g++-v12/bits/stl_vector.h:1230: reference std::vector<int>::back() [_Tp = int, _Alloc = std::allocator<int>]: Assertion '!this->empty()' failed.	2023-04-06 17:59:11 +02:00
at8u	ff05d05c96	miku.sh : add executable bit (#780 )	2023-04-05 18:59:13 +03:00
at8u	88ed5761b8	examples : add Miku.sh (#724 ) * Add Miku.sh to examples * Add missing line to prompt in Miku.sh * Add --keep param to Miku.sh * Remove '[end_of_conversation]' line from Miku.sh No longer is necessary.	2023-04-05 17:32:42 +03:00
mgroeber9110	53dbba7695	Windows: reactive sigint handler after each Ctrl-C (#736 )	2023-04-03 18:00:55 +02:00
Leonardo Neumann	6e7801d08d	examples : add gpt4all script (#658 )	2023-04-02 10:56:20 +03:00
Murilo Santana	5b70e7de4c	fix default params for examples/main (#697 )	2023-04-02 04:41:12 +02:00
Slaren	0d054e292e	Show error message when -f fails	2023-04-01 16:08:40 +02:00
Slaren	64bde3ffd4	Fix ggml_init_params in quantize	2023-03-30 12:28:25 -07:00
Thérence	d9ad104440	Create chat-13B.bat (#592 ) * Create chat-13B.bat Same script than chat-13B.sh, but for windows users. Tested and working on windows 10/11 v 22H2 * Apply suggestions from code review --------- Co-authored-by: anzz1 <anzz1@live.com>	2023-03-29 20:21:09 +03:00
Tobias Lütke	a6956b25a1	add example of re-act pattern (#583 ) * add example of re-act pattern * spelling... * fixed whitespace in reverse prompt issue	2023-03-29 10:10:24 -05:00
anzz1	7f4c5c6651	llama : fix linkage with mingw (#551 ) * Revert `7e53955` (#542) Still needs to be fixed properly * Fix linking on mingw32	2023-03-28 21:23:09 +03:00
Stephan Walter	436e561931	all : be more strict about converting float to double (#458 ) * Be more strict about converting float to double * Test equivalence of round, SILU implementations Test module is commented out in CMakeLists.txt because the tests may take a long time, depending on how much the compiler optimizes. * Fix softmax in perplexity.cpp * all : prefer float over double where appropriate * perplexity : add <cmath> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-03-28 19:48:20 +03:00
Stephan Walter	c1f885067c	ggml : introduce structs for the q4 data blocks (#356 ) * Introduce structs for the q4 data blocks * ggml : rename quant struct variables + fix ARM_NEON --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-03-28 18:56:03 +03:00
anzz1	7b8dbcb78b	main.cpp fixes, refactoring (#571 ) - main: entering empty line passes back control without new input in interactive/instruct modes - instruct mode: keep prompt fix - instruct mode: duplicate instruct prompt fix - refactor: move common console code from main->common	2023-03-28 17:09:55 +03:00
Marco Matthies	7e5395575a	Fix missing ggml link in cmake for examples/* on w64-mingw32 (#542 )	2023-03-27 07:55:26 +03:00
Stephan Walter	b391579db9	Update README and comments for standalone perplexity tool (#525 )	2023-03-26 16:14:01 +03:00
anzz1	7a87d31f4f	[main] fix infinite generation (-n == -1) (#523 )	2023-03-26 16:06:10 +03:00
Harald Fernengel	33e35b8fe8	Exit from interactive mode if input stream is bad (#491 ) Allow exiting the interactive prompt also with CTRL-D on Unix and CTRL-Z on Windows.	2023-03-26 08:25:46 +03:00
anzz1	34ab526843	(Windows) Set console to UTF-8 on init (#420 ) Sets console codepage to 65001 (CP_UTF8) on start for both input and output, should fix problems with UTF-8 characters.	2023-03-25 22:29:22 +02:00
Georgi Gerganov	c2b25b6912	Fix colors enabling on WIN32	2023-03-25 21:53:39 +02:00
Georgi Gerganov	79b2b266db	If n_predict == -1, generate forever	2023-03-25 21:51:41 +02:00
Georgi Gerganov	e2d490dafd	Inifinite generation via context swapping (#71 )	2023-03-25 21:36:22 +02:00
Georgi Gerganov	03f7e33560	Cleanup STL headers + fix embedding examples + minor stuff	2023-03-25 20:51:14 +02:00
Georgi Gerganov	55ad42af84	Move chat scripts into "./examples"	2023-03-25 20:37:09 +02:00
Georgi Gerganov	a316a425d0	Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something !	2023-03-25 20:26:40 +02:00
Georgi Gerganov	04c6f5ed6f	Immediately start processing the prompt before user input has been provided (#476 )	2023-03-24 23:17:58 +02:00
Mathieu Nayrolles	3f9c6135e4	fix typo in chatLLaMa (#368 ) The prompt contains a typo where 'alound' is used instead of 'aloud'.	2023-03-21 22:52:27 +02:00
Jean-Christophe Hoelt	3ab3e6582f	Add chatLLaMa script (#198 ) * Add chatLLaMa script * Fix shellcheck errors and do some cleanup * Move chatLLaMa script to `examples` directory * Reduce chatLLaMa context size to 2048 Ref `d7def1a752` * Include n_predict to 2048 in examples/chatLLaMa	2023-03-21 18:23:15 +02:00

... 10 11 12 13 14

666 Commits