llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-27 03:44:35 +00:00

Author	SHA1	Message	Date
Tobias Lütke	31cfbb1013	Expose generation timings from server & update completions.js (#2116 ) * use javascript generators as much cleaner API Also add ways to access completion as promise and EventSource * export llama_timings as struct and expose them in server * update readme, update baked includes * llama : uniform variable names + struct init --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-05 16:51:13 -04:00
Jesse Jojo Johnson	983b555e9d	Update Server Instructions (#2113 ) * Update server instructions for web front end * Update server README * Remove duplicate OAI instructions * Fix duplicate text --------- Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>	2023-07-05 21:03:19 +03:00
Stephan Walter	1b107b8550	ggml : generalize `quantize_fns` for simpler FP16 handling (#1237 ) * Generalize quantize_fns for simpler FP16 handling * Remove call to ggml_cuda_mul_mat_get_wsize * ci : disable FMA for mac os actions --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-05 19:13:06 +03:00
Jesse Jojo Johnson	8567c76b53	Update server instructions for web front end (#2103 ) Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>	2023-07-05 18:13:35 +03:00
Nigel Bosch	7f0e9a775e	embd-input: Fix input embedding example unsigned int seed (#2105 )	2023-07-05 07:33:33 +08:00
jwj7140	f257fd2550	Add an API example using server.cpp similar to OAI. (#2009 ) * add api_like_OAI.py * add evaluated token count to server * add /v1/ endpoints binding	2023-07-04 21:06:12 +03:00
Tobias Lütke	7ee76e45af	Simple webchat for server (#1998 ) * expose simple web interface on root domain * embed index and add --path for choosing static dir * allow server to multithread because web browsers send a lot of garbage requests we want the server to multithread when serving 404s for favicon's etc. To avoid blowing up llama we just take a mutex when it's invoked. * let's try this with the xxd tool instead and see if msvc is happier with that * enable server in Makefiles * add /completion.js file to make it easy to use the server from js * slightly nicer css * rework state management into session, expose historyTemplate to settings --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-04 16:05:27 +02:00
Henri Vasserman	1cf14ccef1	fix server crashes (#2076 )	2023-07-04 00:05:23 +03:00
WangHaoranRobin	d7d2e6a0f0	server: add option to output probabilities for completion (#1962 ) * server: add option to output probabilities for completion * server: fix issue when handling probability output for incomplete tokens for multibyte character generation * server: fix llama_sample_top_k order * examples/common.h: put all bool variables in gpt_params together	2023-07-03 00:38:44 +03:00
Georgi Gerganov	79f634a19d	embd-input : fix returning ptr to temporary	2023-07-01 18:46:00 +03:00
Georgi Gerganov	04606a1599	train : fix compile warning	2023-07-01 18:45:44 +03:00
Howard Su	b8c8dda75f	Use unsigned for random seed (#2006 ) * Use unsigned for random seed. Keep -1 as the value to use a time based seed. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-29 06:15:15 -07:00
Johannes Gäßler	7f9753fa12	CUDA GPU acceleration for LoRAs + f16 models (#1970 )	2023-06-28 18:35:54 +02:00
ningshanwutuobang	cfa0750bc9	llama : support input embeddings directly (#1910 ) * add interface for float input * fixed inpL shape and type * add examples of input floats * add test example for embd input * fixed sampling * add free for context * fixed add end condition for generating * add examples for llava.py * add READMD for llava.py * add READMD for llava.py * add example of PandaGPT * refactor the interface and fixed the styles * add cmake build for embd-input * add cmake build for embd-input * Add MiniGPT-4 example * change the order of the args of llama_eval_internal * fix ci error	2023-06-28 18:53:37 +03:00
Howard Su	0be54f75a6	baby-llama : fix build after ggml_rope change (#2016 )	2023-06-27 08:07:13 +03:00
Georgi Gerganov	181e8d9755	llama : fix rope usage after ChatGLM change	2023-06-27 00:37:33 +03:00
David Yang	eaa6ca5a61	ggml : increase max tensor name + clean up compiler warnings in train-text (#1988 ) * Clean up compiler warnings in train-text Some brackets to disambiguate order of operations * Increase GGML_MAX_NAME Avoiding strncpy danger in train-text-from-scratch and reducing potential future name length issues	2023-06-26 22:45:32 +03:00
zrm	b853d45601	ggml : add NUMA support (#1556 ) * detect NUMA systems and pin work threads to nodes (linux) * disable mmap prefetch/readahead for NUMA systems * avoid sending finalize op to thread pool if it does nothing * silence robot * fix args * make --numa a param * recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement * lower synchronization overhead * statically allocate * move numa state to g_state * add description for --numa * ggml : minor style changes * ggml : minor style + try fix sanitizer build * llama : allow to initialize backend with NUMA support * llama : avoid ggml include in llama-util.h * ggml : style / formatting * ggml : fix handling of ops with n_threads > n_tasks > 1 * server : utilize numa parameter --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-26 20:57:59 +03:00
anon998	c2a08f87b8	fix server sampling: top k sampler first (#1977 ) Co-authored-by: anon <anon@example.org>	2023-06-25 10:48:36 +02:00
Didzis Gosko	527b6fba1d	llama : make model stateless and context stateful (llama_state) (#1797 ) * llama : make model stateless and context stateful * llama : minor cleanup * llama : update internal API declaration * Apply suggestions from code review fix style Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Missing model memory release * Fix style * Add deprecated warning for public API function llama_init_from_file * Update public API use cases: move away from deprecated llama_init_from_file * Deprecate public API function llama_apply_lora_from_file --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-24 11:47:58 +03:00
Henri Vasserman	20568fe60f	[Fix] Reenable server embedding endpoint (#1937 ) * Add back embedding feature * Update README	2023-06-20 01:12:39 +03:00
Kawrakow	90cc59d6ab	examples : fix examples/metal (#1920 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-18 10:52:10 +03:00
Georgi Gerganov	4f9c43e3bd	minor : warning fixes	2023-06-17 20:24:11 +03:00
Johannes Gäßler	2c9380dd2f	Only one CUDA stream per device for async compute (#1898 )	2023-06-17 19:15:02 +02:00
Georgi Gerganov	051e1b0e6a	llama : fix kv_cache `n` init (close #1903 )	2023-06-17 19:31:20 +03:00
Randall Fitzgerald	794db3e7b9	Server Example Refactor and Improvements (#1570 ) A major rewrite for the server example. Note that if you have built something on the previous server API, it will probably be incompatible. Check out the examples for how a typical chat app could work. This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing. Summary of the changes: - adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos - applies missing top k sampler - removes interactive mode/terminal-like behavior, removes exclude parameter - moves threads and batch size to server command-line parameters - adds LoRA loading and matches command line parameters with main example - fixes stopping on EOS token and with the specified token amount with n_predict - adds server timeouts, host, and port settings - adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text - sets defaults for unspecified parameters between requests - removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming - adds CORS headers to responses - adds request logging, exception printing and optional verbose logging - adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string - adds printing an error when it can't bind to the host/port specified - fixes multi-byte character handling and replaces invalid UTF-8 characters on responses - prints timing and build info on startup - adds logit bias to request parameters - removes embedding mode - updates documentation; adds streaming Node.js and Bash examples - fixes code formatting - sets server threads to 1 since the current global state doesn't work well with simultaneous requests - adds truncation of the input prompt and better context reset - removes token limit from the input prompt - significantly simplified the logic and removed a lot of variables --------- Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com> Co-authored-by: Henri Vasserman <henv@hot.ee> Co-authored-by: Felix Hellmann <privat@cirk2.de> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>	2023-06-17 14:53:04 +03:00
Jiří Podivín	5ddf7ea1fb	hooks : setting up flake8 and pre-commit hooks (#1681 ) Small, non-functional changes were made to non-compliant files. These include breaking up long lines, whitespace sanitation and unused import removal. Maximum line length in python files was set to a generous 125 chars, in order to minimize number of changes needed in scripts and general annoyance. The "txt" prompts directory is excluded from the checks as it may contain oddly formatted files and strings for a good reason. Signed-off-by: Jiri Podivin <jpodivin@gmail.com>	2023-06-17 13:32:48 +03:00
David Yang	92f20d9942	train : get raw text instead of page with html (#1905 ) We probably want to train using just the text of Shakespeare instead of the html of the page displaying his work.	2023-06-17 09:51:54 +03:00
SuperUserNameMan	b41b4cad6f	examples : add "simple" (#1840 ) * Create `simple.cpp` * minimalist example `CMakeLists.txt` * Update Makefile for minimalist example * remove 273: Trailing whitespace * removed trailing white spaces simple.cpp * typo and comments simple.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-16 21:58:09 +03:00
FrankHB	5b9ccaf104	Fixed possible macro redefinition (#1892 ) MinGW libstdc++ may define `NOMINMAX` unconditionally. This fixes the case when it is already defined.	2023-06-16 21:25:01 +03:00
Borislav Stanimirov	9cbf50c041	build : fix and ignore MSVC warnings (#1889 )	2023-06-16 21:23:53 +03:00
yangli2	c36e81da62	examples : add chat-vicuna.sh (#1854 ) Co-authored-by: Yang Li <yangliyl@google.com>	2023-06-15 21:05:53 +03:00
Srinivas Billa	9dda13e5e1	readme : server compile flag (#1874 ) Explicitly include the server make instructions for C++ noobsl like me ;)	2023-06-15 20:36:38 +03:00
Johannes Gäßler	6b8312e797	Better error when using both LoRA + GPU layers (#1861 )	2023-06-15 19:06:46 +02:00
Johannes Gäßler	254a7a7a5f	CUDA full GPU acceleration, KV cache in VRAM (#1827 ) * Fixed CUDA RoPE * ggml_cuda_mul_mat_vec_p021 * ggml_cuda_scale * ggml_cuda_diag_mask_inf * ggml_is_permuted * ggml_cuda_cpy * flatten rows for ggml_cuda_op * Added a --low-vram option * Fixed Windows performance * Fixed LLAMA_CUDA_DMMV_Y > 1 for WizardLM	2023-06-14 19:47:19 +02:00
0xspringtime	9254920265	baby-llama : fix operator!= (#1821 ) * Update baby-llama.cpp Seems to be an error in the implementation of the operator!= function. It attempts to compare the this pointer (a llama_hparams_lora object) with the other pointer (a llama_hparams object) using memcmp. This can lead to incorrect results because the sizes of the objects being compared (sizeof(llama_hparams) and sizeof(llama_hparams_lora)) are different, should now be able to compare two llama_hparams_lora objects for inequality. * Update baby-llama.cpp * Update baby-llama.cpp	2023-06-13 22:37:54 +03:00
xaedes	e32089b2c2	train : improved training-from-scratch example (#1652 ) * add python wrapper https://gist.github.com/abetlen/2b90e5f153f6efd00931d098de5c73ce * fix decoding error. adds errors=ignore parameter * add python bindings for functions to get and set the whole llama state (rng, logits, embedding and kv_cache) * update python bindings * add text generating baby-llama from scratch example * fix race condition bug in ggml_compute_forward_diag_mask_f32 * implement ggml_soft_max_back for more performant backward pass of soft_max avoids creating big intermediate matrices of size n_embd x n_embd for llama layers and n_vocab x n_vocab for cross entropy loss * improve softmax backward pass go from quadratic runtime to linear runtime by simplifying the formulas * fix race condition bug in non-inplace ggml_compute_forward_diag_mask_f32 memcpy needs to be synchronized across threads to avoid race conditions. => do it in INIT phase * fix bug in ggml_compute_forward_soft_max_back_f32 on DEBUG build * improve performance of mul_mat backward pass avoid transpose by using mul_mat with swapped arguments * avoid printing too much newlines in baby-llama-text * activate threading in baby-llama-text * add ggml_out_prod and use it for mul_mat backward pass for improved performance performance stats report improvement from 37 seconds to 16 seconds runtime during my training tests * better weight initialization improves training convergence at start * better weight initialization improves training convergence at start * improve ggml_out_prod performance - change iteration order (>15s -> 10s runtime) - parallelize over one more dimension: over dst matrix rows (10s -> <5s runtime) * add llama sampler, shuffle samples and constrain sampling to tokens occurring in train data * fix get_samples call, add model tensor names, increase model size, start training samples after newline * save train trained model to checkpoint and load model to be trained from checkpoint * use inplace functions where possible * initialize rng with srand * use different arguments for input and output checkpoint * ggml fixes to support backward pass on inplace operations * remove duplicate include * fix cross entropy loss - add target probabilities for each sample which is then used in cross entropy loss * print used memory before and after optimization * sample with non-greedy sampling parameters at the end of training * add cmake target for baby-llama-text * add ggml_add1_inplace to header * enable gradient propagation for inplace add1 and scale operations those functions backward passes don't need the original src0, so they also work when forward is inplace * implement AdamW in ggml_opt_adam by adding weight decay parameter (default 0.001f) also add a schedule parameter (default 1.0f) that can be used to scale alpha and decay according to learning schedule. setting the decay parameter to zero disables AdamW resulting in normal Adam optimizer. since the difference between Adam and AdamW is minimal it is not implemented as another optimizer, but integrated into the existing Adam optimizer. * use inplace operations in cross_entropy_loss * fix random weight initialization scale * add missing default parameters for adam optimizer * add ggml_opt_context, so that we can properly resume training otherwise the optimizer states, tracking statistics about the error function and its derivates, will reset to zero each time ggml_opt is called, hindering convergence on resumed training. now the optimizer context and all its memory is stored in a separate struct. * fix bug in llama_sample_token_mirostat_v2 when all candidates are filtered out through mu threshold, the following soft_max operation will fail. so keep at least one. * add forward function without using cache, for more performant training during training on whole samples no cache is required. removing the cache and simplifying the remaining code results in performance and memory usage improvement. * print suppressed newline tokens as string "\n" printing too much actual newlines is suppressed to avoid flooding the console. * store optimizer state in training checkpoint and add learning schedule persistent optimizer state allows to resume training without resetting the optimizer learning schedule consists of linear warmup ramp followed by cosine decay with restarts * remove unused functions * fix bug in get_samples which corrupted training targets * save checkpoint only when it was trained * simplify code * remove trailing whitespace * simplify backward pass for SQRT * replace inefficient repeat backward pass with dedicated repeat_back operation * add ggml_cross_entropy_loss with backward pass for faster training cross entropy loss can also be implemented using softmax and log, but as dedicated operation it is faster and especially avoids unnecessary memory overhead. * add tests for cross_entropy_loss backward pass finite differences regularly results in estimated gradient of zero, despite the backward pass giving non zero gradient. _probably_ the finite differences fails due to numerical issues * use ggml_cross_entropy_loss in text training example * remove trailing whitespace * slightly improve how cross entropy loss is compute btw: directly implemented cross entropy loss seems to have way lower magnitudes than when implemented with softmax and log. probably the input to log gets closer to zero due to float numerics. maybe the multiplication by (1.0-eps)/sum is more accurate.. * add llama_get_vocab to get the vocabulary as output parameters * set default model.type for unknown models with few layers * add export of training checkpoint to llama compatible model file * get vocabulary for exporting training checkpoint to llama compatible model file * implement backward pass of flash attention * bugfixes for backward pass of flash attention * test flash attention backward pass need to set loose error bounds to pass. the finitie differences are close to numeric limits and often return quite different values than the backward pass. reducing eps further lets the gradients vanish completely. likewise setting eps to big results in wronger values. the softmax in the middle of the function is probably the most responsible for the numeric issues using finite differences. * add option to train with flash attention and move options to the top of the main function training from scratch also works with flash attention training convergence and generation results after fix number of iterations are worse than when not using flash attention. maybe there still lingers a bug in the flash attention backward pass? but training works, just with slower convergence. flash attention is still worth to use, because it requires way less memory and is faster with high n_ctx * add train_params and command line option parser * remove unnecessary comments * add train params to specify memory size * remove python bindings * rename baby-llama-text to train-text-from-scratch * replace auto parameters in lambda function * add #include <climits> * add explicit cast to fix compile error "error: non-constant-expression cannot be narrowed from type 'int64_t' (aka 'long long') to 'uint32_t' (aka 'unsigned int') in initializer list [-Wc++11-narrowing]" * remove trailing whitespace * add ggml_opt_resume_g which accepts forward and backward cgraphs * fix formulas in comments * bug fix for ggml_compute_forward_get_rows_back_f32 the result should be set to zero, not to whatever data is in opt0 * improve training memory usage with scratch buffers instead of relying on the automatic backward pass, we manually create the graph for the backward pass. it turns out that all backward pass operations need only temporary memory which can be reused after each layer. will compute backward pass for ALL model parameters * add option to use scratch buffers in training or not make it configurable because currently training with scratch buffers implies flash attention and optimization over all parameters. * ci : disable temporary * store view offset and permute axes in opt[0] instead of storing it in padding use memcpy to store offset, because offset is of type size_t. when storing it as int32_t offset would have to be smaller than 2^31 which is not necessarily true. * minor : fix compile warnings + minor style changes * fix bug in threaded indices calculation of ggml_compute_forward_flash_attn_back_f32 * store view offset like in master branch * bug fix in forward_batch_wo_cache_flash_attn_train * scratch buffer bug fixes in forward_batch_wo_cache_flash_attn_train data of permute and reshape is the same as their input. if we want to preserve the output of permute/reshape, we also need to preserve their inputs. replace reshape(src0, src1) with reshape_nd calls so that we don't need src1. replace (temporary) t03 with ggml_repeat(ctx0, layer.attention_norm, t02). in the future we could also use the new broadcasting ggml_mul to avoid these repeat calls. for this we need backward pass of broadcasting ggml_mul. * remove unnecessary scratch buffer 0 buf 0 is persistent memory, so we can just disable scratch for this by using buf -1 * avoid creating unnecessary grad tensors previously we need to create grads for model parameters, so that expand(..) correctly populates cgraph->leafs & cgraph->grads this wasted memory, because unnecessary grad for each op were automatically created: the automatically generated grad was unnecessary because we later manually set the grad (e.g. t35->grad = expand(gb, ...) ). this discarded the automatically generated grad resulting in wasted memory. improved this by changing expand(..) to not use ggml_build_forward_expand. expand set cgraph->nodes but not the leafs. cgraph->leafs & cgraph->grads are set in another pass after the last expand call. * print used training seed * zero initialize gfbuf and gbbuf * ci : re-enable workflows + add README for training --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-13 22:04:40 +03:00
Georgi Gerganov	2347e45e7b	llama : do a warm-up eval at start for better timings (#1824 )	2023-06-13 20:20:07 +03:00
Kerfuffle	74d4cfa343	Allow "quantizing" to f16 and f32 (#1787 ) * Allow "quantizing" to f16 and f32 Fix an issue where quantizing didn't respect LLAMA_NO_K_QUANTS Add brief help to the list of quantization types in the quantize tool Ignore case for quantization type arguments in the quantize tool	2023-06-13 04:23:23 -06:00
Kerfuffle	fa84c4b3e8	Fix issue where interactive mode crashes when input exceeds ctx size (#1789 ) * Fix issue where interactive mode in the main example crashes when input exceeds ctx size * Ensure the context size is at least 8 tokens in the main example. Closes #1768	2023-06-11 08:19:17 -06:00
Kerfuffle	4f0154b0ba	llama : support requantizing models instead of only allowing quantization from 16/32bit (#1691 ) * Add support for quantizing already quantized models * Threaded dequantizing and f16 to f32 conversion * Clean up thread blocks with spares calculation a bit * Use std::runtime_error exceptions.	2023-06-10 10:59:17 +03:00
Willy Tarreau	35a84916fb	main: add the possibility to open the prompt cache read-only (#1640 ) The prompt cache constitutes a nice speed up when using the same prompt prefix across multiple evaluations, but when using it, it will also be updated, which is not always desirable. One use case is to have a large prompt containing some context and usage rules, and a second part containing variable data of the problem being studied. In this case it's desirable to be able to save the first part once, and to always reuse it as-is without updating it with the second part. The new argument --prompt-cache-ro enables this read-only mode on the prompt cache. The prompt's contents that match the cache are loaded from the cache but the rest is not modified. This allowed to reduce a total analysis time from 112s to 49.7s here, without having to backup and restore a copy of the prompt, which takes significant time at 500 MB. Signed-off-by: Willy Tarreau <w@1wt.eu>	2023-06-06 22:10:17 -04:00
Johannes Gäßler	17366df842	Multi GPU support, CUDA refactor, CUDA scratch buffer (#1703 ) * CUDA multi GPU + scratch ggml_cuda_compute_forward Tensor parallelism ggml_cuda_add ggml_cuda_rms_norm ggml_cuda_silu CUDA scratch buffer --main-gpu CLI option	2023-06-06 21:33:23 +02:00
Kawrakow	99009e72f8	ggml : add SOTA 2,3,4,5,6 bit k-quantizations (#1684 ) * Starting to add k-quantization to ggml I think it is better to have quantization separate from ggml. For now just adding the k-quants there, but it would be better to also factor out the existing ggml quantizations. * Adding Q3_K and Q8_K (de)-quantization * Q3_K now working on CUDA and AVX2/scalar CUDA is not ideal - ~50% slower than Q4_0 for single token prediction, about the same in batch mode (perplexity). CPU single token is ~55 ms (on Ryzen 7950X). * Some improvement for Q3_K on CUDA It is now ~22.5 ms/token on my GPU, so ~30% slower than Q4_0. * Some more CUDA optimizations for Q3_K Single token is now 20.5 ms/token (~20% slower than Q4_0). Perplexity is on par with Q4_0. * Adding Q4_K - scalar, AVX2, CUDA Performance is the same or perhaps very slightly better than Q4_0 on the CPU. On the GPU, single token prediction is ~10% better than Q4_0, batch mode (perplexity is about the same). * Adding Q6_K - scalar, AVX2, CUDA Performance is ~40% lower compared to Q4_K on the CPU. This is to be expected, considering that we are memory bound on the CPU and the 6-bit model is ~44% larger than the 4-bit. On the GPU, single token prediction is ~6% lower than Q4_0, batch mode (perplexity) is even closer (but still slower). * Adding Q5_K - scalar, AVX2, CUDA Performance is ~20% lower compared to Q4_K on the CPU. This is to be expected, considering that we are memory bound on the CPU and the 5-bit model is ~22% larger than the 4-bit. On the GPU, single token prediction is about the same as Q4_0 for both, single token and batch prediction. * Per convention, all QX_K quantizations use Q5_K for output.weight * Adding quantization mixes * Quantization mixes: didn't quite get what I wanted in the last commit * Q4_K dot product for ARM_NEON * Q6_K dot product for ARM_NEON * Q5_K dot product for ARM_NEON * Adding Q3_K dot for ARM_NEON It is 22% slower than Q4_K, despite the smaller model size. On x86_64, where we are memory bound, the Q3_K model is quite a bit faster than Q4_K. * A very slightly faster ARM_NEON Q3_K dot * Adding Q2_K - just CUDA for now Token prediction is pretty good - about 15.5 ms on a RTX 4080. Perplexity is about the same as Q4_K. * Adding scalar and AVX2 Q2_K dot * Adding ARM_NEON Q2_K dot About the same performance as Q4_K. * A slightly faster ARM_NEON Q2_K dot Single token prediction is now ~36 ms on M2 Max. The code is much simpler too. * Fixed bug in Q2_K CUDA dot product kernel Stranegly enough, for the few prompts I tried with the 7B model the responses looked perfectly reasonable. Only realized something is not quite right when I tried the larger models and started getting nonse back. In any case, Q2_K single token evaluation time on an RTX 4080 in a Ryzen7950X box iusing CUDA and model fully loaded on the GPU are ~15.5 ms for 7B, ~25.4 ms for 13B, and ~55.8 ms for 30B. The max number of layers that fit in VRAM for The 65B is 32. With that, we get ~330 ms per token, which is not that much faster than just running on the CPU (~470 ms per token). * Don't print zeros/NaNs when no count histogram has been collected * A 10% faster CUDA vector dot kernel for Q3_K Q3_K is now running at ~18.5 ms / token on CUDA, so the gap to Q4_0 is only 10%. It seems memory acccess pattern is more important for performance than the amount of computation the kernel does. * A slightly daster Q4_K AVX2 dot product For perplexity, where we are less memory bound, time per pass drops by ~5%. Barely measurable difference for single token prediction. * A slightly faster ARM_NEON A4_K dot product * Minor * Fix quantization error test We cannot possibly be expecting rmse < 0.002 for 2- and 3-bit quantization variants. * Fix docker build I have been sloppy with vector reinterpret casts on ARM_NEON. It seems clang is very forgiving in that regard. * Added forgotten ggml.o dependence on k_quants.h to the Makefile * Had unintentionally committed the Makefile with -Ofast enabled * ggml : rename k_quants -> ggml-quants-k, use lowercase in code --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-05 22:56:18 +03:00
Georgi Gerganov	ecb217db4f	llama : Metal inference (#1642 ) * mtl : export the LLaMA computation graph * ci : disable temporary * mtl : adapt the MNIST example as starter * mtl : no need for mtl-export tool, add cli arg for main instead * mtl : export just a small part of the graph for now to make it easier * mtl : move MSL code into separate file for easy editing * mtl : initial get_rows_q4_0 kernel * mtl : confirmed get_rows_q4_0 is working correctly * mtl : add rms_norm kernel + confirm working * mtl : add mul kernel + confirm working * mtl : initial mul_mat Q4 kernel (wrong results) * mtl : mul_mat fixes (still wrong) * mtl : another mul_mat Q4 (still does not work) * mtl : working mul_mat q4 * ggml : fix handling of "view" ops in ggml_graph_import() * mtl : add rope kernel * mtl : add reshape and transpose handling * ggml : store offset as opt arg for ggml_view_xd() operators * mtl : add cpy kernel + handle view ops * mtl : confirm f16 x f32 attention mul mat * mtl : add scale kernel * mtl : add diag_mask_inf kernel * mtl : fix soft_max kernel * ggml : update ggml_nbytes() to handle non-contiguous tensors * mtl : verify V tensor contents * mtl : add f32 -> f32 cpy kernel * mtl : add silu kernel * mtl : add non-broadcast mul kernel * mtl : full GPU inference of the computation graph * mtl : optimize rms_norm and soft_max kernels * mtl : add f16 mat x f32 vec multiplication kernel * mtl : fix bug in f16 x f32 mul mat + speed-up computation * mtl : faster mul_mat_q4_0_f32 kernel * mtl : fix kernel signature + roll inner loop * mtl : more threads for rms_norm + better timing * mtl : remove printfs from inner loop * mtl : simplify implementation * mtl : add save/load vocab to ggml file * mtl : plug Metal inference into llama.cpp (very quick-n-dirty) * mtl : make it work with main example Lots of hacks but at least now it generates text * mtl : preparing for merge * mtl : clean-up ggml mtl interface + suport scratch / inplace * mtl : remove temp / debug code * metal : final refactoring and simplification * Revert "ci : disable temporary" This reverts commit `98c267fc77`. * metal : add comments * metal : clean-up stuff, fix typos * readme : add Metal instructions * readme : add example for main	2023-06-04 23:34:30 +03:00
Evan Jones	136476e898	Fix prompt cache saving and chat-persistent rollover (#1678 ) * Fix prompt cache saving and chat-persistent rollover (fixes #1670) * clang-tidy Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2023-06-03 07:28:45 -04:00
DannyDaemonic	248367605e	Work around for recalculating logits in cached prompts (Fixes #1585 ) (#1609 ) * Work around for recalculating logits in cached prompts	2023-05-29 05:13:40 -07:00
Kerfuffle	1b78ed2081	Only show -ngl option when relevant + other doc/arg handling updates (#1625 ) 1. Add a `LLAMA_SUPPORTS_GPU_OFFLOAD` define to `llama.h` (defined when compiled with CLBlast or cuBLAS) 2. Update the argument handling in the common example code to only show the `-ngl`, `--n-gpu-layers` option when GPU offload is possible. 3. Add an entry for the `-ngl`, `--n-gpu-layers` option to the `main` and `server` examples documentation 4. Update `main` and `server` examples documentation to use the new style dash separator argument format 5. Update the `server` example to use dash separators for its arguments and adds `-ngl` to `--help` (only shown when compiled with appropriate support). It will still support `--memory_f32` and `--ctx_size` for compatibility. 6. Add a warning discouraging use of `--memory-f32` for the `main` and `server` examples `--help` text as well as documentation. Rationale: https://github.com/ggerganov/llama.cpp/discussions/1593#discussioncomment-6004356	2023-05-28 11:48:57 -06:00
Vladimir Zorin	337aea1139	examples : add --alias option to gpt_params to set use friendly model name (#1614 )	2023-05-28 20:14:24 +03:00
Kerfuffle	0df7d63e5b	Include server in releases + other build system cleanups (#1610 ) Set `LLAMA_BUILD_SERVER` in workflow so the `server` example gets build. This currently only applies to Windows builds because it seems like only Windows binary artifacts are included in releases. Add `server` example target to `Makefile` (still uses `LLAMA_BUILD_SERVER` define and does not build by default) Fix issue where `vdot` binary wasn't removed when running `make clean`. Fix compile warnings in `server` example. Add `.hpp` files to trigger workflow (the server example has one).	2023-05-27 11:04:14 -06:00
Kerfuffle	66874d4fbc	Some improvements to loading the session with --prompt-cache (#1550 ) Improvements to loading the session with `--prompt-cache` in the `main` example. 1. Fix an issue where the `--seed` parameter was ignored when loading a cached prompt. 2. When loading a cached prompt, you previously had to specify the saved prompt (or a prefix of it) again. This pull changes that behavior to default to the prompt that was cached if a prompt wasn't specified by the user.	2023-05-25 20:18:01 -06:00
Senemu	1359b6aba5	chat-persistent.sh : use bracket expressions in grep (#1564 )	2023-05-24 09:16:22 +03:00
Steward Garcia	7e4ea5beff	examples : add server example with REST API (#1443 ) * Added httplib support * Added readme for server example * fixed some bugs * Fix the build error on Macbook * changed json11 to nlohmann-json * removed some whitespaces * remove trailing whitespace * added support custom prompts and more functions * some corrections and added as cmake option	2023-05-21 20:51:18 +03:00
Georgi Gerganov	ec2e10c444	llama : add llama_init_backend() API (close #1527 )	2023-05-20 11:06:37 +03:00
DannyDaemonic	d2c59b8ba4	Fix for mingw (#1462 )	2023-05-20 00:40:02 -07:00
Evan Jones	943e6081cc	examples : add persistent chat (#1495 ) * examples : add persistent chat * examples : fix whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-05-19 20:39:51 +03:00
Jason McCartney	7694b52b9a	main : make reverse prompt option act as a stop token in non-interactive mode (#1032 ) * Make reverse prompt option act as a stop token in non-interactive scenarios * Making requested review changes * Update gpt_params_parse and fix a merge error * Revert "Update gpt_params_parse and fix a merge error" This reverts commit `2bb2ff1748`. * Update gpt_params_parse and fix a merge error take 2	2023-05-19 20:24:59 +03:00
Georgi Gerganov	4b7e245adf	minor : fix compile warnings	2023-05-19 20:14:51 +03:00
DannyDaemonic	ee9654138a	Fixes #1511 lambda issue for w64devkit (mingw) (#1513 ) * Fix for w64devkit and mingw	2023-05-18 19:30:40 +02:00
Stephan Walter	dc271c52ed	Remove unused n_parts parameter (#1509 )	2023-05-17 22:12:01 +00:00
rankaiyx	c238b5873a	benchmark-matmul: Print the average of the test results (#1490 )	2023-05-17 16:47:58 +02:00
András Salamon	9560655409	define default model path once, sync path with readme (#1366 )	2023-05-16 17:46:34 +02:00
zrm	63d20469b8	fix get_num_physical_cores() (#1436 ) * fix get_num_physical_cores() had been broken on complex topologies because "cpu cores" in /proc/cpuinfo is per-"physical id" * Add spaces to maintain consistent formatting --------- Co-authored-by: slaren <ddevesa@gmail.com>	2023-05-15 04:25:42 +02:00
slaren	b5c9295eef	benchmark-matmul: fix clang-tidy issues, report results in GFLOPS (#1458 ) * benchmark-matmul: fix command line parsing, replace macros with functions, report results in GFLOPS	2023-05-14 22:46:00 +02:00
Johannes Gäßler	905d87b70a	ggml : GPU-accelerated token generation (#1412 ) * CUDA kernel for q4_0 dequant. + mat. vec. mult. * Added q4_1 via template * Added missing __syncthreads(); * --gpu_layers -> --gpu-layers * Shorter dequantize_mul_mat_vec line * q5_0 dequantize_mul_mat kernel * More readable dequantize_mul_mat_vec logic * dequantize_mul_mat_vec kernels for q5_1, q8_0, f16 * llama : offload "output" tensor to GPU too + coding style fixes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-05-13 16:38:36 +03:00
xaedes	f954edda93	ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360 ) * implement 8 of 14 missing backward pass operations used by llama - GGML_OP_ADD_AT - GGML_OP_CPY - GGML_OP_MUL_MAT (src0.grad) - GGML_OP_PERMUTE - GGML_OP_RESHAPE - GGML_OP_SCALE - GGML_OP_TRANSPOSE - GGML_OP_VIEW implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW. this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset). the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0. still missing backward passes for llama: - GGML_OP_DIAG_MASK_INF - GGML_OP_GET_ROWS - GGML_OP_RMS_NORM - GGML_OP_ROPE - GGML_OP_SILU - GGML_OP_SOFT_MAX * implement 5 of 6 missing backward pass operations used by llama - GGML_OP_DIAG_MASK_INF - GGML_OP_GET_ROWS - GGML_OP_RMS_NORM - GGML_OP_SILU - GGML_OP_SOFT_MAX add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1. GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know... GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF. Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants. staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and functions with "_inplace" are added which are inplace. in llama we need to call the inplace variants so that it is implemented as before. for llama backward pass we need to use the non-inplace variants. still not completely implemented backward passes for llama: - GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK - GGML_OP_GET_ROWS: only necessary for tokenizer * norm & rms_norm can not be threaded: after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees. * remove already resolved TODO * implement backward pass of ggml_rope and ggml_rope_back * implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back * add test-grad0.c * use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console * test both gradients of mul_mat * disable graph dot export as it floods console * bug fixes for silu_back * successfully test silu backward * bug fix for scale backward pass use sum instead of mean for gradient of scalar scale parameter * successfully test scale backward * improve performance of sum backward pass use add1(x,y) instead of add(x,repeat(y,x)) * improve performance of sqr backward pass use scale(x,y) instead of mul(x,repeat(y,x)) * successfully test rope backward * bug fix for cpy backward pass * successfully test cpy backward * bug fix for reshape backward pass * successfully test reshape backward * add test-opt.c this uses ggml_opt to train a,b for minimal e=sum(sqr(c - ab)) for random initial a,b,c correctly implement softmax backward pass using new operation ggml_diag ggml_diag constructs diagonal matrices with entries. ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d] * successfully test soft_max backward * align shape annotations * add shape annotations for llama * de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type. with this we can duplicate tensor of any typ as long as they are contiguous. * fix ggml_compute_forward_dup_same_cont for when nelements < nthreads when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy * bug fix for add_at forward required for view backward pass src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function. * successfully test view backward * minor code format improvement * fix ggml_forward_add functions to work correctly with transposed tensors uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions. this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32. * fix ggml_forward_add1 functions to work correctly with transposed tensors uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions. this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32. * test-grad0.c : add print_elements to help with debugging * successfully test permute backward * some minor test-grad0 fixes * fix sub, mul and div functions to work correctly with transposed tensors uses the same logic as in add * implement ggml_cont backward pass * successfully test transpose backward and permute for all permutations also test sub, mul and div up to max n_dims * test-grad0.c add TODO for view_2d and view_3d add_at (required for view backward pass) is a bit tricky for n_dims > 1. * fix comments * successfully test diag_mask_inf and diag_mask_zero backward * test-grad0 : fix test for div nargs and ndims was swapped, corrupting the stack * fix diag_mask to work with non-inplace input * move dup call into the actual add_at functions * fix get rows backward pass * successfully test get_rows backward * fix view backward pass add nb parameters to add_at like in view. together with offset they define how to view dst and src0 during the add_at operation. * successfully test backward pass of view_1d, view_2d and view_3d * fix backward pass for rms_norm I would have used formulas from other frameworks, but they differed so I could not decide which is correct. Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification. * successfully test backward pass of rms_norm some tests may fail when gradients are large. could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds. when looking at the values the "failed" tests look actually ok. for example: rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324 it is due to the test logic in check_gradients that they fail. * add todos for llama backward pass - implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required) - repeat is not yet tested and looks like it only works for single element src0 inputs. * add operation ggml_sum_rows ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d] * add missing GGML_OP_SUM_ROWS * fix backward pass for repeat requires ggml_sum_rows * successfully test backward pass of repeat * update quantization types in switch-case of add_at and add1 * add baby-llama example training a very small llama model from scratch to output a sinusoidal wave. had to increase maximum number of optimization parameters to train from scratch. * fix softmax in baby-llama example * switching from training with adam to lbfgs produces much better results in the baby-llama example * train with two examples, creating new tensors each time.. * fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed. so we need to keep the original gradients and make dups for opt * train on multiple examples, generate & print tokens with trained model afterwards ctx0 for evaluation and optimization is renewed for each sample * add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d * fix soft_max backward pass for input->ne[1] != 1 * add ggml_log operation necessary for cross entropy loss * add test for ggml_log gradients * implement backward pass for ggml_sum_rows, necessary for cross entropy loss * implement ggml_repeat support for rank > 2 tensors * add test for ggml_sum_rows gradients * fix training get_example_targets predict the next token, not the current token! * add square_error_loss and cross_entropy_loss functions * optimize loss over multiple samples this increases computation graph, need parallel batched forward for more efficiency. * fix backward pass for add_at and change arguments to have same order as in view * add ggml_set(ctx, a, b) to set b in view of a and return modified a necessary to set values into kv_self cache and properly propagate the gradients * fix kv_self gradients for training use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients * replace inplace operations for training with copying operations to allow gradient propagation * add GGML_ASSERT to catch ggml_rope and back value errors * add trainable lora-only model with all big matrices C split into A,B with AB=C this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices. training this instead of the normal model resulted in much worse results though... vastly improve training results instead of logit targets 0 and 1 use -1 and +1. * shorten code using a variable * change name of GGML_OP_ADD_AT to GGML_OP_ACC * smaller default values for baby llama model parameters * update static assert of GGML_OP_COUNT * remove shape annotations in llama_eval_internal * revert disabling of threading for rms_norm and norm * rename print functions in baby-llama example * fix call to ggml_set_name * add missing include for strcmp, etc * remove trailing whitespace * reduce number of test-grad0 iterations avoid exceeding timeout of automated tests * remove busy loop that was used as sleep for slower sinus wave generation * disable slow tests grad0 and opt to avoid exceeding timeouts * c++ in baby-llama example use c++ includes instead of c includes use std::min, std::max instead of MIN, MAX macros * c++ in baby-llama example use c++ includes instead of c includes use std::min, std::max instead of MIN, MAX macros * ggml : fix compiler warnings + cosmetic changes * ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back * swap arguments to vDSP_vdiv call documentation for vDSP_vdiv states: "Note that B comes before A!" * swap arguments to vDSP_vdiv call documentation for vDSP_vdiv states: "Note that B comes before A!" * ggml : swap vDSP_vsub args as per documentation * add parallel batched forward function for baby-llama training * cleanup code for batched training * remove trailing whitespace * minor : fix compiler warnings + indentation style * ggml : fix null ptr deref in backward pass * ggml : remove Q4_2 remnants * ggml : fix clang-tidy warnings * baby-llama : couple of clang-tidy warnings --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-05-13 15:56:40 +03:00
Rinne	6456a4eb9f	embedding : remove unused code (#1426 )	2023-05-13 10:24:20 +03:00
Georgi Gerganov	fb62f92433	llama : fix --mtest option (close #1414 )	2023-05-12 21:44:20 +03:00
Johannes Gäßler	773ee249fb	CLI args use - instead of _, backwards compatible (#1416 )	2023-05-12 14:34:55 +00:00
Georgi Gerganov	b9fd7eee57	ggml : remove bit shuffling (#1405 ) * ggml : remove Q4_0 bit shufling (ARM NEON) * ggml : remove Q4_1 bit shuffling (ARM NEON + reference) * ggml : nibbles_from_floats() + bytes_from_nibbles() (ARM NEON) * ggml : remove Q4_2 bit shuffling (WIP, BROKEN) * ggml : remove Q5_0 bit shuffling (ARM NEON) * ggml : 2x faster scalar implementations * ggml : remove Q5_1 bit shuffling (ARM NEON + scalar) * ggml : simplify scalar dot * ggml : remove WASM SIMD bit shuffling + remove vzip for ARM 32-bit * ggml : fix Q4_1 quantization * ggml : update cuBLAS + normalize variable names * ggml : remove Q4_2 mode * ggml : minor formatting * ggml : fix Q5_0 quantization * scripts : add script for measuring the time per token * AVX implementations (#1370) * ggml : uniform 5th bit extraction * llama : produce error upon loading old model files * llama : fix model magic/version write * ggml : speed-up Q5_0 + Q5_1 at 4 threads * ggml : preserve old Q4 and Q5 formats * ggml : simplify Q8_1 - no need for low / high sums anymore * ggml : fix Q8_0 and Q8_1 rounding * Revert "AVX implementations (#1370)" This reverts commit `948d124837`. * ggml : fix AVX2 implementation * sha : update hashes for 7B and 13B * readme : update timings + remove warning banner * llama : update v2 PR number to 1405 * ggml : fix WASM comments * ggml : back to original bit order * readme : add note that Q4 and Q5 have been changed * llama : fix return for unknown version --------- Co-authored-by: Stephan Walter <stephan@walter.name>	2023-05-12 00:23:08 +03:00
Evan Jones	cf348a60e0	main : add option to save full output to session (#1338 ) * main : add option to save full output to session * split behavior into --session and --prompt-cache * restore original implementation with new names * PR comments * move the check for incompatible parameters to gpt_params_parse * Fix whitespace Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com> --------- Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>	2023-05-10 11:37:14 -04:00
DannyDaemonic	e6a46b0ed1	Locale fix for Windows (#1379 )	2023-05-09 19:53:28 +02:00
DannyDaemonic	41654efea8	Interface improvements and `--multiline-input` (previously `--author-mode`) (#1040 ) * Interface improvements * Multiline input * Track character width * Works with all characters and control codes + Windows console fixes	2023-05-08 19:45:48 -07:00
Georgi Gerganov	f9a6364912	llama : require first token to be BOS (#1303 ) * llama : require first token to be BOS * scripts : add ppl-run-all.sh * perplexity : add BOS for each chunk * readme : update perplexity values after BOS fix * perplexity : add clarifying comments	2023-05-08 17:41:54 +03:00
Johannes Gäßler	1f48b0abcf	Documented CUDA reproducibility, added warning (#1346 )	2023-05-08 02:42:01 +02:00
Jed Fox	3924088512	Remove default arguments from sampling functions (#1343 )	2023-05-06 17:01:47 -04:00
slaren	94c5652fc0	quantize: make output filename optional, default to ggml-model-<ftype>.bin (#1301 )	2023-05-05 00:58:56 +02:00
44670	2edbdb0f99	main : add --in-suffix option (#1318 ) * adding --in-suffix option * print input suffix before generation	2023-05-04 18:41:12 +03:00
DannyDaemonic	db1080876a	Only escape prompts when used with `-e` (#1311 )	2023-05-04 05:08:25 -07:00
DannyDaemonic	c65a7fbfa9	Update main's README.md with new features (#1296 )	2023-05-04 03:02:59 -07:00
Tomas	f647ce040f	fix #1224 reverse prompt and multi line (#1297 ) * fix reverse prompt and multi line * Code Formatting Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-05-04 03:02:30 -07:00
khimaros	6daa09d879	examples : read chat prompts from a template file (#1196 )	2023-05-03 20:58:11 +03:00
CRD716	a8a2efdc81	examples : various prompt and example fixes (#1298 ) * fix dan.txt * miku prompt improvements * use common characters	2023-05-03 18:26:47 +03:00
DannyDaemonic	2485d7a4d3	Process escape sequences given in prompts (#1173 )	2023-05-02 18:46:20 -07:00
DannyDaemonic	13b0c68ed7	Handle signals properly on Windows (#1123 )	2023-05-02 18:01:57 -07:00
slaren	bf4b22ffe4	fix missing parameters in `llama_init_from_gpt_params` (#1293 )	2023-05-03 01:36:45 +02:00
Ron Evans	67c77799e0	examples : add llama_init_from_gpt_params() common function (#1290 ) Signed-off-by: deadprogram <ron@hybridgroup.com>	2023-05-02 23:39:51 +03:00
Georgi Gerganov	0e6cbff1b7	llama : fix compile warnings	2023-05-02 23:09:08 +03:00
Ron Evans	8c9be35ff9	examples : improve vertical alignment of a few variables (#1286 ) Signed-off-by: deadprogram <ron@hybridgroup.com>	2023-05-02 20:53:52 +03:00
Robert Brisita	2bb992f034	llama : allow 0 as a seed number. (#1275 )	2023-05-02 19:23:44 +03:00
Ron Evans	e2cd506999	main : switch input_noecho to input_echo to remove negation (#979 ) Signed-off-by: deadprogram <ron@hybridgroup.com>	2023-05-02 19:13:26 +03:00
DannyDaemonic	f4cef87edf	Add git-based build information for better issue tracking (#1232 ) * Add git-based build information for better issue tracking * macOS fix * "build (hash)" and "CMAKE_SOURCE_DIR" changes * Redo "CMAKE_CURRENT_SOURCE_DIR" and clearer build messages * Fix conditional dependency on missing target * Broke out build-info.cmake, added find_package fallback, and added build into to all examples, added dependencies to Makefile * 4 space indenting for cmake, attempt to clean up my mess in Makefile * Short hash, less fancy Makefile, and don't modify build-info.h if it wouldn't change it	2023-05-01 18:23:47 +02:00
Georgi Gerganov	70269cae37	llama : fix session load / save (#1263 )	2023-05-01 14:54:59 +03:00
jon-chuang	a5d30b1f53	common : better default number of threads (#934 ) * commit * fix * try-catch * apply code review * improve * improve * add macos headers * done * remove color * fix windows * minor * fix * Apply suggestions from code review Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com> * remove * minor * minor --------- Co-authored-by: jon-chuang <jon-chuang@users.noreply.github.com> Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>	2023-04-30 21:41:35 +03:00
Stephan Walter	f0d70f147d	Various fixes to mat_mul benchmark (#1253 )	2023-04-30 12:32:37 +00:00
Georgi Gerganov	305eb5afd5	build : fix reference to old llama_util.h	2023-04-29 13:53:12 +03:00
Georgi Gerganov	84ca9c2ecf	examples : fix save-load-state + rename llama-util.h	2023-04-29 13:48:11 +03:00
Georgi Gerganov	334637e43e	common : change default parameters to pre-#1126 (#1223 )	2023-04-29 09:51:06 +03:00
Ivan Stepanov	dd7eff57d8	llama : new sampling algorithms (#1126 ) * Sample interface, new samplers. New samplers: - locally typical sampling - tail free sampling - frequency and presence penalty - mirostat Ignore EOS fix: -inf should be used. * mirostat * Added --logit-bias and --no-penalize-nl, removed std::span * Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and k) Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and k) * Save and load example adjust * Tests * Windows build fix * Windows test fix	2023-04-29 08:34:41 +03:00
Stephan Walter	36d19a603b	Remove Q4_3 which is no better than Q5 (#1218 )	2023-04-28 23:10:43 +00:00
CRD716	5fba3c016b	examples : add Jeopardy example (#1168 ) * Basic Setup * Prevent Results.txt from coming up * Prefixes, Line separators, etc * editorcheck * introduction to give more consistent results * Basic graph thing * Grading, ready for testing! * Y'all ready to get funky? * fix column removal stuff * missed a few	2023-04-28 19:13:33 +03:00
Evan Jones	1481a9cf25	llama : add session file format and saved sessions in main (#1169 )	2023-04-28 18:59:37 +03:00
Georgi Gerganov	574406dc7e	ggml : add Q5_0 and Q5_1 quantization (#1187 ) * ggml : add Q5_0 quantization (cuBLAS only) * ggml : fix Q5_0 qh -> uint32_t * ggml : fix q5_0 histogram stats * ggml : q5_0 scalar dot product * ggml : q5_0 ARM NEON dot * ggml : q5_0 more efficient ARM NEON using uint64_t masks * ggml : rename Q5_0 -> Q5_1 * ggml : adding Q5_0 mode * quantize : add Q5_0 and Q5_1 to map * ggml : AVX2 optimizations for Q5_0, Q5_1 (#1195) --------- Co-authored-by: Stephan Walter <stephan@walter.name>	2023-04-26 23:14:13 +03:00
Pavol Rusnak	859fee6dfb	quantize : use `map` to assign quantization type from `string` (#1191 ) instead of `int` (while `int` option still being supported) This allows the following usage: `./quantize ggml-model-f16.bin ggml-model-q4_0.bin q4_0` instead of: `./quantize ggml-model-f16.bin ggml-model-q4_0.bin 2`	2023-04-26 18:43:27 +02:00
Georgi Gerganov	7a32fcb3b2	ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) (#1179 ) * ggml : add Q8_0 quantization format (rename the old one to Q8_1) * tests : fix test-quantize-fns * ggml : finalize Q8_0 implementation * ggml : use q4_0_q8_0 and q4_2_q8_0 * ggml : fix Q8_0 dot product bug (ARM) * ggml : Q8_0 unroll x2 * ggml : fix bug - using wrong block type * ggml : extend quantize_fns_t with "vec_dot_type" * ggml : fix Q8_0 to use 255 values out of 256 * ggml : fix assert using wrong QK4_2 instead of QK4_3	2023-04-25 23:40:51 +03:00
xaedes	0c5692345d	examples : add save_load_state example (#1150 ) * add save_load_state example * use <cstdio> instead of <iostream> and fprintf / printf instead of cout * renamed save-load-state example files replacing underscores by dashes	2023-04-24 19:23:31 +03:00
mgroeber9110	9b0a4d4214	examples/main README improvements and some light refactoring (#1131 )	2023-04-24 15:45:32 +00:00
slaren	1d78fecdab	Fix LoRA acronym (#1145 )	2023-04-23 23:03:44 +02:00
DannyDaemonic	edce63baa9	Added README.md for main with examples and explanations (#1139 )	2023-04-23 15:37:02 +00:00
Stephan Walter	c50b628810	Fix CI: ARM NEON, quantization unit tests, editorconfig (#1122 )	2023-04-22 10:54:13 +00:00
wbpxre150	36b4f7e064	llama : print timings on ctrl+c exit (#1021 ) * print timings on ctrl+c exit * remove redundant free memory call. * add global pointer to ctx.	2023-04-22 11:56:35 +03:00
eiery	10f19c1121	llama : have n_batch default to 512 (#1091 ) * set default n_batch to 512 when using BLAS * spacing * alternate implementation of setting different n_batch for BLAS * set n_batch to 512 for all cases	2023-04-22 11:27:05 +03:00
Clint Herron	e9a9cb0c54	examples : Improve Alpaca Default Repeat Penalty: Better Match Alpaca.cpp Experience (#1107 ) * Moving parameters to separate lines for readability. * Increasing repeate_penalty to 1.1 to make alpaca more usable by default. * Adding trailing newline.	2023-04-22 09:54:33 +03:00
Alex Klinkhamer	9411288271	main : evaluate tokens in batches after swapping context (#1014 ) * examples : evaluate tokens in batches after swapping context * Update examples/main/main.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-21 21:18:09 +03:00
slaren	3d59769c3b	Show perplexity ETA in hours and minutes (#1096 )	2023-04-21 14:57:57 +02:00
Kawrakow	38de86a711	llama : multi-threaded quantization (#1075 ) * Multi-threading quantization. Not much gain for simple quantizations, bit it will be important for quantizations that require more CPU cycles. * Multi-threading for quantize-stats It now does the job in ~14 seconds on my Mac for Q4_0, Q4_1 and Q4_2. Single-threaded it was taking more than 2 minutes after adding the more elaborate version of Q4_2. * Reviewer comments * Avoiding compiler confusion After changing chunk_size to const int as suggested by @ggerganov, clang and GCC starting to warn me that I don't need to capture it in the lambda. So, I removed it from the capture list. But that makes the MSVC build fail. So, making it a constexpr to make every compiler happy. * Still fighting with lambda captures in MSVC --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-20 20:42:27 +03:00
Georgi Gerganov	e0305ead3a	ggml : add Q4_3 quantization (#1082 )	2023-04-20 20:35:53 +03:00
Georgi Gerganov	77a73403ca	ggml : add new Q4_2 quantization (ARM only) (#1046 ) * ggml : Q4_2 ARM * ggml : add ggml_is_quantized() * llama : update llama_type_name() with Q4_2 entry * ggml : speed-up q4_2 - 4 threads: ~100ms -> ~90ms - 8 threads: ~55ms -> ~50ms * ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32	2023-04-18 23:54:57 +03:00
slaren	315a95a4d3	Add LoRA support (#820 )	2023-04-17 17:28:55 +02:00
Georgi Gerganov	eb17a026fd	quantize-stats : fix bug in --type argument	2023-04-17 17:31:06 +03:00
Pavol Rusnak	489537e6cf	examples: add missing <ctime> include for time() (#1011 )	2023-04-16 10:13:00 +00:00
Ivan Komarov	c12b14b77f	benchmark : fix result validation in benchmark-q4_0-matmult (#987 )	2023-04-15 08:51:54 +03:00
Pavol Rusnak	c85e03d12e	Revert "main : alternative instruct mode (Vicuna support, etc.) (#863 )" (#982 ) This reverts commit `f4d277ae17`.	2023-04-14 22:58:43 +03:00
Pavol Rusnak	c56b715269	Expose type name from ggml (#970 ) Avoid duplication of type names in utils Co-authored-by: Håkon H. Hitland <haakon@likedan.net>	2023-04-14 20:05:37 +02:00
Tomáš Pazdiora	f4d277ae17	main : alternative instruct mode (Vicuna support, etc.) (#863 ) * Add support for configs, add configurable prefixes / suffixes, deprecate instruct mode, add stop prompt * Add multiline mode, update text input. * bugfix * update implementation * typos * Change --multiline implementation to be toggled by EOF. * bugfix * default multiline mode * add more configs * update formating * update formatting * apply suggestions	2023-04-14 18:19:17 +03:00
Gary Linscott	be87b6ed20	perplexity : add support for batch size to `--perplexity` (#407 ) * Add support to batch size for perplexity * Revert "Fix memory allocation issues and seg faults" This reverts commit `4870e455b3`. * update from merge * Remove perplexity from main * updates * Update batch size for efficiency	2023-04-14 00:50:42 +03:00
CRD716	0e07e6a839	common : remove unnecessary includes (#947 )	2023-04-13 18:39:25 +03:00
Georgi Gerganov	9190e8eac8	llama : merge llama_internal.h into llama.h Hide it behind an #ifdef	2023-04-13 18:04:45 +03:00
CRD716	8cda5c981d	fix whitespace (#944 )	2023-04-13 16:03:57 +02:00
niansa/tuxifan	107980d970	examples : add -n to alpaca and gpt4all scripts (#706 )	2023-04-13 16:03:39 +03:00
SebastianApel	95ea26f6e9	benchmark : add tool for timing q4_0 matrix multiplication (#653 ) * Initial version of q4_0 matrix multiplication benchmark * Bugfix: Added dependency to ggml.o to benchmark * Reviewer requests: added parameter for threads, switched to ggml_time_us() * Reviewer input: removed rtsc, use epsilon for check * Review comment: Removed set_locale * Feature: Param for numer of iterations, Bugfix for use of parameter threads * Reviewer suggestion: Moved to examples * Reviewer feedback: Updated clean: and benchmark: sections --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-13 15:46:23 +03:00
Pavol Rusnak	8b679987cd	Fix whitespace, add .editorconfig, add GitHub workflow (#883 )	2023-04-11 19:45:44 +00:00
Stephan Walter	3e6e70d8e8	Add enum llama_ftype, sync ggml_type to model files (#709 )	2023-04-11 15:03:51 +00:00
comex	2663d2c678	Windows fixes (#890 ) Mostly for msys2 and mingw64 builds, which are different from each other and different from standard Visual Studio builds. Isn't Windows fun? - Define _GNU_SOURCE in more files (it's already used in ggml.c for Linux's sake). - Don't use PrefetchVirtualMemory if not building for Windows 8 or later (mingw64 doesn't by default). But warn the user about this situation since it's probably not intended. - Check for NOMINMAX already being defined, which it is on mingw64. - Actually use the `increment` variable (bug in my `pizza` PR). - Suppress unused variable warnings in the fake pthread_create and pthread_join implementations for Windows. - (not Windows-related) Remove mention of `asprintf` from comment; `asprintf` is no longer used. Fixes #871.	2023-04-11 15:19:54 +02:00
comex	f963b63afa	Rewrite loading code to try to satisfy everyone: - Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.) - Support both mmap and read (mmap is used by default, but can be disabled with `--no-mmap`, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported). - Support multi-file models like before, but automatically determine the number of parts rather than requiring `--n_parts`. - Improve validation and error checking. - Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields. This has no immediate benefit, but makes it easier to experiment with different formats, and should make it easier to support the new GPTQ-for-LLaMa models in the future (I have some work in progress on that front). - Support VirtualLock on Windows (using the same `--mlock` option as on Unix). - Indicate loading progress when using mmap + mlock. (Which led me to the interesting observation that on my Linux machine, with a warm file cache, mlock actually takes some time, whereas mmap without mlock starts almost instantly...) - To help implement this, move mlock support from ggml to the loading code. - madvise/PrefetchVirtualMemory support (based on #740) - Switch from ifstream to the `fopen` family of functions to avoid unnecessary copying and, when mmap is enabled, allow reusing the same file descriptor for both metadata reads and mmap (whereas the existing implementation opens the file a second time to mmap). - Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way'). Implementation notes: I tried to factor the code into more discrete pieces than before. Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly: - Destructors to make it easier to ensure everything gets cleaned up. - Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. The exceptions are converted to error codes at the API boundary.) Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from #740)	2023-04-10 01:10:46 +02:00
Tomáš Pazdiora	aaf3b23deb	fix for windows utf-8 input (#840 ) Use UTF-16 as input on Windows, since UTF-8 does not work and reads multibyte characters as zeros	2023-04-08 17:49:39 +02:00
unbounded	62cfc54f77	Add quantize-stats command for testing quantization (#728 ) Command that calculates some statistics over the errors introduced by quantization, like mean square error, max error and some percentile errors for layer weights. Should be useful for testing quantization improvements. Exposes some internal state from ggml and llama for testing	2023-04-08 00:09:18 +02:00
Sergey Alirzaev	cc9cee8e9e	Do not crash when it has nothing to say. (#796 ) Otherwise observing this in the interactive mode: /usr/lib/gcc/x86_64-pc-linux-gnu/12/include/g++-v12/bits/stl_vector.h:1230: reference std::vector<int>::back() [_Tp = int, _Alloc = std::allocator<int>]: Assertion '!this->empty()' failed.	2023-04-06 17:59:11 +02:00
at8u	ff05d05c96	miku.sh : add executable bit (#780 )	2023-04-05 18:59:13 +03:00
at8u	88ed5761b8	examples : add Miku.sh (#724 ) * Add Miku.sh to examples * Add missing line to prompt in Miku.sh * Add --keep param to Miku.sh * Remove '[end_of_conversation]' line from Miku.sh No longer is necessary.	2023-04-05 17:32:42 +03:00
mgroeber9110	53dbba7695	Windows: reactive sigint handler after each Ctrl-C (#736 )	2023-04-03 18:00:55 +02:00
Leonardo Neumann	6e7801d08d	examples : add gpt4all script (#658 )	2023-04-02 10:56:20 +03:00
Murilo Santana	5b70e7de4c	fix default params for examples/main (#697 )	2023-04-02 04:41:12 +02:00
Slaren	0d054e292e	Show error message when -f fails	2023-04-01 16:08:40 +02:00
Slaren	64bde3ffd4	Fix ggml_init_params in quantize	2023-03-30 12:28:25 -07:00
Thérence	d9ad104440	Create chat-13B.bat (#592 ) * Create chat-13B.bat Same script than chat-13B.sh, but for windows users. Tested and working on windows 10/11 v 22H2 * Apply suggestions from code review --------- Co-authored-by: anzz1 <anzz1@live.com>	2023-03-29 20:21:09 +03:00
Tobias Lütke	a6956b25a1	add example of re-act pattern (#583 ) * add example of re-act pattern * spelling... * fixed whitespace in reverse prompt issue	2023-03-29 10:10:24 -05:00
anzz1	7f4c5c6651	llama : fix linkage with mingw (#551 ) * Revert `7e53955` (#542) Still needs to be fixed properly * Fix linking on mingw32	2023-03-28 21:23:09 +03:00
Stephan Walter	436e561931	all : be more strict about converting float to double (#458 ) * Be more strict about converting float to double * Test equivalence of round, SILU implementations Test module is commented out in CMakeLists.txt because the tests may take a long time, depending on how much the compiler optimizes. * Fix softmax in perplexity.cpp * all : prefer float over double where appropriate * perplexity : add <cmath> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-03-28 19:48:20 +03:00
Stephan Walter	c1f885067c	ggml : introduce structs for the q4 data blocks (#356 ) * Introduce structs for the q4 data blocks * ggml : rename quant struct variables + fix ARM_NEON --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-03-28 18:56:03 +03:00
anzz1	7b8dbcb78b	main.cpp fixes, refactoring (#571 ) - main: entering empty line passes back control without new input in interactive/instruct modes - instruct mode: keep prompt fix - instruct mode: duplicate instruct prompt fix - refactor: move common console code from main->common	2023-03-28 17:09:55 +03:00
Marco Matthies	7e5395575a	Fix missing ggml link in cmake for examples/* on w64-mingw32 (#542 )	2023-03-27 07:55:26 +03:00
Stephan Walter	b391579db9	Update README and comments for standalone perplexity tool (#525 )	2023-03-26 16:14:01 +03:00
anzz1	7a87d31f4f	[main] fix infinite generation (-n == -1) (#523 )	2023-03-26 16:06:10 +03:00
Harald Fernengel	33e35b8fe8	Exit from interactive mode if input stream is bad (#491 ) Allow exiting the interactive prompt also with CTRL-D on Unix and CTRL-Z on Windows.	2023-03-26 08:25:46 +03:00
anzz1	34ab526843	(Windows) Set console to UTF-8 on init (#420 ) Sets console codepage to 65001 (CP_UTF8) on start for both input and output, should fix problems with UTF-8 characters.	2023-03-25 22:29:22 +02:00
Georgi Gerganov	c2b25b6912	Fix colors enabling on WIN32	2023-03-25 21:53:39 +02:00
Georgi Gerganov	79b2b266db	If n_predict == -1, generate forever	2023-03-25 21:51:41 +02:00
Georgi Gerganov	e2d490dafd	Inifinite generation via context swapping (#71 )	2023-03-25 21:36:22 +02:00
Georgi Gerganov	03f7e33560	Cleanup STL headers + fix embedding examples + minor stuff	2023-03-25 20:51:14 +02:00
Georgi Gerganov	55ad42af84	Move chat scripts into "./examples"	2023-03-25 20:37:09 +02:00
Georgi Gerganov	a316a425d0	Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something !	2023-03-25 20:26:40 +02:00
Georgi Gerganov	04c6f5ed6f	Immediately start processing the prompt before user input has been provided (#476 )	2023-03-24 23:17:58 +02:00
Mathieu Nayrolles	3f9c6135e4	fix typo in chatLLaMa (#368 ) The prompt contains a typo where 'alound' is used instead of 'aloud'.	2023-03-21 22:52:27 +02:00
Jean-Christophe Hoelt	3ab3e6582f	Add chatLLaMa script (#198 ) * Add chatLLaMa script * Fix shellcheck errors and do some cleanup * Move chatLLaMa script to `examples` directory * Reduce chatLLaMa context size to 2048 Ref `d7def1a752` * Include n_predict to 2048 in examples/chatLLaMa	2023-03-21 18:23:15 +02:00

... 16 17 18 19 20 ...

1015 Commits