llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-13 12:10:18 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	08eb99179a	metal : add cpy f16 -> f32 kernel	2023-12-12 14:15:08 +02:00
Georgi Gerganov	8cbaed1d9a	llama : fix hard-coded number of experts	2023-12-11 08:55:27 +02:00
Georgi Gerganov	e640cbe055	llama : add n_expert and n_expert_used to hparams + change quants	2023-12-10 13:57:54 +02:00
Georgi Gerganov	d1259b7b35	llama : do not quantize expert gating tensors	2023-12-10 13:00:13 +02:00
slaren	0710b0f726	llama : offload missing ffn_moe_silu	2023-12-09 23:29:47 +01:00
slaren	62b95f93d0	cuda : support non-contiguous src1 in get_rows	2023-12-09 22:39:34 +01:00
slaren	06dfde3e94	llama : add basic support for offloading moe with CUDA	2023-12-09 13:21:30 +01:00
slaren	ee8fb399aa	ggml : add n_as argument to ggml_mul_mat_id	2023-12-09 12:42:25 +01:00
Georgi Gerganov	8b185b7030	llama : fix expert weighting in the FFN	2023-12-09 13:01:42 +02:00
Georgi Gerganov	7ea36953ba	llama : first working version	2023-12-09 12:45:15 +02:00
Georgi Gerganov	af1a096bf8	llama : fix cur -> cur_expert	2023-12-09 12:07:39 +02:00
Georgi Gerganov	aedfad120a	llama : update graph to support MoE	2023-12-09 11:47:40 +02:00
Georgi Gerganov	a3eefe95a8	llama : model loading	2023-12-09 11:14:03 +02:00
Georgi Gerganov	bcc0eb4591	llama : per-layer KV cache + quantum K cache (#4309 ) * per-layer KV * remove unnecessary copies * less code duplication, offload k and v separately * llama : offload KV cache per-layer * llama : offload K shift tensors * llama : offload for rest of the model arches * llama : enable offload debug temporarily * llama : keep the KV related layers on the device * llama : remove mirrors, perform Device -> Host when partial offload * common : add command-line arg to disable KV cache offloading * llama : update session save/load * llama : support quantum K cache (#4312) * llama : support quantum K cache (wip) * metal : add F32 -> Q8_0 copy kernel * cuda : add F32 -> Q8_0 copy kernel ggml-ci * cuda : use mmv kernel for quantum cache ops * llama : pass KV cache type through API * llama : fix build ggml-ci * metal : add F32 -> Q4_0 copy kernel * metal : add F32 -> Q4_1 copy kernel * cuda : wip * cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels * llama-bench : support type_k/type_v * metal : use mm kernel only for quantum KV cache * cuda : add comment * llama : remove memory_f16 and kv_f16 flags --------- Co-authored-by: slaren <slarengh@gmail.com> * readme : add API change notice --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-07 13:03:17 +02:00
Marcus Dunn	5f6e0c0dff	grammar : pre-computed pieces + reserve mem + less string copies (#4330 ) * reserve space for codepoints * improvement for the appended 0 * used precomputed token text for grammar sample * reserve canidates_decoded * reserve canidates_grammar * remove candidates_decoded * Revert "remove candidates_decoded" This reverts commit `3773328080`. * changed decode_utf8 to take src by ref	2023-12-05 22:55:12 +02:00
Kerfuffle	5aa365d88f	llama : allow overriding GGUF metadata when loading model (#4092 ) * feat: Allow overriding GGUF metadata when loading model * Fix the one time GCC is stricter than clang about something * Step1 * Refactor... basically everything! * Nuke obsolete GetArrayLen struct * simplify std::string specialization * Various cleanups Add informational output when overrides are applied Warn user when an override with the wrong type is specified * Fix broken logic for parsing bool KV overrides Fix issue where overrides didn't apply when key missing in GGUF metadata Resolve merge changes * llama : rearrange model params * Update new GET_KEY call Add note that metadata KV overrides aren't reflected in initial metadata KV info dump --------- Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-05 19:19:18 +02:00
Georgi Gerganov	d7b800b8bc	llama : pad KV cache size (#4280 ) * llama : pad KV cache size to 32 * metal : try to improve batched decoding	2023-12-03 10:58:16 +02:00
Georgi Gerganov	5a7d3125e7	llama : avoid using "optional" keyword (#4283 )	2023-12-01 20:39:12 +02:00
Georgi Gerganov	d5a1cbde60	llama : support optional tensors (#4283 )	2023-12-01 20:35:47 +02:00
CausalLM	03562f3a86	llama : support attention bias on LLaMA architecture (#4283 ) * Support attention_bias on LLaMA architecture QKVO bias, should fix InternLM (https://github.com/ggerganov/llama.cpp/issues/3133) and works for LLaMAfied Qwen models (https://github.com/ggerganov/llama.cpp/pull/3743#issuecomment-1825923608). * check existence of qkvo bias while loading llama models Tested on LLaMA2, CUDA and CPU. * Update llama.cpp	2023-12-01 20:17:06 +02:00
Shijie	37c746d687	llama : add Qwen support (#4281 ) * enable qwen to llama.cpp * llama : do not GPU split bias tensors --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-01 20:16:31 +02:00
Georgi Gerganov	880f57973b	llama : fix integer overflow during quantization (#4284 ) happens with multi-threaded quantization of Qwen-72B ggml-ci	2023-12-01 18:42:11 +02:00
Georgi Gerganov	ef47ec18da	ggml : add ggml_soft_max_ext (#4256 ) * metal : implement soft_max_ext * cuda : implement soft_max_ext * ggml : implement soft_max_ext (CPU) * batched-bench : print threads ggml-ci * metal : simplify soft_max encoding ggml-ci * cuda : use 512 threads for soft_max instead of 32 * ggml : update soft max cpu * cuda : do warp-based block reduce * cuda : increase max block size to 1024 * cuda : fix warp reduction initialization of shared mem * metal : warp-based reduction for soft max kernel * metal : warp-based reduce for rms_norm * metal : simplify soft max kernel ggml-ci * alloc : fix build with debug	2023-12-01 10:51:24 +02:00
Jared Van Bortel	15f5d96037	build : fix build info generation and cleanup Makefile (#3920 ) * cmake : fix joining of REAL_GIT_DIR * fix includes with help from include-what-you-use * make : remove unneeded deps and add test-rope target * fix C includes in C++ source files * Revert "fix includes with help from include-what-you-use" This reverts commit `635e9fadfd`.	2023-12-01 00:23:08 +02:00
Daniel Bevenius	b18c66ca6e	llama : fix alignment of general.name in print meta (#4254 ) * llama: fix alignment of general.name in print meta This commit fixes the alignment of the general.name field in the llm_load_print_meta function. Currently the output looks like this: ```console llm_load_print_meta: model ftype = mostly Q4_0 llm_load_print_meta: model params = 13.02 B llm_load_print_meta: model size = 6.86 GiB (4.53 BPW) llm_load_print_meta: general.name = LLaMA v2 ``` And with this commit it looks like this: ```console llm_load_print_meta: model ftype = mostly Q4_0 llm_load_print_meta: model params = 13.02 B llm_load_print_meta: model size = 6.86 GiB (4.53 BPW) llm_load_print_meta: general.name = LLaMA v2 ``` Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * llama: fix alignment of special tokens Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2023-11-30 23:43:08 +02:00
tarcey	954e22858c	llama : fix typical sampling (#4261 ) Typical sampling was broken because after copying new_candidates into canditates, the "sorted" bool is left at "true", but the new data is no longer sorted according to probability. Patch to set "sorted" to false. Test: Generating with temp=0.0001 (approx. argmax) should generate the same sequence at typical>=1.0 and typical=0.9999 (approx. disabled, but enters the typical sampling codepath).	2023-11-30 23:40:23 +02:00
Georgi Gerganov	8406b0924b	ggml : re-enable BLAS for CPU when src0 != F32 + remove redundant full offload checks in llama.cpp (#4240 ) * ggml : use blas even if src0 is not F32 * llama : use n_threads_batch only when n_tokens >= 32 ggml-ci * llama : revert n_threads_batch logic ggml-ci	2023-11-28 10:32:03 +02:00
Marcus Dunn	f837c3a992	llama : grammar `reserve` space in `decode_utf8` (#4210 ) * reserve space for codepoints * improvement for the appended 0	2023-11-25 18:58:23 +02:00
slaren	e9c13ff781	llama : set metal log callback correctly (#4204 )	2023-11-24 18:10:01 +01:00
slaren	8a052c131e	ggml-cuda : support stablelm rope (#4156 ) * ggml-cuda : support stablelm rope * remove unused freq_base kernel parameter * add n_dims parameter to llm_build_k_shift, default to n_rot via overload * llama : fix llm_build_k_shift args --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-11-24 18:04:31 +01:00
Georgi Gerganov	6b0a7420d0	llama : KV cache view API + better KV cache management (#4170 ) * llama : keep track of used KV cells + better KV cache management * llama : zero KV cache used upon clear ggml-ci * llama : allow exporting a view of the KV cache (#4180) * Allow exporting a view of the KV cache * Allow dumping the sequences per cell in common * Track max contiguous cells value and position as well * Fix max contiguous empty cells index calculation Make dump functions deal with lengths or sequences counts > 10 better * Fix off by one error in dump_kv_cache_view * Add doc comments for KV cache view functions Eliminate cell sequence struct; use llama_seq_id directly Minor cleanups * common : add -dkvc arg for enabling kv cache dumps --------- Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>	2023-11-23 19:07:56 +02:00
Galunid	8e672efe63	stablelm : simplify + speedup generation (#4153 )	2023-11-21 16:22:30 +01:00
slaren	e937066420	gguf-py : export chat templates (#4125 ) * gguf-py : export chat templates * llama.cpp : escape new lines in gguf kv info prints * gguf-py : bump version * gguf-py : check chat_template type * gguf-py : initialize chat_template	2023-11-19 11:10:52 +01:00
slaren	bbecf3f415	llama : increase max nodes (#4115 )	2023-11-17 21:39:11 +02:00
slaren	e85bb1a8e7	llama : add functions to get the model's metadata (#4013 ) * llama : add functions to get the model's metadata * format -> std::to_string * better documentation	2023-11-17 17:17:37 +02:00
Georgi Gerganov	4f447a4833	llama : fix data units (#4101 ) * llama : fix data units ggml-ci * Revert "llama : fix data units" This reverts commit `f5feac831f`. * llama : disambiguate data units ggml-ci	2023-11-17 10:00:15 +02:00
Kerfuffle	91f6499393	Respect tokenizer.ggml.add_bos_token value when tokenizing (#4040 ) * gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode. * Respect add_bos_token GGUF metadata value * gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time	2023-11-16 19:14:37 -07:00
Jared Van Bortel	a6fc554e26	llama : restore prefix space in llama tokenizer (#4081 )	2023-11-15 11:34:47 -05:00
Galunid	36eed0c42c	stablelm : StableLM support (#3586 ) * Add support for stablelm-3b-4e1t * Supports GPU offloading of (n-1) layers	2023-11-14 11:17:12 +01:00
Georgi Gerganov	4760e7cc0b	sync : ggml (backend v2) (#3912 ) * sync : ggml (backend v2) (wip) * sync : migrate examples and llama.cpp to dynamic graphs (wip) * sync : update tests + fix max op params to 64 ggml-ci * sync : ggml-cuda ggml-ci * llama : fix save/load state context size ggml-ci * sync : try to fix build on tvOS * sync : pass custom graph sizes in training examples * sync : update graph copies to new ggml API * sync : update sync-ggml.sh with new files * scripts : fix header in sync script * train : fix context size calculations * llama : increase inference graph size up to 4096 nodes * train : allocate grads for backward graphs * train : allocate grads for gb_tmp	2023-11-13 14:16:23 +02:00
Kerfuffle	bb50a792ec	Add ReLU and SQR CUDA ops to (partially) fix Persimmon offloading (#4041 ) * Add ReLU and SQR CUDA ops to fix Persimmon offloading * Persimmon loader: More helpful error on CUDA/ROCM when offloading too many layers	2023-11-13 01:58:15 -07:00
Galunid	df9d1293de	Unbreak persimmon after #3837 (#4010 )	2023-11-10 14:24:54 +01:00
Meng Zhang	46876d2a2c	cuda : supports running on CPU for GGML_USE_CUBLAS=ON build (#3946 ) * protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build * doc: add comments to ggml_cublas_loaded() * fix defined(...)	2023-11-07 08:49:08 +02:00
Meng Zhang	3d48f42efc	llama : mark LLM_ARCH_STARCODER as full offload supported (#3945 ) as done in https://github.com/ggerganov/llama.cpp/pull/3827	2023-11-05 14:40:08 +02:00
cebtenzzre	3fdbe6b66b	llama : change yarn_ext_factor placeholder to -1 (#3922 )	2023-11-03 08:31:58 +02:00
Georgi Gerganov	1efae9b7dc	llm : prevent from 1-D tensors being GPU split (#3697 )	2023-11-02 09:54:44 +02:00
cebtenzzre	0eb332a10f	llama : fix llama_context_default_params after #2268 (#3893 )	2023-11-01 19:29:14 -04:00
cebtenzzre	898aeca90a	llama : implement YaRN RoPE scaling (#2268 ) Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>	2023-11-01 18:04:33 -04:00
Georgi Gerganov	c43c2da8af	llm : fix llm_build_kqv taking unused tensor (benign, #3837 )	2023-11-01 23:08:30 +02:00
Georgi Gerganov	523e49b111	llm : fix falcon norm after refactoring (#3837 )	2023-11-01 23:00:50 +02:00

1 2 3 4 5 ...

349 Commits