llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-11-15 07:19:53 +00:00

Author	SHA1	Message	Date
Justine Tunney	436787f170	llama : fix time complexity of string replacement (#9163 ) This change fixes a bug where replacing text in a very long string could cause llama.cpp to hang indefinitely. This is because the algorithm used was quadratic, due to memmove() when s.replace() is called in a loop. It seems most search results and LLM responses actually provide the O(n**2) algorithm, which is a great tragedy. Using a builder string fixes things	2024-08-26 09:09:53 +03:00
Johannes Gäßler	f91fc5639b	CUDA: fix Gemma 2 numerical issues for FA (#9166 )	2024-08-25 22:11:48 +02:00
Johannes Gäßler	e11bd856d5	CPU/CUDA: Gemma 2 FlashAttention support (#8542 ) * CPU/CUDA: Gemma 2 FlashAttention support * apply logit_softcap to scale in kernel * disable logit softcapping tests on Metal * remove metal check	2024-08-24 21:34:59 +02:00
piDack	a07c32ea54	llama : use F32 precision in GLM4 attention and no FA (#9130 )	2024-08-23 10:27:17 +03:00
compilade	a1631e53f6	llama : simplify Mamba with advanced batch splits (#8526 ) * llama : advanced batch splits This includes equal-sequence-length batch splits which are useful to simplify recurrent model operators. * llama : always make recurrent state slots contiguous * ggml : simplify mamba operators * llama : fix integer signedness mixing * llama : logits_all has priority over batch->logits Otherwise, the server embeddings tests failed. This was likely an existing problem but was only detected here because of an additional assertion. * llama : apply suggestions Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama : fix t5 segfault * llama : fix Mamba session save and restore * llama : minor cosmetic changes * llama : rename llama_reorder_outputs to llama_output_reorder Also move it closer to llama_output_reserve. * llama : fix pooled embeddings when using batches with equal_seqs * minor : add struct members for clarity ggml-ci * llama : fix T5 segfault again * llama : fix Mamba pooled embeddings with multiple sequences Until the pooled embeddings are refactored to allow splitting across ubatches for causal embeddings, recurrent models can only process a single sequence per ubatch when calculating pooled embeddings. * llama : add llama_model_is_recurrent to simplify figuring that out This will make it easier to more cleanly support RWKV-v6 and Mamba-2. * llama : fix simple splits when the batch contains embeddings --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-08-21 17:58:11 -04:00
Younes Belkada	b40eb84895	llama : support for `falcon-mamba` architecture (#9074 ) * feat: initial support for llama.cpp * fix: lint * refactor: better refactor * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * fix: address comments * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * fix: add more cleanup and harmonization * fix: lint * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <git@compilade.net> * fix: change name * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> * add in operator * fix: add `dt_b_c_rms` in `llm_load_print_meta` * fix: correct printf format for bool * fix: correct print format * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * llama : quantize more Mamba tensors * llama : use f16 as the fallback of fallback quant types --------- Co-authored-by: compilade <git@compilade.net>	2024-08-21 11:06:36 +03:00
Daniel Bevenius	8455340b87	llama : std::move llm_bigram_bpe from work_queue (#9062 ) * llama : std::move llm_bigram_bpe from work_queue This commit updates the retrieval of llm_bigram_bpe objects from work_queue.top() by using std::move. The motivation for this is to avoid the copying of the std::string `text` member of the llm_bigram_bpe struct. * squash! llama : std::move llm_bigram_bpe from work_queue Introduced a MovablePriorityQueue class to allow moving elements out of the priority queue for llm_bigram_bpe. * squash! llama : std::move llm_bigram_bpe from work_queue Rename MovablePriorityQueue to lama_priority_queue. * squash! llama : std::move llm_bigram_bpe from work_queue Rename lama_priority_queue -> llama_priority_queue.	2024-08-21 10:32:58 +03:00
Yoshi Suhara	2fb9267887	Fix incorrect use of ctx_split for bias tensors (#9063 )	2024-08-17 15:34:21 +02:00
Minsoo Cheong	c679e0cb5c	llama : add EXAONE model support (#9025 ) * add exaone model support * add chat template * fix whitespace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add ftype * add exaone pre-tokenizer in `llama-vocab.cpp` Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com> * fix lint Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com> * add `EXAONE` to supported models in `README.md` * fix space Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: compilade <113953597+compilade@users.noreply.github.com> Co-authored-by: compilade <git@compilade.net>	2024-08-16 09:35:18 +03:00
Yoshi Suhara	2a24c8caa6	Add Nemotron/Minitron GGUF Conversion & Inference Support (#8922 ) * Add nemotron GGUF conversion & inference support * Fix formatting issues * Remove unnecessary write_tensors() * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Address comments by @compilade * Replace ggml_mul_mat()->llm_build_lora_mm() * Remove mutable variable * Use for bias tensors * Cover corner case for role_scaling not in config.json --------- Co-authored-by: compilade <git@compilade.net>	2024-08-16 04:23:33 +02:00
Zhenwei Jin	4af8420afb	common : remove duplicate function llama_should_add_bos_token (#8778 )	2024-08-15 10:23:23 +03:00
Esko Toivonen	6bda7ce6c3	llama : add pre-tokenizer regexes for BLOOM and gpt3-finnish (#8850 )	2024-08-15 10:17:12 +03:00
Nico Bosshard	0fd93cdef5	llama : model-based max number of graph nodes calculation (#8970 ) * llama : model-based max number of graph nodes calculation * Update src/llama.cpp --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-08-12 17:13:59 +02:00
Liu Jia	2589292cde	Fix a spelling mistake (#9001 )	2024-08-12 11:46:03 +02:00
fairydreaming	33309f661a	llama : check all graph nodes when searching for result_embd_pooled (#8956 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-08-11 10:35:26 +02:00
Xuan Son Nguyen	7eb23840ed	llama : default n_swa for phi-3 (#8931 ) * default n_swa for phi-3 * fix * double check swa	2024-08-10 13:04:40 +02:00
fairydreaming	7c3f55c100	Add support for encoder-only T5 models (#8900 ) * gguf-py : add T5ENCODER model architecture * common : call llama_decode() during warmup only if the model has decoder * convert-hf : add T5EncoderModel * llama : add llama_model_has_decoder() API function * llama : split build_t5() into build_t5_encoder() and build_t5_decoder() * llama : add support for LLM_ARCH_T5ENCODER * llama-embedding : add support for LLAMA_POOLING_TYPE_NONE * llama-embedding : add support for encoder-only models --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-08-10 11:43:26 +02:00
fairydreaming	6afd1a99dc	llama : add support for lora adapters in T5 model (#8938 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-08-09 18:53:09 +02:00
Georgi Gerganov	45a55b91aa	llama : better replace_all (cont) (#8926 ) * llama : better replace_all (cont) ggml-ci * code : deduplicate replace_all ggml-ci	2024-08-09 18:23:52 +03:00
Daniel Bevenius	6f6496bb09	llama : fix typo in llama_tensor_get_type comment [no ci] (#8937 )	2024-08-09 09:32:23 +03:00
compilade	345a686d82	llama : reduce useless copies when saving session (#8916 ) * llama : avoid useless copies in dummy session writer * llama : avoid double tensor copy when saving session to buffer	2024-08-08 23:54:00 -04:00
Douglas Hanley	cdd1889de6	convert : add support for XLMRoberta embedding models (#8658 ) * add conversion for bge-m3; small fix in unigram tokenizer * clean up and simplify XLMRoberta conversion	2024-08-06 10:20:54 +03:00
fairydreaming	d3f0c7166a	Stop the generation when <\|eom_id\|> token is encountered - needed for Llama 3.1 tool call support (#8858 ) * gguf-py, llama : add constants and methods related to Llama-3.1 <\|eom_id\|> token * llama : find Llama-3.1 <\|eom_id\|> token id during vocab loading * llama-vocab : add Llama-3.1 <\|eom_id\|> token to the set of tokens stopping the generation --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-08-05 09:38:01 +02:00
Georgi Gerganov	f1ea5146d7	llama : better replace_all (#8852 )	2024-08-05 08:53:39 +03:00
pculliton	398ede5efe	Adding Gemma 2 2B configs (#8784 ) * Adding Gemma 2 2B configs Updates to Q scaling and Gemma 2 model sizes to match v2 2B model. * Update src/llama.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-07-31 17:12:10 +02:00
compilade	4c676c85e5	llama : refactor session file management (#8699 ) * llama : refactor session file management * llama : saving and restoring state checks for overflow The size of the buffers should now be given to the functions working with them, otherwise a truncated file could cause out of bound reads. * llama : stream from session file instead of copying into a big buffer Loading session files should no longer cause a memory usage spike. * llama : llama_state_get_size returns the actual size instead of max This is a breaking change, but makes that function much easier to keep up to date, and it also makes it reflect the behavior of llama_state_seq_get_size. * llama : share code between whole and seq_id-specific state saving Both session file types now use a more similar format. * llama : no longer store all hparams in session files Instead, the model arch name is stored. The layer count and the embedding dimensions of the KV cache are still verified when loading. Storing all the hparams is not necessary. * llama : fix uint64_t format type * llama : various integer type cast and format string fixes Some platforms use "%lu" and others "%llu" for uint64_t. Not sure how to handle that, so casting to size_t when displaying errors. * llama : remove _context suffix for llama_data_context * llama : fix session file loading llama_state_get_size cannot be used to get the max size anymore. * llama : more graceful error handling of invalid session files * llama : remove LLAMA_MAX_RNG_STATE It's no longer necessary to limit the size of the RNG state, because the max size of session files is not estimated anymore. * llama : cast seq_id in comparison with unsigned n_seq_max	2024-07-28 00:42:05 -04:00
Jeffrey Morgan	b5e95468b1	llama : add support for llama 3.1 rope scaling factors (#8676 ) * Add llama 3.1 rope scaling factors to llama conversion and inference This commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the `ggml_rope_ext` rope oepration, improving results for context windows above 8192 * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * address comments * address comments * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net>	2024-07-27 15:03:45 +03:00
Georgi Gerganov	92090eca21	llama : add function for model-based max number of graph nodes (#8622 ) * llama : model-based max number of graph nodes ggml-ci * llama : disable 405B max_nodes path due to lack of complaints ggml-ci	2024-07-27 14:59:29 +03:00
slaren	2b1f616b20	ggml : reduce hash table reset cost (#8698 ) * ggml : reduce hash table reset cost * fix unreachable code warnings after GGML_ASSERT(false) * GGML_ASSERT(false) -> GGML_ABORT("fatal error") * GGML_ABORT use format string	2024-07-27 04:41:55 +02:00
Judd	01245f5b16	llama : fix order of parameters (#8706 ) usage of `aclrtGetMemInfo` is correct: https://www.hiascend.com/doc_center/source/zh/canncommercial/63RC2/inferapplicationdev/aclcppdevg/aclcppdevg_03_0103.html Co-authored-by: Judd <foldl@boxvest.com>	2024-07-26 11:38:12 +03:00
Georgi Gerganov	4226a8d10e	llama : fix build + fix fabs compile warnings (#8683 ) ggml-ci	2024-07-25 19:57:31 +03:00
Chen Xi	ed67bcb24f	[SYCL] fix multi-gpu issue on sycl (#8554 ) --------- Signed-off-by: Chen Xi <xi2chen@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>	2024-07-25 19:45:18 +08:00
Georgi Gerganov	eddcb5238b	ggml : add and use ggml_cpu_has_llamafile() (#8664 )	2024-07-25 12:37:42 +03:00
Fan Shupei	8a4bad50a8	llama: use sliding window for phi3 (#8627 ) * use sliding window for phi3 * fix typo, "data_swa" -> "data" * [conver_hf_to_gguf.py] add phi3 sliding window	2024-07-25 10:21:09 +03:00
Xuan Son Nguyen	b115105f05	add llama_lora_adapter_clear (#8653 )	2024-07-24 11:25:19 +02:00
Georgi Gerganov	938943cdbf	llama : move vocab, grammar and sampling into separate files (#8508 ) * llama : move sampling code into llama-sampling ggml-ci * llama : move grammar code into llama-grammar ggml-ci * cont ggml-ci * cont : pre-fetch rules * cont ggml-ci * llama : deprecate llama_sample_grammar * llama : move tokenizers into llama-vocab ggml-ci * make : update llama.cpp deps [no ci] * llama : redirect external API to internal APIs ggml-ci * llama : suffix the internal APIs with "_impl" ggml-ci * llama : clean-up	2024-07-23 13:10:17 +03:00
Keke Han	081fe431aa	llama : fix codeshell support (#8599 ) * llama : fix codeshell support * llama : move codeshell after smollm below to respect the enum order	2024-07-22 19:43:43 +03:00
Jason Stillerman	d94c6e0ccb	llama : add support for SmolLm pre-tokenizer (#8609 ) * Adding SmolLM Pre Tokenizer * Update convert_hf_to_gguf_update.py Co-authored-by: compilade <git@compilade.net> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * handle regex * removed .inp and out .out ggufs --------- Co-authored-by: compilade <git@compilade.net>	2024-07-22 17:43:01 +03:00
Georgi Gerganov	6f11a83e4e	llama : allow overrides for tokenizer flags (#8614 ) ggml-ci	2024-07-22 13:33:22 +03:00
Douglas Hanley	50e05353e8	llama : add Mistral Nemo inference support (#8604 )	2024-07-22 11:06:17 +03:00
Michael Coppola	940362224d	llama : add support for Tekken pre-tokenizer (#8579 ) * llama : Added support for Tekken pre-tokenizer (#8577) Removed uneeded `vocab.tokenizer_clean_spaces` assignment * llama : fix order of pre-tokenizers * * Tekken pre-tokenizer no longer uses clean_up_tokenization_spaces * Updated chkhsh for Tekken tokenizer --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-07-20 16:43:51 +03:00
Georgi Gerganov	d197545530	llama : bump max layers from 256 to 512 (#8530 ) * llama : bump max layers from 256 to 512 * llama : replace asserts with exceptions	2024-07-19 16:50:47 +03:00
Frank Mai	f299aa98ec	fix: typo of chatglm4 chat tmpl (#8586 ) Signed-off-by: thxCode <thxcode0824@gmail.com>	2024-07-19 11:44:41 +02:00
hipudding	1bdd8ae19f	[CANN] Add Ascend NPU backend (#6035 ) * [CANN] Add Ascend NPU backend Ascend is a full-stack AI computing infrastructure for industry applications and services based on Huawei Ascend processors and software. CANN (Compute Architecture of Neural Networks), developped by Huawei, is a heterogeneous computing architecture for AI. Co-authored-by: wangshuai09 <391746016@qq.com> * delete trailing whitespaces * Modify the code based on review comment * Rename LLAMA_CANN to GGML_CANN * Make ggml-common.h private * add ggml_cann prefix for acl funcs * Add logging for CANN backend * Delete Trailing whitespace --------- Co-authored-by: wangshuai09 <391746016@qq.com>	2024-07-17 14:23:50 +03:00
Georgi Gerganov	d65a8361fe	llama : disable context-shift for DeepSeek v2 (#8501 )	2024-07-17 10:32:59 +03:00
Georgi Gerganov	0efec57787	llama : valign + remove unused ftype (#8502 )	2024-07-16 10:00:30 +03:00
Xuan Son Nguyen	97bdd26eee	Refactor lora adapter support (#8332 ) * lora: load to devide buft * add patch tensor function * correct tensor patch * llama_lora_adapter_apply * correct ggml_backend_tensor_copy * add llm_build_mm * fix auto merge * update based on review comments * add convert script * no more transpose A * add f16 convert * add metadata check * add sanity check * fix ftype * add requirements * fix requirements * fix outfile * conversion: only allow selected models * fix types * cuda : do not use dmmv if the tensor does not have enough cols * llama : lora fixes * do not disable mmap with lora Co-authored-by: slaren <slarengh@gmail.com> * llm_build_lora_mm_id * convert_lora : MoE LoRA conversion support * convert_lora : prefer safetensors, similarly to convert_hf * convert_hf : simplify modify_tensors for InternLM2 * convert_lora : lazy conversion * llama : load and use alpha from LoRA adapters * llama : use llm_build_lora_mm in most model graphs * auto scale * Revert "auto scale" This reverts commit `42415a4874`. * remove redundant params * Apply suggestions from code review Co-authored-by: slaren <slarengh@gmail.com> * change kv metadata * move add_type to __init__ * convert_hf : move add_type to main() * convert_lora : use the GGUFWriter from Model instead of overwriting it --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Francis Couture-Harpin <git@compilade.net>	2024-07-15 20:50:47 +02:00
Georgi Gerganov	3dfda05956	llama : de-duplicate deepseek2 norm	2024-07-15 14:10:39 +03:00
Georgi Gerganov	73cf442e7b	llama : fix Gemma-2 Query scaling factors (#8473 ) * 9B - query_pre_attn_scalar = 256 not 224 See `03e657582d` Gemma 9b should use 256 and not 224 (self.config.hidden_size // self.config.num_attention_heads) * llama : fix Gemma-2 Query scaling factor ggml-ci --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2024-07-14 14:05:09 +03:00
compilade	fa79495bb4	llama : fix pre-tokenization of non-special added tokens (#8228 ) * llama : fix mpt and olmo pre-tokenizer * llama : pre-tokenize non-special user-defined tokens first * llama : fix detection of control-like user-defined tokens * convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. * convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens * llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. * llama : fix command-r detokenization * convert_hf : reduce usages of the UNKNOWN token type * llama : add UNKNOWN tokens in the special tokens cache * convert_hf : reduce usages of UNKNOWN for InternLM2 This makes the changes from #8321 more consistent with the other changes made here. * test-tokenizer-random : reduce potential confilcts with #8379 * test-tokenizer-random : add a failing edge case for falcon	2024-07-13 23:35:10 -04:00
Daniel Bevenius	f53226245f	llama : suppress unary minus operator warning (#8448 ) This commit updates the _try_copy lambda and moves the unary minus operator to after the cast to int32_t. The motivation for this that currently the following warning is generated on windows: ```console llama.cpp\src\llama.cpp(21147,30): warning C4146: unary minus operator applied to unsigned type, result still unsigned ```	2024-07-12 12:05:21 +03:00
Chen Xi	b549a1bbef	[SYCL] fix the mul_mat_id ut issues (#8427 ) * fix part of mul_mat_id * skip the bfloat 16 sycl ut Signed-off-by: Chen Xi <xi2chen@intel.com> --------- Signed-off-by: Chen Xi <xi2chen@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com> Co-authored-by: Chen Xi <xi2chen@intel.com>	2024-07-12 08:52:04 +08:00
Georgi Gerganov	7a221b672e	llama : use F32 precision in Qwen2 attention and no FA (#8412 )	2024-07-11 10:21:30 +03:00
Dibakar Gope	0f1a39f343	ggml : add AArch64 optimized GEMV and GEMM Q4 kernels (#5780 ) * Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add copyright claim only to ggml-aarch64.cpp and ggml-aarch64.h files * Arm AArch64: minor code refactoring for rebase * Arm AArch64: minor code refactoring for resolving a build issue with cmake * Arm AArch64: minor code refactoring to split the Q4_0_AARC64 type into three separate types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: minor code change for resolving a build issue with server-windows * retrigger checks * Arm AArch64: minor code changes for rebase * Arm AArch64: minor changes to skip the pr#7433 vec_dot code for arm cpus with SVE VL not equal to 256 bits * Arm AArch64: remove stale LLAMA_QKK_64 from CMakeLists.txt and delete build.zig * Arm AArch64: add reference scalar gemm and gemv, and avoid dynamic memory allocations during quantization for Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: add multithreaded quantization support for the new types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: minor code refactoring * Arm AArch64: simplify logic for calling gemm and gemv functions in ggml_compute_forward_mul_mat * Arm AArch64: minimize changes in ggml_compute_forward_mul_mat * Arm AArch64: minor code refactoring, and add reference scalar code to quantize routines for new quant types * Arm AArch64: minor code refactoring * Arm AArch64: minor code refactoring * Arm AArch64: minor code refactoring * rebase on the latest master commit `3fd62a6` and adapt to the new directory structure * Arm AArch64: remove a redundant comment * Arm AArch64: add pragma in ggml-aarch64.c to turn -Woverlength-strings warning off * Arm AArch64: use __aarch64__ check to guard 64-bit neon kernels * Arm AArch64: update docs/build.md README to include compile time flags for buiilding the Q4_0_4_4 quant type	2024-07-10 15:14:51 +03:00
Borislav Stanimirov	cc61948b1f	llama : C++20 compatibility for u8 strings (#8408 )	2024-07-10 14:45:44 +03:00
Borislav Stanimirov	7a80710d93	msvc : silence codecvt c++17 deprecation warnings (#8395 )	2024-07-10 14:40:53 +03:00
fairydreaming	a8be1e6f59	llama : add assert about missing llama_encode() call (#8400 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-07-10 14:38:58 +03:00
toyer	905942abdb	llama : support glm3 and glm4 (#8031 ) * add chatglm3-6b model support huggingface model: https://hf-mirror.com/THUDM/chatglm3-6b Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com> * remove .rotary_pos_emb.inv_freq and unuse code for chatglm3 model Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com> * fix lint error Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com> * optimize convert-hf-to-gguf.py for chatglm model Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com> * support glm-4-9b-chat Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com> * fix eos tokens to glm4 * remove unused log * add preprocess to chatglm3 and chatglm4 * add eos_id_list to llama.cpp * fix code style * fix code style * fix conflicts * fix conflicts * Revert "add eos_id_list to llama.cpp" This reverts commit `3a4d5790bf`. * set <\|endoftext\|> as eos and <\|user\|> as eot * fix chat template bug * add comment to glm prefix and suffix * fix conflicts and add rope_ratio & ChatGLMForConditionalGeneration * fix chat template bug * fix codestyle * fix conflicts * modified the general name of glm model * fix conflicts * remove prefix and suffix * use normal glm4 chattempalte & use LLM_FFN_SWIGLU in phi3 * fix: resolve Flake8 errors in `convert-hf-to-gguf.py` - Fix E302 by adding two blank lines before top-level function definitions - Replace print statements to fix NP100 - Fix E303 by ensuring only one blank line between lines of code * fix rope ratio to solve incorrect answers * fix by comments --------- Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com> Co-authored-by: XingXing Qiao <qiaoxx@dingdao.com> Co-authored-by: Umpire2018 <138990495+Umpire2018@users.noreply.github.com>	2024-07-07 15:52:10 +03:00
Georgi Gerganov	b5040086d4	llama : fix n_rot default (#8348 ) ggml-ci	2024-07-07 14:59:02 +03:00
Daniel Bevenius	87e25a1d1b	llama : add early return for empty range (#8327 ) * llama : add early return for empty range This commit adds an early return to the llama_kv_cache_seq_add and llama_kv_cache_seq_div functions. The motivation for adding this is to avoid looping over the cache when the range is empty. I ran into this when using the self-extend feature in main.cpp. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * llama : add static_cast to fix CI warning/error This commit attempts to fix the following warning/error: ```console src/llama.cpp:7271:31: error: comparison of integer expressions of different signedness: ‘int’ and ‘uint32_t’ {aka ‘unsigned int’} [-Werror=sign-compare] 7271 \| if (i < hparams.n_layer_dense_lead) { \| ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` This can be reproduced locally by setting -Wsign-compare in the Makefile. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * squash! llama : add early return for empty range Remove the setting of cache.head to 0 when the range is empty. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * Update src/llama.cpp --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-07-06 10:22:16 +03:00
jaime-m-p	213701b51a	Detokenizer fixes (#8039 ) * Add llama_detokenize(): - Update header files location - UNKNOWN and CONTROL are 'special pieces' - Remove space after UNKNOWN and CONTROL - Refactor llama_token_to_piece() - Add flag: clean_up_tokenization_spaces - Symmetric params for llama_tokenize() and llama_detokenize() * Update and fix tokenizer tests: - Using llama_detokenize() - Unexpected vocab type as test fail instead of error - Useful when automating tests: - If you don't know in advance the vocab type - Differenciate other loading errors - Skip unicode surrogaes and undefined - Gracefully exit threads - Using exit() is throwing random exceptions - Clean old known problematic codepoints - Minor: confusing hexadecimal codepoint * Update bruteforce random tests - Add detokenizer checks - New generator: ascii_lr_strip - New generator: apostrophe - Add more vocabs files - Detokenize special tokens. - Replace errors with '\uFFFD' when detokenizing to 'utf-8' - More edge cases - Better detokenization results check * Fix add_space_prefix, set false by default * Better leading space removal * Do not remove space when decoding special tokens * Bugfix: custom regexs splits undefined unicode codepoints * 'viking' detokenizer clean spaces	2024-07-05 19:01:35 +02:00
Georgi Gerganov	7ed03b8974	llama : fix compile warning (#8304 )	2024-07-05 17:32:09 +03:00
Georgi Gerganov	2cccbaa008	llama : minor indentation during tensor loading (#8304 ) * llama : minor indentation during tensor loading ggml-ci * llama : use int for layer iterators [no ci]	2024-07-05 10:15:24 +03:00
Douglas Hanley	d12f781074	llama : streamline embeddings from "non-embedding" models (#8087 )	2024-07-05 10:05:56 +03:00
Georgi Gerganov	aa5898dc53	llama : prefer n_ over num_ prefix (#8308 )	2024-07-05 09:10:03 +03:00
Icecream95	d7fd29fff1	llama : add OpenELM support (#7359 ) * Initial OpenELM support (270M only so far) * Fill out missing entries in llama_model_type_name * fixup! Initial OpenELM support (270M only so far) Fix formatting * llama : support all OpenELM models * llama : add variable GQA and variable FFN sizes Some metadata keys can now also be arrays to support setting their value per-layer for models like OpenELM. * llama : minor spacing changes Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama : use std::array for per-layer hparams * llama : fix save/load state * llama : do not print hparams for vocab-only models * llama : handle n_head == 0 * llama : use const ref for print_f and fix division by zero * llama : fix t5 uses of n_head and n_ff * llama : minor comment --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-07-04 20:14:21 +03:00
fairydreaming	807b0c49ff	Inference support for T5 and FLAN-T5 model families (#5763 ) * llama : add inference support and model types for T5 and FLAN-T5 model families * llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token() * common, llama-cli, llama-batched : add support for encoder-decoder models * convert-hf : handle shared token embeddings tensors in T5Model * convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models) * convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model * convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5 --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-07-04 15:46:11 +02:00
Daniel Bevenius	f8c4c0738d	tests : add _CRT_SECURE_NO_WARNINGS for WIN32 (#8231 ) This commit adds the compile definition `_CRT_SECURE_NO_WARNINGS` to the root cmake subproject. The motivation for this is that currently the following warnings are displayed when compiling the tests and common cmake subprojects: ```console test-llama-grammar.cpp C:\llama.cpp\src\.\llama.cpp(1406,77): warning C4996: 'strerror': This function or variable may be unsafe. Consider using strerror_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. [C:\llama.cpp\build\tests\test-llama-grammar.vcxproj] ... ``` This compile definition is currently set for the `src` subproject and this change moves into the root cmake project so that it is applied to all cmake subprojects.	2024-07-04 13:53:42 +03:00
Daniel Bevenius	402d6feffa	llama : suppress unref var in Windows MSVC (#8150 ) * llama : suppress unref var in Windows MSVC This commit suppresses two warnings that are currently generated for src/llama.cpp when building on Windows MSVC ```console C:\llama.cpp\src\llama.cpp(14349,45): warning C4101: 'ex': unreferenced local variable [C:\llama.cpp\build\src\llama.vcxproj] C:\llama.cpp\src\llama.cpp(19285,44): warning C4101: 'e': unreferenced local variable [C:\llama.cpp\build\src\llama.vcxproj] ``` * Update src/llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-07-04 13:50:57 +03:00
Clint Herron	07a3fc0608	Removes multiple newlines at the end of files that is breaking the editorconfig step of CI. (#8258 )	2024-07-02 12:18:10 -04:00
Faisal Zaghloul	968967376d	Add `JAIS` model(s) (#8118 ) * Add `JAIS` model(s) * cleanup * address review comments * remove hack * un-hardcode max-alibi-bias * minor tweaks --------- Co-authored-by: fmz <quic_fzaghlou@quic.com>	2024-07-02 16:36:00 +02:00
Xuan Son Nguyen	49122a873f	gemma2: add sliding window mask (#8227 ) * gemma2: add sliding window mask * fix data_swa uninitialized * better naming * add co-author Co-authored-by: Arlo Phoenix <arlo-phoenix@users.noreply.github.com> * replace list with single tensor * update * llama : minor styling * convert : add sanity check for query_pre_attn_scalar * fix small typo in README --------- Co-authored-by: Arlo Phoenix <arlo-phoenix@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-07-01 18:48:34 +02:00
Andrei	1c5eba6f8e	llama: Add attention and final logit soft-capping, update scaling factor to Gemma2 (#8197 ) * Add attention and final logit softcapping. * fix * Add custom add_ functions * Disable flash attention for Gemma2 * Update src/llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Add default value for attention and final logit softcap value * Add custom kq scaling from Gemma2Attention * Remove custom pre attention scaling and use computed value instead. --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-29 23:44:08 -04:00
Xuan Son Nguyen	26a39bbd6b	Add MiniCPM, Deepseek V2 chat template + clean up `llama_chat_apply_template_internal` (#8172 ) * tmp_contains * minicpm chat template * add DeepSeek Lite template * change deepseek-lite to deepseek2 * correct code comment * correct code from master branch	2024-06-28 15:11:44 +02:00
pculliton	e57dc62057	llama: Add support for Gemma2ForCausalLM (#8156 ) * Inference support for Gemma 2 model family * Update convert-hf-to-gguf.py, constants, and tensor mappings * cleanup * format fix * Fix special token vocab bug * Don't add space prefix * fix deleted lines * Update src/llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Add model type names * Add control vector * Fix model type identification --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2024-06-27 21:00:43 -07:00
Sigbjørn Skjæret	6030c61281	Add Qwen2MoE 57B-A14B model identifier (#8158 ) * Add Qwen2MoE 57B-A14B * Add Qwen2MoE 57B-A14B	2024-06-27 16:27:41 +02:00
kustaaya	f675b20a3b	Added support for Viking pre-tokenizer (#8135 ) Co-authored-by: kustaaya <kustaaya@protonmail.com>	2024-06-27 10:58:54 +02:00
Sigbjørn Skjæret	911e35bb8b	llama : fix CodeLlama FIM token checks (#8144 ) * account for space prefix character * use find instead	2024-06-27 10:46:41 +03:00
Georgi Gerganov	f3f65429c4	llama : reorganize source code + improve CMake (#8006 ) * scripts : update sync [no ci] * files : relocate [no ci] * ci : disable kompute build [no ci] * cmake : fixes [no ci] * server : fix mingw build ggml-ci * cmake : minor [no ci] * cmake : link math library [no ci] * cmake : build normal ggml library (not object library) [no ci] * cmake : fix kompute build ggml-ci * make,cmake : fix LLAMA_CUDA + replace GGML_CDEF_PRIVATE ggml-ci * move public backend headers to the public include directory (#8122) * move public backend headers to the public include directory * nix test * spm : fix metal header --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * scripts : fix sync paths [no ci] * scripts : sync ggml-blas.h [no ci] --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-26 18:33:02 +03:00

1 2 3

129 Commits