llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-13 20:14:29 +00:00

Author	SHA1	Message	Date
Daniel Bevenius	2739a71e4b	convert : sort print supported models [no ci] (#11179 ) This commit sorts the list of supported models when printing them out. The motivation for this change is to make it easier to find a specific model in the list of supported models. For example: ```console $ ./convert_hf_to_gguf.py --print-supported-models Supported models: - ArcticForCausalLM - BaiChuanForCausalLM - BaichuanForCausalLM - BertForMaskedLM - BertModel - BitnetForCausalLM - BloomForCausalLM - BloomModel - CamembertModel - ChameleonForCausalLM - ChameleonForConditionalGeneration - ChatGLMForConditionalGeneration - ChatGLMModel - CodeShellForCausalLM - Cohere2ForCausalLM - CohereForCausalLM - DbrxForCausalLM - DeciLMForCausalLM - DeepseekForCausalLM - DeepseekV2ForCausalLM - DeepseekV3ForCausalLM - ExaoneForCausalLM - FalconForCausalLM - FalconMambaForCausalLM - GPT2LMHeadModel - GPTBigCodeForCausalLM - GPTNeoXForCausalLM - GPTRefactForCausalLM - Gemma2ForCausalLM - GemmaForCausalLM - GraniteForCausalLM - GraniteMoeForCausalLM - GrokForCausalLM - InternLM2ForCausalLM - JAISLMHeadModel - JinaBertForMaskedLM - JinaBertModel - LLaMAForCausalLM - LlamaForCausalLM - LlavaStableLMEpochForCausalLM - MPTForCausalLM - MT5ForConditionalGeneration - MambaForCausalLM - MambaLMHeadModel - MiniCPM3ForCausalLM - MiniCPMForCausalLM - MistralForCausalLM - MixtralForCausalLM - NemotronForCausalLM - NomicBertModel - OLMoForCausalLM - Olmo2ForCausalLM - OlmoForCausalLM - OlmoeForCausalLM - OpenELMForCausalLM - OrionForCausalLM - Phi3ForCausalLM - PhiForCausalLM - PhiMoEForCausalLM - PlamoForCausalLM - QWenLMHeadModel - Qwen2ForCausalLM - Qwen2MoeForCausalLM - Qwen2VLForConditionalGeneration - RWForCausalLM - RWKV6Qwen2ForCausalLM - RobertaModel - Rwkv6ForCausalLM - StableLMEpochForCausalLM - StableLmForCausalLM - Starcoder2ForCausalLM - T5EncoderModel - T5ForConditionalGeneration - T5WithLMHeadModel - UMT5ForConditionalGeneration - WavTokenizerDec - XLMRobertaForSequenceClassification - XLMRobertaModel - XverseForCausalLM ```	2025-01-11 05:50:33 +01:00
Daniel Bevenius	ff3fcabc72	convert : add --print-supported-models option (#11172 ) Some checks are pending Python check requirements.txt / check-requirements (push) Waiting to run Details flake8 Lint / Lint (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details * convert : add --print-supported-models option This commit adds a new option to the convert_hf_to_gguf.py script to print the supported models. The motivation for this is that it can be useful to know which models are supported by the script without having to look at the code. Example usage: ```console $ ./convert_hf_to_gguf.py --print-supported-models Supported models: - GPTNeoXForCausalLM - BloomForCausalLM - BloomModel - MPTForCausalLM - OrionForCausalLM - BaichuanForCausalLM - BaiChuanForCausalLM - XverseForCausalLM - FalconForCausalLM - RWForCausalLM - GPTBigCodeForCausalLM - GPTRefactForCausalLM - StableLmForCausalLM - StableLMEpochForCausalLM - LlavaStableLMEpochForCausalLM - LLaMAForCausalLM - LlamaForCausalLM - MistralForCausalLM - MixtralForCausalLM - DeciLMForCausalLM - BitnetForCausalLM - GrokForCausalLM - DbrxForCausalLM - MiniCPMForCausalLM - MiniCPM3ForCausalLM - QWenLMHeadModel - Qwen2ForCausalLM - Qwen2VLForConditionalGeneration - WavTokenizerDec - Qwen2MoeForCausalLM - GPT2LMHeadModel - PhiForCausalLM - Phi3ForCausalLM - PhiMoEForCausalLM - PlamoForCausalLM - CodeShellForCausalLM - InternLM2ForCausalLM - BertModel - BertForMaskedLM - CamembertModel - RobertaModel - NomicBertModel - XLMRobertaModel - XLMRobertaForSequenceClassification - GemmaForCausalLM - Gemma2ForCausalLM - Starcoder2ForCausalLM - Rwkv6ForCausalLM - RWKV6Qwen2ForCausalLM - MambaForCausalLM - MambaLMHeadModel - FalconMambaForCausalLM - CohereForCausalLM - Cohere2ForCausalLM - OLMoForCausalLM - OlmoForCausalLM - Olmo2ForCausalLM - OlmoeForCausalLM - JinaBertModel - JinaBertForMaskedLM - OpenELMForCausalLM - ArcticForCausalLM - DeepseekForCausalLM - DeepseekV3ForCausalLM - DeepseekV2ForCausalLM - UMT5ForConditionalGeneration - MT5ForConditionalGeneration - T5ForConditionalGeneration - T5WithLMHeadModel - T5EncoderModel - JAISLMHeadModel - ChatGLMModel - ChatGLMForConditionalGeneration - NemotronForCausalLM - ExaoneForCausalLM - GraniteForCausalLM - GraniteMoeForCausalLM - ChameleonForCausalLM - ChameleonForConditionalGeneration ``` * squash! convert : add --print-supported-models option Fix flake8 error.	2025-01-10 11:30:53 +01:00
Molly Sophia	ee7136c6d1	llama: add support for QRWKV6 model architecture (#11001 ) Some checks are pending Python check requirements.txt / check-requirements (push) Waiting to run Details flake8 Lint / Lint (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details llama: add support for QRWKV6 model architecture (#11001) * WIP: Add support for RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV: Some graph simplification Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Add support for RWKV6Qwen2 with cpu and cuda GLA Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix some typos Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * code format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix wkv test & add gla test Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix cuda warning Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update README.md Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update ggml/src/ggml-cuda/gla.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Fix fused lerp weights loading with RWKV6 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * better sanity check skipping for QRWKV6 in llama-quant thanks @compilade Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: compilade <git@compilade.net> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: compilade <git@compilade.net>	2025-01-10 09:58:08 +08:00
Pierrick Hymbert	f8feb4b01a	model: Add support for PhiMoE arch (#11003 ) * model: support phimoe * python linter * doc: minor Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com> * doc: minor Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com> * doc: add phimoe as supported model ggml-ci --------- Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com>	2025-01-09 11:21:41 +01:00
fairydreaming	9394bbd484	llama : Add support for DeepSeek V3 (#11049 ) * convert : extend DEEPSEEK2 model architecture to support DeepseekV3ForCausalLM by adding EXPERT_WEIGHTS_NORM and EXPERT_GATING_FUNC model parameters and FFN_EXP_PROBS_B tensor type * vocab : add DeepSeek V3 pre-tokenizer regexes * unicode : handle ACCENT_MARK and SYMBOL categories in regex * llama : add DeepSeek V3 chat template, handle new model parameters and tensor types --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2025-01-04 21:06:11 +01:00
DAN™	46be942214	llama : add support for the cohere2 model architecture (#10900 ) Some checks are pending Python check requirements.txt / check-requirements (push) Waiting to run Details flake8 Lint / Lint (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details	2025-01-04 16:33:31 +02:00
ymcki	bc7b1f8632	convert : fix Llama-3_1-Nemotron-51B rope settings (#11008 ) * conflict resolution * move comments after bracket to its own line * DeciLMCausalModel now reads rope_theta from config.json properly	2024-12-31 13:04:48 +02:00
Yun Dou	b92a14a841	llama : support InfiniAI Megrez 3b (#10893 ) Some checks failed flake8 Lint / Lint (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Has been cancelled Details * Support InfiniAI Megrez 3b * Fix tokenizer_clean_spaces for megrez	2024-12-23 01:35:44 +01:00
ymcki	6f0c9e034b	llama : support for Llama-3_1-Nemotron-51B (#10669 ) * conflict resolution * move comments after bracket to its own line	2024-12-23 01:22:33 +01:00
Billel Mokeddem	7ae33a616f	llama : add Falcon3 support (#10883 ) * Add Falcon3 model support * Add fix for adding bos to added special tokens * Add comment explaining the logic behind the if statement * Add a log message to better track the when the following line of code is triggered * Update log to only print when input and output characters are different * Fix handling pre-normalized tokens * Refactoring	2024-12-23 00:09:58 +02:00
Georgi Gerganov	5cd85b5e00	convert : add BertForMaskedLM (#10919 ) Some checks failed Python check requirements.txt / check-requirements (push) Has been cancelled Details flake8 Lint / Lint (push) Has been cancelled Details Python Type-Check / pyright type-check (push) Has been cancelled Details	2024-12-21 10:10:18 +02:00
Molly Sophia	0a11f8b7b5	convert : fix RWKV v6 model conversion (#10913 ) * Enable --no-context-shift for llama-perplexity example Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV 6: Fix error in ggml_cuda_op_bin_bcast Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-12-20 11:44:58 +02:00
Sukriti Sharma	2fffc52b50	llama : fix Roberta embeddings (#10856 ) * fix: Use gpt2 tokenizer for roberta and add eos/bos tokens Branch: RobertaTokenizer Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fixes to position embeddings Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * map roberta-bpe to gpt-2 Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * fix linting Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-19 15:04:51 +02:00
fairydreaming	7585edbdeb	convert : Add support for Microsoft Phi-4 model (#10817 ) * convert : use GPT2 vocab for Phi-4 model * convert : use null value of sliding_window to distinguish Phi-4 from other PHI3-based models * llama : do not use sliding window attention mask for Phi-4 model --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-12-19 10:37:12 +01:00
Georgi Gerganov	0bf2d10c55	tts : add OuteTTS support (#10784 ) * server : add "tokens" output ggml-ci * server : output embeddings for all tokens when pooling = none ggml-ci * server : be explicit about the pooling type in the tests ggml-ci * server : do not normalize embeddings when there is no pooling ggml-ci * llama : add OuteTTS support (wip) * wip * extract features * first conv * group norm * resnet conv * resnet * attn * pos net * layer norm * convnext * head * hann window * fix n_embd + remove llama.cpp hacks * compute hann window * fft * spectrum processing * clean-up * tts : receive input text and generate codes * clip : fix new conv name * tts : minor fix * tts : add header + minor fixes ggml-ci * tts : add matchematical constant ggml-ci * tts : fix sampling + cut initial noise * tts : fixes * tts : update default samplers ggml-ci * tts : text pre-processing * tts : outetts-voc -> wavtokenizer-dec * tts : remove hardcoded constants ggml-ci * tts : fix tensor shapes * llama : refactor wavtokenizer tensors ggml-ci * cont ggml-ci * cont [no ci] * llama : update WavTokenizer to non-causal attn * llama : handle no-vocab detokenization * tts : add Python example for OuteTTS (wip) * tts : extend python example to generate spectrogram ggml-ci * server : fix rebase artifacts * tts : enable "return_tokens" in Python example ggml-ci * tts : minor fixes * common : support HF download for vocoder	2024-12-18 19:27:21 +02:00
Diego Devesa	4da69d1abd	Revert "llama : add Falcon3 support (#10864 )" (#10876 ) Some checks are pending Python check requirements.txt / check-requirements (push) Waiting to run Details flake8 Lint / Lint (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details This reverts commit `382bc7f2e8`.	2024-12-18 01:36:46 +01:00
Billel Mokeddem	382bc7f2e8	llama : add Falcon3 support (#10864 )	2024-12-17 17:24:56 +02:00
Valentin Mamedov	a0974156f3	llama : add Deepseek MoE v1 & GigaChat models (#10827 ) * Add deepseek v1 arch & gigachat template * improve template code * add readme * delete comments * remove comment * fix format * lint llama.cpp * fix order of deepseek and deepseek2, move gigachat temlate to the end of func * fix order of deepseek and deepseek2 in constants; mark shared exp as deepseek arch need * remove comments * move deepseek above deepseek2 * change placement of gigachat chat template	2024-12-15 19:02:46 +02:00
HimariO	ba1cb19cdd	llama : add Qwen2VL support + multimodal RoPE (#10361 ) Some checks failed flake8 Lint / Lint (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Has been cancelled Details * Barebone Qwen2VL LLM convertor * Add Qwen2VL cli entrypoint * [WIP] add qwen2vl arch * Verify m-rope output * Add vl-rope/2d-rope support for qwen2vl ViT * update qwen2vl cli tool * update 5D tensor op workaround * [WIP] qwen2vl vision model * make batch and clip utils compatible with qwen2vl * [WIP] create inference workflow, gguf convert script but fix * correcting vision-rope behavior, add the missing last layer back to ViT * add arg parser to qwen2vl_surgery * replace variable size array with vector * cuda-gdb cmake preset * add fp32 mrope, vision rope kernel * add fp16 support for qwen2vl and m-rope * add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION` * fix rope op mode switching, out dated func args * update `llama_hparams` * update to keep up stream changes * resolve linter, test errors * add makefile entry, update speical image padding token * add mrope unit test, fix few compiler warnings * rename `mrope` related function, params * minor updates on debug util, bug fixs * add `m-rope` testcase to `test-backend-ops` * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix traililng whitespce * store `llama_hparams.rope_sections` with fixed size array * update position id tensor size check in GGML_OP_ROPE * minor updates * update `ggml_backend__supports_op` of unsupported backends remote old `rope_section` compare operator --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-12-14 14:43:46 +02:00
Robert Collins	62e84d9848	llama : add 128k yarn context for Qwen (#10698 ) Some checks failed flake8 Lint / Lint (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Has been cancelled Details * add 128k yarn context for Qwen * added property for model tensors * removing useless line	2024-12-07 23:12:27 +02:00
Sukriti Sharma	784a14aa49	convert : add support for Roberta embeddings (#10695 )	2024-12-07 09:02:14 +02:00
Riccardo Orlando	6fe6247831	llama : add Minerva 7B model support (#10673 ) * Support for Minerva 7B * Update convert_hf_to_gguf_update.py	2024-12-05 20:30:59 +02:00
JFLFY2255	8d0cfd554a	llama: Support MiniCPM-1B (with & w/o longrope) (#10559 )	2024-12-04 11:42:50 +02:00
Shane A	80acb7b430	Rename Olmo1124 to Olmo2 (#10500 )	2024-11-25 19:36:09 +01:00
Gabe Goodhart	9336db462c	convert : XLMRoberta Type Vocab Size (#10458 ) Some checks failed Nix CI / nix-eval (macos-latest) (push) Waiting to run Details Nix CI / nix-eval (ubuntu-latest) (push) Waiting to run Details Nix CI / nix-build (macos-latest) (push) Waiting to run Details Nix CI / nix-build (ubuntu-latest) (push) Waiting to run Details flake8 Lint / Lint (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Has been cancelled Details Python Type-Check / pyright type-check (push) Has been cancelled Details This matches the key in common bert-based embedding models and may have a value other than 1 in it. Branch: XLMRobertaTypeVocabSize Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-11-24 11:02:34 +02:00
Shane A	a88ad007de	llama : add OLMo November 2024 support (#10394 ) Some checks failed Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full-cuda.Dockerfile platforms:linux/amd64 tag:full-cuda]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full-musa.Dockerfile platforms:linux/amd64 tag:full-musa]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full.Dockerfile platforms:linux/amd64,linux/arm64 tag:full]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-cuda.Dockerfile platforms:linux/amd64 tag:light-cuda]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-intel.Dockerfile platforms:linux/amd64 tag:light-intel]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-musa.Dockerfile platforms:linux/amd64 tag:light-musa]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli.Dockerfile platforms:linux/amd64,linux/arm64 tag:light]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-cuda.Dockerfile platforms:linux/amd64 tag:server-cuda]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-intel.Dockerfile platforms:linux/amd64 tag:server-intel]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-musa.Dockerfile platforms:linux/amd64 tag:server-musa]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server.Dockerfile platforms:linux/amd64,linux/arm64 tag:server]) (push) Waiting to run Details Nix CI / nix-eval (macos-latest) (push) Waiting to run Details Nix CI / nix-eval (ubuntu-latest) (push) Waiting to run Details Nix CI / nix-build (macos-latest) (push) Waiting to run Details Nix CI / nix-build (ubuntu-latest) (push) Waiting to run Details flake8 Lint / Lint (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Has been cancelled Details Python Type-Check / pyright type-check (push) Has been cancelled Details * Add OLMo November 2024 constants * Add OLMo November 2024 converter * Add loading of OLMo November 2024 tensors and hyper parameters * Add building of OLMo November 2024 model	2024-11-19 11:04:08 +02:00
Faisal Zaghloul	60e17ce23c	Remove identical wte/etw logic for jais (#10203 )	2024-11-07 08:46:12 -08:00
Xuan Son Nguyen	7554aa4655	convert-lora : make `--base` optional (#10110 ) * convert-lora : make `--base` optional * lint * handle case where base_model_name_or_path is invalid * do not include metadata from base model * clarify unspecified --base * add small comment [no ci] * trigger ci	2024-11-02 12:53:17 +01:00
Georgi Gerganov	bc5ba007b2	server : check that the prompt fits in the slot's context (#10030 ) ggml-ci	2024-10-25 10:13:46 +03:00
Molly Sophia	11d47057a5	Rwkv chat template fix (#10001 ) * llama: remove useless template matching for rwkv-world Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * converter: Add comment about the hack for rwkv models Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update src/llama.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-10-22 15:22:26 +02:00
Molly Sophia	4ff7fe1fb3	llama : add chat template for RWKV-World + fix EOT (#9968 ) * Add chat template for RWKV-World Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV: Fix the chat template not being used Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV v6: Set EOT token to ``\n\n`` Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * readme: add rwkv into supported model list Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-10-22 13:33:37 +03:00
compilade	1927378bcc	convert : refactor rope_freqs generation (#9396 ) * convert : refactor rope_freqs generation This should also fix vocab-only conversion for Phi-3. * convert : adapt MiniCPM3 to separate rope_freqs insertion MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid having to run its custom Python code which mixes tokenization in the same file as tool calls. gguf-py : add long and short RoPE factors to tensor mappings Empty, but the key names are used to populate the mappings.	2024-10-01 09:31:36 +03:00
nopperl	f99d3f8367	py : add model class for Chameleon conversion (#9683 )	2024-09-29 15:02:06 +03:00
Georgi Gerganov	f4d2b8846a	llama : add reranking support (#9510 ) * py : add XLMRobertaForSequenceClassification [no ci] * py : fix scalar-tensor conversion [no ci] * py : fix position embeddings chop [no ci] * llama : read new cls tensors [no ci] * llama : add classigication head (wip) [no ci] * llama : add "rank" pooling type ggml-ci * server : add rerank endpoint ggml-ci * llama : aboud ggml_repeat during classification * rerank : cleanup + comments * server : accept /rerank endpoint in addition to /v1/rerank [no ci] * embedding : parse special tokens * jina : support v1 reranker * vocab : minor style ggml-ci * server : initiate tests for later ggml-ci * server : add docs * llama : add comment [no ci] * llama : fix uninitialized tensors * ci : add rerank tests ggml-ci * add reranking test * change test data * Update examples/server/server.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * add `--reranking` argument * update server docs * llama : fix comment [no ci] ggml-ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-09-28 17:42:03 +03:00
nopperl	9a913110cf	llama : add support for Chameleon (#8543 ) * convert chameleon hf to gguf * add chameleon tokenizer tests * fix lint * implement chameleon graph * add swin norm param * return qk norm weights and biases to original format * implement swin norm * suppress image token output * rem tabs * add comment to conversion * fix ci * check for k norm separately * adapt to new lora implementation * fix layer input for swin norm * move swin_norm in gguf writer * add comment regarding special token regex in chameleon pre-tokenizer * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * fix punctuation regex in chameleon pre-tokenizer (@compilade) Co-authored-by: compilade <git@compilade.net> * fix lint * trigger ci --------- Co-authored-by: compilade <git@compilade.net>	2024-09-28 15:08:43 +03:00
Gabe Goodhart	3d6bf6919f	llama : add IBM Granite MoE architecture (#9438 ) * feat(gguf-py): Add granitemoe architecture This includes the addition of new tensor names for the new moe layers. These may not be correct at this point due to the need for the hack in gguf_writer.py to double-check the length of the shape for these layers. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(convert_hf_to_gguf): Add GraniteMoeModel GraniteMoe has the same configuration deltas as Granite Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(granitemoe convert): Split the double-sized input layer into gate and up After a lot of staring and squinting, it's clear that the standard mixtral expert implementation is equivalent to the vectorized parallel experts in granite. The difference is that in granite, the w1 and w3 are concatenated into a single tensor "input_linear." Rather than reimplementing all of the math on the llama.cpp side, the much simpler route is to just split this tensor during conversion and follow the standard mixtral route. Branch: GraniteMoE Co-Authored-By: alex.brooks@ibm.com Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(granitemoe): Implement granitemoe GraniteMoE follows the mixtral architecture (once the input_linear layers are split into gate_exps/up_exps). The main delta is the addition of the same four multipliers used in Granite. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * Typo fix in docstring Co-Authored-By: ggerganov@gmail.com Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(conversion): Simplify tensor name mapping in conversion Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert): Remove unused tensor name mappings Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert): Sanity check on merged FFN tensor sizes Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Allow "output" layer in granite moe architecture (convert and cpp) Branch: GraniteMoE Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(granite): Add missing 'output' tensor for Granite This is a fix for the previous `granite` architecture PR. Recent snapshots have included this (`lm_head.weights`) as part of the architecture Branch: GraniteMoE Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-25 10:06:52 +03:00
Gabe Goodhart	0d2ec43833	llama : support IBM Granite architecture (#9412 ) * feat(gguf-py): Add Granite model and params to gguf-py Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(convert_hf_to_gguf): Add registration and param setup for Granite Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): Add config parsing for Granite multiplier params Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): First pass at full port of granite deviations from llama Something is still not working right since the results are mostly terrible, but on occasion it's producing relevant results at this point, so _something_ is working. Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama.cpp): Determine granite language 3b instruct by vocab size Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert_hf_to_gguf): Use LlamaModel as base for GraniteModel The defaults in LlamaModel are needed for Granite as well Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama.cpp): Switch Granite param names to use _scale for consistency Other scalar multipliers are called _scale, so this provides a more consistent naming convention. Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> fix(convert_hf_to_gguf/gguf-py): _multiplier -> _scale The transformers names with _multiplier will now be converted to the _scale equivalent during conversion. Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama.cpp): Use separate switch clause for granite in llm_load_hparams Branch: GraniteLM Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-09-17 09:44:58 +03:00
compilade	d54c21df7e	convert : identify missing model files (#9397 ) Some checks failed Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full-cuda.Dockerfile platforms:linux/amd64 tag:full-cuda]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full.Dockerfile platforms:linux/amd64,linux/arm64 tag:full]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-cuda.Dockerfile platforms:linux/amd64 tag:light-cuda]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-intel.Dockerfile platforms:linux/amd64 tag:light-intel]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli.Dockerfile platforms:linux/amd64,linux/arm64 tag:light]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-cuda.Dockerfile platforms:linux/amd64 tag:server-cuda]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-intel.Dockerfile platforms:linux/amd64 tag:server-intel]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server.Dockerfile platforms:linux/amd64,linux/arm64 tag:server]) (push) Waiting to run Details Nix CI / nix-eval (macos-latest) (push) Waiting to run Details Nix CI / nix-eval (ubuntu-latest) (push) Waiting to run Details Nix CI / nix-build (macos-latest) (push) Waiting to run Details Nix CI / nix-build (ubuntu-latest) (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Waiting to run Details flake8 Lint / Lint (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details Nix aarch64 builds / nix-build-aarch64 (push) Has been cancelled Details	2024-09-16 10:30:22 +03:00
Shane A	0aadac10c7	llama : support OLMoE (#9462 )	2024-09-16 09:47:37 +03:00
CarryFun	95ca85168b	llama : support MiniCPM3 (#9322 ) Co-authored-by: 范睿凯 <fanruikai@modelbest.cn>	2024-09-16 09:45:20 +03:00
Csaba Kecskemeti	3c7989fd29	py : add "LLaMAForCausalLM" conversion support (#9485 ) Some checks are pending Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full-cuda.Dockerfile platforms:linux/amd64 tag:full-cuda]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full.Dockerfile platforms:linux/amd64,linux/arm64 tag:full]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-cuda.Dockerfile platforms:linux/amd64 tag:light-cuda]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-intel.Dockerfile platforms:linux/amd64 tag:light-intel]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli.Dockerfile platforms:linux/amd64,linux/arm64 tag:light]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-cuda.Dockerfile platforms:linux/amd64 tag:server-cuda]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-intel.Dockerfile platforms:linux/amd64 tag:server-intel]) (push) Waiting to run Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server.Dockerfile platforms:linux/amd64,linux/arm64 tag:server]) (push) Waiting to run Details Nix CI / nix-eval (macos-latest) (push) Waiting to run Details Nix CI / nix-eval (ubuntu-latest) (push) Waiting to run Details Nix CI / nix-build (macos-latest) (push) Waiting to run Details Nix CI / nix-build (ubuntu-latest) (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Waiting to run Details flake8 Lint / Lint (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details Co-authored-by: Csaba Kecskemeti <csabakecskemeti@Csabas-Mac-Pro.local>	2024-09-15 10:48:25 +03:00
daminho	c837981bba	py : add Phi-1.5/Phi-2 tokenizer (#9361 ) * add phi2 tokenizer * add phi name to convert_hf_to_gguf_update.py * make tokenizer_pre consistent; llama.cpp work	2024-09-12 14:28:20 +03:00
Molly Sophia	39f852f440	py : add special tokens in hf_converter for RWKV v6 (#9428 ) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-09-12 14:25:16 +03:00
Molly Sophia	0b4ac75772	RWKV v6: Add time_mix_decay_w1/w2 in quant exclusion list (#9387 ) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-09-10 10:02:30 +03:00
compilade	9bc6db28d0	ggml-quants : ternary packing for TriLMs and BitNet b1.58 (#8151 ) * ggml-quants : 1.625 bpw ternary packing for BitNet 1.58b * ggml-quants : faster 1.625 bpw AVX2 vec_dot Not using a lookup table anymore makes it match q4_0 speed. * gguf-py : fix formatting * llama : remove spaces on empty line * ggml-quants : subtract 1 when back in epi8 This makes the 1.625 bpw type go faster than q4_0. Still not the fastest. * ggml-quants : Q2_2 now faster than Q4_K on with AVX2 * ggml-quants : cleanup Q1_3 code formatting * ggml-quants : ARM NEON vec_dot for q2_2 and q1_3 * ggml-quants : use ceiling division when quantizing q1_3 * convert-hf : simplify BitNet pre-quantization This still results in the exact same tensor weights and scales, but it reveals some weirdness in the current algorithm. * convert-hf : allow converting the weird BitNet 1.3B Its FFN size is 5460 which is not convenient. The offending tensors are kept in F16, which makes the final model 5.01 bpw. * bitnet : replace 1.58b with b1.58, as in the paper * ggml-quants : fix build failure on Windows * ggml-quants : attempt to fix Arm 32-bit support * ggml : add some informative comments in q1_3 vec_dot * ggml : add TQ1_0 and TQ2_0 ternary quantization types * ggml : even faster TQ2_0 * ggml : also faster TQ1_0 Same optimization as for TQ2_0 by offsetting the sum instead of the weights. This makes TQ1_0 almost as fast as Q8_0 on AVX2. * ggml : fix build issues in certain environments * ggml : add NEON vec_dot implementation for TQ1_0 and TQ2_0 * ggml : avoid directly using vmlal_high_s8, for 32-bit ARM compat The compiler seems smart enough to use the same instruction even when using vget_high_s8 instead. * ggml : remove q1_3 and q2_2 No more 1.625 bpw and 2.000 bpw, now instead using 1.6875 bpw and 2.0625 bpw with TQ1_0 and TQ2_0, respectively. * llama : remove the separate scale tensors of BitNet b1.58 They won't be needed, since the remaining ternary quant types have built-in scales. * ggml-quants : rename fields of TQ1_0 and TQ2_0 structs for consistency * ggml-quants : allow using vdotq_s32 in TQ2_0 vec_dot Not yet tested on hardware which supports it, might not work or might not even compile. But also it might. It should make the performance better on recent ARM CPUs. * ggml-quants : remove comment about possible format change of TQ2_0 Making it slightly more convenient for AVX512 but less convenient for everything else is not worth the trouble. * gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0 * ggml-quants : use roundf instead of nearest_int for TQ1_0 and TQ2_0 This does not change anything for ternary models, since their values should never end up being in halfway cases anyway. * convert : allow direct conversion to TQ1_0 and TQ2_0 The token embeddings and output tensors are kept in F16 to allow quantizing them to Q4_K and Q6_K with llama-quantize. * llama : handle fallback for TQ1_0 and TQ2_0 with Q4_0 Q4_0 is not completely symmetric (so not lossless for ternary models), but it should be good enough. * ggml-quants : allow using ARM dot product instructions for TQ1_0 * ggml-quants : deduplicate TQ1_0 and TQ2_0 __ARM_FEATURE_DOTPROD support * ggml : remove unused ggml_mul special case It would otherwise conflict with the more general optimization coming with Mamba-2. * ggml : handle TQ1_0 and TQ2_0 in dequantization-based operators * test-backend-ops : add TQ1_0 and TQ2_0 comments for later Not yet adding uncommented, because some backends like SYCL and Metal do not properly handle unknown types in supports_op for GGML_OP_MUL_MAT. (and Metal also doesn't handle it with GGML_OP_GET_ROWS) Support for TQ1_0 and TQ2_0 for other backends than CPU will be added in follow-up pull requests.	2024-09-05 21:48:47 -04:00
Molly Sophia	8f1d81a0b6	llama : support RWKV v6 models (#8980 ) * convert_hf_to_gguf: Add support for RWKV v6 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Add RWKV tokenization * Fix build Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Do not use special tokens when matching in RWKV tokenizer * Fix model loading * Add (broken) placeholder graph builder for RWKV * Add workaround for kv cache * Add logits conversion to rwkv5 * Add rwkv5 layer norms * Add time mix KVRG & correct merge mistake * Add remaining time mix parameters * Add time mix output loading * Add placeholder llm_build_time_mix * Fix build Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Load more tensors for rwkv v6 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix rwkv tokenizer Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml: Add unary operator Exp Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV v6 graph building Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Add ``rescale_every_n_layers`` parameter Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Add ``wkv.head_size`` key for RWKV so it doesn't reuse Mamba ssm parameters Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix offloading layers to CUDA Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix parallel inferencing for RWKV Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Remove trailing whitespaces Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * build_rwkv: Avoid using inplace operations Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * convert_hf_to_gguf: rwkv: Avoid using ``eval`` Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * convert_hf_to_gguf: rwkv tokenizer: Don't escape sequences manually Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * ggml: Add backward computation for unary op ``exp`` Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * Use MODEL_ARCH.RWKV6 instead of MODEL_ARCH.RWKV Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * build_rwkv6: Simplify graph Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Detect model.type Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Fix tensor loading for 7B/14B models Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Fix group_norm assertion failure with Metal Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Clean up Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Add quantization tensor exclusion Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Use the new advanced batch splits Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * llama: rwkv6: Use ``ggml_norm`` instead of ``ggml_group_norm`` Co-authored-by: compilade <git@compilade.net> * llama: rwkv6: Apply code style and misc changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * converter: Use class name ``Rwkv6Model`` Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Make use of key ``feed_forward_length`` Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Add kv ``time_mix_extra_dim`` and ``time_decay_extra_dim`` Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * converter: Match ``new_name`` instead of ``name`` for float32 explicit tensors Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Keep ``time_mix_w1/w2`` as F32 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Remove unused nodes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Apply code format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: rwkv6: Add lora for some supported tensors Currently att.key/receptance/value/gate/output, ffn.receptance/key/value, as well as head.weight Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * rwkv : speed-up tokenization using trie * minor : style + indentation * llama: rwkv6: Avoid division by zero Co-authored-by: compilade <git@compilade.net> * ggml: rwkv_wkv: Avoid copying the state Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Layl Bongers <3094382+LaylBongers@users.noreply.github.com> Co-authored-by: compilade <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-01 17:38:17 +03:00
Carsten Kragelund Jørgensen	75e1dbbaab	llama : fix llama3.1 rope_freqs not respecting custom head_dim (#9141 ) * fix: llama3.1 rope_freqs not respecting custom head_dim * fix: use potential head_dim for Exaone	2024-08-27 09:53:40 +03:00
Xuan Son Nguyen	3ba780e2a8	lora : fix llama conversion script with ROPE_FREQS (#9117 )	2024-08-23 12:58:53 +02:00
Younes Belkada	b40eb84895	llama : support for `falcon-mamba` architecture (#9074 ) * feat: initial support for llama.cpp * fix: lint * refactor: better refactor * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * fix: address comments * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * fix: add more cleanup and harmonization * fix: lint * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <git@compilade.net> * fix: change name * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> * add in operator * fix: add `dt_b_c_rms` in `llm_load_print_meta` * fix: correct printf format for bool * fix: correct print format * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * llama : quantize more Mamba tensors * llama : use f16 as the fallback of fallback quant types --------- Co-authored-by: compilade <git@compilade.net>	2024-08-21 11:06:36 +03:00
Minsoo Cheong	c679e0cb5c	llama : add EXAONE model support (#9025 ) * add exaone model support * add chat template * fix whitespace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add ftype * add exaone pre-tokenizer in `llama-vocab.cpp` Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com> * fix lint Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com> * add `EXAONE` to supported models in `README.md` * fix space Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: compilade <113953597+compilade@users.noreply.github.com> Co-authored-by: compilade <git@compilade.net>	2024-08-16 09:35:18 +03:00

1 2

80 Commits