llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-11-18 00:39:52 +00:00

Author	SHA1	Message	Date
Michael Coppola	940362224d	llama : add support for Tekken pre-tokenizer (#8579 ) * llama : Added support for Tekken pre-tokenizer (#8577) Removed uneeded `vocab.tokenizer_clean_spaces` assignment * llama : fix order of pre-tokenizers * * Tekken pre-tokenizer no longer uses clean_up_tokenization_spaces * Updated chkhsh for Tekken tokenizer --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-07-20 16:43:51 +03:00
Huifeng Ou	69b9945b44	llama.swiftui: fix end of generation bug (#8268 ) * fix continuing generating blank lines after getting EOT token or EOS token from LLM * change variable name to is_done (variable name suggested by ggerganov) * minor : fix trailing whitespace * minor : add space --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-07-20 16:09:37 +03:00
Brian	c3776cacab	gguf_dump.py: fix markddown kv array print (#8588 ) * gguf_dump.py: fix markddown kv array print * Update gguf-py/scripts/gguf_dump.py Co-authored-by: compilade <git@compilade.net> * gguf_dump.py: refactor kv array string handling * gguf_dump.py: escape backticks inside of strings * gguf_dump.py: inline code markdown escape handler added >>> escape_markdown_inline_code("hello world") '`hello world`' >>> escape_markdown_inline_code("hello ` world") '``hello ` world``' * gguf_dump.py: handle edge case about backticks on start or end of a string --------- Co-authored-by: compilade <git@compilade.net>	2024-07-20 17:35:25 +10:00
slaren	87e397d00b	ggml : fix quant dot product with odd number of blocks (#8549 ) * ggml : fix iq4_nl dot product with odd number of blocks * ggml : fix odd blocks for ARM_NEON (#8556) * ggml : fix iq4_nl dot product with odd number of blocks * ggml : fix q4_1 * ggml : fix q5_0 * ggml : fix q5_1 * ggml : fix iq4_nl metal ggml-ci * ggml : fix q4_0 * ggml : fix q8_0 ggml-ci * ggml : remove special Q4_0 code for first 2 blocks * ggml : fix sumf redefinition --------- Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-07-19 17:17:27 +02:00
Brian	57b1d4f9eb	convert-*.py: remove add_name from ChatGLMModel class (#8590 )	2024-07-20 00:04:38 +10:00
Georgi Gerganov	d197545530	llama : bump max layers from 256 to 512 (#8530 ) * llama : bump max layers from 256 to 512 * llama : replace asserts with exceptions	2024-07-19 16:50:47 +03:00
Georgi Gerganov	be0cfb4175	readme : fix server badge	2024-07-19 14:34:55 +03:00
Clint Herron	b57eb9ca4f	ggml : add friendlier error message to fopen errors (#8575 ) * Add additional error information when model files fail to load. * Adding additional error information to most instances of fopen.	2024-07-19 14:05:45 +03:00
Frank Mai	f299aa98ec	fix: typo of chatglm4 chat tmpl (#8586 ) Signed-off-by: thxCode <thxcode0824@gmail.com>	2024-07-19 11:44:41 +02:00
Brian	3d0e4367d9	convert-*.py: add general.name kv override (#8571 )	2024-07-19 17:51:51 +10:00
Johannes Gäßler	a15ef8f8a0	CUDA: fix partial offloading for ne0 % 256 != 0 (#8572 )	2024-07-18 23:48:47 +02:00
65a	705b7ecf60	cmake : install all ggml public headers (#8480 ) Co-authored-by: 65a <65a@65a.invalid>	2024-07-18 17:47:12 +03:00
Eric Zhang	0d2c7321e9	server: use relative routes for static files in new UI (#8552 ) * server: public: fix api_url on non-index pages * server: public: use relative routes for static files in new UI	2024-07-18 12:43:49 +02:00
Brian	672a6f1018	convert-.py: GGUF Naming Convention Refactor and Metadata Override Refactor (#7499 ) Main thing is that the default output filename will take this form {name}{parameters}{finetune}{version}{encoding}{kind} In addition this add and remove some entries in the KV store and adds a metadata class with automatic heuristics capability to derive some values based on model card content No Change: - Internal GGUF Spec - `general.architecture` - `general.quantization_version` - `general.alignment` - `general.file_type` - General Model Details - `general.name` - `general.author` - `general.version` - `general.description` - Licensing details - `general.license` - Typically represents the converted GGUF repo (Unless made from scratch) - `general.url` - Model Source during conversion - `general.source.url` * Removed: - Model Source during conversion - `general.source.huggingface.repository` * Added: - General Model Details - `general.organization` - `general.finetune` - `general.basename` - `general.quantized_by` - `general.size_label` - Licensing details - `general.license.name` - `general.license.link` - Typically represents the converted GGUF repo (Unless made from scratch) - `general.doi` - `general.uuid` - `general.repo_url` - Model Source during conversion - `general.source.doi` - `general.source.uuid` - `general.source.repo_url` - Base Model Source - `general.base_model.count` - `general.base_model.{id}.name` - `general.base_model.{id}.author` - `general.base_model.{id}.version` - `general.base_model.{id}.organization` - `general.base_model.{id}.url` (Model Website/Paper) - `general.base_model.{id}.doi` - `general.base_model.{id}.uuid` - `general.base_model.{id}.repo_url` (Model Source Repository (git/svn/etc...)) - Array based KV stores - `general.tags` - `general.languages` - `general.datasets` --------- Co-authored-by: compilade <git@compilade.net> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-07-18 20:40:15 +10:00
RunningLeon	3807c3de04	server : respect `--special` cli arg (#8553 )	2024-07-18 11:06:22 +03:00
Johannes Gäßler	e02b597be3	lookup: fibonacci hashing, fix crashes (#8548 )	2024-07-17 23:35:44 +02:00
Al Mochkin	b3283448ce	build : Fix docker build warnings (#8535 ) (#8537 )	2024-07-17 20:21:55 +02:00
Brian	30f80ca0bc	CONTRIBUTING.md : remove mention of noci (#8541 )	2024-07-17 17:57:06 +03:00
hipudding	1bdd8ae19f	[CANN] Add Ascend NPU backend (#6035 ) * [CANN] Add Ascend NPU backend Ascend is a full-stack AI computing infrastructure for industry applications and services based on Huawei Ascend processors and software. CANN (Compute Architecture of Neural Networks), developped by Huawei, is a heterogeneous computing architecture for AI. Co-authored-by: wangshuai09 <391746016@qq.com> * delete trailing whitespaces * Modify the code based on review comment * Rename LLAMA_CANN to GGML_CANN * Make ggml-common.h private * add ggml_cann prefix for acl funcs * Add logging for CANN backend * Delete Trailing whitespace --------- Co-authored-by: wangshuai09 <391746016@qq.com>	2024-07-17 14:23:50 +03:00
Masaya, Kato	da3913d8f9	batched: fix n_predict parameter (#8527 )	2024-07-17 10:34:28 +03:00
Georgi Gerganov	d65a8361fe	llama : disable context-shift for DeepSeek v2 (#8501 )	2024-07-17 10:32:59 +03:00
Johannes Gäßler	5e116e8dd5	make/cmake: add missing force MMQ/cuBLAS for HIP (#8515 )	2024-07-16 21:20:59 +02:00
Brian	1666f92dcd	gguf-hash : update clib.json to point to original xxhash repo (#8491 ) * Update clib.json to point to Cyan4973 original xxhash Convinced Cyan4973 to add clib.json directly to his repo, so can now point the clib package directly to him now. Previously pointed to my fork with the clib.json package metadata https://github.com/Cyan4973/xxHash/pull/954 * gguf-hash: readme update to point to Cyan4973 xxHash repo [no ci]	2024-07-16 10:14:16 +03:00
Steve Bonds	37b12f92ab	export-lora : handle help argument (#8497 ) The --help option on export-lora isn't accepted as valid. The help still gets displayed by default, but the script exits with an error message and nonzero status.	2024-07-16 10:04:45 +03:00
Georgi Gerganov	0efec57787	llama : valign + remove unused ftype (#8502 )	2024-07-16 10:00:30 +03:00
compilade	7acfd4e8d5	convert_hf : faster lazy safetensors (#8482 ) * convert_hf : faster lazy safetensors This makes '--dry-run' much, much faster. * convert_hf : fix memory leak in lazy MoE conversion The '_lazy' queue was sometimes self-referential, which caused reference cycles of objects old enough to avoid garbage collection until potential memory exhaustion.	2024-07-15 23:13:10 -04:00
Xuan Son Nguyen	97bdd26eee	Refactor lora adapter support (#8332 ) * lora: load to devide buft * add patch tensor function * correct tensor patch * llama_lora_adapter_apply * correct ggml_backend_tensor_copy * add llm_build_mm * fix auto merge * update based on review comments * add convert script * no more transpose A * add f16 convert * add metadata check * add sanity check * fix ftype * add requirements * fix requirements * fix outfile * conversion: only allow selected models * fix types * cuda : do not use dmmv if the tensor does not have enough cols * llama : lora fixes * do not disable mmap with lora Co-authored-by: slaren <slarengh@gmail.com> * llm_build_lora_mm_id * convert_lora : MoE LoRA conversion support * convert_lora : prefer safetensors, similarly to convert_hf * convert_hf : simplify modify_tensors for InternLM2 * convert_lora : lazy conversion * llama : load and use alpha from LoRA adapters * llama : use llm_build_lora_mm in most model graphs * auto scale * Revert "auto scale" This reverts commit `42415a4874`. * remove redundant params * Apply suggestions from code review Co-authored-by: slaren <slarengh@gmail.com> * change kv metadata * move add_type to __init__ * convert_hf : move add_type to main() * convert_lora : use the GGUFWriter from Model instead of overwriting it --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Francis Couture-Harpin <git@compilade.net>	2024-07-15 20:50:47 +02:00
Xuan Son Nguyen	4db8f60fe7	fix ci (#8494 )	2024-07-15 19:23:10 +02:00
Daniel Bevenius	8fac431b06	ggml : suppress unknown pragma 'GCC' on windows (#8460 ) This commit adds a macro guard to pragma GCC to avoid the following warning on windows: ```console C:\llama.cpp\ggml\src\ggml-aarch64.c(17,9): warning C4068: unknown pragma 'GCC' [C:\lama.cpp\build\ggml\src\ggml.vcxproj] ```	2024-07-15 15:48:17 +03:00
M-A	f17f39ff9c	server: update README.md with llama-server --help output [no ci] (#8472 ) The README.md had a stale information. In particular, the --ctx-size "defaults to 512" confused me and I had to check the code to confirm this was false. This the server is evolving rapidly, it's probably better to keep the source of truth at a single place (in the source) and generate the README.md based on that. Did: make llama-server ./llama-server --help > t.txt vimdiff t.txt examples/server/README.md I copied the content inside a backquote block. I would have preferred proper text but it would require a fair amount of surgery to make the current output compatible with markdown. A follow up could be to automate this process with a script. No functional change.	2024-07-15 15:04:56 +03:00
Georgi Gerganov	9104bc20ed	common : add --no-cont-batching arg (#6358 )	2024-07-15 14:54:58 +03:00
NikolaiLyssogor	fc690b018e	docs: fix links in development docs [no ci] (#8481 ) Fixes a few links to within the repo that were broken in the reorganization of the documentation in #8325.	2024-07-15 14:46:39 +03:00
Meng, Hengyu	16bdfa42ac	[SYCL] add concat through dim 1/2 (#8483 ) * add concat through dim 1/2	2024-07-15 19:32:15 +08:00
Georgi Gerganov	3dfda05956	llama : de-duplicate deepseek2 norm	2024-07-15 14:10:39 +03:00
0cc4m	bda62d7999	Vulkan MMQ Fix (#8479 ) * Fix incoherence by adding missing LOAD_VEC_A parameter * Fix Vulkan op result checker build error	2024-07-15 09:38:52 +02:00
compilade	090fca7a07	pydantic : replace uses of __annotations__ with get_type_hints (#8474 ) * pydantic : replace uses of __annotations__ with get_type_hints * pydantic : fix Python 3.9 and 3.10 support	2024-07-14 19:51:21 -04:00
Georgi Gerganov	aaab2419ea	flake.lock: Update (#8475 ) Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/9f4128e00b0ae8ec65918efeba59db998750ead6?narHash=sha256-rwz8NJZV%2B387rnWpTYcXaRNvzUSnnF9aHONoJIYmiUQ%3D' (2024-07-03) → 'github:NixOS/nixpkgs/7e7c39ea35c5cdd002cd4588b03a3fb9ece6fad9?narHash=sha256-EYekUHJE2gxeo2pM/zM9Wlqw1Uw2XTJXOSAO79ksc4Y%3D' (2024-07-12) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2024-07-14 08:54:02 -07:00
Georgi Gerganov	73cf442e7b	llama : fix Gemma-2 Query scaling factors (#8473 ) * 9B - query_pre_attn_scalar = 256 not 224 See `03e657582d` Gemma 9b should use 256 and not 224 (self.config.hidden_size // self.config.num_attention_heads) * llama : fix Gemma-2 Query scaling factor ggml-ci --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2024-07-14 14:05:09 +03:00
Brian	e236528e76	gguf_hash.py: Add sha256 (#8470 ) * gguf_hash.py: Add sha256 * gguf_hash.py: rename string UUIDv5 --> uuid * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net>	2024-07-14 16:47:14 +10:00
compilade	fa79495bb4	llama : fix pre-tokenization of non-special added tokens (#8228 ) * llama : fix mpt and olmo pre-tokenizer * llama : pre-tokenize non-special user-defined tokens first * llama : fix detection of control-like user-defined tokens * convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. * convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens * llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. * llama : fix command-r detokenization * convert_hf : reduce usages of the UNKNOWN token type * llama : add UNKNOWN tokens in the special tokens cache * convert_hf : reduce usages of UNKNOWN for InternLM2 This makes the changes from #8321 more consistent with the other changes made here. * test-tokenizer-random : reduce potential confilcts with #8379 * test-tokenizer-random : add a failing edge case for falcon	2024-07-13 23:35:10 -04:00
bandoti	17eb6aa8a9	vulkan : cmake integration (#8119 ) * Add Vulkan to CMake pkg * Add Sycl to CMake pkg * Add OpenMP to CMake pkg * Split generated shader file into separate translation unit * Add CMake target for Vulkan shaders * Update README.md * Add make target for Vulkan shaders * Use pkg-config to locate vulkan library * Add vulkan SDK dep to ubuntu-22-cmake-vulkan workflow * Clean up tabs * Move sudo to apt-key invocation * Forward GGML_EXTRA_LIBS to CMake config pkg * Update vulkan obj file paths * Add shaderc to nix pkg * Add python3 to Vulkan nix build * Link against ggml in cmake pkg * Remove Python dependency from Vulkan build * code review changes * Remove trailing newline * Add cflags from pkg-config to fix w64devkit build * Update README.md * Remove trailing whitespace * Update README.md * Remove trailing whitespace * Fix doc heading * Make glslc required Vulkan component * remove clblast from nix pkg	2024-07-13 18:12:39 +02:00
Georgi Gerganov	c917b67f06	metal : template-ify some of the kernels (#8447 ) ggml-ci	2024-07-13 18:32:33 +03:00
Georgi Gerganov	4e24cffd8c	server : handle content array in chat API (#8449 ) * server : handle content array in chat API * Update examples/server/utils.hpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-07-12 14:48:15 +03:00
Georgi Gerganov	6af51c0d96	main : print error on empty input (#8456 )	2024-07-12 14:48:04 +03:00
Daniel Bevenius	f53226245f	llama : suppress unary minus operator warning (#8448 ) This commit updates the _try_copy lambda and moves the unary minus operator to after the cast to int32_t. The motivation for this that currently the following warning is generated on windows: ```console llama.cpp\src\llama.cpp(21147,30): warning C4146: unary minus operator applied to unsigned type, result still unsigned ```	2024-07-12 12:05:21 +03:00
Douglas Hanley	c3ebcfa148	server : ensure batches are either all embed or all completion (#8420 ) * make sure batches are all embed or all non-embed * non-embedding batch for sampled tokens; fix unused params warning	2024-07-12 11:14:12 +03:00
Armen Kaleshian	8a4441ea1a	docker : fix filename for convert-hf-to-gguf.py in tools.sh (#8441 ) Commit `b0a4699` changed the name of this script from convert-hf-to-gguf.py to convert_hf_to_gguf.py breaking how convert is called from within a Docker container.	2024-07-12 11:08:19 +03:00
Jiří Podivín	5aefbce27a	convert : remove fsep token from GPTRefactForCausalLM (#8237 ) The <filename> token used by Refact doesn't serve the same purpose as the <file_separator> from CodeGemma. Signed-off-by: Jiri Podivin <jpodivin@redhat.com>	2024-07-12 11:06:33 +03:00
Georgi Gerganov	71c1121d11	examples : sprintf -> snprintf (#8434 ) * examples : sprintf -> snprintf ggml-ci * examples : use sizeof() instead of hardcoded constants	2024-07-12 10:46:14 +03:00
Georgi Gerganov	370b1f7e7a	ggml : minor naming changes (#8433 ) * ggml : minor naming changes ggml-ci * ggml : use PRId64 [no ci] * ggml : revert FA K/Q names	2024-07-12 10:46:02 +03:00

... 8 9 10 11 12 ...

3876 Commits