llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-27 03:44:35 +00:00

Author	SHA1	Message	Date
Borislav Stanimirov	7a80710d93	msvc : silence codecvt c++17 deprecation warnings (#8395 )	2024-07-10 14:40:53 +03:00
fairydreaming	a8be1e6f59	llama : add assert about missing llama_encode() call (#8400 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-07-10 14:38:58 +03:00
RunningLeon	e4dd31ff89	py : fix converter for internlm2 (#8321 ) * update internlm2 * remove unused file * fix lint	2024-07-10 14:26:40 +03:00
laik	8f0fad42b9	py : fix extra space in convert_hf_to_gguf.py (#8407 )	2024-07-10 14:19:10 +03:00
Clint Herron	a59f8fdc85	Server: Enable setting default sampling parameters via command-line (#8402 ) * Load server sampling parameters from the server context by default. * Wordsmithing comment	2024-07-09 18:26:40 -04:00
Andy Salerno	fd560fe680	Update README.md to fix broken link to docs (#8399 ) Update the "Performance troubleshooting" doc link to be correct - the file was moved into a dir called 'development'	2024-07-09 14:58:44 -04:00
Clint Herron	e500d6135a	Deprecation warning to assist with migration to new binary names (#8283 ) * Adding a simple program to provide a deprecation warning that can exist to help people notice the binary name change from #7809 and migrate to the new filenames. * Build legacy replacement binaries only if they already exist. Check for their existence every time so that they are not ignored.	2024-07-09 11:54:43 -04:00
Johannes Gäßler	a03e8dd99d	make/cmake: LLAMA_NO_CCACHE -> GGML_NO_CCACHE (#8392 )	2024-07-09 17:11:07 +02:00
Alberto Cabrera Pérez	5b0b8d8cfb	sycl : Reenabled mmvq path for the SYCL Nvidia Backend (#8372 ) * SYCL : Reenabled mmvq path for the SYCL Nvidia Backend * Reduced verbosity of comment	2024-07-09 22:03:15 +08:00
Borislav Stanimirov	9925ca4087	cmake : allow external ggml (#8370 )	2024-07-09 11:38:00 +03:00
daghanerdonmez	9beb2dda03	readme : fix typo [no ci] (#8389 ) Bakus-Naur --> Backus-Naur	2024-07-09 09:16:00 +03:00
compilade	7d0e23d72e	gguf-py : do not use internal numpy types (#7472 )	2024-07-09 01:04:49 -04:00
Georgi Gerganov	7fdb6f73e3	flake.lock: Update (#8342 ) Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/2a55567fcf15b1b1c7ed712a2c6fadaec7412ea8?narHash=sha256-iKzJcpdXih14qYVcZ9QC9XuZYnPc6T8YImb6dX166kw%3D' (2024-06-01) → 'github:hercules-ci/flake-parts/9227223f6d922fee3c7b190b2cc238a99527bbb7?narHash=sha256-pQMhCCHyQGRzdfAkdJ4cIWiw%2BJNuWsTX7f0ZYSyz0VY%3D' (2024-07-03) • Updated input 'flake-parts/nixpkgs-lib': '`eb9ceca17d`.tar.gz?narHash=sha256-lIbdfCsf8LMFloheeE6N31%2BBMIeixqyQWbSr2vk79EQ%3D' (2024-06-01) → '`5daf051448`.tar.gz?narHash=sha256-Fm2rDDs86sHy0/1jxTOKB1118Q0O3Uc7EC0iXvXKpbI%3D' (2024-07-01) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/b2852eb9365c6de48ffb0dc2c9562591f652242a?narHash=sha256-C8e9S7RzshSdHB7L%2Bv9I51af1gDM5unhJ2xO1ywxNH8%3D' (2024-06-27) → 'github:NixOS/nixpkgs/9f4128e00b0ae8ec65918efeba59db998750ead6?narHash=sha256-rwz8NJZV%2B387rnWpTYcXaRNvzUSnnF9aHONoJIYmiUQ%3D' (2024-07-03) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2024-07-08 15:36:38 -07:00
Alberto Cabrera Pérez	a130eccef4	labeler : updated sycl to match docs and code refactor (#8373 )	2024-07-08 22:35:17 +02:00
b4b4o	c4dd11d1d3	readme : fix web link error [no ci] (#8347 )	2024-07-08 17:19:24 +03:00
Alberto Cabrera Pérez	2ec846d558	sycl : fix powf call in device code (#8368 )	2024-07-08 14:22:41 +01:00
Georgi Gerganov	3f2d538b81	scripts : fix sync for sycl	2024-07-08 13:51:31 +03:00
Georgi Gerganov	2ee44c9a18	sync : ggml ggml-ci	2024-07-08 12:23:00 +03:00
Georgi Gerganov	6847d54c4f	tests : fix whitespace (#0 )	2024-07-08 12:23:00 +03:00
John Balis	fde13b3bb9	feat: cuda implementation for `ggml_conv_transpose_1d` (ggml/854) * conv transpose 1d passing test for 1d input and kernel * working for different input and output channel counts, added test for variable stride * initial draft appears to work with stride other than 1 * working with all old and new conv1d tests * added a test for large tensors * removed use cuda hardcoding * restored test-conv-transpose.c * removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail * fixed accumulator bug * added test to test-backend-ops * fixed mistake * addressed review * fixed includes * removed blank lines * style and warning fixes * return failure when test fails * fix supports_op --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-07-08 12:23:00 +03:00
Kevin Wang	470939d483	common : preallocate sampling token data vector (#8363 ) `emplace_back` repeatedly-called is slower than preallocating the vector to the vocab size and directly inserting the data. Some rudimentary profiling with `chrono` improves the performance of this block of code from ~500us/op to ~40us/op. Overall, this slightly improves the sampling performance which has a more substantial impact for the `examples/lookahead` implementation -- I am able to see a ~10% performance boost in lookahead inference.	2024-07-08 10:26:53 +03:00
Georgi Gerganov	6f0dbf6ab0	infill : assert prefix/suffix tokens + remove old space logic (#8351 )	2024-07-08 09:34:35 +03:00
Kevin Wang	ffd00797d8	common : avoid unnecessary logits fetch (#8358 )	2024-07-08 09:31:55 +03:00
toyer	04ce3a8b19	readme : add supported glm models (#8360 )	2024-07-08 08:57:19 +03:00
compilade	3fd62a6b1c	py : type-check all Python scripts with Pyright (#8341 ) * py : type-check all Python scripts with Pyright * server-tests : use trailing slash in openai base_url * server-tests : add more type annotations * server-tests : strip "chat" from base_url in oai_chat_completions * server-tests : model metadata is a dict * ci : disable pip cache in type-check workflow The cache is not shared between branches, and it's 250MB in size, so it would become quite a big part of the 10GB cache limit of the repo. * py : fix new type errors from master branch * tests : fix test-tokenizer-random.py Apparently, gcc applies optimisations even when pre-processing, which confuses pycparser. * ci : only show warnings and errors in python type-check The "information" level otherwise has entries from 'examples/pydantic_models_to_grammar.py', which could be confusing for someone trying to figure out what failed, considering that these messages can safely be ignored even though they look like errors.	2024-07-07 15:04:39 -04:00
Denis Spasyuk	a8db2a9ce6	Update llama-cli documentation (#8315 ) * Update README.md * Update README.md * Update README.md fixed llama-cli/main, templates on some cmds added chat template sections and fixed typos in some areas * Update README.md * Update README.md * Update README.md	2024-07-07 17:08:28 +02:00
Alex Tuddenham	4090ea5501	ci : add checks for cmake,make and ctest in ci/run.sh (#8200 ) * Added checks for cmake,make and ctest * Removed erroneous whitespace	2024-07-07 17:59:14 +03:00
Andy Tai	f1948f1e10	readme : update bindings list (#8222 ) * adding guile_llama_cpp to binding list * fix formatting * fix formatting	2024-07-07 16:21:37 +03:00
Brian	f7cab35ef9	gguf-hash: model wide and per tensor hashing using xxhash and sha1 (#8048 ) CLI to hash GGUF files to detect difference on a per model and per tensor level The hash type we support is: - `--xxh64`: use xhash 64bit hash mode (default) - `--sha1`: use sha1 - `--uuid`: use uuid - `--sha256`: use sha256 While most POSIX systems already have hash checking programs like sha256sum, it is designed to check entire files. This is not ideal for our purpose if we want to check for consistency of the tensor data even if the metadata content of the gguf KV store has been updated. This program is designed to hash a gguf tensor payload on a 'per tensor layer' in addition to a 'entire tensor model' hash. The intent is that the entire tensor layer can be checked first but if there is any detected inconsistencies, then the per tensor hash can be used to narrow down the specific tensor layer that has inconsistencies. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-07-07 22:58:43 +10:00
toyer	905942abdb	llama : support glm3 and glm4 (#8031 ) * add chatglm3-6b model support huggingface model: https://hf-mirror.com/THUDM/chatglm3-6b Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com> * remove .rotary_pos_emb.inv_freq and unuse code for chatglm3 model Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com> * fix lint error Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com> * optimize convert-hf-to-gguf.py for chatglm model Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com> * support glm-4-9b-chat Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com> * fix eos tokens to glm4 * remove unused log * add preprocess to chatglm3 and chatglm4 * add eos_id_list to llama.cpp * fix code style * fix code style * fix conflicts * fix conflicts * Revert "add eos_id_list to llama.cpp" This reverts commit `3a4d5790bf`. * set <\|endoftext\|> as eos and <\|user\|> as eot * fix chat template bug * add comment to glm prefix and suffix * fix conflicts and add rope_ratio & ChatGLMForConditionalGeneration * fix chat template bug * fix codestyle * fix conflicts * modified the general name of glm model * fix conflicts * remove prefix and suffix * use normal glm4 chattempalte & use LLM_FFN_SWIGLU in phi3 * fix: resolve Flake8 errors in `convert-hf-to-gguf.py` - Fix E302 by adding two blank lines before top-level function definitions - Replace print statements to fix NP100 - Fix E303 by ensuring only one blank line between lines of code * fix rope ratio to solve incorrect answers * fix by comments --------- Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com> Co-authored-by: XingXing Qiao <qiaoxx@dingdao.com> Co-authored-by: Umpire2018 <138990495+Umpire2018@users.noreply.github.com>	2024-07-07 15:52:10 +03:00
Georgi Gerganov	b5040086d4	llama : fix n_rot default (#8348 ) ggml-ci	2024-07-07 14:59:02 +03:00
compilade	d39130a398	py : use cpu-only torch in requirements.txt (#8335 )	2024-07-07 14:23:38 +03:00
standby24x7	b81ba1f96b	finetune: Rename command name in README.md (#8343 ) Rename an old command name "finetune" to "llama-finetune" in README.md Signed-off-by: Masanari Iida <standby24x7@gmail.com>	2024-07-07 13:38:02 +03:00
standby24x7	210eb9ed0a	finetune: Rename an old command name in finetune.sh (#8344 ) This patch replaces an old commad "main" with "llama-cli" in finetune.sh. The part that I fixed is comment, so it doesn't change the script. Signed-off-by: Masanari Iida <standby24x7@gmail.com>	2024-07-07 13:37:47 +03:00
Bjarke Viksøe	cb4d86c4d7	server: Retrieve prompt template in /props (#8337 ) * server: Retrieve prompt template in /props This PR adds the following: - Expose the model's Jinja2 prompt template from the model in the /props endpoint. - Change log-level from Error to Warning for warning about template mismatch. The front-end stands a better chance of actually executing the Jinja template format correctly. Server is currently just guessing it. Ideally this should have been inside a JSON block that expose the same key/value pairs as listed during startup in "llm_load_print_meta" function. * Make string buffer dynamic * Add doc and better string handling * Using chat_template naming convention * Use intermediate vector for string assignment	2024-07-07 11:10:38 +02:00
Derrick T. Woolworth	86e7299ef5	added support for Authorization Bearer tokens when downloading model (#8307 ) * added support for Authorization Bearer tokens * removed auth_token, removed set_ function, other small fixes * Update common/common.cpp --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-07-06 22:32:04 +02:00
Xuan Son Nguyen	60d83a0149	update main readme (#8333 )	2024-07-06 19:01:23 +02:00
Daniel Bevenius	87e25a1d1b	llama : add early return for empty range (#8327 ) * llama : add early return for empty range This commit adds an early return to the llama_kv_cache_seq_add and llama_kv_cache_seq_div functions. The motivation for adding this is to avoid looping over the cache when the range is empty. I ran into this when using the self-extend feature in main.cpp. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * llama : add static_cast to fix CI warning/error This commit attempts to fix the following warning/error: ```console src/llama.cpp:7271:31: error: comparison of integer expressions of different signedness: ‘int’ and ‘uint32_t’ {aka ‘unsigned int’} [-Werror=sign-compare] 7271 \| if (i < hparams.n_layer_dense_lead) { \| ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` This can be reproduced locally by setting -Wsign-compare in the Makefile. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * squash! llama : add early return for empty range Remove the setting of cache.head to 0 when the range is empty. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * Update src/llama.cpp --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-07-06 10:22:16 +03:00
jaime-m-p	213701b51a	Detokenizer fixes (#8039 ) * Add llama_detokenize(): - Update header files location - UNKNOWN and CONTROL are 'special pieces' - Remove space after UNKNOWN and CONTROL - Refactor llama_token_to_piece() - Add flag: clean_up_tokenization_spaces - Symmetric params for llama_tokenize() and llama_detokenize() * Update and fix tokenizer tests: - Using llama_detokenize() - Unexpected vocab type as test fail instead of error - Useful when automating tests: - If you don't know in advance the vocab type - Differenciate other loading errors - Skip unicode surrogaes and undefined - Gracefully exit threads - Using exit() is throwing random exceptions - Clean old known problematic codepoints - Minor: confusing hexadecimal codepoint * Update bruteforce random tests - Add detokenizer checks - New generator: ascii_lr_strip - New generator: apostrophe - Add more vocabs files - Detokenize special tokens. - Replace errors with '\uFFFD' when detokenizing to 'utf-8' - More edge cases - Better detokenization results check * Fix add_space_prefix, set false by default * Better leading space removal * Do not remove space when decoding special tokens * Bugfix: custom regexs splits undefined unicode codepoints * 'viking' detokenizer clean spaces	2024-07-05 19:01:35 +02:00
Xuan Son Nguyen	be20e7f49d	Reorganize documentation pages (#8325 ) * re-organize docs * add link among docs * add link to build docs * fix style * de-duplicate sections	2024-07-05 18:08:32 +02:00
Georgi Gerganov	7ed03b8974	llama : fix compile warning (#8304 )	2024-07-05 17:32:09 +03:00
Natsu	1d894a790e	cmake : add GGML_BUILD and GGML_SHARED macro definitions (#8281 )	2024-07-05 17:29:35 +03:00
Ouadie EL FAROUKI	1f3e1b66e2	Enabled more data types for oneMKL gemm_batch (#8236 )	2024-07-05 13:23:25 +01:00
Georgi Gerganov	148ec970b6	convert : remove AWQ remnants (#8320 )	2024-07-05 10:15:36 +03:00
Georgi Gerganov	2cccbaa008	llama : minor indentation during tensor loading (#8304 ) * llama : minor indentation during tensor loading ggml-ci * llama : use int for layer iterators [no ci]	2024-07-05 10:15:24 +03:00
Johannes Gäßler	8e558309dc	CUDA: MMQ support for iq4_nl, iq4_xs (#8278 )	2024-07-05 09:06:31 +02:00
Daniele	0a423800ff	CUDA: revert part of the RDNA1 optimizations (#8309 ) The change on the launch_bounds was causing a small performance drop in perplexity of 25 t/s	2024-07-05 09:06:09 +02:00
Douglas Hanley	d12f781074	llama : streamline embeddings from "non-embedding" models (#8087 )	2024-07-05 10:05:56 +03:00
Johannes Gäßler	bcefa03bc0	CUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 (#8311 )	2024-07-05 09:05:34 +02:00
Pieter Ouwerkerk	5a7447c569	readme : fix minor typos [no ci] (#8314 )	2024-07-05 09:58:41 +03:00

... 8 9 10 11 12 ...

3812 Commits