llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-12 03:31:46 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	56f20aa25d	scripts : sync ggml-aarch64 sources	2024-07-27 18:07:33 +03:00
Georgi Gerganov	345c8c0c87	ggml : add missing semicolon (#0 ) ggml-ci	2024-07-27 17:43:44 +03:00
Georgi Gerganov	ae7985cd7b	sync : ggml ggml-ci	2024-07-27 17:43:44 +03:00
Mahesh Madhav	a05ca93697	ggml : loop tiling optimizations for scalar path (ggml/898) Apply a loop tiling technique to the generic path, which provides performance upside for ISAs with enough registers to take advantage of it. Also helps the compiler optimize this path.	2024-07-27 17:43:44 +03:00
Ivan Filipov	9f77d899b7	ggml: add support for float16 input tensors in pooling operations (ggml/895) * Add support for float16 tensors in 1d pooling operations * Add support for float16 input tensors in 2d pooling operations * code cleanup remove unnecessary casting during srow ptr initialization --------- Co-authored-by: vanaka11 <vanaka1189@gmail.com>	2024-07-27 17:43:44 +03:00
Tony Wasserka	203b7f1531	vulkan : initialize vk_buffer_struct members to VK_NULL_HANDLE (ggml/893) This prevents invalid frees when destroying a partially initialized vk_buffer_struct. For example, this could happen in ggml_vk_create_buffer when running out of device memory. Co-authored-by: Tony Wasserka <neobrain@users.noreply.github.com>	2024-07-27 17:43:44 +03:00
Borislav Stanimirov	d2b851bfa1	cmake : only enable GGML_NATIVE and x86 flags if not crosscompiling (ggml/885)	2024-07-27 17:43:44 +03:00
Daniel Bevenius	c12b6e8ee7	ggml : remove unnecessary UNUSED macro call (ggml/880) This commit removes an UNUSED macro call that is not needed as the variable n0 is used in the code and will not produce a warning. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-07-27 17:43:44 +03:00
Jeffrey Morgan	b5e95468b1	llama : add support for llama 3.1 rope scaling factors (#8676 ) * Add llama 3.1 rope scaling factors to llama conversion and inference This commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the `ggml_rope_ext` rope oepration, improving results for context windows above 8192 * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * address comments * address comments * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net>	2024-07-27 15:03:45 +03:00
Georgi Gerganov	92090eca21	llama : add function for model-based max number of graph nodes (#8622 ) * llama : model-based max number of graph nodes ggml-ci * llama : disable 405B max_nodes path due to lack of complaints ggml-ci	2024-07-27 14:59:29 +03:00
Daniel Bevenius	9d03d085dd	common : add --no-warmup option for main/llama-cli (#8712 ) This commit adds a --no-warmup option for llama-cli. The motivation for this is that it can be convenient to skip the warmup llama_decode call when debugging. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-07-27 13:45:02 +03:00
wangshuai09	bfb4c74981	cann: Fix Multi-NPU execution error (#8710 ) * cann: fix multi-npu exec error * cann: update comment for ggml_backend_cann_supports_buft	2024-07-27 16:36:44 +08:00
slaren	2b1f616b20	ggml : reduce hash table reset cost (#8698 ) * ggml : reduce hash table reset cost * fix unreachable code warnings after GGML_ASSERT(false) * GGML_ASSERT(false) -> GGML_ABORT("fatal error") * GGML_ABORT use format string	2024-07-27 04:41:55 +02:00
Judd	01245f5b16	llama : fix order of parameters (#8706 ) usage of `aclrtGetMemInfo` is correct: https://www.hiascend.com/doc_center/source/zh/canncommercial/63RC2/inferapplicationdev/aclcppdevg/aclcppdevg_03_0103.html Co-authored-by: Judd <foldl@boxvest.com>	2024-07-26 11:38:12 +03:00
Yaiko	01aec4a631	server : add Speech Recognition & Synthesis to UI (#8679 ) * server : add Speech Recognition & Synthesis to UI * server : add Speech Recognition & Synthesis to UI (fixes)	2024-07-26 00:10:16 +02:00
Xuan Son Nguyen	41cd47caab	examples : export-lora : fix issue with quantized base models (#8687 )	2024-07-25 23:49:39 +02:00
DavidKorczynski	49ce0ab6d4	ggml: handle ggml_init failure to fix NULL pointer deref (#8692 ) `ggml_init` can fail if no unused context is found. In that case, a NULL-pointer deref will happen later in the code during a call to `ggml_set_on_alloc`. This fixes it by bailing out if no context is found.	2024-07-25 23:23:05 +02:00
Georgi Gerganov	4226a8d10e	llama : fix build + fix fabs compile warnings (#8683 ) ggml-ci	2024-07-25 19:57:31 +03:00
Andreas (Andi) Kunar	bf5a81df37	ggml : fix build on Windows with Snapdragon X (#8531 ) * Improvements for Windows with Snapdragon X * Revert "Improvements for Windows with Snapdragon X" This reverts commit `bf21397ae5`. * Improvements for Windows with Snapdragon X * WOA build clarifications * WIndows on ARM build clarifications * cmake build for Windows clarifications * Update docs/build.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: AndreasKunar <andreaskmsn.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-07-25 19:01:00 +03:00
Georgi Gerganov	88954f7fbd	tests : fix printfs (#8068 )	2024-07-25 18:58:04 +03:00
Chen Xi	ed67bcb24f	[SYCL] fix multi-gpu issue on sycl (#8554 ) --------- Signed-off-by: Chen Xi <xi2chen@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>	2024-07-25 19:45:18 +08:00
Georgi Gerganov	eddcb5238b	ggml : add and use ggml_cpu_has_llamafile() (#8664 )	2024-07-25 12:37:42 +03:00
Xuan Son Nguyen	be6d7c0791	examples : remove `finetune` and `train-text-from-scratch` (#8669 ) * examples : remove finetune and train-text-from-scratch * fix build * update help message * fix small typo for export-lora	2024-07-25 10:39:04 +02:00
Ujjawal Panchal	4b0eff3df5	docs : Quantum -> Quantized (#8666 ) * docfix: imatrix readme, quantum models -> quantized models. * docfix: server readme: quantum models -> quantized models.	2024-07-25 11:13:27 +03:00
Fan Shupei	8a4bad50a8	llama: use sliding window for phi3 (#8627 ) * use sliding window for phi3 * fix typo, "data_swa" -> "data" * [conver_hf_to_gguf.py] add phi3 sliding window	2024-07-25 10:21:09 +03:00
MorganRO8	68504f0970	readme : update games list (#8673 ) Added link to game I made that depends on llama	2024-07-24 19:48:00 +03:00
Joe Todd	f19bf99c01	Build Llama SYCL Intel with static libs (#8668 ) Ensure SYCL CI builds both static & dynamic libs for testing purposes Signed-off-by: Joe Todd <joe.todd@codeplay.com>	2024-07-24 14:36:00 +01:00
Thorsten Sommer	3a7ac5300a	readme : update UI list [no ci] (#8505 )	2024-07-24 15:52:30 +03:00
Xuan Son Nguyen	96952e7181	llama : fix `llama_chat_format_single` for mistral (#8657 ) * fix `llama_chat_format_single` for mistral * fix typo * use printf	2024-07-24 13:48:46 +02:00
Joe Todd	79167d9e49	Re-add erroneously removed -fsycl from GGML_EXTRA_LIBS (#8667 )	2024-07-24 11:55:26 +01:00
Xuan Son Nguyen	b115105f05	add llama_lora_adapter_clear (#8653 )	2024-07-24 11:25:19 +02:00
Xuan Son Nguyen	de280085e7	examples : Fix `llama-export-lora` example (#8607 ) * fix export-lora example * add more logging * reject merging subset * better check * typo	2024-07-23 23:48:37 +02:00
Vali Malinoiu	b841d07408	server : fix URL.parse in the UI (#8646 )	2024-07-23 17:37:42 +03:00
Joe Todd	64cf50a0ed	sycl : Add support for non-release DPC++ & oneMKL (#8644 ) * Update cmake to support nvidia hardware & open-source compiler --------- Signed-off-by: Joe Todd <joe.todd@codeplay.com>	2024-07-23 14:58:37 +01:00
Georgi Gerganov	938943cdbf	llama : move vocab, grammar and sampling into separate files (#8508 ) * llama : move sampling code into llama-sampling ggml-ci * llama : move grammar code into llama-grammar ggml-ci * cont ggml-ci * cont : pre-fetch rules * cont ggml-ci * llama : deprecate llama_sample_grammar * llama : move tokenizers into llama-vocab ggml-ci * make : update llama.cpp deps [no ci] * llama : redirect external API to internal APIs ggml-ci * llama : suffix the internal APIs with "_impl" ggml-ci * llama : clean-up	2024-07-23 13:10:17 +03:00
0cc4m	751fcfc6c3	Vulkan IQ4_NL Support (#8613 ) * Fix Vulkan matmul tests compile errors * Add Vulkan IQ4_NL support * Fix Vulkan DeepSeek-Coder-V2-Lite MoE support	2024-07-23 10:56:49 +02:00
Jeroen Mostert	46e47417aa	Allow all RDNA2 archs to use sdot4 intrinsic (#8629 ) The check gating the use of `__builtin_amdgc_sdot4` specifically checks for gfx1030. This causes a severe perf regression for anything gfx103? that's not gfx1030 and not using `HSA_OVERRIDE_GFX_VERSION` (if you've built ROCm to support it). We already have a generic RDNA2 define, let's use it.	2024-07-23 10:50:40 +02:00
Georgi Gerganov	e7e6487ba0	contrib : clarify PR squashing + module names (#8630 ) * contrib : clarify PR squashing * contrib : fix typo + add list of modules	2024-07-23 11:28:38 +03:00
luoyu-intel	063d99ad11	[SYCL] fix scratch size of softmax (#8642 )	2024-07-23 15:43:28 +08:00
Keke Han	081fe431aa	llama : fix codeshell support (#8599 ) * llama : fix codeshell support * llama : move codeshell after smollm below to respect the enum order	2024-07-22 19:43:43 +03:00
Jason Stillerman	d94c6e0ccb	llama : add support for SmolLm pre-tokenizer (#8609 ) * Adding SmolLM Pre Tokenizer * Update convert_hf_to_gguf_update.py Co-authored-by: compilade <git@compilade.net> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * handle regex * removed .inp and out .out ggufs --------- Co-authored-by: compilade <git@compilade.net>	2024-07-22 17:43:01 +03:00
Jiří Podivín	566daa5a5b	.py: Stylistic adjustments for python (#8233 ) Superflous parens in conditionals were removed. * Unused args in function were removed. * Replaced unused `idx` var with `_` * Initializing file_format and format_version attributes * Renaming constant to capitals * Preventing redefinition of the `f` var Signed-off-by: Jiri Podivin <jpodivin@redhat.com>	2024-07-22 23:44:53 +10:00
Georgi Gerganov	6f11a83e4e	llama : allow overrides for tokenizer flags (#8614 ) ggml-ci	2024-07-22 13:33:22 +03:00
Georgi Gerganov	e093dd2382	tests : re-enable tokenizer tests (#8611 ) * models : remove duplicated gpt-2 vocab * models : remove old stablelm vocab * tests : re-enable MPT tokenizer tests * tests : re-enable DeepSeek tokenizer tests * cmake : sort ggml-ci	2024-07-22 13:32:49 +03:00
Douglas Hanley	50e05353e8	llama : add Mistral Nemo inference support (#8604 )	2024-07-22 11:06:17 +03:00
Jan Boon	628154492a	server : update doc to clarify n_keep when there is bos token (#8619 )	2024-07-22 11:02:09 +03:00
Mark Zhuang	04bab6b7da	ggml: fix compile error for RISC-V (#8623 )	2024-07-22 10:56:45 +03:00
devojony	b7c11d36e6	examples: fix android example cannot be generated continuously (#8621 ) When generation ends `completion_loop()` should return a NULL, not the empty string	2024-07-22 09:54:42 +03:00
Georgi Gerganov	45f2c19cc5	flake.lock: Update (#8610 )	2024-07-21 06:45:10 -07:00
M-A	22f281aa16	examples : Rewrite pydantic_models_to_grammar_examples.py (#8493 ) Changes: - Move each example into its own function. This makes the code much easier to read and understand. - Make the program easy to only run one test by commenting out function calls in main(). - Make the output easy to parse by indenting the output for each example. - Add shebang and +x bit to make it clear it's an executable. - Make the host configurable via --host with a default 127.0.0.1:8080. - Make the code look in the tools list to call the registered tool, instead of hardcoding the returned values. This makes the code more copy-pastable. - Add error checking, so that the program exits 1 if the LLM didn't returned expected values. It's super useful to check for correctness. Testing: - Tested with Mistral-7B-Instruct-v0.3 in F16 and Q5_K_M and Meta-Llama-3-8B-Instruct in F16 and Q5_K_M. - I did not observe a failure even once in Mistral-7B-Instruct-v0.3. - Llama-3 failed about a third of the time in example_concurrent: it only returned one call instead of 3. Even for F16. Potential follow ups: - Do not fix the prompt encoding yet. Surprisingly it mostly works even if the prompt encoding is not model optimized. - Add chained answer and response. Test only change.	2024-07-20 22:09:17 -04:00

1 2 3 4 5 ...

3480 Commits