llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-11 19:21:46 +00:00

Author	SHA1	Message	Date
Daniel Bevenius	db68c93b57	ggml : improve inputs log sched_print_assignments (ggml/1053) This commit attempts to improve the log message for the inputs of the splits in the sched_print_assignments function. The motivation for this change is that currently even if there are no inputs a colon is displayed at the end of the line, which can make it a little confusing when reading the output as it could be interpreted as the line below are inputs when they are in fact nodes. With this change the colon will only be printed if there actually are inputs.	2025-01-04 16:09:53 +02:00
Gilad S.	c31fc8b966	fix: Vulkan shader gen binary path (#11037 )	2025-01-04 09:17:31 +01:00
Molly Sophia	4b0c638b9a	common : disable KV cache shifting automatically for unsupported models (#11053 ) * Disable KV cache shifting automatically for unsupported models instead of exiting directly Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update common/common.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-01-03 14:13:18 +02:00
Georgi Gerganov	e7da954ecc	metal : avoid uint (#11019 )	2025-01-03 11:26:14 +02:00
Georgi Gerganov	f66f582927	llama : refactor `src/llama.cpp` (#10902 ) * llama : scatter llama.cpp into multiple modules (wip) * llama : control-vector -> adapter * llama : arch * llama : mmap ggml-ci * ci : remove BUILD_SHARED_LIBS=OFF ggml-ci * llama : arch (cont) ggml-ci * llama : chat ggml-ci * llama : model ggml-ci * llama : hparams ggml-ci * llama : adapter ggml-ci * examples : fix ggml-ci * rebase ggml-ci * minor * llama : kv cache ggml-ci * llama : impl ggml-ci * llama : batch ggml-ci * cont ggml-ci * llama : context ggml-ci * minor * llama : context (cont) ggml-ci * llama : model loader ggml-ci * common : update lora ggml-ci * llama : quant ggml-ci * llama : quant (cont) ggml-ci * minor [no ci]	2025-01-03 10:18:53 +02:00
Pierrick Hymbert	2f0ee84b9b	server: bench: minor fixes (#10765 ) Some checks failed flake8 Lint / Lint (push) Has been cancelled Details Python Type-Check / pyright type-check (push) Has been cancelled Details * server/bench: - support openAI streaming standard output with [DONE]\n\n - export k6 raw results in csv - fix too many tcp idle connection in tcp_wait - add metric time to emit first token * server/bench: - fix when prometheus not started - wait for server to be ready before starting bench	2025-01-02 18:06:12 +01:00
Xuan Son Nguyen	0da5d86026	server : allow using LoRA adapters per-request (#10994 ) Some checks failed flake8 Lint / Lint (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Has been cancelled Details * slot.can_batch_with * lora per request * test: force disable cache prompt * move can_batch_with check * fix condition * add slow test with llama 8b * update docs * move lora change task to queue * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * lora_base * remove redundant check --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-01-02 15:05:18 +01:00
Benson Wong	a45433ba20	readme : add llama-swap to infrastructure section (#11032 ) * list llama-swap under tools in README * readme: add llama-swap to Infrastructure	2025-01-02 09:14:54 +02:00
Srihari-mcw	0827b2c1da	ggml : fixes for AVXVNNI instruction set with MSVC and Clang (#11027 ) Some checks failed flake8 Lint / Lint (push) Has been cancelled Details Python Type-Check / pyright type-check (push) Has been cancelled Details * Fixes for clang AVX VNNI * enable AVX VNNI and alder lake build for MSVC * Apply suggestions from code review --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-12-31 15:23:33 +01:00
Xuan Son Nguyen	45095a61bf	server : clean up built-in template detection (#11026 ) * server : clean up built-in template detection * fix compilation * add chat template test * fix condition	2024-12-31 15:22:01 +01:00
Xuan Son Nguyen	5896c65232	server : add OAI compat for /v1/completions (#10974 ) Some checks failed flake8 Lint / Lint (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Has been cancelled Details * server : add OAI compat for /v1/completions * add test * add docs * better docs	2024-12-31 12:34:13 +01:00
ymcki	bc7b1f8632	convert : fix Llama-3_1-Nemotron-51B rope settings (#11008 ) * conflict resolution * move comments after bracket to its own line * DeciLMCausalModel now reads rope_theta from config.json properly	2024-12-31 13:04:48 +02:00
Peter	6e1531aca5	common, examples, ggml : fix MSYS2 GCC compiler errors and warnings when building with LLAMA_CURL=ON and GGML_OPENCL=ON (#11013 ) In common/common.cpp: * Convert usage of stat() function call to check if file exists to standard library function std::filesystem::exists (error unable to match to correct function signature) * Additional conditions to check if PATH_MAX is already defined in WIN32 environment (warning it is already defined in MSYS2) In examples/run/run.cpp: * Add io.h header inclusion (error cannot find function _get_osfhandle) * Change initialisers for OVERLAPPED to empty struct (warning about uninitialised members) * Add initialiser for hFile (warning it may be uninitialised) * Add cast for curl_off_t percentage value to long int in generate_progress_prefix function (warning that curl_off_t is long long int) In ggml/src/ggml-opencl/ggml-opencl.cpp: * Initialise certain declared cl_mem variables to nullptr for greater safety (warning about B_d variable possibly used unassigned)	2024-12-31 01:46:06 +01:00
Jeff Bolz	716bd6dec3	vulkan: optimize mul_mat for small values of N (#10991 ) Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where the batch_strides are overloaded to hold the row strides. Put the loads from the B matrix in the innermost loop because it should cache better. Share some code for reducing the result values to memory in mul_mat_vec_base.	2024-12-30 18:27:11 +01:00
ag2s20150909	c250ecb315	android : fix llama_batch free (#11014 )	2024-12-30 14:35:13 +02:00
Jeff Bolz	a813badbbd	vulkan: im2col and matmul optimizations for stable diffusion (#10942 ) * tests: Add im2col perf tests * vulkan: optimize im2col, more elements per thread * vulkan: increase small tile size for NV_coopmat2 * vulkan: change im2col to 512 elements per workgroup	2024-12-29 10:16:34 +01:00
Jeff Bolz	fdd2188912	vulkan: Use push constant offset to handle misaligned descriptors (#10987 )	2024-12-29 09:35:11 +01:00
Isaac McFadyen	f865ea149d	server: added more docs for response_fields field (#10995 )	2024-12-28 16:09:19 +01:00
Alexey Parfenov	16cdce7b68	server : fix token duplication when streaming with stop strings (#10997 )	2024-12-28 16:08:54 +01:00
Eve	d79d8f39b4	vulkan: multi-row k quants (#10846 ) * multi row k quant shaders! * better row selection * more row choices * readjust row selection * rm_kq=2 by default	2024-12-26 16:54:44 +01:00
Peter	d283d02bf2	examples, ggml : fix GCC compiler warnings (#10983 ) Warning types fixed (observed under MSYS2 GCC 14.2.0): * format '%ld' expects argument of type 'long int', but argument has type 'size_t' * llama.cpp/ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp:81:46: warning: missing initializer for member '_STARTUPINFOA::lpDesktop' [-Wmissing-field-initializers] (emitted for all struct field except first)	2024-12-26 14:59:11 +01:00
Reza Kakhki	9ba399dfa7	server : add support for "encoding_format": "base64" to the /embeddings endpoints (#10967 ) Some checks failed flake8 Lint / Lint (push) Has been cancelled Details Python Type-Check / pyright type-check (push) Has been cancelled Details add support for base64 * fix base64 test * improve test --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2024-12-24 21:33:04 +01:00
Djip007	2cd43f4900	ggml : more perfo with llamafile tinyblas on x86_64 (#10714 ) Some checks are pending flake8 Lint / Lint (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details * more perfo with llamafile tinyblas on x86_64. - add bf16 suport - change dispache strategie (thanks: https://github.com/ikawrakow/ik_llama.cpp/pull/71 ) - reduce memory bandwidth simple tinyblas dispache and more cache freindly * tinyblas dynamic dispaching * sgemm: add M blocs. * - git 2.47 use short id of len 9. - show-progress is not part of GNU Wget2 * remove not stable test	2024-12-24 18:54:49 +01:00
NeverLucky	09fe2e7613	server: allow filtering llama server response fields (#10940 ) * llama_server_response_fields * llama_server_response_fields_fix_issues * params fixes * fix * clarify docs * change to "response_fields" --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2024-12-24 17:39:49 +01:00
Georgi Gerganov	30caac3a68	llama : the WPM vocabs use the CLS token as BOS (#10930 ) * llama : the WPM vocabs use the CLS token as BOS ggml-ci * llama : add comment	2024-12-24 09:44:20 +02:00
Diego Devesa	60cfa728e2	ggml : use wstring for backend search paths (#10960 ) ggml-ci	2024-12-24 04:05:27 +01:00
Diego Devesa	3327bb0f8d	ggml : fix arm enabled features check (#10961 )	2024-12-24 04:05:17 +01:00
Diego Devesa	32d6ee6385	ggml : fix const usage in SSE path (#10962 )	2024-12-23 20:25:52 +01:00
Xuan Son Nguyen	14b699ecde	server : fix missing model id in /model endpoint (#10957 ) Some checks are pending flake8 Lint / Lint (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details * server : fix missing model id in /model endpoint * fix ci	2024-12-23 12:52:25 +01:00
Xuan Son Nguyen	485dc01214	server : add system_fingerprint to chat/completion (#10917 ) * server : add system_fingerprint to chat/completion * update README	2024-12-23 12:02:44 +01:00
Radoslav Gerganov	86bf31cfe6	rpc-server : add support for the SYCL backend (#10934 )	2024-12-23 10:39:30 +02:00
Yun Dou	b92a14a841	llama : support InfiniAI Megrez 3b (#10893 ) Some checks failed flake8 Lint / Lint (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Has been cancelled Details * Support InfiniAI Megrez 3b * Fix tokenizer_clean_spaces for megrez	2024-12-23 01:35:44 +01:00
ymcki	6f0c9e034b	llama : support for Llama-3_1-Nemotron-51B (#10669 ) * conflict resolution * move comments after bracket to its own line	2024-12-23 01:22:33 +01:00
Eric Curtin	dab76c92cc	llama-run : include temperature option (#10899 ) This commit updates the `examples/run/README.md` file to include a new option for setting the temperature and updates the `run.cpp` file to parse this option. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2024-12-23 01:21:40 +01:00
yuri@FreeBSD	7024d59e6a	ggml : fix run-time on FreeBSD in get_executable_path() (#10948 )	2024-12-23 01:20:11 +01:00
Rudi Servo	7c0e285858	devops : add docker-multi-stage builds (#10832 )	2024-12-22 23:22:58 +01:00
Billel Mokeddem	7ae33a616f	llama : add Falcon3 support (#10883 ) * Add Falcon3 model support * Add fix for adding bos to added special tokens * Add comment explaining the logic behind the if statement * Add a log message to better track the when the following line of code is triggered * Update log to only print when input and output characters are different * Fix handling pre-normalized tokens * Refactoring	2024-12-23 00:09:58 +02:00
Jeff Bolz	ebdee9478c	vulkan: build fixes for 32b (#10927 ) * vulkan: build fixes for 32b Should fix #10923 * vulkan: initialize some buffer/offset variables	2024-12-22 10:44:01 +01:00
Georgi Gerganov	5cd85b5e00	convert : add BertForMaskedLM (#10919 ) Some checks failed Python check requirements.txt / check-requirements (push) Has been cancelled Details flake8 Lint / Lint (push) Has been cancelled Details Python Type-Check / pyright type-check (push) Has been cancelled Details	2024-12-21 10:10:18 +02:00
Jeff Bolz	a91a41364b	vulkan: optimize coopmat2 dequant functions (#10855 ) Change the code to do 16b loads when possible and extract the appropriate component late, so the code is effectively decoding a pair of elements and then selecting one. This can allow more commoning to happen in the compiler when neighboring elements are loaded.	2024-12-21 08:04:45 +01:00
Adrien Gallouët	e34c5af43f	ggml-cpu: replace NEON asm with intrinsics in ggml_gemv_q4_0_4x8_q8_0() (#10874 ) * ggml-cpu: replace NEON asm with intrinsics in ggml_gemv_q4_0_4x8_q8_0() Signed-off-by: Adrien Gallouët <angt@huggingface.co> * ggml-cpu: format code Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2024-12-21 00:33:37 +01:00
Akarshan Biswas	eb5c3dc64b	SYCL: Migrate away from deprecated ggml_tensor->backend (#10840 ) Some checks are pending Python check requirements.txt / check-requirements (push) Waiting to run Details flake8 Lint / Lint (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details * Migrate to tensor->buffer for checking backend buffer type: 1 * SYCL: common.cpp try to migrate away from tensor->backend * SYCL: fix assertions and add proper comments * SYCL: remove extra space * SYCL: Add back static to ggml_backend_buffer_is_sycl_split function * SYCL: Add pragma directive to suppress warning spam * SYCL: Integrate debug logs with GGML_LOG and other fixes * Revert "SYCL: Integrate debug logs with GGML_LOG and other fixes" This reverts commit `2607b7de0f`. Let's keep the current SYCL specific logging mechanism for now * SYCL: Use GGML_SYCL_DEBUG after reverting * SYCL: reg_get_proc_address func, update to the current func signature * SYCL: Refactor SYCL buffer checks in ggml_sycl_cpy_tensor_2d	2024-12-20 23:31:28 +08:00
Xuan Son Nguyen	0ca416c91a	server : (UI) fix copy to clipboard function (#10916 )	2024-12-20 14:12:06 +01:00
Diego Devesa	21ae3b9be8	ggml : add test for SVE and disable when it fails (#10906 )	2024-12-20 13:31:28 +01:00
Molly Sophia	0a11f8b7b5	convert : fix RWKV v6 model conversion (#10913 ) * Enable --no-context-shift for llama-perplexity example Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV 6: Fix error in ggml_cuda_op_bin_bcast Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-12-20 11:44:58 +02:00
Georgi Gerganov	d408bb9268	clip : disable GPU support (#10896 ) ggml-ci	2024-12-19 18:47:15 +02:00
Georgi Gerganov	5cab3e4aaa	llama : minor grammar refactor (#10897 ) Some checks are pending Python check requirements.txt / check-requirements (push) Waiting to run Details flake8 Lint / Lint (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details ggml-ci	2024-12-19 17:42:13 +02:00
Georgi Gerganov	36319dec5d	tts : small QoL for easy model fetch (#10903 )	2024-12-19 17:35:15 +02:00
Xuan Son Nguyen	57bb2c40cd	server : fix logprobs, make it OAI-compatible (#10783 ) * server : fix logprobs, make it openai-compatible * update docs * add std::log * return pre-sampling p * sort before apply softmax * add comment * fix test * set p for sampled token * update docs * add --multi-token-probs * update docs * add `post_sampling_probs` option * update docs [no ci] * remove --multi-token-probs * "top_probs" with "post_sampling_probs" * resolve review comments * rename struct token_prob to prob_info * correct comment placement * fix setting prob for sampled token	2024-12-19 15:40:08 +01:00
Adrien Gallouët	a3c33b1dce	ggml: fix arm build with gcc (#10895 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2024-12-19 14:20:41 +01:00

1 2 3 4 5 ...

4462 Commits