llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-30 21:34:36 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	044ec4b2a5	embedding : add EOS token if not present (#899 )	2024-03-14 15:14:14 +02:00
Georgi Gerganov	77178eedc8	gguf-py : fix dtype check (#6045 )	2024-03-14 13:32:14 +02:00
Jian Liao	15a333260a	readme : improve readme for Llava-1.6 example (#6044 ) Co-authored-by: Jian Liao <jianliao@adobe.com>	2024-03-14 13:18:23 +02:00
Pierrick Hymbert	43241adf22	server: disable debug release type sanitizer, simplify trigger (#6047 ) - increase time out for server - do not fail fast	2024-03-14 13:15:39 +02:00
Georgi Gerganov	a44bc969e4	llama : fix typo	2024-03-14 13:13:06 +02:00
Michael Podvitskiy	2c4fb69246	llama : optimize defrag moves + fix fragmentation calculation (#6037 ) * attempt to reduce the impact of a worst-case scenario * fragmentation calculation fix * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-14 12:56:48 +02:00
Ondřej Čertík	3ca23481dd	gguf-py : add support for I8, I16 and I32 (#6045 ) * Refactor dtype handling to be extensible This code is equivalent as before, but now it is prepared to easily add more NumPy dtypes. * Add support for I8, I16 and I32 These types are allowed in the GGUF specification. * Add support for I8, I16 and I32 to gguf_writer * Add support for I8, I16, I32 to gguf_reader	2024-03-14 12:40:14 +02:00
Georgi Gerganov	3fe8d7a17f	ggml : designate enum vals for integer types (#6050 )	2024-03-14 12:38:37 +02:00
Georgi Gerganov	68265ebfc6	embedding : print all resulting embeddings (#899 )	2024-03-14 12:37:20 +02:00
Georgi Gerganov	381da2d9f0	metal : build metallib + fix embed path (#6015 ) * metal : build metallib + fix embed path ggml-ci * metal : fix embed build + update library load logic ggml-ci * metal : fix embeded library build ggml-ci * ci : fix iOS builds to use embedded library	2024-03-14 11:55:23 +02:00
Georgi Gerganov	0fd6c1f015	embedding : print cosine similarity (#899 )	2024-03-14 10:12:29 +02:00
Linwei Wang	19885d205e	readme : update details about running llama in Termux on Android (#6039 )	2024-03-13 20:34:40 +02:00
Georgi Gerganov	76a936c893	readme : update API changes and hot topics	2024-03-13 20:33:56 +02:00
Clint Herron	463628372d	grammar : handle missing "root" node (#6004 )	2024-03-13 20:10:40 +02:00
slaren	f30ea47a87	llama : add pipeline parallelism support (#6017 ) * llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs ggml-ci * server : add -ub, --ubatch-size parameter * fix server embedding test * llama : fix Mamba inference for pipeline parallelism Tested to work correctly with both `main` and `parallel` examples. * llama : limit max batch size to n_batch * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage * fix hip build * fix sycl build (disable cpy_tensor_async) * fix hip build * llama : limit n_batch and n_ubatch to n_ctx during context creation * llama : fix norm backend * batched-bench : sync after decode * swiftui : sync after decode * ggml : allow ggml_get_rows to use multiple threads if they are available * check n_ubatch >= n_tokens with non-casual attention * llama : do not limit n_batch to n_ctx with non-casual attn * server : construct batch with size of llama_n_batch * ggml_backend_cpu_graph_compute : fix return value when alloc fails * llama : better n_batch and n_ubatch comment * fix merge * small fix * reduce default n_batch to 2048 --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-13 18:54:21 +01:00
slaren	d8fd0ccf6a	test-backend-ops : skip CPU backend by default (#6028 )	2024-03-13 15:58:30 +02:00
AidanBeltonS	b3d978600f	Update get version (#6025 )	2024-03-13 18:47:54 +05:30
Xuan Son Nguyen	99b71c068f	Server: Use multi-task for embeddings endpoint (#6001 ) * use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}	2024-03-13 11:39:11 +01:00
slaren	306d34be7a	ci : remove tidy-review (#6021 )	2024-03-12 17:55:19 +02:00
Georgi Gerganov	8030da7afe	ggml : reuse quantum structs across backends (#5943 ) * ggml : reuse quant blocks across backends ggml-ci * ggml : define helper constants only for CUDA and SYCL ggml-ci * ggml : define helper quantum constants for SYCL ggml-ci	2024-03-12 14:27:20 +02:00
Georgi Gerganov	184215e783	ggml : fix UB in IQ2_S and IQ3_S (#6012 )	2024-03-12 13:49:55 +02:00
Georgi Gerganov	48358b2e5b	sycl : update IQ1_S kernels (WIP - not working!) (#5995 ) * sycl : try to fix after IQ1_S changes * sycl : iq1s_grid -> iq1s_grid_gpu * sycl : fix grid type	2024-03-12 11:15:05 +02:00
gliptic	5cdb371731	grammar : fix unnecessarily retained pointer to rules (#6003 )	2024-03-11 21:59:03 +02:00
Kawrakow	44ca159faf	1.5 bit: we can do even better (#5999 ) * iq1_s: we can do even better Spent one of the 4 scale bits on a signs of a 0.125 shift. I.e., quants are now -1 + delta, delta, 1 + delta, where delta is +/- 0.125. CUDA works, same performance as before. PPL(LLaMA-v2-7B) is now 11.85! * iq1_s: make scalar and AVX2 work with the new version * iq1_s: make Neon work with new version. ~10% drop in performance, so will need some more work. * iq1_s: make Metal work with new version * iq1_s: very slightly faster dequantize on Metal * iq1_s: fix dequantize on the CPU --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-03-11 17:53:15 +02:00
Georgi Gerganov	05b06210c9	llama : more consistent names of count variables (#5994 ) * llama : more consistent names of count variables ggml-ci * llama : n_parallel -> n_seq_max * common : fix param name * examples : fix param name	2024-03-11 17:49:47 +02:00
Georgi Gerganov	83796e62bc	llama : refactor unicode stuff (#5992 ) * llama : refactor unicode stuff ggml-ci * unicode : names * make : fix c++ compiler * unicode : names * unicode : straighten tables * zig : fix build * unicode : put nfd normalization behind API ggml-ci * swift : fix build * unicode : add BOM * unicode : add <cstdint> ggml-ci * unicode : pass as cpts as const ref	2024-03-11 17:47:47 +02:00
Jakub N	828defefb6	Update server docker image URLs (#5997 )	2024-03-11 14:40:42 +01:00
Xuan Son Nguyen	caa106d4e0	Server: format error to json (#5961 ) * server: format error to json * server: do not crash on grammar error * fix api key test case * revert limit max n_predict * small fix * correct coding style * update completion.js * launch_slot_with_task * update docs * update_slots * update webui * update readme	2024-03-11 10:56:41 +01:00
Michael Podvitskiy	3202361c5b	ggml, ci : Windows ARM runner and build fixes (#5979 ) * windows arm ci * fix `error C2078: too many initializers` with ggml_vld1q_u32 macro for MSVC ARM64 * fix `warning C4146: unary minus operator applied to unsigned type, result still unsigned` * fix `error C2065: '__fp16': undeclared identifier`	2024-03-11 11:28:51 +02:00
Minsoo Cheong	332bdfd798	server : maintain chat completion id for streaming responses (#5988 ) * server: maintain chat completion id for streaming responses * Update examples/server/utils.hpp * Update examples/server/utils.hpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-11 10:09:32 +02:00
Gilad S	ecab1c75de	cmake : fix subdir for `LLAMA_METAL_EMBED_LIBRARY` (#5985 )	2024-03-11 10:00:08 +02:00
Georgi Gerganov	ee35600b90	llama : fix F16/F32 downcast + improve names (#5980 )	2024-03-11 09:56:47 +02:00
Kawrakow	be858f6205	Better 1.5 bit quantization (#5971 ) * Trying blocvks of 16 for IQ1_S - seems slightly better * iq1s_blocks16: Adjust scale fudge factor to 1.125 * iq1s_blocks16: going to blocks of 32 with 2048 lattice points, so same bpw. This is even better than blocks of 16. Should I try blocks of 64? But to keep the same bpw, when I go to 4096 lattice points, I need to remove blocks alltogether and just have superblocks of 256 weights. * iq1s_blocks16: Use 2<x^2> as sigma2 in weight adjustment iq1s_blocks16: scalar and AVX2 dot products * iq1s_blocks16: CUDA dot product * iq1s_blocks16: Metal works, Neon does not Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s). Not seeing the bug in the Neon implementation for now. * iq1s_blocks16: fixed Neon * iq1s_blocks16: very slightly faster TG on Metal Still pathetic at 37 t/s * iq1s_blocks16: speedup Metal by packing codebook into uint32_t's * Formatting * iq1s_blocks16: uint32_t codebook is also better in CUDA TG-128 is now 204 t/s up from 194 t/s. PP-512 is 5890 t/s, so significantly better than other quants * iq1s_blocks16: slightly faster Neon dot product * iq1s_blocks16: faster AVX2 dot product * iq1s_blocks16: adjust to ggml-common.h --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-03-11 07:51:49 +01:00
Abhilash Majumder	ef3ced26a3	[SYCL] Add q3_s and q1_s (#5886 ) * Add q3_s and q1_s * fix compilation * fix build * fix build * fix build * enable ops * rm macro * increase grid space	2024-03-11 10:27:56 +05:30
AidanBeltonS	3814a07392	[SYCL] Add support for SYCL Nvidia target (#5738 ) * Add support for nvidia target in CMake * Update sycl read-me for Nvidia target * Fix errors	2024-03-11 09:13:57 +08:00
Georgi Gerganov	bb6d00bbf9	metal : move mm_id indices to shared mem (#5982 )	2024-03-10 23:12:48 +02:00
Dean	7ab7b733bb	android : fix utf8 decoding error (#5935 ) * examples: fix utf8 decoding error some models have a tokenizer that decodes an id into an incomplete utf8 sequence, need to validate and wait for next token one example would be: https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat-GGUF/resolve/main/qwen1_5-1_8b-chat-q4_0.gguf and and an example of the token is 18137 * android : minor --------- Co-authored-by: zhangfuwen <zhangfuwen@foxmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-10 22:03:17 +02:00
Georgi Gerganov	d9f65c97c3	readme : update hot topics	2024-03-10 20:58:26 +02:00
Georgi Gerganov	b838b53ad6	sync : ggml	2024-03-10 20:10:46 +02:00
Georgi Gerganov	df4dc3e7cb	ggml : try fix 32-bit arm compat (whisper/1938) * ggml : try fix 32-bit arm compat * ggml : fix cont	2024-03-10 20:10:39 +02:00
Georgi Gerganov	bf47a5eefc	ggml : remove __constant__ specifier for CUDA tables (#5940 )	2024-03-10 20:09:24 +02:00
Pierrick Hymbert	fa8a809a91	server: ci: windows build and tests (#5968 ) * server: ci: windows build and tests * server: ci: remove tmp push branch * server: ci: EOF EOL * Use builti Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * server: tests: server graceful shutdown, then kill, then hard kill * server: tests: remove python2 unicode string * server: tests: remove wrong comment on server starting, close_fds is always true * server: tests: server kill, if pid exists * server: tests: remove dependency to killall * server: tests: ci windows: pid exists better handling --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-03-10 18:17:47 +01:00
DAN™	bcebd7dbf6	llama : add support for GritLM (#5959 ) * add gritlm example * gritlm results match * tabs to spaces * comment out debug printing * rebase to new embed * gritlm embeddings are back babeee * add to gitignore * allow to toggle embedding mode * Clean-up GritLM sample code. * Fix types. * Flush stdout and output ending newline if streaming. * mostly style fixes; correct KQ_mask comment * add causal_attn flag to llama_cparams * gritml : minor * llama : minor --------- Co-authored-by: Douglas Hanley <thesecretaryofwar@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-10 17:56:30 +02:00
Clint Herron	2960eae847	grammar : verify parsed state (#5950 )	2024-03-10 17:17:43 +02:00
Georgi Gerganov	c78541479c	nix: update flake.lock (#5969 ) Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29) → 'github:NixOS/nixpkgs/9df3e30ce24fd28c7b3e2de0d986769db5d6225d' (2024-03-06) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2024-03-10 16:43:08 +02:00
Pierrick Hymbert	621e86b331	server: benchmark: chat/completions scenario and other llm servers comparison (#5941 ) * server: bench: Init a bench scenario with K6 See #5827 * server: bench: EOL EOF * server: bench: PR feedback and improved k6 script configuration * server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS server: bench: increase truncated rate to 80% before failing * server: bench: fix doc * server: bench: change gauge custom metrics to trend * server: bench: change gauge custom metrics to trend server: bench: add trend custom metrics for total tokens per second average * server: bench: doc add an option to debug http request * server: bench: filter dataset too short and too long sequences * server: bench: allow to filter out conversation in the dataset based on env variable * server: bench: fix assistant message sent instead of user message * server: bench: fix assistant message sent instead of user message * server : add defrag thold parameter * server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-09 23:41:49 +01:00
Georgi Gerganov	77d1ac7e00	server : print chat template info	2024-03-09 22:04:00 +02:00
slaren	d894f352bf	perplexity : support using multiple sequences to allow larger batch sizes (#5946 ) * perplexity : support using multiple sequences to allow larger batch sizes ggml-ci * set cparams.n_parallel to the number of sequences * print tested n_ctx, add assert	2024-03-09 19:55:54 +01:00
Georgi Gerganov	098dbaab44	readme : update hot topics	2024-03-09 18:14:13 +02:00
Georgi Gerganov	8380ecfb21	ggml : fix unnecessary f32 -> f16 -> f32 casts (mmla) (#5951 )	2024-03-09 17:36:20 +02:00

... 2 3 4 5 6 ...

2643 Commits