llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-02 14:54:35 +00:00

Author	SHA1	Message	Date
Sang-Kil Park	f68664ac24	convert : fix TypeError on GPT-2 vocab.json (#5288 )	2024-02-06 23:28:00 -05:00
Alexey Parfenov	213d1439fa	server : remove model.json endpoint (#5371 )	2024-02-06 20:08:38 +02:00
Johannes Gäßler	17c97fb062	CUDA: mul_mat_vec_q max. batch size 8 -> 4 (#5370 )	2024-02-06 19:43:06 +02:00
Kawrakow	b08f22c882	Update README.md (#5366 ) Add some links to quantization related PRs	2024-02-06 19:00:16 +02:00
Kawrakow	f57fadc009	Slight quantization improvement for Q4_K and Q5_K (#5361 ) * Q4_K: slightly better quantization * Q5_K: slightly better quantization --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-06 17:28:02 +02:00
BarfingLemurs	2e9c0bd6b3	readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362 )	2024-02-06 16:06:48 +02:00
Johannes Gäßler	2c516611f1	CUDA: mul_mat_vec_q for batch sizes > 1 (#5351 )	2024-02-06 14:44:06 +01:00
Justin Parker	8a79c591de	server : include total "num_slots" in props endpoint (#5349 )	2024-02-06 11:20:59 +02:00
Michael Coppola	31e7903221	server : add `dynatemp_range` and `dynatemp_exponent` (#5352 ) * server: added `dynatemp_range` and `dynatemp_exponent` * Update README.md --------- Co-authored-by: Michael Coppola <info@michaeljcoppola.com>	2024-02-06 11:20:00 +02:00
Niall Coates	4ffc7a17d4	server : various fixes for the prompt field in /completion (#5300 ) server : fix deadlock when prompt array contains strings and numbers server : removed an unnecessary generation when generating multi-prompts server : removed an unnecessary assert	2024-02-06 10:16:23 +02:00
Georgi Gerganov	906cff55c2	py : handle byte tokens in `get_token_type` (#5341 ) * py : handle byte tokens in `get_token_type` * py : fix empty bytes arg	2024-02-06 07:47:22 +02:00
Johannes Gäßler	098f6d737b	make: Use ccache for faster compilation (#5318 ) * make: Use ccache for faster compilation	2024-02-05 19:33:00 +01:00
Johannes Gäßler	78b00dda6c	README: updated introduction (#5343 ) * README: updated introduction * readme : update --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-05 15:55:10 +01:00
Kawrakow	c6b395535a	ggml : make use of ggml-quants.h possible in C++ code (#5338 ) * Make use of ggml-quants.h possible in C++ code * One cannot possibly be defining static_assert in a C++ compilation --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-05 14:09:47 +02:00
Dr. Tom Murphy VII Ph.D	abb61944a5	ggml : avoid duplicating function calls using MIN/MAX macros (#5325 ) * Avoid duplicating function calls when using MIN/MAX macros. Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice. By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer: https://godbolt.org/z/Ee4KMrvKh Code behaves exactly the same. * Update ggml.c --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-05 13:13:57 +02:00
Kawrakow	89503dcb5f	iq3_xxs: quards for the no-imatrix situation (#5334 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-05 12:32:27 +02:00
Guoteng	7e1ae372f3	py : fix internlm2-hf convert to gguf (#5305 ) * py : fix internlm2-hf convert to gguf * ggml-ci	2024-02-05 11:04:06 +02:00
Kawrakow	6fdfa2ecc6	iq2_xxs: tune quantization (#5320 ) We get slightly better PPL, and we cut quantization time in nearly half. The trick is to 1st quantize without forcing points onto the E8-lattice. We can then use a narrower search range around the block scale that we got that way. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-05 10:46:06 +02:00
Alexey Parfenov	a2d60c9158	server : allow to get default generation settings for completion (#5307 )	2024-02-05 10:10:22 +02:00
l3utterfly	e6f8177532	common : add dynamic temperature parameters to main example cli (#5295 ) * added dynamic temp params in main * added help text	2024-02-05 10:00:47 +02:00
Georgi Gerganov	30679d438d	scripts : fix typos, cleanup (#5303 )	2024-02-05 09:48:03 +02:00
Нияз Гарифзянов	4be04c8965	scripts : add non-interactive server-llm.sh (#5303 ) * Update server-llm.sh Add flag --non-interactive that allows run script without asking a permission * Update scripts/server-llm.sh --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-05 09:43:57 +02:00
chiranko	5d55b0cd82	readme : add CodeShell models to the supported models list (#5330 )	2024-02-05 09:41:38 +02:00
AidanBeltonS	4833ac209d	[SYCL] Fix cpy with dims of 3 (#5289 ) * Fix cpy with dims of 3 * rm asserts --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>	2024-02-05 12:38:24 +05:30
github-actions[bot]	9392ebd49e	flake.lock: Update Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/07f6395285469419cf9d078f59b5b49993198c00' (2024-01-11) → 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/b0d36bd0a420ecee3bc916c91886caca87c894e9?dir=lib' (2023-12-30) → 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/ae5c332cbb5827f6b1f02572496b141021de335f' (2024-01-25) → 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)	2024-02-04 08:45:35 -08:00
Georgi Gerganov	1846e92a90	cuda : minor	2024-02-04 11:01:01 +02:00
Kawrakow	5ed26e1fc9	Adding some imatrix tools (#5302 ) * imatrix: adding --combine and --continue-from * imatrix: be able to start from a specific chunk --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-04 10:39:58 +02:00
Welby Seely	277fad30c6	cmake : use set() for LLAMA_WIN_VER (#5298 ) option() is specifically for booleans. Fixes #5158	2024-02-03 23:18:51 -05:00
Johannes Gäßler	3c0d25c475	make: add nvcc info print (#5310 )	2024-02-03 20:15:13 +01:00
Johannes Gäßler	3cc5ed353c	make: fix nvcc optimization flags for host code (#5309 )	2024-02-03 20:14:59 +01:00
Martin Schwaighofer	60ecf099ed	add Vulkan support to Nix flake	2024-02-03 13:13:07 -06:00
0cc4m	e920ed393d	Vulkan Intel Fixes, Optimizations and Debugging Flags (#5301 ) * Fix Vulkan on Intel ARC Optimize matmul for Intel ARC Add Vulkan dequant test * Add Vulkan debug and validate flags to Make and CMakeLists.txt * Enable asynchronous transfers in Vulkan backend * Fix flake8 * Disable Vulkan async backend functions for now * Also add Vulkan run tests command to Makefile and CMakeLists.txt	2024-02-03 18:15:00 +01:00
Georgi Gerganov	ef68fac2a8	cuda : fix matrix names	2024-02-03 18:36:58 +02:00
Georgi Gerganov	cfd9732b2e	cuda : simplify softmax	2024-02-03 18:31:55 +02:00
Georgi Gerganov	e04ff39181	cuda : fix -INF block check	2024-02-03 16:57:46 +02:00
Georgi Gerganov	5b263dd83a	cuda : unroll Q*K^T loop	2024-02-03 16:12:20 +02:00
Georgi Gerganov	3b1c4e7673	cuda : speed-up reduce part of the kernel	2024-02-03 15:36:05 +02:00
Georgi Gerganov	a7b471569b	cuda : switch to 1 warp for bs > 16	2024-02-03 15:17:49 +02:00
Georgi Gerganov	b958151e3f	cuda : use half2 in softmax	2024-02-03 15:00:25 +02:00
Georgi Gerganov	c51f27c0db	cuda : avoid __hisinf branches	2024-02-03 14:27:36 +02:00
Georgi Gerganov	92472ea22c	cuda : unroll some of the loops	2024-02-03 14:10:01 +02:00
Georgi Gerganov	1f8a592482	cuda : make loops use the same loop values Thanks Johannes again for the tip	2024-02-03 14:01:32 +02:00
Georgi Gerganov	7c34655b36	cuda : use int instead of int64_t Noticeably improves performance (thanks to Johannes)	2024-02-03 13:39:46 +02:00
Michael Klimenko	52bb63c708	refactor : switch to emplace_back to avoid extra object (#5291 )	2024-02-03 13:23:37 +02:00
Jared Van Bortel	1ec3332ade	YaRN : store rope scaling type as int32_t in memory (#5285 ) * YaRN : store rope scaling type as int32_t in memory * llama : store mapped names as const char *	2024-02-03 13:22:06 +02:00
BADR	6a66c5071a	readme : add tenere in the ui tools list (#5284 )	2024-02-03 13:20:26 +02:00
Georgi Gerganov	b150abe83e	cuda : avoid warp_reduce for smax	2024-02-03 13:17:47 +02:00
AidanBeltonS	a305dba8ff	Fix im2col with 32fp (#5286 )	2024-02-03 16:11:37 +08:00
kalomaze	191221178f	perplexity : fix KL divergence calculations on Windows (#5273 )	2024-02-02 16:15:30 +02:00
Georgi Gerganov	b68a112204	cuda : fix __hisinf() result check	2024-02-02 15:12:28 +02:00

1 2 3 4 5 ...

2194 Commits