Alexey Parfenov
213d1439fa
server : remove model.json endpoint ( #5371 )
2024-02-06 20:08:38 +02:00
Johannes Gäßler
17c97fb062
CUDA: mul_mat_vec_q max. batch size 8 -> 4 ( #5370 )
2024-02-06 19:43:06 +02:00
Kawrakow
b08f22c882
Update README.md ( #5366 )
...
Add some links to quantization related PRs
2024-02-06 19:00:16 +02:00
Kawrakow
f57fadc009
Slight quantization improvement for Q4_K and Q5_K ( #5361 )
...
* Q4_K: slightly better quantization
* Q5_K: slightly better quantization
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-06 17:28:02 +02:00
BarfingLemurs
2e9c0bd6b3
readme : add phi, orion 14b, internlm2, and yi-VL to readme ( #5362 )
2024-02-06 16:06:48 +02:00
Johannes Gäßler
2c516611f1
CUDA: mul_mat_vec_q for batch sizes > 1 ( #5351 )
2024-02-06 14:44:06 +01:00
Justin Parker
8a79c591de
server : include total "num_slots" in props endpoint ( #5349 )
2024-02-06 11:20:59 +02:00
Michael Coppola
31e7903221
server : add dynatemp_range
and dynatemp_exponent
( #5352 )
...
* server: added `dynatemp_range` and `dynatemp_exponent`
* Update README.md
---------
Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2024-02-06 11:20:00 +02:00
Niall Coates
4ffc7a17d4
server : various fixes for the prompt field in /completion ( #5300 )
...
server : fix deadlock when prompt array contains strings and numbers
server : removed an unnecessary generation when generating multi-prompts
server : removed an unnecessary assert
2024-02-06 10:16:23 +02:00
Georgi Gerganov
906cff55c2
py : handle byte tokens in get_token_type
( #5341 )
...
* py : handle byte tokens in `get_token_type`
* py : fix empty bytes arg
2024-02-06 07:47:22 +02:00
Johannes Gäßler
098f6d737b
make: Use ccache for faster compilation ( #5318 )
...
* make: Use ccache for faster compilation
2024-02-05 19:33:00 +01:00
Johannes Gäßler
78b00dda6c
README: updated introduction ( #5343 )
...
* README: updated introduction
* readme : update
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-05 15:55:10 +01:00
Kawrakow
c6b395535a
ggml : make use of ggml-quants.h possible in C++ code ( #5338 )
...
* Make use of ggml-quants.h possible in C++ code
* One cannot possibly be defining static_assert in a C++ compilation
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-05 14:09:47 +02:00
Dr. Tom Murphy VII Ph.D
abb61944a5
ggml : avoid duplicating function calls using MIN/MAX macros ( #5325 )
...
* Avoid duplicating function calls when using MIN/MAX macros.
Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice.
By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer:
https://godbolt.org/z/Ee4KMrvKh
Code behaves exactly the same.
* Update ggml.c
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-05 13:13:57 +02:00
Kawrakow
89503dcb5f
iq3_xxs: quards for the no-imatrix situation ( #5334 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-05 12:32:27 +02:00
Guoteng
7e1ae372f3
py : fix internlm2-hf convert to gguf ( #5305 )
...
* py : fix internlm2-hf convert to gguf
* ggml-ci
2024-02-05 11:04:06 +02:00
Kawrakow
6fdfa2ecc6
iq2_xxs: tune quantization ( #5320 )
...
We get slightly better PPL, and we cut quantization time in
nearly half.
The trick is to 1st quantize without forcing points onto the E8-lattice.
We can then use a narrower search range around the block scale that we
got that way.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-05 10:46:06 +02:00
Alexey Parfenov
a2d60c9158
server : allow to get default generation settings for completion ( #5307 )
2024-02-05 10:10:22 +02:00
l3utterfly
e6f8177532
common : add dynamic temperature parameters to main example cli ( #5295 )
...
* added dynamic temp params in main
* added help text
2024-02-05 10:00:47 +02:00
Georgi Gerganov
30679d438d
scripts : fix typos, cleanup ( #5303 )
2024-02-05 09:48:03 +02:00
Нияз Гарифзянов
4be04c8965
scripts : add non-interactive server-llm.sh ( #5303 )
...
* Update server-llm.sh
Add flag --non-interactive that allows run script without asking a permission
* Update scripts/server-llm.sh
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-05 09:43:57 +02:00
chiranko
5d55b0cd82
readme : add CodeShell models to the supported models list ( #5330 )
2024-02-05 09:41:38 +02:00
AidanBeltonS
4833ac209d
[SYCL] Fix cpy with dims of 3 ( #5289 )
...
* Fix cpy with dims of 3
* rm asserts
---------
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-02-05 12:38:24 +05:30
github-actions[bot]
9392ebd49e
flake.lock: Update
...
Flake lock file updates:
• Updated input 'flake-parts':
'github:hercules-ci/flake-parts/07f6395285469419cf9d078f59b5b49993198c00' (2024-01-11)
→ 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01)
• Updated input 'flake-parts/nixpkgs-lib':
'github:NixOS/nixpkgs/b0d36bd0a420ecee3bc916c91886caca87c894e9?dir=lib' (2023-12-30)
→ 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/ae5c332cbb5827f6b1f02572496b141021de335f' (2024-01-25)
→ 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)
2024-02-04 08:45:35 -08:00
Georgi Gerganov
1846e92a90
cuda : minor
2024-02-04 11:01:01 +02:00
Kawrakow
5ed26e1fc9
Adding some imatrix tools ( #5302 )
...
* imatrix: adding --combine and --continue-from
* imatrix: be able to start from a specific chunk
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-04 10:39:58 +02:00
Welby Seely
277fad30c6
cmake : use set() for LLAMA_WIN_VER ( #5298 )
...
option() is specifically for booleans.
Fixes #5158
2024-02-03 23:18:51 -05:00
Johannes Gäßler
3c0d25c475
make: add nvcc info print ( #5310 )
2024-02-03 20:15:13 +01:00
Johannes Gäßler
3cc5ed353c
make: fix nvcc optimization flags for host code ( #5309 )
2024-02-03 20:14:59 +01:00
Martin Schwaighofer
60ecf099ed
add Vulkan support to Nix flake
2024-02-03 13:13:07 -06:00
0cc4m
e920ed393d
Vulkan Intel Fixes, Optimizations and Debugging Flags ( #5301 )
...
* Fix Vulkan on Intel ARC
Optimize matmul for Intel ARC
Add Vulkan dequant test
* Add Vulkan debug and validate flags to Make and CMakeLists.txt
* Enable asynchronous transfers in Vulkan backend
* Fix flake8
* Disable Vulkan async backend functions for now
* Also add Vulkan run tests command to Makefile and CMakeLists.txt
2024-02-03 18:15:00 +01:00
Georgi Gerganov
ef68fac2a8
cuda : fix matrix names
2024-02-03 18:36:58 +02:00
Georgi Gerganov
cfd9732b2e
cuda : simplify softmax
2024-02-03 18:31:55 +02:00
Georgi Gerganov
e04ff39181
cuda : fix -INF block check
2024-02-03 16:57:46 +02:00
Georgi Gerganov
5b263dd83a
cuda : unroll Q*K^T loop
2024-02-03 16:12:20 +02:00
Georgi Gerganov
3b1c4e7673
cuda : speed-up reduce part of the kernel
2024-02-03 15:36:05 +02:00
Georgi Gerganov
a7b471569b
cuda : switch to 1 warp for bs > 16
2024-02-03 15:17:49 +02:00
Georgi Gerganov
b958151e3f
cuda : use half2 in softmax
2024-02-03 15:00:25 +02:00
Georgi Gerganov
c51f27c0db
cuda : avoid __hisinf branches
2024-02-03 14:27:36 +02:00
Georgi Gerganov
92472ea22c
cuda : unroll some of the loops
2024-02-03 14:10:01 +02:00
Georgi Gerganov
1f8a592482
cuda : make loops use the same loop values
...
Thanks Johannes again for the tip
2024-02-03 14:01:32 +02:00
Georgi Gerganov
7c34655b36
cuda : use int instead of int64_t
...
Noticeably improves performance (thanks to Johannes)
2024-02-03 13:39:46 +02:00
Michael Klimenko
52bb63c708
refactor : switch to emplace_back to avoid extra object ( #5291 )
2024-02-03 13:23:37 +02:00
Jared Van Bortel
1ec3332ade
YaRN : store rope scaling type as int32_t in memory ( #5285 )
...
* YaRN : store rope scaling type as int32_t in memory
* llama : store mapped names as const char *
2024-02-03 13:22:06 +02:00
BADR
6a66c5071a
readme : add tenere in the ui tools list ( #5284 )
2024-02-03 13:20:26 +02:00
Georgi Gerganov
b150abe83e
cuda : avoid warp_reduce for smax
2024-02-03 13:17:47 +02:00
AidanBeltonS
a305dba8ff
Fix im2col with 32fp ( #5286 )
2024-02-03 16:11:37 +08:00
kalomaze
191221178f
perplexity : fix KL divergence calculations on Windows ( #5273 )
2024-02-02 16:15:30 +02:00
Georgi Gerganov
b68a112204
cuda : fix __hisinf() result check
2024-02-02 15:12:28 +02:00
Georgi Gerganov
e437b37fd0
scripts : parse wtype in server-llm.sh ( #5167 )
...
* scripts : parse wtype in server-llm.sh
* scripts : fix check for wfile
2024-02-02 14:23:40 +02:00