Commit Graph

3140 Commits

Author SHA1 Message Date
0cc4m
ee1628bdfe
Basic Vulkan Multi-GPU implementation (#5321)
* Initial Vulkan multi-gpu implementation

Move most global variables into backend context

* Add names to backend device functions

* Add further missing cleanup code

* Reduce code duplication in tensor split layer assignment

* generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h

* Only do device info print in the beginning and initialize one backend for cpu assist

Add missing cleanup code

* Rework backend memory management to make sure devices and buffers get properly allocated and freed

* Rename cpu assist free function

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-02-07 07:54:50 +01:00
Eve
ed0bf32290
readme : modernize (#5379)
* first cleanup, update everything to Llama 2 and remove outdated content

* Delete SHA256SUMS

* make build instructions generic

* recommend Q4_K_M quantization method

* Update README.md
2024-02-07 08:21:30 +02:00
Ben Williams
9a697d842b
readme : update ui list (#5354) 2024-02-07 08:16:48 +02:00
runfuture
316c7faf77
llama : add MiniCPM support (#5346)
* support minicpm arch.

* fix tab/space typo.

* convert minicpm model via convert-hf-gguf.py

* try to make tokenizer work

* fix bug for quantize minicpm

* fix for flake8 lint

* remove convert-minicpm.py

* fix for editorconfig

* correct minicpm model type (size)

* constants expanded for minicpm

* Minor change of the constant names for minicpm
2024-02-07 08:15:56 +02:00
Justin Parker
f3e2b4fa3f
server : update /props with "total_slots" value (#5373)
* include total "num_slots" in default_generation_settings_for_props

* cleanup total_slots return value in /props endpoint

* update /props endpoint docs with total_slots

* remove num_slots from default_generation_settings_for_props

* update /props endpoint section
2024-02-07 08:15:19 +02:00
Sang-Kil Park
f68664ac24
convert : fix TypeError on GPT-2 vocab.json (#5288) 2024-02-06 23:28:00 -05:00
Alexey Parfenov
213d1439fa
server : remove model.json endpoint (#5371) 2024-02-06 20:08:38 +02:00
Johannes Gäßler
17c97fb062
CUDA: mul_mat_vec_q max. batch size 8 -> 4 (#5370) 2024-02-06 19:43:06 +02:00
Kawrakow
b08f22c882
Update README.md (#5366)
Add some links to quantization related PRs
2024-02-06 19:00:16 +02:00
Kawrakow
f57fadc009
Slight quantization improvement for Q4_K and Q5_K (#5361)
* Q4_K: slightly better quantization

* Q5_K: slightly better quantization

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-06 17:28:02 +02:00
BarfingLemurs
2e9c0bd6b3
readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362) 2024-02-06 16:06:48 +02:00
Johannes Gäßler
2c516611f1
CUDA: mul_mat_vec_q for batch sizes > 1 (#5351) 2024-02-06 14:44:06 +01:00
Justin Parker
8a79c591de
server : include total "num_slots" in props endpoint (#5349) 2024-02-06 11:20:59 +02:00
Michael Coppola
31e7903221
server : add dynatemp_range and dynatemp_exponent (#5352)
* server: added `dynatemp_range` and `dynatemp_exponent`

* Update README.md

---------

Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2024-02-06 11:20:00 +02:00
Niall Coates
4ffc7a17d4
server : various fixes for the prompt field in /completion (#5300)
server : fix deadlock when prompt array contains strings and numbers

server : removed an unnecessary generation when generating multi-prompts

server : removed an unnecessary assert
2024-02-06 10:16:23 +02:00
Georgi Gerganov
906cff55c2
py : handle byte tokens in get_token_type (#5341)
* py : handle byte tokens in `get_token_type`

* py : fix empty bytes arg
2024-02-06 07:47:22 +02:00
Johannes Gäßler
098f6d737b
make: Use ccache for faster compilation (#5318)
* make: Use ccache for faster compilation
2024-02-05 19:33:00 +01:00
Johannes Gäßler
78b00dda6c
README: updated introduction (#5343)
* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-05 15:55:10 +01:00
Kawrakow
c6b395535a
ggml : make use of ggml-quants.h possible in C++ code (#5338)
* Make use of ggml-quants.h possible in C++ code

* One cannot possibly be defining static_assert in a C++ compilation

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-05 14:09:47 +02:00
Dr. Tom Murphy VII Ph.D
abb61944a5
ggml : avoid duplicating function calls using MIN/MAX macros (#5325)
* Avoid duplicating function calls when using MIN/MAX macros.

Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice.
By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer:

https://godbolt.org/z/Ee4KMrvKh

Code behaves exactly the same.

* Update ggml.c

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-05 13:13:57 +02:00
Kawrakow
89503dcb5f
iq3_xxs: quards for the no-imatrix situation (#5334)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-05 12:32:27 +02:00
Guoteng
7e1ae372f3
py : fix internlm2-hf convert to gguf (#5305)
* py : fix internlm2-hf convert to gguf

* ggml-ci
2024-02-05 11:04:06 +02:00
Kawrakow
6fdfa2ecc6
iq2_xxs: tune quantization (#5320)
We get slightly better PPL, and we cut quantization time in
nearly half.

The trick is to 1st quantize without forcing points onto the E8-lattice.
We can then use a narrower search range around the block scale that we
got that way.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-05 10:46:06 +02:00
Alexey Parfenov
a2d60c9158
server : allow to get default generation settings for completion (#5307) 2024-02-05 10:10:22 +02:00
l3utterfly
e6f8177532
common : add dynamic temperature parameters to main example cli (#5295)
* added dynamic temp params in main

* added help text
2024-02-05 10:00:47 +02:00
Georgi Gerganov
30679d438d
scripts : fix typos, cleanup (#5303) 2024-02-05 09:48:03 +02:00
Нияз Гарифзянов
4be04c8965
scripts : add non-interactive server-llm.sh (#5303)
* Update server-llm.sh

Add flag --non-interactive that allows run script without asking a permission

* Update scripts/server-llm.sh

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-05 09:43:57 +02:00
chiranko
5d55b0cd82
readme : add CodeShell models to the supported models list (#5330) 2024-02-05 09:41:38 +02:00
AidanBeltonS
4833ac209d
[SYCL] Fix cpy with dims of 3 (#5289)
* Fix cpy with dims of 3

* rm asserts

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-02-05 12:38:24 +05:30
github-actions[bot]
9392ebd49e flake.lock: Update
Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/07f6395285469419cf9d078f59b5b49993198c00' (2024-01-11)
  → 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01)
• Updated input 'flake-parts/nixpkgs-lib':
    'github:NixOS/nixpkgs/b0d36bd0a420ecee3bc916c91886caca87c894e9?dir=lib' (2023-12-30)
  → 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/ae5c332cbb5827f6b1f02572496b141021de335f' (2024-01-25)
  → 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)
2024-02-04 08:45:35 -08:00
Kawrakow
5ed26e1fc9
Adding some imatrix tools (#5302)
* imatrix: adding --combine and --continue-from

* imatrix: be able to start from a specific chunk

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-04 10:39:58 +02:00
Welby Seely
277fad30c6
cmake : use set() for LLAMA_WIN_VER (#5298)
option() is specifically for booleans.

Fixes #5158
2024-02-03 23:18:51 -05:00
Johannes Gäßler
3c0d25c475
make: add nvcc info print (#5310) 2024-02-03 20:15:13 +01:00
Johannes Gäßler
3cc5ed353c
make: fix nvcc optimization flags for host code (#5309) 2024-02-03 20:14:59 +01:00
Martin Schwaighofer
60ecf099ed add Vulkan support to Nix flake 2024-02-03 13:13:07 -06:00
0cc4m
e920ed393d
Vulkan Intel Fixes, Optimizations and Debugging Flags (#5301)
* Fix Vulkan on Intel ARC

Optimize matmul for Intel ARC

Add Vulkan dequant test

* Add Vulkan debug and validate flags to Make and CMakeLists.txt

* Enable asynchronous transfers in Vulkan backend

* Fix flake8

* Disable Vulkan async backend functions for now

* Also add Vulkan run tests command to Makefile and CMakeLists.txt
2024-02-03 18:15:00 +01:00
Michael Klimenko
52bb63c708
refactor : switch to emplace_back to avoid extra object (#5291) 2024-02-03 13:23:37 +02:00
Jared Van Bortel
1ec3332ade
YaRN : store rope scaling type as int32_t in memory (#5285)
* YaRN : store rope scaling type as int32_t in memory

* llama : store mapped names as const char *
2024-02-03 13:22:06 +02:00
BADR
6a66c5071a
readme : add tenere in the ui tools list (#5284) 2024-02-03 13:20:26 +02:00
AidanBeltonS
a305dba8ff
Fix im2col with 32fp (#5286) 2024-02-03 16:11:37 +08:00
kalomaze
191221178f
perplexity : fix KL divergence calculations on Windows (#5273) 2024-02-02 16:15:30 +02:00
Georgi Gerganov
e437b37fd0
scripts : parse wtype in server-llm.sh (#5167)
* scripts : parse wtype in server-llm.sh

* scripts : fix check for wfile
2024-02-02 14:23:40 +02:00
Mirror Azure
2d40085c26
py : add check for '.attn.masked_bias' layers to GPT2model (#5281) 2024-02-02 13:39:09 +02:00
AidanBeltonS
b05102fe8c
Tidy ggml-sycl (#5261)
* Tidy some code in ggml-sycl

* Remove blank space

* Remove std::printf comments

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-02-02 16:39:48 +08:00
Xuan Son Nguyen
6b91b1e0a9
docker : add build for SYCL, Vulkan + update readme (#5228)
* add vulkan dockerfile

* intel dockerfile: compile sycl by default

* fix vulkan dockerfile

* add docs for vulkan

* docs: sycl build in docker

* docs: remove trailing spaces

* docs: sycl: add docker section

* docs: clarify install vulkan SDK outside docker

* sycl: use intel/oneapi-basekit docker image

* docs: correct TOC

* docs: correct docker image for Intel oneMKL
2024-02-02 09:56:31 +02:00
Meng, Hengyu
e805f0fa99
[SYCL] get MAX_MEM_ALLOC from device property (#5270)
* get max alloc size from device prop

* fix macro typo
2024-02-02 15:54:14 +08:00
Neo Zhang Jianyu
af3ba5d946
[SYCL] update guide of SYCL backend (#5254)
* update guide for make installation, memory, gguf model link,  rm todo for windows build

* add vs install requirement

* update for gpu device check

* update help of llama-bench

* fix grammer issues
2024-02-02 15:53:27 +08:00
Ian Bull
e1e721094d
llama : fix memory leak in llama_batch_free (#5252)
The llama_batch_init allocates memory for a fixed number of tokens.
However, the llama_batch_free only frees memory for the number of
tokens that were added to the batch.

This change-set uses a null terminated array for the batch seq_id, and
frees all the elements until the nullptr is reached. This change-set
also changes the name of the first parameter from `n_tokens` to
`n_tokens_alloc` to more clearly indicate that this value is the number
of tokens allocated to the batch, not the number of tokens in the batch.
2024-02-02 09:20:13 +02:00
Neo Zhang Jianyu
128dcbd3c9
add --no-mmap in llama-bench (#5257)
* add --no-mmap, show sycl backend

* fix conflict

* fix code format, change print for --no-mmap

* ren no_mmap to mmap, show mmap when not default value in printer

* update guide for mmap

* mv position to reduce model reload
2024-02-01 20:48:53 +01:00
0cc4m
4d0924a890
Vulkan Phi Fix for AMD Proprietary Drivers (#5260)
* Replace tanh to avoid NaN in gelu shader on AMD proprietary driver

* Fix another Vulkan CPY buffer size bug
2024-02-01 19:25:24 +01:00