Commit Graph

1670 Commits

Author SHA1 Message Date
Georgi Gerganov
3c8d6b160b
Update ggml-cuda.cu
Co-authored-by: slaren <slarengh@gmail.com>
2023-12-18 14:21:22 +02:00
Georgi Gerganov
18c67bdd84
ggml : add ggml_mul_mat_set_prec
ggml-ci
2023-12-18 13:31:23 +02:00
Georgi Gerganov
a8d2a6f3ef
Merge branch 'master' into HEAD 2023-12-18 10:17:55 +02:00
Jared Van Bortel
2994f0c5a2
decode : fix logits_valid for legacy API (#4516) 2023-12-17 19:39:02 -05:00
Georgi Gerganov
b1306c4394
readme : update hot topics 2023-12-17 20:16:23 +02:00
Georgi Gerganov
800a489e4a
llama.swiftui : add bench functionality (#4483)
* llama.swiftui : add bench button

* llama.swiftui : initial bench functionality

* force to use n_gpu_layers on simulator

* add download buttons & expose llamaState.loadModel

* update project.pbxproj

* comment #Preview & fix editorconfig check

* gitignore : xcode stuff

* llama.swiftui : UX improvements

* llama.swiftui : avoid data copy via "downloadTask"

* llama.swiftui : remove model from project

* llama : remove "mostly" from model infos

* llama.swiftui : improve bench

---------

Co-authored-by: jhen <developer@jhen.me>
2023-12-17 19:38:41 +02:00
Jared Van Bortel
f7f468a97d
gguf-py : fail fast on nonsensical special token IDs (#4489) 2023-12-17 10:45:46 -05:00
Matheus Gabriel Alves Silva
919c40660f
build : Check the ROCm installation location (#4485)
* build : Check the ROCm installation location

* more generic approach

* fixup! It was returning the path instead of the command output

* fixup! Trailing whitespace
2023-12-17 17:23:33 +02:00
slaren
45668633fd
finetune : keep allocs alive until all allocations are done (#4486) 2023-12-17 16:05:56 +01:00
olexiyb
0ffc92d2d2
server : disable llm logs if SERVER_VERBOSE is off (#3792) 2023-12-17 17:02:16 +02:00
AdithyanI
8edd2b40fd
server : fix grammar being ignored (#4494)
Fix bug in identifying the grammar.
2023-12-17 16:57:56 +02:00
Alexey Parfenov
eb16dae7e7
server : fix possible ambiguity in content type charset (#4501) 2023-12-17 16:56:09 +02:00
mzcu
62bd52b7bf
server : allow requests larger than 8K (#4500) 2023-12-17 16:54:37 +02:00
Bach Le
5daa5f54fd
Link to cublas dynamically on Windows even with LLAMA_STATIC (#4506) 2023-12-17 11:57:33 +01:00
Georgi Gerganov
42e9525884
cuda : less diff in the rope_neox kernel 2023-12-17 09:14:29 +02:00
Georgi Gerganov
f703ca8a3c
ggml : fix NeoX rope to rotate just first n_dims 2023-12-17 08:40:51 +02:00
slaren
c6c4fc081c
lora : add support for non-llama models (#3333)
* lora : add support for non-llama models

ggml-ci

* avoid leaking ggml_context on failure
cleanup

ggml-ci

* lora : allow 1d tensors

* lora : include embd and output layers in size calculation

* fix style
2023-12-16 18:58:46 +01:00
Georgi Gerganov
0644c3be51
phi-2 : scale Q instead of KQ for better precision 2023-12-16 18:01:08 +02:00
Georgi Gerganov
0b6ffa580c
convert : revert "added_tokens_decoder" change 2023-12-16 16:05:35 +02:00
Georgi Gerganov
a878be4cb1
convert : phi don't add BOS token 2023-12-16 11:20:11 +02:00
Georgi Gerganov
5469d82d5a
llama : fix meta KV override bug 2023-12-16 11:19:56 +02:00
Georgi Gerganov
7500fa2f07
py : whitespaces 2023-12-16 11:01:02 +02:00
Georgi Gerganov
aa5c881adb
phi-2 : use layer norm eps 2023-12-16 10:54:10 +02:00
Georgi Gerganov
a2a3d2c8d7
phi-2 : various fixes 2023-12-16 10:46:18 +02:00
Jared Van Bortel
8a5be3bd58
llama : sanity checks for access to logits (#4274)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-15 22:16:15 -05:00
Ebey Abraham
e20765534d fix breaking change 2023-12-16 00:41:06 +00:00
Ebey Abraham
12cc80cb89 phi2 implementation 2023-12-15 20:56:57 +00:00
ShadovvBeast
88ae8952b6
server : add optional API Key Authentication example (#4441)
* Add API key authentication for enhanced server-client security

* server : to snake_case

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-15 13:49:01 +02:00
slaren
ee4725a686
ggml : group mul_mat_id rows by matrix (cpu only) (#4480)
* ggml : group mul_mat_id rows by matrix (cpu only)

* remove mmid parameters from mm forward

* store row groups in wdata and calculate only once in GGML_TASK_INIT

ggml-ci
2023-12-15 12:45:50 +01:00
slaren
6744dbe924
ggml : use ggml_row_size where possible (#4472)
* ggml : use ggml_row_size where possible

ggml-ci

* ggml : move ggml_nbytes_split to ggml-cuda.cu
2023-12-14 20:05:21 +01:00
slaren
cafcd4f895
ggml : remove n_dims from ggml_tensor (#4469)
ggml-ci
2023-12-14 16:52:08 +01:00
wonjun Jang
c50e400163
py : add protobuf dependency (#4466) 2023-12-14 14:44:49 +02:00
LostRuins
20a68a7030
ggml : add ggml_row_size() (fixes llama out of space) (#4461)
* Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values

* do not cast to size_t, instead just use doubles

* ggml : add ggml_row_size(), deprecate ggml_type_sizef()

* ggml : fix row size compute to avoid overflows

* tests : fix sizey -> sizez

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-14 14:13:33 +02:00
Georgi Gerganov
55e87c3749
ggml : fix OpenCL broadcast requirement for ggml_mul (close #4453) 2023-12-14 10:35:29 +02:00
wonjun Jang
873637afc7
convert : support loading vocab from fast tokenizer config (#3633)
* Add HFVocab into convert.py

* Update convert.py

* Update convert.py

* add bytes_to_unicode function

* change add_meta_vocab fucntion

* remove debug code

* remove byte_encoder

* Add newline between classes

* Check tokenizer.json when tokenizer.model is not exist.

* Move transformers dependency to local code

* Add error context with 'raise from'

* Add fast tokenizer option to BpeVocab

* Update convert.py

* Add VocabLoader and remove *Vocab class

* Add transformers dependency

* remove added tokens and check newline token to decide spm or bpe

* Update convert.py

* Add special token type

* Update convert.py

* Update convert.py

* Update convert.py

* Fix typo in convert.py

* Fix when params.n_vocab < tokenizer vocab size

* update vocab class

* change funtion name

* Remove unused variable/functions, add types to class variable and methods, delete blank liens

* fix flake8 warnings

* code style cleanup

* make mypy happy

* change exception

---------

Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2023-12-14 10:09:34 +02:00
BarfingLemurs
0353a18401
readme : update supported model list (#4457) 2023-12-14 09:38:49 +02:00
shibe2
948ff137ec
server : fix handling of characters that span multiple tokens when streaming (#4446) 2023-12-13 21:57:15 +02:00
Georgi Gerganov
4d98d9a656
sync : ggml (SD ops, tests, kernels) (#4444)
* sync : ggml (SD ops, tests, kernels)

ggml-ci

* cuda : restore im2col

ggml-ci

* metal : fix accuracy of dequantization kernels

ggml-ci

* cuda : restore correct im2col

ggml-ci

* metal : try to fix moe test by reducing expert size

ggml-ci

* cuda : fix bin bcast when src1 and dst have different types

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-12-13 21:54:54 +02:00
Jared Van Bortel
70f806b821
build : detect host compiler and cuda compiler separately (#4414) 2023-12-13 12:10:10 -05:00
Siwen Yu
9fb13f9584
common : add --version option to show build info in CLI (#4433) 2023-12-13 14:50:14 +02:00
Georgi Gerganov
113f9942fc
readme : update hot topics 2023-12-13 14:05:38 +02:00
slaren
799a1cb13b
llama : add Mixtral support (#4406)
* convert : support Mixtral as LLAMA arch

* convert : fix n_ff typo

* llama : model loading

* ggml : sync latest ggml_mul_mat_id

* llama : update graph to support MoE

* llama : fix cur -> cur_expert

* llama : first working version

* llama : fix expert weighting in the FFN

* ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only)

* ggml : add n_as argument to ggml_mul_mat_id

* ggml : fix ggml_get_rows to take into account ne02 / ne11

* metal : add more general support for ggml_get_rows + tests

* llama : add basic support for offloading moe with CUDA

* metal : add/mul/div use general kernel when src1 not cont

* metal : reduce the kernel launches for ggml_mul_mat_id

* ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D

* ggml : update get_rows f16 and q

* cuda : support non-contiguous src1 in get_rows

* llama : offload missing ffn_moe_silu

* metal : fix ggml_get_rows to work with non-cont src1

* metal : add indirect mat-vec kernels for all quantization types

* llama : do not quantize expert gating tensors

* llama : add n_expert and n_expert_used to hparams + change quants

* test-backend-ops : add moe test

* cuda : fix get_rows when ncols is odd

* convert : determine n_ctx correctly

* metal : fix ggml_mul_mat_id for F32

* test-backend-ops : make experts more evenly probable (test_moe)

* test-backend-ops : cleanup, add moe test for batches

* test-backend-ops : add cpy from f32 -> all types test

* test-backend-ops : fix dequantize block offset

* llama : fix hard-coded number of experts

* test-backend-ops : simplify and disable slow tests to avoid CI timeout

* test-backend-ops : disable MOE test with thread sanitizer

* cuda : fix mul_mat_id with multi gpu

* convert : use 1e6 rope_freq_base for mixtral

* convert : fix style

* convert : support safetensors format

* gguf-py : bump version

* metal : add cpy f16 -> f32 kernel

* metal : fix binary ops for ne10 % 4 != 0

* test-backend-ops : add one more sum_rows test

* ggml : do not use BLAS with ggml_mul_mat_id

* convert-hf : support for mixtral-instruct (#4428)

* convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct

* convert : use sentencepiece tokenizer for Mixtral-instruct

* convert : make flake8 happy

* metal : fix soft_max kernels

ref: 1914017863

* metal : limit kernels to not use more than the allowed threads

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Radek Pilar <github@mrkva.eu>
2023-12-13 14:04:25 +02:00
kalomaze
fecac45658
server : tweak default sampling parameters (#4367)
* Set a more typical Top P setting as the default

* Update temp max
2023-12-12 12:12:35 +02:00
Richard Kiss
9494d7c477
english : use typos to fix comments and logs (#4354) 2023-12-12 11:53:36 +02:00
Jared Van Bortel
6138963fb2
build : target Windows 8 for standard mingw-w64 (#4405)
* build : target Windows 8 for standard mingw-w64

* make : fix missing console.o deps

This was causing a link error with `make all` on Windows.
2023-12-12 11:27:26 +02:00
crasm
6391817cd1
llama : document logits_all deprecation (#4418)
llama_context_params.logits_all is a parameter for controlling
llama_eval. This documents that logits_all should not be used with
llama_decode and llama_batch.
2023-12-12 11:25:57 +02:00
Vladimir Zorin
d9d4cfef64
server : fix local model name in server (#4420) 2023-12-12 11:25:29 +02:00
Taikono-Himazin
41a11aaf99
ggml : increased GGML_MAX_PARAMS to allow finetuning of 70b models (#4424) 2023-12-12 11:24:32 +02:00
Yueh-Po Peng
8a7b2fa528
Update README.md (#4388)
Fix small typo.
2023-12-10 23:27:38 +01:00
Xiang (Kevin) Li
e18f7345a3
grammar : revert the replacement of llama_token_to_piece with id_to_token (#4396) 2023-12-09 23:29:27 +02:00