Johannes Gäßler
9154494808
CUDA: mul_mat_id always on GPU for batches >= 32 ( #4553 )
2023-12-21 18:42:59 +01:00
Georgi Gerganov
c083718c89
readme : update coding guidelines
2023-12-21 19:27:14 +02:00
howlger
880e352277
py : open merges file as 'utf-8' ( #4566 )
...
Otherwise, on Windows converting bling-phi-2-v0 (<https://huggingface.co/llmware/bling-phi-2-v0 >) via convert-hf-to-gguf.py will fail with the following error:
```
Traceback (most recent call last):
File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 1061, in <module>
model_instance.set_vocab()
File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 52, in set_vocab
self._set_vocab_gpt2()
File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 264, in _set_vocab_gpt2
special_vocab = gguf.SpecialVocab(dir_model, load_merges=True)
File "C:\Users\User\git\gguf\gguf\vocab.py", line 33, in __init__
self._load(Path(path))
File "C:\Users\User\git\gguf\gguf\vocab.py", line 81, in _load
self._try_load_merges_txt(path)
File "C:\Users\User\git\gguf\gguf\vocab.py", line 95, in _try_load_merges_txt
for line in fp:
File "C:\Users\User\miniconda3\envs\gguf\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1415: character maps to <undefined>
```
2023-12-21 19:07:34 +02:00
bobqianic
66f35a2f48
cuda : better error message for ggml_get_rows ( #4561 )
...
* Update ggml-cuda.cu
* Update ggml-cuda.cu
* Update ggml-cuda.cu
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-21 19:06:44 +02:00
slaren
1398823922
cuda : replace asserts in wrong architecture checks with __trap ( #4556 )
...
* cuda : replace asserts in wrong architecture checks with __trap
* make bad_arch noreturn, remove returns
2023-12-21 18:02:30 +01:00
Johannes Gäßler
d3223afdad
llama : disable per-tensor info prints on model load ( #4562 )
2023-12-21 18:34:17 +02:00
LoganDark
1d7a1912ce
Fix access violation in ggml_cuda_free_data if tensor->extra is NULL ( #4554 )
2023-12-21 10:59:27 +01:00
Johannes Gäßler
799fc22689
CUDA: Faster Mixtral prompt processing ( #4538 )
...
* CUDA: make MoE tensors contiguous for batch size>1
* Update ggml-cuda.cu
Co-authored-by: slaren <slarengh@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
2023-12-20 15:41:22 +01:00
Eric Sommerlade
328b83de23
ggml : fixed check for _MSC_VER ( #4535 )
...
Co-authored-by: Eric Sommerlade <ersomme@microsoft.com>
2023-12-19 18:17:01 +02:00
arlo-phoenix
a7aee47b98
ggml-cuda: Fix HIP build ( #4528 )
...
regression of #4490
Adds defines for two new datatypes
cublasComputeType_t, cudaDataType_t.
Currently using deprecated hipblasDatatype_t since newer ones very recent.
2023-12-18 22:33:45 +01:00
Georgi Gerganov
0e18b2e7d0
llama.swiftui : add tinyllama 1.1B F16
2023-12-18 20:17:43 +02:00
Georgi Gerganov
6ff39b129d
llama.swiftui : add more models
2023-12-18 20:05:12 +02:00
Ebey Abraham
b9e74f9bca
llama : add phi-2 + fix NeoX rope + ggml_mul_mat_set_prec ( #4490 )
...
* phi2 implementation
* fix breaking change
* phi-2 : various fixes
* phi-2 : use layer norm eps
* py : whitespaces
* llama : fix meta KV override bug
* convert : phi don't add BOS token
* convert : revert "added_tokens_decoder" change
* phi-2 : scale Q instead of KQ for better precision
* ggml : fix NeoX rope to rotate just first n_dims
* cuda : less diff in the rope_neox kernel
* ggml : add ggml_mul_mat_set_prec
ggml-ci
* Update ggml-cuda.cu
Co-authored-by: slaren <slarengh@gmail.com>
* Update ggml-cuda.cu
Co-authored-by: slaren <slarengh@gmail.com>
* cuda : ggml_cuda_op_mul_mat_cublas support F32 precision
* cuda : remove oboslete comment
---------
Co-authored-by: Ebey Abraham <ebeyabraham@microsoft.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2023-12-18 19:27:47 +02:00
hankcs
3c04bf6da8
llama : fix try_override for bool_value which always return true ( #4519 )
2023-12-18 15:14:58 +02:00
Jared Van Bortel
2994f0c5a2
decode : fix logits_valid for legacy API ( #4516 )
2023-12-17 19:39:02 -05:00
Georgi Gerganov
b1306c4394
readme : update hot topics
2023-12-17 20:16:23 +02:00
Georgi Gerganov
800a489e4a
llama.swiftui : add bench functionality ( #4483 )
...
* llama.swiftui : add bench button
* llama.swiftui : initial bench functionality
* force to use n_gpu_layers on simulator
* add download buttons & expose llamaState.loadModel
* update project.pbxproj
* comment #Preview & fix editorconfig check
* gitignore : xcode stuff
* llama.swiftui : UX improvements
* llama.swiftui : avoid data copy via "downloadTask"
* llama.swiftui : remove model from project
* llama : remove "mostly" from model infos
* llama.swiftui : improve bench
---------
Co-authored-by: jhen <developer@jhen.me>
2023-12-17 19:38:41 +02:00
Jared Van Bortel
f7f468a97d
gguf-py : fail fast on nonsensical special token IDs ( #4489 )
2023-12-17 10:45:46 -05:00
Matheus Gabriel Alves Silva
919c40660f
build : Check the ROCm installation location ( #4485 )
...
* build : Check the ROCm installation location
* more generic approach
* fixup! It was returning the path instead of the command output
* fixup! Trailing whitespace
2023-12-17 17:23:33 +02:00
slaren
45668633fd
finetune : keep allocs alive until all allocations are done ( #4486 )
2023-12-17 16:05:56 +01:00
olexiyb
0ffc92d2d2
server : disable llm logs if SERVER_VERBOSE is off ( #3792 )
2023-12-17 17:02:16 +02:00
AdithyanI
8edd2b40fd
server : fix grammar being ignored ( #4494 )
...
Fix bug in identifying the grammar.
2023-12-17 16:57:56 +02:00
Alexey Parfenov
eb16dae7e7
server : fix possible ambiguity in content type charset ( #4501 )
2023-12-17 16:56:09 +02:00
mzcu
62bd52b7bf
server : allow requests larger than 8K ( #4500 )
2023-12-17 16:54:37 +02:00
Bach Le
5daa5f54fd
Link to cublas dynamically on Windows even with LLAMA_STATIC ( #4506 )
2023-12-17 11:57:33 +01:00
slaren
c6c4fc081c
lora : add support for non-llama models ( #3333 )
...
* lora : add support for non-llama models
ggml-ci
* avoid leaking ggml_context on failure
cleanup
ggml-ci
* lora : allow 1d tensors
* lora : include embd and output layers in size calculation
* fix style
2023-12-16 18:58:46 +01:00
Jared Van Bortel
8a5be3bd58
llama : sanity checks for access to logits ( #4274 )
...
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-15 22:16:15 -05:00
Jared Van Bortel
8072706210
kompute : always destroy Manager via the destructor
2023-12-15 16:23:24 -05:00
Jared Van Bortel
2d2c76acc4
vulkan : fix free of stack addr in llama_buffer
2023-12-15 16:22:18 -05:00
Jared Van Bortel
f58f581ca8
refactor llama.cpp modifications
2023-12-15 13:38:54 -05:00
ShadovvBeast
88ae8952b6
server : add optional API Key Authentication example ( #4441 )
...
* Add API key authentication for enhanced server-client security
* server : to snake_case
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-15 13:49:01 +02:00
slaren
ee4725a686
ggml : group mul_mat_id rows by matrix (cpu only) ( #4480 )
...
* ggml : group mul_mat_id rows by matrix (cpu only)
* remove mmid parameters from mm forward
* store row groups in wdata and calculate only once in GGML_TASK_INIT
ggml-ci
2023-12-15 12:45:50 +01:00
slaren
6744dbe924
ggml : use ggml_row_size where possible ( #4472 )
...
* ggml : use ggml_row_size where possible
ggml-ci
* ggml : move ggml_nbytes_split to ggml-cuda.cu
2023-12-14 20:05:21 +01:00
Jared Van Bortel
c8fd4ba846
ggml : restore 'static' specifiers
2023-12-14 13:18:14 -05:00
slaren
cafcd4f895
ggml : remove n_dims from ggml_tensor ( #4469 )
...
ggml-ci
2023-12-14 16:52:08 +01:00
wonjun Jang
c50e400163
py : add protobuf dependency ( #4466 )
2023-12-14 14:44:49 +02:00
LostRuins
20a68a7030
ggml : add ggml_row_size() (fixes llama out of space) ( #4461 )
...
* Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values
* do not cast to size_t, instead just use doubles
* ggml : add ggml_row_size(), deprecate ggml_type_sizef()
* ggml : fix row size compute to avoid overflows
* tests : fix sizey -> sizez
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-14 14:13:33 +02:00
Georgi Gerganov
55e87c3749
ggml : fix OpenCL broadcast requirement for ggml_mul ( close #4453 )
2023-12-14 10:35:29 +02:00
wonjun Jang
873637afc7
convert : support loading vocab from fast tokenizer config ( #3633 )
...
* Add HFVocab into convert.py
* Update convert.py
* Update convert.py
* add bytes_to_unicode function
* change add_meta_vocab fucntion
* remove debug code
* remove byte_encoder
* Add newline between classes
* Check tokenizer.json when tokenizer.model is not exist.
* Move transformers dependency to local code
* Add error context with 'raise from'
* Add fast tokenizer option to BpeVocab
* Update convert.py
* Add VocabLoader and remove *Vocab class
* Add transformers dependency
* remove added tokens and check newline token to decide spm or bpe
* Update convert.py
* Add special token type
* Update convert.py
* Update convert.py
* Update convert.py
* Fix typo in convert.py
* Fix when params.n_vocab < tokenizer vocab size
* update vocab class
* change funtion name
* Remove unused variable/functions, add types to class variable and methods, delete blank liens
* fix flake8 warnings
* code style cleanup
* make mypy happy
* change exception
---------
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2023-12-14 10:09:34 +02:00
BarfingLemurs
0353a18401
readme : update supported model list ( #4457 )
2023-12-14 09:38:49 +02:00
Jared Van Bortel
f7cb0a65ef
remove script with unclear purpose
2023-12-13 17:55:41 -05:00
Jared Van Bortel
9af7f58b7b
move kompute to a submodule
2023-12-13 17:54:35 -05:00
Jared Van Bortel
b906e126ca
kompute : fix compile warnings
2023-12-13 17:49:45 -05:00
Jared Van Bortel
747e1eafcf
Merge commit '81bc9214a389362010f7a57f4cbc30e5f83a2d28' into nomic-vulkan
2023-12-13 17:49:45 -05:00
Jared Van Bortel
27631dbb6e
separate shaders from kompute itself
2023-12-13 17:49:45 -05:00
Jared Van Bortel
3e09e127eb
rename ggml-vulkan -> ggml-kompute
2023-12-13 17:49:45 -05:00
Jared Van Bortel
56430c3209
relicense Vulkan backend as MIT
2023-12-13 17:49:19 -05:00
shibe2
948ff137ec
server : fix handling of characters that span multiple tokens when streaming ( #4446 )
2023-12-13 21:57:15 +02:00
Georgi Gerganov
4d98d9a656
sync : ggml (SD ops, tests, kernels) ( #4444 )
...
* sync : ggml (SD ops, tests, kernels)
ggml-ci
* cuda : restore im2col
ggml-ci
* metal : fix accuracy of dequantization kernels
ggml-ci
* cuda : restore correct im2col
ggml-ci
* metal : try to fix moe test by reducing expert size
ggml-ci
* cuda : fix bin bcast when src1 and dst have different types
ggml-ci
---------
Co-authored-by: slaren <slarengh@gmail.com>
2023-12-13 21:54:54 +02:00
Jared Van Bortel
70f806b821
build : detect host compiler and cuda compiler separately ( #4414 )
2023-12-13 12:10:10 -05:00