Commit Graph

2093 Commits

Author SHA1 Message Date
Paul Tsochantaris
6dd3c28c9c
metal : remove unused n_buffers and buffers (#5129) 2024-01-26 14:16:07 +02:00
Riceball LEE
38b431de23
gguf : fix "general.alignment" type in gguf_reader.py (#5136) 2024-01-26 11:10:28 +02:00
Georgi Gerganov
aad0b01d73
readme : update hot topics 2024-01-26 10:52:33 +02:00
Kawrakow
1182cf4d4f
Another bucket sort (#5109)
* Initial bucket sort

* Bucket sort: slightly better version

* Bucket sort: another minor improvement

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-26 09:14:39 +02:00
XiaotaoChen
fe54033b69
readme : add MobileVLM 1.7B/3B to the supported models list (#5107)
Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>
2024-01-25 22:14:32 +02:00
l3utterfly
5eaf9964fc
llama : dynamic temperature sampling (#4972)
* implemented dynamic temperature sampling from koboldcpp

* removed trailing whitespace

* removed unused temp parameter in llama_sample_entropy

* exposed exponent_val in dynamic temp sampler

* added debug check for printf statements

* use nullptr in llama_sample_softmax call during llama_sample_entropy

this avoids counting the time taken stats twice

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* return earlier if there is only 1 candiate (i.e. max_entropy == 0)

* reformat 't' case in llama_sample_queue

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* check for one or zero candidates case in llama_sample_entropy

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-01-25 22:06:22 +02:00
Jared Van Bortel
d292f4f204
examples : make pydantic scripts pass mypy and support py3.8 (#5099) 2024-01-25 14:51:24 -05:00
Valentin Konovalov
256d1bb0dd
android : use release cmake build type by default (#5123) 2024-01-25 19:05:51 +02:00
Georgi Gerganov
6fea843b24
metal : add parallel reduce version (disabled) 2024-01-25 18:09:30 +02:00
Kawrakow
faa3526a1e
Fix Q3_K_XS for MoE models (#5113)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-25 17:58:53 +02:00
Georgi Gerganov
f9ca5dcbe8
llama : avoid ggml_cast, use F32 query 2024-01-25 17:46:07 +02:00
Georgi Gerganov
40ea8cd1ac
metal : fix comment 2024-01-25 16:31:39 +02:00
Georgi Gerganov
432ad04ffa
metal : scale and mask in matrix form 2024-01-25 15:47:52 +02:00
Georgi Gerganov
d917746ddb
metal : avoid redundant loads of the attention 2024-01-25 15:00:49 +02:00
Georgi Gerganov
1446a12b29
metal : efficient flash_attn_f16 implementation 2024-01-25 13:40:31 +02:00
Georgi Gerganov
ddc5a5033f
metal : show compile log messages 2024-01-25 11:26:17 +02:00
Engininja2
cd4fddb29f
cuda : fix 2-bit quants on amd hip (#5105)
* cuda : fix 2-bit quants on amd hip

* use __low2float intrinsic function for new quants
2024-01-24 23:18:15 +01:00
Michael Hueschen
c9b316c78f nix-shell: use addToSearchPath
thx to @SomeoneSerge for the suggestion!
2024-01-24 12:39:29 +00:00
Michael Hueschen
bf63d695b8 nix: add cc to devShell LD_LIBRARY_PATH
this fixes the error I encountered when trying to run the convert.py
script in a venv:

```
$ nix develop

[...]$ source .venv/bin/activate
(.venv)
[...]$ pip3 install -r requirements.txt
<... clipped ...>
[...]$ python3 ./convert.py
Traceback (most recent call last):
  File "/home/mhueschen/projects-reference/llama.cpp/./convert.py", line 40, in <module>
    from sentencepiece import SentencePieceProcessor
  File "/home/mhueschen/projects-reference/llama.cpp/.venv/lib/python3.11/site-packages/sentencepiece/__init__.py", line 13, in <module>
    from . import _sentencepiece
ImportError: libstdc++.so.6: cannot open shared object file: No such file or directory
```

however, I am not sure this is the cleanest way to address this linker
issue...
2024-01-24 12:39:29 +00:00
slaren
1387ea2117
llama : pre-allocate input tensors in a separate buffer (#5100) 2024-01-24 12:48:14 +01:00
Georgi Gerganov
26d607608d
metal : disable support for MUL_MAT F32 x F16 2024-01-23 15:50:56 +02:00
Kawrakow
44879ee885
Additional KL-divergence statistics (#5081)
* perplexity: add top-token probability

* perplexity: add additional KL-divergence statistics

* perplexity: a better organized KL-divergence statistics output

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-23 15:17:20 +02:00
Johannes Gäßler
9ecdd12e95
CUDA: more info when no device code (#5088) 2024-01-23 13:31:56 +01:00
Georgi Gerganov
89758723c7
minor : clean-up some warnings and style (#5094)
* minor : clean-up some warnings and style

ggml-ci

* ggml : add comment
2024-01-23 14:12:57 +02:00
Xuan Son Nguyen
2bed4aa3f3
devops : add intel oneapi dockerfile (#5068)
Co-authored-by: Xuan Son Nguyen <xuanson.nguyen@snowpack.eu>
2024-01-23 09:11:39 +02:00
Michael Coppola
125d03a503
llama.vim : added api key support (#5090)
Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2024-01-23 08:51:27 +02:00
slaren
011e8ec577
llama : fix not enough space in buffer with Qwen (#5086) 2024-01-22 23:42:41 +01:00
Kawrakow
6f9939d119
KL-divergence (#5076)
* kl-divergence: be able to save all logits to a file

* Add ability to compute KL-divergence

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-22 16:10:14 +02:00
Reinforce-II
780e24a22e
ggml : parallelize FP32 conversion when using BLAS (#5045)
* make GGML_TASK_INIT phase can be run in multithread

* multithreaded dequantize in mul_mat when using blas library

* minor fixes

* update outdated comment
* fix coding style

* simplify code

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-22 15:15:08 +02:00
XiaotaoChen
3ce7e8f8e7
llava : MobileVLM support (#4954)
* MobileVLM native implementation

* delete depthwise_conv_2d and permute_cpy relative code, replace the two by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake

* move android script to example/llava directory

* Fix the editor config checks

---------

Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>
2024-01-22 15:09:35 +02:00
Someone Serge
b2d80e105a flake.nix: add a comment about flakes vs nix 2024-01-22 12:19:30 +00:00
Someone Serge
28603cd283 nix: add a comment on the many nixpkgs-with-cuda instances 2024-01-22 12:19:30 +00:00
Someone Serge
5e97ec91ae nix: add a comment about makeScope 2024-01-22 12:19:30 +00:00
Someone Serge
7251870780 nix: refactor the cleanSource rules 2024-01-22 12:19:30 +00:00
Someone Serge
fe8b3c0d4b workflows: nix-ci: drop the redundant "paths" filter 2024-01-22 12:19:30 +00:00
Someone Serge
f4dd059259 workflows: nix-build-aarch64: rate limit 2024-01-22 12:19:30 +00:00
Someone Serge
f7276f7500 workflows: nix-ci: rebuild on flake.lock updates 2024-01-22 12:19:30 +00:00
Kawrakow
15bceec2d7
imatrix : keep intermediate imatrix results (#5077)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-22 14:18:43 +02:00
compilade
d6bd4d46dd
llama : support StableLM 2 1.6B (#5052)
* llama : support StableLM 2 1.6B

* convert : fix Qwen's set_vocab wrongly naming all special tokens [PAD{id}]

* convert : refactor Qwen's set_vocab to use it for StableLM 2 too

* nix : add tiktoken to llama-python-extra

* convert : use presence of tokenizer.json to determine StableLM tokenizer loader

It's a less arbitrary heuristic than the vocab size.
2024-01-22 13:21:52 +02:00
Daniel Bevenius
152d9d05e0
finetune : print sample-start/include-sample-start (#5072)
This commit adds `--sample-start` and `--include-sample-start` to the
output from the main function in finetune.cpp.

The motivation for this is that even though these are set explicitly by
the user via the command line, if one forgets to set them then it is
useful to have their values printed out. Otherwise it is possible to go
through the whole training process before realizing that the values are
not what one expected.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-22 13:11:01 +02:00
Kawrakow
66d575c45c
llama : add Q3_K_XS (#5060)
* Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S

* Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K

Together with an importance matrix, this brings perplexity
for LLaMA-v2-70B below the perplexity of the former Q2_K
with a 800 MB smaller quantized model size.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-22 12:43:33 +02:00
bobqianic
57744932c6
ci : fix Windows CI by updating Intel SDE version (#5053) 2024-01-22 10:55:05 +02:00
Shijie
3466c6ebcf
llama : add more qwen2 models (#5071) 2024-01-22 09:33:19 +02:00
iSma
504dc37be8
Revert LLAMA_NATIVE to OFF in flake.nix (#5066) 2024-01-21 21:37:13 +00:00
Georgi Gerganov
17720fad66
metal : parallel reduce across heads 2024-01-21 23:01:46 +02:00
Georgi Gerganov
77d08f3272
metal : parallelize across KV size 2024-01-21 22:26:45 +02:00
Georgi Gerganov
a4b6341c7b
wip : template for rows per warp 2024-01-21 19:06:30 +02:00
kuronekosaiko
05490fad7f
add safetensors support to convert-lora-to-ggml.py (#5062)
* add safetensors support to convert-lora-to-ggml.py

* Update convert-lora-to-ggml.py

Remove white space in line 69.
2024-01-21 17:28:14 +01:00
Georgi Gerganov
f31955f5d1
wip : 4 rows per simd group 2024-01-21 18:01:28 +02:00
Georgi Gerganov
8cde449b8b
wip : 8 rows per simd group 2024-01-21 17:37:24 +02:00