Commit Graph

2060 Commits

Author SHA1 Message Date
Someone Serge
7251870780 nix: refactor the cleanSource rules 2024-01-22 12:19:30 +00:00
Someone Serge
fe8b3c0d4b workflows: nix-ci: drop the redundant "paths" filter 2024-01-22 12:19:30 +00:00
Someone Serge
f4dd059259 workflows: nix-build-aarch64: rate limit 2024-01-22 12:19:30 +00:00
Someone Serge
f7276f7500 workflows: nix-ci: rebuild on flake.lock updates 2024-01-22 12:19:30 +00:00
Kawrakow
15bceec2d7
imatrix : keep intermediate imatrix results (#5077)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-22 14:18:43 +02:00
compilade
d6bd4d46dd
llama : support StableLM 2 1.6B (#5052)
* llama : support StableLM 2 1.6B

* convert : fix Qwen's set_vocab wrongly naming all special tokens [PAD{id}]

* convert : refactor Qwen's set_vocab to use it for StableLM 2 too

* nix : add tiktoken to llama-python-extra

* convert : use presence of tokenizer.json to determine StableLM tokenizer loader

It's a less arbitrary heuristic than the vocab size.
2024-01-22 13:21:52 +02:00
Daniel Bevenius
152d9d05e0
finetune : print sample-start/include-sample-start (#5072)
This commit adds `--sample-start` and `--include-sample-start` to the
output from the main function in finetune.cpp.

The motivation for this is that even though these are set explicitly by
the user via the command line, if one forgets to set them then it is
useful to have their values printed out. Otherwise it is possible to go
through the whole training process before realizing that the values are
not what one expected.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-22 13:11:01 +02:00
Kawrakow
66d575c45c
llama : add Q3_K_XS (#5060)
* Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S

* Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K

Together with an importance matrix, this brings perplexity
for LLaMA-v2-70B below the perplexity of the former Q2_K
with a 800 MB smaller quantized model size.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-22 12:43:33 +02:00
bobqianic
57744932c6
ci : fix Windows CI by updating Intel SDE version (#5053) 2024-01-22 10:55:05 +02:00
Shijie
3466c6ebcf
llama : add more qwen2 models (#5071) 2024-01-22 09:33:19 +02:00
iSma
504dc37be8
Revert LLAMA_NATIVE to OFF in flake.nix (#5066) 2024-01-21 21:37:13 +00:00
Georgi Gerganov
17720fad66
metal : parallel reduce across heads 2024-01-21 23:01:46 +02:00
Georgi Gerganov
77d08f3272
metal : parallelize across KV size 2024-01-21 22:26:45 +02:00
Georgi Gerganov
a4b6341c7b
wip : template for rows per warp 2024-01-21 19:06:30 +02:00
kuronekosaiko
05490fad7f
add safetensors support to convert-lora-to-ggml.py (#5062)
* add safetensors support to convert-lora-to-ggml.py

* Update convert-lora-to-ggml.py

Remove white space in line 69.
2024-01-21 17:28:14 +01:00
Georgi Gerganov
f31955f5d1
wip : 4 rows per simd group 2024-01-21 18:01:28 +02:00
Georgi Gerganov
8cde449b8b
wip : 8 rows per simd group 2024-01-21 17:37:24 +02:00
bobqianic
6c5629d4d2
add #include <string> to unicode.h (#5051)
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2024-01-21 10:17:35 -05:00
Kawrakow
7dcbe39d36
Add ability to evauate multiple choice tasks (#5047)
* TruthfulQA: 1st attempt, does not look like it is working

The same implementation can be used for HellaSwag as well,
so I converted a HellaSwag validation dataset to the binary
format used here and tested with that. The score is only
around 50, so something is not quite right.

* TruthfulQA: works but the result is bad

I know it works because if I convert the HellaSwag validation
data to the binary format used in the truthful_qa_score() function
I get the exact same result as from the hellaswag_score() function.
But I guess, the questions are tricky and the way I have done
the combination of question + answer is very likely not the best.
The TruthfulQA validation dataset contains 817 questions, with
random chance result around 19%. With this version I get
29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2.
The HF leader board results for these two models are
42.2% and 68.3%, respectively.

* TruthfulQA: fix random sample

* TruthfulQA: prepare tasks in parallel for large test datasets

* Rename truthful_qa to multiple_choice

* Make MSVC happy

I had forgotten that MSVC does not make constexpr's available
inside a lambda.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-21 14:42:44 +02:00
Georgi Gerganov
b97325800a
metal : specialize for head size 2024-01-21 12:01:55 +02:00
Georgi Gerganov
52ae085750
metal : reduce branches 2024-01-21 11:59:09 +02:00
Georgi Gerganov
528da7515e
metal : f16 precision 2024-01-21 11:13:24 +02:00
Georgi Gerganov
1173f49c3b
metal : initial implementation 2024-01-21 10:15:02 +02:00
Kawrakow
726c0fa9a2
Slightly faster imatrix (#5050)
* imatrix: speedup by avoiding unnecessary allocations and copies

* imatrix: add --no-ppl option to skip PPL calculations altogether

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-21 08:01:20 +02:00
Georgi Gerganov
942c0107a7
flake.lock: Update (#5054)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/9b19f5e77dd906cb52dade0b7bd280339d2a1f3d' (2024-01-13)
  → 'github:NixOS/nixpkgs/bbe7d8f876fbbe7c959c90ba2ae2852220573261' (2024-01-19)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-01-21 03:17:27 +00:00
Jared Van Bortel
b43ebde3b0
convert : partially revert PR #4818 (#5041) 2024-01-20 18:14:18 -05:00
Jared Van Bortel
97c1549808
perplexity : fix MSVC build after #5020 (#5043)
* perplexity : fix MSVC build after #5020

* try a differerent fix
2024-01-20 17:08:08 +02:00
slaren
6df465a91d
llama : run all KQV ops on the CPU with no KV offload (#5049)
ggml-ci
2024-01-20 17:05:49 +02:00
Georgi Gerganov
a9681febd6
ggml : online attention (CPU) 2024-01-20 16:45:41 +02:00
Georgi Gerganov
c3cdfffa88
Merge branch 'master' into gg/flash-attn 2024-01-20 10:12:07 +02:00
Herman Semenov
77bc1bbd05
cmake : add support for ccache (#5002)
* Added support ccache for speedup recompilation

* cmake : option to disable ccache

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-20 10:11:31 +02:00
adel boussaken
48e2b13372
Add a dart/flutter binding to README.md (#4882) 2024-01-20 03:05:43 -05:00
Kylin
cca894f16a
cuda : fix compile error in jetson platform (#4975)
* cuda: fix compile error in jetson platform

* cuda: update comment in ggml-cuda.cu

* cuda: update ggml-cuda.cu comment
2024-01-20 09:01:46 +02:00
Uzo Nweke
381ee19572
finetune : fix ggml_allocr lifetimes (tmp workaround) (#5033)
* Fix issue with alloc causing max_compute_size to be calculated

* remove ggml_allocr_free as suggested in issue #4791
2024-01-19 20:20:50 +02:00
Georgi Gerganov
fa7ebcca99 ggml : fix GQA support in ggml_flash_attn_ext 2024-01-19 20:06:26 +02:00
Georgi Gerganov
a5cacb22b2
imatrix : add README.md 2024-01-19 15:24:47 +02:00
Shijie
9b75cb2b3c
llama : support upcoming Qwen2 (#5037) 2024-01-19 13:53:13 +02:00
Georgi Gerganov
de9a147df1 py : fix flake8 lint 2024-01-19 13:52:22 +02:00
Kawrakow
7051aacfac
winogrande: evaluate log-probs in parallel (#5036)
This is a relatively minor performance tweak resulting in
~10% speedup on my system.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-19 11:39:11 +02:00
chiranko
2b3b999cac
llama : add CodeShell support (#5016)
* llama: add codeshell support

* llama.cpp: fix codeshell with NeoX rope

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-19 11:07:27 +02:00
Kawrakow
993fba8180
perplexity: avoid unnecessary alloocations and logit copies (#5035)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-19 11:02:39 +02:00
Georgi Gerganov
8b20858e5e
perplexity : faster Winogrande via batching (#5024)
* perplexity : faster Winogrande via batching

ggml-ci

* perplexity : remove unused function

* perplexity : only tokenize selected tasks for Winogrande
2024-01-19 10:45:06 +02:00
John
57e2a7a52a
llama : fix falcon arch for tied output embeddings (#4978)
* falcon arch fix for tied output embeddings

* Update llama.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update llama.cpp

* Update llama.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-19 00:12:15 +02:00
Georgi Gerganov
9b6ea4263a
cmake : add ggml public headers (#5011) 2024-01-18 23:36:07 +02:00
Xuan Son Nguyen
821f0a271e
server : defer tasks when "slot unavailable" (#5018)
* server: defer task when no slot is available

* remove unnecessary log

---------

Co-authored-by: Xuan Son Nguyen <xuanson.nguyen@snowpack.eu>
2024-01-18 22:33:05 +02:00
slaren
96d7f56d29
llama : fix mlock with no-mmap with Metal (#5025) 2024-01-18 21:12:15 +01:00
Georgi Gerganov
2d5419d08a
imatrix : fix assert for src0 non-cont check 2024-01-18 21:45:51 +02:00
Georgi Gerganov
d391ae9b49
perplexity : fix winogrande N tasks option 2024-01-18 20:49:00 +02:00
Georgi Gerganov
e9240cdfa0
scripts : add get-winogrande.sh 2024-01-18 20:45:39 +02:00
David Sommers
b46757735d
convert.py : fix llama/llama2 conversion due to vocab_size=-1 (#5019)
PR #4818 (merged last week) reintroduced a config check for vocab_size that was addressed in PR #4258 (merged 2023-11-30).

Without the fix, llama2 models can't be converted. The error is:

`ValueError: The model's vocab size is set to -1 in params.json. Please update it manually. Maybe 32000?`
2024-01-18 19:20:59 +02:00