AidanBeltonS
fadde67135
Dequant improvements rebase ( #8255 )
...
* Single load for half2
* Store scales in local mem
* Vec load quantized values
2024-07-03 09:55:34 +08:00
MistApproach
a27152b602
fix: add missing short command line argument -mli for multiline-input ( #8261 )
2024-07-02 22:56:46 +02:00
Clint Herron
3e2618bc7b
Adding step to clean
target to remove legacy binary names to reduce upgrade / migration confusion arising from #7809 . ( #8257 )
2024-07-02 13:19:56 -04:00
Clint Herron
07a3fc0608
Removes multiple newlines at the end of files that is breaking the editorconfig step of CI. ( #8258 )
2024-07-02 12:18:10 -04:00
Faisal Zaghloul
968967376d
Add JAIS
model(s) ( #8118 )
...
* Add `JAIS` model(s)
* cleanup
* address review comments
* remove hack
* un-hardcode max-alibi-bias
* minor tweaks
---------
Co-authored-by: fmz <quic_fzaghlou@quic.com>
2024-07-02 16:36:00 +02:00
Daniel Bevenius
023b8807e1
convert-hf : print output file name when completed ( #8181 )
...
* convert-hf : print output file name when completed
This commit adds the output file name to the log message when the
conversion is completed.
The motivation for this change is that when `--outfile` option is not
specified it migth not be obvious where the output file is written.
With this change the output of running the script will be something like
the following:
```console
INFO:hf-to-gguf:Model successfully exported to models/gemma-2-9b-it.gguf.
```
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* squash! convert-hf : print output file name when completed
Updates the output of to support printing the directory if the output is
split into multiple files. Also the output file name is now retrieved
from the model_instance object.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* squash! convert-hf : print output file name when completed
Use parent attribute of Path object and string interpolation.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* squash! convert-hf : print output file name when completed
Use os.sep instead of hardcoding the path separator.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
---------
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-07-02 09:40:49 +03:00
slaren
0e0590adab
cuda : update supports_op for matrix multiplication ( #8245 )
2024-07-02 09:39:38 +03:00
luoyu-intel
a9f3b10215
[SYCL] Fix win build conflict of math library ( #8230 )
...
* fix win build conflict of math library
* fix the condition: !(win32 & SYCL)
* revert warp_size=16
2024-07-02 12:50:07 +08:00
luoyu-intel
d08c20edde
[SYCL] Fix the sub group size of Intel ( #8106 )
...
* use warp_size macro for all sycl kernels
* fix mask of permute_sub_group_by_xor
* fix rms_norm with correct warp number
* fix rms_norm_f32/group_norm_f32
* move norm to norm.cpp file
* fix quantize bug
* fix mmvq's batch size
2024-07-02 10:16:00 +08:00
Xuan Son Nguyen
5fac350b9c
Fix gemma2 tokenizer convert ( #8244 )
...
* fix gemma2 tokenizer convert
* remove scores
* improve code, fix new line issue
2024-07-02 01:07:23 +02:00
Johannes Gäßler
cb5fad4c6c
CUDA: refactor and optimize IQ MMVQ ( #8215 )
...
* CUDA: refactor and optimize IQ MMVQ
* uint -> uint32_t
* __dp4a -> ggml_cuda_dp4a
* remove MIN_CC_DP4A checks
* change default
* try CI fix
2024-07-01 20:39:06 +02:00
Mateusz Charytoniuk
dae57a1ebc
readme: add Paddler to the list of projects ( #8239 )
2024-07-01 20:13:22 +03:00
Xuan Son Nguyen
49122a873f
gemma2: add sliding window mask ( #8227 )
...
* gemma2: add sliding window mask
* fix data_swa uninitialized
* better naming
* add co-author
Co-authored-by: Arlo Phoenix <arlo-phoenix@users.noreply.github.com>
* replace list with single tensor
* update
* llama : minor styling
* convert : add sanity check for query_pre_attn_scalar
* fix small typo in README
---------
Co-authored-by: Arlo Phoenix <arlo-phoenix@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-01 18:48:34 +02:00
Roni
0ddeff1023
readme : update tool list ( #8209 )
...
* Added gppm to Tool list in README
* Update README.md
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-01 15:48:16 +03:00
Michael Francis
3840b6f593
nix : enable curl ( #8043 )
...
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-01 14:47:04 +03:00
Georgi Gerganov
257f8e41e2
nix : remove OpenCL remnants ( #8235 )
...
* nix : remove OpenCL remnants
* minor : remove parentheses
2024-07-01 14:46:18 +03:00
iacore
694c59cb42
Document BERT support. ( #8205 )
...
* Update README.md
document BERT support
* Update README.md
2024-07-01 13:40:58 +02:00
zhentaoyu
197fe6c1d7
[SYCL] Update SYCL-Rope op and Refactor ( #8157 )
...
* align with rope.cu and move sycl-op to a single file
2024-07-01 19:39:06 +08:00
Georgi Gerganov
d0a7145ba9
flake.lock: Update ( #8218 )
2024-06-30 16:09:34 -07:00
Xuan Son Nguyen
9ef0780062
Fix new line issue with chat template, disable template when in-prefix/suffix is set ( #8203 )
...
* preserve new line llama_chat_format_single
* disable chat template if in-prefix/suffix is set
* remove redundant change
2024-06-30 20:27:13 +02:00
Andrei
1c5eba6f8e
llama: Add attention and final logit soft-capping, update scaling factor to Gemma2 ( #8197 )
...
* Add attention and final logit softcapping.
* fix
* Add custom add_ functions
* Disable flash attention for Gemma2
* Update src/llama.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* Add default value for attention and final logit softcap value
* Add custom kq scaling from Gemma2Attention
* Remove custom pre attention scaling and use computed value instead.
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-06-29 23:44:08 -04:00
Francis Couture-Harpin
8fbd59308b
ggml-quants : attempt to fix Arm 32-bit support
2024-06-28 22:52:57 -04:00
Francis Couture-Harpin
ec50944bf6
ggml-quants : fix build failure on Windows
2024-06-28 20:41:13 -04:00
Francis Couture-Harpin
bfd2f21fb4
bitnet : replace 1.58b with b1.58, as in the paper
2024-06-28 20:38:12 -04:00
Xuan Son Nguyen
72272b83a3
fix code typo in llama-cli ( #8198 )
2024-06-29 00:14:20 +02:00
Olivier Chafik
8748d8ac6f
json: attempt to skip slow tests when running under emulator ( #8189 )
2024-06-28 18:02:05 +01:00
Xuan Son Nguyen
26a39bbd6b
Add MiniCPM, Deepseek V2 chat template + clean up llama_chat_apply_template_internal
( #8172 )
...
* tmp_contains
* minicpm chat template
* add DeepSeek Lite template
* change deepseek-lite to deepseek2
* correct code comment
* correct code from master branch
2024-06-28 15:11:44 +02:00
Sigbjørn Skjæret
38373cfbab
Add SPM infill support ( #8016 )
...
* add --spm-infill option
* support --spm-infill
* support --spm-infill
2024-06-28 12:53:43 +02:00
slaren
b851b3fba0
cmake : allow user to override default options ( #8178 )
2024-06-28 12:37:45 +02:00
Olivier Chafik
139cc621e9
json
: restore default additionalProperties to false, fix some pattern escapes (#8180 )
...
* json: expand ESCAPED_IN_REGEXPS_BUT_NOT_IN_LITERALS charset
* json: revert default of additionalProperties to false
* Update README.md
2024-06-28 09:26:45 +01:00
pculliton
e57dc62057
llama: Add support for Gemma2ForCausalLM ( #8156 )
...
* Inference support for Gemma 2 model family
* Update convert-hf-to-gguf.py, constants, and tensor mappings
* cleanup
* format fix
* Fix special token vocab bug
* Don't add space prefix
* fix deleted lines
* Update src/llama.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* Add model type names
* Add control vector
* Fix model type identification
---------
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-06-27 21:00:43 -07:00
Xuan Son Nguyen
a27aa50ab7
Add missing items in makefile ( #8177 )
2024-06-28 02:19:11 +02:00
Olivier Chafik
cb0b06a8a6
json
: update grammars/README w/ examples & note about additionalProperties (#8132 )
...
* json: update grammars/README
* mention broken prefixItems
* add mention to llama-gbnf-validator
* json: explicit type: object for nested items object in cli example
2024-06-27 22:08:42 +01:00
loonerin
558f44bf83
CI: fix release build (Ubuntu+Mac) ( #8170 )
...
* CI: fix release build (Ubuntu)
PR #8006 changes defaults to build shared libs. However, CI for releases
expects static builds.
* CI: fix release build (Mac)
---------
Co-authored-by: loonerin <loonerin@users.noreply.github.com>
2024-06-27 21:01:23 +02:00
slaren
8172ee9da9
cmake : fix deprecated option names not working ( #8171 )
...
* cmake : fix deprecated option names not working
* remove LlAMA_OPENMP
2024-06-27 20:04:39 +02:00
Xuan Son Nguyen
16791b8f0b
Add chatml fallback for cpp llama_chat_apply_template
( #8160 )
...
* add chatml fallback for cpp `llama_chat_apply_template`
* remove redundant code
2024-06-27 18:14:19 +02:00
Georgi Gerganov
ab3679112d
flake.lock: Update ( #8071 )
...
Flake lock file updates:
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/e9ee548d90ff586a6471b4ae80ae9cfcbceb3420?narHash=sha256-4Zu0RYRcAY/VWuu6awwq4opuiD//ahpc2aFHg2CWqFY%3D' (2024-06-13)
→ 'github:NixOS/nixpkgs/d603719ec6e294f034936c0d0dc06f689d91b6c3?narHash=sha256-k3JqJrkdoYwE3fHE6xGDY676AYmyh4U2Zw%2B0Bwe5DLU%3D' (2024-06-20)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Philip Taron <philip.taron@gmail.com>
2024-06-27 08:37:29 -07:00
jukofyork
97877eb10b
Control vector loading fixes ( #8137 )
...
* Fixed leak in llama_control_vector_load_one() and allow llama_control_vector_load() to grow
* refactored `llama_control_vector_load_one()`
* allow multiple directions for same layer in same file
* llama_control_vector_load_one() and llama_control_vector_load() now break on error
* removed unnecessary ggml_free() call
2024-06-27 16:48:07 +02:00
Raj Hammeer Singh Hada
387952651a
Delete examples/llama.android/llama/CMakeLists.txt ( #8165 )
...
* Delete examples/llama.android/llama/CMakeLists.txt
https://github.com/ggerganov/llama.cpp/pull/8145#issuecomment-2194534244
This file is not being used for building on Android. `llama.cpp/examples/llama.android/llama/src/main/cpp/CMakeLists.txt` is being used instead.
* Update CMakeLists.txt
Pick local llama.cpp files instead of fetching content from git
2024-06-27 16:39:29 +02:00
Sigbjørn Skjæret
6030c61281
Add Qwen2MoE 57B-A14B model identifier ( #8158 )
...
* Add Qwen2MoE 57B-A14B
* Add Qwen2MoE 57B-A14B
2024-06-27 16:27:41 +02:00
Johannes Gäßler
85a267daaa
CUDA: fix MMQ stream-k for --split-mode row ( #8167 )
2024-06-27 16:26:05 +02:00
kustaaya
f675b20a3b
Added support for Viking pre-tokenizer ( #8135 )
...
Co-authored-by: kustaaya <kustaaya@protonmail.com>
2024-06-27 10:58:54 +02:00
Sigbjørn Skjæret
911e35bb8b
llama : fix CodeLlama FIM token checks ( #8144 )
...
* account for space prefix character
* use find instead
2024-06-27 10:46:41 +03:00
Francis Couture-Harpin
0996149911
convert-hf : allow converting the weird BitNet 1.3B
...
Its FFN size is 5460 which is not convenient.
The offending tensors are kept in F16,
which makes the final model 5.01 bpw.
2024-06-27 02:06:28 -04:00
Francis Couture-Harpin
961e293833
convert-hf : simplify BitNet pre-quantization
...
This still results in the exact same tensor weights and scales,
but it reveals some weirdness in the current algorithm.
2024-06-27 02:06:28 -04:00
Francis Couture-Harpin
89dc3b254c
ggml-quants : use ceiling division when quantizing q1_3
2024-06-27 02:06:28 -04:00
Francis Couture-Harpin
9465ec6e12
ggml-quants : ARM NEON vec_dot for q2_2 and q1_3
2024-06-27 02:06:28 -04:00
Francis Couture-Harpin
638ad52f87
ggml-quants : cleanup Q1_3 code formatting
2024-06-27 02:06:28 -04:00
Francis Couture-Harpin
ef1e345c85
ggml-quants : Q2_2 now faster than Q4_K on with AVX2
2024-06-27 02:06:28 -04:00
Francis Couture-Harpin
48b73b8498
ggml-quants : substract 1 when back in epi8
...
This makes the 1.625 bpw type go faster than q4_0. Still not the fastest.
2024-06-27 02:06:28 -04:00