Aaron Miller
ff4212d20f
q8 mat*vec
2023-11-03 17:22:21 -04:00
Aaron Miller
9db90cbe12
f16 mv broadcasting fix (gqa fix)
2023-11-03 17:22:21 -04:00
Cebtenzzre
3d850db767
kompute : remove Q6_K from list of supported quant types
2023-11-03 17:22:21 -04:00
Cebtenzzre
24a4a5956a
kompute : only try to use Vulkan for LLaMA itself
2023-11-03 17:22:21 -04:00
Adam Treat
bc4b5ed1cb
Fixes for subgroup size to bring AMD and NVIDIA inline with eachother for all kernels.
2023-11-03 17:22:21 -04:00
Adam Treat
de589ced7c
Change this back to be in agreement with metal and our previous softmax kernel.
2023-11-03 17:22:21 -04:00
Adam Treat
6ac39752bf
Fixup the upstream CMakelists.txt so we can build just llama.cpp with our branch.
2023-11-03 17:22:21 -04:00
Adam Treat
32289aa447
Fixes for norm.
2023-11-03 17:22:21 -04:00
Adam Treat
06d4b21598
Fix offset into the qh and now we have working vulkan accelerated for gguff'd llama.
2023-11-03 17:22:21 -04:00
Adam Treat
f1c9bc1821
Add q6_k getrows and mul*vec kernel.
2023-11-03 17:22:21 -04:00
Adam Treat
4b223ec432
Refactor getrows to use common code and get ready for q6_k.
2023-11-03 17:22:21 -04:00
Adam Treat
5509f74318
Minor cleanup.
2023-11-03 17:22:21 -04:00
Adam Treat
601905e75e
Move the subgroups and printf into common.
2023-11-03 17:22:21 -04:00
Adam Treat
93306f16d0
Consolidate code for mat x vec kernels and use subgroups more extensively.
2023-11-03 17:22:21 -04:00
Adam Treat
77135a3bf5
Add a common boilerplate code via include and elim copy pasta
2023-11-03 17:22:21 -04:00
Adam Treat
9e4f8b4acc
Upload immediately to device.
2023-11-03 17:22:21 -04:00
Cebtenzzre
6b6c73a9e3
kompute : don't fail build because of -Warray-bounds
...
There are some warnings in debug builds that are likely to be false
positives.
2023-11-03 17:22:21 -04:00
Adam Treat
1b1416d7b7
Support for gguf.
2023-11-03 17:22:20 -04:00
Peter Sugihara
d9b33fe95b
metal : round up to 16 to fix MTLDebugComputeCommandEncoder assertion ( #3938 )
2023-11-03 21:18:18 +02:00
Xiao-Yong Jin
5ba3746171
ggml-metal: fix yarn rope ( #3937 )
2023-11-03 14:00:31 -04:00
slaren
abb77e7319
ggml-cuda : move row numbers to x grid dim in mmv kernels ( #3921 )
2023-11-03 12:13:09 +01:00
Georgi Gerganov
8f961abdc4
speculative : change default p_accept to 0.5 + CLI args ( #3919 )
...
ggml-ci
2023-11-03 09:41:56 +02:00
Georgi Gerganov
05816027d6
common : YAYF (yet another YARN fix) ( #3925 )
...
ggml-ci
2023-11-03 09:24:00 +02:00
cebtenzzre
3fdbe6b66b
llama : change yarn_ext_factor placeholder to -1 ( #3922 )
2023-11-03 08:31:58 +02:00
Kerfuffle
629f917cd6
cuda : add ROCM aliases for CUDA pool stuff ( #3918 )
2023-11-02 21:58:22 +02:00
Andrei
51b2fc11f7
cmake : fix relative path to git submodule index ( #3915 )
2023-11-02 21:40:31 +02:00
Georgi Gerganov
224e7d5b14
readme : add notice about #3912
2023-11-02 20:44:12 +02:00
Georgi Gerganov
c7743fe1c1
cuda : fix const ptrs warning causing ROCm build issues ( #3913 )
2023-11-02 20:32:11 +02:00
Oleksii Maryshchenko
d6069051de
cuda : use CUDA memory pool with async memory allocation/deallocation when available ( #3903 )
...
* Using cuda memory pools for async alloc/dealloc.
* If cuda device doesnt support memory pool than use old implementation.
* Removed redundant cublasSetStream
---------
Co-authored-by: Oleksii Maryshchenko <omaryshchenko@dtis.com>
2023-11-02 19:10:39 +02:00
Georgi Gerganov
4ff1046d75
gguf : print error for GGUFv1 files ( #3908 )
2023-11-02 16:22:30 +02:00
slaren
21958bb393
cmake : disable LLAMA_NATIVE by default ( #3906 )
2023-11-02 14:10:33 +02:00
Georgi Gerganov
2756c4fbff
gguf : remove special-case code for GGUFv1 ( #3901 )
...
ggml-ci
2023-11-02 11:20:21 +02:00
Georgi Gerganov
1efae9b7dc
llm : prevent from 1-D tensors being GPU split ( #3697 )
2023-11-02 09:54:44 +02:00
cebtenzzre
b12fa0d1c1
build : link against build info instead of compiling against it ( #3879 )
...
* cmake : fix build when .git does not exist
* cmake : simplify BUILD_INFO target
* cmake : add missing dependencies on BUILD_INFO
* build : link against build info instead of compiling against it
* zig : make build info a .cpp source instead of a header
Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>
* cmake : revert change to CMP0115
---------
Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>
2023-11-02 08:50:16 +02:00
Georgi Gerganov
4d719a6d4e
cuda : check if this fixes Pascal card regression ( #3882 )
2023-11-02 08:35:10 +02:00
Georgi Gerganov
183b3fac6c
metal : fix build errors and kernel sig after #2268 ( #3898 )
2023-11-02 08:33:37 +02:00
cebtenzzre
2fffa0d61f
cuda : fix RoPE after #2268 ( #3897 )
2023-11-02 07:49:44 +02:00
cebtenzzre
0eb332a10f
llama : fix llama_context_default_params after #2268 ( #3893 )
2023-11-01 19:29:14 -04:00
slaren
d02e98cde0
ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel ( #3891 )
...
* ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel
* fix warnings
2023-11-01 23:10:09 +01:00
cebtenzzre
898aeca90a
llama : implement YaRN RoPE scaling ( #2268 )
...
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>
2023-11-01 18:04:33 -04:00
Georgi Gerganov
c43c2da8af
llm : fix llm_build_kqv taking unused tensor (benign, #3837 )
2023-11-01 23:08:30 +02:00
Georgi Gerganov
523e49b111
llm : fix falcon norm after refactoring ( #3837 )
2023-11-01 23:00:50 +02:00
Georgi Gerganov
e16b9fa4ba
metal : multi-simd softmax ( #3710 )
...
ggml-ci
2023-11-01 21:25:00 +02:00
Georgi Gerganov
ff8f9a88da
common : minor ( #3715 )
2023-11-01 21:15:55 +02:00
Georgi Gerganov
50337961a6
llm : add llm_build_context ( #3881 )
...
* llm : add llm_build_context
* llm : deduce norm eps based on type + explict max_alibi_bias, clamp_kqv
* llm : restore the non-graph llm_build_ functional API
ggml-ci
* llm : cleanup + comments
2023-11-01 20:11:02 +02:00
bandoti
0e40806c1c
common : allow caller to handle help/argument exceptions ( #3715 )
...
* Allow caller to handle help/argument exceptions
* Prepend newline to usage output
* Add new gpt_params_parse_ex function to hide arg-parse impl
* Fix issue blocking success case
* exit instead of returning false
* Update common/common.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/common.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-11-01 19:42:01 +02:00
staviq
a2758d08e4
log : make generating separate log files optional ( #3787 )
...
* impl --log-new, --log-append
* Update common/log.h
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
* Update common/log.h
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
* Apply suggestions from code review
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
---------
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-11-01 16:18:27 +02:00
l3utterfly
e75dfdd31b
sampling : null grammar field after reset ( #3885 )
2023-11-01 15:40:43 +02:00
Georgi Gerganov
9a3b4f6c86
ggml : fix UNUSED macro ( #3762 )
2023-11-01 13:50:45 +02:00
Andrew Godfrey
73bdcb395e
finetune : add -ngl parameter ( #3762 )
...
* Add '-ngl' support to finetune.cpp
* Add fprintf in ggml_cuda_op_add
When I tried CUDA offloading during finetuning following the readme, I got an assert here.
This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora
* Add 'finetune.sh', which currently fails when using GPU
"error: operator (): Finetuning on tensors with type 'f16' is not yet supported"
* tweak finetune.sh
* Suppress some warnings in ggml.c
* Add f16 implementation to ggml_compute_forward_add_f16_f32
* Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs
* finetune.sh: Edit comments
* Add "add_f16_f32_f32_cuda"
* Tweak an error message
* finetune.sh: Add an optional LLAMA_MODEL_DIR variable
* finetune.sh: Add an optional LLAMA_TRAINING_DIR variable
* train : minor
* tabs to spaces
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-11-01 13:49:04 +02:00