Georgi Gerganov
8c2b881281
cuda : poc for norm quants (only -b 1 works)
2023-08-30 21:42:28 +03:00
Georgi Gerganov
253eab8ae1
ggml : poc for normalizing weights for better quantization (metal)
2023-08-30 19:05:36 +03:00
Johannes Gäßler
92b1bbd2ec
CUDA: fix RoPE asserts, block sizes ( #2833 )
2023-08-28 14:23:55 +03:00
Georgi Gerganov
eaa13a48ff
falcon : fix CUDA inference by making K and Q contiguous ( #2830 )
...
* falcon : fix CUDA inference by making K and Q contiguous
ggml-ci
* cuda : add assert to guard from non-cont ropes
2023-08-27 16:40:48 +03:00
Kawrakow
a6d1189fdd
k_quants tuning for Falcon-7b ( #2816 )
...
* Make ggml-cuda.cu build with QK_K = 64
Using LLAMA_CUDA_FORCE_DMMV = ON and -nommq it runs and produces
a meaningful result.
* k_quants tuning for Falcon-7b
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-27 15:19:59 +03:00
Henri Vasserman
6bbc598a63
ROCm Port ( #1087 )
...
* use hipblas based on cublas
* Update Makefile for the Cuda kernels
* Expand arch list and make it overrideable
* Fix multi GPU on multiple amd architectures with rocblas_initialize() (#5 )
* add hipBLAS to README
* new build arg LLAMA_CUDA_MMQ_Y
* fix half2 decomposition
* Add intrinsics polyfills for AMD
* AMD assembly optimized __dp4a
* Allow overriding CC_TURING
* use "ROCm" instead of "CUDA"
* ignore all build dirs
* Add Dockerfiles
* fix llama-bench
* fix -nommq help for non CUDA/HIP
---------
Co-authored-by: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>
Co-authored-by: funnbot <22226942+funnbot@users.noreply.github.com>
Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com>
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
Co-authored-by: jammm <2500920+jammm@users.noreply.github.com>
Co-authored-by: jdecourval <7315817+jdecourval@users.noreply.github.com>
2023-08-25 12:09:42 +03:00
Georgi Gerganov
3f460a2b72
cuda : add RoPE kernel for mode == 2 (NeoX) ( #2760 )
...
* cuda : add RoPE kernel for mode == 2 (NeoX)
* falcon : do not offload the embeddings layer
2023-08-25 11:55:59 +03:00
Georgi Gerganov
cf658adc83
llm : add Falcon support ( #2717 )
...
* llama : refactor GGUF constants into static maps
* llama : check if model architecture is known
* llama : refactor llama_model_load_internal()
* gguf : add KV constant maps
* llm : read arch-specific KVs
* convert : add dummy scores + types
* falcon : load tensor data (CPU only)
* llama : fix loading progress bar
* llama : add arch member to llama_model
* falcon : CPU inference working
* falcon : support non-40B models
* falcon : minor
* llama : minor updates
ggml-ci
* convert-falcon-hf-to-gguf.py : fix special token mapping
* llama.cpp : llama default UNK token = id 0
* llama.cpp : fix bpe tokenizer
* llama.cpp : fix the fix of bpe tokenizer
* ggml : pass eps to ggml_norm
* metal : implement RoPE (mode = 2) + avoid ggml_repeat
* ggml : ggml_repeat always creates new tensor
* falcon : copy-paste self-attention from LLaMA
* metal : print extra compute pipeline info
* falcon : minor changes (still chasing the Metal problem)
* llama.cpp : fix linefeed token
* metal : fix GELU kernel numerical stability by using precise::tanh
* metal : temporary workaround for the concurrency optimization bug
* falcon : add CUDA offloading (#2739 )
* llama : better model naming and size reporting
* llama : prep new tokenizer support
* llama : advanced BPE tokenizer based on ggllm.cpp imlpementation
* llama : remove oboslete comment
ggml-ci
* common : remove obsolete BPE API + disable test-tokenizer-1
* llama : revert BPE special-case in llama_byte_to_token()
* cuda : add TODOs for RoPE NeoX implementation
* llama : default special tokens based on vocab type
* perplexity : add log for start of tokenization
---------
Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
Co-authored-by: slaren <slarengh@gmail.com>
2023-08-23 23:08:04 +03:00
Johannes Gäßler
c63bb1d16a
CUDA: use mul_mat_q kernels by default ( #2683 )
2023-08-22 22:47:05 +02:00
Jiahao Li
800c9635b4
Fix CUDA softmax by subtracting max value before exp ( #2665 )
2023-08-22 20:27:06 +02:00
slaren
1123f7fbdf
ggml-cuda : use graph allocator ( #2684 )
...
use a different function for no_alloc to avoid breaking backwards compat, fixes lora
remove 512 n_batch limit
fixed 2048 batch size
cleanup
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-08-22 15:25:19 +02:00
Georgi Gerganov
ef3f333d37
ggml : sync latest (SAM + SD operators, CUDA alibi) ( #2709 )
...
* ggml : sync latest (SAM + SD operators, CUDA alibi)
ggml-ci
* ggml : fix tabs
2023-08-22 14:22:08 +03:00
slaren
097e121e2f
llama : add benchmark example ( #2626 )
...
* llama : add benchmark example
* add to examples CMakeLists.txt
* fix msvc build
* add missing include
* add Bessel's correction to stdev calculation
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* improve markdown formatting
* add missing include
* print warning is NDEBUG is not defined
* remove n_prompt and n_gen from the matrix, use each value separately instead
* better checks for non-optimized builds
* llama.cpp : fix MEM_REQ_SCRATCH0 reusing the value of n_ctx of the first call
* fix json formatting
* add sql output
* add basic cpu and gpu info (linx/cuda only)
* markdown: also show values that differ from the default
* markdown: add build id
* cleanup
* improve formatting
* formatting
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-08-18 12:44:58 +02:00
Johannes Gäßler
1cd06fa25e
CUDA: launch_bounds, small q4_K, q5_K mmq refactor ( #2596 )
2023-08-14 10:41:22 +02:00
Johannes Gäßler
f64d44a9b9
CUDA: Fixed OpenLLaMA 3b mmq, reduced compile time ( #2590 )
2023-08-13 00:24:45 +02:00
Johannes Gäßler
25d43e0eb5
CUDA: tuned mul_mat_q kernels ( #2546 )
2023-08-09 09:42:34 +02:00
Johannes Gäßler
f514d1b306
CUDA: faster k-quant mul_mat_q kernels ( #2525 )
2023-08-05 18:20:44 +02:00
Cebtenzzre
4329d1acb0
CUDA: use min compute capability of GPUs actually used ( #2506 )
2023-08-04 17:35:22 +02:00
Cebtenzzre
02f9d96a86
CUDA: check if event is NULL before cudaStreamWaitEvent ( #2505 )
...
Fixes #2503
2023-08-04 17:34:32 +02:00
Johannes Gäßler
468ea24fb4
CUDA: faster non k-quant mul_mat_q kernels ( #2483 )
2023-08-02 18:04:04 +02:00
Johannes Gäßler
4f6b60c776
CUDA: Fix models with output size != 32000 ( #2480 )
2023-08-02 16:48:10 +02:00
Johannes Gäßler
0728c5a8b9
CUDA: mmq CLI option, fixed mmq build issues ( #2453 )
2023-07-31 15:44:35 +02:00
Johannes Gäßler
1215ed7d5c
CUDA: Implemented row flattening for non-glm RoPE ( #2468 )
2023-07-31 14:32:30 +02:00
Johannes Gäßler
2dbf518911
CUDA: fewer memory bank conflicts for mul_mat_q ( #2458 )
2023-07-31 13:18:51 +02:00
Johannes Gäßler
11f3ca06b8
CUDA: Quantized matrix matrix multiplication ( #2160 )
...
* mmq implementation for non k-quants
* q6_K
* q2_K
* q3_k
* q4_K
* vdr
* q5_K
* faster q8_1 loading
* loop unrolling
* add __restrict__
* q2_K sc_high
* GGML_CUDA_MMQ_Y
* Updated Makefile
* Update Makefile
* DMMV_F16 -> F16
* Updated README, CMakeLists
* Fix CMakeLists.txt
* Fix CMakeLists.txt
* Fix multi GPU out-of-bounds
2023-07-29 23:04:44 +02:00
Johannes Gäßler
9baf9ef304
CUDA: faster multi GPU synchronization ( #2448 )
2023-07-29 23:04:10 +02:00
Kawrakow
129d844c87
Fix Q4_K and Q5_K for QK_K = 64 on CUDA ( #2359 )
...
* Fix Q4_K and Q5_K for QK_K = 64
* Very slightly better Q5_K bit fiddling
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-25 13:48:04 +03:00
slaren
41c674161f
make rms_norm_eps a parameter ( #2374 )
...
* make rms_norm_eps a parameter
* add rms_norm_eps to command line
* fix baby llama, test-grad0
* use scientific notation for eps param in the help
ggml-ci
2023-07-24 17:57:12 +02:00
Georgi Gerganov
5b2b2dc6ae
ggml : sync (unary ops refactor, static-correctness) ( #2370 )
...
* ggml : sync (unary ops, tests)
ggml-ci
* tests : remove unnecessary funcs
2023-07-24 14:46:21 +03:00
Kawrakow
2f9cf974a0
Some more Q4_K and Q5_K speedup on CUDA ( #2346 )
...
* Faster Q5_K on CUDA
* Small Q5_K improvement on older GPUs
* Spped up Q4_K on CUDA
GTX1660: 29.5 ms/t -> 25.6 ms/t
RTX4080: 8.40 ms/t -> 8.25 ms/t
* Spped up Q4_K on CUDA
GTX1660: 36.7 ms/t -> 35.6 ms/t
RTX4080: 9.8 ms/t -> 9.5 ms/t
* Address PR comments
* Add some comments to satisfy PR reviewer
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-24 00:19:47 +03:00
slaren
95a6c595e7
ggml: move op parameters from tensors to ggml_tensor::op_params ( #2333 )
...
* ggml: move op parameters from tensors to ggml_tensor::op_params
* alibi: use memcpy for float params
* remove `src[1] = NULL` in ops
2023-07-23 14:36:02 +02:00
Georgi Gerganov
e76d630df1
llama : grouped-query attention + LLaMAv2 70B support ( #2276 )
...
* CUDA: GQA implementation
* llama : support for GQA and LLaMAv2 70B
ggml-ci
* py : fix hparams parsing (if-else blocks)
ggml-ci
* py : oh boy ..
ggml-ci
* help : fix gqa value for 70B
ggml-ci
---------
Co-authored-by: JohannesGaessler <johannesg@5d6.de>
2023-07-23 15:09:47 +03:00
Kawrakow
d2a43664f9
Speed up Q4_K ( #2322 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-23 08:49:20 +03:00
Johannes Gäßler
b9b7d94fc1
CUDA: Fixed 7b q3_K_S with mul_mat_vec_q ( #2313 )
2023-07-22 21:27:34 +02:00
Kawrakow
d924522a46
Custom RoPE + bettter memory management for CUDA ( #2295 )
...
* Custom RoPE + bettter memory management for CUDA
* Adjusted look ahead in ggml_cuda_pool_malloc to 5%
This is sufficient it seems.
We end up using about 200 MB less VRAM that way when running
the 13B model with context 8192.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-21 17:27:51 +03:00
Georgi Gerganov
ae178ab46b
llama : make tensor_split ptr instead of array ( #2272 )
2023-07-21 13:10:51 +03:00
Jiahao Li
7568d1a2b2
Support dup & cont ops on CUDA ( #2242 )
2023-07-17 20:39:29 +03:00
Bach Le
7cdd30bf1f
cuda : allocate all temporary ggml_tensor_extra_gpu from a fixed-size buffer ( #2220 )
2023-07-14 22:00:58 +03:00
Jiahao Li
206e01de11
cuda : support broadcast add & mul ( #2192 )
...
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-14 21:38:24 +03:00
Johannes Gäßler
4304bd3cde
CUDA: mul_mat_vec_q kernels for k-quants ( #2203 )
2023-07-14 19:44:08 +02:00
Georgi Gerganov
697966680b
ggml : sync (ggml_conv_2d, fix mul_mat bug, CUDA GLM rope)
2023-07-14 16:36:41 +03:00
Howard Su
ff5d58faec
Fix compile error on Windows CUDA ( #2207 )
2023-07-13 21:58:09 +08:00
Georgi Gerganov
680e6f9177
cuda : add gelu support
2023-07-12 20:32:15 +03:00
Johannes Gäßler
2b5eb72e10
Fixed __dp4a compute capability: 6.0 -> 6.1 ( #2189 )
2023-07-12 10:38:52 +02:00
Georgi Gerganov
f7d278faf3
ggml : revert CUDA broadcast changes from #2183 ( #2191 )
2023-07-12 10:54:19 +03:00
Georgi Gerganov
20d7740a9b
ggml : sync (abort callback, mul / add broadcast, fix alibi) ( #2183 )
2023-07-11 22:53:34 +03:00
Spencer Sutton
5bf2a27718
ggml : remove src0 and src1 from ggml_tensor and rename opt to src ( #2178 )
...
* Add ggml changes
* Update train-text-from-scratch for change
* mpi : adapt to new ggml_tensor->src
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-11 19:31:10 +03:00
Johannes Gäßler
64639555ff
Fixed OpenLLaMA 3b CUDA mul_mat_vec_q ( #2144 )
2023-07-08 20:01:44 +02:00
Johannes Gäßler
061f5f8d21
CUDA: add __restrict__ to mul mat vec kernels ( #2140 )
2023-07-08 00:25:15 +02:00
Johannes Gäßler
924dd22fd3
Quantized dot products for CUDA mul mat vec ( #2067 )
2023-07-05 14:19:42 +02:00