llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-28 12:24:35 +00:00

Author	SHA1	Message	Date
JohannesGaessler	e7b9d97bae	More int mult, less float mult, worse performance	2023-05-12 09:11:47 +02:00
JohannesGaessler	d882d1c2fe	Performance no longer terrible	2023-05-11 23:27:06 +02:00
JohannesGaessler	4b12881329	WAKE ME UP	2023-05-11 22:47:38 +02:00
JohannesGaessler	3ed4588e22	Store layers in VRAM	2023-05-09 11:05:58 +02:00
JohannesGaessler	d052a0ed4c	Faster than CPU without 80% runtime memcpy	2023-05-09 09:47:55 +02:00
JohannesGaessler	229aa1f504	Works but slower than CPU	2023-05-09 09:47:55 +02:00
Johannes Gäßler	1f48b0abcf	Documented CUDA reproducibility, added warning (#1346 )	2023-05-08 02:42:01 +02:00
slaren	58b367c2d7	cuBLAS: refactor and optimize f16 mat mul performance (#1259 ) * cuBLAS: refactor, convert fp16 to fp32 on device * cuBLAS: use multiple streams, choose smartly between mul_mat_q and mul_mat_f16 * fix build * cuBLAS: update block_q5_1	2023-05-01 18:11:07 +02:00
slaren	b925f1f1b0	cuBLAS: fall back to pageable memory if pinned alloc fails (#1233 ) * cuBLAS: fall back to pageable memory if pinned alloc fails * cuBLAS: do not use pinned memory if env variable GGML_CUDA_NO_PINNED is set	2023-05-01 13:32:22 +02:00
slaren	7fc50c051a	cuBLAS: use host pinned memory and dequantize while copying (#1207 ) * cuBLAS: dequantize simultaneously while copying memory * cuBLAS: use host pinned memory * cuBLAS: improve ggml_compute_forward_mul_mat_f16_f32 with pinned memory * cuBLAS: also pin kv cache * fix rebase	2023-04-29 02:04:18 +02:00
Henri Vasserman	b1ee8f59b4	cuBLAS: non-contiguous tensor support (#1215 ) * Cuda: non-contiguous tensor support * remove extra stuff * rename * fix error * more fixes, now OpenBLAS and CLBlast build too * now then?	2023-04-29 01:31:56 +02:00
Stephan Walter	36d19a603b	Remove Q4_3 which is no better than Q5 (#1218 )	2023-04-28 23:10:43 +00:00
Georgi Gerganov	574406dc7e	ggml : add Q5_0 and Q5_1 quantization (#1187 ) * ggml : add Q5_0 quantization (cuBLAS only) * ggml : fix Q5_0 qh -> uint32_t * ggml : fix q5_0 histogram stats * ggml : q5_0 scalar dot product * ggml : q5_0 ARM NEON dot * ggml : q5_0 more efficient ARM NEON using uint64_t masks * ggml : rename Q5_0 -> Q5_1 * ggml : adding Q5_0 mode * quantize : add Q5_0 and Q5_1 to map * ggml : AVX2 optimizations for Q5_0, Q5_1 (#1195) --------- Co-authored-by: Stephan Walter <stephan@walter.name>	2023-04-26 23:14:13 +03:00
Georgi Gerganov	7a32fcb3b2	ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) (#1179 ) * ggml : add Q8_0 quantization format (rename the old one to Q8_1) * tests : fix test-quantize-fns * ggml : finalize Q8_0 implementation * ggml : use q4_0_q8_0 and q4_2_q8_0 * ggml : fix Q8_0 dot product bug (ARM) * ggml : Q8_0 unroll x2 * ggml : fix bug - using wrong block type * ggml : extend quantize_fns_t with "vec_dot_type" * ggml : fix Q8_0 to use 255 values out of 256 * ggml : fix assert using wrong QK4_2 instead of QK4_3	2023-04-25 23:40:51 +03:00
slaren	50cb666b8a	Improve cuBLAS performance by using a memory pool (#1094 ) * Improve cuBLAS performance by using a memory pool * Move cuda specific definitions to ggml-cuda.h/cu * Add CXX flags to nvcc * Change memory pool synchronization mechanism to a spin lock General code cleanup	2023-04-21 21:59:17 +02:00
slaren	2005469ea1	Add Q4_3 support to cuBLAS (#1086 )	2023-04-20 20:49:53 +02:00
slaren	02d6988121	Improve cuBLAS performance by dequantizing on the GPU (#1065 )	2023-04-20 03:14:14 +02:00

17 Commits