llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-13 04:00:16 +00:00

History

Georgi Gerganov 1a1a1c3845 llama : support quantum K cache (#4312 ) * llama : support quantum K cache (wip) * metal : add F32 -> Q8_0 copy kernel * cuda : add F32 -> Q8_0 copy kernel ggml-ci * cuda : use mmv kernel for quantum cache ops * llama : pass KV cache type through API * llama : fix build ggml-ci * metal : add F32 -> Q4_0 copy kernel * metal : add F32 -> Q4_1 copy kernel * cuda : wip * cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels * llama-bench : support type_k/type_v * metal : use mm kernel only for quantum KV cache * cuda : add comment * llama : remove memory_f16 and kv_f16 flags --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-06 13:30:20 +02:00
..
CMakeLists.txt	build : link against build info instead of compiling against it (#3879 )	2023-11-02 08:50:16 +02:00
quantize-stats.cpp	llama : support quantum K cache (#4312 )	2023-12-06 13:30:20 +02:00

Georgi Gerganov 1a1a1c3845

llama : support quantum K cache (#4312 )

* llama : support quantum K cache (wip)

* metal : add F32 -> Q8_0 copy kernel

* cuda : add F32 -> Q8_0 copy kernel

ggml-ci

* cuda : use mmv kernel for quantum cache ops

* llama : pass KV cache type through API

* llama : fix build

ggml-ci

* metal : add F32 -> Q4_0 copy kernel

* metal : add F32 -> Q4_1 copy kernel

* cuda : wip

* cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels

* llama-bench : support type_k/type_v

* metal : use mm kernel only for quantum KV cache

* cuda : add comment

* llama : remove memory_f16 and kv_f16 flags

---------

Co-authored-by: slaren <slarengh@gmail.com>

2023-12-06 13:30:20 +02:00

CMakeLists.txt

build : link against build info instead of compiling against it (#3879 )

2023-11-02 08:50:16 +02:00

quantize-stats.cpp

llama : support quantum K cache (#4312 )

2023-12-06 13:30:20 +02:00