llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-11-14 06:49:54 +00:00

History

Kawrakow 1bfc153e2f ggml : a faster version for Q4_1 x Q8_0 dot products (#1083 ) * A faster version for Q4_1 x Q8_0 dot products The idea nehind being that Q8_0 quantized values get used many times in the matrix multiplications where they are involved. In the current implementations, when we are evaluating the dot products, we need to compute the sum of the quants in the Q8_0 vector, so the same operation is repeated many times. Here we pre-compute the sum during Q8_0 quantization, store it in the now modified block_q8_0 struct, and then reuse this result in the subsequent dot products. In a synthetic benchmark (just compute a bunch of dot products), this change speeds up the Q4_1 * Q8_0 dot product by 80%, making the performance identical to Q4_0 * Q8_0. In practical application, I see a ~15% gain in speed for token prediction on M2, and ~5% gain on Ryzen 7950X. The speed gain in the prompt evaluation is much bigger (around 50%). I have only done the change for the scalar version, ARM_NEON, and AVX2, so we still need an AVX implementation. * Cleaning up --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>		2023-04-21 18:18:26 +03:00
..
CMakeLists.txt	ggml : a faster version for Q4_1 x Q8_0 dot products (#1083 )	2023-04-21 18:18:26 +03:00
q8dot.cpp	ggml : a faster version for Q4_1 x Q8_0 dot products (#1083 )	2023-04-21 18:18:26 +03:00
vdot.cpp	Adding a simple program to measure speed of dot products (#1041 )	2023-04-18 19:00:14 +00:00