2024-06-08 12:42:01 +00:00
|
|
|
set(TARGET llama-vdot)
|
2023-04-18 19:00:14 +00:00
|
|
|
add_executable(${TARGET} vdot.cpp)
|
|
|
|
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
|
|
|
|
target_compile_features(${TARGET} PRIVATE cxx_std_11)
|
ggml : a faster version for Q4_1 x Q8_0 dot products (#1083)
* A faster version for Q4_1 x Q8_0 dot products
The idea nehind being that Q8_0 quantized
values get used many times in the matrix multiplications
where they are involved. In the current implementations,
when we are evaluating the dot products, we need to compute
the sum of the quants in the Q8_0 vector, so the same
operation is repeated many times. Here we pre-compute
the sum during Q8_0 quantization, store it in the
now modified block_q8_0 struct, and then reuse this
result in the subsequent dot products.
In a synthetic benchmark (just compute a bunch of dot
products), this change speeds up the Q4_1 * Q8_0 dot
product by 80%, making the performance identical to
Q4_0 * Q8_0.
In practical application, I see a ~15% gain in speed for
token prediction on M2, and ~5% gain on Ryzen 7950X.
The speed gain in the prompt evaluation is much bigger
(around 50%).
I have only done the change for the scalar version,
ARM_NEON, and AVX2, so we still need an AVX implementation.
* Cleaning up
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-04-21 15:18:26 +00:00
|
|
|
|
2024-06-08 12:42:01 +00:00
|
|
|
set(TARGET llama-q8dot)
|
ggml : a faster version for Q4_1 x Q8_0 dot products (#1083)
* A faster version for Q4_1 x Q8_0 dot products
The idea nehind being that Q8_0 quantized
values get used many times in the matrix multiplications
where they are involved. In the current implementations,
when we are evaluating the dot products, we need to compute
the sum of the quants in the Q8_0 vector, so the same
operation is repeated many times. Here we pre-compute
the sum during Q8_0 quantization, store it in the
now modified block_q8_0 struct, and then reuse this
result in the subsequent dot products.
In a synthetic benchmark (just compute a bunch of dot
products), this change speeds up the Q4_1 * Q8_0 dot
product by 80%, making the performance identical to
Q4_0 * Q8_0.
In practical application, I see a ~15% gain in speed for
token prediction on M2, and ~5% gain on Ryzen 7950X.
The speed gain in the prompt evaluation is much bigger
(around 50%).
I have only done the change for the scalar version,
ARM_NEON, and AVX2, so we still need an AVX implementation.
* Cleaning up
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-04-21 15:18:26 +00:00
|
|
|
add_executable(${TARGET} q8dot.cpp)
|
|
|
|
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
|
|
|
|
target_compile_features(${TARGET} PRIVATE cxx_std_11)
|