llama.cpp/examples/quantize
Kawrakow bd2d4e393b
1.5 bit quantization (#5453)
* iq1_s: WIP basics

* iq1_s: CUDA is working

* iq1_s: scalar CPU dot product

* iq1_s: WIP AVX2 dot product - something is not right

* Fix tests

* Fix shadow warnings

* Fix after merge with latest master

* iq1_s: AVX2 finally works

* iq1_s: ARM_NEON dot product. Works, but not very fast

* iq1_s: better grid

* iq1_s: use IQ2_XXS for attn_output

At a cost of 0.04 extra bpw this gives a big improvement in PPL.

* iq1_s: Metal basics

Dequantize works, but not dot product

* iq1_s: Metal works, but quite slow

As usual, Apple Silicon does not like the code I write.

* iq1_s: Tests

* iq1_s: slightly faster dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-18 18:16:55 +02:00
..
CMakeLists.txt build : link against build info instead of compiling against it (#3879) 2023-11-02 08:50:16 +02:00
quantize.cpp 1.5 bit quantization (#5453) 2024-02-18 18:16:55 +02:00
README.md readme : add some recent perplexity and bpw measurements to READMES, link for k-quants (#3340) 2023-09-27 18:30:36 +03:00

quantize

TODO

Llama 2 7B

Quantization Bits per Weight (BPW)
Q2_K 3.35
Q3_K_S 3.50
Q3_K_M 3.91
Q3_K_L 4.27
Q4_K_S 4.58
Q4_K_M 4.84
Q5_K_S 5.52
Q5_K_M 5.68
Q6_K 6.56

Llama 2 13B

Quantization Bits per Weight (BPW)
Q2_K 3.34
Q3_K_S 3.48
Q3_K_M 3.89
Q3_K_L 4.26
Q4_K_S 4.56
Q4_K_M 4.83
Q5_K_S 5.51
Q5_K_M 5.67
Q6_K 6.56

Llama 2 70B

Quantization Bits per Weight (BPW)
Q2_K 3.40
Q3_K_S 3.47
Q3_K_M 3.85
Q3_K_L 4.19
Q4_K_S 4.53
Q4_K_M 4.80
Q5_K_S 5.50
Q5_K_M 5.65
Q6_K 6.56