* ggml-quants : 1.625 bpw ternary packing for BitNet 1.58b
* ggml-quants : faster 1.625 bpw AVX2 vec_dot
Not using a lookup table anymore makes it match q4_0 speed.
* gguf-py : fix formatting
* llama : remove spaces on empty line
* ggml-quants : subtract 1 when back in epi8
This makes the 1.625 bpw type go faster than q4_0. Still not the fastest.
* ggml-quants : Q2_2 now faster than Q4_K on with AVX2
* ggml-quants : cleanup Q1_3 code formatting
* ggml-quants : ARM NEON vec_dot for q2_2 and q1_3
* ggml-quants : use ceiling division when quantizing q1_3
* convert-hf : simplify BitNet pre-quantization
This still results in the exact same tensor weights and scales,
but it reveals some weirdness in the current algorithm.
* convert-hf : allow converting the weird BitNet 1.3B
Its FFN size is 5460 which is not convenient.
The offending tensors are kept in F16,
which makes the final model 5.01 bpw.
* bitnet : replace 1.58b with b1.58, as in the paper
* ggml-quants : fix build failure on Windows
* ggml-quants : attempt to fix Arm 32-bit support
* ggml : add some informative comments in q1_3 vec_dot
* ggml : add TQ1_0 and TQ2_0 ternary quantization types
* ggml : even faster TQ2_0
* ggml : also faster TQ1_0
Same optimization as for TQ2_0 by offsetting the sum instead of the weights.
This makes TQ1_0 almost as fast as Q8_0 on AVX2.
* ggml : fix build issues in certain environments
* ggml : add NEON vec_dot implementation for TQ1_0 and TQ2_0
* ggml : avoid directly using vmlal_high_s8, for 32-bit ARM compat
The compiler seems smart enough to use the same instruction
even when using vget_high_s8 instead.
* ggml : remove q1_3 and q2_2
No more 1.625 bpw and 2.000 bpw,
now instead using 1.6875 bpw and 2.0625 bpw
with TQ1_0 and TQ2_0, respectively.
* llama : remove the separate scale tensors of BitNet b1.58
They won't be needed, since the remaining ternary quant types have
built-in scales.
* ggml-quants : rename fields of TQ1_0 and TQ2_0 structs for consistency
* ggml-quants : allow using vdotq_s32 in TQ2_0 vec_dot
Not yet tested on hardware which supports it,
might not work or might not even compile. But also it might.
It should make the performance better on recent ARM CPUs.
* ggml-quants : remove comment about possible format change of TQ2_0
Making it slightly more convenient for AVX512
but less convenient for everything else is not worth the trouble.
* gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0
* ggml-quants : use roundf instead of nearest_int for TQ1_0 and TQ2_0
This does not change anything for ternary models,
since their values should never end up being in halfway cases anyway.
* convert : allow direct conversion to TQ1_0 and TQ2_0
The token embeddings and output tensors are kept in F16
to allow quantizing them to Q4_K and Q6_K with llama-quantize.
* llama : handle fallback for TQ1_0 and TQ2_0 with Q4_0
Q4_0 is not completely symmetric (so not lossless for ternary models),
but it should be good enough.
* ggml-quants : allow using ARM dot product instructions for TQ1_0
* ggml-quants : deduplicate TQ1_0 and TQ2_0 __ARM_FEATURE_DOTPROD support
* ggml : remove unused ggml_mul special case
It would otherwise conflict with the more general
optimization coming with Mamba-2.
* ggml : handle TQ1_0 and TQ2_0 in dequantization-based operators
* test-backend-ops : add TQ1_0 and TQ2_0 comments for later
Not yet adding uncommented, because some backends like SYCL and Metal
do not properly handle unknown types in supports_op for GGML_OP_MUL_MAT.
(and Metal also doesn't handle it with GGML_OP_GET_ROWS)
Support for TQ1_0 and TQ2_0 for other backends than CPU
will be added in follow-up pull requests.
* Adding IQ2_S and IQ2_M as a single cumulative commit
* Update examples/quantize/quantize.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* iq4_nl: squash commits for easier rebase
* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels
* Resurrecting iq3_xs
After all the experimentation, nothing was better than this.
* Minor PPL improvement via a block scale fudge factor
* Minor improvement via 3 neighbours
* iq3_xs: working scalar and AVX2 dot products
* iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s)
* iq3_xs: working Metal implementation
* Adding IQ3_M - IQ3_XS mix with mostly Q4_K
* iiq3_xs: a 3.4375 bpw variant
* iq3_xs: make CUDA work for new version
* iq3_xs: make scalar and AVX2 work for new version
* iq3_s: make ARM_NEON work with new version
* iq3_xs: make new version work on metal
Performance is very similar to Q3_K_S
* iq3_xs: tiny Metal speed improvement
* iq3_xs: tiny Metal speed improvement
* Fix stupid warning
* Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS
* iq3_xs: rename to iq3_s
* iq3_s: make tests pass
* Move Q3_K_XS mix to 3.25 bpw
* Attempt to fix failing tests
* Another attempt to fix the Windows builds
* Attempt to fix ROCm
* ROCm again
* iq3_s: partial fix for QK_K = 64
* iq3_s: make it work on metal for QK_K = 64
Pleasent surprise: the coding was super-block size independent,
so all it took was to delete some QK_K == 256 guards.
* Will this fix ROCm?
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm
armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q8_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"
On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.
* ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm
armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"
On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.
* ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm
armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_1_q8_1 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"
On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.
* ggml: update unit tests for the new vec_dot interface
* llama.cpp: add MATMUL_INT8 capability to system_info
* iq3_xxs: quantize/dequantize
RMSE seems a bit high-ish at about half-way between q2_K and
q3_K, so need to check more.
* iq3_xxs: CUDA dequantize works
* iq2_xxs: tuning quantization
* iq3_xxs: starting to look better
PPL on wiki.test.raw
LLaMA-v1-7B: 6.4218
LLaMA-v2-7B: 6.3560
Mistral-7B : 6.0717
This is better than Q3_K_XS, with a 5% reduction in quantized model
size.
* iq3_xxs: CUDA dot product
We have
PP-512: 5891 t/s
TG-128: 143.9 t/s
* iq3_xxs: scalar and AVX2 dot products
* iq3_xxs: ARM_NEON and Metal
Metal performance is decent, ARM_NEON is pathetic
* iq3_xxs: slightly better grid points
* Faster iq3_xxs and iq2_xs dot products on CUDA
* iq3_xxs: add some quant mix
* iq3_xxs: fix failing quantization test
Dot product still fails. Is this real?
* iq3_xxs: hopefully fix ROCm
* iq3_xxs: failing tests
This time the dot product accuracy did find an actual bug
in the AVX2 implementation.
* Add IQ3_XXS to test-backend-ops
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* iq2_xs: basics
* iq2_xs: this should have been in the basics
* iq2_xs: CUDA and scalar CPU works
* iq2_xs: WIP Metal
* iq2_xs: Metal now works
* iq2_xs: working, but dog slow, ARM_NEON dot product
* iq2_xs: better ARM_NEON dot product
We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when
running on the CPU.
* iq2_xs: AVX2 dot product - 19.5 t/s
* iq2_xs: faster AVX2 dit product
21.4 t/s for TG-128, 59.2 t/s for PP-512.
The latter is 2x compared to the previous version.
* iq2_xs: had forgotten to delete iq2-data.h
* Add llama enum for IQ2_XS
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* iq2_xxs: basics
* iq2_xxs: scalar and AVX2 dot products
Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.
* iq2_xxs: ARM_NEON dot product
Somehow strangely slow (112 ms/token).
* iq2_xxs: WIP Metal
Dequantize works, something is still wrong with the
dot product.
* iq2_xxs: Metal dot product now works
We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s
Not the greatest performance, but not complete garbage either.
* iq2_xxs: slighty faster dot product
TG-128 is now 48.4 t/s
* iq2_xxs: slighty faster dot product
TG-128 is now 50.9 t/s
* iq2_xxs: even faster Metal dot product
TG-128 is now 54.1 t/s.
Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.
* iq2_xxs: dequantize CUDA kernel - fix conflict with master
* iq2_xxs: quantized CUDA dot product (MMVQ)
We get TG-128 = 153.1 t/s
* iq2_xxs: slightly faster CUDA dot product
TG-128 is now at 155.1 t/s.
* iq2_xxs: add to llama ftype enum
* iq2_xxs: fix MoE on Metal
* Fix missing MMQ ops when on hipBLAS
I had put the ggml_supports_mmq call at the wrong place.
* Fix bug in qequantize_row_iq2_xxs
The 0.25f factor was missing.
Great detective work by @ggerganov!
* Fixing tests
* PR suggestion
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Generalize quantize_fns for simpler FP16 handling
* Remove call to ggml_cuda_mul_mat_get_wsize
* ci : disable FMA for mac os actions
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Starting to add k-quantization to ggml
I think it is better to have quantization separate from
ggml. For now just adding the k-quants there, but it would be
better to also factor out the existing ggml quantizations.
* Adding Q3_K and Q8_K (de)-quantization
* Q3_K now working on CUDA and AVX2/scalar
CUDA is not ideal - ~50% slower than Q4_0 for
single token prediction, about the same in batch
mode (perplexity). CPU single token is ~55 ms
(on Ryzen 7950X).
* Some improvement for Q3_K on CUDA
It is now ~22.5 ms/token on my GPU, so ~30% slower than Q4_0.
* Some more CUDA optimizations for Q3_K
Single token is now 20.5 ms/token (~20% slower than Q4_0).
Perplexity is on par with Q4_0.
* Adding Q4_K - scalar, AVX2, CUDA
Performance is the same or perhaps very slightly better than Q4_0 on the CPU.
On the GPU, single token prediction is ~10% better than Q4_0,
batch mode (perplexity is about the same).
* Adding Q6_K - scalar, AVX2, CUDA
Performance is ~40% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 6-bit model is ~44% larger than the 4-bit.
On the GPU, single token prediction is ~6% lower than Q4_0,
batch mode (perplexity) is even closer (but still slower).
* Adding Q5_K - scalar, AVX2, CUDA
Performance is ~20% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 5-bit model is ~22% larger than the 4-bit.
On the GPU, single token prediction is about the same as Q4_0
for both, single token and batch prediction.
* Per convention, all QX_K quantizations use Q5_K for output.weight
* Adding quantization mixes
* Quantization mixes: didn't quite get what I wanted in the last commit
* Q4_K dot product for ARM_NEON
* Q6_K dot product for ARM_NEON
* Q5_K dot product for ARM_NEON
* Adding Q3_K dot for ARM_NEON
It is 22% slower than Q4_K, despite the smaller model size.
On x86_64, where we are memory bound, the Q3_K model is
quite a bit faster than Q4_K.
* A very slightly faster ARM_NEON Q3_K dot
* Adding Q2_K - just CUDA for now
Token prediction is pretty good - about 15.5 ms on a RTX 4080.
Perplexity is about the same as Q4_K.
* Adding scalar and AVX2 Q2_K dot
* Adding ARM_NEON Q2_K dot
About the same performance as Q4_K.
* A slightly faster ARM_NEON Q2_K dot
Single token prediction is now ~36 ms on M2 Max.
The code is much simpler too.
* Fixed bug in Q2_K CUDA dot product kernel
Stranegly enough, for the few prompts I tried with the 7B model
the responses looked perfectly reasonable. Only realized something
is not quite right when I tried the larger models and started getting
nonse back.
In any case, Q2_K single token evaluation time on an RTX 4080 in a Ryzen7950X
box iusing CUDA and model fully loaded on the GPU are
~15.5 ms for 7B, ~25.4 ms for 13B, and ~55.8 ms for 30B.
The max number of layers that fit in VRAM for The 65B is 32.
With that, we get ~330 ms per token, which is not that much faster
than just running on the CPU (~470 ms per token).
* Don't print zeros/NaNs when no count histogram has been collected
* A 10% faster CUDA vector dot kernel for Q3_K
Q3_K is now running at ~18.5 ms / token on CUDA,
so the gap to Q4_0 is only 10%.
It seems memory acccess pattern is more important for
performance than the amount of computation the kernel
does.
* A slightly daster Q4_K AVX2 dot product
For perplexity, where we are less memory bound, time per
pass drops by ~5%. Barely measurable difference for single
token prediction.
* A slightly faster ARM_NEON A4_K dot product
* Minor
* Fix quantization error test
We cannot possibly be expecting rmse < 0.002 for 2- and 3-bit
quantization variants.
* Fix docker build
I have been sloppy with vector reinterpret casts on ARM_NEON.
It seems clang is very forgiving in that regard.
* Added forgotten ggml.o dependence on k_quants.h to the Makefile
* Had unintentionally committed the Makefile with -Ofast enabled
* ggml : rename k_quants -> ggml-quants-k, use lowercase in code
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Unit test for quantization functions
Use the ggml_internal_get_quantize_fn function to loop through all
quantization formats and run a sanity check on the result.
Also add a microbenchmark that times these functions directly without
running the rest of the GGML graph.
* test-quantize-fns: CI fixes
Fix issues uncovered in CI
- need to use sizes divisible by 32*8 for loop unrolling
- use intrinsic header that should work on Mac
* test-quantize: remove
Per PR comment, subsumed by test-quantize-fns
* test-quantize: fix for q8_0 intermediates