mirror of
https://github.com/ggerganov/llama.cpp.git
synced 2025-01-11 19:21:46 +00:00
Arm AArch64: Documentation updates (#9321)
* Arm AArch64: Documentation updates * Update docs/build.md to include information on how to enable the Arm optimized gemm/gemv kernels * Update examples/quantize/README.md with information on the Q4_0_4_4, Q4_0_4_8 and Q4_0_8_8 formats * Add newline to the end of docs/build.md
This commit is contained in:
parent
daa9623ab0
commit
b2e89a3274
@ -380,3 +380,9 @@ For detailed info, such as model/device supports, CANN install, please refer to
|
|||||||
### Android
|
### Android
|
||||||
|
|
||||||
To read documentation for how to build on Android, [click here](./android.md)
|
To read documentation for how to build on Android, [click here](./android.md)
|
||||||
|
|
||||||
|
### Arm CPU optimized mulmat kernels
|
||||||
|
|
||||||
|
Llama.cpp includes a set of optimized mulmat kernels for the Arm architecture, leveraging Arm® Neon™, int8mm and SVE instructions. These kernels are enabled at build time through the appropriate compiler cpu-type flags, such as `-DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+sve`. Note that these optimized kernels require the model to be quantized into one of the formats: `Q4_0_4_4` (Arm Neon), `Q4_0_4_8` (int8mm) or `Q4_0_8_8` (SVE). The SVE mulmat kernel specifically requires a vector width of 256 bits. When running on devices with a different vector width, it is recommended to use the `Q4_0_4_8` (int8mm) or `Q4_0_4_4` (Arm Neon) formats for better performance. Refer to [examples/quantize/README.md](../examples/quantize/README.md) for more information on the quantization formats.
|
||||||
|
|
||||||
|
To support `Q4_0_4_4`, you must build with `GGML_NO_LLAMAFILE=1` (`make`) or `-DGGML_LLAMAFILE=OFF` (`cmake`).
|
||||||
|
@ -54,6 +54,8 @@ As the models are currently fully loaded into memory, you will need adequate dis
|
|||||||
|
|
||||||
Several quantization methods are supported. They differ in the resulting model disk size and inference speed.
|
Several quantization methods are supported. They differ in the resulting model disk size and inference speed.
|
||||||
|
|
||||||
|
The quantization formats `Q4_0_4_4`, `Q4_0_4_8` and `Q4_0_8_8` are block interleaved variants of the `Q4_0` format, providing a data layout that is better suited for specific implementations of optimized mulmat kernels. Since these formats differ only in data layout, they have the same quantized size as the `Q4_0` format.
|
||||||
|
|
||||||
*(outdated)*
|
*(outdated)*
|
||||||
|
|
||||||
| Model | Measure | F16 | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 |
|
| Model | Measure | F16 | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 |
|
||||||
|
Loading…
Reference in New Issue
Block a user