* metal : improve decoding speed for batches of 2-16
* metal : rename kernels mul_mat_ to mul_mv_
* metal : indentations
* minor
* metal : print more GPU info + disable mul_mm for MTLGPUFamiliy < Apple7
* metal : relax conditions on fast matrix multiplication kernel
* metal : revert the concurrnecy change because it was wrong
* llama : remove experimental stuff
* Minor speed gains for all quantization types
* metal: faster kernel_scale via float4
* Various other speedups for "small" kernels
* metal: faster soft_max vial float4
* metal: faster diagonal infinity
Although, to me it looks like one should simply
fuse scale + diagnonal infinity + soft_max on the
KQtensor.
* Another faster f16 x f32 matrix multiply kernel
* Reverting the diag infinity change
It does work for PP, but somehow it fails for TG.
Need to look more into it.
* metal: add back faster diagonal infinity
This time more carefully
* metal : minor (readibility)
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Metal support for Swift
* update
* add a toggle for arm/arm64
* set minimum versions for all platforms
* update to use newLibraryWithURL
* bump version
Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>
---------
Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>
* llama : use posix_madvise() instead of madvise() derived from BSD
sed -i 's,\<madvise\>,posix_&,g;s,\<MADV_,POSIX_&,g' llama.cpp
* ggml : use sysconf(_SC_PAGESIZE) instead of getpagesize() derived from BSD
sed -i 's,getpagesize(),sysconf(_SC_PAGESIZE),g' ggml.c
* metal : use sysconf(_SC_PAGESIZE) instead of getpagesize() derived from BSD
sed -i 's,getpagesize(),sysconf(_SC_PAGESIZE),g' ggml-metal.m
* Very minor speedup via simd-group synchronization in f16 x f32
* Another very minor speedup on metal
* Quite significant PP speedup on metal
* Another attempt
* Minor
* Massive improvement for TG for fp16
* ~4-5% improvement for Q8_0 TG on metal
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ggml_metal_init: Show all Metal device instances in the system
Also show the default Metal device that was picked.
* Update ggml-metal.m
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Somewhat faster f16 x f32 matrix multiply kernel
* Better use 32 thread groups for f16 x f32
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* metal : fix memory leak
* metal : fix encoders memory leak
* metal : clean up more memory resources
* metal : fix more leaks
* metal : reuse dispatch queue + autoreleasepool
* metal : reuse array for command buffers and encoders
* ggml : assert for odd number of blocks on ARM
15M tinyllama is an example
* metal: matrix-matrix multiplication kernel
This commit removes MPS and uses custom matrix-matrix multiplication
kernels for all quantization types. This commit also adds grouped-query
attention to support llama2 70B.
* metal: fix performance degradation from gqa
Integers are slow on the GPU, and 64-bit divides are extremely slow.
In the context of GQA, we introduce a 64-bit divide that cannot be
optimized out by the compiler, which results in a decrease of ~8% in
inference performance. This commit fixes that issue by calculating a
part of the offset with a 32-bit divide. Naturally, this limits the
size of a single matrix to ~4GB. However, this limitation should
suffice for the near future.
* metal: fix bugs for GQA and perplexity test.
I mixed up ne02 and nb02 in previous commit.
* metal: concurrently dispatch commands
Function `ggml_metal_graph_find_concurrency` will run and write
commands that can be issued concurrently to metal context `concur_list`
array, when `ggml_metal_graph_compute` is called for the first time.
* metal: don't call find_concurrency automatically.
* metal : code style changes
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* make rms_norm_eps a parameter
* add rms_norm_eps to command line
* fix baby llama, test-grad0
* use scientific notation for eps param in the help
ggml-ci
* Faster Q2_K on Metal
* Deleting unnoticed and dangereous trailing white space
* Fixed bug in new metal Q2_K implementation
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* metal: use uint16_t instead of uint8_t.
Apple GPU doesn't like uint8_t. For every operation on uint8_t
the gpu need to copy the uint8_t to an empty 16 bit register, then
it can issue other instructions.
For the matrix-vector multiplication kernel only, we observed a
340~350 GB/s memory read speed on M1 Max after this commit, which is
very close to the reported hardware limit.
* metal: update rms_norm kernel
This commit double the speed of rms_norm operations by using 512 threads
per threadgroup, combining with SIMD primitives to minimize the need for
thread group barriers.
* metal: use template to reduce size
Revert modifications on block_q4_0 and block_q4_1.
* Implement customizable RoPE
The original RoPE has pre-defined parameters
theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2]
Our customizable RoPE, ggml_rope_custom_inplace, uses
theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2]
with the default matches the original
scale = 1.0
base = 10000
The new command line arguments
--rope-freq-base
--rope-freq-scale
set the two new RoPE parameter.
Recent researches show changing these two parameters extends the context limit with minimal loss.
1. Extending Context to 8K
kaiokendev
https://kaiokendev.github.io/til#extending-context-to-8k
2. Extending Context Window of Large Language Models via Positional Interpolation
Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian
https://arxiv.org/abs/2306.15595
3. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
https://www.reddit.com/user/bloc97https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
For the bold, try adding the following command line parameters to your favorite model:
-c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5
* ggml-metal: fix custom rope
* common: fix argument names in help
* llama: increase MEM_REQ_EVAL for MODEL_3B
It avoids crashing for quantized weights on CPU.
Better ways to calculate the required buffer size would be better.
* llama: make MEM_REQ_EVAL depend on n_ctx
* server: use proper Content-Type in curl examples
Without the header Content-Type: application/json, curl will POST with
Content-Type: application/x-www-form-urlencoded
Though our simple server doesn't care, the httplib.h used has a limit
with CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH 8192
With Content-Type: application/json, we can send large json data.
* style : minor fixes, mostly indentations
* ggml : fix asserts
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* 3-5% faster Q4_0 on Metal
* 7-25% faster Q4_1 on Metal
* Oops, forgot to delete the original Q4_1 kernel
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* MPI support, first cut
* fix warnings, update README
* fixes
* wrap includes
* PR comments
* Update CMakeLists.txt
* Add GH workflow, fix test
* Add info to README
* mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099)
* mpi : add names for layer inputs + prep ggml_mpi_graph_compute()
* mpi : move all MPI logic into ggml-mpi
Not tested yet
* mpi : various fixes - communication now works but results are wrong
* mpi : fix output tensor after MPI compute (still not working)
* mpi : fix inference
* mpi : minor
* Add OpenMPI to GH action
* [mpi] continue-on-error: true
* mpi : fix after master merge
* [mpi] Link MPI C++ libraries to fix OpenMPI
* tests : fix new llama_backend API
* [mpi] use MPI_INT32_T
* mpi : factor out recv / send in functions and reuse
* mpi : extend API to allow usage with outer backends (e.g. Metal)
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ggml_graph_compute: deprecate using ggml_context, try resolve issue #287
* rewrite: no longer consider backward compitability; plan and make_plan
* minor: rename ctx as plan; const
* remove ggml_graph_compute from tests/test-grad0.c, but current change breaks backward
* add static ggml_graph_compute_sugar()
* minor: update comments
* reusable buffers
* ggml : more consistent naming + metal fixes
* ggml : fix docs
* tests : disable grad / opt + minor naming changes
* ggml : add ggml_graph_compute_with_ctx()
- backwards compatible API
- deduplicates a lot of copy-paste
* ci : enable test-grad0
* examples : factor out plan allocation into a helper function
* llama : factor out plan stuff into a helper function
* ci : fix env
* llama : fix duplicate symbols + refactor example benchmark
* ggml : remove obsolete assert + refactor n_tasks section
* ggml : fix indentation in switch
* llama : avoid unnecessary bool
* ggml : remove comments from source file and match order in header
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* k_quants: WIP super-blocks with 64 weights
* k_quants: WIP super-blocks with 64 weights
Q6_K scalar and AVX2 works
* k_quants: WIP super-blocks with 64 weights
Q4_K scalar and AVX2 works
* k_quants: WIP super-blocks with 64 weights
Q2_K scalar and AVX2 works. Q2_K is way too slow (it is actually slower
than the scalar implementation)
* k_quants: WIP super-blocks with 64 weights
Q3_K scalar and AVX2 works.
* k_quants: WIP super-blocks with 64 weights
Q5_K scalar and AVX2 works, and with that all
k_quants are done on AVX2 and scalar
* k_quants: WIP super-blocks with 64 weights
Q6_K working on CUDA. Cannot make it run quite as gast as
with super-blocks with 256 weigths: 8% slower on 4080,
20% slower on the 1660 (but there we fit 1 less layer on the
GPU because pf the larger model size), so some fraction of
these 20% is due to that,
* k_quants: WIP super-blocks with 64 weights
Q4_K working on CUDA. ~10% slower on GTX-1660,
16% slower on 4080.
* k_quants: WIP super-blocks with 64 weights
Q2_K working on CUDA. ~3% slower on GTX-1660,
10% slower on 4080.
* k_quants: WIP super-blocks with 64 weights
Q3_K working on CUDA.
* k_quants: WIP super-blocks with 64 weights
Q5_K working on CUDA, and with this CUDA is done.
* k_quants: WIP super-blocks with 64 weights
Q6_K working on ARM_NEON
* k_quants: WIP super-blocks with 64 weights
Q4_K working on ARM_NEON, but quite a bit slower than 256 weights
* k_quants: WIP super-blocks with 64 weights
Q2_K working on ARM_NEON, but quite a bit slower than 256 weights
* k_quants: WIP super-blocks with 64 weights
Q3_K working on ARM_NEON, but quite a bit slower than 256 weights.
* k_quants: WIP super-blocks with 64 weights
Q5_K working on ARM_NEON, but quite a bit slower than 256 weights.
With that, we have full support for ARM_NEON, although
performance is not quite there.
* k_quants: WIP super-blocks with 64 weights
Slightly more efficient Q3_K and Q5_K
* k_quants: WIP super-blocks with 64 weights
Another small improvement for Q3_K and Q5_K on ARM_NEON
* k_quants: WIP super-blocks with 64 weights
Yet another speedup for Q5_K on ARM_NEON.
We are now within 10% of the QK_K = 256 version.
* k_quants: WIP super-blocks with 64 weights
* We are able to pass preprocessor macros to the Metal
compiler
* Q6_K works and is actually slightly more efficient than
the QK_K = 256 version (25.2 ms vs 25.8 ms)
* k_quants: WIP super-blocks with 64 weights
Q4_K works on Metal and is actually slightly faster
than QK_K = 256 (21.95 ms vs 24.0 ms).
* k_quants: WIP super-blocks with 64 weights
Q2_K works on Metal and is very slightly faster
than QK_K = 256 (23.8 ms vs 24.2 ms).
* k_quants: WIP super-blocks with 64 weights
Q3_K works on Metal and is slightly faster
than QK_K = 256 (26.6 ms vs 28.3 ms).
* k_quants: WIP super-blocks with 64 weights
Q5_K works on Metal and is slightly faster
than QK_K = 256 (23.7 ms vs 26.3 ms).
* k_quants: call them _K, not _k, also on Metal
* k_quants: correctly define QK_K in llama.cpp
* Fixed bug in q4_K quantization added with the 64-block addition
* Simplify via lambda
* k_quants: swicth Q3_K to 4-bit scales when QK_K = 64
Otherwise there isn't much benefit from this
quantization type. There is some very slight loss
in accuracy, but we reduce size by ~7%.
E.g., for OpenLLaMA-3B, Q3_K_S perplexity is
8.6131 with 8-bit scales and 8.6352 with 4-bit,
while file size decreases from 1.53G to 1.44G.
* k_quants: switch Q4_K to 4-bit scales when QK_K = 64
Here the loss in accuracy is greater than for Q3_K,
but the Q4_K points still move further to the left on
the perplexity vs size curve.
* k_quants: forgot to add the Metal changes in last commit
* k_quants: change Q5_K to be type 0 when QK_K = 64
Still needs AVX2 implementation
* k_quants: AVX2 implementation for new 64-weight Q5_K
* k_quants: 10% faster ARM_NEON Q5_K dot product
* k_quants: fixed issue caused by merging with master
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* metal : handle buffers larger than device's maxBufferLength
* metal : print more verbose device info + handle errors
* metal : fix prints for overlapping views
* metal : minimize view overlap to try to utilize device memory better