Ivan
116efee0ee
cuda: add q8_0->f32 cpy operation ( #9571 )
...
llama: enable K-shift for quantized KV cache
It will fail on unsupported backends or quant types.
2024-09-24 02:14:24 +02:00
R0CKSTAR
c35e586ea5
musa: enable building fat binaries, enable unified memory, and disable Flash Attention on QY1 (MTT S80) ( #9526 )
...
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full-cuda.Dockerfile platforms:linux/amd64 tag:full-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full.Dockerfile platforms:linux/amd64,linux/arm64 tag:full]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-cuda.Dockerfile platforms:linux/amd64 tag:light-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-intel.Dockerfile platforms:linux/amd64 tag:light-intel]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli.Dockerfile platforms:linux/amd64,linux/arm64 tag:light]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-cuda.Dockerfile platforms:linux/amd64 tag:server-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-intel.Dockerfile platforms:linux/amd64 tag:server-intel]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server.Dockerfile platforms:linux/amd64,linux/arm64 tag:server]) (push) Waiting to run
Nix CI / nix-eval (macos-latest) (push) Waiting to run
Nix CI / nix-eval (ubuntu-latest) (push) Waiting to run
Nix CI / nix-build (macos-latest) (push) Waiting to run
Nix CI / nix-build (ubuntu-latest) (push) Waiting to run
flake8 Lint / Lint (push) Waiting to run
* mtgpu: add mp_21 support
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* mtgpu: disable flash attention on qy1 (MTT S80); disable q3_k and mul_mat_batched_cublas
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* mtgpu: enable unified memory
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* mtgpu: map cublasOperation_t to mublasOperation_t (sync code to latest)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-09-22 16:55:49 +02:00
Johannes Gäßler
a5b57b08ce
CUDA: enable Gemma FA for HIP/Pascal ( #9581 )
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full-cuda.Dockerfile platforms:linux/amd64 tag:full-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full.Dockerfile platforms:linux/amd64,linux/arm64 tag:full]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-cuda.Dockerfile platforms:linux/amd64 tag:light-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-intel.Dockerfile platforms:linux/amd64 tag:light-intel]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli.Dockerfile platforms:linux/amd64,linux/arm64 tag:light]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-cuda.Dockerfile platforms:linux/amd64 tag:server-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-intel.Dockerfile platforms:linux/amd64 tag:server-intel]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server.Dockerfile platforms:linux/amd64,linux/arm64 tag:server]) (push) Waiting to run
Nix CI / nix-eval (macos-latest) (push) Waiting to run
Nix CI / nix-eval (ubuntu-latest) (push) Waiting to run
Nix CI / nix-build (macos-latest) (push) Waiting to run
Nix CI / nix-build (ubuntu-latest) (push) Waiting to run
flake8 Lint / Lint (push) Waiting to run
2024-09-22 09:34:52 +02:00
Molly Sophia
2a63caaa69
RWKV v6: RWKV_WKV op CUDA implementation ( #9454 )
...
* ggml: CUDA unary op EXP
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* ggml: rwkv_wkv op CUDA impl
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-09-22 04:29:12 +02:00
agray3
41f477879f
Update CUDA graph on scale change plus clear nodes/params ( #9550 )
...
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full-cuda.Dockerfile platforms:linux/amd64 tag:full-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full.Dockerfile platforms:linux/amd64,linux/arm64 tag:full]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-cuda.Dockerfile platforms:linux/amd64 tag:light-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-intel.Dockerfile platforms:linux/amd64 tag:light-intel]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli.Dockerfile platforms:linux/amd64,linux/arm64 tag:light]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-cuda.Dockerfile platforms:linux/amd64 tag:server-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-intel.Dockerfile platforms:linux/amd64 tag:server-intel]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server.Dockerfile platforms:linux/amd64,linux/arm64 tag:server]) (push) Waiting to run
Nix CI / nix-eval (macos-latest) (push) Waiting to run
Nix CI / nix-eval (ubuntu-latest) (push) Waiting to run
Nix CI / nix-build (macos-latest) (push) Waiting to run
Nix CI / nix-build (ubuntu-latest) (push) Waiting to run
flake8 Lint / Lint (push) Waiting to run
* Avoid using saved CUDA graph if scale changes and reset nodes/params on update
Fixes https://github.com/ggerganov/llama.cpp/issues/9451
* clear before resize
2024-09-21 02:41:07 +02:00
Georgi Gerganov
d13edb17ed
ggml : fix builds ( #0 )
...
ggml-ci
2024-09-20 21:15:05 +03:00
Johannes Gäßler
424c5d00a9
ggml/examples: add backend support for numerical optimization (ggml/949)
...
* CUDA eval works
* stochastic gradient descent op
* Adam except decay
* CUDA CROSS_ENTROPY_LOSS_BACK
* CUDA mnist-fc training works
* backend CLI arg
* refactor gguf load
* remove sched from opt_step_adam
* implement l1 regularization (weight decay)
* extra call to add optimizer
* initialize gradients with ggml_graph_reset
* gradient accumulation
* increment iter per eval instead of epoch
* adjust backend interfaces
* fix ggml_graph_reset without backend
* fix ggml graph export/import
* fixup
* rename
* revert ggml_opt changes
* more general CUDA repeat_back
* update documentation, fix CNN
* validation split
* add clarifying comment
* optimize PyTorch training
* adjust buffer size, thread count
* fix 0.0f validation split
* Update examples/mnist/mnist-common.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix gradient accumulation
* tensor flag for accumulators -> tensor hash set
* Update include/ggml.h
Co-authored-by: slaren <slarengh@gmail.com>
* Update tests/test-backend-ops.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* Update tests/test-backend-ops.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* fix test prints
* Update src/ggml-backend.c
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* better CUDA support for noncontiguous out_prod
* add comment
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-09-20 21:15:05 +03:00
Johannes Gäßler
5cb12f6839
CUDA: fix sum.cu compilation for CUDA < 11.7 ( #9562 )
2024-09-20 18:35:35 +02:00
Johannes Gäßler
5af118efda
CUDA: fix --split-mode row race condition ( #9413 )
2024-09-11 10:22:40 +02:00
R0CKSTAR
b34e023480
musa: remove Clang builtins mapping ( #9421 )
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-09-11 03:46:55 +02:00
Johannes Gäßler
8e6e2fbe14
CUDA: fix variable name conflict for Windows build ( #9382 )
2024-09-09 14:22:53 +02:00
Georgi Gerganov
e079bffb66
cuda : fix FA Q src index (1 -> 0) ( #9374 )
2024-09-08 22:01:02 +03:00
Johannes Gäßler
202084d31d
tests: add gradient tests for all backends (ggml/932)
...
* tests: add gradient checking to test-backend-ops
* remove old comment
* reorder includes
* adjust SIN/COS parameters
* add documentation, use supports_op if possible
2024-09-08 11:05:55 +03:00
slaren
4db04784f9
cuda : fix defrag with quantized KV ( #9319 )
2024-09-05 11:13:11 +02:00
Georgi Gerganov
231cff5f6f
sync : ggml
2024-08-27 22:41:27 +03:00
Johannes Gäßler
e11bd856d5
CPU/CUDA: Gemma 2 FlashAttention support ( #8542 )
...
* CPU/CUDA: Gemma 2 FlashAttention support
* apply logit_softcap to scale in kernel
* disable logit softcapping tests on Metal
* remove metal check
2024-08-24 21:34:59 +02:00
Daniel Bevenius
06943a69f6
ggml : move rope type enum to ggml.h ( #8949 )
...
* ggml : move rope type enum to ggml.h
This commit moves the `llama_rope_type` enum from `llama.h` to
`ggml.h` and changes its name to `ggml_rope_type`.
The motivation for this change is to address the TODO in `llama.h` and
use the enum in ggml.
Note: This commit does not change the `mode` parameter to be of type
`enum ggml_rope_type`. The name `mode` and its usage suggest that it
might be more generic and possibly used as a bit field for multiple
flags. Further investigation/discussion may be needed to determine
if `mode` should be restricted to RoPE types.
* squash! ggml : move rope type enum to ggml.h
This commit removes GGML_ROPE_TYPE_NONE and GGML_ROPE_TYPE_GLM from
ggml.h, and back the llama_rope_type enum.
I've kept the assert for GGML_ROPE_TYPE_GLM as I'm not sure if it is
safe to remove it yet.
* squash! ggml : move rope type enum to ggml.h
This commit removes the enum ggml_rope_type from ggml.h and replaces it
with a define (GGML_ROPE_TYPE_NEOX). This define is used in the code to
check if the mode is set to GPT-NeoX. Also the enum llama_rope_type has
been updated to reflect this change.
* squash! ggml : move rope type enum to ggml.h
This commit contains a suggestion enable the GGML_ROPE_TYPE_NEOX
macro/define to be passed to the shader compiler.
* squash! ggml : move rope type enum to ggml.h
This commit fixes the editorconfig-checker warnings.
* squash! ggml : move rope type enum to ggml.h
Update comment for ggml_rope function.
* Revert "squash! ggml : move rope type enum to ggml.h"
This reverts commit 6261222bd0
.
* squash! ggml : move rope type enum to ggml.h
Add GGML_ROPE_TYPE_NEOX to rope_common.comp.
* remove extra line
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-13 21:13:15 +02:00
Molly Sophia
2d5dd7bb3f
ggml : add epsilon as a parameter for group_norm ( #8818 )
...
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-08-06 10:26:46 +03:00
slaren
7a11eb3a26
cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X ( #8800 )
...
* cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X
* update asserts
* only use dmmv for supported types
* add test
2024-08-01 15:26:22 +02:00
R0CKSTAR
439b3fc75a
cuda : organize vendor-specific headers into vendors directory ( #8746 )
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-07-29 14:56:12 +02:00
R0CKSTAR
e54c35e4fb
feat: Support Moore Threads GPU ( #8383 )
...
* Update doc for MUSA
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Add GGML_MUSA in Makefile
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Add GGML_MUSA in CMake
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* CUDA => MUSA
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* MUSA adds support for __vsubss4
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Fix CI build failure
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-07-28 01:41:25 +02:00
slaren
2b1f616b20
ggml : reduce hash table reset cost ( #8698 )
...
* ggml : reduce hash table reset cost
* fix unreachable code warnings after GGML_ASSERT(false)
* GGML_ASSERT(false) -> GGML_ABORT("fatal error")
* GGML_ABORT use format string
2024-07-27 04:41:55 +02:00
Jeroen Mostert
46e47417aa
Allow all RDNA2 archs to use sdot4 intrinsic ( #8629 )
...
The check gating the use of `__builtin_amdgc_sdot4` specifically checks for gfx1030. This causes a severe perf regression for anything gfx103? that's not gfx1030 and not using `HSA_OVERRIDE_GFX_VERSION` (if you've built ROCm to support it). We already have a generic RDNA2 define, let's use it.
2024-07-23 10:50:40 +02:00
Johannes Gäßler
69c487f4ed
CUDA: MMQ code deduplication + iquant support ( #8495 )
...
* CUDA: MMQ code deduplication + iquant support
* 1 less parallel job for CI build
2024-07-20 22:25:26 +02:00
Daniel Bevenius
b078c619aa
cuda : suppress 'noreturn' warn in no_device_code ( #8414 )
...
* cuda : suppress 'noreturn' warn in no_device_code
This commit adds a while(true) loop to the no_device_code function in
common.cuh. This is done to suppress the warning:
```console
/ggml/src/ggml-cuda/template-instances/../common.cuh:346:1: warning:
function declared 'noreturn' should not return [-Winvalid-noreturn]
346 | }
| ^
```
The motivation for this is to reduce the number of warnings when
compilng with GGML_HIPBLAS=ON.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* squash! cuda : suppress 'noreturn' warn in no_device_code
Update __trap macro instead of using a while loop to suppress the
warning.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
---------
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-07-11 17:53:42 +02:00
Johannes Gäßler
808aba3916
CUDA: optimize and refactor MMQ ( #8416 )
...
* CUDA: optimize and refactor MMQ
* explicit q8_1 memory layouts, add documentation
2024-07-11 16:47:47 +02:00
John Balis
fde13b3bb9
feat: cuda implementation for ggml_conv_transpose_1d
(ggml/854)
...
* conv transpose 1d passing test for 1d input and kernel
* working for different input and output channel counts, added test for variable stride
* initial draft appears to work with stride other than 1
* working with all old and new conv1d tests
* added a test for large tensors
* removed use cuda hardcoding
* restored test-conv-transpose.c
* removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail
* fixed accumulator bug
* added test to test-backend-ops
* fixed mistake
* addressed review
* fixed includes
* removed blank lines
* style and warning fixes
* return failure when test fails
* fix supports_op
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-07-08 12:23:00 +03:00
Johannes Gäßler
8e558309dc
CUDA: MMQ support for iq4_nl, iq4_xs ( #8278 )
2024-07-05 09:06:31 +02:00
Daniele
0a423800ff
CUDA: revert part of the RDNA1 optimizations ( #8309 )
...
The change on the launch_bounds was causing a small performance drop in perplexity of 25 t/s
2024-07-05 09:06:09 +02:00
Johannes Gäßler
bcefa03bc0
CUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 ( #8311 )
2024-07-05 09:05:34 +02:00
Daniele
d23287f122
Define and optimize RDNA1 ( #8085 )
2024-07-04 01:02:58 +02:00
Clint Herron
07a3fc0608
Removes multiple newlines at the end of files that is breaking the editorconfig step of CI. ( #8258 )
2024-07-02 12:18:10 -04:00
Johannes Gäßler
cb5fad4c6c
CUDA: refactor and optimize IQ MMVQ ( #8215 )
...
* CUDA: refactor and optimize IQ MMVQ
* uint -> uint32_t
* __dp4a -> ggml_cuda_dp4a
* remove MIN_CC_DP4A checks
* change default
* try CI fix
2024-07-01 20:39:06 +02:00
Johannes Gäßler
85a267daaa
CUDA: fix MMQ stream-k for --split-mode row ( #8167 )
2024-06-27 16:26:05 +02:00
Georgi Gerganov
f3f65429c4
llama : reorganize source code + improve CMake ( #8006 )
...
* scripts : update sync [no ci]
* files : relocate [no ci]
* ci : disable kompute build [no ci]
* cmake : fixes [no ci]
* server : fix mingw build
ggml-ci
* cmake : minor [no ci]
* cmake : link math library [no ci]
* cmake : build normal ggml library (not object library) [no ci]
* cmake : fix kompute build
ggml-ci
* make,cmake : fix LLAMA_CUDA + replace GGML_CDEF_PRIVATE
ggml-ci
* move public backend headers to the public include directory (#8122 )
* move public backend headers to the public include directory
* nix test
* spm : fix metal header
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* scripts : fix sync paths [no ci]
* scripts : sync ggml-blas.h [no ci]
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-06-26 18:33:02 +03:00