Commit Graph

2100 Commits

Author SHA1 Message Date
Georgi Gerganov
49a483e0f2
wip 2024-02-04 12:34:36 +02:00
Georgi Gerganov
1846e92a90
cuda : minor 2024-02-04 11:01:01 +02:00
Georgi Gerganov
ef68fac2a8
cuda : fix matrix names 2024-02-03 18:36:58 +02:00
Georgi Gerganov
cfd9732b2e
cuda : simplify softmax 2024-02-03 18:31:55 +02:00
Georgi Gerganov
e04ff39181
cuda : fix -INF block check 2024-02-03 16:57:46 +02:00
Georgi Gerganov
5b263dd83a
cuda : unroll Q*K^T loop 2024-02-03 16:12:20 +02:00
Georgi Gerganov
3b1c4e7673
cuda : speed-up reduce part of the kernel 2024-02-03 15:36:05 +02:00
Georgi Gerganov
a7b471569b
cuda : switch to 1 warp for bs > 16 2024-02-03 15:17:49 +02:00
Georgi Gerganov
b958151e3f
cuda : use half2 in softmax 2024-02-03 15:00:25 +02:00
Georgi Gerganov
c51f27c0db
cuda : avoid __hisinf branches 2024-02-03 14:27:36 +02:00
Georgi Gerganov
92472ea22c
cuda : unroll some of the loops 2024-02-03 14:10:01 +02:00
Georgi Gerganov
1f8a592482
cuda : make loops use the same loop values
Thanks Johannes again for the tip
2024-02-03 14:01:32 +02:00
Georgi Gerganov
7c34655b36
cuda : use int instead of int64_t
Noticeably improves performance (thanks to Johannes)
2024-02-03 13:39:46 +02:00
Georgi Gerganov
b150abe83e
cuda : avoid warp_reduce for smax 2024-02-03 13:17:47 +02:00
Georgi Gerganov
b68a112204
cuda : fix __hisinf() result check 2024-02-02 15:12:28 +02:00
Georgi Gerganov
12eaa22628
tests : update dims 2024-02-02 11:55:38 +02:00
Georgi Gerganov
db1f3c482e
cuda : avoid zeroing fragments 2024-02-01 22:08:37 +02:00
Georgi Gerganov
c6769b9422
tests : minor fix 2024-02-01 21:24:26 +02:00
Georgi Gerganov
cda5a60a41
metal : optimize softmax 2024-02-01 21:05:31 +02:00
Georgi Gerganov
56e45a239e
metal : optimize softmax for C > 32 2024-02-01 20:16:32 +02:00
Georgi Gerganov
41d136b602
Merge branch 'master' into gg/flash-attn 2024-02-01 19:51:41 +02:00
Georgi Gerganov
5a19a9f6d0
cuda : add flash_attn kernel (wip) 2024-02-01 19:50:23 +02:00
slaren
8ca511cade
cuda : fix LLAMA_CUDA_F16 (#5262) 2024-02-01 18:30:17 +01:00
Ali Nehzat
d71ac90985
make : generate .a library for static linking (#5205) 2024-02-01 17:18:53 +02:00
Georgi Gerganov
2e46013749
cuda : fix soft_max to use correct mask size 2024-02-01 16:47:20 +02:00
Georgi Gerganov
910b15bb40
ggml : fix ggml_soft_max mask requirement 2024-02-01 16:41:02 +02:00
Guoteng
ce32060198
llama : support InternLM2 (#5184)
* support InternLM2 inference
  * add add_space_prefix KV pair
2024-02-01 11:19:51 +02:00
Eve
1cfb5372cf
Fix broken Vulkan Cmake (properly) (#5230)
* build vulkan as object

* vulkan ci
2024-01-31 20:21:55 +01:00
Georgi Gerganov
8ad92dc1ec
ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext 2024-01-31 20:39:29 +02:00
Georgi Gerganov
2ddc9bbef1
Merge branch 'master' into gg/flash-attn 2024-01-31 18:49:43 +02:00
Georgi Gerganov
d3bac7d584
llama : reorder build_orion() at correct place (#5118) 2024-01-31 18:47:10 +02:00
Georgi Gerganov
5cb04dbc16
llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD (#5240)
* llama : remove LLAMA_MAX_DEVICES from llama.h

ggml-ci

* Update llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* server : remove LLAMA_MAX_DEVICES

ggml-ci

* llama : remove LLAMA_SUPPORTS_GPU_OFFLOAD

ggml-ci

* train : remove LLAMA_SUPPORTS_GPU_OFFLOAD

* readme : add deprecation notice

* readme : change deprecation notice to "remove" and fix url

* llama : remove gpu includes from llama.h

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-01-31 17:30:17 +02:00
Georgi Gerganov
efb7bdbbd0
metal : add im2col F32 dst support (#5132) 2024-01-31 15:35:41 +02:00
JidongZhang-THU
15606309a0
llava : add MobileVLM support (#5132)
* New Feature:
    1. Sum_Rows:
        fix cuda kernel overflow
        fix block shape error when nrows too big
    2. Im2Col:
        Support Batch in cuda
        Support f32 to f32 both in cpu && cuda
    3. DepthWiseConv:
        Support by Im2Col && MulMat
    4. Pool_2d:
        Supoort avg pooling in cuda
    5. HardSigmoid:
        Imp in cuda
    6. HardSwish:
        Imp in cuda

* fix tabs instead of spaces

* code clean

* CUDA POOL2D

* ADD POOL2D test case in test-backend-ops.cpp

* code clean

* fix pool2d_kernel

nits

* fix bug in pool2d kernel

* fix avg pooling, count_include_pad

nits

* test-backend-ops : add more pool_2d tests

* cuda : fix warnings and formatting

* ggml : check types in release builds too in pool_2d

* test-backend-ops : remove f16 pool_2d tests

* cuda : more style fixes

* Add assert in ggml_cuda_op_pool2d

* pool2d float padding fallback

* test-backend-ops : add dst_type to im2col

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-01-31 15:10:15 +02:00
Neo Zhang Jianyu
b2b9f025e7
format license text, restore apache license by legal suggestion (#5233) 2024-01-31 18:34:46 +05:30
slaren
dabcc5b471
ggml : limit n_threads to the max n_tasks (#5238) 2024-01-31 13:43:03 +01:00
0cc4m
f8e9140cb4
Vulkan Fixes (#5223)
* Fix Vulkan F16 models

* Fix Vulkan context shift crash

* Add Vulkan to common.cpp dump_non_result_info_yaml function

* Fix bug in Vulkan CPY op

* Fix small matrix multiplication errors in AMD GPUs on Windows or with amdvlk

Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com>

---------

Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com>
2024-01-31 11:44:19 +01:00
Yiming Cui
d62520eb2c
Fix typos of IQ2_XXS and IQ3_XXS in llama.cpp (#5231) 2024-01-30 22:04:21 -05:00
Neo Zhang Jianyu
01684139c3
support SYCL backend windows build (#5208)
* support SYCL backend windows build

* add windows build in CI

* add for win build CI

* correct install oneMKL

* fix install issue

* fix ci

* fix install cmd

* fix install cmd

* fix install cmd

* fix install cmd

* fix install cmd

* fix win build

* fix win build

* fix win build

* restore other CI part

* restore as base

* rm no new line

* fix no new line issue, add -j

* fix grammer issue

* allow to trigger manually, fix format issue

* fix format

* add newline

* fix format

* fix format

* fix format issuse

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-01-31 08:08:07 +05:30
Jared Van Bortel
e8dc55d006
kompute : llama-bench support and ggml_cpu_has_kompute() (#5226) 2024-01-30 19:04:37 -05:00
Georgi Gerganov
3d03bcb7af
Merge branch 'master' into gg/flash-attn 2024-01-30 21:49:13 +02:00
Georgi Gerganov
78df5527e4
tests : ifdef 2024-01-30 21:46:49 +02:00
Georgi Gerganov
d073e4f933
metal : fix array initialization 2024-01-30 21:45:32 +02:00
Georgi Gerganov
e0085fdf7c
Revert "server : change deps.sh xxd files to string literals (#5221)"
This reverts commit 4003be0e5f.
2024-01-30 21:19:26 +02:00
Georgi Gerganov
e6f291d158
server : fix context shift (#5195)
* server : fix context shift + simplify self-extend

* server : take system_tokens into account

* server : more n_past fixes

* server : rever n_past_se changes
2024-01-30 20:17:30 +02:00
JohnnyB
4003be0e5f
server : change deps.sh xxd files to string literals (#5221)
* Changed ugly xxd to literals.

HPP files are much more readable as multiline literals rather than hex arrays.

* Dashes in literal variable names.

Replace . and - with _ in file names -> variable names.

* Comment on removing xxd.

XXD-> string literals

* XXD to string literals.

Replaced these unreadable headers with string literal versions using new deps.sh.
2024-01-30 20:15:05 +02:00
Kawrakow
fea4fd4ba7
ggml : fix IQ3_XXS on Metal (#5219)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-30 19:15:28 +02:00
Georgi Gerganov
8f8ddfcfad
sync : ggml (#0) 2024-01-30 16:21:57 +02:00
Georgi Gerganov
6fb50ebbf0
gguf : fix comparison (ggml/715)
ggml-ci
2024-01-30 16:20:25 +02:00
John Balis
625a699b54
ggml_cuda_cpy support for 4d tensors and float16->float32 upcasting (ggml/686)
* added cuda float16->float32 upcasting to ggml_cuda_cpy

* added ability to copy 4d tensors with the cuda backend

* added tests for float16_>float32 upcast and 4d tensor cuda copys

* added 4d copy test for float32->float16 copy

* applied patch suggested by @iamlemec

* simplify cpy tests

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-01-30 16:20:25 +02:00