Iwan Kawrakow
9bfcb16fd3
Add llama enum for IQ2_XS
2024-01-11 18:24:12 +02:00
Iwan Kawrakow
a1610b05b2
iq2_xs: had forgotten to delete iq2-data.h
2024-01-10 13:47:42 +02:00
Iwan Kawrakow
8299b03a99
iq2_xs: faster AVX2 dit product
...
21.4 t/s for TG-128, 59.2 t/s for PP-512.
The latter is 2x compared to the previous version.
2024-01-10 11:33:23 +02:00
Iwan Kawrakow
3198e94f00
iq2_xs: AVX2 dot product - 19.5 t/s
2024-01-10 08:49:38 +02:00
Iwan Kawrakow
52ea3f7930
iq2_xs: better ARM_NEON dot product
...
We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when
running on the CPU.
2024-01-09 19:43:39 +01:00
Iwan Kawrakow
ff49d876c6
iq2_xs: working, but dog slow, ARM_NEON dot product
2024-01-09 18:36:45 +01:00
Iwan Kawrakow
55e2cae83f
iq2_xs: Metal now works
2024-01-09 18:22:20 +01:00
Iwan Kawrakow
0aacd55159
iq2_xs: WIP Metal
2024-01-09 17:46:27 +01:00
Iwan Kawrakow
9b6e38d8c0
iq2_xs: CUDA and scalar CPU works
2024-01-09 18:19:02 +02:00
Iwan Kawrakow
9f21b82e4b
iq2_xs: this should have been in the basics
2024-01-09 17:34:08 +02:00
Iwan Kawrakow
3569fa3fe3
iq2_xs: basics
2024-01-09 17:34:08 +02:00
Georgi Gerganov
d9653894df
scripts : script to get Paul Graham essays in txt format ( #4838 )
2024-01-09 16:23:05 +02:00
Behnam M
128de3585b
server : update readme about token probs ( #4777 )
...
* updated server readme to reflect the gg/server-token-probs-4088 commit
added explanation for the API's completion result which now includes `completion_probabilities`. Also added a JSON schema that shows the type/structure of `completion_probabilities`.
* simplified the `completion_probabilities` JSON schema
It's now easier to understand what the structure of `completion_probabilities` looks like.
* minor : fix trailing whitespace
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-09 12:02:05 +02:00
Zsapi
8c58330318
server : add api-key flag to documentation ( #4832 )
...
Document the api-key flag added to server in https://github.com/ggerganov/llama.cpp/pull/4441
2024-01-09 11:12:43 +02:00
Georgi Gerganov
18c2e1752c
ggml : fix vld1q_s8_x4 32-bit compat ( #4828 )
...
* ggml : fix vld1q_s8_x4 32-bit compat
ggml-ci
* ggml : fix 32-bit ARM compat (cont)
ggml-ci
2024-01-09 10:42:06 +02:00
Johannes Gäßler
8f900abfc0
CUDA: faster softmax via shared memory + fp16 math ( #4742 )
2024-01-09 08:58:55 +01:00
howlger
1fc2f265ff
common : fix the short form of --grp-attn-w
, not -gat
( #4825 )
...
See https://github.com/ggerganov/llama.cpp/blob/master/common/common.cpp#L230C53-L230C57
2024-01-08 21:05:53 +02:00
Georgi Gerganov
a9a8c5de3d
readme : add link to SOTA models
2024-01-08 20:25:17 +02:00
Kawrakow
dd5ae06405
SOTA 2-bit quants ( #4773 )
...
* iq2_xxs: basics
* iq2_xxs: scalar and AVX2 dot products
Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.
* iq2_xxs: ARM_NEON dot product
Somehow strangely slow (112 ms/token).
* iq2_xxs: WIP Metal
Dequantize works, something is still wrong with the
dot product.
* iq2_xxs: Metal dot product now works
We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s
Not the greatest performance, but not complete garbage either.
* iq2_xxs: slighty faster dot product
TG-128 is now 48.4 t/s
* iq2_xxs: slighty faster dot product
TG-128 is now 50.9 t/s
* iq2_xxs: even faster Metal dot product
TG-128 is now 54.1 t/s.
Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.
* iq2_xxs: dequantize CUDA kernel - fix conflict with master
* iq2_xxs: quantized CUDA dot product (MMVQ)
We get TG-128 = 153.1 t/s
* iq2_xxs: slightly faster CUDA dot product
TG-128 is now at 155.1 t/s.
* iq2_xxs: add to llama ftype enum
* iq2_xxs: fix MoE on Metal
* Fix missing MMQ ops when on hipBLAS
I had put the ggml_supports_mmq call at the wrong place.
* Fix bug in qequantize_row_iq2_xxs
The 0.25f factor was missing.
Great detective work by @ggerganov!
* Fixing tests
* PR suggestion
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-08 16:02:32 +01:00
Georgi Gerganov
668b31fc7d
swift : exclude ggml-metal.metal from the package ( #4822 )
2024-01-08 16:40:51 +02:00
Georgi Gerganov
42ea63c5a3
llama.swiftui : update readme
2024-01-08 15:57:36 +02:00
Georgi Gerganov
52531fdff8
main : add self-extend support ( #4815 )
...
* examples : add passkey test
* passkey : better prints
* passkey : select pass key pos from CLI
* passkey : simplify n_past logic
* llama : "self-extend"-like context extension
* passkey : add comment
* main : add Self-Extend support
* llama : add comment about llama_kv_cache_seq_div
2024-01-08 11:18:32 +02:00
Georgi Gerganov
b0034d93ce
examples : add passkey test ( #3856 )
...
* examples : add passkey test
* passkey : better prints
* passkey : select pass key pos from CLI
* passkey : simplify n_past logic
* make : add passkey target
* passkey : add "self-extend"-like context extension (#4810 )
* llama : "self-extend"-like context extension
* passkey : add comment
* passkey : add readme
2024-01-08 11:14:04 +02:00
Lars Grammel
b7e7982953
readme : add lgrammel/modelfusion JS/TS client for llama.cpp ( #4814 )
2024-01-07 22:24:11 +02:00
slaren
226460cc0d
llama-bench : add no-kv-offload parameter ( #4812 )
2024-01-07 17:59:01 +01:00
Johannes Gäßler
d5a410e855
CUDA: fixed redundant value dequantization ( #4809 )
2024-01-07 17:24:08 +01:00
Georgi Gerganov
9dede37d81
llama : remove unused vars ( #4796 )
2024-01-07 14:29:36 +02:00
Georgi Gerganov
3c36213df8
llama : remove redundant GQA check ( #4796 )
2024-01-07 11:21:53 +02:00
Alex Azarov
72d8407b36
llama.swiftui : use llama.cpp as SPM package ( #4804 )
2024-01-07 10:20:50 +02:00
Georgi Gerganov
d117d4dc5d
llama : print tensor meta for debugging
2024-01-07 09:51:12 +02:00
Alex Azarov
3418c03ecc
llama.swiftui : add visionOS target ( #4805 )
2024-01-07 09:46:55 +02:00
Konstantin Zhuravlyov
63ee677efd
ggml : use __builtin_amdgcn_sudot4 in __dp4a for gfx11 ( #4787 )
2024-01-07 08:52:42 +02:00
Georgi Gerganov
67984921a7
server : fix n_predict check ( #4798 )
2024-01-07 08:45:26 +02:00
Daniel Illescas Romero
c75ca5d96f
llama.swiftui : use correct pointer for llama_token_eos ( #4797 )
2024-01-06 17:12:59 +02:00
Georgi Gerganov
96e80dabc6
examples : improve base-translate.sh script ( #4783 )
2024-01-06 11:40:24 +02:00
a-n-n-a-l-e-e
eec22a1c63
cmake : check for openblas64 ( #4134 )
...
openblas v0.3.22 64-bit pkg-config file is named openblas64.pc
https://github.com/OpenMathLib/OpenBLAS/issues/3790
2024-01-05 18:04:40 +02:00
Ikko Eltociear Ashimine
be36bb946a
flake.nix : fix typo ( #4700 )
...
betwen -> between
2024-01-05 18:02:44 +02:00
Georgi Gerganov
91d38876df
metal : switch back to default.metallib (ggml/681)
...
ggml-ci
2024-01-05 18:02:06 +02:00
Georgi Gerganov
d061bf9405
ggml : fix q2_k bpw in comments (ggml/680)
2024-01-05 18:02:06 +02:00
Finn Voorhees
1bf681f90e
ggml : add error handling to graph_compute (whisper/1714)
2024-01-05 18:02:06 +02:00
Georgi Gerganov
c1d7cb28d3
ggml : do not sched_yield when calling BLAS ( #4761 )
...
* ggml : do not sched_yield when calling BLAS
ggml-ci
* ggml : fix do_yield logic
ggml-ci
* ggml : simplify do_yield logic
ggml-ci
2024-01-05 15:18:21 +02:00
Georgi Gerganov
3681f22443
examples : add few-shot translation example ( #4783 )
2024-01-05 15:11:10 +02:00
Daniel Bevenius
b3a7c20b5c
finetune : remove unused includes ( #4756 )
...
This commit removes unused includes from finetune.cpp.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-04 21:45:37 +02:00
Georgi Gerganov
012cf349ae
server : send token probs for "stream == false" ( #4714 )
2024-01-04 19:56:33 +02:00
Johannes Gäßler
a91928014f
Print backend name on test-backend-ops failure ( #4751 )
2024-01-04 09:43:23 +01:00
singularity
3c0b585561
llama.swiftui : support loading custom model from file picker ( #4767 )
...
* swiftui: support load model from file picker
* swiftui: remove trailing whitespace
2024-01-04 10:22:38 +02:00
Michael Coppola
e5804313a1
server : fix options in README.md ( #4765 )
...
* fix examples/server/README.md
* minor : fix whitespace
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-04 10:17:09 +02:00
Georgi Gerganov
dc891b7f7a
ggml : include stdlib.h before intrin.h ( #4736 )
2024-01-04 10:12:26 +02:00
singularity
46cea79e1f
llama.swiftui : fix build of ggml.metallib ( #4754 )
...
* metal: fix metal backend init failure in swiftui
* metal: build ggml.metallib instead of copy src
* llama.swift : remove debug flags from metallib build
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-04 09:58:16 +02:00
Daniel Bevenius
cb1e2818e0
train : fix typo in overlapping-samples help msg ( #4758 )
...
This commit fixes a typo in the help message for the
--overlapping-samples option.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-03 19:53:40 +02:00