Georgi Gerganov
e9b12c332e
perplexity : more meaningful ETA number - 2 decimal points
2023-08-18 12:48:55 +03:00
Georgi Gerganov
dea5be61d7
editorconfig : fix whitespaces
2023-08-18 12:42:38 +03:00
Georgi Gerganov
e35f8c744e
tests : update vocab file with new magic
2023-08-18 12:39:22 +03:00
Georgi Gerganov
856afff746
Merge branch 'master' into gguf
2023-08-18 12:38:05 +03:00
Georgi Gerganov
aa3efe87c8
llama : print number of tensors per type + print arch + style
2023-08-18 10:36:45 +03:00
klosax
b275de745d
llama.cpp : get special token kv and linefeed token id
2023-08-18 03:34:30 +02:00
Evan Jones
604b8bdfa6
Fix unicode in grammars ( fixes #2501 ) ( #2553 )
...
* Fix unicode in grammars (fixes #2501 )
* add more comments
* fix test-llama-grammar
2023-08-17 19:54:44 -04:00
staviq
10151bee2e
server : support for saving templates in browser LocalStorage ( #2486 )
...
* support for templates in browser LocalStorage
* sync accepted #2409 fix from upstream
* convert autosave invocation to useEffect
* Apply suggestions from code review
Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>
* Regen index.html.cpp, suggested from code review
---------
Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>
2023-08-18 07:34:01 +08:00
klosax
306070c896
llama.cpp : print kv general.name
2023-08-18 01:06:27 +02:00
Johannes Gäßler
0992a7b8b1
README: fix LLAMA_CUDA_MMV_Y documentation ( #2647 )
2023-08-17 23:57:59 +02:00
klosax
d9e6890a51
test-tokenizer-0.cpp : fix warning
2023-08-17 23:34:21 +02:00
klosax
147a99bd3a
gguf.py : reverse GGUF_MAGIC
2023-08-17 23:24:04 +02:00
klosax
c20ae49b59
ggml.h : reverse GGUF_MAGIC
2023-08-17 23:23:17 +02:00
Henri Vasserman
6ddeefad9b
[Zig] Fixing Zig build and improvements ( #2554 )
...
* Fix zig after console.o was split
* Better include and flag management
* Change LTO to option
2023-08-17 23:11:18 +03:00
klosax
3c1b7217a9
convert-llama-7b-pth-to-gguf.py : fixes
2023-08-17 21:44:34 +02:00
klosax
9e2d4dd48e
convert-llama-hf-to-gguf.py : fixes
2023-08-17 21:43:48 +02:00
klosax
640ddc4259
gguf.py : gptneox mapping
2023-08-17 21:43:10 +02:00
klosax
b668cd3296
convert-gptneox-hf-to-gguf.py : fixes
2023-08-17 21:42:26 +02:00
M. Yusuf Sarıgöz
fc3a523211
gguf.py : write tensors in a single pass ( #2644 )
...
* gguf : single pass for writing tensors + refactoring writer
* gguf : single pass for writing tensors + refactoring writer
* gguf : single pass for writing tensors + refactoring writer
* gguf : style fixes in simple conversion script
* gguf : refactor gptneox conversion script
* gguf : rename h5 to hf (for HuggingFace)
* gguf : refactor pth to gguf conversion script
* gguf : rm file_type key and method
* gguf.py : fix vertical alignment
* gguf.py : indentation
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-08-17 21:57:39 +03:00
Georgi Gerganov
5484737d58
llama : fix tensor name grepping during quantization
...
ggml-ci
2023-08-17 21:40:51 +03:00
Georgi Gerganov
57eaadb853
llama : throw error if gguf fails to init from file
...
ggml-ci
2023-08-17 21:32:14 +03:00
klosax
b3cc182990
llama.cpp : typo
2023-08-17 20:27:50 +02:00
Georgi Gerganov
acaa98234a
convert.py : fix HF tensor permuting / unpacking
...
ggml-ci
2023-08-17 21:06:45 +03:00
klosax
78e1e57862
quantize-stats.cpp : .bin --> .gguf
2023-08-17 19:18:24 +02:00
klosax
fb11dd3f92
common.h : .bin --> .gguf
2023-08-17 19:16:35 +02:00
Georgi Gerganov
e72c8c2124
ggml : fix bug in gguf_set_kv
...
ggml-ci
2023-08-17 20:13:48 +03:00
Georgi Gerganov
899f9a5350
llama : fix lambda capture
...
ggml-ci
2023-08-17 19:49:45 +03:00
Georgi Gerganov
93f285bdf1
gptneox : move as a WIP example
2023-08-17 19:49:45 +03:00
Georgi Gerganov
81a2c2a6f4
llama : fix llama_model_loader memory leak
2023-08-17 19:49:02 +03:00
Georgi Gerganov
dd9e2fc988
ci : update ".bin" to ".gguf" extension
...
ggml-ci
2023-08-17 19:32:14 +03:00
Georgi Gerganov
c3b739374e
editorconfig : ignore models folder
...
ggml-ci
2023-08-17 19:17:25 +03:00
Georgi Gerganov
6d66ef96eb
Merge branch 'master' into gguf
2023-08-17 19:04:59 +03:00
Georgi Gerganov
11bf4366c2
llama : sync with recent PRs on master
2023-08-17 19:03:15 +03:00
Georgi Gerganov
8ace03ad3d
convert.py : better always have n_head_kv and default it to n_head
2023-08-17 18:47:06 +03:00
klosax
d646c4efce
convert.py : n_head_kv optional and .gguf file extension
2023-08-17 17:20:36 +02:00
Georgi Gerganov
dd016cc246
Revert "ci : disable CI temporary to not waste energy"
...
This reverts commit 7e82d25f40
.
2023-08-17 17:23:16 +03:00
Georgi Gerganov
2ddd9681d6
convert.py : update to support GGUF output
2023-08-17 17:22:43 +03:00
Georgi Gerganov
e0429d38e4
convert-new.py : output gguf ( #2635 )
...
* convert-new.py : output gguf (WIP)
* convert-new.py : add gguf key-value pairs
* llama : add hparams.ctx_train + no longer print ftype
* convert-new.py : minor fixes
* convert-new.py : vocab-only option should work now
* llama : fix tokenizer to use llama_char_to_byte
* tests : add new ggml-vocab-llama.gguf
* convert-new.py : tensor name mapping
* convert-new.py : add map for skipping tensor serialization
* convert-new.py : convert script now works
* gguf.py : pick some of the refactoring from #2644
* convert-new.py : minor fixes
2023-08-17 17:19:52 +03:00
Kerfuffle
8dae7ce684
Add --cfg-negative-prompt-file option for examples ( #2591 )
...
Add --cfg-negative-prompt-file option for examples
2023-08-17 07:29:44 -06:00
klosax
d6fd53afd6
llama.cpp : use ggml_elements()
2023-08-17 15:24:35 +02:00
klosax
5a0a2c5685
llama.cpp : print actual model size
2023-08-17 15:18:16 +02:00
Georgi Gerganov
a73ccf1aa3
llama : replace (permute + reshape + view_1d) with (view_3d) ( #2538 )
...
ggml-ci
2023-08-17 10:47:09 +03:00
drbh
7cf54e1f74
tests : adds simple llama grammar tests ( #2618 )
...
* adds simple llama grammar tests
* fix lint and add Makefile
* 0 terminate code_points
* avoid dangling pointers in candidate cleanup
* cleanup grammar at end of test
2023-08-17 10:41:01 +03:00
Shouzheng Liu
a872a2b28e
ggml-alloc : fix discrepency between measure&eval ( #2639 )
...
The GGML memory allocator consistently places a tensor within the
optimal-fit memory block, which is the smallest block capable of
accommodating the tensor's size. During the measurement phase, the final
block is generously sized, ensuring it never qualifies as the
optimal-fit block as long as there exists another block capable of
accommodating the tensor. Nevertheless, in the evaluation phase, the
last block is constrained in size and could potentially qualify as the
optimal-fit block. Consequently, there exists the possibility of a
tensor being allocated to a different region during evaluation, leading
to more memory fragmentation in our scratch buffer.
This recent commit guarantees uniform behavior of the allocator across
both the measurement and evaluation phases, eliminating discrepancies
between the two.
2023-08-17 10:35:53 +03:00
M. Yusuf Sarıgöz
42f8fe1927
examples/gguf : no need to keep q option for quantization any more
2023-08-17 08:56:42 +03:00
Kolen Cheung
0919a0f73d
cmake : install ggml-meta.metal if LLAMA_METAL ( #2449 )
2023-08-16 23:09:49 +03:00
Jhen-Jie Hong
ed53db86c3
metal : print error of load pipeline state ( #2564 )
...
* metal : print error of load pipeline state
* metal : return null if load pipeline failed
2023-08-16 23:09:03 +03:00
Shouzheng Liu
fc8ef549e5
metal : enable ggml-alloc ( #2627 )
...
* metal: enable ggml-alloc
Make ggml-alloc work with concurrently dispatch.
* style-fix
Co-authored-by: slaren <slarengh@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-08-16 23:08:28 +03:00
Shouzheng Liu
bf83bff674
metal : matrix-matrix multiplication kernel ( #2615 )
...
* metal: matrix-matrix multiplication kernel
This commit removes MPS and uses custom matrix-matrix multiplication
kernels for all quantization types. This commit also adds grouped-query
attention to support llama2 70B.
* metal: fix performance degradation from gqa
Integers are slow on the GPU, and 64-bit divides are extremely slow.
In the context of GQA, we introduce a 64-bit divide that cannot be
optimized out by the compiler, which results in a decrease of ~8% in
inference performance. This commit fixes that issue by calculating a
part of the offset with a 32-bit divide. Naturally, this limits the
size of a single matrix to ~4GB. However, this limitation should
suffice for the near future.
* metal: fix bugs for GQA and perplexity test.
I mixed up ne02 and nb02 in previous commit.
2023-08-16 23:07:04 +03:00
Georgi Gerganov
5ec18934ad
convert-new.py : pick #2427 for HF 70B support
2023-08-16 20:16:15 +03:00