Commit Graph

69 Commits

Author SHA1 Message Date
Howard Su
94ddd6204c Simplify the logic of scheduling 2023-04-10 22:37:37 +08:00
Howard Su
6d18c6ea3e Fix the number of forward looking nodes 2023-04-10 22:37:10 +08:00
Howard Su
6f2a61eb4f Rework scheduling algorithm. 2023-04-10 22:24:27 +08:00
Howard Su
2035a3cc29 avoid to change ggml_task_type 2023-04-09 22:11:24 +08:00
Howard Su
3b03df5c05 look forward more 2023-04-08 19:55:29 +08:00
Howard Su
921296c0d5 avoid malloc/free in critial path 2023-04-08 00:47:19 +08:00
Howard Su
455f6f79bc Try find other single threaded operator to run 2023-04-08 00:34:05 +08:00
Howard Su
43dde039b0 Run second operator when possible 2023-04-07 23:51:46 +08:00
Howard Su
c640d2a4bd Remove finalizer 2023-04-07 22:24:14 +08:00
Howard Su
b8c9b27452 Merge remote-tracking branch 'tp/Pithikos-C-Thread-Pool2' into tp_schedule 2023-04-07 21:31:07 +08:00
Georgi Gerganov
eeaa7b0492
ggml : multi-thread ggml_rope() (~3-4 times faster on M1) (#781) 2023-04-05 22:11:03 +03:00
Georgi Gerganov
986b6ce9f9
ggml, llama : avoid heavy V transpose + improvements (#775)
ggml :

- added ggml_view_3d()
- ggml_view_tensor() now inherits the stride too
- reimplement ggml_cpy() to account for dst stride
- no longer require tensor->data to be memory aligned

llama :

- compute RoPE on 32-bit tensors (should be more accurate)
- store RoPE-ed K in the KV cache
- store transposed V in the KV cache (significant speed-up)
- avoid unnecessary Q copy
2023-04-05 22:07:33 +03:00
SebastianApel
437e77855a
10+% performance improvement of ggml_vec_dot_q4_0 on AVX2 (#654)
* Performance improvement of AVX2 code
* Fixed problem with MSVC compiler
* Reviewer comments: removed double semicolon, deleted empty line 1962
2023-04-03 09:52:28 +02:00
Marian Cepok
c0bb1d3ce2
ggml : change ne to int64_t (#626) 2023-04-02 13:21:31 +03:00
Vladimir
d3bc4df97d fix windows build 2023-04-01 20:18:04 +02:00
Vladimir
a65d37ad36 using github Pithikos/C-Thread-Pool for threading 2023-04-01 20:18:04 +02:00
Stephan Walter
3525899277
Enable -std= for cmake builds, fix warnings (#598) 2023-03-31 19:19:16 +00:00
slaren
1d08882afa
Optimize AVX2 ggml_vec_dot_q4_0 (#642) 2023-03-31 15:55:52 +00:00
perserk
02c5b27e91
Add AVX acceleration (#617)
* ggml : add AVX quantize_row_q4_0()

* ggml : add AVX ggml_vec_dot_q4_0()

* ggml : refactor AVX part of ggml_vec_dot_q4_0()

https://github.com/ggerganov/llama.cpp/pull/617#issuecomment-1489985645
2023-03-31 13:55:44 +02:00
Justine Tunney
6f23ba5ee2 Ensure --mlock works properly with mmap() support 2023-03-30 12:28:25 -07:00
Slaren
c03ae8dca1 Add mmap support for model files 2023-03-30 12:28:25 -07:00
Casey Primozic
a4755cf288
Remove unused variable (#607)
* It seems some new warning were added recently that exposed this.  I wrote the code that included this unused variable originally and it is indeed not needed.
2023-03-30 17:53:35 +00:00
Georgi Gerganov
77efdf5a50
ggml : fix NEON signs (close #620, #622) 2023-03-30 20:27:32 +03:00
slaren
ed3c680bcd
Fix GGML_F32Cx8_STORE in AVX without F16C path (#619) 2023-03-30 11:16:30 +02:00
Georgi Gerganov
b51c717d5c
ggml : init time on first ggml_init() call 2023-03-29 22:15:34 +03:00
Georgi Gerganov
cea1c85948
ggml : add ARM_NEON dequantize_row_q4_1() 2023-03-29 22:10:01 +03:00
Georgi Gerganov
f202ada131
ggml : add ARM_NEON quantize_row_q4_1() 2023-03-29 22:03:07 +03:00
Georgi Gerganov
3b44d30d9b
ggml : add ARM_NEON ggml_vec_dot_q4_1() 2023-03-29 22:03:07 +03:00
anzz1
83df5639eb
Fix GCC warning about binary literal (#595)
0b10101010 -> 0xAA /* 0b10101010 */
2023-03-29 13:20:07 +00:00
anzz1
5a5f8b1501
Enable Fused-Multiply-Add (FMA) and F16C/CVT16 vector extensions on MSVC (#375)
* Enable Fused-Multiply-Add (FMA) instructions on MSVC

__FMA__ macro does not exist in MSVC

* Enable F16C/CVT16 vector extensions on MSVC

__F16C__ macro does not exist in MSVC, but is implied with AVX2/AVX512

* MSVC cvt intrinsics

* Add __SSE3__ macro for MSVC too because why not

even though it's not currently used for anything when AVX is defined
2023-03-28 22:44:29 +03:00
slaren
2a98bc18ea
ggml : add AVX2 implementation of quantize_row_q4_1 (#515)
* Add AVX2 implementation of quantize_row_q4_1

* Actually use AVX2

* Make quantize_row_q4_1 static

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-28 21:06:03 +03:00
Stephan Walter
99c5b27654
ggml : refactor quantized processing functions (#509)
* Refactor quantized processing functions

* ggml : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-28 20:13:01 +03:00
Stephan Walter
436e561931
all : be more strict about converting float to double (#458)
* Be more strict about converting float to double

* Test equivalence of round, SILU implementations

Test module is commented out in CMakeLists.txt because the tests may
take a long time, depending on how much the compiler optimizes.

* Fix softmax in perplexity.cpp

* all : prefer float over double where appropriate

* perplexity : add <cmath>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-28 19:48:20 +03:00
Stephan Walter
c1f885067c
ggml : introduce structs for the q4 data blocks (#356)
* Introduce structs for the q4 data blocks

* ggml : rename quant struct variables + fix ARM_NEON

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-28 18:56:03 +03:00
slaren
a6bdc47cba
Fix usage of F16C intrinsics in AVX code (#563)
* Fix usage of F16C intrinsics in AVX code when F16C is not defined
2023-03-28 17:26:55 +03:00
Stephan Walter
939ad2d3a5
Fix undefined variables in debug build, remove unused variables (#531) 2023-03-26 15:34:02 +00:00
slaren
459e93cce0
Add AVX2 implementation of dequantize_row_q4_1 (#505) 2023-03-25 20:31:48 +02:00
Georgi Gerganov
a316a425d0
Overhaul the examples structure
- main -> examples
- utils -> examples (renamed to "common")
- quantize -> examples
- separate tools for "perplexity" and "embedding"

Hope I didn't break something !
2023-03-25 20:26:40 +02:00
Georgi Gerganov
ecbe466a36
Retire the ggml_mul_mat() branch for transposed src0 (#500)
* Retire the ggml_mul_mat() for transposed src0

- It can always be made contiguous with ggml_cpy()
- The code is now simplified
- The results are deterministic in respect to num threads

* SIMD-ify dequantize_row_q4_0() for ARM_NEON (#502)

* Attempt to SIMD-ify dequantize_row_q4_0() for ARM_NEON

* Fix dequantization - forgot to interleave the quants
2023-03-25 19:47:21 +02:00
slaren
09aecbf628
Add AVX2 implementation of dequantize_row_q4_0 (#467) 2023-03-25 17:06:49 +02:00
Georgi Gerganov
6b6dbc8910
Remove obsolete assert and fix compiler warning 2023-03-25 16:22:05 +02:00
Georgi Gerganov
2a2e63ce05
Fix nasty bug in ggml_compute_forward_mul_mat_f32() and reenable BLAS 2023-03-25 16:10:14 +02:00
Georgi Gerganov
8520fc310e
Disable BLAS altogether - the bug is not just for qunatized mat mul 2023-03-24 23:47:06 +02:00
Georgi Gerganov
b3f460e941
Disable BLAS branch in mul_mat - seems there is a bug 2023-03-24 23:39:17 +02:00
Georgi Gerganov
7a9b6c3a8b
Reduce memory usage and allocate enough memory for largest context (#473)
* Reduce memory usage and allocate enough memory for large contexts

* Simpler scratch buffer usage

* Reenable BLAS for quantized mul_mat

* Fix number of layers in 30B and 65B

* Fix KV cache size for F32
2023-03-24 23:17:37 +02:00
Cameron Kaiser
481044d50c
additional optimizations for POWER9 (#454) 2023-03-24 17:19:26 +02:00
comex
563cdc391d
Support calling mlock() on loaded model data on Linux and macOS (#453)
* Support calling mlock() on loaded model data on Linux and macOS

This is enabled by a new --mlock command line option.

Using mlock() disables swapping and memory compression for the model
data.  Doing so can be useful on systems where the model takes up a
large fraction of system RAM.  In my experience, macOS is quite eager to
start compressing llama.cpp's memory, which then makes it halt for a few
seconds while it decompresses, even with a model that uses "only" 25GB
out of 32GB.

Of course, this comes at the cost of forcing the system to swap or
compress other processes' memory instead, so it needs to be used with
care and shouldn't be enabled by default.

In theory it should be possible to support this on Windows as well using
VirtualLock(), but I'm not much of a Windows user.

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-24 17:19:05 +02:00
Stephan Walter
69c92298a9
Deduplicate q4 quantization functions (#383)
* Deduplicate q4 quantization functions

* Use const; add basic test

* Re-enable quantization test

* Disable AVX2 flags in CI

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-22 19:29:06 +02:00
Valentyn Bezshapkin
97940520e8
fix: add POSIX functionality for Linux compilation (#51)
* fix: add POSIX functionality for Linux compilation

* fix: older standard for compatibility
2023-03-22 19:20:25 +02:00
Georgi Gerganov
f5a77a629b
Introduce C-style API (#370)
* Major refactoring - introduce C-style API

* Clean up

* Add <cassert>

* Add <iterator>

* Add <algorithm> ....

* Fix timing reporting and accumulation

* Measure eval time only for single-token calls

* Change llama_tokenize return meaning
2023-03-22 07:32:36 +02:00