Georgi Gerganov
7923b70cb8
llama : add llm_build_inp_embd helper
2023-10-31 16:43:08 +02:00
Georgi Gerganov
2073347e3b
llama : remove extra ; + deduplicate gate_b logic
2023-10-31 16:28:09 +02:00
Georgi Gerganov
fc5a26aade
llama : enable warning about not offloaded tensors
2023-10-31 08:57:10 +02:00
Georgi Gerganov
0bfdcdd0f8
llama : normalize tensor names
...
ggml-ci
2023-10-31 08:48:37 +02:00
Georgi Gerganov
6669cd8329
llama : update offload functions for KQ tensors
2023-10-31 08:24:07 +02:00
Georgi Gerganov
2926ef63b1
llama : fix input allocation logic
2023-10-31 08:23:43 +02:00
Georgi Gerganov
a3f80013ad
llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading
2023-10-30 12:14:23 +02:00
Georgi Gerganov
792d1a1b16
llama : minor
2023-10-30 11:34:47 +02:00
Georgi Gerganov
f39e6075cf
llama : add llm_build_kqv helper
...
ggml-ci
2023-10-29 22:45:03 +02:00
Georgi Gerganov
c9121fdd0f
llama : remove obsolete comments in build graphs
2023-10-29 21:44:19 +02:00
Georgi Gerganov
a104abea48
llama : simplify falcon Q, K, V computation
2023-10-29 21:24:25 +02:00
Georgi Gerganov
31a12f3d03
llama : fix llm_build_k_shift to use n_head_kv instead of n_head
2023-10-29 21:17:46 +02:00
Georgi Gerganov
5990861938
llama : remove obsolete offload names
2023-10-29 21:11:20 +02:00
Georgi Gerganov
3e0462594b
llama : add llm_build_kv_store helper
...
ggml-ci
2023-10-29 21:09:34 +02:00
Georgi Gerganov
909d64471b
llama : fix offloading after recent changes
2023-10-29 20:38:49 +02:00
Georgi Gerganov
38728a0be0
llama : add llm_build_k_shift helper
...
ggml-ci
2023-10-29 19:23:07 +02:00
Georgi Gerganov
dbf836bb64
llama : add llm_build_ffn helper function ( #3849 )
...
ggml-ci
2023-10-29 18:47:46 +02:00
Georgi Gerganov
7db9c96d8a
llama : add llm_build_norm helper function
...
ggml-ci
2023-10-29 15:48:48 +02:00
Georgi Gerganov
210e6e5d02
llama : remove obsolete map for layer counting
2023-10-29 13:39:04 +02:00
Georgi Gerganov
79ad734417
llama : comment
...
ggml-ci
2023-10-29 13:27:53 +02:00
Georgi Gerganov
761087932b
llama : add functional header
2023-10-29 13:26:32 +02:00
Georgi Gerganov
8925cf9ef8
llama : add layer index to all tensor names
2023-10-29 13:22:15 +02:00
Georgi Gerganov
1e9c5443c2
llama : refactor tensor offloading as callback
2023-10-29 13:05:10 +02:00
Georgi Gerganov
da936188d8
llama : move refact in correct place + optimize graph input
2023-10-29 11:48:58 +02:00
Georgi Gerganov
739b85c985
llama : try to fix build
2023-10-29 11:25:32 +02:00
Georgi Gerganov
25cfbf6776
llama : fix non-CUDA build
2023-10-29 11:12:03 +02:00
Georgi Gerganov
b4ad03b3a7
llama : try to optimize offloading code
2023-10-29 10:33:11 +02:00
Georgi Gerganov
79617902ea
llama : fix res_norm offloading
2023-10-29 09:20:35 +02:00
Georgi Gerganov
e14aa46151
llama : do tensor offload only with CUDA
2023-10-29 08:03:46 +02:00
Georgi Gerganov
0dc05b8433
llama : factor graph input into a function
2023-10-29 07:52:43 +02:00
Georgi Gerganov
4e98897ede
llama : support offloading result_norm + comments
2023-10-29 07:36:07 +02:00
Georgi Gerganov
51c4f9ee9f
llama : comments
2023-10-28 22:50:08 +03:00
Georgi Gerganov
3af8771389
llama : update offload log messages to print node index
2023-10-28 22:36:44 +03:00
Georgi Gerganov
83d2c43791
llama : offload rest of the models
...
ggml-ci
2023-10-28 22:30:54 +03:00
Georgi Gerganov
38aca9e1ab
llama : factor out tensor offloading outside the build call (wip)
...
ggml-ci
2023-10-28 21:22:31 +03:00
Georgi Gerganov
5946d98fc8
metal : disable kernel load log
2023-10-28 21:22:01 +03:00
Georgi Gerganov
8b2420d249
llama : factor out ggml-alloc from graph graph build functions
...
ggml-ci
2023-10-28 19:54:28 +03:00
Erik Scholz
ff3bad83e2
flake : update flake.lock for newer transformers version + provide extra dev shell ( #3797 )
...
* flake : update flake.lock for newer transformers version + provide extra dev shell with torch and transformers (for most convert-xxx.py scripts)
2023-10-28 16:41:07 +02:00
Aarni Koskela
82a6646e02
metal : try cwd for ggml-metal.metal if bundle lookup fails ( #3793 )
...
* Try cwd for ggml-metal if bundle lookup fails
When building with `-DBUILD_SHARED_LIBS=ON -DLLAMA_METAL=ON -DLLAMA_BUILD_SERVER=ON`,
`server` would fail to load `ggml-metal.metal` because `[bundle pathForResource:...]`
returns `nil`. In that case, fall back to `ggml-metal.metal` in the cwd instead of
passing `null` as a path.
Follows up on #1782
* Update ggml-metal.m
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-28 15:43:01 +03:00
Georgi Gerganov
ba231e8a6d
issues : change label from bug to bug-unconfirmed ( #3748 )
2023-10-28 15:35:26 +03:00
Georgi Gerganov
8a2f2fea29
convert : ignore tokens if their IDs are within [0, vocab_size) ( #3831 )
2023-10-28 06:25:15 -06:00
Kerfuffle
bd6d9e2059
llama : allow quantizing k-quants to fall back when tensor size incompatible ( #3747 )
...
* Allow quantizing k-quants to fall back when tensor size incompatible
* quantizing: Add warning when tensors were incompatible with k-quants
Clean up k-quants state passing a bit
2023-10-28 14:54:24 +03:00
Georgi Gerganov
ee1a0ec9cb
llama : add option for greedy sampling with probs ( #3813 )
...
* llama : add option for greedy sampling with probs
* llama : add comment about llama_sample_token_greedy() missing probs
* sampling : temp == 0.0 -> no probs, temp < 0.0 -> probs
2023-10-28 14:23:11 +03:00
Henk Poley
177461104b
common : print that one line of the syntax help *also* to standard output ( #3823 )
2023-10-28 13:16:33 +03:00
Georgi Gerganov
fdee152e4e
starcoder : add GPU offloading ( #3827 )
...
* starcoder : do not GPU split 1D bias tensors
* starcoder : offload layers to GPU
ggml-ci
2023-10-28 12:06:08 +03:00
Kerfuffle
41aee4df82
speculative : ensure draft and target model vocab matches ( #3812 )
...
* speculative: Ensure draft and target model vocab matches
* Tolerate small differences when checking dft vs tgt vocab
2023-10-28 00:40:07 +03:00
cebtenzzre
6d459cbfbe
llama : correctly report GGUFv3 format ( #3818 )
2023-10-27 17:33:53 -04:00
Thibault Terrasson
c8d6a1f34a
simple : fix batch handling ( #3803 )
2023-10-27 08:37:41 -06:00
Georgi Gerganov
2f9ec7e271
cuda : improve text-generation and batched decoding performance ( #3776 )
...
* cuda : prints wip
* cuda : new cublas gemm branch for multi-batch quantized src0
* cuda : add F32 sgemm branch
* cuda : fine-tune >= VOLTA params + use MMQ only for small batches
* cuda : remove duplicated cuBLAS GEMM code
* cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros
* build : add compile option to force use of MMQ kernels
2023-10-27 17:01:23 +03:00
Georgi Gerganov
34b2a5e1ee
server : do not release slot on image input ( #3798 )
2023-10-26 22:54:17 +03:00