llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-26 03:14:35 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	7923b70cb8	llama : add llm_build_inp_embd helper	2023-10-31 16:43:08 +02:00
Georgi Gerganov	2073347e3b	llama : remove extra ; + deduplicate gate_b logic	2023-10-31 16:28:09 +02:00
Georgi Gerganov	fc5a26aade	llama : enable warning about not offloaded tensors	2023-10-31 08:57:10 +02:00
Georgi Gerganov	0bfdcdd0f8	llama : normalize tensor names ggml-ci	2023-10-31 08:48:37 +02:00
Georgi Gerganov	6669cd8329	llama : update offload functions for KQ tensors	2023-10-31 08:24:07 +02:00
Georgi Gerganov	2926ef63b1	llama : fix input allocation logic	2023-10-31 08:23:43 +02:00
Georgi Gerganov	a3f80013ad	llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading	2023-10-30 12:14:23 +02:00
Georgi Gerganov	792d1a1b16	llama : minor	2023-10-30 11:34:47 +02:00
Georgi Gerganov	f39e6075cf	llama : add llm_build_kqv helper ggml-ci	2023-10-29 22:45:03 +02:00
Georgi Gerganov	c9121fdd0f	llama : remove obsolete comments in build graphs	2023-10-29 21:44:19 +02:00
Georgi Gerganov	a104abea48	llama : simplify falcon Q, K, V computation	2023-10-29 21:24:25 +02:00
Georgi Gerganov	31a12f3d03	llama : fix llm_build_k_shift to use n_head_kv instead of n_head	2023-10-29 21:17:46 +02:00
Georgi Gerganov	5990861938	llama : remove obsolete offload names	2023-10-29 21:11:20 +02:00
Georgi Gerganov	3e0462594b	llama : add llm_build_kv_store helper ggml-ci	2023-10-29 21:09:34 +02:00
Georgi Gerganov	909d64471b	llama : fix offloading after recent changes	2023-10-29 20:38:49 +02:00
Georgi Gerganov	38728a0be0	llama : add llm_build_k_shift helper ggml-ci	2023-10-29 19:23:07 +02:00
Georgi Gerganov	dbf836bb64	llama : add llm_build_ffn helper function (#3849 ) ggml-ci	2023-10-29 18:47:46 +02:00
Georgi Gerganov	7db9c96d8a	llama : add llm_build_norm helper function ggml-ci	2023-10-29 15:48:48 +02:00
Georgi Gerganov	210e6e5d02	llama : remove obsolete map for layer counting	2023-10-29 13:39:04 +02:00
Georgi Gerganov	79ad734417	llama : comment ggml-ci	2023-10-29 13:27:53 +02:00
Georgi Gerganov	761087932b	llama : add functional header	2023-10-29 13:26:32 +02:00
Georgi Gerganov	8925cf9ef8	llama : add layer index to all tensor names	2023-10-29 13:22:15 +02:00
Georgi Gerganov	1e9c5443c2	llama : refactor tensor offloading as callback	2023-10-29 13:05:10 +02:00
Georgi Gerganov	da936188d8	llama : move refact in correct place + optimize graph input	2023-10-29 11:48:58 +02:00
Georgi Gerganov	739b85c985	llama : try to fix build	2023-10-29 11:25:32 +02:00
Georgi Gerganov	25cfbf6776	llama : fix non-CUDA build	2023-10-29 11:12:03 +02:00
Georgi Gerganov	b4ad03b3a7	llama : try to optimize offloading code	2023-10-29 10:33:11 +02:00
Georgi Gerganov	79617902ea	llama : fix res_norm offloading	2023-10-29 09:20:35 +02:00
Georgi Gerganov	e14aa46151	llama : do tensor offload only with CUDA	2023-10-29 08:03:46 +02:00
Georgi Gerganov	0dc05b8433	llama : factor graph input into a function	2023-10-29 07:52:43 +02:00
Georgi Gerganov	4e98897ede	llama : support offloading result_norm + comments	2023-10-29 07:36:07 +02:00
Georgi Gerganov	51c4f9ee9f	llama : comments	2023-10-28 22:50:08 +03:00
Georgi Gerganov	3af8771389	llama : update offload log messages to print node index	2023-10-28 22:36:44 +03:00
Georgi Gerganov	83d2c43791	llama : offload rest of the models ggml-ci	2023-10-28 22:30:54 +03:00
Georgi Gerganov	38aca9e1ab	llama : factor out tensor offloading outside the build call (wip) ggml-ci	2023-10-28 21:22:31 +03:00
Georgi Gerganov	5946d98fc8	metal : disable kernel load log	2023-10-28 21:22:01 +03:00
Georgi Gerganov	8b2420d249	llama : factor out ggml-alloc from graph graph build functions ggml-ci	2023-10-28 19:54:28 +03:00
Erik Scholz	ff3bad83e2	flake : update flake.lock for newer transformers version + provide extra dev shell (#3797 ) * flake : update flake.lock for newer transformers version + provide extra dev shell with torch and transformers (for most convert-xxx.py scripts)	2023-10-28 16:41:07 +02:00
Aarni Koskela	82a6646e02	metal : try cwd for ggml-metal.metal if bundle lookup fails (#3793 ) * Try cwd for ggml-metal if bundle lookup fails When building with `-DBUILD_SHARED_LIBS=ON -DLLAMA_METAL=ON -DLLAMA_BUILD_SERVER=ON`, `server` would fail to load `ggml-metal.metal` because `[bundle pathForResource:...]` returns `nil`. In that case, fall back to `ggml-metal.metal` in the cwd instead of passing `null` as a path. Follows up on #1782 * Update ggml-metal.m --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-28 15:43:01 +03:00
Georgi Gerganov	ba231e8a6d	issues : change label from bug to bug-unconfirmed (#3748 )	2023-10-28 15:35:26 +03:00
Georgi Gerganov	8a2f2fea29	convert : ignore tokens if their IDs are within [0, vocab_size) (#3831 )	2023-10-28 06:25:15 -06:00
Kerfuffle	bd6d9e2059	llama : allow quantizing k-quants to fall back when tensor size incompatible (#3747 ) * Allow quantizing k-quants to fall back when tensor size incompatible * quantizing: Add warning when tensors were incompatible with k-quants Clean up k-quants state passing a bit	2023-10-28 14:54:24 +03:00
Georgi Gerganov	ee1a0ec9cb	llama : add option for greedy sampling with probs (#3813 ) * llama : add option for greedy sampling with probs * llama : add comment about llama_sample_token_greedy() missing probs * sampling : temp == 0.0 -> no probs, temp < 0.0 -> probs	2023-10-28 14:23:11 +03:00
Henk Poley	177461104b	common : print that one line of the syntax help also to standard output (#3823 )	2023-10-28 13:16:33 +03:00
Georgi Gerganov	fdee152e4e	starcoder : add GPU offloading (#3827 ) * starcoder : do not GPU split 1D bias tensors * starcoder : offload layers to GPU ggml-ci	2023-10-28 12:06:08 +03:00
Kerfuffle	41aee4df82	speculative : ensure draft and target model vocab matches (#3812 ) * speculative: Ensure draft and target model vocab matches * Tolerate small differences when checking dft vs tgt vocab	2023-10-28 00:40:07 +03:00
cebtenzzre	6d459cbfbe	llama : correctly report GGUFv3 format (#3818 )	2023-10-27 17:33:53 -04:00
Thibault Terrasson	c8d6a1f34a	simple : fix batch handling (#3803 )	2023-10-27 08:37:41 -06:00
Georgi Gerganov	2f9ec7e271	cuda : improve text-generation and batched decoding performance (#3776 ) * cuda : prints wip * cuda : new cublas gemm branch for multi-batch quantized src0 * cuda : add F32 sgemm branch * cuda : fine-tune >= VOLTA params + use MMQ only for small batches * cuda : remove duplicated cuBLAS GEMM code * cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros * build : add compile option to force use of MMQ kernels	2023-10-27 17:01:23 +03:00
Georgi Gerganov	34b2a5e1ee	server : do not release slot on image input (#3798 )	2023-10-26 22:54:17 +03:00

1 2 3 4 5 ...

1478 Commits