llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-05 00:04:36 +00:00

Author	SHA1	Message	Date
Adam Treat	4ed25b2f88	Sync from device back to host at begin of new prompt.	2023-10-05 13:39:18 -04:00
Adam Treat	bd5f6399bb	Don't try and install kompute artifacts.	2023-10-05 13:39:18 -04:00
Aaron Miller	8bea719879	vulkan: disambiguate gpus with the same name	2023-10-05 13:39:18 -04:00
Adam Treat	68cf1df6fb	Throw an exception when allocation fails for vulkan.	2023-10-05 13:39:18 -04:00
Aaron Miller	beee57266f	Make kompute actually include external SDK headers when requested	2023-10-05 13:39:18 -04:00
Adam Treat	b7e2e691d4	Completely revamp how we do object management with the vulkan backend and stop using so many static objects so we can tear down and bring up vulkan on new devices in the same runtime.	2023-10-05 13:39:18 -04:00
Adam Treat	45c8778b49	Switch to a dynamic dispatch table instead of linking hard against libvulkan.	2023-10-05 13:39:18 -04:00
Aaron Miller	8563fa001f	remove dynamic deps from kompute build should no longer have new external deps other than libvulkan ``` ubuntu@ip-172-31-1-24:~/repo/gpt4all/gpt4all-backend/build$ ldd ./libllamamodel-mainline-avxonly.so linux-vdso.so.1 (0x00007ffcb53bb000) libvulkan.so.1 => /lib/x86_64-linux-gnu/libvulkan.so.1 (0x00007f239dab5000) libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f239d800000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f239d719000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f239da95000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f239d400000) /lib64/ld-linux-x86-64.so.2 (0x00007f239dd1d000) ```	2023-10-05 13:39:18 -04:00
Adam Treat	48a45ea435	Remove warning which fails on windows.	2023-10-05 13:39:18 -04:00
niansa	ba15dfd0be	Nomic vulkan backend licensed under the Software for Open Models License (SOM), version 1.0.	2023-10-05 13:39:18 -04:00
Georgi Gerganov	ec893798b7	llama : custom attention mask + parallel decoding + no context swaps (#3228 ) * tests : verify that RoPE is "additive" * llama : replace ggml_diag_mask_inf with ggml_add (custom -inf mask) * ggml : ggml_rope now takes a vector with positions instead of n_past * metal : add rope_f16 kernel + optimize cpy kernels * llama : unified KV cache + batch inference API * llama : add new llama_decode() API that works with llama_batch * llama : add cell_max heuristic for more efficient kv_cache * llama : extend llama_kv_cache API * llama : more robust cell_max heuristic + wip shift * metal : disable concurrency optimization * llama : add llama_kv_cache_shift_seq + no more context swaps * llama : apply K-cache roping for Falcon and Baichuan * speculative : fix KV cache management * parallel : example for serving multiple users in parallel * parallel : disable hot-plug to avoid cache fragmentation * fixes : speculative KV cache + llama worst-case graph * llama : extend batch API to select which logits to output * llama : fix worst case graph build * ggml-cuda : update rope implementation for parallel decoding (#3254) * ggml-cuda : update rope implementation for parallel decoding * better solution for p0 computation * fix rope * simpler rope implementation --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * make : add parallel to build + fix static functions in llama.cpp * simple : fix token counting * parallel : various improvements * llama : fix cell_max logic + rename functions * parallel : try smaller batches when the KV cache is fragmented * parallel : fix sequence termination criteria * llama : silence errors KV cache errors * parallel : remove new line from prompt * parallel : process system prompt once + configurable paramters + llama API * parallel : remove question with short answers * parallel : count cache misses * parallel : print misses on each request * parallel : minor * llama : fix n_kv to never become 0 * parallel : rename hot-plug to continuous-batching * llama : improve llama_batch API + simplify parallel example * simple : add parallel decoding support * simple : improve comments + free batch * ggml-cuda : add rope f16, restore performance with parallel decoding (#3272) * ggml-cuda : add rope f16, restore performance * offload KQ_mask with all models * fix rope shift --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama : disable MPI for now ggml-ci * train : make KQ_pos memory buffer permanent via dummy scale op * ggml : revert change to ggml_cpy, add ggml_cont_Nd instead (#3275) ggml-ci * parallel : fix bug (extra BOS) + smaller token_prev array * parallel : fix cases where the input prompts can overflow the batch * parallel : add disabled experimental batch chunking in powers of two * llama : llama.h formatting + comments * simple : add README.md * llama : fix kv cache heuristic when context is less than 32 * parallel : fix crash when `-n -1` * llama : simplify returns if/else branches * metal : use mm kernels for batch size > 2 * examples : utilize new llama_get_logits_ith() * examples : add example for batched decoding * examples : do not eval prompt 2 times (close #3348) * server : clear the KV cache beyond n_past before llama_decode * server : avoid context swaps by shifting the KV cache --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-09-28 19:04:36 +03:00
Kevin Ji	45855b3f1c	docs : mark code as Bash (#3375 )	2023-09-28 09:11:32 -04:00
Pierre Alexandre SCHEMBRI	4aea3b846e	readme : add Mistral AI release 0.1 (#3362 )	2023-09-28 15:13:37 +03:00
slaren	da0400344b	ggml-cuda : perform cublas fp16 matrix multiplication as fp16 (#3370 ) * ggml-cuda : perform cublas fp16 matrix multiplication as fp16 * try to fix rocm build * restrict fp16 mat mul to volta and up	2023-09-28 13:08:28 +03:00
Zhang Peiyuan	e519621010	convert : remove bug in convert.py permute function (#3364 )	2023-09-27 20:45:20 +02:00
Richard Roberson	ac43576124	make-ggml.py : compatibility with more models and GGUF (#3290 ) * Resync my fork with new llama.cpp commits * examples : rename to use dash instead of underscore * New model conversions --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-09-27 19:25:12 +03:00
Cebtenzzre	20c7e1e804	gguf : fix a few general keys (#3341 )	2023-09-27 12:18:07 -04:00
Rickard Hallerbäck	dc6897404e	metal : reusing llama.cpp logging (#3152 ) * metal : reusing llama.cpp logging * cmake : build fix * metal : logging callback * metal : logging va_args memory fix * metal : minor cleanup * metal : setting function like logging macro to capital letters * llama.cpp : trailing whitespace fix * ggml : log level enum used by llama * Makefile : cleanup ggml-metal recipe * ggml : ggml_log_callback typedef * ggml : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-09-27 18:48:33 +03:00
Jag Chadha	527e57cfd8	build : add ACCELERATE_NEW_LAPACK to fix warning on macOS Sonoma (#3342 )	2023-09-27 18:34:32 +03:00
BarfingLemurs	ffe88a36a9	readme : add some recent perplexity and bpw measurements to READMES, link for k-quants (#3340 ) * Update README.md * Update README.md * Update README.md with k-quants bpw measurements	2023-09-27 18:30:36 +03:00
DAN™	99115f3fa6	cmake : fix build-info.h on MSVC (#3309 )	2023-09-25 18:45:33 -04:00
2f38b454	1726f9626f	docs: Fix typo CLBlast_DIR var. (#3330 )	2023-09-25 20:24:52 +02:00
Erik Scholz	a98b1633d5	nix : add cuda, use a symlinked toolkit for cmake (#3202 )	2023-09-25 13:48:30 +02:00
slaren	c091cdfb24	llama-bench : add README (#3317 ) * llama-bench : add README * minor edit	2023-09-23 21:48:24 +02:00
Cebtenzzre	51a7cf5c6e	examples : fix RoPE defaults to match PR #3240 (#3315 )	2023-09-23 12:28:50 +03:00
Kevin Ji	bedb92b603	scripts : use `/usr/bin/env` in shebang (#3313 )	2023-09-22 23:52:23 -04:00
Lee Drake	bc9d3e3971	Update README.md (#3289 ) * Update README.md * Update README.md Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-09-21 21:00:24 +02:00
shibe2	36b904e200	ggml-opencl.cpp: Make private functions static (#3300 )	2023-09-21 14:10:26 -04:00
Edward Taylor	324f3403d5	zig : fix for updated c lib (#3259 )	2023-09-21 12:08:20 +03:00
yuiseki	f56c418ab0	embedding : update README.md (#3224 )	2023-09-21 11:57:40 +03:00
Johannes Gäßler	8185710a80	CUDA: use only 1 thread if fully offloaded (#2915 )	2023-09-21 11:43:53 +03:00
Georgi Gerganov	7eb41179ed	readme : update hot topics	2023-09-20 20:48:22 +03:00
Cebtenzzre	a5661d7e71	llama : allow gguf RoPE keys to be overridden with defaults (#3240 )	2023-09-20 12:12:47 -04:00
Cebtenzzre	65c2c1c5ab	benchmark-matmult : do not use integer abs() on a float (#3277 )	2023-09-20 12:06:08 -04:00
kang	80834daecf	flake : Restore default package's buildInputs (#3262 )	2023-09-20 15:48:22 +02:00
Alon	a40f2b656f	CI: FreeBSD fix (#3258 ) * - freebsd ci: use qemu	2023-09-20 14:06:36 +02:00
Georgi Gerganov	d119c04c15	examples : fix benchmark-matmult (#1554 ) The precision for Q4_0 has degraded since #1508	2023-09-20 10:02:39 +03:00
Cebtenzzre	8781013ef6	make : restore build-info.h dependency for several targets (#3205 )	2023-09-18 10:03:53 -04:00
Erik Scholz	7ddf185537	ci : switch cudatoolkit install on windows to networked (#3236 )	2023-09-18 02:21:47 +02:00
Johannes Gäßler	ee66942d7e	CUDA: fix peer access logic (#3231 )	2023-09-17 23:35:20 +02:00
Johannes Gäßler	111163e246	CUDA: enable peer access between devices (#2470 )	2023-09-17 16:37:53 +02:00
slaren	8b428c9bc8	llama.cpp : show model size and BPW on load (#3223 )	2023-09-17 14:33:28 +02:00
Johannes Gäßler	578d8c8f5c	CUDA: fix scratch malloced on non-main device (#3220 )	2023-09-17 14:16:22 +02:00
IsaacDynamo	b541b4f0b1	Enable BUILD_SHARED_LIBS=ON on all Windows builds (#3215 )	2023-09-16 19:35:25 +02:00
Vlad	5dbc2b3213	Enable build with CUDA 11.0 (make) (#3132 ) * CUDA 11.0 fixes * Cleaner CUDA/host flags separation Also renamed GGML_ASSUME into GGML_CUDA_ASSUME	2023-09-16 16:55:43 +02:00
goerch	b08e75baea	Fixing the last deviations from sentencepiece indicated by test-tokenizer-1 (#3170 ) * Fix für #2721 * Reenable tokenizer test for LLaMa * Add `console.cpp` dependency * Fix dependency to `common` * Fixing wrong fix. * Make console usage platform specific Work on compiler warnings. * Adapting makefile * Remove trailing whitespace * Adapting the other parts of the makefile * Fix typo. * Fixing the last deviations from sentencepiece indicated by test-tokenizer-1 * Simplify logic * Add missing change... * Fix ugly compiler warning * llama_tokenize should accept strings containing NUL now * Adding huichen's test case	2023-09-16 13:41:33 +02:00
Cebtenzzre	e6616cf0db	examples : add compiler version and target to build info (#2998 )	2023-09-15 16:59:49 -04:00
Cebtenzzre	3aefaab9e5	check C++ code with -Wmissing-declarations (#3184 )	2023-09-15 15:38:27 -04:00
Cebtenzzre	69eb67e282	fix build numbers by setting fetch-depth=0 (#3197 )	2023-09-15 15:18:15 -04:00
Meng Zhang	4fe09dfe66	llama : add support for StarCoder model architectures (#3187 ) * add placeholder of starcoder in gguf / llama.cpp * support convert starcoder weights to gguf * convert MQA to MHA * fix ffn_down name * add LLM_ARCH_STARCODER to llama.cpp * set head_count_kv = 1 * load starcoder weight * add max_position_embeddings * set n_positions to max_positioin_embeddings * properly load all starcoder params * fix head count kv * fix comments * fix vram calculation for starcoder * store mqa directly * add input embeddings handling * add TBD * working in cpu, metal buggy * cleanup useless code * metal : fix out-of-bounds access in soft_max kernels * llama : make starcoder graph build more consistent with others * refactor: cleanup comments a bit * add other starcoder models: 3B, 7B, 15B * support-mqa-directly * fix: remove max_position_embeddings, use n_train_ctx * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix: switch to space from tab --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-09-15 22:02:13 +03:00

1 2 3 4 5 ...

1343 Commits