llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-11-15 23:39:52 +00:00

Author	SHA1	Message	Date
slaren	dd047b476c	disable docker CI on pull requests (#8110 )	2024-06-25 19:20:06 +02:00
joecryptotoo	925c30956d	Add healthchecks to llama-server containers (#8081 ) * added healthcheck * added healthcheck * added healthcheck * added healthcheck * added healthcheck * moved curl to base * moved curl to base	2024-06-25 17:13:27 +02:00
Brian	c8ad35955a	Gguf dump start data offset via --data-offset and some extra refactor (#8054 ) * gguf-dump: add --data-offset * gguf-dump: add tensor data offset table * gguf-dump: refactor GGUFReader for clarity * gguf-dump: add --data-alignment * gguf-dump.py: Rename variables and adjust comments start_data_offset --> data_offset _build_tensors_info_fields --> _build_tensor_info	2024-06-25 22:03:25 +10:00
Xuan Son Nguyen	49c03c79cd	cvector: better prompt handling, add "mean vector" method (#8069 ) * remove completions file * fix inverted vector * add mean method * code style * remove inverted pca hotfix	2024-06-25 13:59:54 +02:00
Xuan Son Nguyen	48e6b92cc3	Add chat template support for llama-cli (#8068 ) * add chat template support for llama-cli * add help message * server: simplify format_chat * more consistent naming * improve * add llama_chat_format_example * fix server * code style * code style * Update examples/main/main.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-25 21:56:49 +10:00
HanishKVC	3791ad2193	SimpleChat v3.1: Boolean chat request options in Settings UI, cache_prompt (#7950 ) * SimpleChat: Allow for chat req bool options to be user controlled * SimpleChat: Allow user to control cache_prompt flag in request * SimpleChat: Add sample GUI images to readme file Show the chat screen and the settings screen * SimpleChat:Readme: Add quickstart block, title to image, cleanup * SimpleChat: RePosition contents of the Info and Settings UI Make it more logically structured and flow through. * SimpleChat: Rename to apiRequestOptions from chatRequestOptions So that it is not wrongly assumed that these request options are used only for chat/completions endpoint. Rather these are used for both the end points, so rename to match semantic better. * SimpleChat: Update image included with readme wrt settings ui * SimpleChat:ReadMe: Switch to webp screen image to reduce size	2024-06-25 21:27:35 +10:00
HatsuneMikuUwU33	f702a90e24	Update control vector help (#8104 )	2024-06-25 10:44:48 +02:00
Meng, Hengyu	083bacce14	[SYCL] Re-enabled mul_mat_batched_sycl (#8095 )	2024-06-25 10:19:20 +08:00
Johannes Gäßler	2df373ac40	CUDA: fix matrix multiplication algorithm choice (#8102 )	2024-06-25 01:22:33 +02:00
Johannes Gäßler	3b099bcd9c	CUDA: fix MMQ writeback for int8 tensor cores (#8100 )	2024-06-24 22:15:33 +02:00
Johannes Gäßler	a818f3028d	CUDA: use MMQ instead of cuBLAS by default (#8075 )	2024-06-24 17:43:42 +02:00
fairydreaming	d62e4aaa02	gguf-py : fix tensor groups for encoder-decoder models in gguf-dump.py (#8090 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Brian <mofosyne@gmail.com>	2024-06-24 14:13:39 +02:00
Johannes Gäßler	9a590c8226	CUDA: optimize MMQ int8 tensor core performance (#8062 ) * CUDA: optimize MMQ int8 tensor core performance * only a single get_mma_tile_x_k function * simplify code, make functions constexpr	2024-06-24 12:41:23 +02:00
Christian Zhou-Zheng	52fc8705a0	Option to split during conversion (#6942 ) * support splits in convert.py * Support split by size and dry run to write estimated shards/filesizes * Move split functionality to new GGUFManager class * fix improper function signature * tentative push of convert-hf-to-gguf support * resolve merge + SplitArguments for easier parsing * Fix eager tensor memory leak and remove convert.py changes Removed a memory leak caused by unexpected reference retention to eager tensors. Also removed GGUFManager functionality in convert.py in favor of specializing for convert-hf-to-gguf.py. * refactor SplitStrategy to be a deque Instead of having SplitStrategy have a `data` field that is a deque, just have SplitStrategy be a subclass of deque itself. * fix Q8 quantization * remove unnecessary imports in gguf_manager * fix final? merge issue * fix gguf_writer placement and remove comments * oops, actually fix gguf_writer placement * reduce duplicated code from gguf_writer * further simplify GGUFManager * simplify even further and standardize with GGUFWriter * reduce diffs with master * form shards while adding tensors, SHA256 sums agree with master * re-add type hint Co-authored-by: compilade <git@compilade.net> * GGUFWriter compatibility fix Co-authored-by: compilade <git@compilade.net> * Shard dataclass and un-negative dont_add_architecture * type consistency in format_n_bytes_to_str * move kv keys to constants.py * make pathlib explicit * base-1024 bytes to base-1000 * rename GGUFManager to GGUFWriterSplit * Update gguf-py/gguf/constants.py Co-authored-by: compilade <git@compilade.net> * fix convert-hf-to-gguf.py permissions * fix line endings * Update gguf-py/gguf/gguf_writer_split.py Co-authored-by: compilade <git@compilade.net> * convert-hf : restore executable file permission * examples/convert-legacy-llama.py: restore executable file permission * reinstate original gguf package import and fix type annotation * attempt to appease the linter * attempt 2 to appease the linter * attempt 3 to appease the linter * comma consistency * Update convert-hf-to-gguf.py Co-authored-by: compilade <git@compilade.net> * edit cmd line args * use simplification from #7827 * kv/ti data are still wrong * try to refactor kv data (still fails) * fix ti data messiness * tidy up * fix linting * actually make the linter happy * cleanup round 1 * remove SplitStrategy, SplitArguments * appease linter * fix typing and clean up * fix linting * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <git@compilade.net> * progress bar, fix split logic * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <git@compilade.net> * catch oversights * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <git@compilade.net> * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <git@compilade.net> * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <git@compilade.net> * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <git@compilade.net> * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <git@compilade.net> * swap bar orders * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <git@compilade.net> * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <git@compilade.net> * compatibility fix * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <git@compilade.net> * Update convert-hf-to-gguf.py Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: Brian <mofosyne@gmail.com> Co-authored-by: compilade <git@compilade.net>	2024-06-24 19:42:03 +10:00
slaren	8cb508d0d5	disable publishing the full-rocm docker image (#8083 )	2024-06-24 08:36:11 +03:00
Yann Follet	646ef4a9cf	embedding : more cli arguments (#7458 ) * add parameters for embeddings --embd-normalize --embd-output-format --embd-separator description in the README.md * Update README.md fix tipo * Trailing whitespace * fix json generation, use " not ' * fix merge master * fix code formating group of parameters // embedding print usage for embedding parameters --------- Co-authored-by: Brian <mofosyne@gmail.com>	2024-06-24 08:30:24 +03:00
fairydreaming	de0d6a68ac	gguf-py, convert-hf : model conversion support for T5 and FLAN-T5 model variants (#5763 ) * gguf-py : add T5 model architecture * gguf-py : add separate tensors for encoder and decoder * gguf-py : add new model header parameters: decoder_start_token_id, attention.relative_buckets_count, tokenizer.ggml.remove_extra_whitespaces, tokenizer.ggml.precompiled_charsmap * convert-hf : add model conversion support for T5ForConditionalGeneration and T5WithLMHeadModel --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-06-24 07:06:05 +02:00
slaren	95f57bb5d5	ggml : remove ggml_task_type and GGML_PERF (#8017 ) * ggml : remove ggml_task_type and GGML_PERF * check abort_callback on main thread only * vulkan : remove usage of ggml_compute_params * remove LLAMA_PERF	2024-06-24 03:07:59 +02:00
Eddie-Wang	e112b610a1	llama : add support for BitnetForCausalLM (#7931 ) * hf bitnet v1 * hf bitnet e2e v2 * finish bitnet e2e * finish f16 hf bitnet e2e * remove unsed * finish bitnet i2 e2e * move i2s to quantize v1 * move i2 to quantize * clean code * clean code 2 * fix codestyle * fix code * fix * fix code * fix merge * remove unused * change table name * fix whitespace * delete redundant * i2_s to absmax * finish i2_s/i8_s vec_dot x86 simd * i2s->q22 * fix code * remove block scale * add dequantize * fix seq * update avx2 * remove q2_2 * remove q22_grid * fix whitespace * reuse llm_build_kv * fix bo --------- Co-authored-by: root <root@wangjinheng>	2024-06-23 21:27:57 +03:00
Aarni Koskela	6a2f298bd7	server : fix JSON-Scheme typo (#7975 )	2024-06-23 11:03:08 -04:00
Daniel Bevenius	11318d9aa1	Fix typo in llama_set_embeddings comment (#8077 )	2024-06-23 15:39:45 +02:00
slaren	b6b9a8e606	fix CI failures (#8066 ) * test-backend-ops : increase cpy max nmse * server ci : disable thread sanitizer	2024-06-23 13:14:45 +02:00
0cc4m	45c0e2e4c1	Refactor Vulkan backend to allow multiple contexts (#7961 ) * Refactor Vulkan backend to allow multiple contexts * Fix too many shader groups called validation error in llama3 on AMD and Intel GPUs * Fix Vulkan debug build error	2024-06-23 10:21:25 +02:00
Clint Herron	b5a5f34efa	Removing extra blank lines that were breaking Lint. (#8067 )	2024-06-22 14:28:18 -04:00
Xuan Son Nguyen	3e58b0ee35	cvector: fix CI + correct help message (#8064 ) * cvector: fix CI + correct help message * also correct --pca-iter	2024-06-22 18:11:30 +02:00
HatsuneMikuUwU33	adf480c3ab	cvector-generator: Moe Moe Fixie-Fixie for Lots of Formats~! ♡(ᐢ ᴥ ᐢ)♡ (#8052 ) * Update negative.txt * Update positive.txt * Update cvector-generator.cpp * Update cvector-generator.cpp	2024-06-22 17:19:37 +02:00
0xspringtime	3aa184a8c7	convert-hf : change assert to exception (#8015 )	2024-06-22 15:37:41 +02:00
ddh0	5b48cd53a8	Update llama-quantize ppl/file size output from LLaMA-v1 to Llama-3 values (#8058 ) Uses the values computed by @JohannesGaessler in PR #7413	2024-06-22 15:16:10 +02:00
Clint Herron	c5a8d4b749	JSON Schema to GBNF integration tests (#7790 ) * Adding simple bare-bones test for end-to-end integration test for json validation against auto-generated JSON-schema grammars. * Adding additional examples as documented in #7789 . Also adding the ability to automatically output improperly failing grammars to debug output files so they can more easily be examined in the gbnf-validator program. * Uncommenting formerly commented tests so that they fail for others who are attempting to reproduce the bugs. * Merging improved schema test methods added by @ochafik in #7797 * Adding #define to temporarily remove failing tests so that this PR can pass CI, but still be useful for other PRs that want to leverage the framework. * Fixing nits from ochafik. Removing escape slashes, adding additional failing cases, fixing some other strings. * Fixing grammar indentation to be consistent throughout file.	2024-06-21 23:18:36 -04:00
k.h.lai	557b653dc9	vulkan: detect multiple devices by deviceUUID instead of deviceID (#8022 ) * vulkan: detect multiple devices by deviceUUID instead of deviceID * vulkan: remove unneeded variables * vulkan: fix id query	2024-06-21 10:28:20 +02:00
Eve	7d5e8777ae	ggml : AVX IQ quants (#7845 ) * initial iq4_xs * fix ci * iq4_nl * iq1_m * iq1_s * iq2_xxs * iq3_xxs * iq2_s * iq2_xs * iq3_s before sllv * iq3_s * iq3_s small fix * iq3_s sllv can be safely replaced with sse multiply	2024-06-21 08:57:36 +03:00
Georgi Gerganov	a927b0f3dd	llama : optimize long word tokenization with WPM (#8034 ) ggml-ci	2024-06-21 08:51:28 +03:00
Douglas Hanley	80ea089d77	llama : allow pooled embeddings on any model (#7477 ) * create append_pooling operation; allow to specify attention_type; add last token pooling; update examples * find result_norm/result_embd tensors properly; update output allocation logic * only use embd output for pooling_type NONE * get rid of old causal_attn accessor * take out attention_type; add in llama_set_embeddings * bypass logits when doing non-NONE pooling	2024-06-21 08:38:22 +03:00
Shuichi Tsutsumi	0e64591e82	swiftui : enable stream updating (#7754 )	2024-06-21 08:30:58 +03:00
Hamdoud Hakem	b1ef562bc1	requirements : Bump torch and numpy for python3.12 (#8041 )	2024-06-20 22:01:15 +02:00
Hamdoud Hakem	17b291a6a5	convert-hf : Fix the encoding in the convert-hf-to-gguf-update.py (#8040 )	2024-06-20 21:59:59 +02:00
Johannes Gäßler	abd894ad96	common: fix warning (#8036 ) * common: fix warning * Update common/common.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-20 16:40:13 +02:00
luoyu-intel	de391e4c80	[SYCL] Fix windows build and inference (#8003 ) * add sycl preset * fix debug link error. fix windows crash * update README	2024-06-20 21:19:05 +08:00
Johannes Gäßler	d50f8897a7	CUDA: stream-k decomposition for MMQ (#8018 ) * CUDA: stream-k decomposition for MMQ * fix undefined memory reads for small matrices	2024-06-20 14:39:21 +02:00
Michael de Gans	2075a66a96	metal : fix `ggml_metal_supports_op` for BF16 (#8021 ) Currently the Metal backend does not support BF16. `ggml_metal_supports_op` was returning true in these cases, leading to a crash with models converted with `--leave-output-tensor`. This commit checks if the first few sources types are BF16 and returns false if that's the case.	2024-06-20 08:32:01 +03:00
sasha0552	ba58993152	server : fix smart slot selection (#8020 )	2024-06-20 09:57:10 +10:00
Michael de Gans	a7854743c5	un-ignore `build-info.cmake` and `build-info.sh` (#7996 ) * un-ignore `build-info.cmake` and `build-info.sh` I am assuming that ignoring them was unintentional. If they are ignored, some tools, like cargo, will consider the files inexistent, even if they're comitted, for the purpose of publishing. This leads to the build failing in such cases. * un-ignore `build-info.cpp.in` For the same reason as the previous two files. * Reorganize `.gitignore` * Add exceptions for files mentioned by @slaren I did leave .clang-tidy since it was explicitly ignored before. * Add comments for organization * Sort some lines for pretty * Test with `make` and `cmake` builds to ensure no build artifacts might be comitted * Remove `.clang-tidy` from `.gitignore` Per comment by @ggerganov * Remove `IDEWorkspaceChecks.plist` from root-level `.gitignore`	2024-06-19 22:10:42 +02:00
slaren	9c77ec1d74	ggml : synchronize threads using barriers (#7993 )	2024-06-19 15:04:15 +02:00
Georgi Gerganov	a04a953cab	codecov : remove (#8004 )	2024-06-19 13:04:36 +03:00
Meng, Hengyu	623494a478	[SYCL] refactor (#6408 ) * seperate lower precision GEMM from the main files * fix workgroup size hardcode	2024-06-19 09:11:51 +08:00
jaime-m-p	37bef89433	tokenizer : BPE fixes (#7530 ) * Random test: add_bos_token, add_eos_token * Random test: add BPE models for testing * Custom regex split fails with codepoint 0 * Fix falcon punctuation regex * Refactor llm_tokenizer_bpe: move code to constructor * Move 'add_special_bos/eos' logic to llm_tokenizer_bpe * Move tokenizer flags to vocab structure. * Default values for special_add_bos/eos * Build vocab.special_tokens_cache using vocab token types * Generalize 'jina-v2' per token attributes * Fix unicode whitespaces (deepseek-coder, deepseek-llm) * Skip missing byte tokens (falcon) * Better unicode data generation * Replace char32_t with uint32_t	2024-06-18 18:40:52 +02:00
Sigbjørn Skjæret	91c188d6c2	Only use FIM middle token if it exists (#7648 ) * Only use FIM middle if it exists * Only use FIM middle if it exists	2024-06-18 22:19:45 +10:00
jojorne	84f6de17f6	Fix no gcc pragma on Windows (#7751 )	2024-06-18 22:18:32 +10:00
Ulrich Drepper	61665277af	Allow compiling with CUDA without CUDA runtime installed (#7989 ) On hosts which are not prepared/dedicated to execute code using CUDA it is still possible to compile llama.cpp with CUDA support by just installing the development packages. Missing are the runtime libraries like /usr/lib64/libcuda.so* and currently the link step will fail. The development environment is prepared for such situations. There are stub libraries for all the CUDA libraries available in the $(CUDA_PATH)/lib64/stubs directory. Adding this directory to the end of the search path will not change anything for environments which currently work fine but will enable compiling llama.cpp also in case the runtime code is not available.	2024-06-18 14:00:14 +02:00
Frank Mai	b96f9afb0d	chore: clean useless beam search param (#7985 ) Signed-off-by: thxCode <thxcode0824@gmail.com>	2024-06-18 10:11:40 +03:00

... 14 15 16 17 18 ...

3976 Commits