llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-29 04:44:34 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	5c977221d2	iq1_s: slightly faster dot product	2024-02-13 15:18:27 +02:00
Iwan Kawrakow	f604a17994	iq1_s: Tests	2024-02-13 15:11:23 +02:00
Iwan Kawrakow	425c6bbb6c	iq1_s: Metal works, but quite slow As usual, Apple Silicon does not like the code I write.	2024-02-13 14:37:16 +02:00
Iwan Kawrakow	020b548ec3	iq1_s: Metal basics Dequantize works, but not dot product	2024-02-13 14:16:30 +02:00
Iwan Kawrakow	4be44b7c33	iq1_s: use IQ2_XXS for attn_output At a cost of 0.04 extra bpw this gives a big improvement in PPL.	2024-02-13 13:17:48 +02:00
Iwan Kawrakow	307c5f617a	iq1_s: better grid	2024-02-13 13:17:48 +02:00
Iwan Kawrakow	773014926f	iq1_s: ARM_NEON dot product. Works, but not very fast	2024-02-13 13:17:48 +02:00
Iwan Kawrakow	2ffb05acc8	iq1_s: AVX2 finally works	2024-02-13 13:17:48 +02:00
Iwan Kawrakow	67e7c4238e	Fix after merge with latest master	2024-02-13 13:17:48 +02:00
Iwan Kawrakow	dc0b14bebb	Fix shadow warnings	2024-02-13 13:17:48 +02:00
Iwan Kawrakow	5574533a72	Fix tests	2024-02-13 13:17:48 +02:00
Iwan Kawrakow	592b3b26bb	iq1_s: WIP AVX2 dot product - something is not right	2024-02-13 13:17:48 +02:00
Iwan Kawrakow	d94139bf27	iq1_s: scalar CPU dot product	2024-02-13 13:17:48 +02:00
Iwan Kawrakow	a9d48e9718	iq1_s: CUDA is working	2024-02-13 13:17:46 +02:00
Iwan Kawrakow	80cd5bae99	iq1_s: WIP basics	2024-02-13 13:16:52 +02:00
Georgi Gerganov	49cc1f7d67	bert : add tests + fix quantization (#5475 ) * llama : do not quantize pos embd and token type tensors * ci : add BERT tests ggml-ci * ci : do not do BERT tests on low-perf nodes ggml-ci	2024-02-13 13:01:29 +02:00
Georgi Gerganov	99b8b43d7b	tests : disable moe test (#5473 )	2024-02-13 11:20:24 +02:00
Kawrakow	895407f31b	ggml-quants : fix compiler warnings (shadow variable) (#5472 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-13 09:07:57 +02:00
Georgi Gerganov	099afc6274	llama : fix quantization when tensors are missing (#5423 )	2024-02-12 20:14:39 +02:00
Georgi Gerganov	df334a1125	swift : package no longer use ggml dependency (#5465 ) * Revert "swift : update Package.swift to use ggml as dependency (#4691)" This reverts commit `ece9a45e8f`. * spm : add ggml headers	2024-02-12 19:54:29 +02:00
Lee	dbd8828eb0	py : fix persimmon `n_rot` conversion (#5460 ) * convert : fix persimmon offical weight conversion to write correct n_rot. * Update convert-persimmon-to-gguf.py --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-12 19:29:57 +02:00
Abhilash Majumder	43fe07c1a4	ggml-sycl: Replace 3d ops with macro (#5458 ) * use macro * use macro * fix format	2024-02-12 20:22:05 +05:30
Daniel Bevenius	4a46d2b792	llava : remove prog parameter from ArgumentParser (#5457 ) * llava: remove prog parameter from ArgumentParser This commit removes the `prog` parameter from `ArgumentParser` so that it uses the default value which is the name of the script. The motivation for this change is that currently the usage output looks like this: ```console $ python examples/llava/convert-image-encoder-to-gguf.py --help usage: convert_hf_to_gguf.py [-h] ... ``` And with this change it will look like this: ```console $ python examples/llava/convert-image-encoder-to-gguf.py --help usage: convert-image-encoder-to-gguf.py [-h] ... ``` Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * ci: add W503 to flake8 ignore list This commit adds W503 to the ignore list for flake8. This is done to avoid the following error: W503 line break before binary operator Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-12 10:38:44 +02:00
Georgi Gerganov	3b169441df	sync : ggml (#5452 ) * ggml-alloc : v3 (ggml/727) * ggml-alloc v3 ggml-ci * fix ci ggml-ci * whisper : check for backend buffer allocation failures * whisper : avoid leaks when initialization fails * cleanup ggml-ci * style fixes ggml-ci * sync : ggml * update llama.cpp, clip.cpp, export-lora.cpp * update finetune.cpp, train-text-from-scratch.cpp ggml-ci * ggml-backend : reduce alignment to 32 to match gguf and fix mmap --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-02-12 09:16:06 +02:00
Johannes Gäßler	3bdc4cd0f5	CUDA: mul_mat_vec_q tiling, refactor mul mat logic (#5434 ) * CUDA: mul_mat_vec_q tiling, refactor mul mat logic Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-02-11 19:08:39 +01:00
Douglas Hanley	2891c8aa9a	Add support for BERT embedding models (#5423 ) * BERT model graph construction (build_bert) * WordPiece tokenizer (llm_tokenize_wpm) * Add flag for non-causal attention models * Allow for models that only output embeddings * Support conversion of BERT models to GGUF * Based on prior work by @xyzhang626 and @skeskinen --------- Co-authored-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-11 11:21:38 -05:00
github-actions[bot]	97a336507e	flake.lock: Update Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31) → 'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07)	2024-02-11 07:50:41 -08:00
Sergio López	c88c74f967	vulkan: only use M-sized matmul on Apple GPUs (#5412 ) * vulkan: refactor guess_matmul_pipeline for vendor Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor conditionals. Signed-off-by: Sergio Lopez <slp@redhat.com> * vulkan: only use M-sized matmul on Apple GPUs L-sized and S-sized matmuls are broken on Apple GPUs, force using M-size with this vendor. Signed-off-by: Sergio Lopez <slp@redhat.com> --------- Signed-off-by: Sergio Lopez <slp@redhat.com>	2024-02-11 15:12:00 +01:00
Alexey Parfenov	a803333a4e	common : use enums for sampler types (#5418 ) * common: use enums for sampler types * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * minor : spaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-11 15:43:31 +02:00
Alexey Parfenov	684780141a	server : allow to specify tokens as strings in logit_bias (#5003 ) * server: allow to specify tokens as strings in logit_bias * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-11 15:38:14 +02:00
Georgi Gerganov	85910c5b30	main : ctrl+C print timing in non-interactive mode (#3873 )	2024-02-11 15:35:50 +02:00
Georgi Gerganov	139b62a839	common : fix compile warning	2024-02-11 15:33:43 +02:00
Georgi Gerganov	0f2411f154	ggml : fix compile warnings (unused vars) (#4966 )	2024-02-11 15:33:01 +02:00
snadampal	a07d0fee1f	ggml : add mmla kernels for quantized GEMM (#4966 ) * ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q8_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_1_q8_1 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: update unit tests for the new vec_dot interface * llama.cpp: add MATMUL_INT8 capability to system_info	2024-02-11 15:22:33 +02:00
Johannes Gäßler	e4640d8fdf	lookup: add print for drafting performance (#5450 )	2024-02-11 12:44:51 +01:00
Xuan Son Nguyen	907e08c110	server : add llama2 chat template (#5425 ) * server: add mistral chat template * server: fix typo * server: rename template mistral to llama2 * server: format_llama2: remove BOS * server: validate "--chat-template" argument * server: clean up using_chatml variable Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-02-11 12:16:22 +02:00
Ian Bull	f026f8120f	metal : use autoreleasepool to avoid memory leaks (#5437 ) There appears to be a known memory leak when using the `MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in [1,2] [1] https://developer.apple.com/forums/thread/662721 [2] https://forums.developer.apple.com/forums/thread/120931 This change-set wraps the `ggml_metal_graph_compute` in a `@autoreleasepool`. This commit addresses https://github.com/ggerganov/llama.cpp/issues/5436	2024-02-10 12:53:28 +02:00
Georgi Gerganov	cd9aea63b5	scripts : update sync scripts with new backends	2024-02-10 09:53:05 +02:00
Georgi Gerganov	43b65f5eb8	sync : ggml	2024-02-10 09:30:36 +02:00
Michael Podvitskiy	4633d93af0	ggml : add abort_callback for cpu backend (ggml/725) * a way to use abort_callback with the cpu backend * whisper update	2024-02-10 09:29:21 +02:00
Neuman Vong	4b7b38bef5	vulkan: Set limit for task concurrency (#5427 ) A common default for the maximum number of open files is 256, which can lead to `asyncio.gather(tasks)` failing with Too many open files. $ python ggml_vk_generate_shaders.py --glslc=$ANDROID_NDK_PATH/shader-tools/darwin-x86_64/glslc ggml_vulkan: Generating and compiling shaders to SPIR-V Traceback (most recent call last): File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2326, in <module> asyncio.run(main()) File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result() File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2294, in main await asyncio.gather(tasks) [...snip...] OSError: [Errno 24] Too many open files This change sets a reasonable concurrency limit for tasks (and therefore open files), without significant impact on run time.	2024-02-09 19:30:19 +01:00
Daniel Bevenius	e00d2a62dd	llava : add requirements.txt and update README.md (#5428 ) * llava: add requirements.txt and update README.md This commit adds a `requirements.txt` file to the `examples/llava` directory. This file contains the required Python packages to run the scripts in the `examples/llava` directory. The motivation of this to make it easier for users to run the scripts in `examples/llava`. This will avoid users from having to possibly run into missing package issues if the packages are not installed on their system. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * llava: fix typo in llava-surgery.py output Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-09 15:00:59 +02:00
Riley Stewart	7c777fcd5d	server : fix prompt caching for repeated prompts (#5420 )	2024-02-09 12:49:49 +02:00
Paul Tsochantaris	e5ca3937c6	llama : do not cap thread count when MoE on CPU (#5419 ) * Not capping thread count when MoE inference is running on CPU * Whitespace	2024-02-09 12:48:06 +02:00
Marko Tasic	e4124c2477	readme : add JavaScript/Wasm repo (#5415 )	2024-02-09 12:17:00 +02:00
Michael Podvitskiy	b2f87cb64d	ggml : fix `error C2078: too many initializers` for MSVC ARM64 (#5404 )	2024-02-09 11:56:43 +02:00
0cc4m	44fbe34360	Fix Vulkan crash on APUs with very little device memory (#5424 ) * Fix Vulkan crash on APUs with very little device memory * Fix debug output function names	2024-02-09 06:52:33 +01:00
Johannes Gäßler	8e6a9d2de0	CUDA: more warps for mmvq on NVIDIA (#5394 )	2024-02-08 21:56:40 +01:00
slaren	41f308f58e	llama : do not print "offloading layers" message in CPU-only builds (#5416 )	2024-02-08 21:33:03 +01:00
Abhilash Majumder	6e99f2a04f	Fix f16_sycl cpy call from Arc (#5411 ) * fix f16_sycl cpy call * rm old logic * add fp16 build CI * use macro * format fix	2024-02-08 22:39:10 +05:30

1 2 3 4 5 ...

2152 Commits