llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-28 12:24:35 +00:00

Author	SHA1	Message	Date
Francis Couture-Harpin	cb6d9962c4	Merge branch 'master' into compilade/bitnet-ternary	2024-08-22 16:42:24 -04:00
Akarshan Biswas	11b84eb457	[SYCL] Add a space to supress a cmake warning (#9133 )	2024-08-22 22:09:47 +08:00
luoyu-intel	1731d4238f	[SYCL] Add oneDNN primitive support (#9091 ) * add onednn * add sycl_f16 * add dnnl stream * add engine map * use dnnl for intel only * use fp16fp16fp16 * update doc	2024-08-22 12:50:10 +08:00
compilade	a1631e53f6	llama : simplify Mamba with advanced batch splits (#8526 ) * llama : advanced batch splits This includes equal-sequence-length batch splits which are useful to simplify recurrent model operators. * llama : always make recurrent state slots contiguous * ggml : simplify mamba operators * llama : fix integer signedness mixing * llama : logits_all has priority over batch->logits Otherwise, the server embeddings tests failed. This was likely an existing problem but was only detected here because of an additional assertion. * llama : apply suggestions Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama : fix t5 segfault * llama : fix Mamba session save and restore * llama : minor cosmetic changes * llama : rename llama_reorder_outputs to llama_output_reorder Also move it closer to llama_output_reserve. * llama : fix pooled embeddings when using batches with equal_seqs * minor : add struct members for clarity ggml-ci * llama : fix T5 segfault again * llama : fix Mamba pooled embeddings with multiple sequences Until the pooled embeddings are refactored to allow splitting across ubatches for causal embeddings, recurrent models can only process a single sequence per ubatch when calculating pooled embeddings. * llama : add llama_model_is_recurrent to simplify figuring that out This will make it easier to more cleanly support RWKV-v6 and Mamba-2. * llama : fix simple splits when the batch contains embeddings --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-08-21 17:58:11 -04:00
Changyeon Kim	2f3c1466ff	llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model. (#8984 ) * llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model. - The CLIP model now prioritizes the Vulkan backend over the CPU when vulkan available. - A GGML_OP_ACC shader has been added. - The encoding performance of the CLIP model improved from 4.2s on the CPU to 0.9s on the GPU. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * fix-up coding style. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * Fix-up the missing initial parameter to resolve the compilation warning. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * [fix] Add missing parameters. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * [fix] Use nb1 and nb2 for dst. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * Fix check results ggml_acc call --------- Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> Co-authored-by: 0cc4m <picard12@live.de>	2024-08-20 21:00:00 +02:00
Meng, Hengyu	50addec9a5	[SYCL] fallback mmvq (#9088 ) * fallback mmvq to mul_mat * mmvq in cuda path * Update ggml/src/ggml-sycl.cpp Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com> --------- Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com>	2024-08-20 23:50:17 +08:00
zhentaoyu	4f8d19ff17	[SYCL] Fix SYCL `im2col` and `convert` Overflow with Large Dims (#9052 ) * sycl: fix im2col overflow and sync with cuda Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix convert overflow Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix convert and dequantize Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix ib in dmmv Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl:refine convert Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: move downsample global_range into common Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: add im2col and convert test cases Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: make new cases only in sycl Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: comment new test_cases for only local testing Signed-off-by: zhentaoyu <zhentao.yu@intel.com> --------- Signed-off-by: zhentaoyu <zhentao.yu@intel.com>	2024-08-20 23:06:51 +08:00
Radoslav Gerganov	1b6ff90ff8	rpc : print error message when failed to connect endpoint (#9042 )	2024-08-19 10:11:45 +03:00
Radoslav Gerganov	18eaf29f4c	rpc : prevent crashes on invalid input (#9040 ) Add more checks which prevent RPC server from crashing if invalid input is received from client	2024-08-19 10:10:21 +03:00
Nico Bosshard	e3f6fd56b1	ggml : dynamic ggml_sched_max_splits based on graph_size (#9047 ) * ggml : Dynamic ggml_sched_max_splits based on graph_size * Fixed and readded debug code for causes	2024-08-16 04:22:55 +02:00
0cc4m	5fd89a70ea	Vulkan Optimizations and Fixes (#8959 ) * Optimize Vulkan REPEAT performance * Use Vulkan GLSL fused multiply-add instruction where possible * Add GGML_VULKAN_PERF option to output performance data per operator * Rework and fix Vulkan descriptor set and descriptor pool handling * Fix float32 concat f16 shader validation error * Add Vulkan GROUP_NORM eps parameter * Fix validation error with transfer queue memory barrier flags * Remove trailing whitespaces	2024-08-14 18:32:53 +02:00
Georgi Gerganov	43bdd3ce18	cmake : remove unused option GGML_CURL (#9011 )	2024-08-14 09:14:49 +03:00
Francis Couture-Harpin	35cc5567c8	ggml-quants : deduplicate TQ1_0 and TQ2_0 __ARM_FEATURE_DOTPROD support	2024-08-13 18:00:06 -04:00
Francis Couture-Harpin	82b240406d	Merge branch 'master' into compilade/bitnet-ternary	2024-08-13 17:36:09 -04:00
Francis Couture-Harpin	69f772682e	ggml-quants : allow using ARM dot product instructions for TQ1_0	2024-08-13 17:21:19 -04:00
Daniel Bevenius	06943a69f6	ggml : move rope type enum to ggml.h (#8949 ) * ggml : move rope type enum to ggml.h This commit moves the `llama_rope_type` enum from `llama.h` to `ggml.h` and changes its name to `ggml_rope_type`. The motivation for this change is to address the TODO in `llama.h` and use the enum in ggml. Note: This commit does not change the `mode` parameter to be of type `enum ggml_rope_type`. The name `mode` and its usage suggest that it might be more generic and possibly used as a bit field for multiple flags. Further investigation/discussion may be needed to determine if `mode` should be restricted to RoPE types. * squash! ggml : move rope type enum to ggml.h This commit removes GGML_ROPE_TYPE_NONE and GGML_ROPE_TYPE_GLM from ggml.h, and back the llama_rope_type enum. I've kept the assert for GGML_ROPE_TYPE_GLM as I'm not sure if it is safe to remove it yet. * squash! ggml : move rope type enum to ggml.h This commit removes the enum ggml_rope_type from ggml.h and replaces it with a define (GGML_ROPE_TYPE_NEOX). This define is used in the code to check if the mode is set to GPT-NeoX. Also the enum llama_rope_type has been updated to reflect this change. * squash! ggml : move rope type enum to ggml.h This commit contains a suggestion enable the GGML_ROPE_TYPE_NEOX macro/define to be passed to the shader compiler. * squash! ggml : move rope type enum to ggml.h This commit fixes the editorconfig-checker warnings. * squash! ggml : move rope type enum to ggml.h Update comment for ggml_rope function. * Revert "squash! ggml : move rope type enum to ggml.h" This reverts commit `6261222bd0`. * squash! ggml : move rope type enum to ggml.h Add GGML_ROPE_TYPE_NEOX to rope_common.comp. * remove extra line --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-08-13 21:13:15 +02:00
DavidKorczynski	df5478fbea	ggml: fix div-by-zero (#9003 ) Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70724 In order to access the above bug you need to login using one of the emails in https://github.com/google/oss-fuzz/blob/master/projects/llamacpp/project.yaml#L3-L5 Signed-off-by: David Korczynski <david@adalogics.com>	2024-08-12 14:21:41 +02:00
Francis Couture-Harpin	3a0bf17d57	gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0 * ggml-quants : use roundf instead of nearest_int for TQ1_0 and TQ2_0 This does not change anything for ternary models, since their values should never end up being in halfway cases anyway.	2024-08-12 00:06:48 -04:00
Francis Couture-Harpin	d911cd1f13	Merge branch 'master' into compilade/bitnet-ternary	2024-08-11 15:52:29 -04:00
Markus Tavenrath	7c5bfd57f8	Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. (#8943 ) * Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. - Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove. - ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors. * Fix small typo --------- Co-authored-by: 0cc4m <picard12@live.de>	2024-08-11 10:09:09 +02:00
slaren	6e02327e8b	metal : fix uninitialized abort_callback (#8968 )	2024-08-10 15:42:10 +02:00
Georgi Gerganov	b72942fac9	Merge commit from fork	2024-08-09 23:03:21 +03:00
Matt Stephenson	70c0ea3560	whisper : use vulkan as gpu backend when available (whisper/2302) * ggml: use vulkan as gpu backend when available Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com> * whisper: enable using vk as default buffer type Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com> --------- Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>	2024-08-09 10:03:44 +03:00
Borislav Stanimirov	f93d49ab1e	ggml : ignore more msvc warnings (ggml/906)	2024-08-08 13:19:31 +03:00
Georgi Gerganov	5b33ea1ee7	metal : fix struct name (ggml/912) ggml-ci	2024-08-08 13:19:31 +03:00
Conrad Kramer	85fca8deb6	metal : add abort callback (ggml/905)	2024-08-08 13:19:30 +03:00
Francis Couture-Harpin	96b3d411e0	ggml-quants : allow using vdotq_s32 in TQ2_0 vec_dot Not yet tested on harware which supports it, might not work or might not even compile. But also it might. It should make the performance better on recent ARM CPUs. * ggml-quants : remove comment about possible format change of TQ2_0 Making it slightly more convenient for AVX512 but less convenient for everything else is not worth the trouble.	2024-08-07 15:08:41 -04:00
slaren	be55695eff	ggml-backend : fix async copy from CPU (#8897 ) * ggml-backend : fix async copy from CPU * cuda : more reliable async copy, fix stream used when the devices are the same	2024-08-07 13:29:02 +02:00
Ouadie EL FAROUKI	0478174d59	[SYCL] Updated SYCL device filtering (#8901 ) * Updated device filter to depend on default_selector (fixes non-intel device issues) * Small related update to example/sycl Readme	2024-08-07 11:25:36 +01:00
Johannes Gäßler	a8dbc6f753	CUDA/HIP: fix tests/test-backend-ops (#8896 )	2024-08-07 09:07:52 +02:00
Johannes Gäßler	641f5dd2a6	CUDA: fix padding logic for FP16/FP32 (#8884 )	2024-08-06 17:13:55 +02:00
Jaeden Amero	db20f50cf4	cmake : Link vulkan-shaders-gen with pthreads (#8835 ) When using CMake to build with Vulkan support, compiling vulkan-shaders-gen fails due to missing a CMakeLists.txt specification to link vulkan-shaders-gen with the threading library, resulting in the following error. [5/172] Linking CXX executable bin/vulkan-shaders-gen FAILED: bin/vulkan-shaders-gen : && /usr/bin/c++ ggml/src/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o -o bin/vulkan-shaders-gen && : ld: error: undefined symbol: pthread_create >>> referenced by vulkan-shaders-gen.cpp >>> ggml/src/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o:(std::__1::__libcpp_thread_create[abi:se180100](pthread*, >>> void ()(void), void*)) c++: error: linker command failed with exit code 1 (use -v to see invocation) [6/172] Generating build details from Git -- Found Git: /usr/local/bin/git (found version "2.45.2") ninja: build stopped: subcommand failed. Add the CMakeLists.txt specification to link vulkan-shaders-gen with the threading library and fix the above error. Fixes #8834	2024-08-06 15:21:47 +02:00
MaggotHATE	efda90c93a	[Vulkan] Fix compilation of `vulkan-shaders-gen` on w64devkit after `e31a4f6` (#8880 ) * Fix compilation issue in `vulkan-shaders-gen` `e31a4f6797` broke compilation on w64devkit. Including `algorithm` seems to fix that. * Guard it under `#ifdef _WIN32`	2024-08-06 13:32:03 +02:00
Molly Sophia	2d5dd7bb3f	ggml : add epsilon as a parameter for group_norm (#8818 ) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-08-06 10:26:46 +03:00
Mengqing Cao	c21a896405	[CANN]: Fix ggml_backend_cann_buffer_get_tensor (#8871 ) * cann: fix ggml_backend_cann_buffer_get_tensor 1. fix data ptr offset 2. enable the acquisition of incomplete tensors * fix backend cann set_tensor	2024-08-06 12:42:42 +08:00
wangshuai09	bc0f887e15	cann: fix buffer_num and runtime speed slowly error (#8865 )	2024-08-05 21:10:37 +08:00
Justine Tunney	b9dfc25ca3	ggml : fix overflows in elu function (#8866 ) It's helpful to use expm1f(x), because expf(x)-1 will result in overflow for 25% of single-precision floating point numbers.	2024-08-05 15:43:40 +03:00
stduhpf	e31a4f6797	cmake: fix paths for vulkan shaders compilation on Windows (#8573 ) * Vulkan-shaders: attempt fix compilation on windows * fix miss-matched parenthesis	2024-08-05 08:18:27 +02:00
0cc4m	064cdc265f	vulkan : fix Qantized Mat-Vec Mul on AMD GPUs for ncols < 64 (#8855 ) * Fix Vulkan mul mat vec invalid results when ncols < warp size * Only run backend ops mul mat vec block size test if block size not already covered	2024-08-05 08:52:55 +03:00
0cc4m	a3738b2fa7	vulkan : implement Stable Diffusion operators (ggml/904) * Fix Vulkan repeat op * Implement Vulkan concat op * Delete old Vulkan shader generator * Implement Vulkan im2col op * Implement Vulkan unary gelu_quick op * Implement Vulkan group_norm op * Implement Vulkan timestep_embedding op * Implement Vulkan upscale op * Fix Vulkan vk_context tensor extra index issue * Fix Vulkan matmul shader parameter bug * Properly fix Vulkan matmul shader parameter bug * Add Vulkan ADD f16 + f32 -> f16 operator support * Implement Vulkan tanh op * Fix Vulkan group count too large Validation error on non-Nvidia GPUs * Throw error when too much memory is requested * Fix another Vulkan group count too large Validation error on non-Nvidia GPUs * Fix matmul MMQ condition * Implement Vulkan pad op * Fix Vulkan crash when tensor is used multiple times in a compute graph * Add Vulkan CONCAT f16 + f16 -> f16 op * Add Vulkan LEAKY_RELU op	2024-08-05 08:50:57 +03:00
Daniel Bevenius	655858ace0	ggml : move c parameter comment to ggml_rope_ext (ggml/901) This commit moves the comment for the c parameter from ggml_rope to ggml_rope_ext. The comment is currently incorrect as ggml_rope does not have a c parameter (freq_factors tensor). Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-08-05 08:50:57 +03:00
wangshuai09	c02b0a8a4d	cann: support q4_0 model (#8822 )	2024-08-05 12:22:30 +08:00
Francis Couture-Harpin	f034aa1bb1	ggml-quants : rename fields of TQ1_0 and TQ2_0 structs for consistency	2024-08-03 16:22:04 -04:00
jdomke	76614f352e	ggml : reading the runtime sve config of the cpu (#8709 ) * ggml : reading the runtime sve config of the cpu * change to one time init to prevent performance drop * prefix variable to avoid possible conflicts * revert xxhash fix and add brackets --------- Co-authored-by: domke <673751-domke@users.noreply.gitlab.com>	2024-08-03 18:34:41 +02:00
Francis Couture-Harpin	04eec58112	ggml : remove q1_3 and q2_2 * llama : remove the separate scale tensors of BitNet b1.58 They won't be needed, since the remaining ternary quant types have built-in scales.	2024-08-02 20:16:26 -04:00
Sigbjørn Skjæret	b72c20b85c	Fix conversion of unnormalized BF16->BF16 weights (#7843 ) * add truncate_bf16 * truncate intermediate fp32 if converting bf16 to bf16 * fix masking in __compute_fp32_to_bf16 * np.int16 no longer used * missing cast and additional numpy 2.x fix * ggml-impl : do not flush bf16 subnormals to zero * ggml : add reference fp32 to bf16 conversion The fast version is no longer equivalent for all platforms because of the handling of subnormal values. * gguf-py : remove flush to zero for bf16 subnormals * gguf-py : remove float32 truncation to bf16 Rounding achieves the same thing in the cases where this was used. * missed prototype update in merge * merge cleanup --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net>	2024-08-02 15:11:39 -04:00
Mengqing Cao	e09a800f9a	cann: Fix ggml_cann_im2col for 1D im2col (#8819 ) * fix ggml_cann_im2col for 1D im2col * fix build warning	2024-08-02 16:50:53 +08:00
Ouadie EL FAROUKI	0fbbd88458	[SYCL] Fixing wrong VDR iq4nl value (#8812 )	2024-08-02 08:55:17 +08:00
matteo	afbb4c1322	ggml-cuda: Adding support for unified memory (#8035 ) * Adding support for unified memory * adding again the documentation about unified memory * refactoring: Moved the unified memory code in the correct location. * Fixed compilation error when using hipblas * cleaning up the documentation * Updating the documentation Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * adding one more case where the PR should not be enabled --------- Co-authored-by: matteo serva <matteo.serva@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-08-01 23:28:28 +02:00
Alex O'Connell	b7a08fd5e0	Build: Only include execinfo.h on linux systems that support it (#8783 ) * Only enable backtrace on GLIBC linux systems * fix missing file from copy * use glibc macro instead of defining a custom one	2024-08-01 18:53:46 +02:00

1 2 3

142 Commits