llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-11-11 21:39:52 +00:00

Author	SHA1	Message	Date
Meng, Hengyu	b864b50ce5	[SYCL] Align GEMM dispatch (#7566 ) * align GEMM dispatch	2024-05-29 07:00:24 +08:00
Masaya, Kato	faa0e6979a	ggml: aarch64: SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot (#7433 ) * Add SVE support for q4_0_q8_0 q8_0_q8_0 * remove ifdef	2024-05-25 11:42:31 +03:00
Georgi Gerganov	e84b71c2c6	ggml : drop support for QK_K=64 (#7473 ) * ggml : drop support for QK_K=64 ggml-ci * opencl : restore QK_K=256 define	2024-05-23 10:00:21 +03:00
k.h.lai	fcda1128bc	vulkan: add workaround for iterator boundary check to fix clang-cl debug build (#7426 )	2024-05-22 14:53:21 +02:00
junchao-loongson	65c58207ec	ggml : add loongarch lsx and lasx support (#6454 ) * add loongarch lsx and lasx optimize code * Add loongarch compilation support to makefile * revert stb_image.h * opt bytes_from_nibbles_32 and sum_i16_pairs_float * fix undeclared * format code * update * update 2 --------- Co-authored-by: Jinyang He <hejinyang@loongson.cn>	2024-05-20 10:19:21 +03:00
Srihari-mcw	33c8d50acc	Add provisions for windows support for BF16 code including CMake provision for enabling AVX512_BF16 (#7258 )	2024-05-20 12:18:39 +10:00
slaren	d359f30921	llama : remove MPI backend (#7395 )	2024-05-20 01:17:03 +02:00
Georgi Gerganov	059031b8c4	ci : re-enable sanitizer runs (#7358 ) * Revert "ci : temporary disable sanitizer builds (#6128)" This reverts commit `4f6d1337ca`. * ci : trigger	2024-05-18 18:55:54 +03:00
Engininja2	ef277de2ad	cmake : fix typo in AMDGPU_TARGETS (#7356 )	2024-05-18 02:39:25 +02:00
Gavin Zhao	82ca83db3c	ROCm: use native CMake HIP support (#5966 ) Supercedes #4024 and #4813. CMake's native HIP support has become the recommended way to add HIP code into a project (see [here](https://rocm.docs.amd.com/en/docs-6.0.0/conceptual/cmake-packages.html#using-hip-in-cmake)). This PR makes the following changes: 1. The environment variable `HIPCXX` or CMake option `CMAKE_HIP_COMPILER` should be used to specify the HIP compiler. Notably this shouldn't be `hipcc`, but ROCm's clang, which usually resides in `$ROCM_PATH/llvm/bin/clang`. Previously this was control by `CMAKE_C_COMPILER` and `CMAKE_CXX_COMPILER`. Note that since native CMake HIP support is not yet available on Windows, on Windows we fall back to the old behavior. 2. CMake option `CMAKE_HIP_ARCHITECTURES` is used to control the GPU architectures to build for. Previously this was controled by `GPU_TARGETS`. 3. Updated the Nix recipe to account for these new changes. 4. The GPU targets to build against in the Nix recipe is now consistent with the supported GPU targets in nixpkgs. 5. Added CI checks for HIP on both Linux and Windows. On Linux, we test both the new and old behavior. The most important part about this PR is the separation of the HIP compiler and the C/C++ compiler. This allows users to choose a different C/C++ compiler if desired, compared to the current situation where when building for ROCm support, everything must be compiled with ROCm's clang. ~~Makefile is unchanged. Please let me know if we want to be consistent on variables' naming because Makefile still uses `GPU_TARGETS` to control architectures to build for, but I feel like setting `CMAKE_HIP_ARCHITECTURES` is a bit awkward when you're calling `make`.~~ Makefile used `GPU_TARGETS` but the README says to use `AMDGPU_TARGETS`. For consistency with CMake, all usage of `GPU_TARGETS` in Makefile has been updated to `AMDGPU_TARGETS`. Thanks to the suggestion of @jin-eld, to maintain backwards compatibility (and not break too many downstream users' builds), if `CMAKE_CXX_COMPILER` ends with `hipcc`, then we still compile using the original behavior and emit a warning that recommends switching to the new HIP support. Similarly, if `AMDGPU_TARGETS` is set but `CMAKE_HIP_ARCHITECTURES` is not, then we forward `AMDGPU_TARGETS` to `CMAKE_HIP_ARCHITECTURES` to ease the transition to the new HIP support. Signed-off-by: Gavin Zhao <git@gzgz.dev>	2024-05-17 17:03:03 +02:00
Max Krasnyansky	13ad16af12	Add support for properly optimized Windows ARM64 builds with LLVM and MSVC (#7191 ) * logging: add proper checks for clang to avoid errors and warnings with VA_ARGS * build: add CMake Presets and toolchian files for Windows ARM64 * matmul-int8: enable matmul-int8 with MSVC and fix Clang warnings * ci: add support for optimized Windows ARM64 builds with MSVC and LLVM * matmul-int8: fixed typos in q8_0_q8_0 matmuls Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * matmul-int8: remove unnecessary casts in q8_0_q8_0 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-16 12:47:36 +10:00
Radoslav Gerganov	5e31828d3e	ggml : add RPC backend (#6829 ) * ggml : add RPC backend The RPC backend proxies all operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). * set TCP_NODELAY * add CI workflows * Address review comments * fix warning * implement llama_max_devices() for RPC * Address review comments * Address review comments * wrap sockfd into a struct * implement get_alignment and get_max_size * add get_device_memory * fix warning * win32 support * add README * readme : trim trailing whitespace * Address review comments * win32 fix * Address review comments * fix compile warnings on macos	2024-05-14 14:27:19 +03:00
Georgi Gerganov	6f1b63606f	cmake : fix version cmp (#7227 )	2024-05-12 18:30:23 +03:00
slaren	b228aba91a	remove convert-lora-to-ggml.py (#7204 )	2024-05-12 02:29:33 +02:00
Jared Van Bortel	4426e2987b	cmake : fix typo (#7151 )	2024-05-08 19:55:32 -04:00
agray3	bc4bba364f	Introduction of CUDA Graphs to LLama.cpp (#6766 ) * DRAFT: Introduction of CUDA Graphs to LLama.cpp * FIx issues raised in comments * Tidied to now only use CUDA runtime (not mixed with driver calls) * disable for multi-gpu and batch size > 1 * Disable CUDA graphs for old GPU arch and with env var * added missing CUDA_CHECKs * Addressed comments * further addressed comments * limit to GGML_ALLOW_CUDA_GRAPHS defined in llama.cpp cmake * Added more comprehensive graph node checking * With mechanism to fall back if graph capture fails * Revert "With mechanism to fall back if graph capture fails" This reverts commit `eb9f15fb6f`. * Fall back if graph capture fails and address other comments * - renamed GGML_ALLOW_CUDA_GRAPHS to GGML_CUDA_USE_GRAPHS - rename env variable to disable CUDA graphs to GGML_CUDA_DISABLE_GRAPHS - updated Makefile build to enable CUDA graphs - removed graph capture failure checking in ggml_cuda_error using a global variable to track this is not thread safe, but I am also not safistied with checking an error by string if this is necessary to workaround some issues with graph capture with eg. cuBLAS, we can pass the ggml_backend_cuda_context to the error checking macro and store the result in the context - fixed several resource leaks - fixed issue with zero node graphs - changed fixed size arrays to vectors - removed the count of number of evaluations before start capturing, and instead changed the capture mode to relaxed - removed the check for multiple devices so that it is still possible to use a single device, instead checks for split buffers to disable cuda graphs with -sm row - changed the op for checking batch size to GGML_OP_ADD, should be more reliable than GGML_OP_SOFT_MAX - code style fixes - things to look into - VRAM usage of the cudaGraphExec_t, if it is significant we may need to make it optional - possibility of using cudaStreamBeginCaptureToGraph to keep track of which ggml graph nodes correspond to which cuda graph nodes * fix build without cuda graphs * remove outdated comment * replace minimum cc value with a constant --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-05-08 22:55:49 +02:00
William Tambellini	858f6b73f6	Add an option to build without CUDA VMM (#7067 ) Add an option to build ggml cuda without CUDA VMM resolves https://github.com/ggerganov/llama.cpp/issues/6889 https://forums.developer.nvidia.com/t/potential-nvshmem-allocated-memory-performance-issue/275416/4	2024-05-06 20:12:14 +02:00
Georgi Gerganov	dba497e0c1	cmake : restore LLAMA_LLAMAFILE_DEFAULT	2024-04-25 21:37:27 +03:00
Georgi Gerganov	fa0b4ad252	cmake : remove obsolete ANDROID check	2024-04-25 18:59:51 +03:00
Justine Tunney	192090bae4	llamafile : improve sgemm.cpp (#6796 ) * llamafile : improve sgemm.cpp - Re-enable by default - Fix issue described in #6716 - Make code more abstract, elegant, and maintainable - Faster handling of weirdly shaped `m` an `n` edge cases * Address review comments * Help clang produce fma instructions * Address review comments	2024-04-22 22:00:36 +03:00
Georgi Gerganov	3b8f1ec4b1	llamafile : tmp disable + build sgemm.o when needed (#6716 ) * build : sgemm.o only when needed ggml-ci * llamafile : tmp disable due to MoE bug ggml-ci	2024-04-17 23:58:26 +03:00
Georgi Gerganov	666867b799	ggml : fix llamafile sgemm wdata offsets (#6710 ) ggml-ci	2024-04-16 23:50:22 +03:00
Justine Tunney	8cc91dc63c	ggml : add llamafile sgemm (#6414 ) This change upstreams llamafile's cpu matrix multiplication kernels which improve image and prompt evaluation speed. For starters, Q4_0 and Q8_0 weights should go ~40% faster on CPU. The biggest benefits are with data types like f16 / f32, which process prompts 2x faster thus making them faster than quantized data types for prompt evals. This change also introduces bona fide AVX512 support since tinyBLAS is able to exploit the larger register file. For example, on my CPU llama.cpp llava-cli processes an image prompt at 305 tokens/second, using the Q4_K and Q4_0 types, which has always been faster than if we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With this change, f16 LLaVA performance leap frogs to 464 tokens/second. On Intel Core i9-14900K this change improves F16 prompt perf by 5x. For example, using llama.cpp at HEAD with Mistral 7b f16 to process a 215 token prompt will go 13 tok/sec. This change has fixes making it go 52 tok/sec. It's mostly thanks to my vectorized outer product kernels but also because I added support for correctly counting the number of cores on Alderlake, so the default thread count discounts Intel's new efficiency cores. Only Linux right now can count cores. This work was sponsored by Mozilla who's given permission to change the license of this code from Apache 2.0 to MIT. To read more about what's improved, and how it works, see: https://justine.lol/matmul/	2024-04-16 21:55:30 +03:00
Matt Clayton	8093987090	cmake : add explicit metal version options (#6370 ) * cmake: add explicit metal version options * Update CMakeLists.txt --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-29 09:27:42 +02:00
Jared Van Bortel	32c8486e1f	wpm : portable unicode tolower (#6305 ) Also use C locale for ispunct/isspace, and split unicode-data.cpp from unicode.cpp.	2024-03-26 17:46:21 -04:00
Joseph Stahl	e190f1fca6	nix: make `xcrun` visible in Nix sandbox for precompiling Metal shaders (#6118 ) * Symlink to /usr/bin/xcrun so that `xcrun` binary is usable during build (used for compiling Metal shaders) Fixes https://github.com/ggerganov/llama.cpp/issues/6117 * cmake - copy default.metallib to install directory When metal files are compiled to default.metallib, Cmake needs to add this to the install directory so that it's visible to llama-cpp Also, update package.nix to use absolute path for default.metallib (it's not finding the bundle) * add `precompileMetalShaders` flag (defaults to false) to disable precompilation of metal shader Precompilation requires Xcode to be installed and requires disable sandbox on nix-darwin	2024-03-25 17:51:46 -07:00
slaren	280345968d	cuda : rename build flag to LLAMA_CUDA (#6299 )	2024-03-26 01:16:01 +01:00
slaren	ae1f211ce2	cuda : refactor into multiple files (#6269 )	2024-03-25 13:50:23 +01:00
slaren	2f0e81e053	cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy (#6208 ) * cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy * add LLAMA_CUDA_NO_PEER_COPY to HIP build	2024-03-22 14:05:31 +01:00
Pierrick Hymbert	d01b3c4c32	common: llama_load_model_from_url using --model-url (#6098 ) * common: llama_load_model_from_url with libcurl dependency Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-17 19:12:37 +01:00
Georgi Gerganov	381da2d9f0	metal : build metallib + fix embed path (#6015 ) * metal : build metallib + fix embed path ggml-ci * metal : fix embed build + update library load logic ggml-ci * metal : fix embeded library build ggml-ci * ci : fix iOS builds to use embedded library	2024-03-14 11:55:23 +02:00
slaren	f30ea47a87	llama : add pipeline parallelism support (#6017 ) * llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs ggml-ci * server : add -ub, --ubatch-size parameter * fix server embedding test * llama : fix Mamba inference for pipeline parallelism Tested to work correctly with both `main` and `parallel` examples. * llama : limit max batch size to n_batch * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage * fix hip build * fix sycl build (disable cpy_tensor_async) * fix hip build * llama : limit n_batch and n_ubatch to n_ctx during context creation * llama : fix norm backend * batched-bench : sync after decode * swiftui : sync after decode * ggml : allow ggml_get_rows to use multiple threads if they are available * check n_ubatch >= n_tokens with non-casual attention * llama : do not limit n_batch to n_ctx with non-casual attn * server : construct batch with size of llama_n_batch * ggml_backend_cpu_graph_compute : fix return value when alloc fails * llama : better n_batch and n_ubatch comment * fix merge * small fix * reduce default n_batch to 2048 --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-13 18:54:21 +01:00
Georgi Gerganov	83796e62bc	llama : refactor unicode stuff (#5992 ) * llama : refactor unicode stuff ggml-ci * unicode : names * make : fix c++ compiler * unicode : names * unicode : straighten tables * zig : fix build * unicode : put nfd normalization behind API ggml-ci * swift : fix build * unicode : add BOM * unicode : add <cstdint> ggml-ci * unicode : pass as cpts as const ref	2024-03-11 17:47:47 +02:00
Gilad S	ecab1c75de	cmake : fix subdir for `LLAMA_METAL_EMBED_LIBRARY` (#5985 )	2024-03-11 10:00:08 +02:00
AidanBeltonS	3814a07392	[SYCL] Add support for SYCL Nvidia target (#5738 ) * Add support for nvidia target in CMake * Update sycl read-me for Nvidia target * Fix errors	2024-03-11 09:13:57 +08:00
Georgi Gerganov	8a3012a4ad	ggml : add ggml-common.h to deduplicate shared code (#5940 ) * ggml : add ggml-common.h to shared code ggml-ci * scripts : update sync scripts * sycl : reuse quantum tables ggml-ci * ggml : minor * ggml : minor * sycl : try to fix build	2024-03-09 12:47:57 +02:00
Radosław Gryta	1289408817	cmake : fix compilation for Android armeabi-v7a (#5702 )	2024-02-25 12:53:11 +02:00
Haoxiang Fei	8dbbd75754	metal : add build system support for embedded metal library (#5604 ) * add build support for embedded metal library * Update Makefile --------- Co-authored-by: Haoxiang Fei <feihaoxiang@idea.edu.cn> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-20 11:58:36 +02:00
Georgi Gerganov	d0e3ce51f4	ci : enable -Werror for CUDA builds (#5579 ) * cmake : pass -Werror through -Xcompiler ggml-ci * make, cmake : enable CUDA errors on warnings ggml-ci	2024-02-19 14:45:41 +02:00
Abhilash Majumder	13e2c771aa	cmake : remove obsolete sycl compile flags (#5581 ) * rm unwanted sycl compile options * fix bug * fix bug * format fix	2024-02-19 11:15:18 +02:00
Jared Van Bortel	a0c2dad9d4	build : pass all warning flags to nvcc via -Xcompiler (#5570 ) * build : pass all warning flags to nvcc via -Xcompiler * make : fix apparent mis-merge from #3952 * make : fix incorrect GF_CC_VER for CUDA host compiler	2024-02-18 16:21:52 -05:00
Georgi Gerganov	f3f28c5395	cmake : fix GGML_USE_SYCL typo (#5555 )	2024-02-18 19:17:00 +02:00
Ananta Bastola	6e4e973b26	ci : add an option to fail on compile warning (#3952 ) * feat(ci): add an option to fail on compile warning * Update CMakeLists.txt * minor : fix compile warnings ggml-ci * ggml : fix unreachable code warnings ggml-ci * ci : disable fatal warnings for windows, ios and tvos * ggml : fix strncpy warning * ci : disable fatal warnings for MPI build * ci : add fatal warnings to ggml-ci ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-17 23:03:14 +02:00
Georgi Gerganov	5bf2b94dd4	cmake : fix VULKAN and ROCm builds (#5525 ) * cmake : fix VULKAN and ROCm builds * cmake : fix (cont) * vulkan : fix compile warnings ggml-ci * cmake : fix ggml-ci * cmake : minor ggml-ci	2024-02-16 19:05:56 +02:00
Michael Podvitskiy	8084d55440	cmake : ARM intrinsics detection for MSVC (#5401 )	2024-02-14 10:49:01 +02:00
Michael Podvitskiy	c4fbb6717c	CMAKE_OSX_ARCHITECTURES for MacOS cross compilation (#5393 ) Co-authored-by: Jared Van Bortel <jared@nomic.ai>	2024-02-07 16:39:23 -05:00
Johannes Gäßler	098f6d737b	make: Use ccache for faster compilation (#5318 ) * make: Use ccache for faster compilation	2024-02-05 19:33:00 +01:00
Welby Seely	277fad30c6	cmake : use set() for LLAMA_WIN_VER (#5298 ) option() is specifically for booleans. Fixes #5158	2024-02-03 23:18:51 -05:00
0cc4m	e920ed393d	Vulkan Intel Fixes, Optimizations and Debugging Flags (#5301 ) * Fix Vulkan on Intel ARC Optimize matmul for Intel ARC Add Vulkan dequant test * Add Vulkan debug and validate flags to Make and CMakeLists.txt * Enable asynchronous transfers in Vulkan backend * Fix flake8 * Disable Vulkan async backend functions for now * Also add Vulkan run tests command to Makefile and CMakeLists.txt	2024-02-03 18:15:00 +01:00
Eve	1cfb5372cf	Fix broken Vulkan Cmake (properly) (#5230 ) * build vulkan as object * vulkan ci	2024-01-31 20:21:55 +01:00

1 2 3 4

187 Commits