llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-24 10:24:35 +00:00

History

agray3 13dca2a54a Vectorize load instructions in dmmv f16 CUDA kernel (#9816 ) * Vectorize load instructions in dmmv f16 CUDA kernel Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup. * addressed comment * Update ggml/src/ggml-cuda/dmmv.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>		2024-10-14 02:49:08 +02:00
..
cmake	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
include	rpc : add backend registry / device interfaces (#9812 )	2024-10-10 20:14:55 +02:00
src	Vectorize load instructions in dmmv f16 CUDA kernel (#9816 )	2024-10-14 02:49:08 +02:00
.gitignore	vulkan : cmake integration (#8119 )	2024-07-13 18:12:39 +02:00
CMakeLists.txt	cmake : do not hide GGML options + rename option (#9465 )	2024-09-16 10:27:50 +03:00