llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-26 11:24:35 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	23fc5c219a	cmake : fix trailing whitespaces	2023-06-19 18:18:34 +03:00
Howard Su	1e3abfcef0	cmake : fix build shared ggml when CUDA is enabled (#1929 ) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-19 18:10:37 +03:00
Johannes Gäßler	16b9cd1939	Convert vector to f16 for dequantize mul mat vec (#1913 ) * Convert vector to f16 for dmmv * compile option * Added compilation option description to README * Changed cmake CUDA_ARCHITECTURES from "OFF" to "native"	2023-06-19 10:23:56 +02:00
Howard Su	57cd69460f	cmake : add CUDA_ARCHITECTURES to new target ggml_static (#1917 )	2023-06-18 07:29:47 +03:00
Kerfuffle	b4c6f46f17	Allow cmake to build ggml as a library (#1896 ) * Allow cmake to build ggml as a library * A ggml_static library will be created * When BUILD_SHARED_LIBS is enabled, ggml_shared will also be built	2023-06-17 01:49:42 -06:00
Zenix	13fe9d2d84	cmake : add auto detection of BLAS_INCLUDE_DIRS (#1886 )	2023-06-16 21:53:04 +03:00
Kawrakow	3d01122610	CUDA : faster k-quant dot kernels (#1862 ) * cuda : faster k-quant dot kernels * Imrove Q2_K dot kernel on older GPUs We now have a K_QUANTS_PER_ITERATION macro, which should be set to 1 on older and to 2 on newer GPUs. With this, we preserve the performance of the original PR on RTX-4080, and are faster compared to master on GTX-1660. * Imrove Q6_K dot kernel on older GPUs Using the same K_QUANTS_PER_ITERATION macro as last commit, we preserve performance on RTX-4080 and speed up Q6_K on a GTX-1660. * Add LLAMA_CUDA_KQUANTS_ITER to CMakeLists.txt and Makefile Allowed values are 1 or 2. 2 gives the best performance on modern GPUs and is set as default. On older GPUs 1 may work better. * PR comments --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-16 20:08:44 +03:00
Georgi Gerganov	bed9275617	cmake : remove whitespaces	2023-06-15 21:56:50 +03:00
Igor Okulist	3559433fec	cmake : set include path for OpenBlas (#1830 )	2023-06-15 20:51:26 +03:00
Georgi Gerganov	4de0334f5c	cmake : fix Metal build (close #1791 )	2023-06-10 22:56:53 +03:00
Andrei	303f5809f1	metal : fix issue with ggml-metal.metal path. Closes #1769 (#1782 ) * Fix issue with ggml-metal.metal path * Add ggml-metal.metal as a resource for llama target * Update flake.nix metal kernel substitution	2023-06-10 17:47:34 +03:00
johnson442	0035858273	k-quants : add missing compile definition to CMakeLists (#1748 )	2023-06-08 10:02:48 +03:00
Georgi Gerganov	5c64a0952e	k-quants : allow to optionally disable at compile time (#1734 ) * k-quants : put behind optional compile flag LLAMA_K_QUANTS * build : enable k-quants by default	2023-06-07 10:59:52 +03:00
Kawrakow	99009e72f8	ggml : add SOTA 2,3,4,5,6 bit k-quantizations (#1684 ) * Starting to add k-quantization to ggml I think it is better to have quantization separate from ggml. For now just adding the k-quants there, but it would be better to also factor out the existing ggml quantizations. * Adding Q3_K and Q8_K (de)-quantization * Q3_K now working on CUDA and AVX2/scalar CUDA is not ideal - ~50% slower than Q4_0 for single token prediction, about the same in batch mode (perplexity). CPU single token is ~55 ms (on Ryzen 7950X). * Some improvement for Q3_K on CUDA It is now ~22.5 ms/token on my GPU, so ~30% slower than Q4_0. * Some more CUDA optimizations for Q3_K Single token is now 20.5 ms/token (~20% slower than Q4_0). Perplexity is on par with Q4_0. * Adding Q4_K - scalar, AVX2, CUDA Performance is the same or perhaps very slightly better than Q4_0 on the CPU. On the GPU, single token prediction is ~10% better than Q4_0, batch mode (perplexity is about the same). * Adding Q6_K - scalar, AVX2, CUDA Performance is ~40% lower compared to Q4_K on the CPU. This is to be expected, considering that we are memory bound on the CPU and the 6-bit model is ~44% larger than the 4-bit. On the GPU, single token prediction is ~6% lower than Q4_0, batch mode (perplexity) is even closer (but still slower). * Adding Q5_K - scalar, AVX2, CUDA Performance is ~20% lower compared to Q4_K on the CPU. This is to be expected, considering that we are memory bound on the CPU and the 5-bit model is ~22% larger than the 4-bit. On the GPU, single token prediction is about the same as Q4_0 for both, single token and batch prediction. * Per convention, all QX_K quantizations use Q5_K for output.weight * Adding quantization mixes * Quantization mixes: didn't quite get what I wanted in the last commit * Q4_K dot product for ARM_NEON * Q6_K dot product for ARM_NEON * Q5_K dot product for ARM_NEON * Adding Q3_K dot for ARM_NEON It is 22% slower than Q4_K, despite the smaller model size. On x86_64, where we are memory bound, the Q3_K model is quite a bit faster than Q4_K. * A very slightly faster ARM_NEON Q3_K dot * Adding Q2_K - just CUDA for now Token prediction is pretty good - about 15.5 ms on a RTX 4080. Perplexity is about the same as Q4_K. * Adding scalar and AVX2 Q2_K dot * Adding ARM_NEON Q2_K dot About the same performance as Q4_K. * A slightly faster ARM_NEON Q2_K dot Single token prediction is now ~36 ms on M2 Max. The code is much simpler too. * Fixed bug in Q2_K CUDA dot product kernel Stranegly enough, for the few prompts I tried with the 7B model the responses looked perfectly reasonable. Only realized something is not quite right when I tried the larger models and started getting nonse back. In any case, Q2_K single token evaluation time on an RTX 4080 in a Ryzen7950X box iusing CUDA and model fully loaded on the GPU are ~15.5 ms for 7B, ~25.4 ms for 13B, and ~55.8 ms for 30B. The max number of layers that fit in VRAM for The 65B is 32. With that, we get ~330 ms per token, which is not that much faster than just running on the CPU (~470 ms per token). * Don't print zeros/NaNs when no count histogram has been collected * A 10% faster CUDA vector dot kernel for Q3_K Q3_K is now running at ~18.5 ms / token on CUDA, so the gap to Q4_0 is only 10%. It seems memory acccess pattern is more important for performance than the amount of computation the kernel does. * A slightly daster Q4_K AVX2 dot product For perplexity, where we are less memory bound, time per pass drops by ~5%. Barely measurable difference for single token prediction. * A slightly faster ARM_NEON A4_K dot product * Minor * Fix quantization error test We cannot possibly be expecting rmse < 0.002 for 2- and 3-bit quantization variants. * Fix docker build I have been sloppy with vector reinterpret casts on ARM_NEON. It seems clang is very forgiving in that regard. * Added forgotten ggml.o dependence on k_quants.h to the Makefile * Had unintentionally committed the Makefile with -Ofast enabled * ggml : rename k_quants -> ggml-quants-k, use lowercase in code --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-05 22:56:18 +03:00
Georgi Gerganov	ecb217db4f	llama : Metal inference (#1642 ) * mtl : export the LLaMA computation graph * ci : disable temporary * mtl : adapt the MNIST example as starter * mtl : no need for mtl-export tool, add cli arg for main instead * mtl : export just a small part of the graph for now to make it easier * mtl : move MSL code into separate file for easy editing * mtl : initial get_rows_q4_0 kernel * mtl : confirmed get_rows_q4_0 is working correctly * mtl : add rms_norm kernel + confirm working * mtl : add mul kernel + confirm working * mtl : initial mul_mat Q4 kernel (wrong results) * mtl : mul_mat fixes (still wrong) * mtl : another mul_mat Q4 (still does not work) * mtl : working mul_mat q4 * ggml : fix handling of "view" ops in ggml_graph_import() * mtl : add rope kernel * mtl : add reshape and transpose handling * ggml : store offset as opt arg for ggml_view_xd() operators * mtl : add cpy kernel + handle view ops * mtl : confirm f16 x f32 attention mul mat * mtl : add scale kernel * mtl : add diag_mask_inf kernel * mtl : fix soft_max kernel * ggml : update ggml_nbytes() to handle non-contiguous tensors * mtl : verify V tensor contents * mtl : add f32 -> f32 cpy kernel * mtl : add silu kernel * mtl : add non-broadcast mul kernel * mtl : full GPU inference of the computation graph * mtl : optimize rms_norm and soft_max kernels * mtl : add f16 mat x f32 vec multiplication kernel * mtl : fix bug in f16 x f32 mul mat + speed-up computation * mtl : faster mul_mat_q4_0_f32 kernel * mtl : fix kernel signature + roll inner loop * mtl : more threads for rms_norm + better timing * mtl : remove printfs from inner loop * mtl : simplify implementation * mtl : add save/load vocab to ggml file * mtl : plug Metal inference into llama.cpp (very quick-n-dirty) * mtl : make it work with main example Lots of hacks but at least now it generates text * mtl : preparing for merge * mtl : clean-up ggml mtl interface + suport scratch / inplace * mtl : remove temp / debug code * metal : final refactoring and simplification * Revert "ci : disable temporary" This reverts commit `98c267fc77`. * metal : add comments * metal : clean-up stuff, fix typos * readme : add Metal instructions * readme : add example for main	2023-06-04 23:34:30 +03:00
Henri Vasserman	0ecb1bbbeb	[CI] Fix openblas (#1613 ) * Fix OpenBLAS build * Fix `LLAMA_BLAS_VENDOR` CMake variable that should be a string and not a boolean.	2023-05-27 17:24:06 +03:00
Johannes Gäßler	1fcdcc28b1	cuda : performance optimizations (#1530 ) * xor hack * block y dim * loop unrolling * Fixed cmake LLAMA_CUDA_BY option * Removed hipblas compatibility code * Define GGML_CUDA_DMMV_BLOCK_Y if not defined * Fewer iters, more ops per iter * Renamed DMMV X/Y compilation options	2023-05-26 00:07:29 +03:00
0cc4m	2e6cd4b025	OpenCL Token Generation Acceleration (#1459 ) * Move back to C++ for OpenCL * Refactor OpenCL code to work more like the CUDA code, add missing functions * Deduplicate dequant kernels * Add OpenCL compile options * Use compile args for preprocessing constants * Restore default platform + device selection by id behavior --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Henri Vasserman <henv@hot.ee>	2023-05-23 00:33:24 +03:00
Steward Garcia	7e4ea5beff	examples : add server example with REST API (#1443 ) * Added httplib support * Added readme for server example * fixed some bugs * Fix the build error on Macbook * changed json11 to nlohmann-json * removed some whitespaces * remove trailing whitespace * added support custom prompts and more functions * some corrections and added as cmake option	2023-05-21 20:51:18 +03:00
Zenix	b8ee340abe	feature : support blis and other blas implementation (#1536 ) * feature: add blis support * feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927 * fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake * Fix typo in INTEGER Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Fix: blas changes on ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-05-20 17:58:31 +03:00
Georgi Gerganov	ea600071cb	Revert "feature : add blis and other BLAS implementation support (#1502 )" This reverts commit `07e9ace0f9`.	2023-05-20 12:03:48 +03:00
Zenix	07e9ace0f9	feature : add blis and other BLAS implementation support (#1502 ) * feature: add blis support * feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927 * fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake * Fix typo in INTEGER Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-05-20 12:02:48 +03:00
kuvaus	9daff419f6	fix build-info.h for git submodules (#1289 ) * make git build info work with submodules --------- Co-authored-by: Green Sky <green@g-s.xyz>	2023-05-03 02:43:43 +02:00
Marvin Gießing	cc0bb7235c	ggml : fix ppc64le build error and make cmake detect Power processors (#1284 ) * Fix ppc64le build issue * Added support to detect ppc64* processors	2023-05-02 19:42:16 +03:00
DannyDaemonic	f4cef87edf	Add git-based build information for better issue tracking (#1232 ) * Add git-based build information for better issue tracking * macOS fix * "build (hash)" and "CMAKE_SOURCE_DIR" changes * Redo "CMAKE_CURRENT_SOURCE_DIR" and clearer build messages * Fix conditional dependency on missing target * Broke out build-info.cmake, added find_package fallback, and added build into to all examples, added dependencies to Makefile * 4 space indenting for cmake, attempt to clean up my mess in Makefile * Short hash, less fancy Makefile, and don't modify build-info.h if it wouldn't change it	2023-05-01 18:23:47 +02:00
Pavol Rusnak	6f79699286	build: add armv{6,7,8} support to cmake (#1251 ) - flags copied from Makefile - updated comments in both CMakeLists.txt and Makefile to match reality	2023-04-30 20:48:38 +02:00
Georgi Gerganov	305eb5afd5	build : fix reference to old llama_util.h	2023-04-29 13:53:12 +03:00
0cc4m	7296c961d9	ggml : add CLBlast support (#1164 ) * Allow use of OpenCL GPU-based BLAS using ClBlast instead of OpenBLAS for context processing * Improve ClBlast implementation, avoid recreating buffers, remove redundant transfers * Finish merge of ClBlast support * Move CLBlast implementation to separate file Add buffer reuse code (adapted from slaren's cuda implementation) * Add q4_2 and q4_3 CLBlast support, improve code * Double CLBlast speed by disabling OpenBLAS thread workaround Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com> Co-authored-by: slaren <2141330+slaren@users.noreply.github.com> * Fix device selection env variable names * Fix cast in opencl kernels * Add CLBlast to CMakeLists.txt * Replace buffer pool with static buffers a, b, qb, c Fix compile warnings * Fix typos, use GGML_TYPE defines, improve code * Improve btype dequant kernel selection code, add error if type is unsupported * Improve code quality * Move internal stuff out of header * Use internal enums instead of CLBlast enums * Remove leftover C++ includes and defines * Make event use easier to read Co-authored-by: Henri Vasserman <henv@hot.ee> * Use c compiler for opencl files * Simplify code, fix include * First check error, then release event * Make globals static, fix indentation * Rename dequant kernels file to conform with other file names * Fix import cl file name --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com> Co-authored-by: slaren <2141330+slaren@users.noreply.github.com> Co-authored-by: Henri Vasserman <henv@hot.ee> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-28 17:57:16 +03:00
Georgi Gerganov	0e018fe008	ggml : fix Q4_3 cuBLAS	2023-04-22 16:32:07 +03:00
Howard Su	7e312f165c	cmake : fix build under Windows when enable BUILD_SHARED_LIBS (#1100 ) * Fix build under Windows when enable BUILD_SHARED_LIBS * Make AVX512 test on Windows to build the shared libs	2023-04-22 11:18:20 +03:00
源文雨	018f2279f5	cmake : link threads publicly to ggml (#1042 ) * fix: ld link test-tokenizer-0 error ``` cmake3 --build . --config Release [ 5%] Built target ggml [ 16%] Built target llama [ 22%] Linking CXX executable ../bin/test-tokenizer-0 ../libllama.a(ggml.c.o)：在函数‘ggml_graph_compute’中： ggml.c:(.text+0xf2db)：对‘pthread_create’未定义的引用 ggml.c:(.text+0xf9d4)：对‘pthread_join’未定义的引用 collect2: error: ld returned 1 exit status gmake[2]: * [bin/test-tokenizer-0] 错误 1 gmake[1]: * [tests/CMakeFiles/test-tokenizer-0.dir/all] 错误 2 gmake: *** [all] 错误 2 ``` * Update CMakeLists.txt * Update CMakeLists.txt * Update CMakeLists.txt	2023-04-21 21:27:06 +03:00
slaren	02d6988121	Improve cuBLAS performance by dequantizing on the GPU (#1065 )	2023-04-20 03:14:14 +02:00
Stephan Walter	f3d4edf504	ggml : Q4 cleanup - remove 4-bit dot product code (#1061 ) * Q4 cleanup * Remove unused AVX512 Q4_0 code	2023-04-19 19:06:37 +03:00
slaren	8944a13296	Add NVIDIA cuBLAS support (#1044 )	2023-04-19 11:22:45 +02:00
Kawrakow	5ecff35151	Adding a simple program to measure speed of dot products (#1041 ) On my Mac, the direct Q4_1 product is marginally slower (~69 vs ~55 us for Q4_0). The SIMD-ified ggml version is now almost 2X slower (~121 us). On a Ryzen 7950X CPU, the direct product for Q4_1 quantization is faster than the AVX2 implementation (~60 vs ~62 us). --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-04-18 19:00:14 +00:00
Ivan Komarov	f266259ad9	Speedup the AVX-512 implementation of ggml_vec_dot_q4_0() (#933 )	2023-04-17 15:10:57 +02:00
katsu560	106faaf297	cmake : add finding the OpenBLAS header file (#992 )	2023-04-15 08:51:11 +03:00
Georgi Gerganov	9190e8eac8	llama : merge llama_internal.h into llama.h Hide it behind an #ifdef	2023-04-13 18:04:45 +03:00
anzz1	585d91a156	cmake : add explicit F16C option (x86) (#576 ) Fixes building for x86 processors missing F16C featureset MSVC not included, as in MSVC F16C is implied with AVX2/AVX512	2023-04-13 15:48:21 +03:00
comex	f963b63afa	Rewrite loading code to try to satisfy everyone: - Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.) - Support both mmap and read (mmap is used by default, but can be disabled with `--no-mmap`, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported). - Support multi-file models like before, but automatically determine the number of parts rather than requiring `--n_parts`. - Improve validation and error checking. - Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields. This has no immediate benefit, but makes it easier to experiment with different formats, and should make it easier to support the new GPTQ-for-LLaMa models in the future (I have some work in progress on that front). - Support VirtualLock on Windows (using the same `--mlock` option as on Unix). - Indicate loading progress when using mmap + mlock. (Which led me to the interesting observation that on my Linux machine, with a warm file cache, mlock actually takes some time, whereas mmap without mlock starts almost instantly...) - To help implement this, move mlock support from ggml to the loading code. - madvise/PrefetchVirtualMemory support (based on #740) - Switch from ifstream to the `fopen` family of functions to avoid unnecessary copying and, when mmap is enabled, allow reusing the same file descriptor for both metadata reads and mmap (whereas the existing implementation opens the file a second time to mmap). - Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way'). Implementation notes: I tried to factor the code into more discrete pieces than before. Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly: - Destructors to make it easier to ensure everything gets cleaned up. - Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. The exceptions are converted to error codes at the API boundary.) Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from #740)	2023-04-10 01:10:46 +02:00
eiery	f2d1c47294	cmake should link openblas properly with -lopenblas like how it's done in the makefile (#839 )	2023-04-08 11:15:17 +00:00
Stephan Walter	3525899277	Enable -std= for cmake builds, fix warnings (#598 )	2023-03-31 19:19:16 +00:00
Stephan Walter	3bcc129ba8	cmake : properly invoke CTest (#629 )	2023-03-30 20:56:59 +03:00
Georgi Gerganov	d502bc7c9d	tests : free llama context at the end of the test	2023-03-28 19:51:55 +03:00
Stephan Walter	436e561931	all : be more strict about converting float to double (#458 ) * Be more strict about converting float to double * Test equivalence of round, SILU implementations Test module is commented out in CMakeLists.txt because the tests may take a long time, depending on how much the compiler optimizes. * Fix softmax in perplexity.cpp * all : prefer float over double where appropriate * perplexity : add <cmath> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-03-28 19:48:20 +03:00
anzz1	2f7bf7dd7c	CMake / CI additions (#497 ) * CMake: Add AVX512 option * CI: Add AVX/AVX512 builds (Windows) (AVX512 tests can only be run when the worker happens to support it, building works anyway) * CMake: Fix sanitizer linkage ( merged #468 ) * CI: Add sanitizer builds (Ubuntu) * CI: Fix release tagging (change @zendesk/action-create-release to @anzz1/action-create-release until upstream PR Added commitish as input zendesk/action-create-release#32 is merged)	2023-03-25 23:38:11 +02:00
Georgi Gerganov	a316a425d0	Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something !	2023-03-25 20:26:40 +02:00
nusu-github	ad072fc5ad	Generate library with CMake (#430 ) * Generate library with CMake BUILD_SHARED_LIBS to allow llama library to be generated. * Turn ON PIC when BUILD_SHARED_LIBS is ON	2023-03-23 21:16:48 +01:00
Erik Scholz	4122dffff9	cmake: make llama an actual library (#392 )	2023-03-22 18:37:10 +02:00
Georgi Gerganov	f5a77a629b	Introduce C-style API (#370 ) * Major refactoring - introduce C-style API * Clean up * Add <cassert> * Add <iterator> * Add <algorithm> .... * Fix timing reporting and accumulation * Measure eval time only for single-token calls * Change llama_tokenize return meaning	2023-03-22 07:32:36 +02:00

1 2

55 Commits