llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-11-14 06:49:54 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	f048af0230	ggml : sync alibi fix from ggml repo	2023-05-13 11:54:33 +03:00
3ooabkhxtn	ac0cd259d5	Adding SSE instructions to ggml_vec_dot_q4_0_q8_0 (#1413 )	2023-05-13 08:43:33 +00:00
Georgi Gerganov	b9fd7eee57	ggml : remove bit shuffling (#1405 ) * ggml : remove Q4_0 bit shufling (ARM NEON) * ggml : remove Q4_1 bit shuffling (ARM NEON + reference) * ggml : nibbles_from_floats() + bytes_from_nibbles() (ARM NEON) * ggml : remove Q4_2 bit shuffling (WIP, BROKEN) * ggml : remove Q5_0 bit shuffling (ARM NEON) * ggml : 2x faster scalar implementations * ggml : remove Q5_1 bit shuffling (ARM NEON + scalar) * ggml : simplify scalar dot * ggml : remove WASM SIMD bit shuffling + remove vzip for ARM 32-bit * ggml : fix Q4_1 quantization * ggml : update cuBLAS + normalize variable names * ggml : remove Q4_2 mode * ggml : minor formatting * ggml : fix Q5_0 quantization * scripts : add script for measuring the time per token * AVX implementations (#1370) * ggml : uniform 5th bit extraction * llama : produce error upon loading old model files * llama : fix model magic/version write * ggml : speed-up Q5_0 + Q5_1 at 4 threads * ggml : preserve old Q4 and Q5 formats * ggml : simplify Q8_1 - no need for low / high sums anymore * ggml : fix Q8_0 and Q8_1 rounding * Revert "AVX implementations (#1370)" This reverts commit `948d124837`. * ggml : fix AVX2 implementation * sha : update hashes for 7B and 13B * readme : update timings + remove warning banner * llama : update v2 PR number to 1405 * ggml : fix WASM comments * ggml : back to original bit order * readme : add note that Q4 and Q5 have been changed * llama : fix return for unknown version --------- Co-authored-by: Stephan Walter <stephan@walter.name>	2023-05-12 00:23:08 +03:00
Sami Farin	9f8dbc4787	use pause asm insn in busyloop to run the CPU (13600K) 10 °C cooler (#1314 ) * use pause asm insn in busyloop to run the CPU (13600K) 10 °C cooler Tested with a 13B model. * use _mm_pause() in busyloop * use _mm_pause() in busyloop on x86_64 to reduce power consumption	2023-05-09 14:29:20 +02:00
swittk	1b0fd45465	ggml : Allow usage of CLBlast alongside Accelerate.framework (#1336 ) Minor edit in ggml.c which originally would prevent OpenCL from loading completely if GGML_USE_ACCELERATE was defined. Minor speedup in prompt eval time.	2023-05-06 23:03:23 -04:00
Ron Jailall	20fbf2a2a0	ggml : change immintrin.h to intrin.h for compatibility (#1307 ) * change immintrin.h to intrin.h for compatibility Building on windows11 arm throws an error on this line. Seems like using intrin.h covers x86 and and arm * conditional def of intrin.h * fix typo in ggml.c	2023-05-04 18:05:59 +03:00
Georgi Gerganov	799fdc1b5d	ggml : vectorize Q8_0 quantization https://github.com/ggerganov/ggml/pull/127#issuecomment-1533648531	2023-05-03 23:24:20 +03:00
Georgi Gerganov	5d5817ca60	ggml : fix 32-bit ARM	2023-05-02 22:14:50 +03:00
Marvin Gießing	cc0bb7235c	ggml : fix ppc64le build error and make cmake detect Power processors (#1284 ) * Fix ppc64le build issue * Added support to detect ppc64* processors	2023-05-02 19:42:16 +03:00
slaren	2d099e5193	ggml: add names to tensors (#1268 ) * ggml: add names to tensors * minor improvements to dot file formatting	2023-05-02 16:03:00 +02:00
slaren	58b367c2d7	cuBLAS: refactor and optimize f16 mat mul performance (#1259 ) * cuBLAS: refactor, convert fp16 to fp32 on device * cuBLAS: use multiple streams, choose smartly between mul_mat_q and mul_mat_f16 * fix build * cuBLAS: update block_q5_1	2023-05-01 18:11:07 +02:00
Kerfuffle	2bdc09646d	ggml : fix ggml_used_mem() (#1264 )	2023-05-01 14:56:07 +03:00
Georgi Gerganov	7ff0dcd320	ggml : fix UB (int << 31)	2023-04-30 22:28:51 +03:00
Georgi Gerganov	6bc4400e67	ggml : add Q5 WASM SIMD + GGML_FTYPE	2023-04-30 19:07:43 +03:00
Georgi Gerganov	3e5aa8a1c4	ggml : fix labels for GGML_OP_ALIBI	2023-04-30 10:25:46 +03:00
Georgi Gerganov	c3ca7a5f05	ggml : fix 32-bit ARM NEON	2023-04-29 21:34:23 +03:00
Georgi Gerganov	e8c051611a	ggml : use vzip instead of vuzp for consistency	2023-04-29 21:12:56 +03:00
Georgi Gerganov	0b5a935099	ggml : fix visibility and unused warnings	2023-04-29 19:28:36 +03:00
Georgi Gerganov	ec728e44d7	ggml : fix #if for f32_f32 mul_mat (CLBlast) (#1229 )	2023-04-29 18:43:42 +03:00
Georgi Gerganov	214b6a3570	ggml : adjust mul_mat_f16 work memory (#1226 ) * llama : minor - remove explicity int64_t cast * ggml : reduce memory buffer for F16 mul_mat when not using cuBLAS * ggml : add asserts to guard for incorrect wsize	2023-04-29 18:43:28 +03:00
slaren	7fc50c051a	cuBLAS: use host pinned memory and dequantize while copying (#1207 ) * cuBLAS: dequantize simultaneously while copying memory * cuBLAS: use host pinned memory * cuBLAS: improve ggml_compute_forward_mul_mat_f16_f32 with pinned memory * cuBLAS: also pin kv cache * fix rebase	2023-04-29 02:04:18 +02:00
Henri Vasserman	b1ee8f59b4	cuBLAS: non-contiguous tensor support (#1215 ) * Cuda: non-contiguous tensor support * remove extra stuff * rename * fix error * more fixes, now OpenBLAS and CLBlast build too * now then?	2023-04-29 01:31:56 +02:00
Stephan Walter	36d19a603b	Remove Q4_3 which is no better than Q5 (#1218 )	2023-04-28 23:10:43 +00:00
Georgi Gerganov	55390bcaf2	ggml : sync ggml (ggml_alibi)	2023-04-28 20:51:05 +03:00
Georgi Gerganov	11d902364b	ggml : add helper debug printf in soft_max	2023-04-28 17:59:08 +03:00
0cc4m	7296c961d9	ggml : add CLBlast support (#1164 ) * Allow use of OpenCL GPU-based BLAS using ClBlast instead of OpenBLAS for context processing * Improve ClBlast implementation, avoid recreating buffers, remove redundant transfers * Finish merge of ClBlast support * Move CLBlast implementation to separate file Add buffer reuse code (adapted from slaren's cuda implementation) * Add q4_2 and q4_3 CLBlast support, improve code * Double CLBlast speed by disabling OpenBLAS thread workaround Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com> Co-authored-by: slaren <2141330+slaren@users.noreply.github.com> * Fix device selection env variable names * Fix cast in opencl kernels * Add CLBlast to CMakeLists.txt * Replace buffer pool with static buffers a, b, qb, c Fix compile warnings * Fix typos, use GGML_TYPE defines, improve code * Improve btype dequant kernel selection code, add error if type is unsupported * Improve code quality * Move internal stuff out of header * Use internal enums instead of CLBlast enums * Remove leftover C++ includes and defines * Make event use easier to read Co-authored-by: Henri Vasserman <henv@hot.ee> * Use c compiler for opencl files * Simplify code, fix include * First check error, then release event * Make globals static, fix indentation * Rename dequant kernels file to conform with other file names * Fix import cl file name --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com> Co-authored-by: slaren <2141330+slaren@users.noreply.github.com> Co-authored-by: Henri Vasserman <henv@hot.ee> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-28 17:57:16 +03:00
Yann Follet	04aaae1d79	add avx2 for dot_q8_0_q8_0, 2x faster than scalar (#1211 )	2023-04-28 11:59:48 +00:00
Stephan Walter	0b2da20538	ggml : slightly faster AVX2 implementation for Q5 (#1197 )	2023-04-26 23:26:42 +03:00
Georgi Gerganov	574406dc7e	ggml : add Q5_0 and Q5_1 quantization (#1187 ) * ggml : add Q5_0 quantization (cuBLAS only) * ggml : fix Q5_0 qh -> uint32_t * ggml : fix q5_0 histogram stats * ggml : q5_0 scalar dot product * ggml : q5_0 ARM NEON dot * ggml : q5_0 more efficient ARM NEON using uint64_t masks * ggml : rename Q5_0 -> Q5_1 * ggml : adding Q5_0 mode * quantize : add Q5_0 and Q5_1 to map * ggml : AVX2 optimizations for Q5_0, Q5_1 (#1195) --------- Co-authored-by: Stephan Walter <stephan@walter.name>	2023-04-26 23:14:13 +03:00
Georgi Gerganov	7a32fcb3b2	ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) (#1179 ) * ggml : add Q8_0 quantization format (rename the old one to Q8_1) * tests : fix test-quantize-fns * ggml : finalize Q8_0 implementation * ggml : use q4_0_q8_0 and q4_2_q8_0 * ggml : fix Q8_0 dot product bug (ARM) * ggml : Q8_0 unroll x2 * ggml : fix bug - using wrong block type * ggml : extend quantize_fns_t with "vec_dot_type" * ggml : fix Q8_0 to use 255 values out of 256 * ggml : fix assert using wrong QK4_2 instead of QK4_3	2023-04-25 23:40:51 +03:00
unbounded	dd0eabc049	ggml : use full range for Q4_0 and Q4_2 quantization (#729 ) * Use full range for q4_0 quantization By keeping the sign of the highest magnitude, we can make sure the highest value maps to -8, which is currently unused. This is a bit of a freebie since it is fully backwards compatible with the current format. * Update quantize_row_q4_0 for AVX/AVX2 * Update quantize_row_q4_0 for WASM Untested * Update quantize_row_q4_0 for Arm NEON * Update quantize_row_q4_0 for PowerPC Untested * Use full range for q4_2 quantization	2023-04-25 20:20:46 +03:00
xaedes	54bb60e268	ggml : fix bug in ggml_compute_forward_sum_f32 (#1162 ) The sum over all rows is now computed instead of just the last row	2023-04-24 23:02:02 +02:00
Stephan Walter	2ec83428de	Fix build for gcc 8 and test in CI (#1154 )	2023-04-24 15:38:26 +00:00
Georgi Gerganov	ec9cdb6752	ggml : do not print perf ops that have not been used at all	2023-04-23 18:32:52 +03:00
Georgi Gerganov	e4422e299c	ggml : better PERF prints + support "LLAMA_PERF=1 make"	2023-04-23 18:15:39 +03:00
Stephan Walter	53c8434398	Improve AVX2 for vec_dot_q4_3_q8_0 (#1138 )	2023-04-23 11:01:03 +00:00
Yishuo Wang	c9e2c26f41	A better `packNibbles` and `mul_sum_i8_pairs_float` implementation using AVX512 (#1119 )	2023-04-23 07:57:05 +00:00
Georgi Gerganov	0e018fe008	ggml : fix Q4_3 cuBLAS	2023-04-22 16:32:07 +03:00
Stephan Walter	c50b628810	Fix CI: ARM NEON, quantization unit tests, editorconfig (#1122 )	2023-04-22 10:54:13 +00:00
Georgi Gerganov	872c365a91	ggml : fix AVX build + update to new Q8_0 format	2023-04-22 11:08:12 +03:00
Georgi Gerganov	955ef9a5d5	ggml : alternative Q4_3 implementation using modified Q8_0 (#1109 ) * ggml : prefer vzip to vuzp This way we always use the same type of instruction across all quantizations * ggml : alternative Q4_3 implementation using modified Q8_0 * ggml : fix Q4_3 scalar imlpementation * ggml : slight improvement of Q4_3 - no need for loop unrolling * ggml : fix AVX paths for Q8_0 quantization	2023-04-22 10:55:35 +03:00
Stephan Walter	c5aa5e5777	ggml : AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring (#1099 ) * AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring * finish AVX vectorization of quantize_row_q8_0 * Rename hsum_int_8 to hsum_i32_8	2023-04-22 10:37:05 +03:00
slaren	50cb666b8a	Improve cuBLAS performance by using a memory pool (#1094 ) * Improve cuBLAS performance by using a memory pool * Move cuda specific definitions to ggml-cuda.h/cu * Add CXX flags to nvcc * Change memory pool synchronization mechanism to a spin lock General code cleanup	2023-04-21 21:59:17 +02:00
Kawrakow	1bfc153e2f	ggml : a faster version for Q4_1 x Q8_0 dot products (#1083 ) * A faster version for Q4_1 x Q8_0 dot products The idea nehind being that Q8_0 quantized values get used many times in the matrix multiplications where they are involved. In the current implementations, when we are evaluating the dot products, we need to compute the sum of the quants in the Q8_0 vector, so the same operation is repeated many times. Here we pre-compute the sum during Q8_0 quantization, store it in the now modified block_q8_0 struct, and then reuse this result in the subsequent dot products. In a synthetic benchmark (just compute a bunch of dot products), this change speeds up the Q4_1 * Q8_0 dot product by 80%, making the performance identical to Q4_0 * Q8_0. In practical application, I see a ~15% gain in speed for token prediction on M2, and ~5% gain on Ryzen 7950X. The speed gain in the prompt evaluation is much bigger (around 50%). I have only done the change for the scalar version, ARM_NEON, and AVX2, so we still need an AVX implementation. * Cleaning up --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-04-21 18:18:26 +03:00
Georgi Gerganov	12b5900dbc	ggml : sync ggml (add GPT-NeoX RoPE implementation)	2023-04-20 23:32:59 +03:00
Georgi Gerganov	9ff334f3c9	ggml : fix bug in ggml_compute_forward_dup_f32()	2023-04-20 21:58:38 +03:00
Georgi Gerganov	8a1756abdf	ggml : do not break cuBLAS build (Q4_3 is not yet implemented)	2023-04-20 21:43:50 +03:00
Georgi Gerganov	66aab46079	ggml : fix Q4_3 quantization Broke it during conflict resolution in last PR	2023-04-20 20:44:05 +03:00
Kawrakow	38de86a711	llama : multi-threaded quantization (#1075 ) * Multi-threading quantization. Not much gain for simple quantizations, bit it will be important for quantizations that require more CPU cycles. * Multi-threading for quantize-stats It now does the job in ~14 seconds on my Mac for Q4_0, Q4_1 and Q4_2. Single-threaded it was taking more than 2 minutes after adding the more elaborate version of Q4_2. * Reviewer comments * Avoiding compiler confusion After changing chunk_size to const int as suggested by @ggerganov, clang and GCC starting to warn me that I don't need to capture it in the lambda. So, I removed it from the capture list. But that makes the MSVC build fail. So, making it a constexpr to make every compiler happy. * Still fighting with lambda captures in MSVC --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-20 20:42:27 +03:00
Georgi Gerganov	e0305ead3a	ggml : add Q4_3 quantization (#1082 )	2023-04-20 20:35:53 +03:00
Stephan Walter	c8c2c52482	AVX2 optimization for vec_dot_q4_2_q8_0 (#1068 )	2023-04-20 08:45:41 +02:00
slaren	02d6988121	Improve cuBLAS performance by dequantizing on the GPU (#1065 )	2023-04-20 03:14:14 +02:00
Kawrakow	f7d05095b4	Q4_2 quantization with rmse-optimized scale and quants (#1062 ) * Q4_2 quantization with rmse-optimized scale and quants For quantize-stats we get q4_2: rmse 0.00159301, maxerr 0.17480469, 95pct<0.0030, median<0.0012 For 7B perplexity with BLAS enabled we get 6.2038 after 655 chunks. Quantization is slow (~90 seconds on my Mac for 7B) as not multi-threaded as in PR #896. * ggml : satisfy the sanitizer builds Not sure why this makes them fail * Better follow ggml conventions for function names * Fixed type as per reviewer comment --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-19 20:20:14 +02:00
Georgi Gerganov	884e7d7a2b	ggml : use 8-bit precision for Q4_1 intermediate results (#1047 ) * ggml : use 8-bit precision for Q4_1 intermediate results (ARM) * ggml : optimize ggml_vec_dot_q4_1_q8_0() via vmalq_n_f32 56 ms/token with Q4_1 ! * ggml : AVX2 implementation of ggml_vec_dot_q4_1_q8_0 (#1051) * gitignore : ignore ppl-*.txt files --------- Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>	2023-04-19 20:10:08 +03:00
Stephan Walter	f3d4edf504	ggml : Q4 cleanup - remove 4-bit dot product code (#1061 ) * Q4 cleanup * Remove unused AVX512 Q4_0 code	2023-04-19 19:06:37 +03:00
slaren	8944a13296	Add NVIDIA cuBLAS support (#1044 )	2023-04-19 11:22:45 +02:00
slaren	6667401238	Multi-threaded ggml_cpy (#1035 ) * Multi-threaded ggml_cpy * Update ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Also fix wdata offset in ggml_compute_forward_add_q_f32 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-04-19 00:53:24 +02:00
Georgi Gerganov	77a73403ca	ggml : add new Q4_2 quantization (ARM only) (#1046 ) * ggml : Q4_2 ARM * ggml : add ggml_is_quantized() * llama : update llama_type_name() with Q4_2 entry * ggml : speed-up q4_2 - 4 threads: ~100ms -> ~90ms - 8 threads: ~55ms -> ~50ms * ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32	2023-04-18 23:54:57 +03:00
Georgi Gerganov	50a8a2af97	ggml : scratch that - vmlaq_n_f32 is always better Had a background process that was messing with the timings	2023-04-18 23:11:23 +03:00
Georgi Gerganov	dcdd65e296	ggml : optimize ggml_vec_dot_q4_0_q8_0() using vectorized accumulators	2023-04-18 22:59:17 +03:00
slaren	315a95a4d3	Add LoRA support (#820 )	2023-04-17 17:28:55 +02:00
Georgi Gerganov	69b740289f	ggml : avoid using ggml_fp16_to_fp32() and ggml_fp32_to_fp16() in ggml.c	2023-04-17 16:16:23 +03:00
Ivan Komarov	f266259ad9	Speedup the AVX-512 implementation of ggml_vec_dot_q4_0() (#933 )	2023-04-17 15:10:57 +02:00
Stephan Walter	2f7c8e014e	Fix potential int8 overflow in non-SIMD vec_dot (#986 )	2023-04-15 18:28:56 +00:00
Stephan Walter	0ad964631f	Refactor ggml.c for future tensor types (#1001 )	2023-04-15 16:25:38 +00:00
Georgi Gerganov	e95b6554b4	ggml : add Q8_0 quantization for intermediate results (#951 ) * ggml : add Q8_0 quantization for intermediate results * quantize-stats : fix test + add it to Makefile default * Q8: use int8_t, AVX/AVX2 optimizations * ggml : fix quantize_row_q8_0() ARM_NEON rounding * minor : updates after rebase to latest master * quantize-stats : delete obsolete strings * ggml : fix q4_1 dot func --------- Co-authored-by: Stephan Walter <stephan@walter.name>	2023-04-15 17:53:22 +03:00
Georgi Gerganov	aa485cee33	ggml : use posix_memalign on non-Windows env	2023-04-15 14:25:45 +03:00
Pavol Rusnak	c56b715269	Expose type name from ggml (#970 ) Avoid duplication of type names in utils Co-authored-by: Håkon H. Hitland <haakon@likedan.net>	2023-04-14 20:05:37 +02:00
Kerfuffle	c9a59b70a5	ggml : add unary and binary map operations (#874 ) * GGML map ops proof of concept. * Various cleanups. Add handling for task setting. Add handling for ggml_compute_backward. Rename functions to ggml_map_unary_f32 and ggml_map_binary_f32 Fix compiler warnings related to casting function pointers and `void ` Reorder functions and definitions based on the GGML op number. Use typedefs for map op function pointer types. Fix position of map ops cases in ggml_compute_forward	2023-04-14 17:43:55 +03:00
Georgi Gerganov	1623a6e9b4	ggml : minor	2023-04-14 13:31:29 +03:00
Georgi Gerganov	c14e0d2f23	ggml : always allocate buffers with size multiple of GGML_MEM_ALIGN	2023-04-14 13:31:15 +03:00
Georgi Gerganov	0f07cacb05	ggml : fix q4_1 dot product types	2023-04-14 09:45:42 +03:00
Howard Su	c5d70f5c9e	ggml : optimize rope function to avoid call powf in the tight loop (#807 )	2023-04-14 09:24:52 +03:00
Georgi Gerganov	a3a2a0eda8	ggml : add GGML_DEFAULT_N_THREADS	2023-04-13 18:36:48 +03:00
Georgi Gerganov	d990e3fffc	ggml : speed-up ggml_vec_dot_q4_1() ARM_NEON + 32-bit ARM support (#900 ) * ggml : speed-up q4_1 ARM_NEON by ~5% * ggml : implement vaddvq when missing * ggml : implement vminvq and vmaxvq when missing * ggml : implement vzip when missing * ggml : fix comment * ggml : try to use correct ifdef	2023-04-13 18:32:36 +03:00
Stephan Walter	6232f2d7fd	ggml : optimize non-SIMD Q4_0 vector dot product (#703 )	2023-04-13 17:59:50 +03:00
Pavol Rusnak	6c248707f5	ggml : introduce GGML_ALIGNED_MALLOC/GGML_ALIGNED_FREE macros (#884 ) which allows us to use aligned_alloc or _aligned_malloc functions	2023-04-13 17:08:32 +03:00
Vladimir	8c3ffc2f04	ggml : update cblas_sgemm columns var to be more reasonable (#838 )	2023-04-13 16:24:30 +03:00
Pavol Rusnak	8b679987cd	Fix whitespace, add .editorconfig, add GitHub workflow (#883 )	2023-04-11 19:45:44 +00:00
Stephan Walter	3e6e70d8e8	Add enum llama_ftype, sync ggml_type to model files (#709 )	2023-04-11 15:03:51 +00:00
comex	2663d2c678	Windows fixes (#890 ) Mostly for msys2 and mingw64 builds, which are different from each other and different from standard Visual Studio builds. Isn't Windows fun? - Define _GNU_SOURCE in more files (it's already used in ggml.c for Linux's sake). - Don't use PrefetchVirtualMemory if not building for Windows 8 or later (mingw64 doesn't by default). But warn the user about this situation since it's probably not intended. - Check for NOMINMAX already being defined, which it is on mingw64. - Actually use the `increment` variable (bug in my `pizza` PR). - Suppress unused variable warnings in the fake pthread_create and pthread_join implementations for Windows. - (not Windows-related) Remove mention of `asprintf` from comment; `asprintf` is no longer used. Fixes #871.	2023-04-11 15:19:54 +02:00
Georgi Gerganov	461ba9e66e	ggml : fix WASM build	2023-04-10 23:20:01 +03:00
Georgi Gerganov	c3ac702e5e	ggml : add ggml_cont() + optimize ggml_cpy() for contiguous dst	2023-04-10 22:42:28 +03:00
Georgi Gerganov	9d634ef452	ggml : remove trailing whitespaces	2023-04-10 22:42:28 +03:00
Marco Matthies	d9a239c410	Simplify to include lower-case windows.h always, fix compile on mingw32 (#747 )	2023-04-10 19:57:59 +02:00
Georgi Gerganov	684da25926	ggml : fix quantize_row_q4_1() ARM_NEON (close #876 )	2023-04-10 19:29:48 +03:00
comex	f963b63afa	Rewrite loading code to try to satisfy everyone: - Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.) - Support both mmap and read (mmap is used by default, but can be disabled with `--no-mmap`, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported). - Support multi-file models like before, but automatically determine the number of parts rather than requiring `--n_parts`. - Improve validation and error checking. - Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields. This has no immediate benefit, but makes it easier to experiment with different formats, and should make it easier to support the new GPTQ-for-LLaMa models in the future (I have some work in progress on that front). - Support VirtualLock on Windows (using the same `--mlock` option as on Unix). - Indicate loading progress when using mmap + mlock. (Which led me to the interesting observation that on my Linux machine, with a warm file cache, mlock actually takes some time, whereas mmap without mlock starts almost instantly...) - To help implement this, move mlock support from ggml to the loading code. - madvise/PrefetchVirtualMemory support (based on #740) - Switch from ifstream to the `fopen` family of functions to avoid unnecessary copying and, when mmap is enabled, allow reusing the same file descriptor for both metadata reads and mmap (whereas the existing implementation opens the file a second time to mmap). - Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way'). Implementation notes: I tried to factor the code into more discrete pieces than before. Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly: - Destructors to make it easier to ensure everything gets cleaned up. - Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. The exceptions are converted to error codes at the API boundary.) Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from #740)	2023-04-10 01:10:46 +02:00
unbounded	62cfc54f77	Add quantize-stats command for testing quantization (#728 ) Command that calculates some statistics over the errors introduced by quantization, like mean square error, max error and some percentile errors for layer weights. Should be useful for testing quantization improvements. Exposes some internal state from ggml and llama for testing	2023-04-08 00:09:18 +02:00
Georgi Gerganov	eeaa7b0492	ggml : multi-thread ggml_rope() (~3-4 times faster on M1) (#781 )	2023-04-05 22:11:03 +03:00
Georgi Gerganov	986b6ce9f9	ggml, llama : avoid heavy V transpose + improvements (#775 ) ggml : - added ggml_view_3d() - ggml_view_tensor() now inherits the stride too - reimplement ggml_cpy() to account for dst stride - no longer require tensor->data to be memory aligned llama : - compute RoPE on 32-bit tensors (should be more accurate) - store RoPE-ed K in the KV cache - store transposed V in the KV cache (significant speed-up) - avoid unnecessary Q copy	2023-04-05 22:07:33 +03:00
SebastianApel	437e77855a	10+% performance improvement of ggml_vec_dot_q4_0 on AVX2 (#654 ) * Performance improvement of AVX2 code * Fixed problem with MSVC compiler * Reviewer comments: removed double semicolon, deleted empty line 1962	2023-04-03 09:52:28 +02:00
Marian Cepok	c0bb1d3ce2	ggml : change ne to int64_t (#626 )	2023-04-02 13:21:31 +03:00
Stephan Walter	3525899277	Enable -std= for cmake builds, fix warnings (#598 )	2023-03-31 19:19:16 +00:00
slaren	1d08882afa	Optimize AVX2 ggml_vec_dot_q4_0 (#642 )	2023-03-31 15:55:52 +00:00
perserk	02c5b27e91	Add AVX acceleration (#617 ) * ggml : add AVX quantize_row_q4_0() * ggml : add AVX ggml_vec_dot_q4_0() * ggml : refactor AVX part of ggml_vec_dot_q4_0() https://github.com/ggerganov/llama.cpp/pull/617#issuecomment-1489985645	2023-03-31 13:55:44 +02:00
Justine Tunney	6f23ba5ee2	Ensure --mlock works properly with mmap() support	2023-03-30 12:28:25 -07:00
Slaren	c03ae8dca1	Add mmap support for model files	2023-03-30 12:28:25 -07:00
Casey Primozic	a4755cf288	Remove unused variable (#607 ) * It seems some new warning were added recently that exposed this. I wrote the code that included this unused variable originally and it is indeed not needed.	2023-03-30 17:53:35 +00:00
Georgi Gerganov	77efdf5a50	ggml : fix NEON signs (close #620 , #622 )	2023-03-30 20:27:32 +03:00
slaren	ed3c680bcd	Fix GGML_F32Cx8_STORE in AVX without F16C path (#619 )	2023-03-30 11:16:30 +02:00
Georgi Gerganov	b51c717d5c	ggml : init time on first ggml_init() call	2023-03-29 22:15:34 +03:00
Georgi Gerganov	cea1c85948	ggml : add ARM_NEON dequantize_row_q4_1()	2023-03-29 22:10:01 +03:00
Georgi Gerganov	f202ada131	ggml : add ARM_NEON quantize_row_q4_1()	2023-03-29 22:03:07 +03:00
Georgi Gerganov	3b44d30d9b	ggml : add ARM_NEON ggml_vec_dot_q4_1()	2023-03-29 22:03:07 +03:00
anzz1	83df5639eb	Fix GCC warning about binary literal (#595 ) 0b10101010 -> 0xAA /* 0b10101010 */	2023-03-29 13:20:07 +00:00
anzz1	5a5f8b1501	Enable Fused-Multiply-Add (FMA) and F16C/CVT16 vector extensions on MSVC (#375 ) * Enable Fused-Multiply-Add (FMA) instructions on MSVC __FMA__ macro does not exist in MSVC * Enable F16C/CVT16 vector extensions on MSVC __F16C__ macro does not exist in MSVC, but is implied with AVX2/AVX512 * MSVC cvt intrinsics * Add __SSE3__ macro for MSVC too because why not even though it's not currently used for anything when AVX is defined	2023-03-28 22:44:29 +03:00
slaren	2a98bc18ea	ggml : add AVX2 implementation of quantize_row_q4_1 (#515 ) * Add AVX2 implementation of quantize_row_q4_1 * Actually use AVX2 * Make quantize_row_q4_1 static Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-03-28 21:06:03 +03:00
Stephan Walter	99c5b27654	ggml : refactor quantized processing functions (#509 ) * Refactor quantized processing functions * ggml : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-03-28 20:13:01 +03:00
Stephan Walter	436e561931	all : be more strict about converting float to double (#458 ) * Be more strict about converting float to double * Test equivalence of round, SILU implementations Test module is commented out in CMakeLists.txt because the tests may take a long time, depending on how much the compiler optimizes. * Fix softmax in perplexity.cpp * all : prefer float over double where appropriate * perplexity : add <cmath> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-03-28 19:48:20 +03:00
Stephan Walter	c1f885067c	ggml : introduce structs for the q4 data blocks (#356 ) * Introduce structs for the q4 data blocks * ggml : rename quant struct variables + fix ARM_NEON --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-03-28 18:56:03 +03:00
slaren	a6bdc47cba	Fix usage of F16C intrinsics in AVX code (#563 ) * Fix usage of F16C intrinsics in AVX code when F16C is not defined	2023-03-28 17:26:55 +03:00
Stephan Walter	939ad2d3a5	Fix undefined variables in debug build, remove unused variables (#531 )	2023-03-26 15:34:02 +00:00
slaren	459e93cce0	Add AVX2 implementation of dequantize_row_q4_1 (#505 )	2023-03-25 20:31:48 +02:00
Georgi Gerganov	a316a425d0	Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something !	2023-03-25 20:26:40 +02:00
Georgi Gerganov	ecbe466a36	Retire the ggml_mul_mat() branch for transposed src0 (#500 ) * Retire the ggml_mul_mat() for transposed src0 - It can always be made contiguous with ggml_cpy() - The code is now simplified - The results are deterministic in respect to num threads * SIMD-ify dequantize_row_q4_0() for ARM_NEON (#502) * Attempt to SIMD-ify dequantize_row_q4_0() for ARM_NEON * Fix dequantization - forgot to interleave the quants	2023-03-25 19:47:21 +02:00
slaren	09aecbf628	Add AVX2 implementation of dequantize_row_q4_0 (#467 )	2023-03-25 17:06:49 +02:00
Georgi Gerganov	6b6dbc8910	Remove obsolete assert and fix compiler warning	2023-03-25 16:22:05 +02:00
Georgi Gerganov	2a2e63ce05	Fix nasty bug in ggml_compute_forward_mul_mat_f32() and reenable BLAS	2023-03-25 16:10:14 +02:00
Georgi Gerganov	8520fc310e	Disable BLAS altogether - the bug is not just for qunatized mat mul	2023-03-24 23:47:06 +02:00
Georgi Gerganov	b3f460e941	Disable BLAS branch in mul_mat - seems there is a bug	2023-03-24 23:39:17 +02:00
Georgi Gerganov	7a9b6c3a8b	Reduce memory usage and allocate enough memory for largest context (#473 ) * Reduce memory usage and allocate enough memory for large contexts * Simpler scratch buffer usage * Reenable BLAS for quantized mul_mat * Fix number of layers in 30B and 65B * Fix KV cache size for F32	2023-03-24 23:17:37 +02:00
Cameron Kaiser	481044d50c	additional optimizations for POWER9 (#454 )	2023-03-24 17:19:26 +02:00
comex	563cdc391d	Support calling mlock() on loaded model data on Linux and macOS (#453 ) * Support calling mlock() on loaded model data on Linux and macOS This is enabled by a new --mlock command line option. Using mlock() disables swapping and memory compression for the model data. Doing so can be useful on systems where the model takes up a large fraction of system RAM. In my experience, macOS is quite eager to start compressing llama.cpp's memory, which then makes it halt for a few seconds while it decompresses, even with a model that uses "only" 25GB out of 32GB. Of course, this comes at the cost of forcing the system to swap or compress other processes' memory instead, so it needs to be used with care and shouldn't be enabled by default. In theory it should be possible to support this on Windows as well using VirtualLock(), but I'm not much of a Windows user. * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-03-24 17:19:05 +02:00
Stephan Walter	69c92298a9	Deduplicate q4 quantization functions (#383 ) * Deduplicate q4 quantization functions * Use const; add basic test * Re-enable quantization test * Disable AVX2 flags in CI --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-03-22 19:29:06 +02:00
Valentyn Bezshapkin	97940520e8	fix: add POSIX functionality for Linux compilation (#51 ) * fix: add POSIX functionality for Linux compilation * fix: older standard for compatibility	2023-03-22 19:20:25 +02:00
Georgi Gerganov	f5a77a629b	Introduce C-style API (#370 ) * Major refactoring - introduce C-style API * Clean up * Add <cassert> * Add <iterator> * Add <algorithm> .... * Fix timing reporting and accumulation * Measure eval time only for single-token calls * Change llama_tokenize return meaning	2023-03-22 07:32:36 +02:00
Kevin Lo	715d292ee0	Add OpenBSD support (#314 )	2023-03-21 17:50:09 +02:00
Casey Primozic	2e664f1ff4	Add initial AVX512 support for dot product on Linux (#320 ) * Update Makefile to detect AVX512 support and add compiler flags if it's available * Based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time * Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16 * Use built-in AVX512 horizontal reduce add to get sum at the end * Manual unrolling on inner dot product loop to reduce loop counter overhead	2023-03-21 15:35:42 +01:00
Georgi Gerganov	22213a17b5	Change RMSNorm eps to 1e-6 (#173 ) I think this is what is used in the Python code	2023-03-19 17:30:00 +02:00
Stephan Walter	367946c668	Don't tell users to use a bad number of threads (#243 ) The readme tells people to use the command line option "-t 8", causing 8 threads to be started. On systems with fewer than 8 cores, this causes a significant slowdown. Remove the option from the example command lines and use /proc/cpuinfo on Linux to determine a sensible default.	2023-03-17 19:47:35 +02:00
Matvey Soloviev	904d2a8d6a	Q4_1 quantization (#193 ) * Add AVX2 version of ggml_vec_dot_q4_1 * Small optimisations to q4_1 dot product (@Const-me) * Rearrange Q4_1 quantization to work for multipart models. (Fix #152) * Fix ggml_vec_mad_q4_1 too * Fix non-vectorised q4_1 vec mul	2023-03-17 06:48:39 +02:00
Nebula	9b4a15b17d	Fix RMS norm in GGML (#191 )	2023-03-15 19:29:25 -04:00
hoangmit	6eac39ba95	Add RMS norm and use it (#187 ) * add ggml_rms_norm * update op num	2023-03-16 00:41:38 +02:00
hoangmit	113e685d18	inline -> static inline for "bytesFromNibbles" (#161 ) Without "static" prefix, it fails to compile in clang	2023-03-15 21:05:14 +02:00
Ronsor	47857e564c	Don't use vdotq_s32 if it's not available (#139 ) * Don't use vdotq_s32 if it's not available `dotprod` extensions aren't available on some ARM CPUs (e.g. Raspberry Pi 4), so check for them and only use them if they're available. Reintroduces the code removed in `84d9015` if `__ARM_FEATURE_DOTPROD` isn't defined. * Update ggml.c --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-03-14 21:34:37 +02:00
Thomas Klausner	41be0a3b3d	Add NetBSD support. (#90 )	2023-03-13 18:40:54 +02:00
Georgi Gerganov	84d9015c4a	Use vdotq_s32 to improve performance (#67 ) * 10% performance boost on ARM * Back to original change	2023-03-13 18:36:44 +02:00
Georgi Gerganov	c80e2a8f2a	Revert "10% performance boost on ARM" This reverts commit `113a9e83eb`. There are some reports for illegal instruction. Moved this stuff to vdotq_s32 branch until resolve	2023-03-13 01:28:08 +02:00
Georgi Gerganov	54a0e66ea0	Check for vdotq_s32 availability	2023-03-13 01:21:03 +02:00
Georgi Gerganov	543c57e991	Ammend to previous commit - forgot to update non-QRDMX branch	2023-03-13 01:05:24 +02:00
Georgi Gerganov	113a9e83eb	10% performance boost on ARM	2023-03-13 00:56:10 +02:00
Sebastián A	eb062bb012	Windows fixes (#31 ) * Apply fixes suggested to build on windows Issue: https://github.com/ggerganov/llama.cpp/issues/22 * Remove unsupported VLAs * MSVC: Remove features that are only available on MSVC C++20. * Fix zero initialization of the other fields. * Change the use of vector for stack allocations.	2023-03-12 22:15:00 +02:00
Georgi Gerganov	f1eaff4721	Add AVX2 support for x86 architectures thanks to @Const-me !	2023-03-11 18:04:25 +02:00
Georgi Gerganov	007a8f6f45	Support all LLaMA models + change Q4_0 quantization storage	2023-03-11 11:28:30 +02:00
Georgi Gerganov	26c0846629	Initial release	2023-03-10 20:56:40 +02:00

... 3 4 5 6 7

345 Commits