llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-11-15 23:39:52 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	5b33ea1ee7	metal : fix struct name (ggml/912) ggml-ci	2024-08-08 13:19:31 +03:00
Conrad Kramer	85fca8deb6	metal : add abort callback (ggml/905)	2024-08-08 13:19:30 +03:00
Pablo Duboue	ebd541a570	make : clean llamafile objects (#8923 ) `ggml/src/llamafile/sgemm.o` was not deleted on `make clean`	2024-08-08 11:44:51 +03:00
slaren	15fa07a5c5	make : use C compiler to build metal embed object (#8899 ) * make : use C compiler to build metal embed object * use rm + rmdir to avoid -r flag in rm	2024-08-07 18:24:05 +02:00
slaren	be55695eff	ggml-backend : fix async copy from CPU (#8897 ) * ggml-backend : fix async copy from CPU * cuda : more reliable async copy, fix stream used when the devices are the same	2024-08-07 13:29:02 +02:00
Ouadie EL FAROUKI	0478174d59	[SYCL] Updated SYCL device filtering (#8901 ) * Updated device filter to depend on default_selector (fixes non-intel device issues) * Small related update to example/sycl Readme	2024-08-07 11:25:36 +01:00
Johannes Gäßler	a8dbc6f753	CUDA/HIP: fix tests/test-backend-ops (#8896 )	2024-08-07 09:07:52 +02:00
Zhenwei Jin	506122d854	llama-bench : add support for getting cpu info on Windows (#8824 ) * Add support for getting cpu info on Windows for llama_bench * refactor --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-08-07 03:01:06 +02:00
Daniel Bevenius	725e3d9437	quantize : update usage comment in quantize.cpp (#8889 ) This commit updates the usage comment in quantize.cpp to reflect the new name of the executable, which is llama-quantize.	2024-08-07 01:43:00 +02:00
Nexes the Old	31958546c3	typo correction (#8891 )	2024-08-07 01:41:54 +02:00
Xuan Son Nguyen	1e6f6554aa	server : add lora hotswap endpoint (WIP) (#8857 ) * server : add lora hotswap endpoint * handle lora_no_apply * fix build * updae docs * clean up struct def * fix build * add LoRA test * fix style	2024-08-06 17:33:39 +02:00
Johannes Gäßler	641f5dd2a6	CUDA: fix padding logic for FP16/FP32 (#8884 )	2024-08-06 17:13:55 +02:00
Daniel Bevenius	5f4dcb1e60	simple : update name of executable to llama-simple (#8885 ) This commit updates the name of the executable in README.md from `simple` to `llama-simple`.	2024-08-06 16:44:35 +02:00
Jaeden Amero	db20f50cf4	cmake : Link vulkan-shaders-gen with pthreads (#8835 ) When using CMake to build with Vulkan support, compiling vulkan-shaders-gen fails due to missing a CMakeLists.txt specification to link vulkan-shaders-gen with the threading library, resulting in the following error. [5/172] Linking CXX executable bin/vulkan-shaders-gen FAILED: bin/vulkan-shaders-gen : && /usr/bin/c++ ggml/src/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o -o bin/vulkan-shaders-gen && : ld: error: undefined symbol: pthread_create >>> referenced by vulkan-shaders-gen.cpp >>> ggml/src/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o:(std::__1::__libcpp_thread_create[abi:se180100](pthread*, >>> void ()(void), void*)) c++: error: linker command failed with exit code 1 (use -v to see invocation) [6/172] Generating build details from Git -- Found Git: /usr/local/bin/git (found version "2.45.2") ninja: build stopped: subcommand failed. Add the CMakeLists.txt specification to link vulkan-shaders-gen with the threading library and fix the above error. Fixes #8834	2024-08-06 15:21:47 +02:00
MaggotHATE	efda90c93a	[Vulkan] Fix compilation of `vulkan-shaders-gen` on w64devkit after `e31a4f6` (#8880 ) * Fix compilation issue in `vulkan-shaders-gen` `e31a4f6797` broke compilation on w64devkit. Including `algorithm` seems to fix that. * Guard it under `#ifdef _WIN32`	2024-08-06 13:32:03 +02:00
Georgi Gerganov	0bf16de07b	contributing : add note about write access	2024-08-06 11:48:01 +03:00
Molly Sophia	2d5dd7bb3f	ggml : add epsilon as a parameter for group_norm (#8818 ) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-08-06 10:26:46 +03:00
Douglas Hanley	cdd1889de6	convert : add support for XLMRoberta embedding models (#8658 ) * add conversion for bge-m3; small fix in unigram tokenizer * clean up and simplify XLMRoberta conversion	2024-08-06 10:20:54 +03:00
Mengqing Cao	c21a896405	[CANN]: Fix ggml_backend_cann_buffer_get_tensor (#8871 ) * cann: fix ggml_backend_cann_buffer_get_tensor 1. fix data ptr offset 2. enable the acquisition of incomplete tensors * fix backend cann set_tensor	2024-08-06 12:42:42 +08:00
Neo Zhang	d4ff847153	[SYCL] correct cmd name (#8877 )	2024-08-06 09:09:12 +08:00
Liu Jia	0a4ce78681	common : Changed tuple to struct (TODO fix) (#8823 ) * common : Changed tuple to struct (TODO fix) Use struct `llama_init_result` to replace the previous std::tuple<struct llama_model , struct llama_context > * delete llama_init_default_params() * delete the extra whitespace	2024-08-05 18:14:10 +02:00
wangshuai09	bc0f887e15	cann: fix buffer_num and runtime speed slowly error (#8865 )	2024-08-05 21:10:37 +08:00
Eric Curtin	b42978e7e4	readme : add ramalama to the availables UI (#8811 ) ramalama is a repo agnostic boring CLI tool that supports pulling from ollama, huggingface and oci registries. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2024-08-05 15:45:01 +03:00
Justine Tunney	b9dfc25ca3	ggml : fix overflows in elu function (#8866 ) It's helpful to use expm1f(x), because expf(x)-1 will result in overflow for 25% of single-precision floating point numbers.	2024-08-05 15:43:40 +03:00
Brian	1ef14b3007	py: Add more authorship metadata from model card (#8810 ) * py: add more authorship metadata from model card * fixup! py: add more authorship metadata from model card	2024-08-05 21:15:28 +10:00
fairydreaming	d3f0c7166a	Stop the generation when <\|eom_id\|> token is encountered - needed for Llama 3.1 tool call support (#8858 ) * gguf-py, llama : add constants and methods related to Llama-3.1 <\|eom_id\|> token * llama : find Llama-3.1 <\|eom_id\|> token id during vocab loading * llama-vocab : add Llama-3.1 <\|eom_id\|> token to the set of tokens stopping the generation --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-08-05 09:38:01 +02:00
stduhpf	e31a4f6797	cmake: fix paths for vulkan shaders compilation on Windows (#8573 ) * Vulkan-shaders: attempt fix compilation on windows * fix miss-matched parenthesis	2024-08-05 08:18:27 +02:00
BarfingLemurs	400ae6f65f	readme : update model list (#8851 )	2024-08-05 08:54:10 +03:00
Georgi Gerganov	f1ea5146d7	llama : better replace_all (#8852 )	2024-08-05 08:53:39 +03:00
0cc4m	064cdc265f	vulkan : fix Qantized Mat-Vec Mul on AMD GPUs for ncols < 64 (#8855 ) * Fix Vulkan mul mat vec invalid results when ncols < warp size * Only run backend ops mul mat vec block size test if block size not already covered	2024-08-05 08:52:55 +03:00
Georgi Gerganov	5587e57a76	sync : ggml ggml-ci	2024-08-05 08:50:57 +03:00
0cc4m	a3738b2fa7	vulkan : implement Stable Diffusion operators (ggml/904) * Fix Vulkan repeat op * Implement Vulkan concat op * Delete old Vulkan shader generator * Implement Vulkan im2col op * Implement Vulkan unary gelu_quick op * Implement Vulkan group_norm op * Implement Vulkan timestep_embedding op * Implement Vulkan upscale op * Fix Vulkan vk_context tensor extra index issue * Fix Vulkan matmul shader parameter bug * Properly fix Vulkan matmul shader parameter bug * Add Vulkan ADD f16 + f32 -> f16 operator support * Implement Vulkan tanh op * Fix Vulkan group count too large Validation error on non-Nvidia GPUs * Throw error when too much memory is requested * Fix another Vulkan group count too large Validation error on non-Nvidia GPUs * Fix matmul MMQ condition * Implement Vulkan pad op * Fix Vulkan crash when tensor is used multiple times in a compute graph * Add Vulkan CONCAT f16 + f16 -> f16 op * Add Vulkan LEAKY_RELU op	2024-08-05 08:50:57 +03:00
Daniel Bevenius	655858ace0	ggml : move c parameter comment to ggml_rope_ext (ggml/901) This commit moves the comment for the c parameter from ggml_rope to ggml_rope_ext. The comment is currently incorrect as ggml_rope does not have a c parameter (freq_factors tensor). Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-08-05 08:50:57 +03:00
wangshuai09	c02b0a8a4d	cann: support q4_0 model (#8822 )	2024-08-05 12:22:30 +08:00
Brandon Squizzato	0d6fb52be0	Install curl in runtime layer (#8693 )	2024-08-04 20:17:16 +02:00
ardfork	978ba3d83d	Server: Don't ignore llama.cpp params (#8754 ) * Don't ignore llama.cpp params * Add fallback for max_tokens	2024-08-04 20:16:23 +02:00
Brian Cunnie	ecf6b7f23e	batched-bench : handle empty `-npl` (#8839 ) * [example] batched-bench "segmentation fault" When `llama-batched-bench` is invoked _without_ setting `-npl`, "number of parallel prompts", it segfaults. The segfault is caused by invoking `max_element()` on a zero-length vector, `n_pl` This commit addresses that by first checking to see if the number of parallel prompts is zero, and if so sets the maximum sequence size to 1; otherwise, sets it to the original, the result of `max_element()`. Fixes, when running `lldb build/bin/llama-batched-bench -- -m models/Meta-Llama-3-8B.gguf` ``` * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0) frame #0: 0x000000010000366c llama-batched-bench`main(argc=3, argv=0x000000016fdff268) at batched-bench.cpp:72:28 69 llama_context_params ctx_params = llama_context_params_from_gpt_params(params); 70 71 // ensure enough sequences are available -> 72 ctx_params.n_seq_max = std::max_element(n_pl.begin(), n_pl.end()); ``` Update examples/batched-bench/batched-bench.cpp Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: compilade <git@compilade.net>	2024-08-04 13:55:03 +03:00
Daniel Bevenius	01aae2b497	baby-llama : remove duplicate vector include	2024-08-04 13:24:59 +03:00
Georgi Gerganov	4b77ea95f5	flake.lock: Update (#8847 )	2024-08-03 19:53:20 -07:00
jdomke	76614f352e	ggml : reading the runtime sve config of the cpu (#8709 ) * ggml : reading the runtime sve config of the cpu * change to one time init to prevent performance drop * prefix variable to avoid possible conflicts * revert xxhash fix and add brackets --------- Co-authored-by: domke <673751-domke@users.noreply.gitlab.com>	2024-08-03 18:34:41 +02:00
Sigbjørn Skjæret	b72c20b85c	Fix conversion of unnormalized BF16->BF16 weights (#7843 ) * add truncate_bf16 * truncate intermediate fp32 if converting bf16 to bf16 * fix masking in __compute_fp32_to_bf16 * np.int16 no longer used * missing cast and additional numpy 2.x fix * ggml-impl : do not flush bf16 subnormals to zero * ggml : add reference fp32 to bf16 conversion The fast version is no longer equivalent for all platforms because of the handling of subnormal values. * gguf-py : remove flush to zero for bf16 subnormals * gguf-py : remove float32 truncation to bf16 Rounding achieves the same thing in the cases where this was used. * missed prototype update in merge * merge cleanup --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net>	2024-08-02 15:11:39 -04:00
Mengqing Cao	e09a800f9a	cann: Fix ggml_cann_im2col for 1D im2col (#8819 ) * fix ggml_cann_im2col for 1D im2col * fix build warning	2024-08-02 16:50:53 +08:00
Ouadie EL FAROUKI	0fbbd88458	[SYCL] Fixing wrong VDR iq4nl value (#8812 )	2024-08-02 08:55:17 +08:00
matteo	afbb4c1322	ggml-cuda: Adding support for unified memory (#8035 ) * Adding support for unified memory * adding again the documentation about unified memory * refactoring: Moved the unified memory code in the correct location. * Fixed compilation error when using hipblas * cleaning up the documentation * Updating the documentation Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * adding one more case where the PR should not be enabled --------- Co-authored-by: matteo serva <matteo.serva@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-08-01 23:28:28 +02:00
Alex O'Connell	b7a08fd5e0	Build: Only include execinfo.h on linux systems that support it (#8783 ) * Only enable backtrace on GLIBC linux systems * fix missing file from copy * use glibc macro instead of defining a custom one	2024-08-01 18:53:46 +02:00
slaren	7a11eb3a26	cuda : fix dmmv cols requirement to 2GGML_CUDA_DMMV_X (#8800 ) cuda : fix dmmv cols requirement to 2GGML_CUDA_DMMV_X update asserts * only use dmmv for supported types * add test	2024-08-01 15:26:22 +02:00
wangshuai09	c8a0090922	cann: support q8_0 for Ascend backend (#8805 )	2024-08-01 10:39:05 +08:00
Igor Okulist	afbbcf3c04	server : update llama-server embedding flag documentation (#8779 ) Fixes #8763	2024-07-31 19:59:09 -04:00
Clint Herron	ed9d2854c9	Build: Fix potential race condition (#8781 ) * Fix potential race condition as pointed out by @fairydreaming in #8776 * Reference the .o rather than rebuilding every time. * Adding in CXXFLAGS and LDFLAGS * Removing unnecessary linker flags.	2024-07-31 15:51:06 -04:00
pculliton	398ede5efe	Adding Gemma 2 2B configs (#8784 ) * Adding Gemma 2 2B configs Updates to Q scaling and Gemma 2 model sizes to match v2 2B model. * Update src/llama.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-07-31 17:12:10 +02:00

... 10 11 12 13 14 ...

4095 Commits