llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2025-01-11 03:01:45 +00:00

Author	SHA1	Message	Date
mqy	5feefb32b3	threading: add suspend/resume APIs, so it's possible to run a thread pool at session level	2023-06-18 18:57:33 +08:00
mqy	5abb8aefea	fix warning	2023-06-18 18:55:44 +08:00
mqy	0ec4dab864	fixed break and asssion from select; try fix cuda link error	2023-06-18 14:59:44 +08:00
mqy	2193ab6281	fix cuda build error	2023-06-18 14:27:56 +08:00
mqy	67bb367962	typos	2023-06-18 14:27:56 +08:00
mqy	06b00827a0	bulk refactoring task profile and related to run CL GPU offloading. * removed ggml_task_backend, infavour of ggml_task_profile.runner and newly added id and name. * extracted mul_mat blas codes into ggml_compute_forward_mul_mat_blas, thus align with CUDA/CL a bit more and make it easier to fix profile and run tune. * rewrote task profile and update/add some cuda/cl codes, finnaly made CL GPU offloading work. * misc minor fix/update to tune, the data format was changed.	2023-06-18 14:27:56 +08:00
mqy	6b83a3e16f	try make CL run w/o tunning, but -ngl stucks no output. had to add task runer and profile id, many changes, see the f codes	2023-06-18 14:27:56 +08:00
mqy	5342dc075f	tunning: support k_quants; disabled rope shapes (workaround); make cache thread safe; fixed shape comprison	2023-06-18 14:27:56 +08:00
mqy	21e9379707	tunning: add f16, todo: f32 failed with CL	2023-06-18 14:27:56 +08:00
mqy	7c05049f8b	tunning: check GPU offloading before loading model	2023-06-18 14:27:56 +08:00
mqy	bb590f1482	Workrounnd to set node->backend	2023-06-18 14:27:56 +08:00
mqy	9106232260	threading test: At github, Windows can take more than 20 seconds to start 15 threads.Let's silently ignore when we saw two adjacent slowness.	2023-06-18 14:27:56 +08:00
mqy	48016f685c	bulk refactored task profile to support complete fallback; enable tune by default for ease of dev	2023-06-18 14:27:56 +08:00
mqy	1b041d7737	threading test: improve readability at both codes and output	2023-06-18 14:27:56 +08:00
mqy	213f133701	initial	2023-06-18 14:27:53 +08:00
Georgi Gerganov	ce2c7d72e2	metal : handle buffers larger than device's maxBufferLength (#1826 ) * metal : handle buffers larger than device's maxBufferLength * metal : print more verbose device info + handle errors * metal : fix prints for overlapping views * metal : minimize view overlap to try to utilize device memory better	2023-06-18 09:09:47 +03:00
Howard Su	57cd69460f	cmake : add CUDA_ARCHITECTURES to new target ggml_static (#1917 )	2023-06-18 07:29:47 +03:00
Georgi Gerganov	b2416493ab	make : do not print help for simple example	2023-06-17 20:55:03 +03:00
Georgi Gerganov	4f9c43e3bd	minor : warning fixes	2023-06-17 20:24:11 +03:00
Johannes Gäßler	2c9380dd2f	Only one CUDA stream per device for async compute (#1898 )	2023-06-17 19:15:02 +02:00
Georgi Gerganov	051e1b0e6a	llama : fix kv_cache `n` init (close #1903 )	2023-06-17 19:31:20 +03:00
DaniAndTheWeb	86c7571864	make : update for latest Arch (#1701 ) With the upcoming change to the openblas package in arch the Makefile workaround is no longer needed.	2023-06-17 19:17:22 +03:00
Howard Su	3d59ec5935	ggml : fix warnings under MSVC (#1908 )	2023-06-17 18:46:15 +03:00
Aaron Miller	0711a5f6dc	metal : add norm, cpy f16->f16, alibi kernels (#1823 )	2023-06-17 17:37:49 +03:00
Faez Shakil	fc45a81bc6	exposed modules so that they can be invoked by nix run github:ggerganov/llama.cpp#server etc (#1863 )	2023-06-17 14:13:05 +02:00
Randall Fitzgerald	794db3e7b9	Server Example Refactor and Improvements (#1570 ) A major rewrite for the server example. Note that if you have built something on the previous server API, it will probably be incompatible. Check out the examples for how a typical chat app could work. This took a lot of effort, there are 24 PR's closed in the submitter's repo alone, over 160 commits and a lot of comments and testing. Summary of the changes: - adds missing generation parameters: tfs_z, typical_p, repeat_last_n, repeat_penalty, presence_penalty, frequency_penalty, mirostat, penalize_nl, seed, ignore_eos - applies missing top k sampler - removes interactive mode/terminal-like behavior, removes exclude parameter - moves threads and batch size to server command-line parameters - adds LoRA loading and matches command line parameters with main example - fixes stopping on EOS token and with the specified token amount with n_predict - adds server timeouts, host, and port settings - adds expanded generation complete response; adds generation settings, stop reason, prompt truncated, model used, and final text - sets defaults for unspecified parameters between requests - removes /next-token endpoint and as_loop parameter, adds stream parameter and server-sent events for streaming - adds CORS headers to responses - adds request logging, exception printing and optional verbose logging - adds better stopping words handling when matching multiple tokens and while streaming, or when it finishes on a partial stop string - adds printing an error when it can't bind to the host/port specified - fixes multi-byte character handling and replaces invalid UTF-8 characters on responses - prints timing and build info on startup - adds logit bias to request parameters - removes embedding mode - updates documentation; adds streaming Node.js and Bash examples - fixes code formatting - sets server threads to 1 since the current global state doesn't work well with simultaneous requests - adds truncation of the input prompt and better context reset - removes token limit from the input prompt - significantly simplified the logic and removed a lot of variables --------- Co-authored-by: anon998 <131767832+anon998@users.noreply.github.com> Co-authored-by: Henri Vasserman <henv@hot.ee> Co-authored-by: Felix Hellmann <privat@cirk2.de> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Lesaun Harvey <Lesaun@gmail.com>	2023-06-17 14:53:04 +03:00
Jiří Podivín	5ddf7ea1fb	hooks : setting up flake8 and pre-commit hooks (#1681 ) Small, non-functional changes were made to non-compliant files. These include breaking up long lines, whitespace sanitation and unused import removal. Maximum line length in python files was set to a generous 125 chars, in order to minimize number of changes needed in scripts and general annoyance. The "txt" prompts directory is excluded from the checks as it may contain oddly formatted files and strings for a good reason. Signed-off-by: Jiri Podivin <jpodivin@gmail.com>	2023-06-17 13:32:48 +03:00
Gustavo Rocha Dias	bac19927c3	readme : alternative way to build for Android with CLBlast. (#1828 )	2023-06-17 12:01:06 +03:00
Kerfuffle	b4c6f46f17	Allow cmake to build ggml as a library (#1896 ) * Allow cmake to build ggml as a library * A ggml_static library will be created * When BUILD_SHARED_LIBS is enabled, ggml_shared will also be built	2023-06-17 01:49:42 -06:00
David Yang	92f20d9942	train : get raw text instead of page with html (#1905 ) We probably want to train using just the text of Shakespeare instead of the html of the page displaying his work.	2023-06-17 09:51:54 +03:00
0cc4m	d411968e99	opencl : support k-quants (#1836 ) * Porting q2_k kernel to OpenCL * Set global and local sizes for kernel calls for dequantizing k-quants * Added q6_k kernel * Fix q4_k opencl struct order * Replace uchar with uint8_t * Finish dequant kernels * Added OpenCL DMMV kernels * Fix q2_k, improve code * Fix q3_k * Shorten switch statements * Improve code formatting --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>	2023-06-16 21:59:49 +03:00
SuperUserNameMan	b41b4cad6f	examples : add "simple" (#1840 ) * Create `simple.cpp` * minimalist example `CMakeLists.txt` * Update Makefile for minimalist example * remove 273: Trailing whitespace * removed trailing white spaces simple.cpp * typo and comments simple.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-16 21:58:09 +03:00
Zenix	13fe9d2d84	cmake : add auto detection of BLAS_INCLUDE_DIRS (#1886 )	2023-06-16 21:53:04 +03:00
Johannes Gäßler	ac3b886953	llama : fix embd when offloading non-repeating layers (#1891 )	2023-06-16 21:25:51 +03:00
FrankHB	5b9ccaf104	Fixed possible macro redefinition (#1892 ) MinGW libstdc++ may define `NOMINMAX` unconditionally. This fixes the case when it is already defined.	2023-06-16 21:25:01 +03:00
Borislav Stanimirov	9cbf50c041	build : fix and ignore MSVC warnings (#1889 )	2023-06-16 21:23:53 +03:00
Kawrakow	3d01122610	CUDA : faster k-quant dot kernels (#1862 ) * cuda : faster k-quant dot kernels * Imrove Q2_K dot kernel on older GPUs We now have a K_QUANTS_PER_ITERATION macro, which should be set to 1 on older and to 2 on newer GPUs. With this, we preserve the performance of the original PR on RTX-4080, and are faster compared to master on GTX-1660. * Imrove Q6_K dot kernel on older GPUs Using the same K_QUANTS_PER_ITERATION macro as last commit, we preserve performance on RTX-4080 and speed up Q6_K on a GTX-1660. * Add LLAMA_CUDA_KQUANTS_ITER to CMakeLists.txt and Makefile Allowed values are 1 or 2. 2 gives the best performance on modern GPUs and is set as default. On older GPUs 1 may work better. * PR comments --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-16 20:08:44 +03:00
Borislav Stanimirov	602c748863	gitignore : add several entries specific to Visual Studio (#1888 )	2023-06-16 09:58:11 +03:00
Johannes Gäßler	a09f9195be	Fixed CUDA runtime version check (#1879 )	2023-06-15 21:49:08 +02:00
Georgi Gerganov	bed9275617	cmake : remove whitespaces	2023-06-15 21:56:50 +03:00
yangli2	c36e81da62	examples : add chat-vicuna.sh (#1854 ) Co-authored-by: Yang Li <yangliyl@google.com>	2023-06-15 21:05:53 +03:00
Igor Okulist	3559433fec	cmake : set include path for OpenBlas (#1830 )	2023-06-15 20:51:26 +03:00
Frederik Vogel	69b34a0e80	swift : Package compile breaks due to ggml-metal.metal (#1831 ) * Ignore metal file in spm * Add ggml.h to spm public Headers --------- Co-authored-by: Vogel Frederik <vogel.frederik@linecorp.com>	2023-06-15 20:47:04 +03:00
daboe01	cf267d1c71	make : add train-text-from-scratch (#1850 ) * make finetuning example accessible * fixed: targed was in wrong line * fixed: name of executable was wrong * fixed: naming of binary * fixed: model path was wrong * fixed clean target * Update examples/train-text-from-scratch/README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-15 20:42:48 +03:00
Srinivas Billa	9dda13e5e1	readme : server compile flag (#1874 ) Explicitly include the server make instructions for C++ noobsl like me ;)	2023-06-15 20:36:38 +03:00
sandyiscool	37e257c48e	make : clean *.so files (#1857 )	2023-06-15 20:36:06 +03:00
Howard Su	64cc19b4fe	Fix the validation of main device (#1872 )	2023-06-15 19:29:59 +02:00
Georgi Gerganov	4bfcc855ab	metal : parallel command buffer encoding (#1860 ) * metal : parallel command buffer encoding * metal : determine number of command buffers based on gf->n_threads	2023-06-15 20:29:48 +03:00
Johannes Gäßler	6b8312e797	Better error when using both LoRA + GPU layers (#1861 )	2023-06-15 19:06:46 +02:00
Johannes Gäßler	254a7a7a5f	CUDA full GPU acceleration, KV cache in VRAM (#1827 ) * Fixed CUDA RoPE * ggml_cuda_mul_mat_vec_p021 * ggml_cuda_scale * ggml_cuda_diag_mask_inf * ggml_is_permuted * ggml_cuda_cpy * flatten rows for ggml_cuda_op * Added a --low-vram option * Fixed Windows performance * Fixed LLAMA_CUDA_DMMV_Y > 1 for WizardLM	2023-06-14 19:47:19 +02:00

1 2 3 4 5 ...

719 Commits