llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-27 20:04:35 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	d2b6ca13ad	gguf : add array support	2023-07-27 14:53:07 +03:00
Georgi Gerganov	d89533dff6	gguf : expose the gguf_type enum through the API for now	2023-07-27 11:10:34 +03:00
Georgi Gerganov	d8491fc7e3	gguf : add comments	2023-07-26 23:00:24 +03:00
Georgi Gerganov	5628ec7163	gguf : read / write sample models	2023-07-26 22:40:45 +03:00
Georgi Gerganov	d313c0fa33	gguf : simplify gguf_get_val	2023-07-26 18:53:57 +03:00
Georgi Gerganov	cb871fa022	gguf : do not support passing existing ggml_context to gguf_init	2023-07-26 18:48:52 +03:00
Georgi Gerganov	860c9c63ce	gguf : add gguf_get_tensor_name()	2023-07-26 18:21:14 +03:00
Georgi Gerganov	78b226a959	gguf : initial model loading - not tested	2023-07-26 18:21:14 +03:00
Georgi Gerganov	d91b985d2d	gguf : read tensor info	2023-07-26 18:21:13 +03:00
Georgi Gerganov	8d6acfec12	gguf : read header + meta data	2023-07-26 18:21:13 +03:00
Georgi Gerganov	6873148771	gguf : first API pass	2023-07-26 18:21:13 +03:00
slaren	5488fb789e	ggml : allocate graphs in a context (#2392 ) * ggml : graph allocation in contexts * allocate work buffer as a ggml_object in ggml_graph_compute_with_ctx * llama.cpp : allocate graph in the context * add GGML_PAD --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-26 15:56:53 +02:00
slaren	07aaa0f63f	ggml : fix ggml_flash_attn to use op_params (#2387 ) * ggml : fix ggml_flash_attn to use op_params	2023-07-25 16:20:12 +02:00
Jiahao Li	875086bdb9	ggml : relax contiguous constraints in activation function (#2371 )	2023-07-25 15:58:32 +03:00
slaren	da1889834a	ggml : improve graph build time via hash table lookup (#2329 ) * improve graph build time * ggml_tensor : use 1 bit per flag * use a hash table instead	2023-07-25 15:32:20 +03:00
slaren	41c674161f	make rms_norm_eps a parameter (#2374 ) * make rms_norm_eps a parameter * add rms_norm_eps to command line * fix baby llama, test-grad0 * use scientific notation for eps param in the help ggml-ci	2023-07-24 17:57:12 +02:00
Georgi Gerganov	5b2b2dc6ae	ggml : sync (unary ops refactor, static-correctness) (#2370 ) * ggml : sync (unary ops, tests) ggml-ci * tests : remove unnecessary funcs	2023-07-24 14:46:21 +03:00
slaren	3602ac4255	fix n_tasks (#2342 ) ggml-ci	2023-07-23 15:19:39 +02:00
slaren	95a6c595e7	ggml: move op parameters from tensors to ggml_tensor::op_params (#2333 ) * ggml: move op parameters from tensors to ggml_tensor::op_params * alibi: use memcpy for float params * remove `src[1] = NULL` in ops	2023-07-23 14:36:02 +02:00
Georgi Gerganov	0db14fef06	ggml : fix the rope fix (`513f861953`)	2023-07-21 15:16:55 +03:00
Georgi Gerganov	513f861953	ggml : fix rope args order + assert (#2054 )	2023-07-21 14:51:34 +03:00
Qingyou Meng	672dda10e4	ggml : fixed runtime bugs and compile errors related to GGML_PERF and GGML_DEBUG (#2219 ) * fixed runtime bugs and compile errors related to GGML_PERF and GGML_DEBUG * remove ifdef GGML_PERF; update fmt	2023-07-16 22:57:28 +03:00
Xiao-Yong Jin	6e7cca4047	llama : add custom RoPE (#2054 ) * Implement customizable RoPE The original RoPE has pre-defined parameters theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2] Our customizable RoPE, ggml_rope_custom_inplace, uses theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2] with the default matches the original scale = 1.0 base = 10000 The new command line arguments --rope-freq-base --rope-freq-scale set the two new RoPE parameter. Recent researches show changing these two parameters extends the context limit with minimal loss. 1. Extending Context to 8K kaiokendev https://kaiokendev.github.io/til#extending-context-to-8k 2. Extending Context Window of Large Language Models via Positional Interpolation Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian https://arxiv.org/abs/2306.15595 3. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/user/bloc97 https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/ For the bold, try adding the following command line parameters to your favorite model: -c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5 * ggml-metal: fix custom rope * common: fix argument names in help * llama: increase MEM_REQ_EVAL for MODEL_3B It avoids crashing for quantized weights on CPU. Better ways to calculate the required buffer size would be better. * llama: make MEM_REQ_EVAL depend on n_ctx * server: use proper Content-Type in curl examples Without the header Content-Type: application/json, curl will POST with Content-Type: application/x-www-form-urlencoded Though our simple server doesn't care, the httplib.h used has a limit with CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH 8192 With Content-Type: application/json, we can send large json data. * style : minor fixes, mostly indentations * ggml : fix asserts --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-15 13:34:16 +03:00
Evan Miller	e8035f141e	ggml : fix static_assert with older compilers #2024 (#2218 )	2023-07-14 21:55:56 +03:00
Georgi Gerganov	697966680b	ggml : sync (ggml_conv_2d, fix mul_mat bug, CUDA GLM rope)	2023-07-14 16:36:41 +03:00
Georgi Gerganov	975221e954	ggml : broadcast mul_mat + conv batch support (#2199 ) * ggml : broadcast mul_mat + conv batch support * ggml : apply mul_mat broadcast fix by @jploski	2023-07-12 20:51:29 +03:00
Georgi Gerganov	4523d10d0c	ggml : add ggml_pool_1d and ggml_pool_2d	2023-07-12 20:32:15 +03:00
Georgi Gerganov	20d7740a9b	ggml : sync (abort callback, mul / add broadcast, fix alibi) (#2183 )	2023-07-11 22:53:34 +03:00
Spencer Sutton	5bf2a27718	ggml : remove src0 and src1 from ggml_tensor and rename opt to src (#2178 ) * Add ggml changes * Update train-text-from-scratch for change * mpi : adapt to new ggml_tensor->src --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-11 19:31:10 +03:00
clyang	3bbc1a11f0	ggml : fix buidling with Intel MKL but ask for "cblas.h" issue (#2104 ) (#2115 ) * Fix buidling with Intel MKL but ask for "cblas.h" issue * Use angle brackets to indicate the system library	2023-07-09 11:12:20 +03:00
Qingyou Meng	1d656d6360	ggml : change ggml_graph_compute() API to not require context (#1999 ) * ggml_graph_compute: deprecate using ggml_context, try resolve issue #287 * rewrite: no longer consider backward compitability; plan and make_plan * minor: rename ctx as plan; const * remove ggml_graph_compute from tests/test-grad0.c, but current change breaks backward * add static ggml_graph_compute_sugar() * minor: update comments * reusable buffers * ggml : more consistent naming + metal fixes * ggml : fix docs * tests : disable grad / opt + minor naming changes * ggml : add ggml_graph_compute_with_ctx() - backwards compatible API - deduplicates a lot of copy-paste * ci : enable test-grad0 * examples : factor out plan allocation into a helper function * llama : factor out plan stuff into a helper function * ci : fix env * llama : fix duplicate symbols + refactor example benchmark * ggml : remove obsolete assert + refactor n_tasks section * ggml : fix indentation in switch * llama : avoid unnecessary bool * ggml : remove comments from source file and match order in header --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-07 19:24:01 +03:00
Georgi Gerganov	7242140283	ggml : remove sched_yield() call in ggml_graph_compute_thread() (#2134 )	2023-07-07 18:37:10 +03:00
Georgi Gerganov	ec326d350c	ggml : fix bug introduced in #1237	2023-07-05 20:44:11 +03:00
Stephan Walter	1b107b8550	ggml : generalize `quantize_fns` for simpler FP16 handling (#1237 ) * Generalize quantize_fns for simpler FP16 handling * Remove call to ggml_cuda_mul_mat_get_wsize * ci : disable FMA for mac os actions --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-07-05 19:13:06 +03:00
Georgi Gerganov	ed9a54e512	ggml : sync latest (new ops, macros, refactoring) (#2106 ) - add ggml_argmax() - add ggml_tanh() - add ggml_elu() - refactor ggml_conv_1d() and variants - refactor ggml_conv_2d() and variants - add helper macros to reduce code duplication in ggml.c	2023-07-04 21:54:11 +03:00
Georgi Gerganov	46088f7231	ggml : fix build with OpenBLAS (close #2066 )	2023-07-02 09:46:46 +03:00
Qingyou Meng	b1ca8f36a9	ggml : disable GGML_TASK_INIT and GGML_TASK_FINALIZE by default (#1995 ) Will not be scheduled unless explicitly enabled.	2023-07-01 18:42:43 +03:00
Erik Scholz	9d23589d63	fix pthreads setaffinity usage on android (#2020 )	2023-06-27 19:06:33 +02:00
Georgi Gerganov	d9779021bd	ggml : add support for ChatGLM RoPE	2023-06-27 00:06:51 +03:00
Georgi Gerganov	c824d2e368	ggml : avoid conv 2d kernel round up	2023-06-26 21:03:59 +03:00
zrm	b853d45601	ggml : add NUMA support (#1556 ) * detect NUMA systems and pin work threads to nodes (linux) * disable mmap prefetch/readahead for NUMA systems * avoid sending finalize op to thread pool if it does nothing * silence robot * fix args * make --numa a param * recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement * lower synchronization overhead * statically allocate * move numa state to g_state * add description for --numa * ggml : minor style changes * ggml : minor style + try fix sanitizer build * llama : allow to initialize backend with NUMA support * llama : avoid ggml include in llama-util.h * ggml : style / formatting * ggml : fix handling of ops with n_threads > n_tasks > 1 * server : utilize numa parameter --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-26 20:57:59 +03:00
Georgi Gerganov	bd34cdde38	ggml : sync latest ggml (custom operators)	2023-06-25 14:25:08 +03:00
Robyn	5ec8dd5a3c	#1869 Fix null reference errors when training from scratch with CUDA (#1907 ) * #1869 Fix null reference errors when training from scratch with CUDA build Calling ggml_compute_forward when node->src0 was null was causing train-text-from-scratch.exe to terminate unexpectedly. * ggml : do not dereference src0 if NULL --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-24 20:10:29 +02:00
slaren	f2c754e1c3	ggml : improve ggml_graph_dump_dot, add ggml_format_name (#1978 ) * Improve ggml_graph_dump_dot, add ggml_format_name * add more automatic names to view ops * fix name of copies	2023-06-24 13:57:18 +03:00
Georgi Gerganov	18b35625c3	ggml : fix bug in LBFGS optimizer (found by ggml tests)	2023-06-19 20:43:30 +03:00
Georgi Gerganov	b97ca431db	ggml : sync latest ggml repo (#1924 ) * ggml : sync latest ggml repo * ggml : remove unused comments * ggml : asserts	2023-06-19 18:12:33 +03:00
l3utterfly	8596af4277	ggml : fix bug in ggml_compute_forward_add_q_f32 (#1918 )	2023-06-18 14:19:16 +03:00
Georgi Gerganov	ce2c7d72e2	metal : handle buffers larger than device's maxBufferLength (#1826 ) * metal : handle buffers larger than device's maxBufferLength * metal : print more verbose device info + handle errors * metal : fix prints for overlapping views * metal : minimize view overlap to try to utilize device memory better	2023-06-18 09:09:47 +03:00
Borislav Stanimirov	9cbf50c041	build : fix and ignore MSVC warnings (#1889 )	2023-06-16 21:23:53 +03:00
Johannes Gäßler	254a7a7a5f	CUDA full GPU acceleration, KV cache in VRAM (#1827 ) * Fixed CUDA RoPE * ggml_cuda_mul_mat_vec_p021 * ggml_cuda_scale * ggml_cuda_diag_mask_inf * ggml_is_permuted * ggml_cuda_cpy * flatten rows for ggml_cuda_op * Added a --low-vram option * Fixed Windows performance * Fixed LLAMA_CUDA_DMMV_Y > 1 for WizardLM	2023-06-14 19:47:19 +02:00

1 2 3 4 5

229 Commits