llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-11-11 21:39:52 +00:00

Author	SHA1	Message	Date
slaren	6744dbe924	ggml : use ggml_row_size where possible (#4472 ) * ggml : use ggml_row_size where possible ggml-ci * ggml : move ggml_nbytes_split to ggml-cuda.cu	2023-12-14 20:05:21 +01:00
slaren	cafcd4f895	ggml : remove n_dims from ggml_tensor (#4469 ) ggml-ci	2023-12-14 16:52:08 +01:00
LostRuins	20a68a7030	ggml : add ggml_row_size() (fixes llama out of space) (#4461 ) * Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values * do not cast to size_t, instead just use doubles * ggml : add ggml_row_size(), deprecate ggml_type_sizef() * ggml : fix row size compute to avoid overflows * tests : fix sizey -> sizez --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-14 14:13:33 +02:00
Georgi Gerganov	55e87c3749	ggml : fix OpenCL broadcast requirement for ggml_mul (close #4453 )	2023-12-14 10:35:29 +02:00
Georgi Gerganov	4d98d9a656	sync : ggml (SD ops, tests, kernels) (#4444 ) * sync : ggml (SD ops, tests, kernels) ggml-ci * cuda : restore im2col ggml-ci * metal : fix accuracy of dequantization kernels ggml-ci * cuda : restore correct im2col ggml-ci * metal : try to fix moe test by reducing expert size ggml-ci * cuda : fix bin bcast when src1 and dst have different types ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-13 21:54:54 +02:00
slaren	799a1cb13b	llama : add Mixtral support (#4406 ) * convert : support Mixtral as LLAMA arch * convert : fix n_ff typo * llama : model loading * ggml : sync latest ggml_mul_mat_id * llama : update graph to support MoE * llama : fix cur -> cur_expert * llama : first working version * llama : fix expert weighting in the FFN * ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only) * ggml : add n_as argument to ggml_mul_mat_id * ggml : fix ggml_get_rows to take into account ne02 / ne11 * metal : add more general support for ggml_get_rows + tests * llama : add basic support for offloading moe with CUDA * metal : add/mul/div use general kernel when src1 not cont * metal : reduce the kernel launches for ggml_mul_mat_id * ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D * ggml : update get_rows f16 and q * cuda : support non-contiguous src1 in get_rows * llama : offload missing ffn_moe_silu * metal : fix ggml_get_rows to work with non-cont src1 * metal : add indirect mat-vec kernels for all quantization types * llama : do not quantize expert gating tensors * llama : add n_expert and n_expert_used to hparams + change quants * test-backend-ops : add moe test * cuda : fix get_rows when ncols is odd * convert : determine n_ctx correctly * metal : fix ggml_mul_mat_id for F32 * test-backend-ops : make experts more evenly probable (test_moe) * test-backend-ops : cleanup, add moe test for batches * test-backend-ops : add cpy from f32 -> all types test * test-backend-ops : fix dequantize block offset * llama : fix hard-coded number of experts * test-backend-ops : simplify and disable slow tests to avoid CI timeout * test-backend-ops : disable MOE test with thread sanitizer * cuda : fix mul_mat_id with multi gpu * convert : use 1e6 rope_freq_base for mixtral * convert : fix style * convert : support safetensors format * gguf-py : bump version * metal : add cpy f16 -> f32 kernel * metal : fix binary ops for ne10 % 4 != 0 * test-backend-ops : add one more sum_rows test * ggml : do not use BLAS with ggml_mul_mat_id * convert-hf : support for mixtral-instruct (#4428) * convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct * convert : use sentencepiece tokenizer for Mixtral-instruct * convert : make flake8 happy * metal : fix soft_max kernels ref: `1914017863` * metal : limit kernels to not use more than the allowed threads --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Radek Pilar <github@mrkva.eu>	2023-12-13 14:04:25 +02:00
Richard Kiss	9494d7c477	english : use `typos` to fix comments and logs (#4354 )	2023-12-12 11:53:36 +02:00
Georgi Gerganov	fe680e3d10	sync : ggml (new ops, tests, backend, etc.) (#4359 ) * sync : ggml (part 1) * sync : ggml (part 2, CUDA) * sync : ggml (part 3, Metal) * ggml : build fixes ggml-ci * cuda : restore lost changes * cuda : restore lost changes (StableLM rope) * cmake : enable separable compilation for CUDA ggml-ci * ggml-cuda : remove device side dequantize * Revert "cmake : enable separable compilation for CUDA" This reverts commit `09e35d04b1`. * cuda : remove assert for rope * tests : add test-backend-ops * ggml : fix bug in ggml_concat * ggml : restore `ggml_get_n_tasks()` logic in `ggml_graph_plan()` * ci : try to fix macOS * ggml-backend : remove backend self-registration * ci : disable Metal for macOS cmake build ggml-ci * metal : fix "supports family" call * metal : fix assert * metal : print resource path ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-07 22:26:54 +02:00
Georgi Gerganov	fbbc42827b	ggml : reuse ggml_get_n_tasks() in ggml_graph_plan() (#4308 ) * ggml : fix soft max out-of-bounds access ggml-ci * ggml : reuse ggml_get_n_tasks() in ggml_graph_plan() ggml-ci	2023-12-03 15:56:35 +02:00
Georgi Gerganov	adf3de4f69	ggml : fix soft max out-of-bounds access (#4307 ) ggml-ci	2023-12-03 15:56:22 +02:00
Georgi Gerganov	ef47ec18da	ggml : add ggml_soft_max_ext (#4256 ) * metal : implement soft_max_ext * cuda : implement soft_max_ext * ggml : implement soft_max_ext (CPU) * batched-bench : print threads ggml-ci * metal : simplify soft_max encoding ggml-ci * cuda : use 512 threads for soft_max instead of 32 * ggml : update soft max cpu * cuda : do warp-based block reduce * cuda : increase max block size to 1024 * cuda : fix warp reduction initialization of shared mem * metal : warp-based reduction for soft max kernel * metal : warp-based reduce for rms_norm * metal : simplify soft max kernel ggml-ci * alloc : fix build with debug	2023-12-01 10:51:24 +02:00
Georgi Gerganov	8406b0924b	ggml : re-enable BLAS for CPU when src0 != F32 + remove redundant full offload checks in llama.cpp (#4240 ) * ggml : use blas even if src0 is not F32 * llama : use n_threads_batch only when n_tokens >= 32 ggml-ci * llama : revert n_threads_batch logic ggml-ci	2023-11-28 10:32:03 +02:00
Jared Van Bortel	f3b269813f	ggml : fix -Warray-bounds warning with gcc (#4231 )	2023-11-26 22:58:43 -05:00
slaren	e85bb1a8e7	llama : add functions to get the model's metadata (#4013 ) * llama : add functions to get the model's metadata * format -> std::to_string * better documentation	2023-11-17 17:17:37 +02:00
gwjr	3e916a07ac	finetune : speed-up ggml_compute_forward_out_prod_f32 via BLAS (#4079 ) * Remove logically superfluous assertions and order by dimension * Use cblas_sgemm() to implement ggml_compute_forward_out_prod() * Remove ggml_compute_forward_out_prod_use_blas(), fix compiling errors on cmake/zig, remove trailing whitespace * Add openBLAS support for sgemm() in compute_forward_out_prod()	2023-11-17 16:48:19 +02:00
texmex76	8da46278e1	gguf : fix potential infinite loops while parsing (#4100 ) Co-authored-by: Bernhard Gstrein <gstrein@cs.uni-freiburg.de>	2023-11-16 17:01:48 +02:00
Georgi Gerganov	3d68f364f1	ggml : sync (im2col, GPU conv, 32-bit arm compat) (#4060 ) ggml-ci	2023-11-13 16:55:52 +02:00
Georgi Gerganov	4760e7cc0b	sync : ggml (backend v2) (#3912 ) * sync : ggml (backend v2) (wip) * sync : migrate examples and llama.cpp to dynamic graphs (wip) * sync : update tests + fix max op params to 64 ggml-ci * sync : ggml-cuda ggml-ci * llama : fix save/load state context size ggml-ci * sync : try to fix build on tvOS * sync : pass custom graph sizes in training examples * sync : update graph copies to new ggml API * sync : update sync-ggml.sh with new files * scripts : fix header in sync script * train : fix context size calculations * llama : increase inference graph size up to 4096 nodes * train : allocate grads for backward graphs * train : allocate grads for gb_tmp	2023-11-13 14:16:23 +02:00
xaedes	e9c1cecb9d	ggml : fix backward rope after YaRN (#3974 ) * fix backward process of rope rope backward process was broken after YaRN RoPE (#2268) implementation, due to missing changes in backward functions. the code for the backward process is nearly identically to the forward process: the only difference is the sign of the sin-values. to avoid future regressions remove the near-duplicate backward functions and reuse the forward code: for this a new function argument `bool forward` was added to `ggml_compute_forward_rope_f32` and `ggml_compute_forward_rope_f16`. the sin-values will be negated when forward is false. * fix finetune rope call to use correct default attn_factor of 1.0f * remove unused `ggml_rope_xpos_back` it is better to have only one `ggml_rope_back` function that accepts all rope parameters, so that `ggml_compute_backward` can propagate all parameters without having to switch between different rope_back variants. * fix comments explaining the sinus sign in ggml_forward_rope * add missing function arguments in declaration * fix function argument type in declaration	2023-11-07 10:04:51 +02:00
Georgi Gerganov	4ff1046d75	gguf : print error for GGUFv1 files (#3908 )	2023-11-02 16:22:30 +02:00
Georgi Gerganov	2756c4fbff	gguf : remove special-case code for GGUFv1 (#3901 ) ggml-ci	2023-11-02 11:20:21 +02:00
cebtenzzre	898aeca90a	llama : implement YaRN RoPE scaling (#2268 ) Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>	2023-11-01 18:04:33 -04:00
Andrew Godfrey	73bdcb395e	finetune : add -ngl parameter (#3762 ) * Add '-ngl' support to finetune.cpp * Add fprintf in ggml_cuda_op_add When I tried CUDA offloading during finetuning following the readme, I got an assert here. This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora * Add 'finetune.sh', which currently fails when using GPU "error: operator (): Finetuning on tensors with type 'f16' is not yet supported" * tweak finetune.sh * Suppress some warnings in ggml.c * Add f16 implementation to ggml_compute_forward_add_f16_f32 * Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs * finetune.sh: Edit comments * Add "add_f16_f32_f32_cuda" * Tweak an error message * finetune.sh: Add an optional LLAMA_MODEL_DIR variable * finetune.sh: Add an optional LLAMA_TRAINING_DIR variable * train : minor * tabs to spaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>	2023-11-01 13:49:04 +02:00
Georgi Gerganov	207b51900e	ggml : move FP16 <-> FP32 code to ggml-impl.h (#3861 ) * ggml : move FP16 <-> FP32 stuff to ggml-impl.h ggml-ci * tests : fix ARM build * ggml : explicitly initialize deprecated type traits * ggml : add math.h to ggml-impl.h * ggml : remove duplicate static assert macros * ggml : prefix lookup tables with ggml_ ggml-ci * ggml-impl : move extern "C" to start of file	2023-10-30 19:19:15 +02:00
Georgi Gerganov	d69d777c02	ggml : quantization refactoring (#3833 ) * ggml : factor all quantization code in ggml-quants ggml-ci * ggml-quants : fix Zig and Swift builds + quantize tool ggml-ci * quantize : --pure option for disabling k-quant mixtures --------- Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>	2023-10-29 18:32:28 +02:00
Georgi Gerganov	b2f7e04bd3	sync : ggml (conv ops + cuda MSVC fixes) (#3765 ) ggml-ci	2023-10-24 21:51:20 +03:00
Georgi Gerganov	2b4ea35e56	cuda : add batched cuBLAS GEMM for faster attention (#3749 ) * cmake : add helper for faster CUDA builds * batched : add NGL arg * ggml : skip nops in compute_forward * cuda : minor indentation * cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops) * Apply suggestions from code review These changes plus: ```c++ #define cublasGemmBatchedEx hipblasGemmBatchedEx ``` are needed to compile with ROCM. I haven't done performance testing, but it seems to work. I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up. * cuda : add ROCm / hipBLAS cublasGemmBatchedEx define * cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases * cuda : reduce mallocs in cublasGemmBatchedEx branch * cuda : add TODO for calling cublas from kernel + using mem pool --------- Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>	2023-10-24 16:48:37 +03:00
Qin Yue Chen	8cf19d60dc	gguf : support big endian platform (#3552 ) * check whether platform is 390x if yes->do not import immintrin.h * support s390x big endian * support --bigendian option for s390x 1. verified with baichuan7b-chat with float 16 on s390x 2. verified with baichuan7b-chat 3. verified with chinese-alpaca-2-13b-f16 * update format based on editor-config checker result * Update convert-baichuan-hf-to-gguf.py * 1. check in ggml.c if endianess is not match 2. update GGUF version 3. change get_pack_prefix to property 4. update information log * always use "GGUF" as beginng of GGUF file * Compare "GGUF" with file header char by char 1. Set GGUF_MAGIC to "GGUF" string instead of int value 2. Compare "GGUF" char by char to ensure its byte order 3. Move bytes swap code from convert.py to gguf.py write_tensor_data --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-20 14:19:40 +03:00
Herman Semenov	f439e506e8	ggml : fix rope + llama minor optimizations (#3560 ) * Minor fixes and fixed memleak * Using const auto references in range-based loop C++17	2023-10-20 13:02:12 +03:00
slaren	424b6381c4	ggml : add context enumeration functions (#3605 ) finetune : fix assert failure in ggml-alloc	2023-10-13 12:23:10 +02:00
M. Yusuf Sarıgöz	370359e5ba	examples: support LLaVA v1.5 (multimodal model) (#3436 ) * WIP: start implementing LLaVA * rm scratch buf for now, will revert after cleanup * LLaVA image encoder is working. will combine with llama * Add llava inference code, but it's buggy. debugging * LLaVA is working e2e, needs to optimize memory allocation + cleanup * Use ggml_allocr + rm unnecessary code * fix: crlf -> lf * fix: new line at EoF * fix: trailing whitespace * Add readme * Update readme * Some cleanup * Are you happy editorconfig? * rm unused batch image preprocessing * rm unused import * fix: rm designated initializers * introduce pad-to-square mode for non-square images * are you happy editorconfig? * gitignore /llava * Handle cases where image file does not exist * add llava target to Makefile * add support for 13b model variant * Maybe seed is unlucky? * Check if apples are compared to apples * are you happy editorconfig? * Use temperature = 0.1 by default * command line: use gpt_params_parse() * minor * handle default n_predict * fix typo * llava : code formatting, rename files, fix compile warnings * do not use Wno-cast-qual for MSVC --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-12 18:23:18 +03:00
Jan Ploski	f5f9121de1	llm : add MPT support (#3417 ) * CUDA: added support for ggml_clamp (see also: https://github.com/ggerganov/ggml/issues/545) * mpt : added an implementation based (mostly) on falcon integration, modified with deltas from ggml/examples/mpt * mpt : protect against "clip_qkv": null in mpt-7b * mpt : quick fix to avoid "Strange model" warning when quantizing MPT models * mpt : addendum to changeset:84e30e8 - leave parameter clamp_kqv out from metadata rather than use 0.0 to indicate "no clamping" (more compliant with the current GGUF spec?) * mpt : standardized all tensor names to follow GGUF spec * mpt : addendum to changeset:1be89c40 - use "req" parameter of GGUF_GET_KEY macro instead of duplicate code * mpt : fixed comment s/gptneox/mpt/ * mpt : remove tabs, trailing whitespace * mpt : removed ne01 + n_past == ne00 assertion from alibi (cuda/f32) and rope_shift from build_mpt * mpt : updated convert-mpt-hf-to-gguf.py to reflect changes made to convert-gptneox-hf-to-gguf.py in pr:3252 * comment out n_past instead of marking it unused * mpt : removed hardcoded +178 from convert script in favor of utilizing hparams["vocab_size"] * mpt : remove unused tokenizer_json in convert script * ggml : remove obsolete n_past assert in ggml_alibi * llama : print clam_kqv and max_alibi_bias hparams --------- Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-10 10:50:23 +03:00
Georgi Gerganov	fcca0a7004	refact : fix convert script + zero out KV cache to avoid nans (#3523 ) * refact : fix convert script + zero out KV cache to avoid nans * ggml : silu(-inf) should never happen * metal : assert various kernel requirements	2023-10-09 14:32:17 +03:00
Georgi Gerganov	db3abcc114	sync : ggml (ggml-backend) (#3548 ) * sync : ggml (ggml-backend) ggml-ci * zig : add ggml-backend to the build	2023-10-08 20:19:14 +03:00
Georgi Gerganov	0d152b37fe	ggml : fix build after #3329	2023-10-04 16:25:41 +03:00
ds5t5	f8c90cdbaa	llm : add Refact model (#3329 ) * add refact model * resolve comments * rebase to the latest * solve alibi cpu error --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-04 16:23:39 +03:00
Georgi Gerganov	f93af02488	sync : ggml (conv 1d + 2d updates, UB fixes) (#3468 ) * sync : ggml (conv 1d + 2d updates) ggml-ci * ggml : fix UB in q5_0 and q5_1 quantize code ggml.c:1033:39: runtime error: left shift of 1 by 31 places cannot be represented in type 'int' SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ggml.c:1081:39: runtime error: left shift of 1 by 31 places cannot be represented in type 'int' SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ggml-ci * tests : fix UB in test-quantize-perf	2023-10-04 15:29:58 +03:00
Tameem	79f34abddb	ggml : add RISC-V Vector Support for K-Quants and improved the existing intrinsics (#3453 ) * Added RVV intrinsics support for Q8 quantize row and also improved the existing dot product function for risc-v. The RVV intrinsics is added for the following quantize row functions quantize_row_q8_0 quantize_row_q8_1 The following dot product functions have also been optimized by using LMUL = 1/2 instead of LMUL = 1 ggml_vec_dot_q4_0_q8_0 ggml_vec_dot_q4_1_q8_1 ggml_vec_dot_q5_0_q8_0 ggml_vec_dot_q5_1_q8_1 And vector initialization in Q5 by temporary array is also replaced by the vid intrinsics Signed-off-by: Ahmad Tameem <ahmad.tameem@10xengineers.ai> * Added RVV intrinsics support for k_quants This adds RISC-V Vector intrinsics support for the following K_quants functions for both QKK = 256 and QKK = 64 ggml_vec_dot_q2_K_q8_K ggml_vec_dot_q3_K_q8_K ggml_vec_dot_q4_K_q8_K ggml_vec_dot_q5_K_q8_K ggml_vec_dot_q6_K_q8_K Signed-off-by: Ahmad Tameem <ahmad.tameem@10xengineers.ai> --------- Signed-off-by: Ahmad Tameem <ahmad.tameem@10xengineers.ai>	2023-10-03 21:38:19 +03:00
shibe2	665018c749	CLBlast: Add broadcast support for matrix multiplication (#3402 ) Broadcast src0 into src1 across dimensions 2 and 3 when needed. This is required for models that use GQA.	2023-10-02 21:26:15 +02:00
Cebtenzzre	bc39553c90	build : enable more non-default compiler warnings (#3200 )	2023-09-28 17:41:44 -04:00
Qu Zongfu	7f1a0fe709	ggml : release the requested thread pool resource (#3292 ) * Release the requested thread pool resource * Release the requested thread pool resource 2 --------- Co-authored-by: Zongfu ZF3 Qu <quzf3@Lenovo.com>	2023-09-28 22:51:52 +03:00
xaedes	0e76a8992c	train : finetune LORA (#2632 ) * fix track_max_mem in forward_batch_wo_cache_flash_attn_train * remove unnecessary Adam(W) optimizer tensors. reduces optimizer memory overhead from 7modelsize to 2modelsize. additionally allows to optimize models with more than 2^31 parameters by replacing int with int64_t. bumps training checkpoint file version, but old checkpoints can still be read. new version with less tensors is saved. * add gradient clipping to AdamW * Fix reset of unused g->nodes and g->grads to NULL * implement gradient checkpointing for training reduces memory overhead from O(n_layer) to O(sqrt(n_layer)) as explained in readme of https://github.com/cybertronai/gradient-checkpointing * remove unused compute buffer 3 * add and use function ggml_build_backward_expand to avoid stack overflows with large maximum number of nodes GGML_API void ggml_build_backward_expand(struct ggml_context * ctx, struct ggml_cgraph * gf, struct ggml_cgraph * gb, bool keep); * change AdamW decay parameter to work like the torch AdamW decay parameter It is now relative to Adam learning rate `alphasched`. Before that it was relative to `sched` only. `alpha` being the maximum learning rate and `sched` being a scaling parameter in [0..1] change default AdamW weight decay parameter used in training to 0.1 as used in nanoGPT * change default AdamW weight decay parameter defined in ggml to 0.0, making Adam default instead of AdamW btw: the default weight decay parameter for torch.optim.AdamW is 0.01 * bug fixes for cross entropy loss ggml_cross_entropy_loss: sums where not correctly added in workload of each thread ggml_cross_entropy_loss_back: simplify backward process, reducing numerical issues guard usage of exp f16 lookup in cross entropy by #define GGML_CROSS_ENTROPY_EXP_FP16 cross entropy loss is only used once during training, but it is quite sensitive to numerical errors introduced by exp-f16-lookup. so exp-f16-lookup for cross entropy loss is disabled by default, trading better gradients for very slightly worse runtime performance. * fix test-grad0 for cross_entropy_loss the second argument to cross_entropy_loss must sum up to 1 for each row * fix test-grad0 for soft_max dont use only sum as aggregation, because sum of softmax is always 1 -> finite differences should not work instead use sum(log(soft_max()(1-eps)+eps)); use eps to avoid log(0) improve finite differences of test-grad0 by using double instead of float * change cross_entropy_loss to output average over all rows this helps keeping the loss and gradients in a sane range * improve gradient checkpointing sqrt(n_layers) is only the best checkpoint step when mem size of checkpoints and mem size of layers are equal. since layers require more memory than the single-tensor-checkpoint we use, the optimal values are compute different: ``` given: n, u, v objective: minimize(au+bv) where ab=n, a>0, b>0 b=n/a minimize(au+vn/a) diff(au+vn/a, a) = u - (vn/a)/a diff(au+vn/a, a) == 0 u - (vn/a)/a == 0 u == vn/(aa) uaa = vn aa = vn/u a = sqrt(nv/u) ``` this change results in more checkpoints, requiring less layers to store between checkpoints, overall improving memory usage. disable gradient checkpointing debug output * llama : fix rope usage in train-text-from-scratch after ChatGLM change * add more training parameters: --enable-restart N Only for Adam optimizer. Enable restarts of cos-decay --disable-restart N Only for Adam optimizer. Disable restarts of cos-decay --opt-past N Number of optimization iterations to track for delta convergence test. Disabled when zero. --opt-delta N Maximum delta for delta convergence test. Disabled when <= zero. --opt-max-no-improvement N Maximum number of optimization iterations with no improvement. Disabled when <= zero. --adam-epsf N AdamW epsilon for convergence test. Disabled when <= zero. --adam-min-alpha N Adam minimum learning rate alpha, usually 0.1 * alpha * replace memcpy with reshape operation so that the graph is not cut at the input this makes it possible to store other values into the input tensor and then simply recompute the graph without rebuilding it * remove unused function argument from get_example_targets_batch * measure and print total training time * add optimization callback to ggml_opt_resume_g this callback is called before each iteration with custom data and pointer to learning schedule parameter (only used in Adam(W)). can be used for dynamic learning schedule and setting input data for batches before each iteration * use optimization callback in training allows dynamic learning schedule and different batch data for each iteration without relying on low n_iter and high n_examples parameters reduces runtime by avoiding restart of optimization function and improves training convergence by providing a different batch for each iteration * add minimum number of tensor dimensions to apply weight decay (default 2) this allows to not apply weight decay to bias parameters * rename training parameter cos-decay-alpha to cos-decay-min and clarify that adam-min-alpha also applies to warmup * fix increase of model.train_samples and model.train_tokens now that each optimizer iteration gets its own batch we need to multiply by number of opt iterations * change sampling parameters for prediction after training to defaults of common.h and clarify what is context for prediction and what are generated tokens * tighten abs error bounds for cross_entropy_loss in test-grad0 * add conditional compilation of using F16 exp in flash attention uncomment `// #define GGML_FLASH_ATTN_EXP_FP16` to enable usage of f16 exp in flash attention * tighten abs error bounds for flash_attn in test-grad0 * tighten abs error bounds for sqrt in test-grad0 * remove out-commented vectorized code of opt_adam the vectorized code might be bit faster for low number of parameters, but it had a big memory usage overhead * ggml : update ggml_rms_norm_back with configurable eps * llama training : fix ggml_rms_norm_back calls to pass configurable eps * remove trailing whitespace * add train function using automatic gradient checkpointing backward pass and allocator * in train function replace add_inplace by regular add because using add_inplace seems to result in different gradients * don't use allocate hash_map on context because the context has no_alloc=True when using memory allocator resulting in NULL data pointers * correctly clone reshape and permute operations by also cloning tensor->nb values * fix variable name and add missing type cast * terminate recursive tensor cloning when reaching tensor without src tensors * correctly clone view tensors by setting data pointers without this the checkpointing would only work when being used together with memory allocator * fix variable names * swap arguments to commutative ops to be the same as in `forward_batch_wo_cache_flash_attn` * add input tensors as checkpoints so that recursive tensor cloning of gradient checkpointing terminates on input tensors * fix variable name and add missing boolean negation * make sure some tensors are not reallocated by inserting new temporary nodes depending on them: output and parameter gradient tensors need to be available at the end of the graph execution parameter gradient tensors also need to be available before the graph execution because they are set to zero before each optimizer iteration checkpoint tensors are allocated all together to reduce memory allocator fragmentation afterwards, in addition to the temporary nodes, we also need to reset the temporary leafs * fix ASSERT to work with zero layers * add training options whether to use allocator and/or unified training function * integrate unified training function which may use memory allocator the unified training function also supports arguments whether to use flash attention and/or gradient checkpointing * format name of cloned tensors with " (clone)" suffix * set names for tensors in unified train function for easier debugging * allocate graph on context using ggml_new_graph * remove handwritten training functions * remove unused training parameters "use_scratch" and "use_unified" * remove trailing whitespace * remove unused train params: mem_compute1_gb & mem_compute2_gb mem_compute_gb is used for compute when automatic memory allocator is not enabled, otherwise it can be very small to only hold the tensor definitions mem_compute0_gb is used for automatic memory allocator (as long as measurement of max required size is not implemented) * remove unused forward_batch function * add debug asserts in ggml_allocr_alloc to some common pitfalls when using this function directly * only use ggml_allocr_alloc when tensor has NULL data and is no view * fix test when to create temporary backward graph temporary backward graph is only necessary when using checkpointing * fix memory "leak" in optimizers each iteration a new cplan with new memory for work data was allocated. now cplan creation only happens at the start of optimization, with each iteration reusing the cplan and its work data. * reverse order of for loop in ggml_build_backward_expand to save memory when using gradient checkpointing and allocator with this loop order gradient checkpointing with allocator on 16 layer model saves 13% memory; 2 layer memory it saves 2% memory. the computation results are the same * add API functions to access llama model tensors * add stub example for finetuning, based on train-text-from-scratch * move and remove code * add API functions to access remaining model parameters: mult, head and rot * first draft for LORA finetune training * remove const model and layer arguments in API functions for accessing model tensors * bug fixes to make finetune compile automatic allocator does not work yet * add debug prints for training memory improvements * fix names of lora tensors * avoid stack overflow resulting from big ggml_cgraph replace stack allocation and ggml_build_forward by ggml_new_graph in combination with ggml_build_forward_expand * replace llama API functions to get model tensors by one function to get model tensor by name LLAMA_API struct ggml_tensor * llama_get_model_tensor(struct llama_model * model, const char * name); * remove unused call to not existing llama_get_layer_from_model * implement ggml_compute_forward_out_prod_q_f32 * remove trailing whitespace * add lora finetune support on quantized base model tensors * add ggml_add_cast API function this function works like ggml_add, but accepts a data type for the resulting tensor. only supported for quantized src0 input. * use ggml_add_cast in finetuning lora-applied weights will now have data type F32, which improves gradients when finetuning quantized base models * bug fix: actually use result type passed to ggml_add_cast * make sure base model tensors data cannot be used in viewable operations memory allocator would try to make lora application inplace on base model tensors. since those are memory mapped this will result in memory access violations * fix bug in ggml_out_prod which resulted in wrong n_dims of result tensors * avoid keeping in memory ALL of the gradients The problem here stems from ggml_graph_reset. This function is called in the optimization function, before each graph computation, to reset the gradients to zero. This required a unique memory slot for each gradient: allocating memory from a previosly freed memory location might lead to non-zero input gradients. During ggml_compute_backward the gradients are build stepwise by adding or substracting new values, starting from a OP_NONE tensor which needs to contain zero-values. This requires the graph reset. To avoid this I now remember in ggml_build_backward_expand the original OP_NONE gradient tensors in a hash table, which is passed to ggml_compute_backward. There instead of using add (or sub or similar) I test whether the existing gradient to be changed is a zero-valued-tensor by looking up its existence in the hash table. When it is such a zero-tensor it will not be modified, but replaced by the value to be added, otherwise the regular add (not inplace, allocator will take care of this) will be used. This way none of those zero-tensor values will be necessary in the final backward graph and more importantly they won't need a unique memory slot, just to make them zero. * remove trailing whitespace * remove debug prints and function to compute tensor data hash * improve optimization iteration prints * adjust maximal values to support finetuning 3B models * change default finetune params lora_r and lora_alpha to match the n_rank parameters of 4 * bug fix: make sure finetune input gradient is allocated at begin and kept until end * remove unnecessary src tensor from ggml_get_rows_back we don't need data of src[2] for computation, only to setup the correct output shape. remove dependency on src[2], so that allocator can work more freely. the computational graph is still completely determined, because the output shape is naturally included. this is similar to how ggml_reshape does it. * remove unnecessary src tensor from ggml_repeat & ggml_repeat_back we don't need data of src[1] for computation, only to setup the correct output shape. remove dependency on src[1], so that allocator can work more freely. the computational graph is still completely determined, because the output shape is naturally included * resolve todo allocator will only make it inplace when they are of the same type * mixing multiple LORA adapters is now possible pass more than one '--lora FNAME' argument to apply more than one LORA. use '--lora-scaled FNAME S' when you want to specify a user-defined scale for an adapter. * add option to save finetune output every N iterations * also save latest finetune output with ITERATION="LATEST" and print where files are saved saving with LATEST makes it easier to resume training from the latest checkpoint the string "LATEST" can be configured with command line option "--fn-latest STR" * update checkpoint train stats before saving via "--save-every" * add command line option `--rank-wo N` for rank of wo tensor * update finetune README * fix dump_non_result_info_yaml to output multiple lora adapters * bug fix: replace GGML_TYPE_SIZE[t] by ggml_type_size(t) * replace llama_n_mult by llama_n_ff * finetune bug fixes to compile with merged in code from master * remove prediction related code to reduce duplicated code with main use main instead * reduce large memory overhead in train-text-from-scratch all gradients had to be pinned so that graph_reset works correctly. this is no longer necessary with the changes to ggml_compute_backward introduced in this PR. * add comment explaining why finetune checkpoints are allocated in one block * make default value of float member a float literal * handle rms_norm and rope parameters the same as in train-text-from-scratch * remove unused code * remove vocab related code as it is unnecessary * add LLM_KV_TRAINING_TYPE to train-text-from-scratch checkpoints so that they can be differentiated from lora finetune checkpoints * add gguf constants and load/save functions from train-text-from-scratch * add load & save lora finetune checkpoints via gguf * add python script to convert old finetune checkpoint files to gguf * remove old checkpoint save & load code * remove code to print data checksums which was used to verify correctness of new gguf code * omit tokenization when training is disabled, only save llama lora adapter training can be disabled by passing '-n 0' to finetune * remove trailing whitespace * update README.md * implement ggml_compute_forward_repeat_f16 * avoid stack overflow of large cgraphs in test-grad0 * add ggml API functions ggml_unravel_index, ggml_get_i32_nd and its analogs for set and for f32 ggml_get_i32_1d, ggml_set_i32_1d, ggml_get_f32_1d, ggml_set_f32_1d now support non-contiguous tensors. in case of non-contiguous tensor, the 1d index is unraveled into a multi index using ggml_unravel_index to be passed to '_nd' function equivalent. this fixes a bug in test-grad0 which happens due to ggml_build_backward not building purely contiguous tensors anymore * increase test-grad0 context mem size to accommodate for bigger cgraph * add sanity check to ggml_compute_backward, asserting the correct shape of gradients * fix ggml_acc_or_set to return tensor of correct shape * remove unused 'inplace' argument from ggml_compute_backward function inplace operations to add gradients are no longer created by ggml_compute_backward use allocator to automatically make inplace operations * add missing argument 'int i0' to ggml_get_i32_nd & ggml_set_i32_nd header declarations * fix error message in ggml_allocr_alloc to display actual max_avail * fix check_gradient ggml_build_backward_expand was previously replaced by ggml_build_backward, but the assignment of forward graph to backward graph missing * use tensor->view_src instead of ggml_is_view and get_view_source * move gradient checkpointing code into ggml, new API function: // build gradient checkpointing backward graph gb for gf using provided checkpoints // gb_tmp will contain original backward graph with rewritten backward process nodes, // but without the second forward pass nodes. GGML_API void ggml_build_backward_gradient_checkpointing( struct ggml_context * ctx, struct ggml_cgraph * gf, struct ggml_cgraph * gb, struct ggml_cgraph * gb_tmp, struct ggml_tensor * * checkpoints, int n_checkpoints); * replace custom data getters and setters by ggml functions * train-text-from-scratch can train (full finetune) gguf models just pass the gguf model via `--checkpoint-in FN`. after this, to continue training, pass the generated checkpoint instead of the original gguf model. tested with smaller models, bigger models may exceed available memory. use (LORA) finetune for those. * remove trailing whitespace * add option to save train-text-from-scratch output every N iterations * update README.md * fix warnings * fix warnings * remove finetune option to disable allocator the allocator should always be used. by making sure that it is always used it gets easier to implement automatic memory requirements computation * add tensor checkpoints only when gradient checkpointing is enabled * initialize opt ggml context if none was provided * add ggml-alloc API function 'ggml_allocr_max_size' to get max size of alloc GGML_API size_t ggml_allocr_max_size(struct ggml_allocr * alloc); * finetune: automatically allocate all memory and changes to command line options remove '--n_examples N' parameter, as it no longer makes sense to call optimization process multiple times in a loop. add '--only_write_lora' command line option: will skip tokenization and training, to only write a llama.cpp comptabile LORA adapter. remove memory buffer related command line options. improve iteration console output. * add finetune to Makefile * update README.md * print time per iteration and estimate remaining time * increase measured alloc size by tensor_alignment ggml_allocr_reset will reduce the given size by up to tensor_alignment-1 * fix README.md * add some more allocator debug prints * bug fix, probably solves the 'ggml_allocr_alloc: not enough space in the buffer' issue * revert last commit "bug fix, probably solves the 'ggml_allocr_alloc: not enough space in the buffer' issue" "alloc was freeing an externally allocated tensor, because it calculated the end of allocator memory as alloc->data + alloc->max_size instead of alloc->data + alloc->size." This is intentional to reduce the risk of freeing external tensors when measuring. Unless max_size is not properly calculated, I don't see why this is an issue. * remove unnecessary "0x" before "%p" output * move measurement memory segment to upper region of the address space * update README.md * fix printf format warnings * add missing gguf_free in load_checkpoint_lora_file * load default rms_norm and rope parameters from base model * add gradient accumulation specify number accumulation steps with '--grad-acc N'. this will simulate a bigger batch size of grad_accbatch. fix tracking of train_samples and train_tokens * build : fix compile warnings * ggml : fix L-BFGS linesearch loop * improve finetune time measurement fix printf warnings on system where int64_t is (long int). change time datatypes to double because values get big with long training times. exclude file saving from time measurement. converge faster to actual time per iteration by removing very small first duration before first iteration was performed. fix bug in output of total training time, the reported value was 1000 times to small. * specify default lora rank with '--lora-r N' '--lora-r N' will specify default rank for all tensors '--rank-wq N', etc. will override this default rank for specific tensor types. * fix gradient accumulation bug where the same batch was used for each microstep * fix gradient accumulation bug where the same batch was used for each microstep * support grouped-query-attention in ggml_flash_attn and ggml_flash_attn_back k and v can now be repeated in q along ne[2] in forward pass just use modulo to compute k and v indices, like ik2 = iq2 % nek2. in backard pass this won't work as easy, because multiple threads will compete to accumulate to the same k->grad[:,ik1,ik2,ik3] and v->grad[:,iv1,iv2,iv3]. so we change the parallelization over q rows to be over k rows. this ensures non-overlapping (ik2,ik3) across threads. in each thread we then iterate over the number of repetitions of k/v in q to compute iq2 as iq2 = ik2 + irepnek2. since ne2 is not the same for q,k and v we also change how the gradients are concatenated into the result tensor. additionally the offsets of gradq, gradk and gradv in the result tensor are now memory aligned. we also simplify the compute_backward part of flash_attn to use ggml_reshape instead of switching over the number of dimensions. this needs a small change to ggml_reshape, removing the assertion of second argument to be contiguous. since only the shape (ne) of the second reshape argument is of relevance, its memory layout (nb) is irrelevant -> it can very well be non-contiguous. change test-grad0 to also test for repeated k/v in q. this changes the rng and now results in small gradient differences in softmax. these solely come from using f16 exp table lookup in forward softmax: when temporarily changing softmax to use actual exp function, the reported gradient differences go away. gradient differences coming solely from f16 table lookup are acceptable. added a note to explain this. add llama API functions to get grouped-query-attention n_head parameter 'n_head_kv'. * fix finetune to support grouped-query-attention (using flash-attention) note: ggml changes to ggml_out_prod are necessary to support grouped-query-attention without flash-attention. * support broadcastable a in out_prod(a, b) and backward pass of broadcasting mul_mat(a, b) * test broadcasting mul_mat backward pass * decouple random number generator of each operation test when changing one test the rng of others tests is not influenced anymore * add comment briefly describing what ggml_repeat_back does * simplify broadcasting mul_mat backward using ggml_repeat_back * add cgraph evaluation order member and corresponding enum type this controls in which order ggml_build_forward visits source nodes. by default the nodes are visited left to right, i.e. src[0] first. in some cases it is beneficial for ggml-alloc to visit in a different order. two possible orders are supported: left-to-right (src[0] first) and right-to-left (src[0] last). * measure max compute size for each cgraph eval order and use best order this can bring huge memory savings: e.g. codellama-34b with n_ctx=64, n_batch=1 goes from 92927.8mb down to 4627.6 MB * remove unused command line options * add sample start patterns and options to force new or by default resume last shuffling * update shuffle rng state on reshuffle * exclude known zero values from computations in flash_attn_f32 & flash_attn_back_f32 * remove probably unnecessary exception type flags from stringstream * pass correct max number of tokens to llama_tokenize * account for possible leading whitespace that will be added by tokenizer e.g. '\t' will be tokenized by llama spm tokenizer to [29871, 12] * use unrolled vec_mad in out_prod y is vec_mad result vec. x is vec_mad input vec. v is vec_mad input scalar. ggml_vec_mad_f32_unroll will internally loop over x and v with same y. GGML_VEC_MAD_UNROLL is by default defined to 32. This value is empirical optimized using performance test runs of out-prod in openllama-3b finetune with 256 context length and batch size 1. It gives 23% performance boost for out_prod. Full measurements of out-prod runtime in ms: unroll_xv unroll_yv 1 67014.643 87826.469 2 77117.552 89077.656 4 72091.311 109121.657 8 61077.543 88678.334 16 56914.67 79514.947 24 59024.595 84350.254 28 55952.446 83368.73 32 51476.658 85177.745 36 55973.792 84659.92 40 55139.616 93844.738 48 60736.392 93330.267 64 99856.878 116994.99 Second column is when unrollying yv instead of xv * set lora_alpha to value of lora_r if it is not set via command line otherwise only changing lora_r will change scaling of lora adapter used in prediction * reshuffle original sample order instead of the previous shuffled order otherwise resumed reshuffle will not result in same sample order * block tiling for out-prod inspired by mul-mat block sizes are empirically optimized roughly doubles the flops of out-prod * exclude some more known zero values from computations in flash_attn_f32 & flash_attn_back_f32 * add static keywords * remove outcommented old code * update train-text-from-scratch with tokenization, sample selection and shuffling from finetune * remove lbfgs related train parameters * move common train functions into common/train.[h\|cpp] * move train state into struct train_state * move train data saving code into callback to unify code of opt_callback train_params are still different in finetune and train-text-from-scratch, so it can't yet be moved to train.h\|cpp * move common train params into common/train * move common opt_callback into common/train * fix consume_common_train_arg * save and load head_count_kv in lora checkpoints * increase train_samples by used_samples instead of number of batches on batch can contain more than one sample when option "fill_with_next_samples" is used * fix usage of llama_tokenize * remove static from process_escape since we need it exposed in header * fix code formating of long function declarations * fix condition in load_train_state_gguf * use die("msg") instead of replace GGML_ASSERT(!"msg") or throw std::runtime_error("msg") * fix saving and loading of training type * remove terminating '\0' from tokenization (llama_tokenize is now passed the string length instead of relying on terminating '\0') * fix compile warnings * fix compile warnings * use new/delete for train_state instead of malloc/free using malloc may result in seg faults when trying to assign string fields * assert that sample_count > 0, avoiding division by zero * fix frand to return value in interval [0,1) * add train option "--sample-random-offsets" Use samples beginning at random offsets. The offset is only applied to the first sample in each batch context window. Together with "--fill-with-next-samples" this may help for training endless text generation. For example given a dataset containing samples "abcd", "ABCD", "0123". With context size of 8 and options "--fill-with-next-samples", "--no-separate-with-eos", "--no-separate-with-bos", the context windows of batches could only be filled with "abcdABCD", "ABCDabcd", "0123abcd", etc. With "--sample-random-offsets" it can also be filled with "23abcdAB", "bcd0123A", etc. * deduplicate code into function * remove n_rot hparam, as it must always be hparam.n_embd_head() * align code * assert correct base model tensor shapes * move some params from lora hparams into model hparams and load model params from gguf this equalizes the model definition in finetune and text-from-scratch and removes the need for additional llama api functions to get model parameters * remove now unnecessary llama API functions to get model params that where added by this PR * train-text-from-scratch: automatically allocate model tensors, remove option '--mem-model N' * train-text-from-scratch: automatically allocate opt context * train-text-from-scratch: automatically allocate input tensors * train-text-from-scratch: automatically allocate compute memory * remove unused options and equalize train-text-from-scratch with finetune * initialize opt->loss_after with zero * add export-lora program * remove trailing whitespace * add export-lora build in Makefile * remove unused struct tensor_info from export-lora * add export-lora build dependency to llama because it depends on common, which depends on llama * update finetune README.md * cancel optimization when specified number of epochs is completed * improve handling of export-lora arguments print errors and warnings when files could not be read or created * Fix export-lora.cpp "not enough space in the context's memory pool" (#1) * Fix export-lora.cpp "not enough space in the context's memory pool" Without this patch, export-lora would sometimes error with "not enough space in the context's memory pool (needed 656784, available 656800)". * increase required context size by 5GGML_MEM_ALIGN instead of plain 16 --------- Co-authored-by: xaedes <xaedes@gmail.com> improve handling of not yet supported tensor types --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: meatbag-18a <145869052+meatbag-18a@users.noreply.github.com>	2023-09-28 21:40:11 +03:00
Cebtenzzre	2db94d98ed	gguf : basic type checking in gguf_get_* (#3346 )	2023-09-28 14:30:31 -04:00
Georgi Gerganov	ec893798b7	llama : custom attention mask + parallel decoding + no context swaps (#3228 ) * tests : verify that RoPE is "additive" * llama : replace ggml_diag_mask_inf with ggml_add (custom -inf mask) * ggml : ggml_rope now takes a vector with positions instead of n_past * metal : add rope_f16 kernel + optimize cpy kernels * llama : unified KV cache + batch inference API * llama : add new llama_decode() API that works with llama_batch * llama : add cell_max heuristic for more efficient kv_cache * llama : extend llama_kv_cache API * llama : more robust cell_max heuristic + wip shift * metal : disable concurrency optimization * llama : add llama_kv_cache_shift_seq + no more context swaps * llama : apply K-cache roping for Falcon and Baichuan * speculative : fix KV cache management * parallel : example for serving multiple users in parallel * parallel : disable hot-plug to avoid cache fragmentation * fixes : speculative KV cache + llama worst-case graph * llama : extend batch API to select which logits to output * llama : fix worst case graph build * ggml-cuda : update rope implementation for parallel decoding (#3254) * ggml-cuda : update rope implementation for parallel decoding * better solution for p0 computation * fix rope * simpler rope implementation --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * make : add parallel to build + fix static functions in llama.cpp * simple : fix token counting * parallel : various improvements * llama : fix cell_max logic + rename functions * parallel : try smaller batches when the KV cache is fragmented * parallel : fix sequence termination criteria * llama : silence errors KV cache errors * parallel : remove new line from prompt * parallel : process system prompt once + configurable paramters + llama API * parallel : remove question with short answers * parallel : count cache misses * parallel : print misses on each request * parallel : minor * llama : fix n_kv to never become 0 * parallel : rename hot-plug to continuous-batching * llama : improve llama_batch API + simplify parallel example * simple : add parallel decoding support * simple : improve comments + free batch * ggml-cuda : add rope f16, restore performance with parallel decoding (#3272) * ggml-cuda : add rope f16, restore performance * offload KQ_mask with all models * fix rope shift --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama : disable MPI for now ggml-ci * train : make KQ_pos memory buffer permanent via dummy scale op * ggml : revert change to ggml_cpy, add ggml_cont_Nd instead (#3275) ggml-ci * parallel : fix bug (extra BOS) + smaller token_prev array * parallel : fix cases where the input prompts can overflow the batch * parallel : add disabled experimental batch chunking in powers of two * llama : llama.h formatting + comments * simple : add README.md * llama : fix kv cache heuristic when context is less than 32 * parallel : fix crash when `-n -1` * llama : simplify returns if/else branches * metal : use mm kernels for batch size > 2 * examples : utilize new llama_get_logits_ith() * examples : add example for batched decoding * examples : do not eval prompt 2 times (close #3348) * server : clear the KV cache beyond n_past before llama_decode * server : avoid context swaps by shifting the KV cache --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-09-28 19:04:36 +03:00
Georgi Gerganov	8c00b7a6ff	sync : ggml (Metal F32 support + reduce ggml-alloc size) (#3192 ) * sync : ggml (Metal F32 support + reduce ggml-alloc size) ggml-ci * llama-bench : fix ggml_cpu_has_metal() duplicate function ggml-ci	2023-09-15 19:06:03 +03:00
Georgi Gerganov	a51b687657	metal : relax conditions on fast matrix multiplication kernel (#3168 ) * metal : relax conditions on fast matrix multiplication kernel * metal : revert the concurrnecy change because it was wrong * llama : remove experimental stuff	2023-09-15 11:09:24 +03:00
Eric Sommerlade	b52b29ab9d	arm64 support for windows (#3007 ) Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>	2023-09-12 21:54:20 -04:00
Georgi Gerganov	b3e9852e47	sync : ggml (CUDA GLM RoPE + POSIX) (#3082 ) ggml-ci	2023-09-08 17:58:07 +03:00
Przemysław Pawełczyk	cb6c44c5e0	build : do not use _GNU_SOURCE gratuitously (#2035 ) * Do not use _GNU_SOURCE gratuitously. What is needed to build llama.cpp and examples is availability of stuff defined in The Open Group Base Specifications Issue 6 (https://pubs.opengroup.org/onlinepubs/009695399/) known also as Single Unix Specification v3 (SUSv3) or POSIX.1-2001 + XSI extensions, plus some stuff from BSD that is not specified in POSIX.1. Well, that was true until NUMA support was added recently, so enable GNU libc extensions for Linux builds to cover that. Not having feature test macros in source code gives greater flexibility to those wanting to reuse it in 3rd party app, as they can build it with FTMs set by Makefile here or other FTMs depending on their needs. It builds without issues in Alpine (musl libc), Ubuntu (glibc), MSYS2. * make : enable Darwin extensions for macOS to expose RLIMIT_MEMLOCK * make : enable BSD extensions for DragonFlyBSD to expose RLIMIT_MEMLOCK * make : use BSD-specific FTMs to enable alloca on BSDs * make : fix OpenBSD build by exposing newer POSIX definitions * cmake : follow recent FTM improvements from Makefile	2023-09-08 15:09:21 +03:00
Kunshang Ji	7f412dab9c	enable CPU HBM (#2603 ) * add cpu hbm support * add memalign 0 byte check * Update ggml.c * Update llama.cpp * ggml : allow ggml_init with 0 size * retrigger ci * fix code style --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-09-08 03:46:56 +02:00

1 2 3 4 5 ...

289 Commits