Commit Graph

248 Commits

Author SHA1 Message Date
Georgi Gerganov
e72c8c2124
ggml : fix bug in gguf_set_kv
ggml-ci
2023-08-17 20:13:48 +03:00
Georgi Gerganov
88b5769487
gguf : deduplicate (#2629)
* gguf : better type names

* dedup : CPU + Metal is working

* ggml : fix warnings about unused results

* llama.cpp : fix line feed and compiler warning

* llama : fix strncpy warning + note token_to_str does not write null

* llama : restore the original load/save session implementation

Will migrate this to GGUF in the future

* convert-llama-h5-to-gguf.py : support alt ctx param name

* ggml : assert when using ggml_mul with non-F32 src1

* examples : dedup simple

---------

Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
2023-08-16 19:25:29 +03:00
Georgi Gerganov
758ff1bbb5
llama : refactor model loading code (#2620)
* llama : style formatting + remove helper methods

* llama : fix quantization using gguf tool

* llama : simplify gguf_file_saver

* llama : fix method names

* llama : simplify write_header()

* llama : no need to pass full file loader to the file saver

just gguf_ctx

* llama : gguf_file_saver write I32

* llama : refactor tensor names (#2622)

* gguf: update tensor names searched in quantization

* gguf : define tensor names as constants

* gguf : initial write API (not tested yet)

* gguf : write to file API (not tested)

* gguf : initial write API ready + example

* gguf : fix header write

* gguf : fixes + simplify example + add ggml_nbytes_pad()

* gguf : minor

* llama : replace gguf_file_saver with new gguf write API

* gguf : streaming support when writing files

* gguf : remove oboslete write methods

* gguf : remove obosolete gguf_get_arr_xxx API

* llama : simplify gguf_file_loader

* llama : move hparams and vocab from gguf_file_loader to llama_model_loader

* llama : merge gguf-util.h in llama.cpp

* llama : reorder definitions in .cpp to match .h

* llama : minor simplifications

* llama : refactor llama_model_loader (WIP)

wip : remove ggml_ctx from llama_model_loader

wip : merge gguf_file_loader in llama_model_loader

* llama : fix shape prints

* llama : fix Windows build + fix norm_rms_eps key

* llama : throw error on missing KV paris in model meta data

* llama : improve printing + log meta data

* llama : switch print order of meta data

---------

Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>
2023-08-16 14:34:03 +03:00
M. Yusuf Sarıgöz
d09fd10713 gguf : write metadata in gguf_file_saver 2023-08-11 20:07:43 +03:00
M. Yusuf Sarıgöz
e3a4960953 gguf : add gguf_get_kv_type 2023-08-11 13:03:23 +03:00
M. Yusuf Sarıgöz
eb8ca6996f gguf : add gguf_get_kv_type 2023-08-11 12:24:08 +03:00
Georgi Gerganov
8083ae347a gguf : minor stuff 2023-08-07 19:02:18 +03:00
Georgi Gerganov
1da82c551f Merge branch 'master' into gguf 2023-08-07 18:53:03 +03:00
Georgi Gerganov
93356bdb7a
ggml : mul mat tweaks (#2372)
* ggml : mul mat wip

ggml-ci

* ggml : alternative thread distribution for mul_mat

ggml-ci

* ggml : mul_mat block tiling attempt

* ggml : mul_mat threads yield

ggml-ci
2023-08-07 14:25:58 +03:00
Georgi Gerganov
60baff7c85
ggml : pad result of ggml_nbytes() 2023-08-07 14:24:42 +03:00
Georgi Gerganov
9082b5dfbf
ggml : change params pointer (style change) (#2539)
ggml-ci
2023-08-07 13:55:18 +03:00
Georgi Gerganov
99d29c0094
ggml : sync (custom ops) (#2537)
ggml-ci
2023-08-07 13:20:09 +03:00
M. Yusuf Sarıgöz
b26f5b2e43 gguf : fix typo in function call 2023-07-31 16:23:54 +03:00
M. Yusuf Sarıgöz
7aa0a0e7f7 gguf : support custom alignment value 2023-07-31 09:59:36 +03:00
slaren
a113689571
ggml : add graph tensor allocator (#2411)
* ggml : add graph tensor allocator

* ggml : don't calculate data pointer of unallocated tensors when creating a view with an offset

* ggml : refactor ggml_view_Nd into ggml_view_tensor_offset
2023-07-30 15:58:01 +02:00
klosax
b19c11750b
ggml.c : add gguf_get_arr_n 2023-07-30 14:58:50 +02:00
klosax
2c22e3bcdb
ggml.c : get arr str and f32 2023-07-29 20:37:47 +02:00
klosax
3492f848d7
gguf : add gguf_find_key (#2438)
* gguf.cpp : find key example

* ggml.h : add gguf_find_key

* ggml.c : add gguf_find_key
2023-07-28 23:45:24 +03:00
Georgi Gerganov
d2b6ca13ad
gguf : add array support 2023-07-27 14:53:07 +03:00
Georgi Gerganov
d89533dff6
gguf : expose the gguf_type enum through the API for now 2023-07-27 11:10:34 +03:00
slaren
b5472ea0ad
ggml : fix assert in ggml_set_unary_op (#2410) 2023-07-26 23:57:23 +02:00
Georgi Gerganov
d8491fc7e3
gguf : add comments 2023-07-26 23:00:24 +03:00
Georgi Gerganov
5628ec7163
gguf : read / write sample models 2023-07-26 22:40:45 +03:00
Georgi Gerganov
d313c0fa33
gguf : simplify gguf_get_val 2023-07-26 18:53:57 +03:00
Georgi Gerganov
cb871fa022
gguf : do not support passing existing ggml_context to gguf_init 2023-07-26 18:48:52 +03:00
Georgi Gerganov
860c9c63ce
gguf : add gguf_get_tensor_name() 2023-07-26 18:21:14 +03:00
Georgi Gerganov
78b226a959
gguf : initial model loading - not tested 2023-07-26 18:21:14 +03:00
Georgi Gerganov
d91b985d2d
gguf : read tensor info 2023-07-26 18:21:13 +03:00
Georgi Gerganov
8d6acfec12
gguf : read header + meta data 2023-07-26 18:21:13 +03:00
Georgi Gerganov
6873148771
gguf : first API pass 2023-07-26 18:21:13 +03:00
slaren
5488fb789e
ggml : allocate graphs in a context (#2392)
* ggml : graph allocation in contexts

* allocate work buffer as a ggml_object in ggml_graph_compute_with_ctx

* llama.cpp : allocate graph in the context

* add GGML_PAD

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-26 15:56:53 +02:00
slaren
07aaa0f63f
ggml : fix ggml_flash_attn to use op_params (#2387)
* ggml : fix ggml_flash_attn to use op_params
2023-07-25 16:20:12 +02:00
Jiahao Li
875086bdb9
ggml : relax contiguous constraints in activation function (#2371) 2023-07-25 15:58:32 +03:00
slaren
da1889834a
ggml : improve graph build time via hash table lookup (#2329)
* improve graph build time

* ggml_tensor : use 1 bit per flag

* use a hash table instead
2023-07-25 15:32:20 +03:00
slaren
41c674161f
make rms_norm_eps a parameter (#2374)
* make rms_norm_eps a parameter

* add rms_norm_eps to command line

* fix baby llama, test-grad0

* use scientific notation for eps param in the help

ggml-ci
2023-07-24 17:57:12 +02:00
Georgi Gerganov
5b2b2dc6ae
ggml : sync (unary ops refactor, static-correctness) (#2370)
* ggml : sync (unary ops, tests)

ggml-ci

* tests : remove unnecessary funcs
2023-07-24 14:46:21 +03:00
slaren
3602ac4255
fix n_tasks (#2342)
ggml-ci
2023-07-23 15:19:39 +02:00
slaren
95a6c595e7
ggml: move op parameters from tensors to ggml_tensor::op_params (#2333)
* ggml: move op parameters from tensors to ggml_tensor::op_params

* alibi: use memcpy for float params

* remove `src[1] = NULL` in ops
2023-07-23 14:36:02 +02:00
Georgi Gerganov
0db14fef06
ggml : fix the rope fix (513f861953) 2023-07-21 15:16:55 +03:00
Georgi Gerganov
513f861953
ggml : fix rope args order + assert (#2054) 2023-07-21 14:51:34 +03:00
Qingyou Meng
672dda10e4
ggml : fixed runtime bugs and compile errors related to GGML_PERF and GGML_DEBUG (#2219)
* fixed runtime bugs and compile errors related to GGML_PERF and GGML_DEBUG

* remove ifdef GGML_PERF; update fmt
2023-07-16 22:57:28 +03:00
Xiao-Yong Jin
6e7cca4047
llama : add custom RoPE (#2054)
* Implement customizable RoPE

The original RoPE has pre-defined parameters

theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2]

Our customizable RoPE, ggml_rope_custom_inplace, uses

theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2]

with the default matches the original

scale = 1.0
base = 10000

The new command line arguments
--rope-freq-base
--rope-freq-scale
set the two new RoPE parameter.

Recent researches show changing these two parameters extends the context limit with minimal loss.

1. Extending Context to 8K
   kaiokendev
   https://kaiokendev.github.io/til#extending-context-to-8k

2. Extending Context Window of Large Language Models via Positional Interpolation
   Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian
   https://arxiv.org/abs/2306.15595

3. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
   https://www.reddit.com/user/bloc97
   https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

For the bold, try adding the following command line parameters to your favorite model:
-c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5

* ggml-metal: fix custom rope

* common: fix argument names in help

* llama: increase MEM_REQ_EVAL for MODEL_3B

It avoids crashing for quantized weights on CPU.
Better ways to calculate the required buffer size would be better.

* llama: make MEM_REQ_EVAL depend on n_ctx

* server: use proper Content-Type in curl examples

Without the header Content-Type: application/json, curl will POST with
Content-Type: application/x-www-form-urlencoded

Though our simple server doesn't care, the httplib.h used has a limit
with CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH 8192

With Content-Type: application/json, we can send large json data.

* style : minor fixes, mostly indentations

* ggml : fix asserts

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-15 13:34:16 +03:00
Evan Miller
e8035f141e
ggml : fix static_assert with older compilers #2024 (#2218) 2023-07-14 21:55:56 +03:00
Georgi Gerganov
697966680b
ggml : sync (ggml_conv_2d, fix mul_mat bug, CUDA GLM rope) 2023-07-14 16:36:41 +03:00
Georgi Gerganov
975221e954
ggml : broadcast mul_mat + conv batch support (#2199)
* ggml : broadcast mul_mat + conv batch support

* ggml : apply mul_mat broadcast fix by @jploski
2023-07-12 20:51:29 +03:00
Georgi Gerganov
4523d10d0c ggml : add ggml_pool_1d and ggml_pool_2d 2023-07-12 20:32:15 +03:00
Georgi Gerganov
20d7740a9b
ggml : sync (abort callback, mul / add broadcast, fix alibi) (#2183) 2023-07-11 22:53:34 +03:00
Spencer Sutton
5bf2a27718
ggml : remove src0 and src1 from ggml_tensor and rename opt to src (#2178)
* Add ggml changes

* Update train-text-from-scratch for change

* mpi : adapt to new ggml_tensor->src

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-11 19:31:10 +03:00
clyang
3bbc1a11f0
ggml : fix buidling with Intel MKL but ask for "cblas.h" issue (#2104) (#2115)
* Fix buidling with Intel MKL but ask for "cblas.h" issue

* Use angle brackets to indicate the system library
2023-07-09 11:12:20 +03:00
Qingyou Meng
1d656d6360
ggml : change ggml_graph_compute() API to not require context (#1999)
* ggml_graph_compute: deprecate using ggml_context, try resolve issue #287

* rewrite: no longer consider backward compitability; plan and make_plan

* minor: rename ctx as plan; const

* remove ggml_graph_compute from tests/test-grad0.c, but current change breaks backward

* add static ggml_graph_compute_sugar()

* minor: update comments

* reusable buffers

* ggml : more consistent naming + metal fixes

* ggml : fix docs

* tests : disable grad / opt + minor naming changes

* ggml : add ggml_graph_compute_with_ctx()

- backwards compatible API
- deduplicates a lot of copy-paste

* ci : enable test-grad0

* examples : factor out plan allocation into a helper function

* llama : factor out plan stuff into a helper function

* ci : fix env

* llama : fix duplicate symbols + refactor example benchmark

* ggml : remove obsolete assert + refactor n_tasks section

* ggml : fix indentation in switch

* llama : avoid unnecessary bool

* ggml : remove comments from source file and match order in header

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-07 19:24:01 +03:00