Commit Graph

134 Commits

Author SHA1 Message Date
Jared Van Bortel
3773e1afe7 Merge branch 'master' of https://github.com/ggerganov/llama.cpp into ceb/nomic-vulkan 2024-01-10 12:26:30 -05:00
Georgi Gerganov
52531fdff8
main : add self-extend support (#4815)
* examples : add passkey test

* passkey : better prints

* passkey : select pass key pos from CLI

* passkey : simplify n_past logic

* llama : "self-extend"-like context extension

* passkey : add comment

* main : add Self-Extend support

* llama : add comment about llama_kv_cache_seq_div
2024-01-08 11:18:32 +02:00
Jared Van Bortel
747e1eafcf Merge commit '81bc9214a389362010f7a57f4cbc30e5f83a2d28' into nomic-vulkan 2023-12-13 17:49:45 -05:00
Jared Van Bortel
3e09e127eb rename ggml-vulkan -> ggml-kompute 2023-12-13 17:49:45 -05:00
MaggotHATE
52c8bc3cf3
sampling : custom samplers order (#4285)
* Samplers sequence order w parameter

* Cleaned commented code

* Fixed formatting

* Rewrote with unordered_map

* Revert and rewrite, too many problems and safeguards would be needed

* Fixed code style

* Code style fixes according to review

* More readable samplers input string, fixed help

* Style fix in sampler_queue

* Formatting fixes

* Fixing whitespaces
2023-12-05 12:05:51 +02:00
Andrew Godfrey
8efa0f6ebe
main : pass LOG_TEE callback to llama.cpp log (#4033)
* main : Call llama_log_set to use LOG_TEE

* tabs to spaces
2023-11-30 23:56:19 +02:00
Jared Van Bortel
9ae88baf38 Merge remote-tracking branch 'upstream/master' into nomic-vulkan-redo 2023-11-23 17:22:09 -05:00
Jared Van Bortel
1829f1d7be Merge commit '4760e7cc0b68570d58f55e8dda469805d1759d0d~' into nomic-vulkan 2023-11-23 17:22:08 -05:00
Jared Van Bortel
fe26e6adff Merge commit 'e16b9fa4baa8a09c6619b116159830e898050942' into nomic-vulkan 2023-11-23 17:22:04 -05:00
Jared Van Bortel
2a41ba7258 Merge commit '469c9addef75893e6be12edda852d12e840bf064' into nomic-vulkan 2023-11-23 17:22:00 -05:00
Jared Van Bortel
f194e1b6a6 Merge commit 'fcca0a700487999d52a525c96d6661e9f6a8703a' into nomic-vulkan 2023-11-23 17:21:59 -05:00
Seb C
881800d1f0
main : Add ChatML functionality to main example (#4046)
Co-authored-by: Sebastian Cramond <sebby37@users.noreply.github.com>
2023-11-20 14:56:59 +01:00
Kerfuffle
91f6499393
Respect tokenizer.ggml.add_bos_token value when tokenizing (#4040)
* gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode.

* Respect add_bos_token GGUF metadata value

* gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time
2023-11-16 19:14:37 -07:00
Richard Kiss
532dd74e38
Fix some documentation typos/grammar mistakes (#4032)
* typos

* Update examples/parallel/README.md

Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>

---------

Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
2023-11-11 23:04:58 -07:00
Jared Van Bortel
af00cca08e Merge commit 'ec893798b7a2a803466cc8f063051499ec3d96f7' into HEAD 2023-11-08 16:36:00 -05:00
cebtenzzre
b12fa0d1c1
build : link against build info instead of compiling against it (#3879)
* cmake : fix build when .git does not exist

* cmake : simplify BUILD_INFO target

* cmake : add missing dependencies on BUILD_INFO

* build : link against build info instead of compiling against it

* zig : make build info a .cpp source instead of a header

Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>

* cmake : revert change to CMP0115

---------

Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>
2023-11-02 08:50:16 +02:00
kalomaze
238657db23
samplers : Min-P sampler implementation [alternative to Top P/Top K] (#3841)
* Introduce the new Min-P sampler by @kalomaze
   The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token.

* Min-P enabled and set to 0.05 default

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-31 20:44:49 +01:00
Kerfuffle
6e08281e58
Extend llama_kv_cache_seq_rm to allow matching any sequence (#3843)
* Extend llama_kv_cache_seq_rm to allow matichng any sequence

* Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear

Use llama_kv_cache_clear for cache clearing

Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality
2023-10-29 11:31:40 -06:00
Marcus Dunn
5be6c803fa
llama : remove token functions with context args in favor of model (#3720)
* added `llama_model_token_*` variants to all the `llama_token_*` functions.

* added `LLAMA_API`

* formatting

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* removed old `llama_token` functions

* changed 3 more functions to take in model

- `llama_token_get_text`
- `llama_token_get_score`
- `llama_token_get_type`

* added back docs

* fixed main.cpp

* changed token functions to use new model variants

* changed token functions to use new model variants

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-23 22:40:03 +03:00
vvhg1
d3956aea53
main : escape prompt for cfg_negative_prompt and consecutive inputs in main with interactive (#3623)
* infill tokens correction

* serverinfill tokens correction

* removing any leading whitespace from infill suffix and removing leeading space token from suffix when params.escape

* removing any leading whitespace from infill suffix and removing leeading space token from suffix when params.escape

* only rm when params.escape, rm space if possible which is added back or rm added space token

* only rm when params.escape, rm space if possible which is added back or rm added space token

* Revert "only rm when params.escape, rm space if possible which is added back or rm added space token"

This reverts commit 63ba0b621f.

* fix interactive prompt escaping and fix server infill leading space handling

* rm unnecessary bool check

* process escapes for neg prompt and interactive consec prompts

* removed unneccessary static string escape
2023-10-22 21:09:51 +03:00
Georgi Gerganov
d1031cf49c
sampling : refactor init to use llama_sampling_params (#3696)
* sampling : refactor init to use llama_sampling_params

* llama : combine repetition, frequency and presence penalties in 1 call

* examples : remove embd-input and gptneox-wip

* sampling : rename penalty params + reduce size of "prev" vector

* sampling : add llama_sampling_print helper

* sampling : hide prev behind API and apply #3661

ggml-ci
2023-10-20 21:07:23 +03:00
Georgi Gerganov
0e89203b51
speculative : add tree-based sampling example (#3624)
* sampling : one sequence per sampling context

ggml-ci

* speculative : add tree-based sampling support

ggml-ci

* speculative : reuse the n_parallel CLI param

* speculative : refactor sampling

* examples : fix build after sampling refactoring

ggml-ci

* batched : fix n_seq_id

* sampling : fix malloc

ggml-ci

* swift : fix build

ggml-ci

* swift : try to fix build

ggml-ci

* prompts : add assistant.txt

* common : add llama_batch_add() and llama_batch_clear() helpers

* speculative : minor refactor

ggml-ci

* minor : comments + rename

ggml-ci

* speculative : fix off-by-one for n_drafted

* speculative : fix the n_drafted fix + p constants
2023-10-18 16:21:57 +03:00
Georgi Gerganov
e1675d133c
llama : avoid fprintf in favor of LLAMA_LOG (#3538) 2023-10-17 22:34:26 +03:00
staviq
1a159553f9
tokenizer : special token handling (#3538)
* Rewrite special token handling from #1931

* shorten param name, add st verification by type

* use offsets instead of copy by substr

* formatting, remove copying iterator on delete

* llama : normalize code-style

* swift fix

* print pfx/sfx if verb, main: split pfx input sfx

* dont add space when using special tokens

* minor : comment + spacing

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-17 18:11:01 +03:00
Georgi Gerganov
b8fe4b5cc9
main : fix session loading bug (#3400) 2023-10-11 23:55:41 +03:00
Kerfuffle
70c29da118
common : fix mirostat state when using multiple sequences (#3543)
* Fix mirostat state when using multiple sequences

* Fix mirostat by completely refactoring sampling!

* Try to fix zig build.

* Export function to fetch/create default sampler states

Code formatting cleanups and add some comments

Silence a warning about id not being used when logging is disabled

* Apply some renaming suggestions.

Fix comments that were out of sync with the pull.

* Use more consistant naming convention for sampling contexts
2023-10-11 22:35:46 +03:00
niansa
ba15dfd0be Nomic vulkan backend licensed under the Software for Open Models License (SOM), version 1.0. 2023-10-05 13:39:18 -04:00
h-h-h-h
8186242b6d
main : consistent prefix/suffix coloring (#3425)
* Typo

* No `--in-prefix` coloring

The `--in-prefix` text was inconsistently colored. Now, it's never colored, just like the `--in-suffix` text.
2023-10-03 21:16:15 +03:00
Georgi Gerganov
ac2219fef3
llama : fix session saving/loading (#3400)
* llama : fix session saving/loading

* llama : temp fix for clearing "future" tokens from the KV cache

* llama : fix handling of "future" tokens when loading sessions

* llama : fix comments for llama_kv_cache API
2023-10-03 21:04:01 +03:00
Cebtenzzre
bc39553c90
build : enable more non-default compiler warnings (#3200) 2023-09-28 17:41:44 -04:00
slaren
16bc66d947
llama.cpp : split llama_context_params into model and context params (#3301)
* llama.cpp : split llama_context_params into model and context params

ggml-ci

* fix metal build

* fix freq_base/scale default to model value

* llama-bench : keep the same model between tests when possible

* move n_threads to llama_context_params, add n_threads_batch

* fix mpi build

* remove kv_size(), cuda scratch fixes

* remove low-vram option

* add n_threads_batch to system info, refactor to get_system_info()

* add documentation about --threads-batch to the READMEs

* llama-bench fix

* main : fix rope freq/scale warning

* llama.cpp : add llama_get_model
common : add llama_tokenize from model

* remove duplicated ctx/model functions

ggml-ci

* cuda : print total VRAM used
2023-09-28 22:42:38 +03:00
Georgi Gerganov
ec893798b7
llama : custom attention mask + parallel decoding + no context swaps (#3228)
* tests : verify that RoPE is "additive"

* llama : replace ggml_diag_mask_inf with ggml_add (custom -inf mask)

* ggml : ggml_rope now takes a vector with positions instead of n_past

* metal : add rope_f16 kernel + optimize cpy kernels

* llama : unified KV cache + batch inference API

* llama : add new llama_decode() API that works with llama_batch

* llama : add cell_max heuristic for more efficient kv_cache

* llama : extend llama_kv_cache API

* llama : more robust cell_max heuristic + wip shift

* metal : disable concurrency optimization

* llama : add llama_kv_cache_shift_seq + no more context swaps

* llama : apply K-cache roping for Falcon and Baichuan

* speculative : fix KV cache management

* parallel : example for serving multiple users in parallel

* parallel : disable hot-plug to avoid cache fragmentation

* fixes : speculative KV cache + llama worst-case graph

* llama : extend batch API to select which logits to output

* llama : fix worst case graph build

* ggml-cuda : update rope implementation for parallel decoding (#3254)

* ggml-cuda : update rope implementation for parallel decoding

* better solution for p0 computation

* fix rope

* simpler rope implementation

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* make : add parallel to build + fix static functions in llama.cpp

* simple : fix token counting

* parallel : various improvements

* llama : fix cell_max logic + rename functions

* parallel : try smaller batches when the KV cache is fragmented

* parallel : fix sequence termination criteria

* llama : silence errors KV cache errors

* parallel : remove new line from prompt

* parallel : process system prompt once + configurable paramters + llama API

* parallel : remove question with short answers

* parallel : count cache misses

* parallel : print misses on each request

* parallel : minor

* llama : fix n_kv to never become 0

* parallel : rename hot-plug to continuous-batching

* llama : improve llama_batch API + simplify parallel example

* simple : add parallel decoding support

* simple : improve comments + free batch

* ggml-cuda : add rope f16, restore performance with parallel decoding (#3272)

* ggml-cuda : add rope f16, restore performance

* offload KQ_mask with all models

* fix rope shift

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llama : disable MPI for now

ggml-ci

* train : make KQ_pos memory buffer permanent via dummy scale op

* ggml : revert change to ggml_cpy, add ggml_cont_Nd instead (#3275)

ggml-ci

* parallel : fix bug (extra BOS) + smaller token_prev array

* parallel : fix cases where the input prompts can overflow the batch

* parallel : add disabled experimental batch chunking in powers of two

* llama : llama.h formatting + comments

* simple : add README.md

* llama : fix kv cache heuristic when context is less than 32

* parallel : fix crash when `-n -1`

* llama : simplify returns if/else branches

* metal : use mm kernels for batch size > 2

* examples : utilize new llama_get_logits_ith()

* examples : add example for batched decoding

* examples : do not eval prompt 2 times (close #3348)

* server : clear the KV cache beyond n_past before llama_decode

* server : avoid context swaps by shifting the KV cache

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-09-28 19:04:36 +03:00
Cebtenzzre
e6616cf0db
examples : add compiler version and target to build info (#2998) 2023-09-15 16:59:49 -04:00
Cebtenzzre
3aefaab9e5
check C++ code with -Wmissing-declarations (#3184) 2023-09-15 15:38:27 -04:00
Roland
2d770505a8
llama : remove mtest (#3177)
* Remove mtest

* remove from common/common.h and examples/main/main.cpp
2023-09-15 10:28:45 +03:00
Cebtenzzre
e64f5b5578
examples : make n_ctx warning work again (#3066)
This was broken by commit e36ecdcc ("build : on Mac OS enable Metal by
default (#2901)").
2023-09-08 11:43:35 -04:00
Przemysław Pawełczyk
cb6c44c5e0
build : do not use _GNU_SOURCE gratuitously (#2035)
* Do not use _GNU_SOURCE gratuitously.

What is needed to build llama.cpp and examples is availability of
stuff defined in The Open Group Base Specifications Issue 6
(https://pubs.opengroup.org/onlinepubs/009695399/) known also as
Single Unix Specification v3 (SUSv3) or POSIX.1-2001 + XSI extensions,
plus some stuff from BSD that is not specified in POSIX.1.

Well, that was true until NUMA support was added recently,
so enable GNU libc extensions for Linux builds to cover that.

Not having feature test macros in source code gives greater flexibility
to those wanting to reuse it in 3rd party app, as they can build it with
FTMs set by Makefile here or other FTMs depending on their needs.

It builds without issues in Alpine (musl libc), Ubuntu (glibc), MSYS2.

* make : enable Darwin extensions for macOS to expose RLIMIT_MEMLOCK

* make : enable BSD extensions for DragonFlyBSD to expose RLIMIT_MEMLOCK

* make : use BSD-specific FTMs to enable alloca on BSDs

* make : fix OpenBSD build by exposing newer POSIX definitions

* cmake : follow recent FTM improvements from Makefile
2023-09-08 15:09:21 +03:00
Cebtenzzre
00d62adb79
fix some warnings from gcc and clang-tidy (#3038)
Co-authored-by: xaedes <xaedes@gmail.com>
2023-09-07 13:22:29 -04:00
Georgi Gerganov
e36ecdccc8
build : on Mac OS enable Metal by default (#2901)
* build : on Mac OS enable Metal by default

* make : try to fix build on Linux

* make : move targets back to the top

* make : fix target clean

* llama : enable GPU inference by default with Metal

* llama : fix vocab_only logic when GPU is enabled

* common : better `n_gpu_layers` assignment

* readme : update Metal instructions

* make : fix merge conflict remnants

* gitignore : metal
2023-09-04 22:26:24 +03:00
Georgi Gerganov
47068e5170
speculative : PoC for speeding-up inference via speculative sampling (#2926)
* speculative : initial example

* speculative : print encoding speed

* speculative : add --draft CLI arg
2023-09-03 15:12:08 +03:00
Georgi Gerganov
8f429fa511
perplexity : fix ETA by warming up the model with an empty run 2023-09-03 13:43:17 +03:00
ZHAOKAI WANG
69fdbb9abc
readme : quick start command fix (#2908)
* quick start command fix

* quick start win command fix
2023-09-01 17:06:44 +03:00
staviq
8341a25957
main : log file (#2748)
* initial, base LOG macro

* add *.log to .gitignore

* added basic log file handler

* reverted log auto endline to better mimic printf

* remove atomics and add dynamic log target

* log_enable/disable, LOG_TEE, basic usage doc

* update .gitignore

* mv include to common, params, help msg

* log tostring helpers, token vectors pretty prints

* main: replaced fprintf/LOG_TEE, some trace logging

* LOG_DISABLE_LOGS compile flag, wrapped f in macros

* fix LOG_TEELN and configchecker

* stub LOG_DUMP_CMDLINE for WIN32 for now

* fix msvc

* cleanup main.cpp:273

* fix stray whitespace after master sync

* log : fix compile warnings

- do not use C++20 stuff
- use PRIu64 to print uint64_t
- avoid string copies by using const ref
- fix ", ##__VA_ARGS__" warnings
- compare strings with == and !=

* log : do not append to existing log + disable file line func by default

* log : try to fix Windows build

* main : wip logs

* main : add trace log

* review: macro f lowercase, str append to sstream

* review: simplify ifs and str comparisons

* fix MSVC, formatting, FMT/VAL placeholders

* review: if/else cleanup

* review: if/else cleanup (2)

* replace _ prefix with _impl suffix

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-08-30 09:29:32 +03:00
Johannes Gäßler
6b73ef1201
YAML result logging + preset script (#2657) 2023-08-28 17:59:39 +02:00
Georgi Gerganov
edd4c14817
llama : more tokenizer fixes (#2810)
* tests : write a Python tokenizer test (wip)

* llama : prefix input text for tokenization with whitespace

* llama : distinguish pieces from decoded text + fix detokenization

* common : add comments

* examples : no longer manually add leading space when tokenizing

* tests : use Python to generate tokenizer tests for C++

* tests : add option to tokenize text files

ggml-ci

* tests : add test-tokenizer-1.py

* llama.cpp : fix LF token

* hellaswag : move the concat space for clarity

* tests : add falcon tests (py + cpp, currently do not pass Unicode)

ggml-ci

* common : temporary separate llama_detokenize calls for SPM and BPE

---------

Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
2023-08-27 14:19:19 +03:00
Dr. Tom Murphy VII Ph.D
72f895c923
main : fix bug (penalize_nl=false doesn't work) + suppress warning on mingw (#1528)
* Fix bug in main.cpp where penalize_nl=false has no effect. It modifies the underlying logits array, but at this point we are already working on the candidates copy.

* Suppress redefinition warning for NOMINMAX on mingw. In my installation, this macro is already defined by /usr/lib/gcc/x86_64-w64-mingw32/11/include/c++/x86_64-w64-mingw32/bits/os_defines.h:45.

* main : fix indentation

* main : pass ctx to llama_token_nl()

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-08-26 21:12:56 +03:00
klosax
2ba83c8685
Fix spm whitespaces (#2806)
* llama.cpp : fix spm whitespace escaping + clean up

* main.cpp : spm - add whitespace in front of prompt

* test-tokenizer-0.cpp : spm - add whitespace in front of prompt
2023-08-26 13:45:53 +02:00
Kerfuffle
7694adda8d
Fix for main example getting stuck when -n -2 and --interactive (#2767)
* Fix for main example getting stuck when -n -2 and --interactive

* Add a comment so future generations may suffer less.
2023-08-24 10:11:13 -06:00
Georgi Gerganov
cf658adc83
llm : add Falcon support (#2717)
* llama : refactor GGUF constants into static maps

* llama : check if model architecture is known

* llama : refactor llama_model_load_internal()

* gguf : add KV constant maps

* llm : read arch-specific KVs

* convert : add dummy scores + types

* falcon : load tensor data (CPU only)

* llama : fix loading progress bar

* llama : add arch member to llama_model

* falcon : CPU inference working

* falcon : support non-40B models

* falcon : minor

* llama : minor updates

ggml-ci

* convert-falcon-hf-to-gguf.py : fix special token mapping

* llama.cpp : llama default UNK token = id 0

* llama.cpp : fix bpe tokenizer

* llama.cpp : fix the fix of bpe tokenizer

* ggml : pass eps to ggml_norm

* metal : implement RoPE (mode = 2) + avoid ggml_repeat

* ggml : ggml_repeat always creates new tensor

* falcon : copy-paste self-attention from LLaMA

* metal : print extra compute pipeline info

* falcon : minor changes (still chasing the Metal problem)

* llama.cpp : fix linefeed token

* metal : fix GELU kernel numerical stability by using precise::tanh

* metal : temporary workaround for the concurrency optimization bug

* falcon : add CUDA offloading (#2739)

* llama : better model naming and size reporting

* llama : prep new tokenizer support

* llama : advanced BPE tokenizer based on ggllm.cpp imlpementation

* llama : remove oboslete comment

ggml-ci

* common : remove obsolete BPE API + disable test-tokenizer-1

* llama : revert BPE special-case in llama_byte_to_token()

* cuda : add TODOs for RoPE NeoX implementation

* llama : default special tokens based on vocab type

* perplexity : add log for start of tokenization

---------

Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
Co-authored-by: slaren <slarengh@gmail.com>
2023-08-23 23:08:04 +03:00
klosax
5290c38e6e
main : insert bos if no tokens (#2727)
* main.cpp : insert bos if no tokens

* Update examples/main/main.cpp

* Update examples/main/main.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-08-23 16:46:03 +02:00