llama.cpp/scripts
jaime-m-p 43248e5594
llama3 custom regex split (#6965)
* merged the changes from deepseeker models to main branch

* Moved regex patterns to unicode.cpp and updated unicode.h

* Moved header files

* Resolved issues

* added and refactored unicode_regex_split and related functions

* Updated/merged the deepseek coder pr

* Refactored code

* Adding unicode regex mappings

* Adding unicode regex function

* Added needed functionality, testing remains

* Fixed issues

* Fixed issue with gpt2 regex custom preprocessor

* unicode : fix? unicode_wstring_to_utf8

* lint : fix whitespaces

* tests : add tokenizer tests for numbers

* unicode : remove redundant headers

* tests : remove and rename tokenizer test scripts

* tests : add sample usage

* gguf-py : reader prints warnings on duplicate keys

* llama : towards llama3 tokenization support (wip)

* unicode : shot in the dark to fix tests on Windows

* unicode : first try custom implementations

* convert : add "tokenizer.ggml.pre" GGUF KV (wip)

* llama : use new pre-tokenizer type

* convert : fix pre-tokenizer type writing

* lint : fix

* make : add test-tokenizer-0-llama-v3

* wip

* models : add llama v3 vocab file

* llama : adapt punctuation regex + add llama 3 regex

* minor

* unicode : set bomb

* unicode : set bomb

* unicode : always use std::wregex

* unicode : support \p{N}, \p{L} and \p{P} natively

* unicode : try fix windows

* unicode : category support via std::regex

* unicode : clean-up

* unicode : simplify

* llama3 custom regex split

* convert : add convert-hf-to-gguf-update.py

ggml-ci

* lint : update

* convert : add falcon

ggml-ci

* unicode : normalize signatures

* lint : fix

* lint : fix

* convert : remove unused functions

* convert : add comments

* convert : exercise contractions

ggml-ci

* Using char32_t for codepoints

* lint : fix

* already exists unicode_tolower()

* Typing

* Restore BOM

* cmake : refactor test targets

* tests : refactor vocab tests

ggml-ci

* tests : add more vocabs and tests

ggml-ci

* unicode : cleanup

* scripts : ignore new update script in check-requirements.sh

* Fix merge

* models : add phi-3, mpt, gpt-2, starcoder

* tests : disable obsolete

ggml-ci

* tests : use faster bpe test

ggml-ci

* llama : more prominent warning for old BPE models

* tests : disable test-tokenizer-1-bpe due to slowness

ggml-ci

* Move unused variable value

* GPT2 custom regex split

* Add alternative regex for custom aplit llama3

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Style

* Add bruteforce random tests for token encoding

* wip: fixing unicode codepoint ranges

* Fix merge

* Unicode tables: separator, lowercase, uppercase and whitespace

* llama3 custom regex split: fix \s

* Restore BOM

* Style

* wip: generate NDF table

* Ignore special tokens for testing

* Clean gen-unicode-data.py

* Refactor random tokenizer test

* lint : fix

* tests : add fail test for llama-bpe

---------

Co-authored-by: Jaggzh <jaggz.h@gmail.com>
Co-authored-by: Kazim Abrar Mahi <kazimabrarmahi135@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: jaime-m-p <>
2024-05-09 23:30:44 +10:00
..
build-info.cmake cmake : fix issue with version info not getting baked into LlamaConfig.cmake (#3970) 2023-11-27 21:25:42 +02:00
build-info.sh build : link against build info instead of compiling against it (#3879) 2023-11-02 08:50:16 +02:00
check-requirements.sh llama : fix BPE pre-tokenization (#6920) 2024-04-29 16:58:41 +03:00
ci-run.sh ci : add model tests + script wrapper (#4586) 2024-01-26 14:18:00 +02:00
compare-commits.sh ggml : group all experts in a single ggml_mul_mat_id (#6505) 2024-04-18 15:18:48 +02:00
compare-llama-bench.py compare-llama-bench.py: add missing basicConfig (#7138) 2024-05-08 10:54:39 +02:00
convert-gg.sh scripts : helper convert script 2023-08-27 15:24:58 +03:00
gen-authors.sh license : update copyright notice + add AUTHORS (#6405) 2024-04-09 09:23:19 +03:00
gen-build-info-cpp.cmake cmake : fix issue with version info not getting baked into LlamaConfig.cmake (#3970) 2023-11-27 21:25:42 +02:00
gen-unicode-data.py llama3 custom regex split (#6965) 2024-05-09 23:30:44 +10:00
get-flags.mk build : pass all warning flags to nvcc via -Xcompiler (#5570) 2024-02-18 16:21:52 -05:00
get-hellaswag.sh scripts : add get-winogrande.sh 2024-01-18 20:45:39 +02:00
get-pg.sh scripts : improve get-pg.sh (#4838) 2024-01-09 19:21:13 +02:00
get-wikitext-2.sh model: support arch DbrxForCausalLM (#6515) 2024-04-13 11:33:52 +02:00
get-wikitext-103.sh lookup: complement data from context with general text statistics (#5479) 2024-03-23 01:24:36 +01:00
get-winogrande.sh scripts : add get-winogrande.sh 2024-01-18 20:45:39 +02:00
hf.sh scripts : add --outdir option to hf.sh (#6600) 2024-04-11 16:22:47 +03:00
install-oneapi.bat support SYCL backend windows build (#5208) 2024-01-31 08:08:07 +05:30
LlamaConfig.cmake.in cuda : rename build flag to LLAMA_CUDA (#6299) 2024-03-26 01:16:01 +01:00
pod-llama.sh cuda : rename build flag to LLAMA_CUDA (#6299) 2024-03-26 01:16:01 +01:00
qnt-all.sh scripts : add pipefail 2023-08-29 10:50:30 +03:00
run-all-perf.sh scripts : add pipefail 2023-08-29 10:50:30 +03:00
run-all-ppl.sh scripts : add pipefail 2023-08-29 10:50:30 +03:00
run-with-preset.py convert.py : add python logging instead of print() (#6511) 2024-05-03 22:36:41 +03:00
server-llm.sh cuda : rename build flag to LLAMA_CUDA (#6299) 2024-03-26 01:16:01 +01:00
sync-ggml-am.sh license : update copyright notice + add AUTHORS (#6405) 2024-04-09 09:23:19 +03:00
sync-ggml.last sync : ggml 2024-04-09 20:29:06 +03:00
sync-ggml.sh license : update copyright notice + add AUTHORS (#6405) 2024-04-09 09:23:19 +03:00
verify-checksum-models.py convert.py : add python logging instead of print() (#6511) 2024-05-03 22:36:41 +03:00
xxd.cmake build: generate hex dump of server assets during build (#6661) 2024-04-21 18:48:53 +01:00