llama.cpp/common
goerch ff5a3f0c09
Work on the BPE tokenizer (#3252)
* Work on the BPE tokenizer

Tokenizer tests work for Falcon-7B

* Try to fix build problem

* Fix debug assertion failure

* Fix MSVC Unicode BOM problem

* Cleanup and an improvement

* Fix compiler warning

* Cleanup

* Test doesn't work over the full range of Unicodes

* Update .gitignore and Makefile

* Another Makefile rule

* Testing Aquila

* Moving byte decoding back to `token_to_piece` ...

... because everyone is using it.

* Guarding some unusable code pathes

* Streamlining code and adding some more assertions

Important change: I'm classifying added tokens as control tokens now for BPE.

* Adding a comment

* Adding another assertion

* Fixed vocabulary guarding assertions

* Fix PR for recent change

* Fix PR for recent change

* Fix for compiler warning

* Fix PR for recent change

* Fix PR for recent change

* Fix PR for recent change

* Fix for compiler warning

* Fixes for more compiler warnings

* Remove unused code

* Fix initialization of static maps

* Add scores and token types back, adapt gptneox

* Update llama.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update unicode.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update unicode.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Ported Starcoder and added some assertions

* Fix coding style

* Apply @jploski 's fix for missing tokens

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-03 09:16:26 +02:00
..
CMakeLists.txt train : finetune LORA (#2632) 2023-09-28 21:40:11 +03:00
common.cpp Work on the BPE tokenizer (#3252) 2023-10-03 09:16:26 +02:00
common.h infill : add new example + extend server API (#3296) 2023-10-02 10:42:02 +03:00
console.cpp check C++ code with -Wmissing-declarations (#3184) 2023-09-15 15:38:27 -04:00
console.h gguf : new file format with flexible meta data (beta) (#2398) 2023-08-21 23:07:43 +03:00
grammar-parser.cpp check C++ code with -Wmissing-declarations (#3184) 2023-09-15 15:38:27 -04:00
grammar-parser.h gguf : new file format with flexible meta data (beta) (#2398) 2023-08-21 23:07:43 +03:00
log.h build : enable more non-default compiler warnings (#3200) 2023-09-28 17:41:44 -04:00
train.cpp llama.cpp : split llama_context_params into model and context params (#3301) 2023-09-28 22:42:38 +03:00
train.h train : finetune LORA (#2632) 2023-09-28 21:40:11 +03:00