llama.cpp/examples/finetune
Justine Tunney 3855416027
ggml : introduce bfloat16 support (#6412)
* Introduce bfloat16 support

Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as
their canonical floating point format.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───┐
    0b0000000000000000 brain16

This encoding has the same number of exponent bits as float32. That
makes conversion relatively straightforward, even in the absence of
hardware support. For example, converting brain16 to binary32 means
simply shifting 16 bits to the left.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───────────────────┐
    0b00000000000000000000000000000000 IEEE binary32

The issue is that converting bf16 to fp16 can result in information
loss. Only 13% of bf16 numbers can be precisely represented in fp16
which in practice ends up being 99.71% of Mistral 7b v0.2's weights
however there is currently no way other than fp32 to get the others

      ┌sign
      │
      │  ┌exponent
      │  │
      │  │    ┌mantissa
      │  │    │
      │┌─┴─┐┌─┴──────┐
    0b0000000000000000 IEEE binary16

This change fixes that, by adding a bf16 data type to GGML. Support
for CPU inference has been implemented along with optimizations for
the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2
improves somewhere around -0.0024 to -0.0046 compared to using fp16

* Remove GGML code that's not needed

* Minimize the GGML API surface area for BF16

* Remove bf16 luts

* Make the GGML header look nicer

* Fix documentation

* Apply ggerganov's fixes for test-backend-ops

* Add BF16 code for new ggml_validate_row_data() function
2024-05-08 09:30:09 +03:00
..
CMakeLists.txt train : finetune LORA (#2632) 2023-09-28 21:40:11 +03:00
convert-finetune-checkpoint-to-gguf.py py : remove superfluous import statements (#4076) 2023-11-17 17:20:53 +02:00
finetune.cpp ggml : introduce bfloat16 support (#6412) 2024-05-08 09:30:09 +03:00
finetune.sh finetune : add -ngl parameter (#3762) 2023-11-01 13:49:04 +02:00
README.md finetune : rename feed-forward tensors (w1/w2/w3) (#4839) 2024-02-13 15:15:42 +02:00

finetune

Basic usage instructions:

# get training data
wget https://raw.githubusercontent.com/brunoklein99/deep-learning-notes/master/shakespeare.txt

# finetune LORA adapter
./bin/finetune \
        --model-base open-llama-3b-v2-q8_0.gguf \
        --checkpoint-in  chk-lora-open-llama-3b-v2-q8_0-shakespeare-LATEST.gguf \
        --checkpoint-out chk-lora-open-llama-3b-v2-q8_0-shakespeare-ITERATION.gguf \
        --lora-out lora-open-llama-3b-v2-q8_0-shakespeare-ITERATION.bin \
        --train-data "shakespeare.txt" \
        --save-every 10 \
        --threads 6 --adam-iter 30 --batch 4 --ctx 64 \
        --use-checkpointing

# predict
./bin/main -m open-llama-3b-v2-q8_0.gguf --lora lora-open-llama-3b-v2-q8_0-shakespeare-LATEST.bin

Only llama based models are supported! The output files will be saved every N iterations (config with --save-every N). The pattern 'ITERATION' in the output filenames will be replaced with the iteration number and with 'LATEST' for the latest output. So in above example after 10 iterations these files will be written:

  • chk-lora-open-llama-3b-v2-q8_0-shakespeare-10.gguf
  • chk-lora-open-llama-3b-v2-q8_0-shakespeare-LATEST.gguf
  • lora-open-llama-3b-v2-q8_0-shakespeare-10.bin
  • lora-open-llama-3b-v2-q8_0-shakespeare-LATEST.bin

After 10 more iterations:

  • chk-lora-open-llama-3b-v2-q8_0-shakespeare-20.gguf
  • chk-lora-open-llama-3b-v2-q8_0-shakespeare-LATEST.gguf
  • lora-open-llama-3b-v2-q8_0-shakespeare-20.bin
  • lora-open-llama-3b-v2-q8_0-shakespeare-LATEST.bin

Checkpoint files (--checkpoint-in FN, --checkpoint-out FN) store the training process. When the input checkpoint file does not exist, it will begin finetuning a new randomly initialized adapter.

llama.cpp compatible LORA adapters will be saved with filename specified by --lora-out FN. These LORA adapters can then be used by main together with the base model, like in the 'predict' example command above.

In main you can also load multiple LORA adapters, which will then be mixed together.

For example if you have two LORA adapters lora-open-llama-3b-v2-q8_0-shakespeare-LATEST.bin and lora-open-llama-3b-v2-q8_0-bible-LATEST.bin, you can mix them together like this:

./bin/main -m open-llama-3b-v2-q8_0.gguf \
  --lora lora-open-llama-3b-v2-q8_0-shakespeare-LATEST.bin \
  --lora lora-open-llama-3b-v2-q8_0-bible-LATEST.bin

You can change how strong each LORA adapter is applied to the base model by using --lora-scaled FN SCALE instead of --lora FN.

For example to apply 40% of the 'shakespeare' LORA adapter, 80% of the 'bible' LORA adapter and 100% of yet another one:

./bin/main -m open-llama-3b-v2-q8_0.gguf \
  --lora-scaled lora-open-llama-3b-v2-q8_0-shakespeare-LATEST.bin 0.4 \
  --lora-scaled lora-open-llama-3b-v2-q8_0-bible-LATEST.bin 0.8 \
  --lora lora-open-llama-3b-v2-q8_0-yet-another-one-LATEST.bin

The scale numbers don't need to add up to one, and you can also use numbers greater than 1 to further increase the influence of an adapter. But making the values too big will sometimes result in worse output. Play around to find good values.

Gradient checkpointing reduces the memory requirements by ~50% but increases the runtime. If you have enough RAM, you can make finetuning a bit faster by disabling checkpointing with --no-checkpointing.

The default LORA rank can be specified with --lora-r N. The LORA rank can be configured for each model tensor type separately with these command line options:

  --lora-r N                 LORA r: default rank. Also specifies resulting scaling together with lora-alpha. (default 4)
  --rank-att-norm N          LORA rank for attention norm tensor (default 1)
  --rank-ffn-norm N          LORA rank for feed-forward norm tensor (default 1)
  --rank-out-norm N          LORA rank for output norm tensor (default 1)
  --rank-tok-embd N          LORA rank for token embeddings tensor (default 4)
  --rank-out N               LORA rank for output tensor (default 4)
  --rank-wq N                LORA rank for wq tensor (default 4)
  --rank-wk N                LORA rank for wk tensor (default 4)
  --rank-wv N                LORA rank for wv tensor (default 4)
  --rank-wo N                LORA rank for wo tensor (default 4)
  --rank-ffn_gate N          LORA rank for ffn_gate tensor (default 4)
  --rank-ffn_down N          LORA rank for ffn_down tensor (default 4)
  --rank-ffn_up N            LORA rank for ffn_up tensor (default 4)

The LORA rank of 'norm' tensors should always be 1.

To see all available options use finetune --help.