llama.cpp/examples/tts
Georgi Gerganov afa8a9ec9b
llama : add llama_vocab, functions -> methods, naming (#11110)
* llama : functions -> methods (#11110)

* llama : add struct llama_vocab to the API (#11156)

ggml-ci

* hparams : move vocab params to llama_vocab (#11159)

ggml-ci

* vocab : more pimpl (#11165)

ggml-ci

* vocab : minor tokenization optimizations (#11160)

ggml-ci

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* lora : update API names (#11167)

ggml-ci

* llama : update API names to use correct prefix (#11174)

* llama : update API names to use correct prefix

ggml-ci

* cont

ggml-ci

* cont

ggml-ci

* minor [no ci]

* vocab : llama_vocab_add_[be]os -> llama_vocab_get_add_[be]os (#11174)

ggml-ci

* vocab : llama_vocab_n_vocab -> llama_vocab_n_tokens (#11174)

ggml-ci

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-01-12 11:32:42 +02:00
..
CMakeLists.txt tts : add OuteTTS support (#10784) 2024-12-18 19:27:21 +02:00
convert_pt_to_hf.py tts : add OuteTTS support (#10784) 2024-12-18 19:27:21 +02:00
README.md examples : add README.md to tts example [no ci] (#11155) 2025-01-10 13:16:16 +01:00
tts-outetts.py tts : add OuteTTS support (#10784) 2024-12-18 19:27:21 +02:00
tts.cpp llama : add llama_vocab, functions -> methods, naming (#11110) 2025-01-12 11:32:42 +02:00

llama.cpp/example/tts

This example demonstrates the Text To Speech feature. It uses a model from outeai.

Quickstart

If you have built llama.cpp with -DLLAMA_CURL=ON you can simply run the following command and the required models will be downloaded automatically:

$ build/bin/llama-tts --tts-oute-default -p "Hello world" && aplay output.wav

For details about the models and how to convert them to the required format see the following sections.

Model conversion

Checkout or download the model that contains the LLM model:

$ pushd models
$ git clone --branch main --single-branch --depth 1 https://huggingface.co/OuteAI/OuteTTS-0.2-500M
$ cd OuteTTS-0.2-500M && git lfs install && git lfs pull
$ popd

Convert the model to .gguf format:

(venv) python convert_hf_to_gguf.py models/OuteTTS-0.2-500M \
    --outfile models/outetts-0.2-0.5B-f16.gguf --outtype f16

The generated model will be models/outetts-0.2-0.5B-f16.gguf.

We can optionally quantize this to Q8_0 using the following command:

$ build/bin/llama-quantize models/outetts-0.2-0.5B-f16.gguf \
    models/outetts-0.2-0.5B-q8_0.gguf q8_0

The quantized model will be models/outetts-0.2-0.5B-q8_0.gguf.

Next we do something simlar for the audio decoder. First download or checkout the model for the voice decoder:

$ pushd models
$ git clone --branch main --single-branch --depth 1 https://huggingface.co/novateur/WavTokenizer-large-speech-75token
$ cd WavTokenizer-large-speech-75token && git lfs install && git lfs pull
$ popd

This model file is PyTorch checkpoint (.ckpt) and we first need to convert it to huggingface format:

(venv) python examples/tts/convert_pt_to_hf.py \
    models/WavTokenizer-large-speech-75token/wavtokenizer_large_speech_320_24k.ckpt
...
Model has been successfully converted and saved to models/WavTokenizer-large-speech-75token/model.safetensors
Metadata has been saved to models/WavTokenizer-large-speech-75token/index.json
Config has been saved to models/WavTokenizer-large-speech-75tokenconfig.json

Then we can convert the huggingface format to gguf:

(venv) python convert_hf_to_gguf.py models/WavTokenizer-large-speech-75token \
    --outfile models/wavtokenizer-large-75-f16.gguf --outtype f16
...
INFO:hf-to-gguf:Model successfully exported to models/wavtokenizer-large-75-f16.gguf

Running the example

With both of the models generated, the LLM model and the voice decoder model, we can run the example:

$ build/bin/llama-tts -m  ./models/outetts-0.2-0.5B-q8_0.gguf \
    -mv ./models/wavtokenizer-large-75-f16.gguf \
    -p "Hello world"
...
main: audio written to file 'output.wav'

The output.wav file will contain the audio of the prompt. This can be heard by playing the file with a media player. On Linux the following command will play the audio:

$ aplay output.wav