llama.cpp/examples/tts/README.md

# llama.cpp/example/tts
This example demonstrates the Text To Speech feature. It uses a
[model](https://www.outeai.com/blog/outetts-0.2-500m) from
[outeai](https://www.outeai.com/).

## Quickstart
If you have built llama.cpp with `-DLLAMA_CURL=ON` you can simply run the
following command and the required models will be downloaded automatically:
```console
$ build/bin/llama-tts --tts-oute-default -p "Hello world" && aplay output.wav
```
For details about the models and how to convert them to the required format
see the following sections.

### Model conversion
Checkout or download the model that contains the LLM model:
```console
$ pushd models
$ git clone --branch main --single-branch --depth 1 https://huggingface.co/OuteAI/OuteTTS-0.2-500M
$ cd OuteTTS-0.2-500M && git lfs install && git lfs pull
$ popd
```
Convert the model to .gguf format:
```console
(venv) python convert_hf_to_gguf.py models/OuteTTS-0.2-500M \
    --outfile models/outetts-0.2-0.5B-f16.gguf --outtype f16
```
The generated model will be `models/outetts-0.2-0.5B-f16.gguf`.

We can optionally quantize this to Q8_0 using the following command:
```console
$ build/bin/llama-quantize models/outetts-0.2-0.5B-f16.gguf \
    models/outetts-0.2-0.5B-q8_0.gguf q8_0
```
The quantized model will be `models/outetts-0.2-0.5B-q8_0.gguf`.

Next we do something simlar for the audio decoder. First download or checkout
the model for the voice decoder:
```console
$ pushd models
$ git clone --branch main --single-branch --depth 1 https://huggingface.co/novateur/WavTokenizer-large-speech-75token
$ cd WavTokenizer-large-speech-75token && git lfs install && git lfs pull
$ popd
```
This model file is PyTorch checkpoint (.ckpt) and we first need to convert it to
huggingface format:
```console
(venv) python examples/tts/convert_pt_to_hf.py \
    models/WavTokenizer-large-speech-75token/wavtokenizer_large_speech_320_24k.ckpt
...
Model has been successfully converted and saved to models/WavTokenizer-large-speech-75token/model.safetensors
Metadata has been saved to models/WavTokenizer-large-speech-75token/index.json
Config has been saved to models/WavTokenizer-large-speech-75tokenconfig.json
```
Then we can convert the huggingface format to gguf:
```console
(venv) python convert_hf_to_gguf.py models/WavTokenizer-large-speech-75token \
    --outfile models/wavtokenizer-large-75-f16.gguf --outtype f16
...
INFO:hf-to-gguf:Model successfully exported to models/wavtokenizer-large-75-f16.gguf
```

### Running the example

With both of the models generated, the LLM model and the voice decoder model,
we can run the example:
```console
$ build/bin/llama-tts -m  ./models/outetts-0.2-0.5B-q8_0.gguf \
    -mv ./models/wavtokenizer-large-75-f16.gguf \
    -p "Hello world"
...
main: audio written to file 'output.wav'
```
The output.wav file will contain the audio of the prompt. This can be heard
by playing the file with a media player. On Linux the following command will
play the audio:
```console
$ aplay output.wav
```
examples : add README.md to tts example [no ci] (#11155) * examples : add README.md to tts example [no ci] * squash! examples : add README.md to tts example [no ci] Fix heading to be consistent with other examples, and add a quickstart section to README.md. * squash! examples : add README.md to tts example [no ci] Fix spelling mistake. 2025-01-10 12:16:16 +00:00			`# llama.cpp/example/tts`
			`This example demonstrates the Text To Speech feature. It uses a`
			`[model](https://www.outeai.com/blog/outetts-0.2-500m) from`
			`[outeai](https://www.outeai.com/).`

			`## Quickstart`
			If you have built llama.cpp with `-DLLAMA_CURL=ON` you can simply run the
			`following command and the required models will be downloaded automatically:`
			```console
			`$ build/bin/llama-tts --tts-oute-default -p "Hello world" && aplay output.wav`
			```
			`For details about the models and how to convert them to the required format`
			`see the following sections.`

			`### Model conversion`
			`Checkout or download the model that contains the LLM model:`
			```console
			`$ pushd models`
			`$ git clone --branch main --single-branch --depth 1 https://huggingface.co/OuteAI/OuteTTS-0.2-500M`
			`$ cd OuteTTS-0.2-500M && git lfs install && git lfs pull`
			`$ popd`
			```
			`Convert the model to .gguf format:`
			```console
			`(venv) python convert_hf_to_gguf.py models/OuteTTS-0.2-500M \`
			`--outfile models/outetts-0.2-0.5B-f16.gguf --outtype f16`
			```
			The generated model will be `models/outetts-0.2-0.5B-f16.gguf`.

			`We can optionally quantize this to Q8_0 using the following command:`
			```console
			`$ build/bin/llama-quantize models/outetts-0.2-0.5B-f16.gguf \`
			`models/outetts-0.2-0.5B-q8_0.gguf q8_0`
			```
			The quantized model will be `models/outetts-0.2-0.5B-q8_0.gguf`.

			`Next we do something simlar for the audio decoder. First download or checkout`
			`the model for the voice decoder:`
			```console
			`$ pushd models`
			`$ git clone --branch main --single-branch --depth 1 https://huggingface.co/novateur/WavTokenizer-large-speech-75token`
			`$ cd WavTokenizer-large-speech-75token && git lfs install && git lfs pull`
			`$ popd`
			```
			`This model file is PyTorch checkpoint (.ckpt) and we first need to convert it to`
			`huggingface format:`
			```console
			`(venv) python examples/tts/convert_pt_to_hf.py \`
			`models/WavTokenizer-large-speech-75token/wavtokenizer_large_speech_320_24k.ckpt`
			`...`
			`Model has been successfully converted and saved to models/WavTokenizer-large-speech-75token/model.safetensors`
			`Metadata has been saved to models/WavTokenizer-large-speech-75token/index.json`
			`Config has been saved to models/WavTokenizer-large-speech-75tokenconfig.json`
			```
			`Then we can convert the huggingface format to gguf:`
			```console
			`(venv) python convert_hf_to_gguf.py models/WavTokenizer-large-speech-75token \`
			`--outfile models/wavtokenizer-large-75-f16.gguf --outtype f16`
			`...`
			`INFO:hf-to-gguf:Model successfully exported to models/wavtokenizer-large-75-f16.gguf`
			```

			`### Running the example`

			`With both of the models generated, the LLM model and the voice decoder model,`
			`we can run the example:`
			```console
			`$ build/bin/llama-tts -m ./models/outetts-0.2-0.5B-q8_0.gguf \`
			`-mv ./models/wavtokenizer-large-75-f16.gguf \`
			`-p "Hello world"`
			`...`
			`main: audio written to file 'output.wav'`
			```
			`The output.wav file will contain the audio of the prompt. This can be heard`
			`by playing the file with a media player. On Linux the following command will`
			`play the audio:`
			```console
			`$ aplay output.wav`
			```