llama.cpp/examples/server/bench/README.md

### Server benchmark tools

Benchmark is using [k6](https://k6.io/).

##### Install k6 and sse extension

SSE is not supported by default in k6, you have to build k6 with the [xk6-sse](https://github.com/phymbert/xk6-sse) extension.

Example:
```shell
go install go.k6.io/xk6/cmd/xk6@latest
xk6 build master \
--with github.com/phymbert/xk6-sse
```

#### Download a dataset

This dataset was originally proposed in [vLLM benchmarks](https://github.com/vllm-project/vllm/blob/main/benchmarks/README.md).

```shell
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```

#### Download a model
Example for PHI-2

```shell
../../../scripts/hf.sh --repo ggml-org/models --file phi-2/ggml-model-q4_0.gguf
```

#### Start the server
The server must answer OAI Chat completion requests on `http://localhost:8080/v1` or according to the environment variable `SERVER_BENCH_URL`.

Example:
```shell
server --host localhost --port 8080 \
  --model ggml-model-q4_0.gguf \
  --cont-batching \
  --metrics \
  --parallel 8 \
  --batch-size 512 \
  --ctx-size 4096 \
  -ngl 33
```

#### Run the benchmark

For 500 chat completions request with 8 concurrent users during maximum 10 minutes, run:
```shell
./k6 run script.js --duration 10m --iterations 500 --vus 8
```

The benchmark values can be overridden with:
- `SERVER_BENCH_URL` server url prefix for chat completions, default `http://localhost:8080/v1`
- `SERVER_BENCH_N_PROMPTS` total prompts to randomly select in the benchmark, default `480`
- `SERVER_BENCH_MODEL_ALIAS` model alias to pass in the completion request, default `my-model`
- `SERVER_BENCH_MAX_TOKENS` max tokens to predict, default: `512`
- `SERVER_BENCH_DATASET` path to the benchmark dataset file
- `SERVER_BENCH_MAX_PROMPT_TOKENS` maximum prompt tokens to filter out in the dataset: default `1024`
- `SERVER_BENCH_MAX_CONTEXT` maximum context size of the completions request to filter out in the dataset: prompt + predicted tokens, default `2048`

Note: the local tokenizer is just a string space split, real number of tokens will differ.

Or with [k6 options](https://k6.io/docs/using-k6/k6-options/reference/):

```shell
SERVER_BENCH_N_PROMPTS=500 k6 run script.js --duration 10m --iterations 500 --vus 8
```

To [debug http request](https://k6.io/docs/using-k6/http-debugging/) use `--http-debug="full"`.

#### Metrics

Following metrics are available computed from the OAI chat completions response `usage`:
- `llamacpp_tokens_second` Trend of `usage.total_tokens / request duration`
- `llamacpp_prompt_tokens` Trend of `usage.prompt_tokens`
- `llamacpp_prompt_tokens_total_counter` Counter of `usage.prompt_tokens`
- `llamacpp_completion_tokens` Trend of `usage.completion_tokens`
- `llamacpp_completion_tokens_total_counter` Counter of `usage.completion_tokens`
- `llamacpp_completions_truncated_rate` Rate of completions truncated, i.e. if `finish_reason === 'length'`
- `llamacpp_completions_stop_rate` Rate of completions stopped by the model, i.e. if `finish_reason === 'stop'`

The script will fail if too many completions are truncated, see `llamacpp_completions_truncated_rate`.

K6 metrics might be compared against [server metrics](../README.md), with:

```shell
curl http://localhost:8080/metrics
```

### Using the CI python script
The `bench.py` script does several steps:
- start the server
- define good variable for k6
- run k6 script
- extract metrics from prometheus

It aims to be used in the CI, but you can run it manually:

```shell
LLAMA_SERVER_BIN_PATH=../../../cmake-build-release/bin/llama-server python bench.py \
              --runner-label local \
              --name local \
              --branch `git rev-parse --abbrev-ref HEAD` \
              --commit `git rev-parse HEAD` \
              --scenario script.js \
              --duration 5m \
              --hf-repo ggml-org/models	 \
              --hf-file phi-2/ggml-model-q4_0.gguf \
              --model-path-prefix models \
              --parallel 4 \
              -ngl 33 \
              --batch-size 2048 \
              --ubatch-size	256 \
              --ctx-size 4096 \
              --n-prompts 200 \
              --max-prompt-tokens 256 \
              --max-tokens 256
```
server: benchmark: chat/completions scenario and other llm servers comparison (#5941) * server: bench: Init a bench scenario with K6 See #5827 * server: bench: EOL EOF * server: bench: PR feedback and improved k6 script configuration * server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS server: bench: increase truncated rate to 80% before failing * server: bench: fix doc * server: bench: change gauge custom metrics to trend * server: bench: change gauge custom metrics to trend server: bench: add trend custom metrics for total tokens per second average * server: bench: doc add an option to debug http request * server: bench: filter dataset too short and too long sequences * server: bench: allow to filter out conversation in the dataset based on env variable * server: bench: fix assistant message sent instead of user message * server: bench: fix assistant message sent instead of user message * server : add defrag thold parameter * server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2024-03-09 22:41:49 +00:00			`### Server benchmark tools`

			`Benchmark is using [k6](https://k6.io/).`

ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response (#6495) * ci: bench: support sse and fix prompt processing time server: add tokens usage in stream mode * ci: bench: README.md EOL * ci: bench: remove total pp and tg as it is not accurate * ci: bench: fix case when there is no token generated * ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics * ci: bench: fix finish reason rate 2024-04-06 03:40:47 +00:00			`##### Install k6 and sse extension`
server: benchmark: chat/completions scenario and other llm servers comparison (#5941) * server: bench: Init a bench scenario with K6 See #5827 * server: bench: EOL EOF * server: bench: PR feedback and improved k6 script configuration * server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS server: bench: increase truncated rate to 80% before failing * server: bench: fix doc * server: bench: change gauge custom metrics to trend * server: bench: change gauge custom metrics to trend server: bench: add trend custom metrics for total tokens per second average * server: bench: doc add an option to debug http request * server: bench: filter dataset too short and too long sequences * server: bench: allow to filter out conversation in the dataset based on env variable * server: bench: fix assistant message sent instead of user message * server: bench: fix assistant message sent instead of user message * server : add defrag thold parameter * server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2024-03-09 22:41:49 +00:00
ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response (#6495) * ci: bench: support sse and fix prompt processing time server: add tokens usage in stream mode * ci: bench: README.md EOL * ci: bench: remove total pp and tg as it is not accurate * ci: bench: fix case when there is no token generated * ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics * ci: bench: fix finish reason rate 2024-04-06 03:40:47 +00:00			`SSE is not supported by default in k6, you have to build k6 with the [xk6-sse](https://github.com/phymbert/xk6-sse) extension.`
server: benchmark: chat/completions scenario and other llm servers comparison (#5941) * server: bench: Init a bench scenario with K6 See #5827 * server: bench: EOL EOF * server: bench: PR feedback and improved k6 script configuration * server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS server: bench: increase truncated rate to 80% before failing * server: bench: fix doc * server: bench: change gauge custom metrics to trend * server: bench: change gauge custom metrics to trend server: bench: add trend custom metrics for total tokens per second average * server: bench: doc add an option to debug http request * server: bench: filter dataset too short and too long sequences * server: bench: allow to filter out conversation in the dataset based on env variable * server: bench: fix assistant message sent instead of user message * server: bench: fix assistant message sent instead of user message * server : add defrag thold parameter * server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2024-03-09 22:41:49 +00:00
ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response (#6495) * ci: bench: support sse and fix prompt processing time server: add tokens usage in stream mode * ci: bench: README.md EOL * ci: bench: remove total pp and tg as it is not accurate * ci: bench: fix case when there is no token generated * ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics * ci: bench: fix finish reason rate 2024-04-06 03:40:47 +00:00			`Example:`
server: benchmark: chat/completions scenario and other llm servers comparison (#5941) * server: bench: Init a bench scenario with K6 See #5827 * server: bench: EOL EOF * server: bench: PR feedback and improved k6 script configuration * server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS server: bench: increase truncated rate to 80% before failing * server: bench: fix doc * server: bench: change gauge custom metrics to trend * server: bench: change gauge custom metrics to trend server: bench: add trend custom metrics for total tokens per second average * server: bench: doc add an option to debug http request * server: bench: filter dataset too short and too long sequences * server: bench: allow to filter out conversation in the dataset based on env variable * server: bench: fix assistant message sent instead of user message * server: bench: fix assistant message sent instead of user message * server : add defrag thold parameter * server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2024-03-09 22:41:49 +00:00			```shell
ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response (#6495) * ci: bench: support sse and fix prompt processing time server: add tokens usage in stream mode * ci: bench: README.md EOL * ci: bench: remove total pp and tg as it is not accurate * ci: bench: fix case when there is no token generated * ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics * ci: bench: fix finish reason rate 2024-04-06 03:40:47 +00:00			`go install go.k6.io/xk6/cmd/xk6@latest`
			`xk6 build master \`
			`--with github.com/phymbert/xk6-sse`
server: benchmark: chat/completions scenario and other llm servers comparison (#5941) * server: bench: Init a bench scenario with K6 See #5827 * server: bench: EOL EOF * server: bench: PR feedback and improved k6 script configuration * server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS server: bench: increase truncated rate to 80% before failing * server: bench: fix doc * server: bench: change gauge custom metrics to trend * server: bench: change gauge custom metrics to trend server: bench: add trend custom metrics for total tokens per second average * server: bench: doc add an option to debug http request * server: bench: filter dataset too short and too long sequences * server: bench: allow to filter out conversation in the dataset based on env variable * server: bench: fix assistant message sent instead of user message * server: bench: fix assistant message sent instead of user message * server : add defrag thold parameter * server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2024-03-09 22:41:49 +00:00			```

			`#### Download a dataset`

			`This dataset was originally proposed in [vLLM benchmarks](https://github.com/vllm-project/vllm/blob/main/benchmarks/README.md).`

			```shell
			`wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
			```

			`#### Download a model`
			`Example for PHI-2`

			```shell
			`../../../scripts/hf.sh --repo ggml-org/models --file phi-2/ggml-model-q4_0.gguf`
			```

			`#### Start the server`
			The server must answer OAI Chat completion requests on `http://localhost:8080/v1` or according to the environment variable `SERVER_BENCH_URL`.

			`Example:`
			```shell
			`server --host localhost --port 8080 \`
			`--model ggml-model-q4_0.gguf \`
			`--cont-batching \`
			`--metrics \`
			`--parallel 8 \`
			`--batch-size 512 \`
			`--ctx-size 4096 \`
			`-ngl 33`
			```

			`#### Run the benchmark`

			`For 500 chat completions request with 8 concurrent users during maximum 10 minutes, run:`
			```shell
ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response (#6495) * ci: bench: support sse and fix prompt processing time server: add tokens usage in stream mode * ci: bench: README.md EOL * ci: bench: remove total pp and tg as it is not accurate * ci: bench: fix case when there is no token generated * ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics * ci: bench: fix finish reason rate 2024-04-06 03:40:47 +00:00			`./k6 run script.js --duration 10m --iterations 500 --vus 8`
server: benchmark: chat/completions scenario and other llm servers comparison (#5941) * server: bench: Init a bench scenario with K6 See #5827 * server: bench: EOL EOF * server: bench: PR feedback and improved k6 script configuration * server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS server: bench: increase truncated rate to 80% before failing * server: bench: fix doc * server: bench: change gauge custom metrics to trend * server: bench: change gauge custom metrics to trend server: bench: add trend custom metrics for total tokens per second average * server: bench: doc add an option to debug http request * server: bench: filter dataset too short and too long sequences * server: bench: allow to filter out conversation in the dataset based on env variable * server: bench: fix assistant message sent instead of user message * server: bench: fix assistant message sent instead of user message * server : add defrag thold parameter * server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2024-03-09 22:41:49 +00:00			```

			`The benchmark values can be overridden with:`
			- `SERVER_BENCH_URL` server url prefix for chat completions, default `http://localhost:8080/v1`
			- `SERVER_BENCH_N_PROMPTS` total prompts to randomly select in the benchmark, default `480`
			- `SERVER_BENCH_MODEL_ALIAS` model alias to pass in the completion request, default `my-model`
			- `SERVER_BENCH_MAX_TOKENS` max tokens to predict, default: `512`
			- `SERVER_BENCH_DATASET` path to the benchmark dataset file
			- `SERVER_BENCH_MAX_PROMPT_TOKENS` maximum prompt tokens to filter out in the dataset: default `1024`
			- `SERVER_BENCH_MAX_CONTEXT` maximum context size of the completions request to filter out in the dataset: prompt + predicted tokens, default `2048`

			`Note: the local tokenizer is just a string space split, real number of tokens will differ.`

			`Or with [k6 options](https://k6.io/docs/using-k6/k6-options/reference/):`

			```shell
			`SERVER_BENCH_N_PROMPTS=500 k6 run script.js --duration 10m --iterations 500 --vus 8`
			```

			To [debug http request](https://k6.io/docs/using-k6/http-debugging/) use `--http-debug="full"`.

			`#### Metrics`

			Following metrics are available computed from the OAI chat completions response `usage`:
			- `llamacpp_tokens_second` Trend of `usage.total_tokens / request duration`
			- `llamacpp_prompt_tokens` Trend of `usage.prompt_tokens`
			- `llamacpp_prompt_tokens_total_counter` Counter of `usage.prompt_tokens`
			- `llamacpp_completion_tokens` Trend of `usage.completion_tokens`
			- `llamacpp_completion_tokens_total_counter` Counter of `usage.completion_tokens`
			- `llamacpp_completions_truncated_rate` Rate of completions truncated, i.e. if `finish_reason === 'length'`
			- `llamacpp_completions_stop_rate` Rate of completions stopped by the model, i.e. if `finish_reason === 'stop'`

			The script will fail if too many completions are truncated, see `llamacpp_completions_truncated_rate`.

			`K6 metrics might be compared against [server metrics](../README.md), with:`

			```shell
			`curl http://localhost:8080/metrics`
			```
ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response (#6495) * ci: bench: support sse and fix prompt processing time server: add tokens usage in stream mode * ci: bench: README.md EOL * ci: bench: remove total pp and tg as it is not accurate * ci: bench: fix case when there is no token generated * ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics * ci: bench: fix finish reason rate 2024-04-06 03:40:47 +00:00
			`### Using the CI python script`
			The `bench.py` script does several steps:
			`- start the server`
			`- define good variable for k6`
			`- run k6 script`
			`- extract metrics from prometheus`

			`It aims to be used in the CI, but you can run it manually:`

			```shell
`build`: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809) * `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew * server: update refs -> llama-server gitignore llama-server * server: simplify nix package * main: update refs -> llama fix examples/main ref * main/server: fix targets * update more names * Update build.yml * rm accidentally checked in bins * update straggling refs * Update .gitignore * Update server-llm.sh * main: target name -> llama-cli * Prefix all example bins w/ llama- * fix main refs * rename {main->llama}-cmake-pkg binary * prefix more cmake targets w/ llama- * add/fix gbnf-validator subfolder to cmake * sort cmake example subdirs * rm bin files * fix llama-lookup-* Makefile rules * gitignore /llama-* * rename Dockerfiles * rename llama\|main -> llama-cli; consistent RPM bin prefixes * fix some missing -cli suffixes * rename dockerfile w/ llama-cli * rename(make): llama-baby-llama * update dockerfile refs * more llama-cli(.exe) * fix test-eval-callback * rename: llama-cli-cmake-pkg(.exe) * address gbnf-validator unused fread warning (switched to C++ / ifstream) * add two missing llama- prefixes * Updating docs for eval-callback binary to use new `llama-` prefix. * Updating a few lingering doc references for rename of main to llama-cli * Updating `run-with-preset.py` to use new binary names. Updating docs around `perplexity` binary rename. * Updating documentation references for lookup-merge and export-lora * Updating two small `main` references missed earlier in the finetune docs. * Update apps.nix * update grammar/README.md w/ new llama-* names * update llama-rpc-server bin name + doc * Revert "update llama-rpc-server bin name + doc" This reverts commit e474ef1df481fd8936cd7d098e3065d7de378930. * add hot topic notice to README.md * Update README.md * Update README.md * rename gguf-split & quantize bins refs in **/tests.sh --------- Co-authored-by: HanClinto <hanclinto@gmail.com> 2024-06-12 23:41:52 +00:00			`LLAMA_SERVER_BIN_PATH=../../../cmake-build-release/bin/llama-server python bench.py \`
ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response (#6495) * ci: bench: support sse and fix prompt processing time server: add tokens usage in stream mode * ci: bench: README.md EOL * ci: bench: remove total pp and tg as it is not accurate * ci: bench: fix case when there is no token generated * ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics * ci: bench: fix finish reason rate 2024-04-06 03:40:47 +00:00			`--runner-label local \`
			`--name local \`
			--branch `git rev-parse --abbrev-ref HEAD` \
			--commit `git rev-parse HEAD` \
			`--scenario script.js \`
			`--duration 5m \`
			`--hf-repo ggml-org/models \`
			`--hf-file phi-2/ggml-model-q4_0.gguf \`
			`--model-path-prefix models \`
			`--parallel 4 \`
			`-ngl 33 \`
			`--batch-size 2048 \`
			`--ubatch-size 256 \`
			`--ctx-size 4096 \`
			`--n-prompts 200 \`
			`--max-prompt-tokens 256 \`
			`--max-tokens 256`
			```