llama.cpp/examples/mulmat-tune/README.md

# Mulmat Benchmark and Tunning

Apart from the standalone tool `mulmat-tune`, mulmat tune is also integrated into
`main` and `perplexity`. To avoid too many new cli options, I just added two options.
To make it run faster, the `m_num` is set as 8 thus max M is 128, and the `n_pass`
is set as 1.

With the newly added cli options, we can use `main` and `perplexity` with the
following three ways:

* bench and run:  --tune
* bench and exit: --tune --tune-file <FILE>
* load  and run:  --tune-file <FILE>

The `load` mode reads existing data file. Although this is fine because we can
run bench ahead of time (saving tens of seconds), but there are two shortcomings:
- have to re-run when format changed, this is OK because we are acknowledged.
- the most subtle problem is algorithm was changed silently but we are using the
  outdated format. So I integrated mulmat tune into `main` and `perplexity` as
  a complementary solution.

## Build into main and perplexity

Makefile:
```
make clean && make
```

CMake (with BLAS):
```
cmake --build . --target clean
cmake .. -DLLAMA_BLAS=ON
cmake --build . --config Release
```

Run examples:

```
# bench and run:

./main -m ./models/3B/open-llama-3b-q4-0.bin -c 512 -b 1024 -n 256 --keep 48 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt -t 4 --tune

# bench then exit:
./main -m ./models/3B/open-llama-3b-q4-0.bin --tune --tune-file <FILE>

# load and run

./main -m ./models/3B/open-llama-3b-q4-0.bin -c 512 -b 1024 -n 256 --keep 48 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt -t 4 --tune-file <FILE>
```

# Build the standalone `mulmat-tune`

Makefile:
```
make clean && make
```

CMake (with BLAS)
```
cmake --build . --target clean
cmake .. -DLLAMA_BLAS=ON
cmake --build . --config Release
```

Run examples:

```
./mulmat-tune -h

# run with default params (7B, Q4_0, ...)
./mulmat-tune

# set model
./mulmat-tune --model 13B

# set ggml ftype, 2 for Q4_0, 3 for Q4_1, run `mulmat-tune -h` for help.
./mulmat-tune --ftype 3

# customized m_num
./mulmat-tune --m_num 8

# customized n_pass: run 1 pass only instead of the default 3.
./mulmat-tune --n_pass 1

# customized n_threads instead of the default 1.
./mulmat-tune --n_threads 4

# save to file
./mulmat-tune --file <FILE>

# save to file, always override if exists (CAUTION!)
./mulmat-tune --file <FILE> -y

```

# End to End Test

## Compare With Master

You may want to run the following commands. Make sure the tune result file is
setup properly.

General steps:

1. run `./mulmat-tune -h` to see how to build for misc vendors.
   To enable the debug, comment out `-DGGML_TUNE_NDEBUG` from makefile then run:

   ```
   make clean; make
   ```

   On `macOS`, `ACCELERATE` is enabled by default. When `ACCELERATE` is built along
   with `CUDA` or `CL`, you may not see `CUDA` or `CL` from debug because `CPU`
   or `CPU_BLAS` is more faster (as of the estimation from mulmat tune), try run
   with `-t 1`?
2. create a small prompt file:

   ```
   head -n 5 ./models/wikitext-2-raw/wiki.valid.raw > ./models/wiki.valid-5.raw
   ```

3. run any of the following example commands.

   ```
   ./perplexity -m models/7B/ggml-model-q4_0.bin -f ./models/wiki.valid-5.raw -c 128 --mlock -t 1 -b 32
   ./perplexity -m models/7B/ggml-model-q4_0.bin -f ./models/wiki.valid-5.raw -c 128 --mlock -t 4 -b 64
   ```

   * `--mlock` is recommended for `macOS`, you may not want to use it.
   * don't change `-c 128`: too large `context size` causes 0 perplexity trunk.
   * `-t` is the number of threads, recommend `1`, `2`, `4` or `6`.
   * you can change the batch size (`-b`) between `1` and `128`.
   * you may want to add other cli options.

The following results are generated with Accelerate compiled.

### 1 thread

**Master (2d43387d)**

```
| M   | perplexity (seconds per pass) | prompt eval time (ms per token) |
| --- | --------------- |
|  8  |  43.53 | 339.95 |
|  16 |  44.31 | 346.12 |
|  24 |  43.14 | 336.90 |
|  32 |  33.59 | 262.25 |
|  40 |  27.64 | 215.77 |
|  48 |  24.52 | 191.42 |
```

**This branch (tune)**

```
|  M  |  perplexity (seconds per pass) | prompt eval time (ms per token) |
| --- | --------------- |
|  8  |  43.78 | 341.96 |
|  16 |  42.88 | 334.93 |
|  24 |  42.06 | 328.42 |
|  32 |  33.07 | 258.25 |
|  40 |  28.69 | 223.98 |
|  48 |  25.65 | 200.19 |
```

### 4 threads

**Master (2d43387d)**

```
|  M  |  perplexity (seconds per pass) | prompt eval time (ms per token) |
| --- | --------------- |
|   8 |  12.43 |  96.99 |
|  16 |  12.10 |  94.44 |
|  24 |  12.81 |  99.95 |
|  32 |  31.64 | 247.04 |
|  48 |  24.55 | 191.63 |
|  64 |  17.56 | 137.09 |
|  96 |  17.59 | 137.25 |
| 128 |  10.73 |  83.74 |
```

**This branch (no tune)**

```
|  M  |  perplexity (seconds per pass) | prompt eval time (ms per token) |
| --- | --------------- |
|   8 |  12.31 |  96.07 |
|  16 |  12.00 |  93.63 |
|  24 |  12.07 |  94.15 |
|  32 |  20.34 | 158.76 |
|  48 |  15.86 | 123.73 |
|  64 |  10.98 |  85.69 |
|  96 |  11.24 |  87.66 |
| 128 |   7.53 |  58.77 |
```

**This branch (tune)**

```
|  M  |  perplexity (seconds per pass) | prompt eval time (ms per token) |
| --- | --------------- |
|   8 |  12.48 |  97.37 |
|  16 |  12.26 |  95.70 |
|  24 |  12.25 |  95.53 |
|  32 |  11.98 |  93.58 |
|  48 |  12.57 |  98.12 |
|  64 |  11.28 |  88.05 |
|  96 |   9.55 |  74.53 |
| 128 |   7.51 |  58.61 |
```

# Bench Data Format

**Example**

```
5 3B 2 6 1

3200 3200  2 0 3 10
16 0 0 0  16 1 0 1   0 0 0 0
16 1 0 2  17 0 1 0   0 0 0 0
 0 0 0 0  34 0 1 0   0 0 0 0
   1        1      793 0     9103     2102 0 0     6014 0
   2        2     1591 0     8034     2305 0 0    30982 0
   4        4     2236 0     6476     2484 0 0    31388 0
   8        7     4161 0     6623     2389 0 0    29204 0
  16       15     8339 0     6434     2752 0 0    34303 0
  32       32    16919 0     6915     3651 0 0    42511 0
  64      200    34270 0     6574     4528 0 0    68212 0
 128      188    69400 0     6325     6839 0 0    74437 0
 256      303   134597 0     6168    11544 0 0   110180 0
 512      687   279685 0     6337    29712 0 0   159728 0

3200 8640  2 0 2 10

 ...

 ```

**Informal Explanation**

```
head
groups+

head := version model ggml_ftype n_shapes n_threads
shape+

# head
version: 1
model: "3B" | "7B" | "13B" | "30B" | "65B"
ggml_ftype: 0 - 4, 7 - 14
n_shapes: number of shapes
n_threads: number of threads

shape := N K  m_num n_profiles
task_conf_profile+
bench_item+

task_conf_profile: stage_conf(init) stage_conf(compute) stage_conf(finalize)
stage_conf: backend parallel wait
backend: 0 (NONE) | 16 (CPU) | 17 (CPU_BLAS) | 32 (GPU) | 33 (GPU_CUDA) | 34 (GPU_CL)
parallel: 0 (false) | 1 (true)
wait: 0 (false) | 1 (true)

bench_item: M profile_time+
profile_time := stage_time[3]
stage_time[3]: init_time, compute_time, finalize_time
```

A task stage is invalid if it's backend equals to `GGML_TASK_BACKEND_NONE`.
Time unit is `us`. A column is all zeros when that stage does not exist.

# NOTE

1. "3B" is [open-llama 3B](https://github.com/ggerganov/llama.cpp/pull/1588).
2. Model names are subject to change: we may support something like X-3B, Y-4B, ...
3. As of Jun 1, this tool is still in early stage, will be changed frequently in
   recent couple of days (or weeks).
initial 2023-06-14 10:33:14 +00:00			`# Mulmat Benchmark and Tunning`

			Apart from the standalone tool `mulmat-tune`, mulmat tune is also integrated into
			`main` and `perplexity`. To avoid too many new cli options, I just added two options.
			To make it run faster, the `m_num` is set as 8 thus max M is 128, and the `n_pass`
			`is set as 1.`

			With the newly added cli options, we can use `main` and `perplexity` with the
			`following three ways:`

			`* bench and run: --tune`
			`* bench and exit: --tune --tune-file <FILE>`
			`* load and run: --tune-file <FILE>`

			The `load` mode reads existing data file. Although this is fine because we can
			`run bench ahead of time (saving tens of seconds), but there are two shortcomings:`
			`- have to re-run when format changed, this is OK because we are acknowledged.`
			`- the most subtle problem is algorithm was changed silently but we are using the`
			outdated format. So I integrated mulmat tune into `main` and `perplexity` as
			`a complementary solution.`

			`## Build into main and perplexity`

			`Makefile:`
			```
bulk refactored task profile to support complete fallback; enable tune by default for ease of dev 2023-06-14 22:43:08 +00:00			`make clean && make`
initial 2023-06-14 10:33:14 +00:00			```

			`CMake (with BLAS):`
			```
			`cmake --build . --target clean`
bulk refactored task profile to support complete fallback; enable tune by default for ease of dev 2023-06-14 22:43:08 +00:00			`cmake .. -DLLAMA_BLAS=ON`
initial 2023-06-14 10:33:14 +00:00			`cmake --build . --config Release`
			```

			`Run examples:`

			```
			`# bench and run:`

			`./main -m ./models/3B/open-llama-3b-q4-0.bin -c 512 -b 1024 -n 256 --keep 48 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt -t 4 --tune`

			`# bench then exit:`
			`./main -m ./models/3B/open-llama-3b-q4-0.bin --tune --tune-file <FILE>`

			`# load and run`

			`./main -m ./models/3B/open-llama-3b-q4-0.bin -c 512 -b 1024 -n 256 --keep 48 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt -t 4 --tune-file <FILE>`
			```

			# Build the standalone `mulmat-tune`

			`Makefile:`
			```
bulk refactored task profile to support complete fallback; enable tune by default for ease of dev 2023-06-14 22:43:08 +00:00			`make clean && make`
initial 2023-06-14 10:33:14 +00:00			```

			`CMake (with BLAS)`
			```
			`cmake --build . --target clean`
bulk refactored task profile to support complete fallback; enable tune by default for ease of dev 2023-06-14 22:43:08 +00:00			`cmake .. -DLLAMA_BLAS=ON`
initial 2023-06-14 10:33:14 +00:00			`cmake --build . --config Release`
			```

			`Run examples:`

			```
			`./mulmat-tune -h`

			`# run with default params (7B, Q4_0, ...)`
			`./mulmat-tune`

			`# set model`
			`./mulmat-tune --model 13B`

			# set ggml ftype, 2 for Q4_0, 3 for Q4_1, run `mulmat-tune -h` for help.
			`./mulmat-tune --ftype 3`

			`# customized m_num`
			`./mulmat-tune --m_num 8`

			`# customized n_pass: run 1 pass only instead of the default 3.`
			`./mulmat-tune --n_pass 1`

			`# customized n_threads instead of the default 1.`
			`./mulmat-tune --n_threads 4`

			`# save to file`
			`./mulmat-tune --file <FILE>`

			`# save to file, always override if exists (CAUTION!)`
			`./mulmat-tune --file <FILE> -y`

			```

			`# End to End Test`

			`## Compare With Master`

			`You may want to run the following commands. Make sure the tune result file is`
			`setup properly.`

			`General steps:`

			1. run `./mulmat-tune -h` to see how to build for misc vendors.
bulk refactored task profile to support complete fallback; enable tune by default for ease of dev 2023-06-14 22:43:08 +00:00			To enable the debug, comment out `-DGGML_TUNE_NDEBUG` from makefile then run:

initial 2023-06-14 10:33:14 +00:00			```
bulk refactored task profile to support complete fallback; enable tune by default for ease of dev 2023-06-14 22:43:08 +00:00			`make clean; make`
initial 2023-06-14 10:33:14 +00:00			```
bulk refactored task profile to support complete fallback; enable tune by default for ease of dev 2023-06-14 22:43:08 +00:00
initial 2023-06-14 10:33:14 +00:00			On `macOS`, `ACCELERATE` is enabled by default. When `ACCELERATE` is built along
			with `CUDA` or `CL`, you may not see `CUDA` or `CL` from debug because `CPU`
bulk refactored task profile to support complete fallback; enable tune by default for ease of dev 2023-06-14 22:43:08 +00:00			or `CPU_BLAS` is more faster (as of the estimation from mulmat tune), try run
			with `-t 1`?
initial 2023-06-14 10:33:14 +00:00			`2. create a small prompt file:`
bulk refactored task profile to support complete fallback; enable tune by default for ease of dev 2023-06-14 22:43:08 +00:00
initial 2023-06-14 10:33:14 +00:00			```
			`head -n 5 ./models/wikitext-2-raw/wiki.valid.raw > ./models/wiki.valid-5.raw`
			```
bulk refactored task profile to support complete fallback; enable tune by default for ease of dev 2023-06-14 22:43:08 +00:00
initial 2023-06-14 10:33:14 +00:00			`3. run any of the following example commands.`
bulk refactored task profile to support complete fallback; enable tune by default for ease of dev 2023-06-14 22:43:08 +00:00
initial 2023-06-14 10:33:14 +00:00			```
			`./perplexity -m models/7B/ggml-model-q4_0.bin -f ./models/wiki.valid-5.raw -c 128 --mlock -t 1 -b 32`
			`./perplexity -m models/7B/ggml-model-q4_0.bin -f ./models/wiki.valid-5.raw -c 128 --mlock -t 4 -b 64`
			```
bulk refactored task profile to support complete fallback; enable tune by default for ease of dev 2023-06-14 22:43:08 +00:00
initial 2023-06-14 10:33:14 +00:00			* `--mlock` is recommended for `macOS`, you may not want to use it.
			* don't change `-c 128`: too large `context size` causes 0 perplexity trunk.
			* `-t` is the number of threads, recommend `1`, `2`, `4` or `6`.
			* you can change the batch size (`-b`) between `1` and `128`.
			`* you may want to add other cli options.`

			`The following results are generated with Accelerate compiled.`

			`### 1 thread`

			`Master (2d43387d)`

			```
			`\| M \| perplexity (seconds per pass) \| prompt eval time (ms per token) \|`
			`\| --- \| --------------- \|`
			`\| 8 \| 43.53 \| 339.95 \|`
			`\| 16 \| 44.31 \| 346.12 \|`
			`\| 24 \| 43.14 \| 336.90 \|`
			`\| 32 \| 33.59 \| 262.25 \|`
			`\| 40 \| 27.64 \| 215.77 \|`
			`\| 48 \| 24.52 \| 191.42 \|`
			```

			`This branch (tune)`

			```
			`\| M \| perplexity (seconds per pass) \| prompt eval time (ms per token) \|`
			`\| --- \| --------------- \|`
			`\| 8 \| 43.78 \| 341.96 \|`
			`\| 16 \| 42.88 \| 334.93 \|`
			`\| 24 \| 42.06 \| 328.42 \|`
			`\| 32 \| 33.07 \| 258.25 \|`
			`\| 40 \| 28.69 \| 223.98 \|`
			`\| 48 \| 25.65 \| 200.19 \|`
			```

			`### 4 threads`

			`Master (2d43387d)`

			```
			`\| M \| perplexity (seconds per pass) \| prompt eval time (ms per token) \|`
			`\| --- \| --------------- \|`
			`\| 8 \| 12.43 \| 96.99 \|`
			`\| 16 \| 12.10 \| 94.44 \|`
			`\| 24 \| 12.81 \| 99.95 \|`
			`\| 32 \| 31.64 \| 247.04 \|`
			`\| 48 \| 24.55 \| 191.63 \|`
			`\| 64 \| 17.56 \| 137.09 \|`
			`\| 96 \| 17.59 \| 137.25 \|`
			`\| 128 \| 10.73 \| 83.74 \|`
			```

			`This branch (no tune)`

			```
			`\| M \| perplexity (seconds per pass) \| prompt eval time (ms per token) \|`
			`\| --- \| --------------- \|`
			`\| 8 \| 12.31 \| 96.07 \|`
			`\| 16 \| 12.00 \| 93.63 \|`
			`\| 24 \| 12.07 \| 94.15 \|`
			`\| 32 \| 20.34 \| 158.76 \|`
			`\| 48 \| 15.86 \| 123.73 \|`
			`\| 64 \| 10.98 \| 85.69 \|`
			`\| 96 \| 11.24 \| 87.66 \|`
			`\| 128 \| 7.53 \| 58.77 \|`
			```

			`This branch (tune)`

			```
			`\| M \| perplexity (seconds per pass) \| prompt eval time (ms per token) \|`
			`\| --- \| --------------- \|`
			`\| 8 \| 12.48 \| 97.37 \|`
			`\| 16 \| 12.26 \| 95.70 \|`
			`\| 24 \| 12.25 \| 95.53 \|`
			`\| 32 \| 11.98 \| 93.58 \|`
			`\| 48 \| 12.57 \| 98.12 \|`
			`\| 64 \| 11.28 \| 88.05 \|`
			`\| 96 \| 9.55 \| 74.53 \|`
			`\| 128 \| 7.51 \| 58.61 \|`
			```

			`# Bench Data Format`

			`Example`

			```
			`5 3B 2 6 1`

			`3200 3200 2 0 3 10`
			`16 0 0 0 16 1 0 1 0 0 0 0`
			`16 1 0 2 17 0 1 0 0 0 0 0`
			`0 0 0 0 34 0 1 0 0 0 0 0`
			`1 1 793 0 9103 2102 0 0 6014 0`
			`2 2 1591 0 8034 2305 0 0 30982 0`
			`4 4 2236 0 6476 2484 0 0 31388 0`
			`8 7 4161 0 6623 2389 0 0 29204 0`
			`16 15 8339 0 6434 2752 0 0 34303 0`
			`32 32 16919 0 6915 3651 0 0 42511 0`
			`64 200 34270 0 6574 4528 0 0 68212 0`
			`128 188 69400 0 6325 6839 0 0 74437 0`
			`256 303 134597 0 6168 11544 0 0 110180 0`
			`512 687 279685 0 6337 29712 0 0 159728 0`

			`3200 8640 2 0 2 10`

			`...`

			```

			`Informal Explanation`

			```
			`head`
			`groups+`

			`head := version model ggml_ftype n_shapes n_threads`
			`shape+`

			`# head`
			`version: 1`
			`model: "3B" \| "7B" \| "13B" \| "30B" \| "65B"`
			`ggml_ftype: 0 - 4, 7 - 14`
			`n_shapes: number of shapes`
			`n_threads: number of threads`

			`shape := N K m_num n_profiles`
			`task_conf_profile+`
			`bench_item+`

			`task_conf_profile: stage_conf(init) stage_conf(compute) stage_conf(finalize)`
			`stage_conf: backend parallel wait`
			`backend: 0 (NONE) \| 16 (CPU) \| 17 (CPU_BLAS) \| 32 (GPU) \| 33 (GPU_CUDA) \| 34 (GPU_CL)`
			`parallel: 0 (false) \| 1 (true)`
			`wait: 0 (false) \| 1 (true)`

			`bench_item: M profile_time+`
			`profile_time := stage_time[3]`
			`stage_time[3]: init_time, compute_time, finalize_time`
			```

			A task stage is invalid if it's backend equals to `GGML_TASK_BACKEND_NONE`.
			Time unit is `us`. A column is all zeros when that stage does not exist.

			`# NOTE`

			`1. "3B" is [open-llama 3B](https://github.com/ggerganov/llama.cpp/pull/1588).`
			`2. Model names are subject to change: we may support something like X-3B, Y-4B, ...`
			`3. As of Jun 1, this tool is still in early stage, will be changed frequently in`
			`recent couple of days (or weeks).`