# Mulmat Benchmark and Tunning Apart from the standalone tool `mulmat-tune`, mulmat tune is also integrated into `main` and `perplexity`. To avoid too many new cli options, I just added two options. To make it run faster, the `m_num` is set as 8 thus max M is 128, and the `n_pass` is set as 1. With the newly added cli options, we can use `main` and `perplexity` with the following three ways: * bench and run: --tune * bench and exit: --tune --tune-file * load and run: --tune-file The `load` mode reads existing data file. Although this is fine because we can run bench ahead of time (saving tens of seconds), but there are two shortcomings: - have to re-run when format changed, this is OK because we are acknowledged. - the most subtle problem is algorithm was changed silently but we are using the outdated format. So I integrated mulmat tune into `main` and `perplexity` as a complementary solution. ## Build into main and perplexity Makefile: ``` make clean && make ``` CMake (with BLAS): ``` cmake --build . --target clean cmake .. -DLLAMA_BLAS=ON cmake --build . --config Release ``` Run examples: ``` # bench and run: ./main -m ./models/3B/open-llama-3b-q4-0.bin -c 512 -b 1024 -n 256 --keep 48 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt -t 4 --tune # bench then exit: ./main -m ./models/3B/open-llama-3b-q4-0.bin --tune --tune-file # load and run ./main -m ./models/3B/open-llama-3b-q4-0.bin -c 512 -b 1024 -n 256 --keep 48 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt -t 4 --tune-file ``` # Build the standalone `mulmat-tune` Makefile: ``` make clean && make ``` CMake (with BLAS) ``` cmake --build . --target clean cmake .. -DLLAMA_BLAS=ON cmake --build . --config Release ``` Run examples: ``` ./mulmat-tune -h # run with default params (7B, Q4_0, ...) ./mulmat-tune # set model ./mulmat-tune --model 13B # set ggml ftype, 2 for Q4_0, 3 for Q4_1, run `mulmat-tune -h` for help. ./mulmat-tune --ftype 3 # customized m_num ./mulmat-tune --m_num 8 # customized n_pass: run 1 pass only instead of the default 3. ./mulmat-tune --n_pass 1 # customized n_threads instead of the default 1. ./mulmat-tune --n_threads 4 # save to file ./mulmat-tune --file # save to file, always override if exists (CAUTION!) ./mulmat-tune --file -y ``` # End to End Test ## Compare With Master You may want to run the following commands. Make sure the tune result file is setup properly. General steps: 1. run `./mulmat-tune -h` to see how to build for misc vendors. To enable the debug, comment out `-DGGML_TUNE_NDEBUG` from makefile then run: ``` make clean; make ``` On `macOS`, `ACCELERATE` is enabled by default. When `ACCELERATE` is built along with `CUDA` or `CL`, you may not see `CUDA` or `CL` from debug because `CPU` or `CPU_BLAS` is more faster (as of the estimation from mulmat tune), try run with `-t 1`? 2. create a small prompt file: ``` head -n 5 ./models/wikitext-2-raw/wiki.valid.raw > ./models/wiki.valid-5.raw ``` 3. run any of the following example commands. ``` ./perplexity -m models/7B/ggml-model-q4_0.bin -f ./models/wiki.valid-5.raw -c 128 --mlock -t 1 -b 32 ./perplexity -m models/7B/ggml-model-q4_0.bin -f ./models/wiki.valid-5.raw -c 128 --mlock -t 4 -b 64 ``` * `--mlock` is recommended for `macOS`, you may not want to use it. * don't change `-c 128`: too large `context size` causes 0 perplexity trunk. * `-t` is the number of threads, recommend `1`, `2`, `4` or `6`. * you can change the batch size (`-b`) between `1` and `128`. * you may want to add other cli options. The following results are generated with Accelerate compiled. ### 1 thread **Master (2d43387d)** ``` | M | perplexity (seconds per pass) | prompt eval time (ms per token) | | --- | --------------- | | 8 | 43.53 | 339.95 | | 16 | 44.31 | 346.12 | | 24 | 43.14 | 336.90 | | 32 | 33.59 | 262.25 | | 40 | 27.64 | 215.77 | | 48 | 24.52 | 191.42 | ``` **This branch (tune)** ``` | M | perplexity (seconds per pass) | prompt eval time (ms per token) | | --- | --------------- | | 8 | 43.78 | 341.96 | | 16 | 42.88 | 334.93 | | 24 | 42.06 | 328.42 | | 32 | 33.07 | 258.25 | | 40 | 28.69 | 223.98 | | 48 | 25.65 | 200.19 | ``` ### 4 threads **Master (2d43387d)** ``` | M | perplexity (seconds per pass) | prompt eval time (ms per token) | | --- | --------------- | | 8 | 12.43 | 96.99 | | 16 | 12.10 | 94.44 | | 24 | 12.81 | 99.95 | | 32 | 31.64 | 247.04 | | 48 | 24.55 | 191.63 | | 64 | 17.56 | 137.09 | | 96 | 17.59 | 137.25 | | 128 | 10.73 | 83.74 | ``` **This branch (no tune)** ``` | M | perplexity (seconds per pass) | prompt eval time (ms per token) | | --- | --------------- | | 8 | 12.31 | 96.07 | | 16 | 12.00 | 93.63 | | 24 | 12.07 | 94.15 | | 32 | 20.34 | 158.76 | | 48 | 15.86 | 123.73 | | 64 | 10.98 | 85.69 | | 96 | 11.24 | 87.66 | | 128 | 7.53 | 58.77 | ``` **This branch (tune)** ``` | M | perplexity (seconds per pass) | prompt eval time (ms per token) | | --- | --------------- | | 8 | 12.48 | 97.37 | | 16 | 12.26 | 95.70 | | 24 | 12.25 | 95.53 | | 32 | 11.98 | 93.58 | | 48 | 12.57 | 98.12 | | 64 | 11.28 | 88.05 | | 96 | 9.55 | 74.53 | | 128 | 7.51 | 58.61 | ``` # Bench Data Format **Example** ``` [tune] done, elapsed time: 0 seconds. 10 xB 12 4 2 1024 1024 12 0 2 4 100 110 000 1 CPU 110 101 000 2 BLAS 1 11 309 0 1234 90 0 2 23 654 0 1359 215 0 4 44 1283 0 1362 421 0 8 85 2341 0 1357 347 0 1024 2048 12 0 2 4 ... ``` **Informal Explanation** ``` head groups+ head := version model ggml_ftype n_shapes n_threads shape+ # head version: 1 model: "3B" | "7B" | "13B" | "30B" | "65B" ggml_ftype: 0 - 3, 7 - 14 n_shapes: number of shapes n_threads: number of threads shape := N K src0_ggml_type src1_ggml_type n_profiles m_num task_profile+ bench_item+ task_profile: stage_conf(init) stage_conf(compute) stage_conf(finalize) id name stage_conf(bitmap): valid parallel wait valid: 0 (false) | 1 (true) parallel: 0 (false) | 1 (true) wait: 0 (false) | 1 (true) bench_item: M profile_time+ profile_time := stage_time[3] stage_time[3]: init_time, compute_time, finalize_time ``` A task stage is invalid if it's backend equals to `GGML_TASK_BACKEND_NONE`. Time unit is `us`. A column is all zeros when that stage does not exist. # NOTE 1. "3B" is [open-llama 3B](https://github.com/ggerganov/llama.cpp/pull/1588). 2. Model names are subject to change: we may support something like X-3B, Y-4B, ... 3. As of Jun 1, this tool is still in early stage, will be changed frequently in recent couple of days (or weeks).