mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-25 02:44:36 +00:00

History

Brian Cunnie ecf6b7f23e batched-bench : handle empty `-npl` (#8839 ) * [example] batched-bench "segmentation fault" When `llama-batched-bench` is invoked _without_ setting `-npl`, "number of parallel prompts", it segfaults. The segfault is caused by invoking `max_element()` on a zero-length vector, `n_pl` This commit addresses that by first checking to see if the number of parallel prompts is zero, and if so sets the maximum sequence size to 1; otherwise, sets it to the original, the result of `max_element()`. Fixes, when running `lldb build/bin/llama-batched-bench -- -m models/Meta-Llama-3-8B.gguf` ``` * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0) frame #0: 0x000000010000366c llama-batched-bench`main(argc=3, argv=0x000000016fdff268) at batched-bench.cpp:72:28 69 llama_context_params ctx_params = llama_context_params_from_gpt_params(params); 70 71 // ensure enough sequences are available -> 72 ctx_params.n_seq_max = std::max_element(n_pl.begin(), n_pl.end()); ``` Update examples/batched-bench/batched-bench.cpp Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: compilade <git@compilade.net>		2024-08-04 13:55:03 +03:00
..
batched-bench.cpp	batched-bench : handle empty `-npl` (#8839 )	2024-08-04 13:55:03 +03:00
CMakeLists.txt	`build`: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 )	2024-06-13 00:41:52 +01:00
README.md	`build`: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 )	2024-06-13 00:41:52 +01:00

README.md

llama.cpp/example/batched-bench

Benchmark the batched decoding performance of llama.cpp

Usage

There are 2 modes of operation:

prompt not shared - each batch has a separate prompt of size PP (i.e. N_KV = B*(PP + TG))
prompt is shared - there is a common prompt of size PP used by all batches (i.e. N_KV = PP + B*TG)

./llama-batched-bench -m model.gguf -c 2048 -b 2048 -ub 512 -npp 128,256,512 -ntg 128,256 -npl 1,2,4,8,16,32 [-pps]

# LLaMA 7B, F16, N_KV_MAX = 16384 (8GB), prompt not shared
./llama-batched-bench -m ./models/llama-7b/ggml-model-f16.gguf -c 16384 -b 2048 -ub 512 -ngl 99

# LLaMA 7B, Q8_0, N_KV_MAX = 16384 (8GB), prompt is shared
./llama-batched-bench -m ./models/llama-7b/ggml-model-q8_0.gguf -c 16384 -b 2048 -ub 512 -ngl 99 -pps

# custom set of batches
./llama-batched-bench -m ./models/llama-7b/ggml-model-q8_0.gguf -c 2048 -b 512 -ub 512 -ngl 999 -npp 128,256,512 -ntg 128,256 -npl 1,2,4,8,16,32

Sample results

PP - prompt tokens per batch
TG - generated tokens per batch
B - number of batches
N_KV - required KV cache size
T_PP - prompt processing time (i.e. time to first token)
S_PP - prompt processing speed ((B*PP)/T_PP or PP/T_PP)
T_TG - time to generate all batches
S_TG - text generation speed ((B*TG)/T_TG)
T - total time
S - total speed (i.e. all tokens / total time)

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.108	1186.64	3.079	41.57	3.187	80.32
128	128	2	512	0.198	1295.19	5.029	50.90	5.227	97.95
128	128	4	1024	0.373	1373.96	6.878	74.44	7.251	141.23
128	128	8	2048	0.751	1363.27	7.344	139.43	8.095	252.99
128	128	16	4096	1.570	1304.68	8.455	242.23	10.024	408.60
128	128	32	8192	3.408	1201.73	8.801	465.40	12.209	670.96
128	256	1	384	0.107	1196.70	6.329	40.45	6.436	59.67
128	256	2	768	0.194	1317.45	10.239	50.00	10.433	73.61
128	256	4	1536	0.366	1399.03	13.960	73.35	14.326	107.22
128	256	8	3072	0.751	1363.92	15.110	135.54	15.861	193.69
128	256	16	6144	1.569	1304.93	18.073	226.64	19.642	312.80
128	256	32	12288	3.409	1201.35	19.223	426.15	22.633	542.93