llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-09-23 13:36:20 +00:00

History

Georgi Gerganov 9c67c2773d ggml : add Flash Attention (#5021 ) * ggml : add ggml_flash_attn_ext API * ggml : fix GQA support in ggml_flash_attn_ext * ggml : online attention (CPU) * metal : initial implementation * metal : f16 precision * metal : reduce branches * metal : specialize for head size * wip : 8 rows per simd group * wip : 4 rows per simd group * wip : template for rows per warp * metal : parallelize across KV size * metal : parallel reduce across heads * metal : efficient flash_attn_f16 implementation * metal : avoid redundant loads of the attention * metal : scale and mask in matrix form * metal : fix comment * llama : avoid ggml_cast, use F32 query * metal : add parallel reduce version (disabled) * metal : move output into local memory + optimize - the result from each simdgroup now stays in the registers - significantly reduced SRAM usage - more efficient skipping of -INF blocks - avoid simdgroup barrier in hot loop - add comments * metal : add tests, fix scaling, support C > 32 * metal : improve precision * ggml : fix f16 mad * metal : minor * metal : support Q > 8 * tests : add ATTN tests * metal : disable buffer allocation logs * tests : more * metal : faster inner loop for C == 32 * metal : fix array initialization * tests : ifdef * ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext * ggml : fix ggml_soft_max mask requirement * cuda : fix soft_max to use correct mask size * cuda : add flash_attn kernel (wip) * metal : optimize softmax for C > 32 * metal : optimize softmax * tests : minor fix * cuda : avoid zeroing fragments * tests : update dims * cuda : fix __hisinf() result check * cuda : avoid warp_reduce for smax * cuda : use int instead of int64_t Noticeably improves performance (thanks to Johannes) * cuda : make loops use the same loop values Thanks Johannes again for the tip * cuda : unroll some of the loops * cuda : avoid __hisinf branches * cuda : use half2 in softmax * cuda : switch to 1 warp for bs > 16 * cuda : speed-up reduce part of the kernel * cuda : unroll QK^T loop cuda : fix -INF block check * cuda : simplify softmax * cuda : fix matrix names * cuda : minor * llama : adapt to F16 KQ_pos * llama : adapt new models to F16 KQ_mask * ggml : fix F16 store (ARM NEON) * llama : fix type of KQ_mask and KQ_pos * ggml : fix CPU soft_max * tests : add hs=256 * cuda : fix build * metal : improve perf via smaller int registers * cuda : adapt soft_max to F16 mask and pos * CUDA: faster FlashAttention, kernel for bs == 1 * 16 cols for Phi-2 * no vec for hs, no hs==256 ncols==32 for Volta * adjust kernel selection logic * 4 warps, 256 stride for all D * no ncols == 64 * Multiple parallel blocks for batch size 1 * fix compile warnings * fix excessive KQ_b loads * fix cmake build * fix KV cache padding, NaN from INFINITY (#6438) * llama : flash_attn cparam + fix defrag * server: support flash_attn param * server: bench: enable flash_attn param * CUDA: refactor host code, dyn. par. blocks * fix flash_attn_vec_f16 race condition * flush softmax exp below threshold to 0 * store temp KQ in registers * Calculate KQ as FP32 if KQV has GGML_PREC_F32 * Add __hgt2_mask implementation for CUDA 11 * fix KQ FP32 precision fpr parallel_blocks > 1 * llama-bench : add -fa,--flash-attn arg * metal : add BS=1 kernel for flash attention (#6508) * metal : add BS=1 kernel for flash attention (wip) * metal : support more than 1 warps * metal : opts * metal : opt * metal : switch to parallel reduce * metal : reduce registers * metal : simplify * metal : initial FA vec kernel * metal : use F32 attention accumulators * batched-bench : add fattn arg * llama : simplify llama_build_kv_store ggml-ci * llama : adapt build_olmo to changes * ggml : fix arm fp16 store on windows * metal : clean-up * metal : clean-up kernel code * metal : minor * tests : remove benchmarks ggml-ci * ggml : fix avx512 const correctness ggml-ci * ggml : fix soft_max with bias on CPU ggml-ci * common : print --flash-attn in help * ggml : fix num dimensions in ggml_flash_attn_ext * llama : force disable flash attention for incompatible models * ggml : ggml_soft_max support F16/F32 mask/pos ggml-ci * cuda : uint -> uint32_t * cuda : "constexpr dim3" -> "const dim3" ggml-ci * cuda : try to fix __hgt2_mask ggml-ci * ggml : add TODO's for F16/F32 mask/pos support in other backends * llama : replace bool need_kq_pos with use_alibi * llama : prep ALiBi support for BERT models ggml-ci * llama : fix n_batch requirements ggml-ci * cont * server : add help for --flash-attn arg * llama : disable FA for AMD * tests : remove TMP_ATTN_BENCH ggml-ci * llama : support save/load state with FA enabled ggml-ci * ci : add CUDA save-load-state tests ggml-ci * llama : llama_kv_cache_clear zeroes data + fix save-load seq ggml-ci * llama : fix copy-paste errors, add TODO * llama : disallow incompatible states * llama : update llama_state_get_size after v_trans field * metal : remove tmp log * llama : add static reminder for llama_state_get_size * metal : fix max nsg ggml-ci * ci : fix arg order ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>		2024-04-30 12:16:08 +03:00
..
bench.py	ggml : add Flash Attention (#5021 )	2024-04-30 12:16:08 +03:00
prometheus.yml	server: continuous performance monitoring and PR comment (#6283 )	2024-03-27 20:26:49 +01:00
README.md	ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response (#6495 )	2024-04-06 05:40:47 +02:00
requirements.txt	server: continuous performance monitoring and PR comment (#6283 )	2024-03-27 20:26:49 +01:00
script.js	bench: server add stop word for PHI-2 (#6916 )	2024-04-26 09:26:16 +02:00

README.md

Server benchmark tools

Benchmark is using k6.

Install k6 and sse extension

SSE is not supported by default in k6, you have to build k6 with the xk6-sse extension.

Example:

go install go.k6.io/xk6/cmd/xk6@latest
xk6 build master \
--with github.com/phymbert/xk6-sse

Download a dataset

This dataset was originally proposed in vLLM benchmarks.

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

Download a model

Example for PHI-2

../../../scripts/hf.sh --repo ggml-org/models --file phi-2/ggml-model-q4_0.gguf

Start the server

The server must answer OAI Chat completion requests on http://localhost:8080/v1 or according to the environment variable SERVER_BENCH_URL.

Example:

server --host localhost --port 8080 \
  --model ggml-model-q4_0.gguf \
  --cont-batching \
  --metrics \
  --parallel 8 \
  --batch-size 512 \
  --ctx-size 4096 \
  --log-format text \
  -ngl 33

Run the benchmark

For 500 chat completions request with 8 concurrent users during maximum 10 minutes, run:

./k6 run script.js --duration 10m --iterations 500 --vus 8

The benchmark values can be overridden with:

SERVER_BENCH_URL server url prefix for chat completions, default http://localhost:8080/v1
SERVER_BENCH_N_PROMPTS total prompts to randomly select in the benchmark, default 480
SERVER_BENCH_MODEL_ALIAS model alias to pass in the completion request, default my-model
SERVER_BENCH_MAX_TOKENS max tokens to predict, default: 512
SERVER_BENCH_DATASET path to the benchmark dataset file
SERVER_BENCH_MAX_PROMPT_TOKENS maximum prompt tokens to filter out in the dataset: default 1024
SERVER_BENCH_MAX_CONTEXT maximum context size of the completions request to filter out in the dataset: prompt + predicted tokens, default 2048

Note: the local tokenizer is just a string space split, real number of tokens will differ.

Or with k6 options:

SERVER_BENCH_N_PROMPTS=500 k6 run script.js --duration 10m --iterations 500 --vus 8

To debug http request use --http-debug="full".

Metrics

Following metrics are available computed from the OAI chat completions response usage:

llamacpp_tokens_second Trend of usage.total_tokens / request duration
llamacpp_prompt_tokens Trend of usage.prompt_tokens
llamacpp_prompt_tokens_total_counter Counter of usage.prompt_tokens
llamacpp_completion_tokens Trend of usage.completion_tokens
llamacpp_completion_tokens_total_counter Counter of usage.completion_tokens
llamacpp_completions_truncated_rate Rate of completions truncated, i.e. if finish_reason === 'length'
llamacpp_completions_stop_rate Rate of completions stopped by the model, i.e. if finish_reason === 'stop'

The script will fail if too many completions are truncated, see llamacpp_completions_truncated_rate.

K6 metrics might be compared against server metrics, with:

curl http://localhost:8080/metrics

Using the CI python script

The bench.py script does several steps:

start the server
define good variable for k6
run k6 script
extract metrics from prometheus

It aims to be used in the CI, but you can run it manually:

LLAMA_SERVER_BIN_PATH=../../../cmake-build-release/bin/server python bench.py \
              --runner-label local \
              --name local \
              --branch `git rev-parse --abbrev-ref HEAD` \
              --commit `git rev-parse HEAD` \
              --scenario script.js \
              --duration 5m \
              --hf-repo ggml-org/models	 \
              --hf-file phi-2/ggml-model-q4_0.gguf \
              --model-path-prefix models \
              --parallel 4 \
              -ngl 33 \
              --batch-size 2048 \
              --ubatch-size	256 \
              --ctx-size 4096 \
              --n-prompts 200 \
              --max-prompt-tokens 256 \
              --max-tokens 256