mirror of
https://github.com/ggerganov/llama.cpp.git
synced 2025-01-14 04:24:30 +00:00
docs : remove obsolete make references, scripts, examples
ggml-ci
This commit is contained in:
parent
c536c07e1e
commit
328ded353b
@ -27,13 +27,6 @@ We recommend using openmp since it's easier to modify the cores being used.
|
|||||||
|
|
||||||
### llama.cpp compilation
|
### llama.cpp compilation
|
||||||
|
|
||||||
Makefile:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
make GGML_BLIS=1 -j
|
|
||||||
# make GGML_BLIS=1 llama-benchmark-matmult
|
|
||||||
```
|
|
||||||
|
|
||||||
CMake:
|
CMake:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
@ -18,7 +18,6 @@ In order to build llama.cpp you have four different options.
|
|||||||
|
|
||||||
**Notes**:
|
**Notes**:
|
||||||
|
|
||||||
- For `Q4_0_4_4` quantization type build, add the `-DGGML_LLAMAFILE=OFF` cmake option. For example, use `cmake -B build -DGGML_LLAMAFILE=OFF`.
|
|
||||||
- For faster compilation, add the `-j` argument to run multiple jobs in parallel. For example, `cmake --build build --config Release -j 8` will run 8 jobs in parallel.
|
- For faster compilation, add the `-j` argument to run multiple jobs in parallel. For example, `cmake --build build --config Release -j 8` will run 8 jobs in parallel.
|
||||||
- For faster repeated compilation, install [ccache](https://ccache.dev/).
|
- For faster repeated compilation, install [ccache](https://ccache.dev/).
|
||||||
- For debug builds, there are two cases:
|
- For debug builds, there are two cases:
|
||||||
@ -337,9 +336,3 @@ For detailed info, such as model/device supports, CANN install, please refer to
|
|||||||
### Android
|
### Android
|
||||||
|
|
||||||
To read documentation for how to build on Android, [click here](./android.md)
|
To read documentation for how to build on Android, [click here](./android.md)
|
||||||
|
|
||||||
### Arm CPU optimized mulmat kernels
|
|
||||||
|
|
||||||
Llama.cpp includes a set of optimized mulmat kernels for the Arm architecture, leveraging Arm® Neon™, int8mm and SVE instructions. These kernels are enabled at build time through the appropriate compiler cpu-type flags, such as `-DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+sve`. Note that these optimized kernels require the model to be quantized into one of the formats: `Q4_0_4_4` (Arm Neon), `Q4_0_4_8` (int8mm) or `Q4_0_8_8` (SVE). The SVE mulmat kernel specifically requires a vector width of 256 bits. When running on devices with a different vector width, it is recommended to use the `Q4_0_4_8` (int8mm) or `Q4_0_4_4` (Arm Neon) formats for better performance. Refer to [examples/quantize/README.md](../examples/quantize/README.md) for more information on the quantization formats.
|
|
||||||
|
|
||||||
To support `Q4_0_4_4`, you must build with `GGML_NO_LLAMAFILE=1` (`make`) or `-DGGML_LLAMAFILE=OFF` (`cmake`).
|
|
||||||
|
@ -1,61 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
#
|
|
||||||
# Few-shot translation example.
|
|
||||||
# Requires a base model (i.e. no fine-tuned or instruct models).
|
|
||||||
#
|
|
||||||
# Usage:
|
|
||||||
#
|
|
||||||
# cd llama.cpp
|
|
||||||
# make -j
|
|
||||||
#
|
|
||||||
# ./examples/base-translate.sh <model-base> "<text>" [extra-main-args]
|
|
||||||
#
|
|
||||||
|
|
||||||
if [ $# -lt 2 ]; then
|
|
||||||
echo "Usage: ./base-translate.sh <model-base> \"<text>\" [extra-main-args]"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
eargs=""
|
|
||||||
if [ $# -gt 2 ]; then
|
|
||||||
eargs="${@:3}"
|
|
||||||
fi
|
|
||||||
|
|
||||||
ftmp="__llama.cpp_example_tmp__.txt"
|
|
||||||
trap "rm -f $ftmp" EXIT
|
|
||||||
|
|
||||||
echo "Translate from English to French:
|
|
||||||
|
|
||||||
===
|
|
||||||
|
|
||||||
sea otter, peppermint, plush girafe:
|
|
||||||
|
|
||||||
sea otter => loutre de mer
|
|
||||||
peppermint => menthe poivrée
|
|
||||||
plush girafe => girafe peluche
|
|
||||||
|
|
||||||
===
|
|
||||||
|
|
||||||
violin
|
|
||||||
|
|
||||||
violin => violon
|
|
||||||
|
|
||||||
===
|
|
||||||
|
|
||||||
phone, computer, mouse, keyboard:
|
|
||||||
|
|
||||||
phone => téléphone
|
|
||||||
computer => ordinateur
|
|
||||||
mouse => souris
|
|
||||||
keyboard => clavier
|
|
||||||
|
|
||||||
===
|
|
||||||
" > $ftmp
|
|
||||||
|
|
||||||
echo "$2
|
|
||||||
" >> $ftmp
|
|
||||||
|
|
||||||
model=$1
|
|
||||||
|
|
||||||
# generate the most likely continuation until the string "===" is found
|
|
||||||
./llama-cli -m $model -f $ftmp -n 64 --temp 0 --repeat-penalty 1.0 --no-penalize-nl -r "===" $eargs
|
|
@ -2,11 +2,8 @@
|
|||||||
|
|
||||||
This example reads weights from project [llama2.c](https://github.com/karpathy/llama2.c) and saves them in ggml compatible format. The vocab that is available in `models/ggml-vocab.bin` is used by default.
|
This example reads weights from project [llama2.c](https://github.com/karpathy/llama2.c) and saves them in ggml compatible format. The vocab that is available in `models/ggml-vocab.bin` is used by default.
|
||||||
|
|
||||||
To convert the model first download the models from the [llama2.c](https://github.com/karpathy/llama2.c) repository:
|
To convert the model first download the models from the [llama2.c](https://github.com/karpathy/llama2.c) repository.
|
||||||
|
|
||||||
`$ make -j`
|
|
||||||
|
|
||||||
After successful compilation, following usage options are available:
|
|
||||||
```
|
```
|
||||||
usage: ./llama-convert-llama2c-to-ggml [options]
|
usage: ./llama-convert-llama2c-to-ggml [options]
|
||||||
|
|
||||||
|
@ -25,8 +25,6 @@ For faster computation, make sure to use GPU offloading via the `-ngl` argument
|
|||||||
## Example
|
## Example
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
GGML_CUDA=1 make -j
|
|
||||||
|
|
||||||
# generate importance matrix (imatrix.dat)
|
# generate importance matrix (imatrix.dat)
|
||||||
./llama-imatrix -m ggml-model-f16.gguf -f train-data.txt -ngl 99
|
./llama-imatrix -m ggml-model-f16.gguf -f train-data.txt -ngl 99
|
||||||
|
|
||||||
|
@ -188,12 +188,6 @@ services:
|
|||||||
|
|
||||||
`llama-server` is built alongside everything else from the root of the project
|
`llama-server` is built alongside everything else from the root of the project
|
||||||
|
|
||||||
- Using `make`:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
make llama-server
|
|
||||||
```
|
|
||||||
|
|
||||||
- Using `CMake`:
|
- Using `CMake`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@ -207,15 +201,6 @@ services:
|
|||||||
|
|
||||||
`llama-server` can also be built with SSL support using OpenSSL 3
|
`llama-server` can also be built with SSL support using OpenSSL 3
|
||||||
|
|
||||||
- Using `make`:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# NOTE: For non-system openssl, use the following:
|
|
||||||
# CXXFLAGS="-I /path/to/openssl/include"
|
|
||||||
# LDFLAGS="-L /path/to/openssl/lib"
|
|
||||||
make LLAMA_SERVER_SSL=true llama-server
|
|
||||||
```
|
|
||||||
|
|
||||||
- Using `CMake`:
|
- Using `CMake`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
@ -1,212 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
#
|
|
||||||
# Use this script only on fresh pods (runpod.io)!
|
|
||||||
# Otherwise, it can break your environment!
|
|
||||||
#
|
|
||||||
|
|
||||||
if [ -z "$1" ]; then
|
|
||||||
echo "Usage: $0 <data>"
|
|
||||||
echo " 0: no models"
|
|
||||||
echo " 1: tinyllama-1b"
|
|
||||||
echo " 2: codellama-7b"
|
|
||||||
echo " 3: codellama-13b"
|
|
||||||
echo " 4: codellama-34b"
|
|
||||||
echo " 5: codellama-7b-instruct"
|
|
||||||
echo " 6: codellama-13b-instruct"
|
|
||||||
echo " 7: codellama-34b-instruct"
|
|
||||||
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
set -x
|
|
||||||
|
|
||||||
# setup deps
|
|
||||||
apt-get update
|
|
||||||
apt-get install -y git-lfs cmake cmake-curses-gui vim ruby
|
|
||||||
git-lfs install
|
|
||||||
|
|
||||||
if [ ! -d "/workspace" ]; then
|
|
||||||
ln -sfn $(pwd) /workspace
|
|
||||||
fi
|
|
||||||
|
|
||||||
# download data
|
|
||||||
cd /workspace
|
|
||||||
|
|
||||||
# this is useful to git clone repos without doubling the disk size due to .git
|
|
||||||
git clone https://github.com/iboB/git-lfs-download
|
|
||||||
ln -sfn /workspace/git-lfs-download/git-lfs-download /usr/local/bin/git-lfs-download
|
|
||||||
|
|
||||||
# llama.cpp
|
|
||||||
cd /workspace
|
|
||||||
git clone https://github.com/ggerganov/llama.cpp
|
|
||||||
|
|
||||||
cd llama.cpp
|
|
||||||
|
|
||||||
GGML_CUDA=1 make -j
|
|
||||||
|
|
||||||
ln -sfn /workspace/TinyLlama-1.1B-Chat-v0.3 ./models/tinyllama-1b
|
|
||||||
ln -sfn /workspace/CodeLlama-7b-hf ./models/codellama-7b
|
|
||||||
ln -sfn /workspace/CodeLlama-13b-hf ./models/codellama-13b
|
|
||||||
ln -sfn /workspace/CodeLlama-34b-hf ./models/codellama-34b
|
|
||||||
ln -sfn /workspace/CodeLlama-7b-Instruct-hf ./models/codellama-7b-instruct
|
|
||||||
ln -sfn /workspace/CodeLlama-13b-Instruct-hf ./models/codellama-13b-instruct
|
|
||||||
ln -sfn /workspace/CodeLlama-34b-Instruct-hf ./models/codellama-34b-instruct
|
|
||||||
|
|
||||||
pip install -r requirements.txt
|
|
||||||
|
|
||||||
# cmake
|
|
||||||
cd /workspace/llama.cpp
|
|
||||||
|
|
||||||
mkdir build-cublas
|
|
||||||
cd build-cublas
|
|
||||||
|
|
||||||
cmake -DGGML_CUDA=1 ../
|
|
||||||
make -j
|
|
||||||
|
|
||||||
if [ "$1" -eq "0" ]; then
|
|
||||||
exit 0
|
|
||||||
fi
|
|
||||||
|
|
||||||
# more models
|
|
||||||
if [ "$1" -eq "1" ]; then
|
|
||||||
cd /workspace
|
|
||||||
|
|
||||||
git-lfs-download https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.3
|
|
||||||
|
|
||||||
cd /workspace/llama.cpp
|
|
||||||
|
|
||||||
python3 examples/convert_legacy_llama.py ./models/tinyllama-1b --outfile ./models/tinyllama-1b/ggml-model-f16.gguf --outtype f16
|
|
||||||
|
|
||||||
./llama-quantize ./models/tinyllama-1b/ggml-model-f16.gguf ./models/tinyllama-1b/ggml-model-q4_0.gguf q4_0
|
|
||||||
./llama-quantize ./models/tinyllama-1b/ggml-model-f16.gguf ./models/tinyllama-1b/ggml-model-q4_k.gguf q4_k
|
|
||||||
./llama-quantize ./models/tinyllama-1b/ggml-model-f16.gguf ./models/tinyllama-1b/ggml-model-q8_0.gguf q8_0
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [ "$1" -eq "2" ]; then
|
|
||||||
cd /workspace
|
|
||||||
|
|
||||||
git-lfs-download https://huggingface.co/codellama/CodeLlama-7b-hf --without *safetensors*
|
|
||||||
rm -v ./CodeLlama-7b-hf/*safetensors*
|
|
||||||
|
|
||||||
cd /workspace/llama.cpp
|
|
||||||
|
|
||||||
python3 examples/convert_legacy_llama.py ./models/codellama-7b --outfile ./models/codellama-7b/ggml-model-f16.gguf --outtype f16
|
|
||||||
|
|
||||||
./llama-quantize ./models/codellama-7b/ggml-model-f16.gguf ./models/codellama-7b/ggml-model-q4_0.gguf q4_0
|
|
||||||
./llama-quantize ./models/codellama-7b/ggml-model-f16.gguf ./models/codellama-7b/ggml-model-q4_k.gguf q4_k
|
|
||||||
./llama-quantize ./models/codellama-7b/ggml-model-f16.gguf ./models/codellama-7b/ggml-model-q8_0.gguf q8_0
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [ "$1" -eq "3" ]; then
|
|
||||||
cd /workspace
|
|
||||||
|
|
||||||
git-lfs-download https://huggingface.co/codellama/CodeLlama-13b-hf --without *safetensors*
|
|
||||||
rm -v ./CodeLlama-13b-hf/*safetensors*
|
|
||||||
|
|
||||||
cd /workspace/llama.cpp
|
|
||||||
|
|
||||||
python3 examples/convert_legacy_llama.py ./models/codellama-13b --outfile ./models/codellama-13b/ggml-model-f16.gguf --outtype f16
|
|
||||||
|
|
||||||
./llama-quantize ./models/codellama-13b/ggml-model-f16.gguf ./models/codellama-13b/ggml-model-q4_0.gguf q4_0
|
|
||||||
./llama-quantize ./models/codellama-13b/ggml-model-f16.gguf ./models/codellama-13b/ggml-model-q4_k.gguf q4_k
|
|
||||||
./llama-quantize ./models/codellama-13b/ggml-model-f16.gguf ./models/codellama-13b/ggml-model-q8_0.gguf q8_0
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [ "$1" -eq "4" ]; then
|
|
||||||
cd /workspace
|
|
||||||
|
|
||||||
git-lfs-download https://huggingface.co/codellama/CodeLlama-34b-hf --without *safetensors*
|
|
||||||
rm -v ./CodeLlama-34b-hf/*safetensors*
|
|
||||||
|
|
||||||
cd /workspace/llama.cpp
|
|
||||||
|
|
||||||
python3 examples/convert_legacy_llama.py ./models/codellama-34b --outfile ./models/codellama-34b/ggml-model-f16.gguf --outtype f16
|
|
||||||
|
|
||||||
./llama-quantize ./models/codellama-34b/ggml-model-f16.gguf ./models/codellama-34b/ggml-model-q4_0.gguf q4_0
|
|
||||||
./llama-quantize ./models/codellama-34b/ggml-model-f16.gguf ./models/codellama-34b/ggml-model-q4_k.gguf q4_k
|
|
||||||
./llama-quantize ./models/codellama-34b/ggml-model-f16.gguf ./models/codellama-34b/ggml-model-q8_0.gguf q8_0
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [ "$1" -eq "5" ]; then
|
|
||||||
cd /workspace
|
|
||||||
|
|
||||||
git-lfs-download https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf --without *safetensors*
|
|
||||||
rm -v ./CodeLlama-7b-Instruct-hf/*safetensors*
|
|
||||||
|
|
||||||
cd /workspace/llama.cpp
|
|
||||||
|
|
||||||
python3 examples/convert_legacy_llama.py ./models/codellama-7b-instruct --outfile ./models/codellama-7b-instruct/ggml-model-f16.gguf --outtype f16
|
|
||||||
|
|
||||||
./llama-quantize ./models/codellama-7b-instruct/ggml-model-f16.gguf ./models/codellama-7b-instruct/ggml-model-q4_0.gguf q4_0
|
|
||||||
./llama-quantize ./models/codellama-7b-instruct/ggml-model-f16.gguf ./models/codellama-7b-instruct/ggml-model-q4_k.gguf q4_k
|
|
||||||
./llama-quantize ./models/codellama-7b-instruct/ggml-model-f16.gguf ./models/codellama-7b-instruct/ggml-model-q8_0.gguf q8_0
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [ "$1" -eq "6" ]; then
|
|
||||||
cd /workspace
|
|
||||||
|
|
||||||
git-lfs-download https://huggingface.co/codellama/CodeLlama-13b-Instruct-hf --without *safetensors*
|
|
||||||
rm -v ./CodeLlama-13b-Instruct-hf/*safetensors*
|
|
||||||
|
|
||||||
cd /workspace/llama.cpp
|
|
||||||
|
|
||||||
python3 examples/convert_legacy_llama.py ./models/codellama-13b-instruct --outfile ./models/codellama-13b-instruct/ggml-model-f16.gguf --outtype f16
|
|
||||||
|
|
||||||
./llama-quantize ./models/codellama-13b-instruct/ggml-model-f16.gguf ./models/codellama-13b-instruct/ggml-model-q4_0.gguf q4_0
|
|
||||||
./llama-quantize ./models/codellama-13b-instruct/ggml-model-f16.gguf ./models/codellama-13b-instruct/ggml-model-q4_k.gguf q4_k
|
|
||||||
./llama-quantize ./models/codellama-13b-instruct/ggml-model-f16.gguf ./models/codellama-13b-instruct/ggml-model-q8_0.gguf q8_0
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [ "$1" -eq "7" ]; then
|
|
||||||
cd /workspace
|
|
||||||
|
|
||||||
git-lfs-download https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf --without *safetensors*
|
|
||||||
rm -v ./CodeLlama-34b-Instruct-hf/*safetensors*
|
|
||||||
|
|
||||||
cd /workspace/llama.cpp
|
|
||||||
|
|
||||||
python3 examples/convert_legacy_llama.py ./models/codellama-34b-instruct --outfile ./models/codellama-34b-instruct/ggml-model-f16.gguf --outtype f16
|
|
||||||
|
|
||||||
./llama-quantize ./models/codellama-34b-instruct/ggml-model-f16.gguf ./models/codellama-34b-instruct/ggml-model-q4_0.gguf q4_0
|
|
||||||
./llama-quantize ./models/codellama-34b-instruct/ggml-model-f16.gguf ./models/codellama-34b-instruct/ggml-model-q4_k.gguf q4_k
|
|
||||||
./llama-quantize ./models/codellama-34b-instruct/ggml-model-f16.gguf ./models/codellama-34b-instruct/ggml-model-q8_0.gguf q8_0
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [ "$1" -eq "1" ]; then
|
|
||||||
# perf + perplexity
|
|
||||||
cd /workspace/llama.cpp/build-cublas
|
|
||||||
|
|
||||||
make -j && ../scripts/run-all-perf.sh tinyllama-1b "f16" "-ngl 99 -t 1 -p 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,32,64,128,256,512,1024,2048 -n 128"
|
|
||||||
|
|
||||||
../scripts/get-wikitext-2.sh
|
|
||||||
unzip wikitext-2-raw-v1.zip
|
|
||||||
|
|
||||||
make -j && ./bin/llama-perplexity -m ../models/tinyllama-1b/ggml-model-f16.gguf -f ./wikitext-2-raw/wiki.test.raw -ngl 100 --chunks 32
|
|
||||||
|
|
||||||
# batched
|
|
||||||
cd /workspace/llama.cpp
|
|
||||||
|
|
||||||
GGML_CUDA=1 make -j && ./llama-batched ./models/tinyllama-1b/ggml-model-f16.gguf "Hello, my name is" 8 128 999
|
|
||||||
|
|
||||||
# batched-bench
|
|
||||||
cd /workspace/llama.cpp
|
|
||||||
|
|
||||||
GGML_CUDA=1 make -j && ./llama-batched-bench ./models/tinyllama-1b/ggml-model-f16.gguf 4608 1 99 0 512 128 1,2,3,4,5,6,7,8,16,32
|
|
||||||
|
|
||||||
# parallel
|
|
||||||
cd /workspace/llama.cpp
|
|
||||||
|
|
||||||
GGML_CUDA=1 make -j && ./llama-parallel -m ./models/tinyllama-1b/ggml-model-f16.gguf -t 1 -ngl 100 -c 4096 -b 512 -s 1 -np 8 -ns 128 -n 100 -cb
|
|
||||||
|
|
||||||
fi
|
|
||||||
|
|
||||||
# speculative
|
|
||||||
#if [ "$1" -eq "7" ]; then
|
|
||||||
# cd /workspace/llama.cpp
|
|
||||||
#
|
|
||||||
# GGML_CUDA=1 make -j && ./llama-speculative -m ./models/codellama-34b-instruct/ggml-model-f16.gguf -md ./models/codellama-7b-instruct/ggml-model-q4_0.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -ngl 999 -ngld 999 -t 4 -n 512 -c 4096 -s 21 --draft 16 -np 1 --temp 0.0
|
|
||||||
#fi
|
|
||||||
|
|
||||||
# more benches
|
|
||||||
#GGML_CUDA=1 make -j && ./llama-batched-bench ./models/codellama-7b/ggml-model-q4_k.gguf 4096 1 99 1 512,3200 128,128,800 1
|
|
||||||
#GGML_CUDA=1 make -j && ./llama-batched-bench ./models/codellama-13b/ggml-model-q4_k.gguf 4096 1 99 1 512,3200 128,128,800 1
|
|
@ -1,418 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
#
|
|
||||||
# Helper script for deploying llama.cpp server with a single Bash command
|
|
||||||
#
|
|
||||||
# - Works on Linux and macOS
|
|
||||||
# - Supports: CPU, CUDA, Metal
|
|
||||||
# - Can run all GGUF models from HuggingFace
|
|
||||||
# - Can serve requests in parallel
|
|
||||||
# - Always builds latest llama.cpp from GitHub
|
|
||||||
#
|
|
||||||
# Limitations
|
|
||||||
#
|
|
||||||
# - Chat templates are poorly supported (base models recommended)
|
|
||||||
# - Might be unstable!
|
|
||||||
#
|
|
||||||
# Usage:
|
|
||||||
# ./server-llm.sh [--port] [--repo] [--wtype] [--backend] [--gpu-id] [--n-parallel] [--n-kv] [--verbose] [-non-interactive]
|
|
||||||
#
|
|
||||||
# --port: port number, default is 8888
|
|
||||||
# --repo: path to a repo containing GGUF model files
|
|
||||||
# --wtype: weights type (f16, q8_0, q4_0, q4_1), default is user-input
|
|
||||||
# --backend: cpu, cuda, metal, depends on the OS
|
|
||||||
# --gpu-id: gpu id, default is 0
|
|
||||||
# --n-parallel: number of parallel requests, default is 8
|
|
||||||
# --n-kv: KV cache size, default is 4096
|
|
||||||
# --verbose: verbose output
|
|
||||||
# --non-interactive: run without asking a permission to run
|
|
||||||
#
|
|
||||||
# Example:
|
|
||||||
#
|
|
||||||
# bash -c "$(curl -s https://ggml.ai/server-llm.sh)"
|
|
||||||
#
|
|
||||||
|
|
||||||
set -e
|
|
||||||
|
|
||||||
# required utils: curl, git, make
|
|
||||||
if ! command -v curl &> /dev/null; then
|
|
||||||
printf "[-] curl not found\n"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
if ! command -v git &> /dev/null; then
|
|
||||||
printf "[-] git not found\n"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
if ! command -v make &> /dev/null; then
|
|
||||||
printf "[-] make not found\n"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
# parse arguments
|
|
||||||
is_interactive=1
|
|
||||||
port=8888
|
|
||||||
repo=""
|
|
||||||
wtype=""
|
|
||||||
backend="cpu"
|
|
||||||
|
|
||||||
# if macOS, use metal backend by default
|
|
||||||
if [[ "$OSTYPE" == "darwin"* ]]; then
|
|
||||||
backend="metal"
|
|
||||||
elif command -v nvcc &> /dev/null; then
|
|
||||||
backend="cuda"
|
|
||||||
fi
|
|
||||||
|
|
||||||
gpu_id=0
|
|
||||||
n_parallel=8
|
|
||||||
n_kv=4096
|
|
||||||
verbose=0
|
|
||||||
|
|
||||||
function print_usage {
|
|
||||||
printf "Usage:\n"
|
|
||||||
printf " ./server-llm.sh [--port] [--repo] [--wtype] [--backend] [--gpu-id] [--n-parallel] [--n-kv] [--verbose] [-non-interactive]\n\n"
|
|
||||||
printf " --port: port number, default is 8888\n"
|
|
||||||
printf " --repo: path to a repo containing GGUF model files\n"
|
|
||||||
printf " --wtype: weights type (f16, q8_0, q4_0, q4_1), default is user-input\n"
|
|
||||||
printf " --backend: cpu, cuda, metal, depends on the OS\n"
|
|
||||||
printf " --gpu-id: gpu id, default is 0\n"
|
|
||||||
printf " --n-parallel: number of parallel requests, default is 8\n"
|
|
||||||
printf " --n-kv: KV cache size, default is 4096\n"
|
|
||||||
printf " --verbose: verbose output\n\n"
|
|
||||||
printf " --non-interactive: run without asking a permission to run\n"
|
|
||||||
printf "Example:\n\n"
|
|
||||||
printf ' bash -c "$(curl -s https://ggml.ai/server-llm.sh)"\n\n'
|
|
||||||
}
|
|
||||||
|
|
||||||
while [[ $# -gt 0 ]]; do
|
|
||||||
key="$1"
|
|
||||||
case $key in
|
|
||||||
--non-interactive)
|
|
||||||
is_interactive=0
|
|
||||||
shift
|
|
||||||
;;
|
|
||||||
--port)
|
|
||||||
port="$2"
|
|
||||||
shift
|
|
||||||
shift
|
|
||||||
;;
|
|
||||||
--repo)
|
|
||||||
repo="$2"
|
|
||||||
shift
|
|
||||||
shift
|
|
||||||
;;
|
|
||||||
--wtype)
|
|
||||||
wtype="$2"
|
|
||||||
shift
|
|
||||||
shift
|
|
||||||
;;
|
|
||||||
--backend)
|
|
||||||
backend="$2"
|
|
||||||
shift
|
|
||||||
shift
|
|
||||||
;;
|
|
||||||
--gpu-id)
|
|
||||||
gpu_id="$2"
|
|
||||||
shift
|
|
||||||
shift
|
|
||||||
;;
|
|
||||||
--n-parallel)
|
|
||||||
n_parallel="$2"
|
|
||||||
shift
|
|
||||||
shift
|
|
||||||
;;
|
|
||||||
--n-kv)
|
|
||||||
n_kv="$2"
|
|
||||||
shift
|
|
||||||
shift
|
|
||||||
;;
|
|
||||||
--verbose)
|
|
||||||
verbose=1
|
|
||||||
shift
|
|
||||||
;;
|
|
||||||
--help)
|
|
||||||
print_usage
|
|
||||||
exit 0
|
|
||||||
;;
|
|
||||||
*)
|
|
||||||
echo "Unknown argument: $key"
|
|
||||||
print_usage
|
|
||||||
exit 1
|
|
||||||
;;
|
|
||||||
esac
|
|
||||||
done
|
|
||||||
|
|
||||||
# available weights types
|
|
||||||
wtypes=("F16" "Q8_0" "Q4_0" "Q4_1" "Q5_0" "Q5_1" "Q6_K" "Q5_K_M" "Q5_K_S" "Q4_K_M" "Q4_K_S" "Q3_K_L" "Q3_K_M" "Q3_K_S" "Q2_K")
|
|
||||||
|
|
||||||
wfiles=()
|
|
||||||
for wt in "${wtypes[@]}"; do
|
|
||||||
wfiles+=("")
|
|
||||||
done
|
|
||||||
|
|
||||||
# map wtype input to index
|
|
||||||
if [[ ! -z "$wtype" ]]; then
|
|
||||||
iw=-1
|
|
||||||
is=0
|
|
||||||
for wt in "${wtypes[@]}"; do
|
|
||||||
# uppercase
|
|
||||||
uwt=$(echo "$wt" | tr '[:lower:]' '[:upper:]')
|
|
||||||
if [[ "$uwt" == "$wtype" ]]; then
|
|
||||||
iw=$is
|
|
||||||
break
|
|
||||||
fi
|
|
||||||
is=$((is+1))
|
|
||||||
done
|
|
||||||
|
|
||||||
if [[ $iw -eq -1 ]]; then
|
|
||||||
printf "[-] Invalid weight type: %s\n" "$wtype"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
wtype="$iw"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# sample repos
|
|
||||||
repos=(
|
|
||||||
"https://huggingface.co/TheBloke/Llama-2-7B-GGUF"
|
|
||||||
"https://huggingface.co/TheBloke/Llama-2-13B-GGUF"
|
|
||||||
"https://huggingface.co/TheBloke/Llama-2-70B-GGUF"
|
|
||||||
"https://huggingface.co/TheBloke/CodeLlama-7B-GGUF"
|
|
||||||
"https://huggingface.co/TheBloke/CodeLlama-13B-GGUF"
|
|
||||||
"https://huggingface.co/TheBloke/CodeLlama-34B-GGUF"
|
|
||||||
"https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF"
|
|
||||||
"https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF"
|
|
||||||
"https://huggingface.co/TheBloke/OpenHermes-2-Mistral-7B-GGUF"
|
|
||||||
"https://huggingface.co/TheBloke/CausalLM-7B-GGUF"
|
|
||||||
)
|
|
||||||
if [ $is_interactive -eq 1 ]; then
|
|
||||||
printf "\n"
|
|
||||||
printf "[I] This is a helper script for deploying llama.cpp's server on this machine.\n\n"
|
|
||||||
printf " Based on the options that follow, the script might download a model file\n"
|
|
||||||
printf " from the internet, which can be a few GBs in size. The script will also\n"
|
|
||||||
printf " build the latest llama.cpp source code from GitHub, which can be unstable.\n"
|
|
||||||
printf "\n"
|
|
||||||
printf " Upon success, an HTTP server will be started and it will serve the selected\n"
|
|
||||||
printf " model using llama.cpp for demonstration purposes.\n"
|
|
||||||
printf "\n"
|
|
||||||
printf " Please note:\n"
|
|
||||||
printf "\n"
|
|
||||||
printf " - All new data will be stored in the current folder\n"
|
|
||||||
printf " - The server will be listening on all network interfaces\n"
|
|
||||||
printf " - The server will run with default settings which are not always optimal\n"
|
|
||||||
printf " - Do not judge the quality of a model based on the results from this script\n"
|
|
||||||
printf " - Do not use this script to benchmark llama.cpp\n"
|
|
||||||
printf " - Do not use this script in production\n"
|
|
||||||
printf " - This script is only for demonstration purposes\n"
|
|
||||||
printf "\n"
|
|
||||||
printf " If you don't know what you are doing, please press Ctrl-C to abort now\n"
|
|
||||||
printf "\n"
|
|
||||||
printf " Press Enter to continue ...\n\n"
|
|
||||||
|
|
||||||
read
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [[ -z "$repo" ]]; then
|
|
||||||
printf "[+] No repo provided from the command line\n"
|
|
||||||
printf " Please select a number from the list below or enter an URL:\n\n"
|
|
||||||
|
|
||||||
is=0
|
|
||||||
for r in "${repos[@]}"; do
|
|
||||||
printf " %2d) %s\n" $is "$r"
|
|
||||||
is=$((is+1))
|
|
||||||
done
|
|
||||||
|
|
||||||
# ask for repo until index of sample repo is provided or an URL
|
|
||||||
while [[ -z "$repo" ]]; do
|
|
||||||
printf "\n Or choose one from: https://huggingface.co/models?sort=trending&search=gguf\n\n"
|
|
||||||
read -p "[+] Select repo: " repo
|
|
||||||
|
|
||||||
# check if the input is a number
|
|
||||||
if [[ "$repo" =~ ^[0-9]+$ ]]; then
|
|
||||||
if [[ "$repo" -ge 0 && "$repo" -lt ${#repos[@]} ]]; then
|
|
||||||
repo="${repos[$repo]}"
|
|
||||||
else
|
|
||||||
printf "[-] Invalid repo index: %s\n" "$repo"
|
|
||||||
repo=""
|
|
||||||
fi
|
|
||||||
elif [[ "$repo" =~ ^https?:// ]]; then
|
|
||||||
repo="$repo"
|
|
||||||
else
|
|
||||||
printf "[-] Invalid repo URL: %s\n" "$repo"
|
|
||||||
repo=""
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
fi
|
|
||||||
|
|
||||||
# remove suffix
|
|
||||||
repo=$(echo "$repo" | sed -E 's/\/tree\/main$//g')
|
|
||||||
|
|
||||||
printf "[+] Checking for GGUF model files in %s\n" "$repo"
|
|
||||||
|
|
||||||
# find GGUF files in the source
|
|
||||||
# TODO: better logic
|
|
||||||
model_tree="${repo%/}/tree/main"
|
|
||||||
model_files=$(curl -s "$model_tree" | grep -i "\\.gguf</span>" | sed -E 's/.*<span class="truncate group-hover:underline">(.*)<\/span><\/a>/\1/g')
|
|
||||||
|
|
||||||
# list all files in the provided git repo
|
|
||||||
printf "[+] Model files:\n\n"
|
|
||||||
for file in $model_files; do
|
|
||||||
# determine iw by grepping the filename with wtypes
|
|
||||||
iw=-1
|
|
||||||
is=0
|
|
||||||
for wt in "${wtypes[@]}"; do
|
|
||||||
# uppercase
|
|
||||||
ufile=$(echo "$file" | tr '[:lower:]' '[:upper:]')
|
|
||||||
if [[ "$ufile" =~ "$wt" ]]; then
|
|
||||||
iw=$is
|
|
||||||
break
|
|
||||||
fi
|
|
||||||
is=$((is+1))
|
|
||||||
done
|
|
||||||
|
|
||||||
if [[ $iw -eq -1 ]]; then
|
|
||||||
continue
|
|
||||||
fi
|
|
||||||
|
|
||||||
wfiles[$iw]="$file"
|
|
||||||
|
|
||||||
have=" "
|
|
||||||
if [[ -f "$file" ]]; then
|
|
||||||
have="*"
|
|
||||||
fi
|
|
||||||
|
|
||||||
printf " %2d) %s %s\n" $iw "$have" "$file"
|
|
||||||
done
|
|
||||||
|
|
||||||
wfile="${wfiles[$wtype]}"
|
|
||||||
|
|
||||||
# ask for weights type until provided and available
|
|
||||||
while [[ -z "$wfile" ]]; do
|
|
||||||
printf "\n"
|
|
||||||
read -p "[+] Select weight type: " wtype
|
|
||||||
wfile="${wfiles[$wtype]}"
|
|
||||||
|
|
||||||
if [[ -z "$wfile" ]]; then
|
|
||||||
printf "[-] Invalid weight type: %s\n" "$wtype"
|
|
||||||
wtype=""
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
|
|
||||||
printf "[+] Selected weight type: %s (%s)\n" "$wtype" "$wfile"
|
|
||||||
|
|
||||||
url="${repo%/}/resolve/main/$wfile"
|
|
||||||
|
|
||||||
# check file if the model has been downloaded before
|
|
||||||
chk="$wfile.chk"
|
|
||||||
|
|
||||||
# check if we should download the file
|
|
||||||
# - if $wfile does not exist
|
|
||||||
# - if $wfile exists but $chk does not exist
|
|
||||||
# - if $wfile exists and $chk exists but $wfile is newer than $chk
|
|
||||||
# TODO: better logic using git lfs info
|
|
||||||
|
|
||||||
do_download=0
|
|
||||||
|
|
||||||
if [[ ! -f "$wfile" ]]; then
|
|
||||||
do_download=1
|
|
||||||
elif [[ ! -f "$chk" ]]; then
|
|
||||||
do_download=1
|
|
||||||
elif [[ "$wfile" -nt "$chk" ]]; then
|
|
||||||
do_download=1
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [[ $do_download -eq 1 ]]; then
|
|
||||||
printf "[+] Downloading weights from %s\n" "$url"
|
|
||||||
|
|
||||||
# download the weights file
|
|
||||||
curl -o "$wfile" -# -L "$url"
|
|
||||||
|
|
||||||
# create a check file if successful
|
|
||||||
if [[ $? -eq 0 ]]; then
|
|
||||||
printf "[+] Creating check file %s\n" "$chk"
|
|
||||||
touch "$chk"
|
|
||||||
fi
|
|
||||||
else
|
|
||||||
printf "[+] Using cached weights %s\n" "$wfile"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# get latest llama.cpp and build
|
|
||||||
|
|
||||||
printf "[+] Downloading latest llama.cpp\n"
|
|
||||||
|
|
||||||
llama_cpp_dir="__llama_cpp_port_${port}__"
|
|
||||||
|
|
||||||
if [[ -d "$llama_cpp_dir" && ! -f "$llama_cpp_dir/__ggml_script__" ]]; then
|
|
||||||
# if the dir exists and there isn't a file "__ggml_script__" in it, abort
|
|
||||||
printf "[-] Directory %s already exists\n" "$llama_cpp_dir"
|
|
||||||
printf "[-] Please remove it and try again\n"
|
|
||||||
exit 1
|
|
||||||
elif [[ -d "$llama_cpp_dir" ]]; then
|
|
||||||
printf "[+] Directory %s already exists\n" "$llama_cpp_dir"
|
|
||||||
printf "[+] Using cached llama.cpp\n"
|
|
||||||
|
|
||||||
cd "$llama_cpp_dir"
|
|
||||||
git reset --hard
|
|
||||||
git fetch
|
|
||||||
git checkout origin/master
|
|
||||||
|
|
||||||
cd ..
|
|
||||||
else
|
|
||||||
printf "[+] Cloning llama.cpp\n"
|
|
||||||
|
|
||||||
git clone https://github.com/ggerganov/llama.cpp "$llama_cpp_dir"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# mark that that the directory is made by this script
|
|
||||||
touch "$llama_cpp_dir/__ggml_script__"
|
|
||||||
|
|
||||||
if [[ $verbose -eq 1 ]]; then
|
|
||||||
set -x
|
|
||||||
fi
|
|
||||||
|
|
||||||
# build
|
|
||||||
cd "$llama_cpp_dir"
|
|
||||||
|
|
||||||
make clean
|
|
||||||
|
|
||||||
log="--silent"
|
|
||||||
if [[ $verbose -eq 1 ]]; then
|
|
||||||
log=""
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [[ "$backend" == "cuda" ]]; then
|
|
||||||
printf "[+] Building with CUDA backend\n"
|
|
||||||
GGML_CUDA=1 make -j llama-server $log
|
|
||||||
elif [[ "$backend" == "cpu" ]]; then
|
|
||||||
printf "[+] Building with CPU backend\n"
|
|
||||||
make -j llama-server $log
|
|
||||||
elif [[ "$backend" == "metal" ]]; then
|
|
||||||
printf "[+] Building with Metal backend\n"
|
|
||||||
make -j llama-server $log
|
|
||||||
else
|
|
||||||
printf "[-] Unknown backend: %s\n" "$backend"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
# run the server
|
|
||||||
|
|
||||||
printf "[+] Running server\n"
|
|
||||||
|
|
||||||
args=""
|
|
||||||
if [[ "$backend" == "cuda" ]]; then
|
|
||||||
export CUDA_VISIBLE_DEVICES=$gpu_id
|
|
||||||
args="-ngl 999"
|
|
||||||
elif [[ "$backend" == "cpu" ]]; then
|
|
||||||
args="-ngl 0"
|
|
||||||
elif [[ "$backend" == "metal" ]]; then
|
|
||||||
args="-ngl 999"
|
|
||||||
else
|
|
||||||
printf "[-] Unknown backend: %s\n" "$backend"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [[ $verbose -eq 1 ]]; then
|
|
||||||
args="$args --verbose"
|
|
||||||
fi
|
|
||||||
|
|
||||||
./llama-server -m "../$wfile" --host 0.0.0.0 --port "$port" -c $n_kv -np "$n_parallel" $args
|
|
||||||
|
|
||||||
exit 0
|
|
Loading…
Reference in New Issue
Block a user