docs : remove obsolete make references, scripts, examples

ggml-ci
2024-12-27 20:04:35 +00:00 · 2024-12-02 10:24:54 +02:00 · 2024-12-02 10:24:54 +02:00 · 328ded353b
commit 328ded353b
parent c536c07e1e
8 changed files with 1 additions and 726 deletions
--- a/docs/backend/BLIS.md
+++ b/docs/backend/BLIS.md
@ -27,13 +27,6 @@ We recommend using openmp since it's easier to modify the cores being used.
 ### llama.cpp compilation
 Makefile:
 ```bash
 make GGML_BLIS=1 -j
 # make GGML_BLIS=1 llama-benchmark-matmult
 ```
 CMake:
 ```bash
--- a/docs/build.md
+++ b/docs/build.md
@ -18,7 +18,6 @@ In order to build llama.cpp you have four different options.
  **Notes**:
    - For `Q4_0_4_4` quantization type build, add the `-DGGML_LLAMAFILE=OFF` cmake option. For example, use `cmake -B build -DGGML_LLAMAFILE=OFF`.
    - For faster compilation, add the `-j` argument to run multiple jobs in parallel. For example, `cmake --build build --config Release -j 8` will run 8 jobs in parallel.
    - For faster repeated compilation, install [ccache](https://ccache.dev/).
    - For debug builds, there are two cases:
@ -337,9 +336,3 @@ For detailed info, such as model/device supports, CANN install, please refer to
 ### Android
 To read documentation for how to build on Android, [click here](./android.md)
 ### Arm CPU optimized mulmat kernels
 Llama.cpp includes a set of optimized mulmat kernels for the Arm architecture, leveraging Arm® Neon™, int8mm and SVE instructions. These kernels are enabled at build time through the appropriate compiler cpu-type flags, such as `-DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+sve`. Note that these optimized kernels require the model to be quantized into one of the formats: `Q4_0_4_4` (Arm Neon), `Q4_0_4_8` (int8mm) or `Q4_0_8_8` (SVE). The SVE mulmat kernel specifically requires a vector width of 256 bits. When running on devices with a different vector width, it is recommended to use the `Q4_0_4_8` (int8mm) or `Q4_0_4_4` (Arm Neon) formats for better performance. Refer to [examples/quantize/README.md](../examples/quantize/README.md) for more information on the quantization formats.
 To support `Q4_0_4_4`, you must build with `GGML_NO_LLAMAFILE=1` (`make`) or `-DGGML_LLAMAFILE=OFF` (`cmake`).
--- a/examples/base-translate.sh
+++ b/examples/base-translate.sh
@ -1,61 +0,0 @@
 #!/bin/bash
 #
 # Few-shot translation example.
 # Requires a base model (i.e. no fine-tuned or instruct models).
 #
 # Usage:
 #
 #   cd llama.cpp
 #   make -j
 #
 #   ./examples/base-translate.sh <model-base> "<text>" [extra-main-args]
 #
 if [ $# -lt 2 ]; then
  echo "Usage: ./base-translate.sh <model-base> \"<text>\" [extra-main-args]"
  exit 1
 fi
 eargs=""
 if [ $# -gt 2 ]; then
  eargs="${@:3}"
 fi
 ftmp="__llama.cpp_example_tmp__.txt"
 trap "rm -f $ftmp" EXIT
 echo "Translate from English to French:
 ===
 sea otter, peppermint, plush girafe:
 sea otter => loutre de mer
 peppermint => menthe poivrée
 plush girafe => girafe peluche
 ===
 violin
 violin => violon
 ===
 phone, computer, mouse, keyboard:
 phone => téléphone
 computer => ordinateur
 mouse => souris
 keyboard => clavier
 ===
 " > $ftmp
 echo "$2
 " >> $ftmp
 model=$1
 # generate the most likely continuation until the string "===" is found
 ./llama-cli -m $model -f $ftmp -n 64 --temp 0 --repeat-penalty 1.0 --no-penalize-nl -r "===" $eargs
--- a/examples/convert-llama2c-to-ggml/README.md
+++ b/examples/convert-llama2c-to-ggml/README.md
@ -2,11 +2,8 @@
 This example reads weights from project [llama2.c](https://github.com/karpathy/llama2.c) and saves them in ggml compatible format. The vocab that is available in `models/ggml-vocab.bin` is used by default.
-To convert the model first download the models from the [llama2.c](https://github.com/karpathy/llama2.c) repository:
+To convert the model first download the models from the [llama2.c](https://github.com/karpathy/llama2.c) repository.
 `$ make -j`
 After successful compilation, following usage options are available:
 ```
 usage: ./llama-convert-llama2c-to-ggml [options]
--- a/examples/imatrix/README.md
+++ b/examples/imatrix/README.md
@ -25,8 +25,6 @@ For faster computation, make sure to use GPU offloading via the `-ngl` argument
 ## Example
 ```bash
 GGML_CUDA=1 make -j
 # generate importance matrix (imatrix.dat)
 ./llama-imatrix -m ggml-model-f16.gguf -f train-data.txt -ngl 99
--- a/examples/server/README.md
+++ b/examples/server/README.md
@ -188,12 +188,6 @@ services:
 `llama-server` is built alongside everything else from the root of the project
 - Using `make`:
  ```bash
  make llama-server
  ```
 - Using `CMake`:
  ```bash
@ -207,15 +201,6 @@ services:
 `llama-server` can also be built with SSL support using OpenSSL 3
 - Using `make`:
  ```bash
  # NOTE: For non-system openssl, use the following:
  #   CXXFLAGS="-I /path/to/openssl/include"
  #   LDFLAGS="-L /path/to/openssl/lib"
  make LLAMA_SERVER_SSL=true llama-server
  ```
 - Using `CMake`:
  ```bash
--- a/scripts/pod-llama.sh
+++ b/scripts/pod-llama.sh
@ -1,212 +0,0 @@
 #!/bin/bash
 #
 # Use this script only on fresh pods (runpod.io)!
 # Otherwise, it can break your environment!
 #
 if [ -z "$1" ]; then
    echo "Usage: $0 <data>"
    echo "  0: no models"
    echo "  1: tinyllama-1b"
    echo "  2: codellama-7b"
    echo "  3: codellama-13b"
    echo "  4: codellama-34b"
    echo "  5: codellama-7b-instruct"
    echo "  6: codellama-13b-instruct"
    echo "  7: codellama-34b-instruct"
    exit 1
 fi
 set -x
 # setup deps
 apt-get update
 apt-get install -y git-lfs cmake cmake-curses-gui vim ruby
 git-lfs install
 if [ ! -d "/workspace" ]; then
    ln -sfn $(pwd) /workspace
 fi
 # download data
 cd /workspace
 # this is useful to git clone repos without doubling the disk size due to .git
 git clone https://github.com/iboB/git-lfs-download
 ln -sfn /workspace/git-lfs-download/git-lfs-download /usr/local/bin/git-lfs-download
 # llama.cpp
 cd /workspace
 git clone https://github.com/ggerganov/llama.cpp
 cd llama.cpp
 GGML_CUDA=1 make -j
 ln -sfn /workspace/TinyLlama-1.1B-Chat-v0.3  ./models/tinyllama-1b
 ln -sfn /workspace/CodeLlama-7b-hf           ./models/codellama-7b
 ln -sfn /workspace/CodeLlama-13b-hf          ./models/codellama-13b
 ln -sfn /workspace/CodeLlama-34b-hf          ./models/codellama-34b
 ln -sfn /workspace/CodeLlama-7b-Instruct-hf  ./models/codellama-7b-instruct
 ln -sfn /workspace/CodeLlama-13b-Instruct-hf ./models/codellama-13b-instruct
 ln -sfn /workspace/CodeLlama-34b-Instruct-hf ./models/codellama-34b-instruct
 pip install -r requirements.txt
 # cmake
 cd /workspace/llama.cpp
 mkdir build-cublas
 cd build-cublas
 cmake -DGGML_CUDA=1 ../
 make -j
 if [ "$1" -eq "0" ]; then
    exit 0
 fi
 # more models
 if [ "$1" -eq "1" ]; then
    cd /workspace
    git-lfs-download https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.3
    cd /workspace/llama.cpp
    python3 examples/convert_legacy_llama.py ./models/tinyllama-1b  --outfile ./models/tinyllama-1b/ggml-model-f16.gguf  --outtype f16
    ./llama-quantize ./models/tinyllama-1b/ggml-model-f16.gguf ./models/tinyllama-1b/ggml-model-q4_0.gguf q4_0
    ./llama-quantize ./models/tinyllama-1b/ggml-model-f16.gguf ./models/tinyllama-1b/ggml-model-q4_k.gguf q4_k
    ./llama-quantize ./models/tinyllama-1b/ggml-model-f16.gguf ./models/tinyllama-1b/ggml-model-q8_0.gguf q8_0
 fi
 if [ "$1" -eq "2" ]; then
    cd /workspace
    git-lfs-download https://huggingface.co/codellama/CodeLlama-7b-hf  --without *safetensors*
    rm -v ./CodeLlama-7b-hf/*safetensors*
    cd /workspace/llama.cpp
    python3 examples/convert_legacy_llama.py ./models/codellama-7b  --outfile ./models/codellama-7b/ggml-model-f16.gguf  --outtype f16
    ./llama-quantize ./models/codellama-7b/ggml-model-f16.gguf ./models/codellama-7b/ggml-model-q4_0.gguf q4_0
    ./llama-quantize ./models/codellama-7b/ggml-model-f16.gguf ./models/codellama-7b/ggml-model-q4_k.gguf q4_k
    ./llama-quantize ./models/codellama-7b/ggml-model-f16.gguf ./models/codellama-7b/ggml-model-q8_0.gguf q8_0
 fi
 if [ "$1" -eq "3" ]; then
    cd /workspace
    git-lfs-download https://huggingface.co/codellama/CodeLlama-13b-hf --without *safetensors*
    rm -v ./CodeLlama-13b-hf/*safetensors*
    cd /workspace/llama.cpp
    python3 examples/convert_legacy_llama.py ./models/codellama-13b --outfile ./models/codellama-13b/ggml-model-f16.gguf --outtype f16
    ./llama-quantize ./models/codellama-13b/ggml-model-f16.gguf ./models/codellama-13b/ggml-model-q4_0.gguf q4_0
    ./llama-quantize ./models/codellama-13b/ggml-model-f16.gguf ./models/codellama-13b/ggml-model-q4_k.gguf q4_k
    ./llama-quantize ./models/codellama-13b/ggml-model-f16.gguf ./models/codellama-13b/ggml-model-q8_0.gguf q8_0
 fi
 if [ "$1" -eq "4" ]; then
    cd /workspace
    git-lfs-download https://huggingface.co/codellama/CodeLlama-34b-hf --without *safetensors*
    rm -v ./CodeLlama-34b-hf/*safetensors*
    cd /workspace/llama.cpp
    python3 examples/convert_legacy_llama.py ./models/codellama-34b --outfile ./models/codellama-34b/ggml-model-f16.gguf --outtype f16
    ./llama-quantize ./models/codellama-34b/ggml-model-f16.gguf ./models/codellama-34b/ggml-model-q4_0.gguf q4_0
    ./llama-quantize ./models/codellama-34b/ggml-model-f16.gguf ./models/codellama-34b/ggml-model-q4_k.gguf q4_k
    ./llama-quantize ./models/codellama-34b/ggml-model-f16.gguf ./models/codellama-34b/ggml-model-q8_0.gguf q8_0
 fi
 if [ "$1" -eq "5" ]; then
    cd /workspace
    git-lfs-download https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf  --without *safetensors*
    rm -v ./CodeLlama-7b-Instruct-hf/*safetensors*
    cd /workspace/llama.cpp
    python3 examples/convert_legacy_llama.py ./models/codellama-7b-instruct  --outfile ./models/codellama-7b-instruct/ggml-model-f16.gguf  --outtype f16
    ./llama-quantize ./models/codellama-7b-instruct/ggml-model-f16.gguf ./models/codellama-7b-instruct/ggml-model-q4_0.gguf q4_0
    ./llama-quantize ./models/codellama-7b-instruct/ggml-model-f16.gguf ./models/codellama-7b-instruct/ggml-model-q4_k.gguf q4_k
    ./llama-quantize ./models/codellama-7b-instruct/ggml-model-f16.gguf ./models/codellama-7b-instruct/ggml-model-q8_0.gguf q8_0
 fi
 if [ "$1" -eq "6" ]; then
    cd /workspace
    git-lfs-download https://huggingface.co/codellama/CodeLlama-13b-Instruct-hf --without *safetensors*
    rm -v ./CodeLlama-13b-Instruct-hf/*safetensors*
    cd /workspace/llama.cpp
    python3 examples/convert_legacy_llama.py ./models/codellama-13b-instruct --outfile ./models/codellama-13b-instruct/ggml-model-f16.gguf --outtype f16
    ./llama-quantize ./models/codellama-13b-instruct/ggml-model-f16.gguf ./models/codellama-13b-instruct/ggml-model-q4_0.gguf q4_0
    ./llama-quantize ./models/codellama-13b-instruct/ggml-model-f16.gguf ./models/codellama-13b-instruct/ggml-model-q4_k.gguf q4_k
    ./llama-quantize ./models/codellama-13b-instruct/ggml-model-f16.gguf ./models/codellama-13b-instruct/ggml-model-q8_0.gguf q8_0
 fi
 if [ "$1" -eq "7" ]; then
    cd /workspace
    git-lfs-download https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf --without *safetensors*
    rm -v ./CodeLlama-34b-Instruct-hf/*safetensors*
    cd /workspace/llama.cpp
    python3 examples/convert_legacy_llama.py ./models/codellama-34b-instruct --outfile ./models/codellama-34b-instruct/ggml-model-f16.gguf --outtype f16
    ./llama-quantize ./models/codellama-34b-instruct/ggml-model-f16.gguf ./models/codellama-34b-instruct/ggml-model-q4_0.gguf q4_0
    ./llama-quantize ./models/codellama-34b-instruct/ggml-model-f16.gguf ./models/codellama-34b-instruct/ggml-model-q4_k.gguf q4_k
    ./llama-quantize ./models/codellama-34b-instruct/ggml-model-f16.gguf ./models/codellama-34b-instruct/ggml-model-q8_0.gguf q8_0
 fi
 if [ "$1" -eq "1" ]; then
    # perf + perplexity
    cd /workspace/llama.cpp/build-cublas
    make -j && ../scripts/run-all-perf.sh tinyllama-1b "f16" "-ngl 99 -t 1 -p 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,32,64,128,256,512,1024,2048 -n 128"
    ../scripts/get-wikitext-2.sh
    unzip wikitext-2-raw-v1.zip
    make -j && ./bin/llama-perplexity -m ../models/tinyllama-1b/ggml-model-f16.gguf -f ./wikitext-2-raw/wiki.test.raw -ngl 100 --chunks 32
    # batched
    cd /workspace/llama.cpp
    GGML_CUDA=1 make -j && ./llama-batched ./models/tinyllama-1b/ggml-model-f16.gguf "Hello, my name is" 8 128 999
    # batched-bench
    cd /workspace/llama.cpp
    GGML_CUDA=1 make -j && ./llama-batched-bench ./models/tinyllama-1b/ggml-model-f16.gguf 4608 1 99 0 512 128 1,2,3,4,5,6,7,8,16,32
    # parallel
    cd /workspace/llama.cpp
    GGML_CUDA=1 make -j && ./llama-parallel -m ./models/tinyllama-1b/ggml-model-f16.gguf -t 1 -ngl 100 -c 4096 -b 512 -s 1 -np 8 -ns 128 -n 100 -cb
 fi
 # speculative
 #if [ "$1" -eq "7" ]; then
 #    cd /workspace/llama.cpp
 #
 #    GGML_CUDA=1 make -j && ./llama-speculative -m ./models/codellama-34b-instruct/ggml-model-f16.gguf -md ./models/codellama-7b-instruct/ggml-model-q4_0.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -ngl 999 -ngld 999 -t 4 -n 512 -c 4096 -s 21 --draft 16 -np 1 --temp 0.0
 #fi
 # more benches
 #GGML_CUDA=1 make -j && ./llama-batched-bench ./models/codellama-7b/ggml-model-q4_k.gguf  4096 1 99 1 512,3200 128,128,800 1
 #GGML_CUDA=1 make -j && ./llama-batched-bench ./models/codellama-13b/ggml-model-q4_k.gguf 4096 1 99 1 512,3200 128,128,800 1
--- a/scripts/server-llm.sh
+++ b/scripts/server-llm.sh
@ -1,418 +0,0 @@
 #!/bin/bash
 #
 # Helper script for deploying llama.cpp server with a single Bash command
 #
 # - Works on Linux and macOS
 # - Supports: CPU, CUDA, Metal
 # - Can run all GGUF models from HuggingFace
 # - Can serve requests in parallel
 # - Always builds latest llama.cpp from GitHub
 #
 # Limitations
 #
 # - Chat templates are poorly supported (base models recommended)
 # - Might be unstable!
 #
 # Usage:
 #   ./server-llm.sh [--port] [--repo] [--wtype] [--backend] [--gpu-id] [--n-parallel] [--n-kv] [--verbose] [-non-interactive]
 #
 #   --port:            port number, default is 8888
 #   --repo:            path to a repo containing GGUF model files
 #   --wtype:           weights type (f16, q8_0, q4_0, q4_1), default is user-input
 #   --backend:         cpu, cuda, metal, depends on the OS
 #   --gpu-id:          gpu id, default is 0
 #   --n-parallel:      number of parallel requests, default is 8
 #   --n-kv:            KV cache size, default is 4096
 #   --verbose:         verbose output
 #   --non-interactive: run without asking a permission to run
 #
 # Example:
 #
 #   bash -c "$(curl -s https://ggml.ai/server-llm.sh)"
 #
 set -e
 # required utils: curl, git, make
 if ! command -v curl &> /dev/null; then
    printf "[-] curl not found\n"
    exit 1
 fi
 if ! command -v git &> /dev/null; then
    printf "[-] git not found\n"
    exit 1
 fi
 if ! command -v make &> /dev/null; then
    printf "[-] make not found\n"
    exit 1
 fi
 # parse arguments
 is_interactive=1
 port=8888
 repo=""
 wtype=""
 backend="cpu"
 # if macOS, use metal backend by default
 if [[ "$OSTYPE" == "darwin"* ]]; then
    backend="metal"
 elif command -v nvcc &> /dev/null; then
    backend="cuda"
 fi
 gpu_id=0
 n_parallel=8
 n_kv=4096
 verbose=0
 function print_usage {
    printf "Usage:\n"
    printf "  ./server-llm.sh [--port] [--repo] [--wtype] [--backend] [--gpu-id] [--n-parallel] [--n-kv] [--verbose] [-non-interactive]\n\n"
    printf "  --port:             port number, default is 8888\n"
    printf "  --repo:             path to a repo containing GGUF model files\n"
    printf "  --wtype:            weights type (f16, q8_0, q4_0, q4_1), default is user-input\n"
    printf "  --backend:          cpu, cuda, metal, depends on the OS\n"
    printf "  --gpu-id:           gpu id, default is 0\n"
    printf "  --n-parallel:       number of parallel requests, default is 8\n"
    printf "  --n-kv:             KV cache size, default is 4096\n"
    printf "  --verbose:          verbose output\n\n"
    printf "  --non-interactive:  run without asking a permission to run\n"
    printf "Example:\n\n"
    printf '  bash -c "$(curl -s https://ggml.ai/server-llm.sh)"\n\n'
 }
 while [[ $# -gt 0 ]]; do
    key="$1"
    case $key in
        --non-interactive)
            is_interactive=0
            shift
            ;;
        --port)
            port="$2"
            shift
            shift
            ;;
        --repo)
            repo="$2"
            shift
            shift
            ;;
        --wtype)
            wtype="$2"
            shift
            shift
            ;;
        --backend)
            backend="$2"
            shift
            shift
            ;;
        --gpu-id)
            gpu_id="$2"
            shift
            shift
            ;;
        --n-parallel)
            n_parallel="$2"
            shift
            shift
            ;;
        --n-kv)
            n_kv="$2"
            shift
            shift
            ;;
        --verbose)
            verbose=1
            shift
            ;;
        --help)
            print_usage
            exit 0
            ;;
        *)
            echo "Unknown argument: $key"
            print_usage
            exit 1
            ;;
    esac
 done
 # available weights types
 wtypes=("F16" "Q8_0" "Q4_0" "Q4_1" "Q5_0" "Q5_1" "Q6_K" "Q5_K_M" "Q5_K_S" "Q4_K_M" "Q4_K_S" "Q3_K_L" "Q3_K_M" "Q3_K_S" "Q2_K")
 wfiles=()
 for wt in "${wtypes[@]}"; do
    wfiles+=("")
 done
 # map wtype input to index
 if [[ ! -z "$wtype" ]]; then
    iw=-1
    is=0
    for wt in "${wtypes[@]}"; do
        # uppercase
        uwt=$(echo "$wt" | tr '[:lower:]' '[:upper:]')
        if [[ "$uwt" == "$wtype" ]]; then
            iw=$is
            break
        fi
        is=$((is+1))
    done
    if [[ $iw -eq -1 ]]; then
        printf "[-] Invalid weight type: %s\n" "$wtype"
        exit 1
    fi
    wtype="$iw"
 fi
 # sample repos
 repos=(
    "https://huggingface.co/TheBloke/Llama-2-7B-GGUF"
    "https://huggingface.co/TheBloke/Llama-2-13B-GGUF"
    "https://huggingface.co/TheBloke/Llama-2-70B-GGUF"
    "https://huggingface.co/TheBloke/CodeLlama-7B-GGUF"
    "https://huggingface.co/TheBloke/CodeLlama-13B-GGUF"
    "https://huggingface.co/TheBloke/CodeLlama-34B-GGUF"
    "https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF"
    "https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF"
    "https://huggingface.co/TheBloke/OpenHermes-2-Mistral-7B-GGUF"
    "https://huggingface.co/TheBloke/CausalLM-7B-GGUF"
 )
 if [ $is_interactive -eq 1 ]; then
    printf "\n"
    printf "[I] This is a helper script for deploying llama.cpp's server on this machine.\n\n"
    printf "    Based on the options that follow, the script might download a model file\n"
    printf "    from the internet, which can be a few GBs in size. The script will also\n"
    printf "    build the latest llama.cpp source code from GitHub, which can be unstable.\n"
    printf "\n"
    printf "    Upon success, an HTTP server will be started and it will serve the selected\n"
    printf "    model using llama.cpp for demonstration purposes.\n"
    printf "\n"
    printf "    Please note:\n"
    printf "\n"
    printf "    - All new data will be stored in the current folder\n"
    printf "    - The server will be listening on all network interfaces\n"
    printf "    - The server will run with default settings which are not always optimal\n"
    printf "    - Do not judge the quality of a model based on the results from this script\n"
    printf "    - Do not use this script to benchmark llama.cpp\n"
    printf "    - Do not use this script in production\n"
    printf "    - This script is only for demonstration purposes\n"
    printf "\n"
    printf "    If you don't know what you are doing, please press Ctrl-C to abort now\n"
    printf "\n"
    printf "    Press Enter to continue ...\n\n"
    read
 fi
 if [[ -z "$repo" ]]; then
    printf "[+] No repo provided from the command line\n"
    printf "    Please select a number from the list below or enter an URL:\n\n"
    is=0
    for r in "${repos[@]}"; do
        printf "    %2d) %s\n" $is "$r"
        is=$((is+1))
    done
    # ask for repo until index of sample repo is provided or an URL
    while [[ -z "$repo" ]]; do
        printf "\n    Or choose one from: https://huggingface.co/models?sort=trending&search=gguf\n\n"
        read -p "[+] Select repo: " repo
        # check if the input is a number
        if [[ "$repo" =~ ^[0-9]+$ ]]; then
            if [[ "$repo" -ge 0 && "$repo" -lt ${#repos[@]} ]]; then
                repo="${repos[$repo]}"
            else
                printf "[-] Invalid repo index: %s\n" "$repo"
                repo=""
            fi
        elif [[ "$repo" =~ ^https?:// ]]; then
            repo="$repo"
        else
            printf "[-] Invalid repo URL: %s\n" "$repo"
            repo=""
        fi
    done
 fi
 # remove suffix
 repo=$(echo "$repo" | sed -E 's/\/tree\/main$//g')
 printf "[+] Checking for GGUF model files in %s\n" "$repo"
 # find GGUF files in the source
 # TODO: better logic
 model_tree="${repo%/}/tree/main"
 model_files=$(curl -s "$model_tree" | grep -i "\\.gguf</span>" | sed -E 's/.*<span class="truncate group-hover:underline">(.*)<\/span><\/a>/\1/g')
 # list all files in the provided git repo
 printf "[+] Model files:\n\n"
 for file in $model_files; do
    # determine iw by grepping the filename with wtypes
    iw=-1
    is=0
    for wt in "${wtypes[@]}"; do
        # uppercase
        ufile=$(echo "$file" | tr '[:lower:]' '[:upper:]')
        if [[ "$ufile" =~ "$wt" ]]; then
            iw=$is
            break
        fi
        is=$((is+1))
    done
    if [[ $iw -eq -1 ]]; then
        continue
    fi
    wfiles[$iw]="$file"
    have=" "
    if [[ -f "$file" ]]; then
        have="*"
    fi
    printf "    %2d) %s %s\n" $iw "$have" "$file"
 done
 wfile="${wfiles[$wtype]}"
 # ask for weights type until provided and available
 while [[ -z "$wfile" ]]; do
    printf "\n"
    read -p "[+] Select weight type: " wtype
    wfile="${wfiles[$wtype]}"
    if [[ -z "$wfile" ]]; then
        printf "[-] Invalid weight type: %s\n" "$wtype"
        wtype=""
    fi
 done
 printf "[+] Selected weight type: %s (%s)\n" "$wtype" "$wfile"
 url="${repo%/}/resolve/main/$wfile"
 # check file if the model has been downloaded before
 chk="$wfile.chk"
 # check if we should download the file
 # - if $wfile does not exist
 # - if $wfile exists but $chk does not exist
 # - if $wfile exists and $chk exists but $wfile is newer than $chk
 # TODO: better logic using git lfs info
 do_download=0
 if [[ ! -f "$wfile" ]]; then
    do_download=1
 elif [[ ! -f "$chk" ]]; then
    do_download=1
 elif [[ "$wfile" -nt "$chk" ]]; then
    do_download=1
 fi
 if [[ $do_download -eq 1 ]]; then
    printf "[+] Downloading weights from %s\n" "$url"
    # download the weights file
    curl -o "$wfile" -# -L "$url"
    # create a check file if successful
    if [[ $? -eq 0 ]]; then
        printf "[+] Creating check file %s\n" "$chk"
        touch "$chk"
    fi
 else
    printf "[+] Using cached weights %s\n" "$wfile"
 fi
 # get latest llama.cpp and build
 printf "[+] Downloading latest llama.cpp\n"
 llama_cpp_dir="__llama_cpp_port_${port}__"
 if [[ -d "$llama_cpp_dir" && ! -f "$llama_cpp_dir/__ggml_script__" ]]; then
    # if the dir exists and there isn't a file "__ggml_script__" in it, abort
    printf "[-] Directory %s already exists\n" "$llama_cpp_dir"
    printf "[-] Please remove it and try again\n"
    exit 1
 elif [[ -d "$llama_cpp_dir" ]]; then
    printf "[+] Directory %s already exists\n" "$llama_cpp_dir"
    printf "[+] Using cached llama.cpp\n"
    cd "$llama_cpp_dir"
    git reset --hard
    git fetch
    git checkout origin/master
    cd ..
 else
    printf "[+] Cloning llama.cpp\n"
    git clone https://github.com/ggerganov/llama.cpp "$llama_cpp_dir"
 fi
 # mark that that the directory is made by this script
 touch "$llama_cpp_dir/__ggml_script__"
 if [[ $verbose -eq 1 ]]; then
    set -x
 fi
 # build
 cd "$llama_cpp_dir"
 make clean
 log="--silent"
 if [[ $verbose -eq 1 ]]; then
    log=""
 fi
 if [[ "$backend" == "cuda" ]]; then
    printf "[+] Building with CUDA backend\n"
    GGML_CUDA=1 make -j llama-server $log
 elif [[ "$backend" == "cpu" ]]; then
    printf "[+] Building with CPU backend\n"
    make -j llama-server $log
 elif [[ "$backend" == "metal" ]]; then
    printf "[+] Building with Metal backend\n"
    make -j llama-server $log
 else
    printf "[-] Unknown backend: %s\n" "$backend"
    exit 1
 fi
 # run the server
 printf "[+] Running server\n"
 args=""
 if [[ "$backend" == "cuda" ]]; then
    export CUDA_VISIBLE_DEVICES=$gpu_id
    args="-ngl 999"
 elif [[ "$backend" == "cpu" ]]; then
    args="-ngl 0"
 elif [[ "$backend" == "metal" ]]; then
    args="-ngl 999"
 else
    printf "[-] Unknown backend: %s\n" "$backend"
    exit 1
 fi
 if [[ $verbose -eq 1 ]]; then
    args="$args --verbose"
 fi
 ./llama-server -m "../$wfile" --host 0.0.0.0 --port "$port" -c $n_kv -np "$n_parallel" $args
 exit 0