From 328ded353bfa5028ba78f5a03bd027e6ee5e0126 Mon Sep 17 00:00:00 2001 From: Georgi Gerganov Date: Mon, 2 Dec 2024 10:24:54 +0200 Subject: [PATCH] docs : remove obsolete make references, scripts, examples ggml-ci --- docs/backend/BLIS.md | 7 - docs/build.md | 7 - examples/base-translate.sh | 61 --- examples/convert-llama2c-to-ggml/README.md | 5 +- examples/imatrix/README.md | 2 - examples/server/README.md | 15 - scripts/pod-llama.sh | 212 ----------- scripts/server-llm.sh | 418 --------------------- 8 files changed, 1 insertion(+), 726 deletions(-) delete mode 100755 examples/base-translate.sh delete mode 100644 scripts/pod-llama.sh delete mode 100644 scripts/server-llm.sh diff --git a/docs/backend/BLIS.md b/docs/backend/BLIS.md index 35d06bd0f..904548577 100644 --- a/docs/backend/BLIS.md +++ b/docs/backend/BLIS.md @@ -27,13 +27,6 @@ We recommend using openmp since it's easier to modify the cores being used. ### llama.cpp compilation -Makefile: - -```bash -make GGML_BLIS=1 -j -# make GGML_BLIS=1 llama-benchmark-matmult -``` - CMake: ```bash diff --git a/docs/build.md b/docs/build.md index 6c8b95586..9cf7814da 100644 --- a/docs/build.md +++ b/docs/build.md @@ -18,7 +18,6 @@ In order to build llama.cpp you have four different options. **Notes**: - - For `Q4_0_4_4` quantization type build, add the `-DGGML_LLAMAFILE=OFF` cmake option. For example, use `cmake -B build -DGGML_LLAMAFILE=OFF`. - For faster compilation, add the `-j` argument to run multiple jobs in parallel. For example, `cmake --build build --config Release -j 8` will run 8 jobs in parallel. - For faster repeated compilation, install [ccache](https://ccache.dev/). - For debug builds, there are two cases: @@ -337,9 +336,3 @@ For detailed info, such as model/device supports, CANN install, please refer to ### Android To read documentation for how to build on Android, [click here](./android.md) - -### Arm CPU optimized mulmat kernels - -Llama.cpp includes a set of optimized mulmat kernels for the Arm architecture, leveraging Arm® Neon™, int8mm and SVE instructions. These kernels are enabled at build time through the appropriate compiler cpu-type flags, such as `-DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+sve`. Note that these optimized kernels require the model to be quantized into one of the formats: `Q4_0_4_4` (Arm Neon), `Q4_0_4_8` (int8mm) or `Q4_0_8_8` (SVE). The SVE mulmat kernel specifically requires a vector width of 256 bits. When running on devices with a different vector width, it is recommended to use the `Q4_0_4_8` (int8mm) or `Q4_0_4_4` (Arm Neon) formats for better performance. Refer to [examples/quantize/README.md](../examples/quantize/README.md) for more information on the quantization formats. - -To support `Q4_0_4_4`, you must build with `GGML_NO_LLAMAFILE=1` (`make`) or `-DGGML_LLAMAFILE=OFF` (`cmake`). diff --git a/examples/base-translate.sh b/examples/base-translate.sh deleted file mode 100755 index 103a52f55..000000000 --- a/examples/base-translate.sh +++ /dev/null @@ -1,61 +0,0 @@ -#!/bin/bash -# -# Few-shot translation example. -# Requires a base model (i.e. no fine-tuned or instruct models). -# -# Usage: -# -# cd llama.cpp -# make -j -# -# ./examples/base-translate.sh "" [extra-main-args] -# - -if [ $# -lt 2 ]; then - echo "Usage: ./base-translate.sh \"\" [extra-main-args]" - exit 1 -fi - -eargs="" -if [ $# -gt 2 ]; then - eargs="${@:3}" -fi - -ftmp="__llama.cpp_example_tmp__.txt" -trap "rm -f $ftmp" EXIT - -echo "Translate from English to French: - -=== - -sea otter, peppermint, plush girafe: - -sea otter => loutre de mer -peppermint => menthe poivrée -plush girafe => girafe peluche - -=== - -violin - -violin => violon - -=== - -phone, computer, mouse, keyboard: - -phone => téléphone -computer => ordinateur -mouse => souris -keyboard => clavier - -=== -" > $ftmp - -echo "$2 -" >> $ftmp - -model=$1 - -# generate the most likely continuation until the string "===" is found -./llama-cli -m $model -f $ftmp -n 64 --temp 0 --repeat-penalty 1.0 --no-penalize-nl -r "===" $eargs diff --git a/examples/convert-llama2c-to-ggml/README.md b/examples/convert-llama2c-to-ggml/README.md index 5774ac83c..46a42da69 100644 --- a/examples/convert-llama2c-to-ggml/README.md +++ b/examples/convert-llama2c-to-ggml/README.md @@ -2,11 +2,8 @@ This example reads weights from project [llama2.c](https://github.com/karpathy/llama2.c) and saves them in ggml compatible format. The vocab that is available in `models/ggml-vocab.bin` is used by default. -To convert the model first download the models from the [llama2.c](https://github.com/karpathy/llama2.c) repository: +To convert the model first download the models from the [llama2.c](https://github.com/karpathy/llama2.c) repository. -`$ make -j` - -After successful compilation, following usage options are available: ``` usage: ./llama-convert-llama2c-to-ggml [options] diff --git a/examples/imatrix/README.md b/examples/imatrix/README.md index bb5faec94..9c056986b 100644 --- a/examples/imatrix/README.md +++ b/examples/imatrix/README.md @@ -25,8 +25,6 @@ For faster computation, make sure to use GPU offloading via the `-ngl` argument ## Example ```bash -GGML_CUDA=1 make -j - # generate importance matrix (imatrix.dat) ./llama-imatrix -m ggml-model-f16.gguf -f train-data.txt -ngl 99 diff --git a/examples/server/README.md b/examples/server/README.md index 877768c8b..1476bd97b 100644 --- a/examples/server/README.md +++ b/examples/server/README.md @@ -188,12 +188,6 @@ services: `llama-server` is built alongside everything else from the root of the project -- Using `make`: - - ```bash - make llama-server - ``` - - Using `CMake`: ```bash @@ -207,15 +201,6 @@ services: `llama-server` can also be built with SSL support using OpenSSL 3 -- Using `make`: - - ```bash - # NOTE: For non-system openssl, use the following: - # CXXFLAGS="-I /path/to/openssl/include" - # LDFLAGS="-L /path/to/openssl/lib" - make LLAMA_SERVER_SSL=true llama-server - ``` - - Using `CMake`: ```bash diff --git a/scripts/pod-llama.sh b/scripts/pod-llama.sh deleted file mode 100644 index 6e56e1ed0..000000000 --- a/scripts/pod-llama.sh +++ /dev/null @@ -1,212 +0,0 @@ -#!/bin/bash -# -# Use this script only on fresh pods (runpod.io)! -# Otherwise, it can break your environment! -# - -if [ -z "$1" ]; then - echo "Usage: $0 " - echo " 0: no models" - echo " 1: tinyllama-1b" - echo " 2: codellama-7b" - echo " 3: codellama-13b" - echo " 4: codellama-34b" - echo " 5: codellama-7b-instruct" - echo " 6: codellama-13b-instruct" - echo " 7: codellama-34b-instruct" - - exit 1 -fi - -set -x - -# setup deps -apt-get update -apt-get install -y git-lfs cmake cmake-curses-gui vim ruby -git-lfs install - -if [ ! -d "/workspace" ]; then - ln -sfn $(pwd) /workspace -fi - -# download data -cd /workspace - -# this is useful to git clone repos without doubling the disk size due to .git -git clone https://github.com/iboB/git-lfs-download -ln -sfn /workspace/git-lfs-download/git-lfs-download /usr/local/bin/git-lfs-download - -# llama.cpp -cd /workspace -git clone https://github.com/ggerganov/llama.cpp - -cd llama.cpp - -GGML_CUDA=1 make -j - -ln -sfn /workspace/TinyLlama-1.1B-Chat-v0.3 ./models/tinyllama-1b -ln -sfn /workspace/CodeLlama-7b-hf ./models/codellama-7b -ln -sfn /workspace/CodeLlama-13b-hf ./models/codellama-13b -ln -sfn /workspace/CodeLlama-34b-hf ./models/codellama-34b -ln -sfn /workspace/CodeLlama-7b-Instruct-hf ./models/codellama-7b-instruct -ln -sfn /workspace/CodeLlama-13b-Instruct-hf ./models/codellama-13b-instruct -ln -sfn /workspace/CodeLlama-34b-Instruct-hf ./models/codellama-34b-instruct - -pip install -r requirements.txt - -# cmake -cd /workspace/llama.cpp - -mkdir build-cublas -cd build-cublas - -cmake -DGGML_CUDA=1 ../ -make -j - -if [ "$1" -eq "0" ]; then - exit 0 -fi - -# more models -if [ "$1" -eq "1" ]; then - cd /workspace - - git-lfs-download https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.3 - - cd /workspace/llama.cpp - - python3 examples/convert_legacy_llama.py ./models/tinyllama-1b --outfile ./models/tinyllama-1b/ggml-model-f16.gguf --outtype f16 - - ./llama-quantize ./models/tinyllama-1b/ggml-model-f16.gguf ./models/tinyllama-1b/ggml-model-q4_0.gguf q4_0 - ./llama-quantize ./models/tinyllama-1b/ggml-model-f16.gguf ./models/tinyllama-1b/ggml-model-q4_k.gguf q4_k - ./llama-quantize ./models/tinyllama-1b/ggml-model-f16.gguf ./models/tinyllama-1b/ggml-model-q8_0.gguf q8_0 -fi - -if [ "$1" -eq "2" ]; then - cd /workspace - - git-lfs-download https://huggingface.co/codellama/CodeLlama-7b-hf --without *safetensors* - rm -v ./CodeLlama-7b-hf/*safetensors* - - cd /workspace/llama.cpp - - python3 examples/convert_legacy_llama.py ./models/codellama-7b --outfile ./models/codellama-7b/ggml-model-f16.gguf --outtype f16 - - ./llama-quantize ./models/codellama-7b/ggml-model-f16.gguf ./models/codellama-7b/ggml-model-q4_0.gguf q4_0 - ./llama-quantize ./models/codellama-7b/ggml-model-f16.gguf ./models/codellama-7b/ggml-model-q4_k.gguf q4_k - ./llama-quantize ./models/codellama-7b/ggml-model-f16.gguf ./models/codellama-7b/ggml-model-q8_0.gguf q8_0 -fi - -if [ "$1" -eq "3" ]; then - cd /workspace - - git-lfs-download https://huggingface.co/codellama/CodeLlama-13b-hf --without *safetensors* - rm -v ./CodeLlama-13b-hf/*safetensors* - - cd /workspace/llama.cpp - - python3 examples/convert_legacy_llama.py ./models/codellama-13b --outfile ./models/codellama-13b/ggml-model-f16.gguf --outtype f16 - - ./llama-quantize ./models/codellama-13b/ggml-model-f16.gguf ./models/codellama-13b/ggml-model-q4_0.gguf q4_0 - ./llama-quantize ./models/codellama-13b/ggml-model-f16.gguf ./models/codellama-13b/ggml-model-q4_k.gguf q4_k - ./llama-quantize ./models/codellama-13b/ggml-model-f16.gguf ./models/codellama-13b/ggml-model-q8_0.gguf q8_0 -fi - -if [ "$1" -eq "4" ]; then - cd /workspace - - git-lfs-download https://huggingface.co/codellama/CodeLlama-34b-hf --without *safetensors* - rm -v ./CodeLlama-34b-hf/*safetensors* - - cd /workspace/llama.cpp - - python3 examples/convert_legacy_llama.py ./models/codellama-34b --outfile ./models/codellama-34b/ggml-model-f16.gguf --outtype f16 - - ./llama-quantize ./models/codellama-34b/ggml-model-f16.gguf ./models/codellama-34b/ggml-model-q4_0.gguf q4_0 - ./llama-quantize ./models/codellama-34b/ggml-model-f16.gguf ./models/codellama-34b/ggml-model-q4_k.gguf q4_k - ./llama-quantize ./models/codellama-34b/ggml-model-f16.gguf ./models/codellama-34b/ggml-model-q8_0.gguf q8_0 -fi - -if [ "$1" -eq "5" ]; then - cd /workspace - - git-lfs-download https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf --without *safetensors* - rm -v ./CodeLlama-7b-Instruct-hf/*safetensors* - - cd /workspace/llama.cpp - - python3 examples/convert_legacy_llama.py ./models/codellama-7b-instruct --outfile ./models/codellama-7b-instruct/ggml-model-f16.gguf --outtype f16 - - ./llama-quantize ./models/codellama-7b-instruct/ggml-model-f16.gguf ./models/codellama-7b-instruct/ggml-model-q4_0.gguf q4_0 - ./llama-quantize ./models/codellama-7b-instruct/ggml-model-f16.gguf ./models/codellama-7b-instruct/ggml-model-q4_k.gguf q4_k - ./llama-quantize ./models/codellama-7b-instruct/ggml-model-f16.gguf ./models/codellama-7b-instruct/ggml-model-q8_0.gguf q8_0 -fi - -if [ "$1" -eq "6" ]; then - cd /workspace - - git-lfs-download https://huggingface.co/codellama/CodeLlama-13b-Instruct-hf --without *safetensors* - rm -v ./CodeLlama-13b-Instruct-hf/*safetensors* - - cd /workspace/llama.cpp - - python3 examples/convert_legacy_llama.py ./models/codellama-13b-instruct --outfile ./models/codellama-13b-instruct/ggml-model-f16.gguf --outtype f16 - - ./llama-quantize ./models/codellama-13b-instruct/ggml-model-f16.gguf ./models/codellama-13b-instruct/ggml-model-q4_0.gguf q4_0 - ./llama-quantize ./models/codellama-13b-instruct/ggml-model-f16.gguf ./models/codellama-13b-instruct/ggml-model-q4_k.gguf q4_k - ./llama-quantize ./models/codellama-13b-instruct/ggml-model-f16.gguf ./models/codellama-13b-instruct/ggml-model-q8_0.gguf q8_0 -fi - -if [ "$1" -eq "7" ]; then - cd /workspace - - git-lfs-download https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf --without *safetensors* - rm -v ./CodeLlama-34b-Instruct-hf/*safetensors* - - cd /workspace/llama.cpp - - python3 examples/convert_legacy_llama.py ./models/codellama-34b-instruct --outfile ./models/codellama-34b-instruct/ggml-model-f16.gguf --outtype f16 - - ./llama-quantize ./models/codellama-34b-instruct/ggml-model-f16.gguf ./models/codellama-34b-instruct/ggml-model-q4_0.gguf q4_0 - ./llama-quantize ./models/codellama-34b-instruct/ggml-model-f16.gguf ./models/codellama-34b-instruct/ggml-model-q4_k.gguf q4_k - ./llama-quantize ./models/codellama-34b-instruct/ggml-model-f16.gguf ./models/codellama-34b-instruct/ggml-model-q8_0.gguf q8_0 -fi - -if [ "$1" -eq "1" ]; then - # perf + perplexity - cd /workspace/llama.cpp/build-cublas - - make -j && ../scripts/run-all-perf.sh tinyllama-1b "f16" "-ngl 99 -t 1 -p 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,32,64,128,256,512,1024,2048 -n 128" - - ../scripts/get-wikitext-2.sh - unzip wikitext-2-raw-v1.zip - - make -j && ./bin/llama-perplexity -m ../models/tinyllama-1b/ggml-model-f16.gguf -f ./wikitext-2-raw/wiki.test.raw -ngl 100 --chunks 32 - - # batched - cd /workspace/llama.cpp - - GGML_CUDA=1 make -j && ./llama-batched ./models/tinyllama-1b/ggml-model-f16.gguf "Hello, my name is" 8 128 999 - - # batched-bench - cd /workspace/llama.cpp - - GGML_CUDA=1 make -j && ./llama-batched-bench ./models/tinyllama-1b/ggml-model-f16.gguf 4608 1 99 0 512 128 1,2,3,4,5,6,7,8,16,32 - - # parallel - cd /workspace/llama.cpp - - GGML_CUDA=1 make -j && ./llama-parallel -m ./models/tinyllama-1b/ggml-model-f16.gguf -t 1 -ngl 100 -c 4096 -b 512 -s 1 -np 8 -ns 128 -n 100 -cb - -fi - -# speculative -#if [ "$1" -eq "7" ]; then -# cd /workspace/llama.cpp -# -# GGML_CUDA=1 make -j && ./llama-speculative -m ./models/codellama-34b-instruct/ggml-model-f16.gguf -md ./models/codellama-7b-instruct/ggml-model-q4_0.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -ngl 999 -ngld 999 -t 4 -n 512 -c 4096 -s 21 --draft 16 -np 1 --temp 0.0 -#fi - -# more benches -#GGML_CUDA=1 make -j && ./llama-batched-bench ./models/codellama-7b/ggml-model-q4_k.gguf 4096 1 99 1 512,3200 128,128,800 1 -#GGML_CUDA=1 make -j && ./llama-batched-bench ./models/codellama-13b/ggml-model-q4_k.gguf 4096 1 99 1 512,3200 128,128,800 1 diff --git a/scripts/server-llm.sh b/scripts/server-llm.sh deleted file mode 100644 index 802592a3e..000000000 --- a/scripts/server-llm.sh +++ /dev/null @@ -1,418 +0,0 @@ -#!/bin/bash -# -# Helper script for deploying llama.cpp server with a single Bash command -# -# - Works on Linux and macOS -# - Supports: CPU, CUDA, Metal -# - Can run all GGUF models from HuggingFace -# - Can serve requests in parallel -# - Always builds latest llama.cpp from GitHub -# -# Limitations -# -# - Chat templates are poorly supported (base models recommended) -# - Might be unstable! -# -# Usage: -# ./server-llm.sh [--port] [--repo] [--wtype] [--backend] [--gpu-id] [--n-parallel] [--n-kv] [--verbose] [-non-interactive] -# -# --port: port number, default is 8888 -# --repo: path to a repo containing GGUF model files -# --wtype: weights type (f16, q8_0, q4_0, q4_1), default is user-input -# --backend: cpu, cuda, metal, depends on the OS -# --gpu-id: gpu id, default is 0 -# --n-parallel: number of parallel requests, default is 8 -# --n-kv: KV cache size, default is 4096 -# --verbose: verbose output -# --non-interactive: run without asking a permission to run -# -# Example: -# -# bash -c "$(curl -s https://ggml.ai/server-llm.sh)" -# - -set -e - -# required utils: curl, git, make -if ! command -v curl &> /dev/null; then - printf "[-] curl not found\n" - exit 1 -fi -if ! command -v git &> /dev/null; then - printf "[-] git not found\n" - exit 1 -fi -if ! command -v make &> /dev/null; then - printf "[-] make not found\n" - exit 1 -fi - -# parse arguments -is_interactive=1 -port=8888 -repo="" -wtype="" -backend="cpu" - -# if macOS, use metal backend by default -if [[ "$OSTYPE" == "darwin"* ]]; then - backend="metal" -elif command -v nvcc &> /dev/null; then - backend="cuda" -fi - -gpu_id=0 -n_parallel=8 -n_kv=4096 -verbose=0 - -function print_usage { - printf "Usage:\n" - printf " ./server-llm.sh [--port] [--repo] [--wtype] [--backend] [--gpu-id] [--n-parallel] [--n-kv] [--verbose] [-non-interactive]\n\n" - printf " --port: port number, default is 8888\n" - printf " --repo: path to a repo containing GGUF model files\n" - printf " --wtype: weights type (f16, q8_0, q4_0, q4_1), default is user-input\n" - printf " --backend: cpu, cuda, metal, depends on the OS\n" - printf " --gpu-id: gpu id, default is 0\n" - printf " --n-parallel: number of parallel requests, default is 8\n" - printf " --n-kv: KV cache size, default is 4096\n" - printf " --verbose: verbose output\n\n" - printf " --non-interactive: run without asking a permission to run\n" - printf "Example:\n\n" - printf ' bash -c "$(curl -s https://ggml.ai/server-llm.sh)"\n\n' -} - -while [[ $# -gt 0 ]]; do - key="$1" - case $key in - --non-interactive) - is_interactive=0 - shift - ;; - --port) - port="$2" - shift - shift - ;; - --repo) - repo="$2" - shift - shift - ;; - --wtype) - wtype="$2" - shift - shift - ;; - --backend) - backend="$2" - shift - shift - ;; - --gpu-id) - gpu_id="$2" - shift - shift - ;; - --n-parallel) - n_parallel="$2" - shift - shift - ;; - --n-kv) - n_kv="$2" - shift - shift - ;; - --verbose) - verbose=1 - shift - ;; - --help) - print_usage - exit 0 - ;; - *) - echo "Unknown argument: $key" - print_usage - exit 1 - ;; - esac -done - -# available weights types -wtypes=("F16" "Q8_0" "Q4_0" "Q4_1" "Q5_0" "Q5_1" "Q6_K" "Q5_K_M" "Q5_K_S" "Q4_K_M" "Q4_K_S" "Q3_K_L" "Q3_K_M" "Q3_K_S" "Q2_K") - -wfiles=() -for wt in "${wtypes[@]}"; do - wfiles+=("") -done - -# map wtype input to index -if [[ ! -z "$wtype" ]]; then - iw=-1 - is=0 - for wt in "${wtypes[@]}"; do - # uppercase - uwt=$(echo "$wt" | tr '[:lower:]' '[:upper:]') - if [[ "$uwt" == "$wtype" ]]; then - iw=$is - break - fi - is=$((is+1)) - done - - if [[ $iw -eq -1 ]]; then - printf "[-] Invalid weight type: %s\n" "$wtype" - exit 1 - fi - - wtype="$iw" -fi - -# sample repos -repos=( - "https://huggingface.co/TheBloke/Llama-2-7B-GGUF" - "https://huggingface.co/TheBloke/Llama-2-13B-GGUF" - "https://huggingface.co/TheBloke/Llama-2-70B-GGUF" - "https://huggingface.co/TheBloke/CodeLlama-7B-GGUF" - "https://huggingface.co/TheBloke/CodeLlama-13B-GGUF" - "https://huggingface.co/TheBloke/CodeLlama-34B-GGUF" - "https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF" - "https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF" - "https://huggingface.co/TheBloke/OpenHermes-2-Mistral-7B-GGUF" - "https://huggingface.co/TheBloke/CausalLM-7B-GGUF" -) -if [ $is_interactive -eq 1 ]; then - printf "\n" - printf "[I] This is a helper script for deploying llama.cpp's server on this machine.\n\n" - printf " Based on the options that follow, the script might download a model file\n" - printf " from the internet, which can be a few GBs in size. The script will also\n" - printf " build the latest llama.cpp source code from GitHub, which can be unstable.\n" - printf "\n" - printf " Upon success, an HTTP server will be started and it will serve the selected\n" - printf " model using llama.cpp for demonstration purposes.\n" - printf "\n" - printf " Please note:\n" - printf "\n" - printf " - All new data will be stored in the current folder\n" - printf " - The server will be listening on all network interfaces\n" - printf " - The server will run with default settings which are not always optimal\n" - printf " - Do not judge the quality of a model based on the results from this script\n" - printf " - Do not use this script to benchmark llama.cpp\n" - printf " - Do not use this script in production\n" - printf " - This script is only for demonstration purposes\n" - printf "\n" - printf " If you don't know what you are doing, please press Ctrl-C to abort now\n" - printf "\n" - printf " Press Enter to continue ...\n\n" - - read -fi - -if [[ -z "$repo" ]]; then - printf "[+] No repo provided from the command line\n" - printf " Please select a number from the list below or enter an URL:\n\n" - - is=0 - for r in "${repos[@]}"; do - printf " %2d) %s\n" $is "$r" - is=$((is+1)) - done - - # ask for repo until index of sample repo is provided or an URL - while [[ -z "$repo" ]]; do - printf "\n Or choose one from: https://huggingface.co/models?sort=trending&search=gguf\n\n" - read -p "[+] Select repo: " repo - - # check if the input is a number - if [[ "$repo" =~ ^[0-9]+$ ]]; then - if [[ "$repo" -ge 0 && "$repo" -lt ${#repos[@]} ]]; then - repo="${repos[$repo]}" - else - printf "[-] Invalid repo index: %s\n" "$repo" - repo="" - fi - elif [[ "$repo" =~ ^https?:// ]]; then - repo="$repo" - else - printf "[-] Invalid repo URL: %s\n" "$repo" - repo="" - fi - done -fi - -# remove suffix -repo=$(echo "$repo" | sed -E 's/\/tree\/main$//g') - -printf "[+] Checking for GGUF model files in %s\n" "$repo" - -# find GGUF files in the source -# TODO: better logic -model_tree="${repo%/}/tree/main" -model_files=$(curl -s "$model_tree" | grep -i "\\.gguf" | sed -E 's/.*(.*)<\/span><\/a>/\1/g') - -# list all files in the provided git repo -printf "[+] Model files:\n\n" -for file in $model_files; do - # determine iw by grepping the filename with wtypes - iw=-1 - is=0 - for wt in "${wtypes[@]}"; do - # uppercase - ufile=$(echo "$file" | tr '[:lower:]' '[:upper:]') - if [[ "$ufile" =~ "$wt" ]]; then - iw=$is - break - fi - is=$((is+1)) - done - - if [[ $iw -eq -1 ]]; then - continue - fi - - wfiles[$iw]="$file" - - have=" " - if [[ -f "$file" ]]; then - have="*" - fi - - printf " %2d) %s %s\n" $iw "$have" "$file" -done - -wfile="${wfiles[$wtype]}" - -# ask for weights type until provided and available -while [[ -z "$wfile" ]]; do - printf "\n" - read -p "[+] Select weight type: " wtype - wfile="${wfiles[$wtype]}" - - if [[ -z "$wfile" ]]; then - printf "[-] Invalid weight type: %s\n" "$wtype" - wtype="" - fi -done - -printf "[+] Selected weight type: %s (%s)\n" "$wtype" "$wfile" - -url="${repo%/}/resolve/main/$wfile" - -# check file if the model has been downloaded before -chk="$wfile.chk" - -# check if we should download the file -# - if $wfile does not exist -# - if $wfile exists but $chk does not exist -# - if $wfile exists and $chk exists but $wfile is newer than $chk -# TODO: better logic using git lfs info - -do_download=0 - -if [[ ! -f "$wfile" ]]; then - do_download=1 -elif [[ ! -f "$chk" ]]; then - do_download=1 -elif [[ "$wfile" -nt "$chk" ]]; then - do_download=1 -fi - -if [[ $do_download -eq 1 ]]; then - printf "[+] Downloading weights from %s\n" "$url" - - # download the weights file - curl -o "$wfile" -# -L "$url" - - # create a check file if successful - if [[ $? -eq 0 ]]; then - printf "[+] Creating check file %s\n" "$chk" - touch "$chk" - fi -else - printf "[+] Using cached weights %s\n" "$wfile" -fi - -# get latest llama.cpp and build - -printf "[+] Downloading latest llama.cpp\n" - -llama_cpp_dir="__llama_cpp_port_${port}__" - -if [[ -d "$llama_cpp_dir" && ! -f "$llama_cpp_dir/__ggml_script__" ]]; then - # if the dir exists and there isn't a file "__ggml_script__" in it, abort - printf "[-] Directory %s already exists\n" "$llama_cpp_dir" - printf "[-] Please remove it and try again\n" - exit 1 -elif [[ -d "$llama_cpp_dir" ]]; then - printf "[+] Directory %s already exists\n" "$llama_cpp_dir" - printf "[+] Using cached llama.cpp\n" - - cd "$llama_cpp_dir" - git reset --hard - git fetch - git checkout origin/master - - cd .. -else - printf "[+] Cloning llama.cpp\n" - - git clone https://github.com/ggerganov/llama.cpp "$llama_cpp_dir" -fi - -# mark that that the directory is made by this script -touch "$llama_cpp_dir/__ggml_script__" - -if [[ $verbose -eq 1 ]]; then - set -x -fi - -# build -cd "$llama_cpp_dir" - -make clean - -log="--silent" -if [[ $verbose -eq 1 ]]; then - log="" -fi - -if [[ "$backend" == "cuda" ]]; then - printf "[+] Building with CUDA backend\n" - GGML_CUDA=1 make -j llama-server $log -elif [[ "$backend" == "cpu" ]]; then - printf "[+] Building with CPU backend\n" - make -j llama-server $log -elif [[ "$backend" == "metal" ]]; then - printf "[+] Building with Metal backend\n" - make -j llama-server $log -else - printf "[-] Unknown backend: %s\n" "$backend" - exit 1 -fi - -# run the server - -printf "[+] Running server\n" - -args="" -if [[ "$backend" == "cuda" ]]; then - export CUDA_VISIBLE_DEVICES=$gpu_id - args="-ngl 999" -elif [[ "$backend" == "cpu" ]]; then - args="-ngl 0" -elif [[ "$backend" == "metal" ]]; then - args="-ngl 999" -else - printf "[-] Unknown backend: %s\n" "$backend" - exit 1 -fi - -if [[ $verbose -eq 1 ]]; then - args="$args --verbose" -fi - -./llama-server -m "../$wfile" --host 0.0.0.0 --port "$port" -c $n_kv -np "$n_parallel" $args - -exit 0