diff --git a/docs/development/HOWTO-add-model.md b/docs/development/HOWTO-add-model.md index 2712b66c1..04c5ccbbe 100644 --- a/docs/development/HOWTO-add-model.md +++ b/docs/development/HOWTO-add-model.md @@ -9,15 +9,15 @@ Adding a model requires few steps: After following these steps, you can open PR. Also, it is important to check that the examples and main ggml backends (CUDA, METAL, CPU) are working with the new architecture, especially: -- [main](../examples/main) -- [imatrix](../examples/imatrix) -- [quantize](../examples/quantize) -- [server](../examples/server) +- [main](/examples/main/) +- [imatrix](/examples/imatrix/) +- [quantize](/examples/quantize/) +- [server](/examples/server/) ### 1. Convert the model to GGUF This step is done in python with a `convert` script using the [gguf](https://pypi.org/project/gguf/) library. -Depending on the model architecture, you can use either [convert_hf_to_gguf.py](../convert_hf_to_gguf.py) or [examples/convert_legacy_llama.py](../examples/convert_legacy_llama.py) (for `llama/llama2` models in `.pth` format). +Depending on the model architecture, you can use either [convert_hf_to_gguf.py](/convert_hf_to_gguf.py) or [examples/convert_legacy_llama.py](/examples/convert_legacy_llama.py) (for `llama/llama2` models in `.pth` format). The convert script reads the model configuration, tokenizer, tensor names+data and converts them to GGUF metadata and tensors. @@ -31,7 +31,7 @@ class MyModel(Model): model_arch = gguf.MODEL_ARCH.GROK ``` -2. Define the layout of the GGUF tensors in [constants.py](../gguf-py/gguf/constants.py) +2. Define the layout of the GGUF tensors in [constants.py](/gguf-py/gguf/constants.py) Add an enum entry in `MODEL_ARCH`, the model human friendly name in `MODEL_ARCH_NAMES` and the GGUF tensor names in `MODEL_TENSORS`. @@ -54,7 +54,7 @@ Example for `falcon` model: As a general rule, before adding a new tensor name to GGUF, be sure the equivalent naming does not already exist. -Once you have found the GGUF tensor name equivalent, add it to the [tensor_mapping.py](../gguf-py/gguf/tensor_mapping.py) file. +Once you have found the GGUF tensor name equivalent, add it to the [tensor_mapping.py](/gguf-py/gguf/tensor_mapping.py) file. If the tensor name is part of a repetitive layer/block, the key word `bid` substitutes it. @@ -100,7 +100,7 @@ Have a look at existing implementation like `build_llama`, `build_dbrx` or `buil When implementing a new graph, please note that the underlying `ggml` backends might not support them all, support for missing backend operations can be added in another PR. -Note: to debug the inference graph: you can use [llama-eval-callback](../examples/eval-callback). +Note: to debug the inference graph: you can use [llama-eval-callback](/examples/eval-callback/). ## GGUF specification diff --git a/docs/development/token_generation_performance_tips.md b/docs/development/token_generation_performance_tips.md index c0840cad5..41b7232c9 100644 --- a/docs/development/token_generation_performance_tips.md +++ b/docs/development/token_generation_performance_tips.md @@ -1,7 +1,7 @@ # Token generation performance troubleshooting ## Verifying that the model is running on the GPU with CUDA -Make sure you compiled llama with the correct env variables according to [this guide](../README.md#CUDA), so that llama accepts the `-ngl N` (or `--n-gpu-layers N`) flag. When running llama, you may configure `N` to be very large, and llama will offload the maximum possible number of layers to the GPU, even if it's less than the number you configured. For example: +Make sure you compiled llama with the correct env variables according to [this guide](/docs/build.md#cuda), so that llama accepts the `-ngl N` (or `--n-gpu-layers N`) flag. When running llama, you may configure `N` to be very large, and llama will offload the maximum possible number of layers to the GPU, even if it's less than the number you configured. For example: ```shell ./llama-cli -m "path/to/model.gguf" -ngl 200000 -p "Please sir, may I have some " ```