mirror of
https://github.com/ggerganov/llama.cpp.git
synced 2024-12-25 02:44:36 +00:00
docs: fix links in development docs [no ci] (#8481)
Fixes a few links to within the repo that were broken in the reorganization of the documentation in #8325.
This commit is contained in:
parent
16bdfa42ac
commit
fc690b018e
@ -9,15 +9,15 @@ Adding a model requires few steps:
|
|||||||
After following these steps, you can open PR.
|
After following these steps, you can open PR.
|
||||||
|
|
||||||
Also, it is important to check that the examples and main ggml backends (CUDA, METAL, CPU) are working with the new architecture, especially:
|
Also, it is important to check that the examples and main ggml backends (CUDA, METAL, CPU) are working with the new architecture, especially:
|
||||||
- [main](../examples/main)
|
- [main](/examples/main/)
|
||||||
- [imatrix](../examples/imatrix)
|
- [imatrix](/examples/imatrix/)
|
||||||
- [quantize](../examples/quantize)
|
- [quantize](/examples/quantize/)
|
||||||
- [server](../examples/server)
|
- [server](/examples/server/)
|
||||||
|
|
||||||
### 1. Convert the model to GGUF
|
### 1. Convert the model to GGUF
|
||||||
|
|
||||||
This step is done in python with a `convert` script using the [gguf](https://pypi.org/project/gguf/) library.
|
This step is done in python with a `convert` script using the [gguf](https://pypi.org/project/gguf/) library.
|
||||||
Depending on the model architecture, you can use either [convert_hf_to_gguf.py](../convert_hf_to_gguf.py) or [examples/convert_legacy_llama.py](../examples/convert_legacy_llama.py) (for `llama/llama2` models in `.pth` format).
|
Depending on the model architecture, you can use either [convert_hf_to_gguf.py](/convert_hf_to_gguf.py) or [examples/convert_legacy_llama.py](/examples/convert_legacy_llama.py) (for `llama/llama2` models in `.pth` format).
|
||||||
|
|
||||||
The convert script reads the model configuration, tokenizer, tensor names+data and converts them to GGUF metadata and tensors.
|
The convert script reads the model configuration, tokenizer, tensor names+data and converts them to GGUF metadata and tensors.
|
||||||
|
|
||||||
@ -31,7 +31,7 @@ class MyModel(Model):
|
|||||||
model_arch = gguf.MODEL_ARCH.GROK
|
model_arch = gguf.MODEL_ARCH.GROK
|
||||||
```
|
```
|
||||||
|
|
||||||
2. Define the layout of the GGUF tensors in [constants.py](../gguf-py/gguf/constants.py)
|
2. Define the layout of the GGUF tensors in [constants.py](/gguf-py/gguf/constants.py)
|
||||||
|
|
||||||
Add an enum entry in `MODEL_ARCH`, the model human friendly name in `MODEL_ARCH_NAMES` and the GGUF tensor names in `MODEL_TENSORS`.
|
Add an enum entry in `MODEL_ARCH`, the model human friendly name in `MODEL_ARCH_NAMES` and the GGUF tensor names in `MODEL_TENSORS`.
|
||||||
|
|
||||||
@ -54,7 +54,7 @@ Example for `falcon` model:
|
|||||||
|
|
||||||
As a general rule, before adding a new tensor name to GGUF, be sure the equivalent naming does not already exist.
|
As a general rule, before adding a new tensor name to GGUF, be sure the equivalent naming does not already exist.
|
||||||
|
|
||||||
Once you have found the GGUF tensor name equivalent, add it to the [tensor_mapping.py](../gguf-py/gguf/tensor_mapping.py) file.
|
Once you have found the GGUF tensor name equivalent, add it to the [tensor_mapping.py](/gguf-py/gguf/tensor_mapping.py) file.
|
||||||
|
|
||||||
If the tensor name is part of a repetitive layer/block, the key word `bid` substitutes it.
|
If the tensor name is part of a repetitive layer/block, the key word `bid` substitutes it.
|
||||||
|
|
||||||
@ -100,7 +100,7 @@ Have a look at existing implementation like `build_llama`, `build_dbrx` or `buil
|
|||||||
|
|
||||||
When implementing a new graph, please note that the underlying `ggml` backends might not support them all, support for missing backend operations can be added in another PR.
|
When implementing a new graph, please note that the underlying `ggml` backends might not support them all, support for missing backend operations can be added in another PR.
|
||||||
|
|
||||||
Note: to debug the inference graph: you can use [llama-eval-callback](../examples/eval-callback).
|
Note: to debug the inference graph: you can use [llama-eval-callback](/examples/eval-callback/).
|
||||||
|
|
||||||
## GGUF specification
|
## GGUF specification
|
||||||
|
|
||||||
|
@ -1,7 +1,7 @@
|
|||||||
# Token generation performance troubleshooting
|
# Token generation performance troubleshooting
|
||||||
|
|
||||||
## Verifying that the model is running on the GPU with CUDA
|
## Verifying that the model is running on the GPU with CUDA
|
||||||
Make sure you compiled llama with the correct env variables according to [this guide](../README.md#CUDA), so that llama accepts the `-ngl N` (or `--n-gpu-layers N`) flag. When running llama, you may configure `N` to be very large, and llama will offload the maximum possible number of layers to the GPU, even if it's less than the number you configured. For example:
|
Make sure you compiled llama with the correct env variables according to [this guide](/docs/build.md#cuda), so that llama accepts the `-ngl N` (or `--n-gpu-layers N`) flag. When running llama, you may configure `N` to be very large, and llama will offload the maximum possible number of layers to the GPU, even if it's less than the number you configured. For example:
|
||||||
```shell
|
```shell
|
||||||
./llama-cli -m "path/to/model.gguf" -ngl 200000 -p "Please sir, may I have some "
|
./llama-cli -m "path/to/model.gguf" -ngl 200000 -p "Please sir, may I have some "
|
||||||
```
|
```
|
||||||
|
Loading…
Reference in New Issue
Block a user