mirror of
https://github.com/ggerganov/llama.cpp.git
synced 2024-12-24 10:24:35 +00:00
docs: fix typos (#7124)
* fix typo * fix typos * fix typo * fix typos * fix typo * fix typos
This commit is contained in:
parent
947d3ad27d
commit
04976db7a8
@ -23,7 +23,7 @@ Install BLIS:
|
|||||||
sudo make install
|
sudo make install
|
||||||
```
|
```
|
||||||
|
|
||||||
We recommend using openmp since it's easier to modify the cores been used.
|
We recommend using openmp since it's easier to modify the cores being used.
|
||||||
|
|
||||||
### llama.cpp compilation
|
### llama.cpp compilation
|
||||||
|
|
||||||
|
@ -96,9 +96,9 @@ NOTE: The dimensions in `ggml` are typically in the reverse order of the `pytorc
|
|||||||
|
|
||||||
This is the funniest part, you have to provide the inference graph implementation of the new model architecture in `llama_build_graph`.
|
This is the funniest part, you have to provide the inference graph implementation of the new model architecture in `llama_build_graph`.
|
||||||
|
|
||||||
Have a look to existing implementation like `build_llama`, `build_dbrx` or `build_bert`.
|
Have a look at existing implementation like `build_llama`, `build_dbrx` or `build_bert`.
|
||||||
|
|
||||||
When implementing a new graph, please note that the underlying `ggml` backends might not support them all, support of missing backend operations can be added in another PR.
|
When implementing a new graph, please note that the underlying `ggml` backends might not support them all, support for missing backend operations can be added in another PR.
|
||||||
|
|
||||||
Note: to debug the inference graph: you can use [eval-callback](../examples/eval-callback).
|
Note: to debug the inference graph: you can use [eval-callback](../examples/eval-callback).
|
||||||
|
|
||||||
|
@ -56,7 +56,7 @@ python ./examples/llava/convert-image-encoder-to-gguf.py -m ../clip-vit-large-pa
|
|||||||
python ./convert.py ../llava-v1.5-7b --skip-unknown
|
python ./convert.py ../llava-v1.5-7b --skip-unknown
|
||||||
```
|
```
|
||||||
|
|
||||||
Now both the LLaMA part and the image encoder is in the `llava-v1.5-7b` directory.
|
Now both the LLaMA part and the image encoder are in the `llava-v1.5-7b` directory.
|
||||||
|
|
||||||
## LLaVA 1.6 gguf conversion
|
## LLaVA 1.6 gguf conversion
|
||||||
1) First clone a LLaVA 1.6 model:
|
1) First clone a LLaVA 1.6 model:
|
||||||
|
@ -143,7 +143,7 @@ The `--ctx-size` option allows you to set the size of the prompt context used by
|
|||||||
|
|
||||||
### Extended Context Size
|
### Extended Context Size
|
||||||
|
|
||||||
Some fine-tuned models have extended the context length by scaling RoPE. For example, if the original pre-trained model have a context length (max sequence length) of 4096 (4k) and the fine-tuned model have 32k. That is a scaling factor of 8, and should work by setting the above `--ctx-size` to 32768 (32k) and `--rope-scale` to 8.
|
Some fine-tuned models have extended the context length by scaling RoPE. For example, if the original pre-trained model has a context length (max sequence length) of 4096 (4k) and the fine-tuned model has 32k. That is a scaling factor of 8, and should work by setting the above `--ctx-size` to 32768 (32k) and `--rope-scale` to 8.
|
||||||
|
|
||||||
- `--rope-scale N`: Where N is the linear scaling factor used by the fine-tuned model.
|
- `--rope-scale N`: Where N is the linear scaling factor used by the fine-tuned model.
|
||||||
|
|
||||||
@ -286,7 +286,7 @@ These options help improve the performance and memory usage of the LLaMA models.
|
|||||||
|
|
||||||
- `--numa distribute`: Pin an equal proportion of the threads to the cores on each NUMA node. This will spread the load amongst all cores on the system, utilitizing all memory channels at the expense of potentially requiring memory to travel over the slow links between nodes.
|
- `--numa distribute`: Pin an equal proportion of the threads to the cores on each NUMA node. This will spread the load amongst all cores on the system, utilitizing all memory channels at the expense of potentially requiring memory to travel over the slow links between nodes.
|
||||||
- `--numa isolate`: Pin all threads to the NUMA node that the program starts on. This limits the number of cores and amount of memory that can be used, but guarantees all memory access remains local to the NUMA node.
|
- `--numa isolate`: Pin all threads to the NUMA node that the program starts on. This limits the number of cores and amount of memory that can be used, but guarantees all memory access remains local to the NUMA node.
|
||||||
- `--numa numactl`: Pin threads to the CPUMAP that is passed to the program by starting it with the numactl utility. This is the most flexible mode, and allow arbitraty core usage patterns, for example a map that uses all the cores on one NUMA nodes, and just enough cores on a second node to saturate the inter-node memory bus.
|
- `--numa numactl`: Pin threads to the CPUMAP that is passed to the program by starting it with the numactl utility. This is the most flexible mode, and allow arbitrary core usage patterns, for example a map that uses all the cores on one NUMA nodes, and just enough cores on a second node to saturate the inter-node memory bus.
|
||||||
|
|
||||||
These flags attempt optimizations that help on some systems with non-uniform memory access. This currently consists of one of the above strategies, and disabling prefetch and readahead for mmap. The latter causes mapped pages to be faulted in on first access instead of all at once, and in combination with pinning threads to NUMA nodes, more of the pages end up on the NUMA node where they are used. Note that if the model is already in the system page cache, for example because of a previous run without this option, this will have little effect unless you drop the page cache first. This can be done by rebooting the system or on Linux by writing '3' to '/proc/sys/vm/drop_caches' as root.
|
These flags attempt optimizations that help on some systems with non-uniform memory access. This currently consists of one of the above strategies, and disabling prefetch and readahead for mmap. The latter causes mapped pages to be faulted in on first access instead of all at once, and in combination with pinning threads to NUMA nodes, more of the pages end up on the NUMA node where they are used. Note that if the model is already in the system page cache, for example because of a previous run without this option, this will have little effect unless you drop the page cache first. This can be done by rebooting the system or on Linux by writing '3' to '/proc/sys/vm/drop_caches' as root.
|
||||||
|
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
# llama.cpp/example/sycl
|
# llama.cpp/example/sycl
|
||||||
|
|
||||||
This example program provide the tools for llama.cpp for SYCL on Intel GPU.
|
This example program provides the tools for llama.cpp for SYCL on Intel GPU.
|
||||||
|
|
||||||
## Tool
|
## Tool
|
||||||
|
|
||||||
|
@ -51,7 +51,7 @@ single-line ::= [^\n]+ "\n"`
|
|||||||
|
|
||||||
## Sequences and Alternatives
|
## Sequences and Alternatives
|
||||||
|
|
||||||
The order of symbols in a sequence matter. For example, in `"1. " move " " move "\n"`, the `"1. "` must come before the first `move`, etc.
|
The order of symbols in a sequence matters. For example, in `"1. " move " " move "\n"`, the `"1. "` must come before the first `move`, etc.
|
||||||
|
|
||||||
Alternatives, denoted by `|`, give different sequences that are acceptable. For example, in `move ::= pawn | nonpawn | castle`, `move` can be a `pawn` move, a `nonpawn` move, or a `castle`.
|
Alternatives, denoted by `|`, give different sequences that are acceptable. For example, in `move ::= pawn | nonpawn | castle`, `move` can be a `pawn` move, a `nonpawn` move, or a `castle`.
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user