From 532dd74e38c29e16ea1cfc4e7eedb4f2fab3f3cd Mon Sep 17 00:00:00 2001 From: Richard Kiss Date: Sat, 11 Nov 2023 22:04:58 -0800 Subject: [PATCH] Fix some documentation typos/grammar mistakes (#4032) * typos * Update examples/parallel/README.md Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com> --------- Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com> --- README.md | 2 +- docs/token_generation_performance_tips.md | 2 +- examples/main/README.md | 2 +- examples/parallel/README.md | 2 +- grammars/README.md | 4 ++-- 5 files changed, 6 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 9c9e36ad0..af39e8c0e 100644 --- a/README.md +++ b/README.md @@ -424,7 +424,7 @@ Building the program with BLAS support may lead to some performance improvements ``` The environment variable [`HIP_VISIBLE_DEVICES`](https://rocm.docs.amd.com/en/latest/understand/gpu_isolation.html#hip-visible-devices) can be used to specify which GPU(s) will be used. - If your GPU is not officialy supported you can use the environment variable [`HSA_OVERRIDE_GFX_VERSION`] set to a similar GPU, for example 10.3.0 on RDNA2 or 11.0.0 on RDNA3. + If your GPU is not officially supported you can use the environment variable [`HSA_OVERRIDE_GFX_VERSION`] set to a similar GPU, for example 10.3.0 on RDNA2 or 11.0.0 on RDNA3. The following compilation options are also available to tweak performance (yes, they refer to CUDA, not HIP, because it uses the same code as the cuBLAS version above): | Option | Legal values | Default | Description | diff --git a/docs/token_generation_performance_tips.md b/docs/token_generation_performance_tips.md index c9acff7d4..d7e863dff 100644 --- a/docs/token_generation_performance_tips.md +++ b/docs/token_generation_performance_tips.md @@ -17,7 +17,7 @@ llama_model_load_internal: [cublas] total VRAM used: 17223 MB If you see these lines, then the GPU is being used. ## Verifying that the CPU is not oversaturated -llama accepts a `-t N` (or `--threads N`) parameter. It's extremely important that this parameter is not too large. If your token generation is extremely slow, try setting this number to 1. If this significantly improves your token generation speed, then your CPU is being oversaturated and you need to explicitly set this parameter to the number of the physicial CPU cores on your machine (even if you utilize a GPU). If in doubt, start with 1 and double the amount until you hit a performance bottleneck, then scale the number down. +llama accepts a `-t N` (or `--threads N`) parameter. It's extremely important that this parameter is not too large. If your token generation is extremely slow, try setting this number to 1. If this significantly improves your token generation speed, then your CPU is being oversaturated and you need to explicitly set this parameter to the number of the physical CPU cores on your machine (even if you utilize a GPU). If in doubt, start with 1 and double the amount until you hit a performance bottleneck, then scale the number down. # Example of runtime flags effect on inference speed benchmark These runs were tested on the following machine: diff --git a/examples/main/README.md b/examples/main/README.md index a3428b487..c7997f665 100644 --- a/examples/main/README.md +++ b/examples/main/README.md @@ -142,7 +142,7 @@ The `--ctx-size` option allows you to set the size of the prompt context used by ### Extended Context Size -Some fine-tuned models have extened the context length by scaling RoPE. For example, if the original pretrained model have a context length (max sequence length) of 4096 (4k) and the fine-tuned model have 32k. That is a scaling factor of 8, and should work by setting the above `--ctx-size` to 32768 (32k) and `--rope-scale` to 8. +Some fine-tuned models have extended the context length by scaling RoPE. For example, if the original pre-trained model have a context length (max sequence length) of 4096 (4k) and the fine-tuned model have 32k. That is a scaling factor of 8, and should work by setting the above `--ctx-size` to 32768 (32k) and `--rope-scale` to 8. - `--rope-scale N`: Where N is the linear scaling factor used by the fine-tuned model. diff --git a/examples/parallel/README.md b/examples/parallel/README.md index 4d0fe5cef..df0456733 100644 --- a/examples/parallel/README.md +++ b/examples/parallel/README.md @@ -1,3 +1,3 @@ # llama.cpp/example/parallel -Simplified simluation for serving incoming requests in parallel +Simplified simulation of serving incoming requests in parallel diff --git a/grammars/README.md b/grammars/README.md index 7f3b11ca5..e1383fa5c 100644 --- a/grammars/README.md +++ b/grammars/README.md @@ -55,7 +55,7 @@ The order of symbols in a sequence matter. For example, in `"1. " move " " move Alternatives, denoted by `|`, give different sequences that are acceptable. For example, in `move ::= pawn | nonpawn | castle`, `move` can be a `pawn` move, a `nonpawn` move, or a `castle`. -Parentheses `()` can be used to group sequences, which allows for embedding alternatives in a larger rule or applying repetition and optptional symbols (below) to a sequence. +Parentheses `()` can be used to group sequences, which allows for embedding alternatives in a larger rule or applying repetition and optional symbols (below) to a sequence. ## Repetition and Optional Symbols @@ -67,7 +67,7 @@ Parentheses `()` can be used to group sequences, which allows for embedding alte Comments can be specified with `#`: ``` -# defines optional whitspace +# defines optional whitespace ws ::= [ \t\n]+ ```