From 3ad1e3f1a10c1f66b4f1cd7510e0977fadbc0dfd Mon Sep 17 00:00:00 2001 From: coezbek Date: Tue, 17 Oct 2023 18:51:02 +0200 Subject: [PATCH 1/7] server : documentation of JSON return value of /completion endpoint (#3632) * Added documentation of JSON return value of /completion endpoint * Update examples/server/README.md --------- Co-authored-by: Georgi Gerganov --- examples/server/README.md | 42 +++++++++++++++++++++++++++++++++------ 1 file changed, 36 insertions(+), 6 deletions(-) diff --git a/examples/server/README.md b/examples/server/README.md index 8a079ae26..9737010d3 100644 --- a/examples/server/README.md +++ b/examples/server/README.md @@ -106,25 +106,25 @@ node index.js ## API Endpoints -- **POST** `/completion`: Given a prompt, it returns the predicted completion. +- **POST** `/completion`: Given a `prompt`, it returns the predicted completion. *Options:* + `prompt`: Provide the prompt for this completion as a string or as an array of strings or numbers representing tokens. Internally, the prompt is compared to the previous completion and only the "unseen" suffix is evaluated. If the prompt is a string or an array with the first element given as a string, a `bos` token is inserted in the front like `main` does. + `temperature`: Adjust the randomness of the generated text (default: 0.8). `top_k`: Limit the next token selection to the K most probable tokens (default: 40). `top_p`: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.95). - `n_predict`: Set the number of tokens to predict when generating text. **Note:** May exceed the set limit slightly if the last token is a partial multibyte character. When 0, no tokens will be generated but the prompt is evaluated into the cache. (default: -1, -1 = infinity). + `n_predict`: Set the maximum number of tokens to predict when generating text. **Note:** May exceed the set limit slightly if the last token is a partial multibyte character. When 0, no tokens will be generated but the prompt is evaluated into the cache. (default: -1, -1 = infinity). - `n_keep`: Specify the number of tokens from the initial prompt to retain when the model resets its internal context. - By default, this value is set to 0 (meaning no tokens are kept). Use `-1` to retain all tokens from the initial prompt. + `n_keep`: Specify the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded. + By default, this value is set to 0 (meaning no tokens are kept). Use `-1` to retain all tokens from the prompt. `stream`: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to `true`. - `prompt`: Provide a prompt as a string, or as an array of strings and numbers representing tokens. Internally, the prompt is compared, and it detects if a part has already been evaluated, and the remaining part will be evaluate. If the prompt is a string, or an array with the first element given as a string, a space is inserted in the front like main.cpp does. - `stop`: Specify a JSON array of stopping strings. These words will not be included in the completion, so make sure to add them to the prompt for the next iteration (default: []). @@ -158,6 +158,36 @@ node index.js `n_probs`: If greater than 0, the response also contains the probabilities of top N tokens for each generated token (default: 0) + *Result JSON:* + + Note: When using streaming mode (`stream`) only `content` and `stop` will be returned until end of completion. + + `content`: Completion result as a string (excluding `stopping_word` if any). In case of streaming mode, will contain the next token as a string. + + `stop`: Boolean for use with `stream` to check whether the generation has stopped (Note: This is not related to stopping words array `stop` from input options) + + `generation_settings`: The provided options above excluding `prompt` but including `n_ctx`, `model` + + `model`: The path to the model loaded with `-m` + + `prompt`: The provided `prompt` + + `stopped_eos`: Indicating whether the completion has stopped because it encountered the EOS token + + `stopped_limit`: Indicating whether the completion stopped because `n_predict` tokens were generated before stop words or EOS was encountered + + `stopped_word`: Indicating whether the completion stopped due to encountering a stopping word from `stop` JSON array provided + + `stopping_word`: The stopping word encountered which stopped the generation (or "" if not stopped due to a stopping word) + + `timings`: Hash of timing information about the completion such as the number of tokens `predicted_per_second` + + `tokens_cached`: Number of tokens from the prompt which could be re-used from previous completion (`n_past`) + + `tokens_evaluated`: Number of tokens evaluated in total from the prompt + + `truncated`: Boolean indicating if the context size was exceeded during generation, i.e. the number of tokens provided in the prompt (`tokens_evaluated`) plus tokens generated (`tokens predicted`) exceeded the context size (`n_ctx`) + - **POST** `/tokenize`: Tokenize a given text. *Options:* From e74c705e15cd228ad696c4a3cdea6d6fb4ff434c Mon Sep 17 00:00:00 2001 From: Georgi Gerganov Date: Tue, 17 Oct 2023 19:52:53 +0300 Subject: [PATCH 2/7] editorconfig : remove trailing spaces --- examples/server/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/examples/server/README.md b/examples/server/README.md index 9737010d3..9f0ace3d7 100644 --- a/examples/server/README.md +++ b/examples/server/README.md @@ -164,7 +164,7 @@ node index.js `content`: Completion result as a string (excluding `stopping_word` if any). In case of streaming mode, will contain the next token as a string. - `stop`: Boolean for use with `stream` to check whether the generation has stopped (Note: This is not related to stopping words array `stop` from input options) + `stop`: Boolean for use with `stream` to check whether the generation has stopped (Note: This is not related to stopping words array `stop` from input options) `generation_settings`: The provided options above excluding `prompt` but including `n_ctx`, `model` @@ -186,7 +186,7 @@ node index.js `tokens_evaluated`: Number of tokens evaluated in total from the prompt - `truncated`: Boolean indicating if the context size was exceeded during generation, i.e. the number of tokens provided in the prompt (`tokens_evaluated`) plus tokens generated (`tokens predicted`) exceeded the context size (`n_ctx`) + `truncated`: Boolean indicating if the context size was exceeded during generation, i.e. the number of tokens provided in the prompt (`tokens_evaluated`) plus tokens generated (`tokens predicted`) exceeded the context size (`n_ctx`) - **POST** `/tokenize`: Tokenize a given text. From a5e8c1d8c71f01d98ae2ec63a57c118664f9764d Mon Sep 17 00:00:00 2001 From: slaren Date: Tue, 17 Oct 2023 19:00:58 +0200 Subject: [PATCH 3/7] train-text-from-scratch : fix assert failure in ggml-alloc (#3618) --- .../train-text-from-scratch.cpp | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/examples/train-text-from-scratch/train-text-from-scratch.cpp b/examples/train-text-from-scratch/train-text-from-scratch.cpp index be693b3ac..1ce6cef29 100644 --- a/examples/train-text-from-scratch/train-text-from-scratch.cpp +++ b/examples/train-text-from-scratch/train-text-from-scratch.cpp @@ -253,13 +253,14 @@ static void init_model(struct my_llama_model * model) { set_param_model(model); // measure data size - struct ggml_allocr * alloc = NULL; - alloc = ggml_allocr_new_measure(tensor_alignment); - alloc_model(alloc, model); + size_t size = 0; + for (struct ggml_tensor * t = ggml_get_first_tensor(ctx); t != NULL; t = ggml_get_next_tensor(ctx, t)) { + size += GGML_PAD(ggml_nbytes(t), tensor_alignment); + } // allocate data - model->data.resize(ggml_allocr_max_size(alloc) + tensor_alignment); - ggml_allocr_free(alloc); + struct ggml_allocr * alloc = NULL; + model->data.resize(size + tensor_alignment); alloc = ggml_allocr_new(model->data.data(), model->data.size(), tensor_alignment); alloc_model(alloc, model); ggml_allocr_free(alloc); @@ -1094,11 +1095,9 @@ int main(int argc, char ** argv) { struct ggml_tensor * target_probs = ggml_new_tensor_3d(ctx_input, GGML_TYPE_F32, n_vocab, n_tokens, n_batch); // measure required memory for input tensors - alloc = ggml_allocr_new_measure(tensor_alignment); - ggml_allocr_alloc(alloc, tokens_input); - ggml_allocr_alloc(alloc, target_probs); - size_t max_input_size = ggml_allocr_max_size(alloc) + tensor_alignment; - ggml_allocr_free(alloc); + size_t max_input_size = GGML_PAD(ggml_nbytes(tokens_input), tensor_alignment) + + GGML_PAD(ggml_nbytes(target_probs), tensor_alignment) + + tensor_alignment; printf("%s: input_size = %zu bytes (%.1f MB)\n", __func__, max_input_size, (float) max_input_size / (1024.0f*1024.0f)); // allocate input tensors From 40e5ce054f4c4fa555e4510ea5f760bb29185332 Mon Sep 17 00:00:00 2001 From: shibe2 Date: Wed, 11 Oct 2023 21:30:06 +0400 Subject: [PATCH 4/7] CLBlast: Fix temporary buffer size for f16 conversion (wsize) Fix buffer overflow. Reduce the size to fit just one 2D slice. Assert sufficient size. --- ggml-opencl.cpp | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/ggml-opencl.cpp b/ggml-opencl.cpp index 33d0691eb..22fd0e3a7 100644 --- a/ggml-opencl.cpp +++ b/ggml-opencl.cpp @@ -1568,7 +1568,7 @@ static void ggml_cl_mul_mat_f32(const ggml_tensor * src0, const ggml_tensor * sr ggml_cl_pool_free(d_D, d_size); } -static void ggml_cl_mul_mat_f16(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, void * wdata, size_t /* wsize */) { +static void ggml_cl_mul_mat_f16(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, void * wdata, size_t wsize) { GGML_ASSERT(fp16_support); const int64_t ne00 = src0->ne[0]; @@ -1598,6 +1598,10 @@ static void ggml_cl_mul_mat_f16(const ggml_tensor * src0, const ggml_tensor * sr const int y_ne = ne11 * ne10; const int d_ne = ne11 * ne01; + GGML_ASSERT(wsize >= sizeof(ggml_fp16_t) * y_ne); + GGML_ASSERT(wsize >= sizeof(ggml_fp16_t) * d_ne); + ggml_fp16_t * const tmp = (ggml_fp16_t *) wdata; + size_t x_size; size_t y_size; size_t d_size; @@ -1634,7 +1638,6 @@ static void ggml_cl_mul_mat_f16(const ggml_tensor * src0, const ggml_tensor * sr // convert src1 to fp16 // TODO: use multiple threads - ggml_fp16_t * const tmp = (ggml_fp16_t *) wdata + (ne11 * ne10) * (i13 * ne12 + i12); char * src1i = (char *) src1->data + i13*nb13 + i12*nb12; if (src1_cont_rows) { if (src1_cont_cols) { @@ -1897,8 +1900,8 @@ void ggml_cl_mul_mat(const struct ggml_tensor * src0, const struct ggml_tensor * } size_t ggml_cl_mul_mat_get_wsize(const struct ggml_tensor * src0, const struct ggml_tensor * src1, struct ggml_tensor * dst) { - if (ggml_cl_mul_mat_use_f16(src0, src1, dst)) { - return ggml_nelements(src1) * sizeof(ggml_fp16_t); + if (src0->type == GGML_TYPE_F16 && ggml_cl_mul_mat_use_f16(src0, src1, dst)) { + return sizeof(ggml_fp16_t) * std::max(src1->ne[0] * src1->ne[1], dst->ne[0] * dst->ne[1]); } return 0; } From 8402566a7c436bfbde8e7b0461faee50298106a0 Mon Sep 17 00:00:00 2001 From: BarfingLemurs <128182951+BarfingLemurs@users.noreply.github.com> Date: Tue, 17 Oct 2023 14:13:21 -0400 Subject: [PATCH 5/7] readme : update hot-topics & models, detail windows release in usage (#3615) * Update README.md * Update README.md * Update README.md * move "Running on Windows" section below "Prepare data and run" --------- Co-authored-by: Georgi Gerganov --- README.md | 27 ++++++++++++++++++++------- 1 file changed, 20 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 56372865b..4fd4bd427 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++ ### Hot topics - +- ‼️ BPE tokenizer update: existing Falcon and Starcoder `.gguf` models will need to be reconverted: [#3252](https://github.com/ggerganov/llama.cpp/pull/3252) - ‼️ Breaking change: `rope_freq_base` and `rope_freq_scale` must be set to zero to use the model default values: [#3401](https://github.com/ggerganov/llama.cpp/pull/3401) - Parallel decoding + continuous batching support added: [#3228](https://github.com/ggerganov/llama.cpp/pull/3228) \ **Devs should become familiar with the new API** @@ -89,16 +89,17 @@ as the main playground for developing new features for the [ggml](https://github - [X] [Vicuna](https://github.com/ggerganov/llama.cpp/discussions/643#discussioncomment-5533894) - [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/) - [X] [OpenBuddy 🐶 (Multilingual)](https://github.com/OpenBuddy/OpenBuddy) -- [X] [Pygmalion 7B / Metharme 7B](#using-pygmalion-7b--metharme-7b) +- [X] [Pygmalion/Metharme](#using-pygmalion-7b--metharme-7b) - [X] [WizardLM](https://github.com/nlpxucan/WizardLM) -- [X] [Baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B) and its derivations (such as [baichuan-7b-sft](https://huggingface.co/hiyouga/baichuan-7b-sft)) -- [X] [Aquila-7B](https://huggingface.co/BAAI/Aquila-7B) / [AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B) -- [X] [Aquila2-7B](https://huggingface.co/BAAI/Aquila2-7B) / [AquilaChat2-7B](https://huggingface.co/BAAI/AquilaChat2-7B) / [AquilaChat2-34B](https://huggingface.co/BAAI/AquilaChat2-34B) / [Aquila2-34B](https://huggingface.co/BAAI/Aquila2-34B) +- [X] [Baichuan 1 & 2](https://huggingface.co/models?search=baichuan-inc/Baichuan) + [derivations](https://huggingface.co/hiyouga/baichuan-7b-sft) +- [X] [Aquila 1 & 2](https://huggingface.co/models?search=BAAI/Aquila) - [X] [Starcoder models](https://github.com/ggerganov/llama.cpp/pull/3187) - [X] [Mistral AI v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) - [X] [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim) -- [X] [Bloom](https://github.com/ggerganov/llama.cpp/pull/3553) +- [X] [Persimmon 8B](https://github.com/ggerganov/llama.cpp/pull/3410) - [X] [MPT](https://github.com/ggerganov/llama.cpp/pull/3417) +- [X] [Bloom](https://github.com/ggerganov/llama.cpp/pull/3553) + **Bindings:** @@ -207,7 +208,7 @@ https://user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8 ## Usage -Here are the steps for the LLaMA-7B model. +Here are the end-to-end binary build and model conversion steps for the LLaMA-7B model. ### Get the Code @@ -574,6 +575,18 @@ python3 convert.py models/7B/ When running the larger models, make sure you have enough disk space to store all the intermediate files. +### Running on Windows with prebuilt binaries + +You will find prebuilt Windows binaries on the release page. + +Simply download and extract the latest zip package of choice: (e.g. `llama-b1380-bin-win-avx2-x64.zip`) + +From the unzipped folder, open a terminal/cmd window here and place a pre-converted `.gguf` model file. Test out the main example like so: + +``` +.\main -m llama-2-7b.Q4_0.gguf -n 128 +``` + ### Memory/Disk Requirements As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same. From e1675d133c31e1c8de2f06be7164e12c0ba6cf2c Mon Sep 17 00:00:00 2001 From: Georgi Gerganov Date: Tue, 17 Oct 2023 22:34:26 +0300 Subject: [PATCH 6/7] llama : avoid fprintf in favor of LLAMA_LOG (#3538) --- examples/main/main.cpp | 2 +- llama.cpp | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/examples/main/main.cpp b/examples/main/main.cpp index a5fb65548..7313d06a0 100644 --- a/examples/main/main.cpp +++ b/examples/main/main.cpp @@ -799,7 +799,7 @@ int main(int argc, char ** argv) { } const auto line_pfx = ::llama_tokenize(ctx, params.input_prefix, false, true); - const auto line_inp = ::llama_tokenize(ctx, buffer, false, false); + const auto line_inp = ::llama_tokenize(ctx, buffer, false, false); const auto line_sfx = ::llama_tokenize(ctx, params.input_suffix, false, true); LOG("input tokens: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, line_inp)); diff --git a/llama.cpp b/llama.cpp index 82b7638ae..37df88779 100644 --- a/llama.cpp +++ b/llama.cpp @@ -2327,13 +2327,13 @@ static void llm_load_vocab( } if (special_tokens_definition_mismatch || special_tokens_count_from_verification != special_tokens_count_by_type) { - fprintf(stderr, "%s: warning: Mismatch in special tokens definition ( %u/%zu vs %u/%zu ).\n", + LLAMA_LOG_WARN("%s: mismatch in special tokens definition ( %u/%zu vs %u/%zu ).\n", __func__, special_tokens_count_from_verification, vocab.id_to_token.size(), special_tokens_count_by_type, vocab.id_to_token.size() ); } else { - fprintf(stderr, "%s: Special tokens definition check successful ( %u/%zu ).\n", + LLAMA_LOG_INFO("%s: special tokens definition check successful ( %u/%zu ).\n", __func__, special_tokens_count_from_verification, vocab.id_to_token.size() ); From cb33f43a2a9f5a5a5f8d290dd97c625d9ba97a2f Mon Sep 17 00:00:00 2001 From: slaren Date: Tue, 17 Oct 2023 22:24:50 +0200 Subject: [PATCH 7/7] fix embeddings when using CUDA (#3657) --- llama.cpp | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/llama.cpp b/llama.cpp index 37df88779..04a779e04 100644 --- a/llama.cpp +++ b/llama.cpp @@ -5903,6 +5903,13 @@ static int llama_decode_internal( ggml_allocr_alloc_graph(lctx.alloc, gf); + struct ggml_tensor * res = gf->nodes[gf->n_nodes - 1]; + struct ggml_tensor * embeddings = gf->nodes[gf->n_nodes - 2]; + + GGML_ASSERT(strcmp(res->name, "result_output") == 0); + GGML_ASSERT(strcmp(embeddings->name, "result_norm") == 0); + + #ifdef GGML_USE_CUBLAS for (int i = 0; i < gf->n_leafs; i++) { ggml_tensor * node = gf->leafs[i]; @@ -5920,6 +5927,12 @@ static int llama_decode_internal( } ggml_cuda_set_mul_mat_q(cparams.mul_mat_q); + + // HACK: ggml-alloc may change the tensor backend when reusing a parent, so force output to be on the CPU here if needed + if (!lctx.embedding.empty()) { + embeddings->backend = GGML_BACKEND_CPU; + } + res->backend = GGML_BACKEND_CPU; #endif // LLAMA_LOG_INFO("graph build time: %.3f ms (%d nodes, %d leafs)\n", (ggml_time_us() - t_start_us)/1000.0, gf->n_nodes, gf->n_leafs); @@ -5944,12 +5957,6 @@ static int llama_decode_internal( n_threads = 1; } - struct ggml_tensor * res = gf->nodes[gf->n_nodes - 1]; - struct ggml_tensor * embeddings = gf->nodes[gf->n_nodes - 2]; - - GGML_ASSERT(strcmp(res->name, "result_output") == 0); - GGML_ASSERT(strcmp(embeddings->name, "result_norm") == 0); - #if GGML_USE_MPI const int64_t n_layer = hparams.n_layer; ggml_mpi_graph_compute_pre(lctx.ctx_mpi, gf, n_layer);