llama.cpp/examples/embedding/embedding.cpp

#include "common.h"
#include "llama.h"

#include <ctime>

#if defined(_MSC_VER)
#pragma warning(disable: 4244 4267) // possible loss of data
#endif

int main(int argc, char ** argv) {
    gpt_params params;

    if (!gpt_params_parse(argc, argv, params)) {
        return 1;
    }

    params.embedding = true;

    print_build_info();

    if (params.seed == LLAMA_DEFAULT_SEED) {
        params.seed = time(NULL);
    }

    fprintf(stderr, "%s: seed  = %u\n", __func__, params.seed);

    std::mt19937 rng(params.seed);
    if (params.random_prompt) {
        params.prompt = gpt_random_prompt(rng);
    }

    llama_backend_init(params.numa);

    llama_model * model;
    llama_context * ctx;

    // load the model
    std::tie(model, ctx) = llama_init_from_gpt_params(params);
    if (model == NULL) {
        fprintf(stderr, "%s: error: unable to load model\n", __func__);
        return 1;
    }

    const int n_ctx_train = llama_n_ctx_train(model);
    const int n_ctx = llama_n_ctx(ctx);

    if (n_ctx > n_ctx_train) {
        fprintf(stderr, "%s: warning: model was trained on only %d context tokens (%d specified)\n",
                __func__, n_ctx_train, n_ctx);
    }

    // print system information
    {
        fprintf(stderr, "\n");
        fprintf(stderr, "%s\n", get_system_info(params).c_str());
    }

    int n_past = 0;

    // tokenize the prompt
    auto embd_inp = ::llama_tokenize(ctx, params.prompt, true);

    if (params.verbose_prompt) {
        fprintf(stderr, "\n");
        fprintf(stderr, "%s: prompt: '%s'\n", __func__, params.prompt.c_str());
        fprintf(stderr, "%s: number of tokens in prompt = %zu\n", __func__, embd_inp.size());
        for (int i = 0; i < (int) embd_inp.size(); i++) {
            fprintf(stderr, "%6d -> '%s'\n", embd_inp[i], llama_token_to_piece(ctx, embd_inp[i]).c_str());
        }
        fprintf(stderr, "\n");
    }

    if (embd_inp.size() > (size_t)n_ctx) {
        fprintf(stderr, "%s: error: prompt is longer than the context window (%zu tokens, n_ctx = %d)\n",
                __func__, embd_inp.size(), n_ctx);
        return 1;
    }

    while (!embd_inp.empty()) {
        int n_tokens = std::min(params.n_batch, (int) embd_inp.size());
        if (llama_decode(ctx, llama_batch_get_one(embd_inp.data(), n_tokens, n_past, 0))) {
            fprintf(stderr, "%s : failed to eval\n", __func__);
            return 1;
        }
        n_past += n_tokens;
        embd_inp.erase(embd_inp.begin(), embd_inp.begin() + n_tokens);
    }

    const int n_embd = llama_n_embd(model);
    auto * embeddings = llama_get_embeddings(ctx);

    // l2-normalize embeddings
    float norm = 0;
    for (int i = 0; i < n_embd; i++) {
        norm += embeddings[i] * embeddings[i];
    }
    norm = sqrt(norm);
    for (int i = 0; i < n_embd; i++) {
        embeddings[i] /= norm;
    }

    for (int i = 0; i < n_embd; i++) {
        printf("%f ", embeddings[i]);
    }
    printf("\n");

    llama_print_timings(ctx);
    llama_free(ctx);
    llama_free_model(model);

    llama_backend_free();

    return 0;
}
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 18:26:40 +00:00			`#include "common.h"`
			`#include "llama.h"`

examples: add missing <ctime> include for time() (#1011) 2023-04-16 10:13:00 +00:00			`#include <ctime>`

build : fix and ignore MSVC warnings (#1889) 2023-06-16 18:23:53 +00:00			`#if defined(_MSC_VER)`
			`#pragma warning(disable: 4244 4267) // possible loss of data`
			`#endif`

Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 18:26:40 +00:00			`int main(int argc, char ** argv) {`
			`gpt_params params;`

fix some warnings from gcc and clang-tidy (#3038) Co-authored-by: xaedes <xaedes@gmail.com> 2023-09-07 17:22:29 +00:00			`if (!gpt_params_parse(argc, argv, params)) {`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 18:26:40 +00:00			`return 1;`
			`}`

			`params.embedding = true;`

examples : add compiler version and target to build info (#2998) 2023-09-15 20:59:49 +00:00			`print_build_info();`
Add git-based build information for better issue tracking (#1232) * Add git-based build information for better issue tracking * macOS fix * "build (hash)" and "CMAKE_SOURCE_DIR" changes * Redo "CMAKE_CURRENT_SOURCE_DIR" and clearer build messages * Fix conditional dependency on missing target * Broke out build-info.cmake, added find_package fallback, and added build into to all examples, added dependencies to Makefile * 4 space indenting for cmake, attempt to clean up my mess in Makefile * Short hash, less fancy Makefile, and don't modify build-info.h if it wouldn't change it 2023-05-01 16:23:47 +00:00
Use unsigned for random seed (#2006) * Use unsigned for random seed. Keep -1 as the value to use a time based seed. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-06-29 13:15:15 +00:00			`if (params.seed == LLAMA_DEFAULT_SEED) {`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 18:26:40 +00:00			`params.seed = time(NULL);`
			`}`

Use unsigned for random seed (#2006) * Use unsigned for random seed. Keep -1 as the value to use a time based seed. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-06-29 13:15:15 +00:00			`fprintf(stderr, "%s: seed = %u\n", __func__, params.seed);`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 18:26:40 +00:00
			`std::mt19937 rng(params.seed);`
			`if (params.random_prompt) {`
			`params.prompt = gpt_random_prompt(rng);`
			`}`

mpi : add support for distributed inference via MPI (#2099) * MPI support, first cut * fix warnings, update README * fixes * wrap includes * PR comments * Update CMakeLists.txt * Add GH workflow, fix test * Add info to README * mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099) * mpi : add names for layer inputs + prep ggml_mpi_graph_compute() * mpi : move all MPI logic into ggml-mpi Not tested yet * mpi : various fixes - communication now works but results are wrong * mpi : fix output tensor after MPI compute (still not working) * mpi : fix inference * mpi : minor * Add OpenMPI to GH action * [mpi] continue-on-error: true * mpi : fix after master merge * [mpi] Link MPI C++ libraries to fix OpenMPI * tests : fix new llama_backend API * [mpi] use MPI_INT32_T * mpi : factor out recv / send in functions and reuse * mpi : extend API to allow usage with outer backends (e.g. Metal) --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-07-10 15:49:56 +00:00			`llama_backend_init(params.numa);`
llama : add llama_init_backend() API (close #1527) 2023-05-20 08:06:11 +00:00
llama : make model stateless and context stateful (llama_state) (#1797) * llama : make model stateless and context stateful * llama : minor cleanup * llama : update internal API declaration * Apply suggestions from code review fix style Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Missing model memory release * Fix style * Add deprecated warning for public API function llama_init_from_file * Update public API use cases: move away from deprecated llama_init_from_file * Deprecate public API function llama_apply_lora_from_file --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-06-24 08:47:58 +00:00			`llama_model * model;`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 18:26:40 +00:00			`llama_context * ctx;`

			`// load the model`
llama : make model stateless and context stateful (llama_state) (#1797) * llama : make model stateless and context stateful * llama : minor cleanup * llama : update internal API declaration * Apply suggestions from code review fix style Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Missing model memory release * Fix style * Add deprecated warning for public API function llama_init_from_file * Update public API use cases: move away from deprecated llama_init_from_file * Deprecate public API function llama_apply_lora_from_file --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-06-24 08:47:58 +00:00			`std::tie(model, ctx) = llama_init_from_gpt_params(params);`
			`if (model == NULL) {`
examples : add llama_init_from_gpt_params() common function (#1290) Signed-off-by: deadprogram <ron@hybridgroup.com> 2023-05-02 20:39:51 +00:00			`fprintf(stderr, "%s: error: unable to load model\n", __func__);`
			`return 1;`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 18:26:40 +00:00			`}`

llama.cpp : split llama_context_params into model and context params (#3301) * llama.cpp : split llama_context_params into model and context params ggml-ci * fix metal build * fix freq_base/scale default to model value * llama-bench : keep the same model between tests when possible * move n_threads to llama_context_params, add n_threads_batch * fix mpi build * remove kv_size(), cuda scratch fixes * remove low-vram option * add n_threads_batch to system info, refactor to get_system_info() * add documentation about --threads-batch to the READMEs * llama-bench fix * main : fix rope freq/scale warning * llama.cpp : add llama_get_model common : add llama_tokenize from model * remove duplicated ctx/model functions ggml-ci * cuda : print total VRAM used 2023-09-28 19:42:38 +00:00			`const int n_ctx_train = llama_n_ctx_train(model);`
			`const int n_ctx = llama_n_ctx(ctx);`

			`if (n_ctx > n_ctx_train) {`
examples : make n_ctx warning work again (#3066) This was broken by commit e36ecdcc ("build : on Mac OS enable Metal by default (#2901)"). 2023-09-08 15:43:35 +00:00			`fprintf(stderr, "%s: warning: model was trained on only %d context tokens (%d specified)\n",`
llama.cpp : split llama_context_params into model and context params (#3301) * llama.cpp : split llama_context_params into model and context params ggml-ci * fix metal build * fix freq_base/scale default to model value * llama-bench : keep the same model between tests when possible * move n_threads to llama_context_params, add n_threads_batch * fix mpi build * remove kv_size(), cuda scratch fixes * remove low-vram option * add n_threads_batch to system info, refactor to get_system_info() * add documentation about --threads-batch to the READMEs * llama-bench fix * main : fix rope freq/scale warning * llama.cpp : add llama_get_model common : add llama_tokenize from model * remove duplicated ctx/model functions ggml-ci * cuda : print total VRAM used 2023-09-28 19:42:38 +00:00			`__func__, n_ctx_train, n_ctx);`
examples : make n_ctx warning work again (#3066) This was broken by commit e36ecdcc ("build : on Mac OS enable Metal by default (#2901)"). 2023-09-08 15:43:35 +00:00			`}`

Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 18:26:40 +00:00			`// print system information`
			`{`
			`fprintf(stderr, "\n");`
llama.cpp : split llama_context_params into model and context params (#3301) * llama.cpp : split llama_context_params into model and context params ggml-ci * fix metal build * fix freq_base/scale default to model value * llama-bench : keep the same model between tests when possible * move n_threads to llama_context_params, add n_threads_batch * fix mpi build * remove kv_size(), cuda scratch fixes * remove low-vram option * add n_threads_batch to system info, refactor to get_system_info() * add documentation about --threads-batch to the READMEs * llama-bench fix * main : fix rope freq/scale warning * llama.cpp : add llama_get_model common : add llama_tokenize from model * remove duplicated ctx/model functions ggml-ci * cuda : print total VRAM used 2023-09-28 19:42:38 +00:00			`fprintf(stderr, "%s\n", get_system_info(params).c_str());`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 18:26:40 +00:00			`}`

			`int n_past = 0;`

			`// tokenize the prompt`
			`auto embd_inp = ::llama_tokenize(ctx, params.prompt, true);`

			`if (params.verbose_prompt) {`
			`fprintf(stderr, "\n");`
			`fprintf(stderr, "%s: prompt: '%s'\n", __func__, params.prompt.c_str());`
			`fprintf(stderr, "%s: number of tokens in prompt = %zu\n", __func__, embd_inp.size());`
			`for (int i = 0; i < (int) embd_inp.size(); i++) {`
llama : more tokenizer fixes (#2810) * tests : write a Python tokenizer test (wip) * llama : prefix input text for tokenization with whitespace * llama : distinguish pieces from decoded text + fix detokenization * common : add comments * examples : no longer manually add leading space when tokenizing * tests : use Python to generate tokenizer tests for C++ * tests : add option to tokenize text files ggml-ci * tests : add test-tokenizer-1.py * llama.cpp : fix LF token * hellaswag : move the concat space for clarity * tests : add falcon tests (py + cpp, currently do not pass Unicode) ggml-ci * common : temporary separate llama_detokenize calls for SPM and BPE --------- Co-authored-by: klosax <131523366+klosax@users.noreply.github.com> 2023-08-27 11:19:19 +00:00			`fprintf(stderr, "%6d -> '%s'\n", embd_inp[i], llama_token_to_piece(ctx, embd_inp[i]).c_str());`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 18:26:40 +00:00			`}`
			`fprintf(stderr, "\n");`
			`}`

llama.cpp : split llama_context_params into model and context params (#3301) * llama.cpp : split llama_context_params into model and context params ggml-ci * fix metal build * fix freq_base/scale default to model value * llama-bench : keep the same model between tests when possible * move n_threads to llama_context_params, add n_threads_batch * fix mpi build * remove kv_size(), cuda scratch fixes * remove low-vram option * add n_threads_batch to system info, refactor to get_system_info() * add documentation about --threads-batch to the READMEs * llama-bench fix * main : fix rope freq/scale warning * llama.cpp : add llama_get_model common : add llama_tokenize from model * remove duplicated ctx/model functions ggml-ci * cuda : print total VRAM used 2023-09-28 19:42:38 +00:00			`if (embd_inp.size() > (size_t)n_ctx) {`
embedding : evaluate prompt in batches (#2713) 2023-08-22 14:03:12 +00:00			`fprintf(stderr, "%s: error: prompt is longer than the context window (%zu tokens, n_ctx = %d)\n",`
llama.cpp : split llama_context_params into model and context params (#3301) * llama.cpp : split llama_context_params into model and context params ggml-ci * fix metal build * fix freq_base/scale default to model value * llama-bench : keep the same model between tests when possible * move n_threads to llama_context_params, add n_threads_batch * fix mpi build * remove kv_size(), cuda scratch fixes * remove low-vram option * add n_threads_batch to system info, refactor to get_system_info() * add documentation about --threads-batch to the READMEs * llama-bench fix * main : fix rope freq/scale warning * llama.cpp : add llama_get_model common : add llama_tokenize from model * remove duplicated ctx/model functions ggml-ci * cuda : print total VRAM used 2023-09-28 19:42:38 +00:00			`__func__, embd_inp.size(), n_ctx);`
embedding : evaluate prompt in batches (#2713) 2023-08-22 14:03:12 +00:00			`return 1;`
			`}`

			`while (!embd_inp.empty()) {`
			`int n_tokens = std::min(params.n_batch, (int) embd_inp.size());`
llama.cpp : split llama_context_params into model and context params (#3301) * llama.cpp : split llama_context_params into model and context params ggml-ci * fix metal build * fix freq_base/scale default to model value * llama-bench : keep the same model between tests when possible * move n_threads to llama_context_params, add n_threads_batch * fix mpi build * remove kv_size(), cuda scratch fixes * remove low-vram option * add n_threads_batch to system info, refactor to get_system_info() * add documentation about --threads-batch to the READMEs * llama-bench fix * main : fix rope freq/scale warning * llama.cpp : add llama_get_model common : add llama_tokenize from model * remove duplicated ctx/model functions ggml-ci * cuda : print total VRAM used 2023-09-28 19:42:38 +00:00			`if (llama_decode(ctx, llama_batch_get_one(embd_inp.data(), n_tokens, n_past, 0))) {`
embedding : evaluate prompt in batches (#2713) 2023-08-22 14:03:12 +00:00			`fprintf(stderr, "%s : failed to eval\n", __func__);`
			`return 1;`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 18:26:40 +00:00			`}`
embedding : evaluate prompt in batches (#2713) 2023-08-22 14:03:12 +00:00			`n_past += n_tokens;`
			`embd_inp.erase(embd_inp.begin(), embd_inp.begin() + n_tokens);`
			`}`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 18:26:40 +00:00
llama.cpp : split llama_context_params into model and context params (#3301) * llama.cpp : split llama_context_params into model and context params ggml-ci * fix metal build * fix freq_base/scale default to model value * llama-bench : keep the same model between tests when possible * move n_threads to llama_context_params, add n_threads_batch * fix mpi build * remove kv_size(), cuda scratch fixes * remove low-vram option * add n_threads_batch to system info, refactor to get_system_info() * add documentation about --threads-batch to the READMEs * llama-bench fix * main : fix rope freq/scale warning * llama.cpp : add llama_get_model common : add llama_tokenize from model * remove duplicated ctx/model functions ggml-ci * cuda : print total VRAM used 2023-09-28 19:42:38 +00:00			`const int n_embd = llama_n_embd(model);`
Add support for BERT embedding models (#5423) * BERT model graph construction (build_bert) * WordPiece tokenizer (llm_tokenize_wpm) * Add flag for non-causal attention models * Allow for models that only output embeddings * Support conversion of BERT models to GGUF * Based on prior work by @xyzhang626 and @skeskinen --------- Co-authored-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2024-02-11 16:21:38 +00:00			`auto * embeddings = llama_get_embeddings(ctx);`

			`// l2-normalize embeddings`
			`float norm = 0;`
			`for (int i = 0; i < n_embd; i++) {`
			`norm += embeddings[i] * embeddings[i];`
			`}`
			`norm = sqrt(norm);`
			`for (int i = 0; i < n_embd; i++) {`
			`embeddings[i] /= norm;`
			`}`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 18:26:40 +00:00
embedding : evaluate prompt in batches (#2713) 2023-08-22 14:03:12 +00:00			`for (int i = 0; i < n_embd; i++) {`
			`printf("%f ", embeddings[i]);`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 18:26:40 +00:00			`}`
embedding : evaluate prompt in batches (#2713) 2023-08-22 14:03:12 +00:00			`printf("\n");`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 18:26:40 +00:00
			`llama_print_timings(ctx);`
			`llama_free(ctx);`
llama : make model stateless and context stateful (llama_state) (#1797) * llama : make model stateless and context stateful * llama : minor cleanup * llama : update internal API declaration * Apply suggestions from code review fix style Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Missing model memory release * Fix style * Add deprecated warning for public API function llama_init_from_file * Update public API use cases: move away from deprecated llama_init_from_file * Deprecate public API function llama_apply_lora_from_file --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-06-24 08:47:58 +00:00			`llama_free_model(model);`
Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 18:26:40 +00:00
mpi : add support for distributed inference via MPI (#2099) * MPI support, first cut * fix warnings, update README * fixes * wrap includes * PR comments * Update CMakeLists.txt * Add GH workflow, fix test * Add info to README * mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099) * mpi : add names for layer inputs + prep ggml_mpi_graph_compute() * mpi : move all MPI logic into ggml-mpi Not tested yet * mpi : various fixes - communication now works but results are wrong * mpi : fix output tensor after MPI compute (still not working) * mpi : fix inference * mpi : minor * Add OpenMPI to GH action * [mpi] continue-on-error: true * mpi : fix after master merge * [mpi] Link MPI C++ libraries to fix OpenMPI * tests : fix new llama_backend API * [mpi] use MPI_INT32_T * mpi : factor out recv / send in functions and reuse * mpi : extend API to allow usage with outer backends (e.g. Metal) --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-07-10 15:49:56 +00:00			`llama_backend_free();`

Overhaul the examples structure - main -> examples - utils -> examples (renamed to "common") - quantize -> examples - separate tools for "perplexity" and "embedding" Hope I didn't break something ! 2023-03-25 18:26:40 +00:00			`return 0;`
			`}`