@llama.cpp @embeddings Feature: llama.cpp server Background: Server startup Given a server listening on localhost:8080 And a model url https://huggingface.co/ggml-org/models/resolve/main/bert-bge-small/ggml-model-f16.gguf And a model file bert-bge-small.gguf And a model alias bert-bge-small And 42 as server seed And 2 slots # the bert-bge-small model has context size of 512 # since the generated prompts are as big as the batch size, we need to set the batch size to 512 # ref: https://huggingface.co/BAAI/bge-small-en-v1.5/blob/5c38ec7c405ec4b44b94cc5a9bb96e735b38267a/config.json#L20 And 512 as batch size And 512 as ubatch size And 2048 KV cache size And embeddings extraction Then the server is starting Then the server is healthy Scenario: Embedding When embeddings are computed for: """ What is the capital of Bulgaria ? """ Then embeddings are generated Scenario: OAI Embeddings compatibility Given a model bert-bge-small When an OAI compatible embeddings computation request for: """ What is the capital of Spain ? """ Then embeddings are generated Scenario: OAI Embeddings compatibility with multiple inputs Given a model bert-bge-small Given a prompt: """ In which country Paris is located ? """ And a prompt: """ Is Madrid the capital of Spain ? """ When an OAI compatible embeddings computation request for multiple inputs Then embeddings are generated Scenario: Multi users embeddings Given a prompt: """ Write a very long story about AI. """ And a prompt: """ Write another very long music lyrics. """ And a prompt: """ Write a very long poem. """ And a prompt: """ Write a very long joke. """ Given concurrent embedding requests Then the server is busy Then the server is idle Then all embeddings are generated Scenario: Multi users OAI compatibility embeddings Given a prompt: """ In which country Paris is located ? """ And a prompt: """ Is Madrid the capital of Spain ? """ And a prompt: """ What is the biggest US city ? """ And a prompt: """ What is the capital of Bulgaria ? """ And a model bert-bge-small Given concurrent OAI embedding requests Then the server is busy Then the server is idle Then all embeddings are generated Scenario: All embeddings should be the same Given 10 fixed prompts And a model bert-bge-small Given concurrent OAI embedding requests Then all embeddings are the same