mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-27 20:04:35 +00:00

History

FSSRepo afc09db51c fix json format README		2023-10-05 15:23:58 -04:00
..
CMakeLists.txt	server handling multiple clients with cam	2023-10-05 15:12:39 -04:00
frontend.h	server handling multiple clients with cam	2023-10-05 15:12:39 -04:00
README.md	fix json format README	2023-10-05 15:23:58 -04:00
server.cpp	remove trail whitespace	2023-10-05 15:18:47 -04:00

README.md

llama.cpp/example/server-parallel

This example demonstrates a PoC HTTP API server that handles simulataneus requests. Long prompts are not supported.

Quick Start

To get started right away, run the following command, making sure to use the correct path for the model you have:

Unix-based systems (Linux, macOS, etc.):

./server-parallel -m models/7B/ggml-model.gguf --ctx_size 2048 -t 4 -ngl 33 --batch-size 512 --parallel 3 -n 512 --cont-batching

Windows:

server-parallel.exe -m models\7B\ggml-model.gguf --ctx_size 2048 -t 4 -ngl 33 --batch-size 512 --parallel 3 -n 512 --cont-batching

The above command will start a server that by default listens on 127.0.0.1:8080.

API Endpoints

GET /props: Return the user and assistant name for generate the prompt.

Response:

{
    "user_name": "User:",
    "assistant_name": "Assistant:"
}

POST /completion: Given a prompt, it returns the predicted completion, just streaming mode.

Options:

temperature: Adjust the randomness of the generated text (default: 0.1).

prompt: Provide a prompt as a string, It should be a coherent continuation of the system prompt.

system_prompt: Provide a system prompt as a string.

anti_prompt: Provide the name of the user coherent with the system prompt.

assistant_name: Provide the name of the assistant coherent with the system prompt.

Example request:

{
    "system_prompt": "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n\nHuman: Hello\nAssistant: Hi, how may I help you?\nHuman:",
    "anti_prompt": "Human:",
    "assistant_name": "Assistant:",
    "prompt": "When is the day of independency of US?",
    "temperature": 0.2
}

Response:

{
    "content": "<token_str>"
}

This example is a Proof of Concept, have some bugs and unexpected behaivors, this not supports long prompts.