mirror of
https://github.com/ggerganov/llama.cpp.git
synced 2025-01-12 03:31:46 +00:00
server : improve README (#5209)
This commit is contained in:
parent
ceebbb5b21
commit
6685cc41c2
@ -32,6 +32,7 @@ Command line options:
|
|||||||
- `--mmproj MMPROJ_FILE`: Path to a multimodal projector file for LLaVA.
|
- `--mmproj MMPROJ_FILE`: Path to a multimodal projector file for LLaVA.
|
||||||
- `--grp-attn-n`: Set the group attention factor to extend context size through self-extend(default: 1=disabled), used together with group attention width `--grp-attn-w`
|
- `--grp-attn-n`: Set the group attention factor to extend context size through self-extend(default: 1=disabled), used together with group attention width `--grp-attn-w`
|
||||||
- `--grp-attn-w`: Set the group attention width to extend context size through self-extend(default: 512), used together with group attention factor `--grp-attn-n`
|
- `--grp-attn-w`: Set the group attention width to extend context size through self-extend(default: 512), used together with group attention factor `--grp-attn-n`
|
||||||
|
|
||||||
## Build
|
## Build
|
||||||
|
|
||||||
server is build alongside everything else from the root of the project
|
server is build alongside everything else from the root of the project
|
||||||
@ -52,21 +53,23 @@ server is build alongside everything else from the root of the project
|
|||||||
|
|
||||||
To get started right away, run the following command, making sure to use the correct path for the model you have:
|
To get started right away, run the following command, making sure to use the correct path for the model you have:
|
||||||
|
|
||||||
### Unix-based systems (Linux, macOS, etc.):
|
### Unix-based systems (Linux, macOS, etc.)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
./server -m models/7B/ggml-model.gguf -c 2048
|
./server -m models/7B/ggml-model.gguf -c 2048
|
||||||
```
|
```
|
||||||
|
|
||||||
### Windows:
|
### Windows
|
||||||
|
|
||||||
```powershell
|
```powershell
|
||||||
server.exe -m models\7B\ggml-model.gguf -c 2048
|
server.exe -m models\7B\ggml-model.gguf -c 2048
|
||||||
```
|
```
|
||||||
|
|
||||||
The above command will start a server that by default listens on `127.0.0.1:8080`.
|
The above command will start a server that by default listens on `127.0.0.1:8080`.
|
||||||
You can consume the endpoints with Postman or NodeJS with axios library. You can visit the web front end at the same url.
|
You can consume the endpoints with Postman or NodeJS with axios library. You can visit the web front end at the same url.
|
||||||
|
|
||||||
### Docker:
|
### Docker
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker run -p 8080:8080 -v /path/to/models:/models ggerganov/llama.cpp:server -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080
|
docker run -p 8080:8080 -v /path/to/models:/models ggerganov/llama.cpp:server -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080
|
||||||
|
|
||||||
@ -120,6 +123,7 @@ node index.js
|
|||||||
```
|
```
|
||||||
|
|
||||||
## API Endpoints
|
## API Endpoints
|
||||||
|
|
||||||
- **GET** `/health`: Returns the current state of the server:
|
- **GET** `/health`: Returns the current state of the server:
|
||||||
- `{"status": "loading model"}` if the model is still being loaded.
|
- `{"status": "loading model"}` if the model is still being loaded.
|
||||||
- `{"status": "error"}` if the model failed to load.
|
- `{"status": "error"}` if the model failed to load.
|
||||||
@ -189,14 +193,13 @@ node index.js
|
|||||||
|
|
||||||
`system_prompt`: Change the system prompt (initial prompt of all slots), this is useful for chat applications. [See more](#change-system-prompt-on-runtime)
|
`system_prompt`: Change the system prompt (initial prompt of all slots), this is useful for chat applications. [See more](#change-system-prompt-on-runtime)
|
||||||
|
|
||||||
### Result JSON:
|
### Result JSON
|
||||||
|
|
||||||
* Note: When using streaming mode (`stream`) only `content` and `stop` will be returned until end of completion.
|
|
||||||
|
|
||||||
|
- Note: When using streaming mode (`stream`) only `content` and `stop` will be returned until end of completion.
|
||||||
|
|
||||||
- `completion_probabilities`: An array of token probabilities for each completion. The array's length is `n_predict`. Each item in the array has the following structure:
|
- `completion_probabilities`: An array of token probabilities for each completion. The array's length is `n_predict`. Each item in the array has the following structure:
|
||||||
|
|
||||||
```
|
```json
|
||||||
{
|
{
|
||||||
"content": "<the token selected by the model>",
|
"content": "<the token selected by the model>",
|
||||||
"probs": [
|
"probs": [
|
||||||
@ -212,6 +215,7 @@ node index.js
|
|||||||
]
|
]
|
||||||
},
|
},
|
||||||
```
|
```
|
||||||
|
|
||||||
Notice that each `probs` is an array of length `n_probs`.
|
Notice that each `probs` is an array of length `n_probs`.
|
||||||
|
|
||||||
- `content`: Completion result as a string (excluding `stopping_word` if any). In case of streaming mode, will contain the next token as a string.
|
- `content`: Completion result as a string (excluding `stopping_word` if any). In case of streaming mode, will contain the next token as a string.
|
||||||
@ -290,6 +294,7 @@ Notice that each `probs` is an array of length `n_probs`.
|
|||||||
|
|
||||||
print(completion.choices[0].message)
|
print(completion.choices[0].message)
|
||||||
```
|
```
|
||||||
|
|
||||||
... or raw HTTP requests:
|
... or raw HTTP requests:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
@ -311,6 +316,40 @@ Notice that each `probs` is an array of length `n_probs`.
|
|||||||
}'
|
}'
|
||||||
```
|
```
|
||||||
|
|
||||||
|
- **POST** `/v1/embeddings`: OpenAI-compatible embeddings API.
|
||||||
|
|
||||||
|
*Options:*
|
||||||
|
|
||||||
|
See [OpenAI Embeddings API documentation](https://platform.openai.com/docs/api-reference/embeddings).
|
||||||
|
|
||||||
|
*Examples:*
|
||||||
|
|
||||||
|
- input as string
|
||||||
|
|
||||||
|
```shell
|
||||||
|
curl http://localhost:8080/v1/embeddings \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-H "Authorization: Bearer no-key" \
|
||||||
|
-d '{
|
||||||
|
"input": "hello",
|
||||||
|
"model":"GPT-4",
|
||||||
|
"encoding_format": "float"
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
- `input` as string array
|
||||||
|
|
||||||
|
```shell
|
||||||
|
curl http://localhost:8080/v1/embeddings \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-H "Authorization: Bearer no-key" \
|
||||||
|
-d '{
|
||||||
|
"input": ["hello", "world"],
|
||||||
|
"model":"GPT-4",
|
||||||
|
"encoding_format": "float"
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
## More examples
|
## More examples
|
||||||
|
|
||||||
### Change system prompt on runtime
|
### Change system prompt on runtime
|
||||||
@ -362,6 +401,7 @@ python api_like_OAI.py
|
|||||||
```
|
```
|
||||||
|
|
||||||
After running the API server, you can use it in Python by setting the API base URL.
|
After running the API server, you can use it in Python by setting the API base URL.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
openai.api_base = "http://<Your api-server IP>:port"
|
openai.api_base = "http://<Your api-server IP>:port"
|
||||||
```
|
```
|
||||||
|
Loading…
Reference in New Issue
Block a user