server : improve README (#5209)

This commit is contained in:
Wu Jian Ping 2024-01-30 17:11:46 +08:00 committed by GitHub
parent ceebbb5b21
commit 6685cc41c2
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -32,6 +32,7 @@ Command line options:
- `--mmproj MMPROJ_FILE`: Path to a multimodal projector file for LLaVA. - `--mmproj MMPROJ_FILE`: Path to a multimodal projector file for LLaVA.
- `--grp-attn-n`: Set the group attention factor to extend context size through self-extend(default: 1=disabled), used together with group attention width `--grp-attn-w` - `--grp-attn-n`: Set the group attention factor to extend context size through self-extend(default: 1=disabled), used together with group attention width `--grp-attn-w`
- `--grp-attn-w`: Set the group attention width to extend context size through self-extend(default: 512), used together with group attention factor `--grp-attn-n` - `--grp-attn-w`: Set the group attention width to extend context size through self-extend(default: 512), used together with group attention factor `--grp-attn-n`
## Build ## Build
server is build alongside everything else from the root of the project server is build alongside everything else from the root of the project
@ -52,21 +53,23 @@ server is build alongside everything else from the root of the project
To get started right away, run the following command, making sure to use the correct path for the model you have: To get started right away, run the following command, making sure to use the correct path for the model you have:
### Unix-based systems (Linux, macOS, etc.): ### Unix-based systems (Linux, macOS, etc.)
```bash ```bash
./server -m models/7B/ggml-model.gguf -c 2048 ./server -m models/7B/ggml-model.gguf -c 2048
``` ```
### Windows: ### Windows
```powershell ```powershell
server.exe -m models\7B\ggml-model.gguf -c 2048 server.exe -m models\7B\ggml-model.gguf -c 2048
``` ```
The above command will start a server that by default listens on `127.0.0.1:8080`. The above command will start a server that by default listens on `127.0.0.1:8080`.
You can consume the endpoints with Postman or NodeJS with axios library. You can visit the web front end at the same url. You can consume the endpoints with Postman or NodeJS with axios library. You can visit the web front end at the same url.
### Docker: ### Docker
```bash ```bash
docker run -p 8080:8080 -v /path/to/models:/models ggerganov/llama.cpp:server -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080 docker run -p 8080:8080 -v /path/to/models:/models ggerganov/llama.cpp:server -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080
@ -120,6 +123,7 @@ node index.js
``` ```
## API Endpoints ## API Endpoints
- **GET** `/health`: Returns the current state of the server: - **GET** `/health`: Returns the current state of the server:
- `{"status": "loading model"}` if the model is still being loaded. - `{"status": "loading model"}` if the model is still being loaded.
- `{"status": "error"}` if the model failed to load. - `{"status": "error"}` if the model failed to load.
@ -189,14 +193,13 @@ node index.js
`system_prompt`: Change the system prompt (initial prompt of all slots), this is useful for chat applications. [See more](#change-system-prompt-on-runtime) `system_prompt`: Change the system prompt (initial prompt of all slots), this is useful for chat applications. [See more](#change-system-prompt-on-runtime)
### Result JSON: ### Result JSON
* Note: When using streaming mode (`stream`) only `content` and `stop` will be returned until end of completion.
- Note: When using streaming mode (`stream`) only `content` and `stop` will be returned until end of completion.
- `completion_probabilities`: An array of token probabilities for each completion. The array's length is `n_predict`. Each item in the array has the following structure: - `completion_probabilities`: An array of token probabilities for each completion. The array's length is `n_predict`. Each item in the array has the following structure:
``` ```json
{ {
"content": "<the token selected by the model>", "content": "<the token selected by the model>",
"probs": [ "probs": [
@ -212,6 +215,7 @@ node index.js
] ]
}, },
``` ```
Notice that each `probs` is an array of length `n_probs`. Notice that each `probs` is an array of length `n_probs`.
- `content`: Completion result as a string (excluding `stopping_word` if any). In case of streaming mode, will contain the next token as a string. - `content`: Completion result as a string (excluding `stopping_word` if any). In case of streaming mode, will contain the next token as a string.
@ -290,6 +294,7 @@ Notice that each `probs` is an array of length `n_probs`.
print(completion.choices[0].message) print(completion.choices[0].message)
``` ```
... or raw HTTP requests: ... or raw HTTP requests:
```shell ```shell
@ -311,6 +316,40 @@ Notice that each `probs` is an array of length `n_probs`.
}' }'
``` ```
- **POST** `/v1/embeddings`: OpenAI-compatible embeddings API.
*Options:*
See [OpenAI Embeddings API documentation](https://platform.openai.com/docs/api-reference/embeddings).
*Examples:*
- input as string
```shell
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"input": "hello",
"model":"GPT-4",
"encoding_format": "float"
}'
```
- `input` as string array
```shell
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"input": ["hello", "world"],
"model":"GPT-4",
"encoding_format": "float"
}'
```
## More examples ## More examples
### Change system prompt on runtime ### Change system prompt on runtime
@ -362,6 +401,7 @@ python api_like_OAI.py
``` ```
After running the API server, you can use it in Python by setting the API base URL. After running the API server, you can use it in Python by setting the API base URL.
```python ```python
openai.api_base = "http://<Your api-server IP>:port" openai.api_base = "http://<Your api-server IP>:port"
``` ```