* server : refactor middleware and /health endpoint
* move "fail_on_no_slot" to /slots
* Update examples/server/server.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix server tests
* fix CI
* update server docs
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* common : Changed tuple to struct (TODO fix)
Use struct `llama_init_result` to replace the previous
std::tuple<struct llama_model *, struct llama_context *>
* delete llama_init_default_params()
* delete the extra whitespace
The README.md had a stale information. In particular, the --ctx-size
"defaults to 512" confused me and I had to check the code to confirm
this was false. This the server is evolving rapidly, it's probably
better to keep the source of truth at a single place (in the source) and
generate the README.md based on that.
Did:
make llama-server
./llama-server --help > t.txt
vimdiff t.txt examples/server/README.md
I copied the content inside a backquote block. I would have preferred
proper text but it would require a fair amount of surgery to make the
current output compatible with markdown. A follow up could be to
automate this process with a script.
No functional change.
* server : handle content array in chat API
* Update examples/server/utils.hpp
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* py : type-check all Python scripts with Pyright
* server-tests : use trailing slash in openai base_url
* server-tests : add more type annotations
* server-tests : strip "chat" from base_url in oai_chat_completions
* server-tests : model metadata is a dict
* ci : disable pip cache in type-check workflow
The cache is not shared between branches, and it's 250MB in size,
so it would become quite a big part of the 10GB cache limit of the repo.
* py : fix new type errors from master branch
* tests : fix test-tokenizer-random.py
Apparently, gcc applies optimisations even when pre-processing,
which confuses pycparser.
* ci : only show warnings and errors in python type-check
The "information" level otherwise has entries
from 'examples/pydantic_models_to_grammar.py',
which could be confusing for someone trying to figure out what failed,
considering that these messages can safely be ignored
even though they look like errors.
* server: Retrieve prompt template in /props
This PR adds the following:
- Expose the model's Jinja2 prompt template from the model in the /props endpoint.
- Change log-level from Error to Warning for warning about template mismatch.
The front-end stands a better chance of actually executing the Jinja template format correctly. Server is currently just guessing it.
Ideally this should have been inside a JSON block that expose the same key/value pairs as listed during startup in "llm_load_print_meta" function.
* Make string buffer dynamic
* Add doc and better string handling
* Using chat_template naming convention
* Use intermediate vector for string assignment
* json: default additionalProperty to true
* json: don't force additional props after normal properties!
* json: allow space after enum/const
* json: update pydantic example to set additionalProperties: false
* json: prevent additional props to redefine a typed prop
* port not_strings to python, add trailing space
* fix not_strings & port to js+py
* Update json-schema-to-grammar.cpp
* fix _not_strings for substring overlaps
* json: fix additionalProperties default, uncomment tests
* json: add integ. test case for additionalProperties
* json: nit: simplify condition
* reformat grammar integ tests w/ R"""()""" strings where there's escapes
* update # tokens in server test: consts can now have trailing space
* SimpleChat: Allow for chat req bool options to be user controlled
* SimpleChat: Allow user to control cache_prompt flag in request
* SimpleChat: Add sample GUI images to readme file
Show the chat screen and the settings screen
* SimpleChat:Readme: Add quickstart block, title to image, cleanup
* SimpleChat: RePosition contents of the Info and Settings UI
Make it more logically structured and flow through.
* SimpleChat: Rename to apiRequestOptions from chatRequestOptions
So that it is not wrongly assumed that these request options are
used only for chat/completions endpoint. Rather these are used
for both the end points, so rename to match semantic better.
* SimpleChat: Update image included with readme wrt settings ui
* SimpleChat:ReadMe: Switch to webp screen image to reduce size
* server : Smart selection of available slot using Longest Common Substring
* add usage
* remove trailing whitespaces
* Use Longest Common Prefix (LCP) instead of LCS
* Rename argument
* avoid to get prompt in infill mode and embedding mode
* remove embedding mode
* refactor format
---------
Co-authored-by: wudexiang <wudexiang@bytedance.com>