johannes
7006dd784c
server: Propagate standby_timeout after it has been initialized
2024-12-11 08:41:51 +01:00
johannes
4fd58a8013
server: Initialize standby_timeout over constructor instead of passing as argument
2024-12-11 08:33:24 +01:00
johannes
acbac00f0d
server: Return shutdown_handler to its initial state and use running = false for termination
2024-12-11 08:32:12 +01:00
johannes
a4108f59bd
server: Adhere to naming conventions for shutdown_reasons
2024-12-09 23:55:51 +01:00
johannes
4fd985af91
server: Update README to include standby-timeout
2024-12-09 23:55:36 +01:00
johannes
9a8df14d5c
server: Add standby-timeout
...
Add standby-timeout. A timeout for automatically terminating the server
after being unused for a certain amount of time
2024-12-09 22:56:27 +01:00
Xuan Son Nguyen
ce8784bdb1
server : fix format_infill ( #10724 )
...
flake8 Lint / Lint (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
* server : fix format_infill
* fix
* rename
* update test
* use another model
* update test
* update test
* test_invalid_input_extra_req
2024-12-08 23:04:29 +01:00
Xuan Son Nguyen
e52522b869
server : bring back info of final chunk in stream mode ( #10722 )
...
* server : bring back into to final chunk in stream mode
* clarify a bit
* traling space
2024-12-08 20:38:51 +01:00
Xuan Son Nguyen
3573fa8e7b
server : (refactor) no more json in server_task input ( #10691 )
...
* server : (refactor) no more json in server_task input
* add test for slots endpoint
* add tests for /props and /slots
* remove task inf_type
* fix CI by adding safe_json_to_str
* add "model_path" to /props
* update readme
2024-12-07 20:21:09 +01:00
Georgi Gerganov
ce4a7b8493
server : various fixes ( #10704 )
...
* server : various fixes
ggml-ci
* server : show curent seed in slot_params
ggml-ci
* fix /slots endpoint
* Update examples/server/server.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : reflect endpoint response changes in the readme
ggml-ci
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-12-07 18:02:05 +02:00
Georgi Gerganov
c2a16c0bdb
server : fix free of spec context and batch ( #10651 )
...
Python check requirements.txt / check-requirements (push) Waiting to run
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
ggml-ci
2024-12-07 11:52:44 +02:00
Xuan Son Nguyen
6c5bc0625f
server : (refactoring) do not rely on JSON internally ( #10643 )
...
* server : (refactoring) reduce usage of json internally
* move all response types to struct
* wip [no ci]
* many fixes
* add virtual function
* fix index
* minor style fix
* add std::move
* refactor handle_completions_generic
* add virtual functions
* remove server.hpp
* clarify server_sent_event RFC specs
* apply review comments
* fix model_alias and completion_probabilities
* small clean up
* remove virtual for to_json_oai_compat()
* naming oai_compat --> oaicompat
* fix unwanted recursive call
* update docs
2024-12-06 11:14:32 +01:00
Plamen Minev
7736837d62
fix(server) : not show alert when DONE is received ( #10674 )
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
Python check requirements.txt / check-requirements (push) Has been cancelled
2024-12-05 22:36:41 +01:00
Georgi Gerganov
1da7b76569
server : fix speculative decoding with context shift ( #10641 )
...
* server : fix speculative decoding with context shift
ggml-ci
* server : take into account speculative limits
ggml-ci
* server : add tests
2024-12-04 22:38:20 +02:00
Xuan Son Nguyen
91c36c269b
server : (web ui) Various improvements, now use vite as bundler ( #10599 )
...
* hide buttons in dropdown menu
* use npm as deps manager and vite as bundler
* fix build
* fix build (2)
* fix responsive on mobile
* fix more problems on mobile
* sync build
* (test) add CI step for verifying build
* fix ci
* force rebuild .hpp files
* cmake: clean up generated files pre build
2024-12-03 19:38:44 +01:00
Nikolaos Pothitos
82bca2257b
readme : add option, update default value, fix formatting ( #10271 )
...
* readme : document --no-display-prompt
* readme : update default prompt context size
* readme : remove unnecessary indentation
Indenting a line with four spaces makes Markdown treat that section as
plain text.
* readme : indent commands under bullets
* readme : indent commands in lettered list
2024-12-03 12:50:08 +02:00
Georgi Gerganov
70b98fadbc
server : fix default draft model parameters ( #10586 )
...
* server : force F16 KV cache for the draft model
ggml-ci
* server : fix draft params
ggml-ci
* server : various params fixes
ggml-ci
2024-12-03 11:20:00 +02:00
Xuan Son Nguyen
642330ac7c
llama : add enum for built-in chat templates ( #10623 )
...
* llama : add enum for supported chat templates
* use "built-in" instead of "supported"
* arg: print list of built-in templates
* fix test
* update server README
2024-12-02 22:10:19 +01:00
Georgi Gerganov
8648c52101
make : deprecate ( #10514 )
...
* make : deprecate
ggml-ci
* ci : disable Makefile builds
ggml-ci
* docs : remove make references [no ci]
* ci : disable swift build
ggml-ci
* docs : remove obsolete make references, scripts, examples
ggml-ci
* basic fix for compare-commits.sh
* update build.md
* more build.md updates
* more build.md updates
* more build.md updates
* Update Makefile
Co-authored-by: Diego Devesa <slarengh@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-12-02 21:22:53 +02:00
haopeng
64ed2091b2
server: Add "tokens per second" information in the backend ( #10548 )
...
flake8 Lint / Lint (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
* add cmake rvv support
* add timings
* remove space
* update readme
* fix
* fix code
* remove empty line
* add test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-02 14:45:54 +01:00
alek3y
86dc11c5bc
server : bind to any port when specified ( #10590 )
2024-12-01 13:33:12 +02:00
Diego Devesa
7cc2d2c889
ggml : move AMX to the CPU backend ( #10570 )
...
flake8 Lint / Lint (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
* ggml : move AMX to the CPU backend
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-29 21:54:58 +01:00
Xuan Son Nguyen
b782e5c7d4
server : add more test cases ( #10569 )
...
* server : add split model test
* add test speculative
* add invalid cases
2024-11-29 21:48:56 +01:00
Xuan Son Nguyen
6c59567689
server : (tests) don't use thread for capturing stdout/stderr, bump openai client library ( #10568 )
...
* server : (tests) don't use thread for capturing stdout/stderr
* test: bump openai to 1.55.2
* bump openai to 1.55.3
2024-11-28 19:17:49 +01:00
Xuan Son Nguyen
9f912511bc
common : fix duplicated file name with hf_repo and hf_file ( #10550 )
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
2024-11-27 22:30:52 +01:00
Xuan Son Nguyen
45abe0f74e
server : replace behave with pytest ( #10416 )
...
* server : replace behave with pytest
* fix test on windows
* misc
* add more tests
* more tests
* styling
* log less, fix embd test
* added all sequential tests
* fix coding style
* fix save slot test
* add parallel completion test
* fix parallel test
* remove feature files
* update test docs
* no cache_prompt for some tests
* add test_cache_vs_nocache_prompt
2024-11-26 16:20:18 +01:00
Georgi Gerganov
84e1c33cde
server : fix parallel speculative decoding ( #10513 )
...
ggml-ci
2024-11-26 13:36:40 +02:00
Georgi Gerganov
47f931c8f9
server : enable cache_prompt by default ( #10501 )
...
ggml-ci
2024-11-25 21:50:07 +02:00
Diego Devesa
10bce0450f
llama : accept a list of devices to use to offload a model ( #10497 )
...
* llama : accept a list of devices to use to offload a model
* accept `--dev none` to completely disable offloading
* fix dev list with dl backends
* rename env parameter to LLAMA_ARG_DEVICE for consistency
2024-11-25 19:30:06 +01:00
brucepro
a9a678a6b2
Add download chat feature to server chat ( #10481 )
...
* Add download chat feature to server chat
Add a download feature next to the delete chat feature in the server vue chat interface.
* code style
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-11-25 17:11:55 +01:00
Georgi Gerganov
9ca2e67762
server : add speculative decoding support ( #10455 )
...
* server : add speculative decoding support
ggml-ci
* server : add helper function slot.can_speculate()
ggml-ci
2024-11-25 16:31:38 +02:00
Georgi Gerganov
d9d54e498d
speculative : refactor and add a simpler example ( #10362 )
...
* speculative : refactor and add a simpler example
ggml-ci
* speculative : clean-up and add comments and TODOs [no ci]
* speculative : manage context in common_speculative
ggml-ci
* speculative : simplify
ggml-ci
* speculative : simplify (cont)
ggml-ci
* speculative : add --draft-min CLI arg
* speculative : minor fixup
* make : build fixes
* speculative : do not redraft previous drafts
ggml-ci
* speculative : fix the draft sampling
ggml-ci
* speculative : fix compile warning
* common : refactor args
ggml-ci
* common : change defaults [no ci]
* common : final touches
ggml-ci
2024-11-25 09:58:41 +02:00
Johannes Gäßler
4e54be0ec6
llama/ex: remove --logdir argument ( #10339 )
2024-11-16 23:00:41 +01:00
MaggotHATE
bcdb7a2386
server: (web UI) Add samplers sequence customization ( #10255 )
...
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full-cuda.Dockerfile platforms:linux/amd64 tag:full-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full-musa.Dockerfile platforms:linux/amd64 tag:full-musa]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full.Dockerfile platforms:linux/amd64,linux/arm64 tag:full]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-cuda.Dockerfile platforms:linux/amd64 tag:light-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-intel.Dockerfile platforms:linux/amd64 tag:light-intel]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-musa.Dockerfile platforms:linux/amd64 tag:light-musa]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli.Dockerfile platforms:linux/amd64,linux/arm64 tag:light]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-cuda.Dockerfile platforms:linux/amd64 tag:server-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-intel.Dockerfile platforms:linux/amd64 tag:server-intel]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-musa.Dockerfile platforms:linux/amd64 tag:server-musa]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server.Dockerfile platforms:linux/amd64,linux/arm64 tag:server]) (push) Waiting to run
Nix CI / nix-eval (macos-latest) (push) Waiting to run
Nix CI / nix-eval (ubuntu-latest) (push) Waiting to run
Nix CI / nix-build (macos-latest) (push) Waiting to run
Nix CI / nix-build (ubuntu-latest) (push) Waiting to run
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
update-flake-lock / lockfile (push) Has been cancelled
* Samplers sequence: simplified and input field.
* Removed unused function
* Modify and use `settings-modal-short-input`
* rename "name" --> "label"
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-11-16 14:26:54 +01:00
Xuan Son Nguyen
9901068ac7
server : (web UI) add copy button for code block, fix api key ( #10242 )
...
* server : (web ui) add copy btn for code blocks
* fix problem with api key
* use settings-modal-short-input component
* always show copy btn for code snippet
2024-11-15 10:48:49 +01:00
Alexey Parfenov
ff7fb670d0
server : add missing docs ( #10269 )
2024-11-13 13:16:30 +02:00
Jhen-Jie Hong
0e712a5acb
server : fix incorrect res in validate_model_chat_template ( #10272 )
...
* server : fix validate_model_chat_template
* server : fix chat res
2024-11-13 13:15:23 +02:00
Georgi Gerganov
b141e5f6ef
server : enable KV cache defrag by default ( #10233 )
...
ggml-ci
2024-11-11 08:38:43 +02:00
MaggotHATE
505f33274d
server : (web UI) Add back sampler settings ( #10239 )
...
* Add back samplers to server
* Added tooltips with basic information
* Fixed stretching of input fields.
* use component for settings input, move help msg to tooltips
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-11-10 15:42:25 -04:00
Xuan Son Nguyen
76c6e7f105
server : minor UI fix ( #10207 )
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full-cuda.Dockerfile platforms:linux/amd64 tag:full-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full-musa.Dockerfile platforms:linux/amd64 tag:full-musa]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full.Dockerfile platforms:linux/amd64,linux/arm64 tag:full]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-cuda.Dockerfile platforms:linux/amd64 tag:light-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-intel.Dockerfile platforms:linux/amd64 tag:light-intel]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-musa.Dockerfile platforms:linux/amd64 tag:light-musa]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli.Dockerfile platforms:linux/amd64,linux/arm64 tag:light]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-cuda.Dockerfile platforms:linux/amd64 tag:server-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-intel.Dockerfile platforms:linux/amd64 tag:server-intel]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-musa.Dockerfile platforms:linux/amd64 tag:server-musa]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server.Dockerfile platforms:linux/amd64,linux/arm64 tag:server]) (push) Waiting to run
Nix CI / nix-eval (macos-latest) (push) Waiting to run
Nix CI / nix-eval (ubuntu-latest) (push) Waiting to run
Nix CI / nix-build (macos-latest) (push) Waiting to run
Nix CI / nix-build (ubuntu-latest) (push) Waiting to run
flake8 Lint / Lint (push) Waiting to run
2024-11-07 18:44:38 -04:00
Xuan Son Nguyen
a71d81cf8c
server : revamp chat UI with vuejs and daisyui ( #10175 )
...
* server : simple chat UI with vuejs and daisyui
* move old files to legacy folder
* embed deps into binary
* basic markdown support
* add conversation history, save to localStorage
* fix bg-base classes
* save theme preferences
* fix tests
* regenerate, edit, copy buttons
* small fixes
* docs: how to use legacy ui
* better error handling
* make CORS preflight more explicit
* add GET method for CORS
* fix tests
* clean up a bit
* better auto scroll
* small fixes
* use collapse-arrow
* fix closeAndSaveConfigDialog
* small fix
* remove console.log
* fix style for <pre> element
* lighter bubble color (less distract when reading)
2024-11-07 17:31:10 -04:00
Georgi Gerganov
b11f9ba9b8
server : remove hack for extra parallel slot ( #10187 )
...
ggml-ci
2024-11-06 13:29:01 +02:00
Xuan Son Nguyen
9e0ecfb697
server : clarify /slots endpoint, add is_processing ( #10162 )
...
* server : clarify /slots endpoint, add is_processing
* fix tests
2024-11-04 16:33:29 +01:00
sasha0552
42cadc74bd
server : fix slot selection by lru ( #10126 )
...
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full-cuda.Dockerfile platforms:linux/amd64 tag:full-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full-musa.Dockerfile platforms:linux/amd64 tag:full-musa]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full.Dockerfile platforms:linux/amd64,linux/arm64 tag:full]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-cuda.Dockerfile platforms:linux/amd64 tag:light-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-intel.Dockerfile platforms:linux/amd64 tag:light-intel]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-musa.Dockerfile platforms:linux/amd64 tag:light-musa]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli.Dockerfile platforms:linux/amd64,linux/arm64 tag:light]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-cuda.Dockerfile platforms:linux/amd64 tag:server-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-intel.Dockerfile platforms:linux/amd64 tag:server-intel]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-musa.Dockerfile platforms:linux/amd64 tag:server-musa]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server.Dockerfile platforms:linux/amd64,linux/arm64 tag:server]) (push) Waiting to run
Nix CI / nix-eval (macos-latest) (push) Waiting to run
Nix CI / nix-eval (ubuntu-latest) (push) Waiting to run
Nix CI / nix-build (macos-latest) (push) Waiting to run
Nix CI / nix-build (ubuntu-latest) (push) Waiting to run
flake8 Lint / Lint (push) Waiting to run
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
update-flake-lock / lockfile (push) Has been cancelled
* server : fix slot selection by lru, migrate lcs to `size_t`
* minor debug log fix
2024-11-02 18:34:56 +02:00
Georgi Gerganov
45950415ed
server : fix endpoint checks ( #10135 )
...
ggml-ci
2024-11-02 18:34:00 +02:00
sasha0552
d865d1478c
server : fix smart selection of available slot ( #10120 )
...
* Fix smart selection of available slot
* minor fix
* replace vectors of tokens with shorthands
2024-11-01 14:33:14 +01:00
Kevin Gibbons
0a683e8088
server : include scheme when printing URL ( #10106 )
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full-cuda.Dockerfile platforms:linux/amd64 tag:full-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full-musa.Dockerfile platforms:linux/amd64 tag:full-musa]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/full.Dockerfile platforms:linux/amd64,linux/arm64 tag:full]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-cuda.Dockerfile platforms:linux/amd64 tag:light-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-intel.Dockerfile platforms:linux/amd64 tag:light-intel]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli-musa.Dockerfile platforms:linux/amd64 tag:light-musa]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-cli.Dockerfile platforms:linux/amd64,linux/arm64 tag:light]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-cuda.Dockerfile platforms:linux/amd64 tag:server-cuda]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-intel.Dockerfile platforms:linux/amd64 tag:server-intel]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server-musa.Dockerfile platforms:linux/amd64 tag:server-musa]) (push) Waiting to run
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/llama-server.Dockerfile platforms:linux/amd64,linux/arm64 tag:server]) (push) Waiting to run
Nix CI / nix-eval (macos-latest) (push) Waiting to run
Nix CI / nix-eval (ubuntu-latest) (push) Waiting to run
Nix CI / nix-build (macos-latest) (push) Waiting to run
Nix CI / nix-build (ubuntu-latest) (push) Waiting to run
flake8 Lint / Lint (push) Waiting to run
2024-10-31 14:02:35 +01:00
Georgi Gerganov
8d8ff71536
llama : remove Tail-Free sampling ( #10071 )
...
ggml-ci
2024-10-29 10:42:05 +02:00
Georgi Gerganov
8125e6cbfc
server : don't overfill the batch during infill ( #10018 )
...
ggml-ci
2024-10-28 08:49:32 +02:00
wwoodsTM
ff252ea48e
llama : add DRY sampler ( #9702 )
...
* sampling : add DRY sampler (post-refactor)
* DRY: Trying to fix coauthors, removed unneeded line
* DRY: Fixed redundant code
* DRY: Fixed crash issue due to DRY being in chain but uninitialized
---------
Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com>
Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com>
2024-10-25 19:07:34 +03:00