llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-12-27 20:04:35 +00:00

Author	SHA1	Message	Date
Xuan Son Nguyen	1b9ae5189c	common : refactor arg parser (#9308 ) * (wip) argparser v3 * migrated * add test * handle env * fix linux build * add export-docs example * fix build (2) * skip build test-arg-parser on windows * update server docs * bring back missing --alias * bring back --n-predict * clarify test-arg-parser * small correction * add comments * fix args with 2 values * refine example-specific args * no more lamba capture Co-authored-by: slaren@users.noreply.github.com * params.sparams * optimize more * export-docs --> gen-docs	2024-09-07 20:43:51 +02:00
Georgi Gerganov	df270ef745	llama : refactor sampling v2 (#9294 ) - Add `struct llama_sampler` and `struct llama_sampler_i` - Add `llama_sampler_` API - Add `llama_sampler_chain_` API for chaining multiple samplers - Remove `LLAMA_API_INTERNAL` - Add `llama_perf_` API and remove old `llama_print_timings` and `llama_reset_timings`	2024-09-07 15:16:19 +03:00
Xuan Son Nguyen	9b2c24c099	server : simplify state machine for slot (#9283 ) * server : simplify state machine for slot * add SLOT_STATE_DONE_PROMPT * pop_deferred_task * add missing notify_one * fix passkey test * metrics : add n_busy_slots_per_decode * fix test step * add test * maybe fix AddressSanitizer? * fix deque ? * missing lock * pop_deferred_task: also notify * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-06 23:21:29 +02:00
Xuan Son Nguyen	4a1411b4f1	server : fix missing lock (#9334 )	2024-09-06 14:06:04 +02:00
Xuan Son Nguyen	48baa61ecc	server : test script : add timeout for all requests (#9282 )	2024-09-02 22:08:38 +02:00
Xuan Son Nguyen	6e7d133a5f	server : refactor multitask handling (#9274 ) * server : remove multitask from server_task * refactor completions handler * fix embeddings * use res_ok everywhere * small change for handle_slots_action * use unordered_set everywhere * (try) fix test * no more "mutable" lambda * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * use deque --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-02 17:11:51 +02:00
Faisal Zaghloul	42c76d1358	Threadpool: take 2 (#8672 ) * Introduce ggml_compute_threadpool - OpenMP functional: check - Vanilla ggml functional: Check - ggml w/threadpool functional: Check - OpenMP no regression: No glaring problems - Vanilla ggml no regression: No glaring problems - ggml w/threadpool no regression: No glaring problems * Minor fixes * fixed use after release bug * fixed a harmless race condition * Fix Android bulid issue * fix more race conditions * fix deadlock for cases where cgraph.n_nodes == 1 and fix --poll case * threadpool: use cpu_get_num_math to set the default number of threadpool threads This way we avoid using E-Cores and Hyperthreaded siblings. * bench: create fresh threadpool for each test For benchmarking it's better to start a fresh pool for each test with the exact number of threads needed for that test. Having larger pools is suboptimal (causes more load, etc). * atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior. * threadpool: make polling the default to match openmp behavior All command line args now allow for setting poll to 0 (false). * threadpool: do not wakeup threads in already paused threadpool * fix potential race condition in check_for_work * threadpool: do not create two threadpools if their params are identical * threadpool: reduce pause/resume/wakeup overhead in common cases We now start threadpool in paused state only if we have two. The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead. * threadpool: add support for hybrid polling poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var. poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ... The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms. We can tune this further as things evolve. * threadpool: reduce the number of barrier required New work is now indicated with an atomic counter that is incremented for each new graph that needs to be computed. This removes the need for extra barrier for clearing the "new_work" and removes the special case for trivial graphs. * threadpool: remove special-casing for disposable threadpools With the efficient hybrid polling there is no need to make disposable pools any different. This simplifies the overall logic and reduces branching. Include n_threads in debug print for disposable threadpool. Declare pause and stop flags as atomic_bool This doesn't actually generate any memory barriers and simply informs the thread sanitizer that these flags can be written & read by different threads without locking. * threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs) This fixes the race condition with very small graphs where the main thread happens to start a new graph while the workers are just about to exit from barriers. * threadpool: use relaxed order for chunk sync Full memory barrier is an overkill for this since each thread works on different chunk * threadpool: remove abort_callback from threadpool state * threadpool: better naming for thread/cpumask releated functions * threadpool: consistent use of int type for n_threads params * threadpool: add support for ggml_threadpool_params_default/init Also removes the need for explicit mask_specified param. all-zero cpumask means use default (usually inherited) cpu affinity mask. * threadpool: move typedef into ggml.h * threadpool: fix apply_priority() function name * threadpool: fix swift wrapper errors due to n_threads int type cleanup * threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled * threadpool: replace checks for compute_thread ret code with proper status check * threadpool: simplify threadpool init logic and fix main thread affinity application Most of the init code is now exactly the same between threadpool and openmp. * threadpool: update threadpool resume/pause function names * threadpool: enable openmp by default for now * threadpool: don't forget to free workers state when omp is enabled * threadpool: avoid updating process priority on the platforms that do not require it On Windows we need to change overall process priority class in order to set thread priorities, but on Linux, Mac, etc we do not need to touch the overall process settings. * threadpool: update calling thread prio and affinity only at start/resume This avoids extra syscalls for each graph_compute() * llama-bench: turn threadpool params into vectors, add output headers, etc * llama-bench: add support for cool off between tests --delay This helps for long running tests on platforms that are thermally limited (phones, laptops, etc). --delay (disabled by default) introduces the sleep for N seconds before starting each test. * threadpool: move process priority setting into the apps (bench and cli) This avoids changing the overall process priority on Windows for the apps that use ggml/llama.cpp directy. * threadpool: move all pause/resume logic into ggml * threadpool: futher api cleanup and prep for future refactoring All threadpool related functions and structs use ggml_threadpool prefix. * threadpool: minor indent fixes * threadpool: improve setprioty error message * Update examples/llama-bench/llama-bench.cpp Co-authored-by: slaren <slarengh@gmail.com> * threadpool: fix indent in set_threadpool call * use int32_t for n_thread type in public llama.cpp API * threadpool: use _new and _free instead of _create and _release * fix two more public APIs to use int32_t for n_threads * build: set _GNU_SOURCE for Adroid --------- Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com> Co-authored-by: fmz <quic_fzaghlou@quic.com> Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2024-08-30 01:20:53 +02:00
Jan Boon	9f7d4bcf5c	server : fix crash when error handler dumps invalid utf-8 json (#9195 )	2024-08-30 07:15:26 +08:00
Xuan Son Nguyen	a77feb5d71	server : add some missing env variables (#9116 ) * server : add some missing env variables * add LLAMA_ARG_HOST to server dockerfile * also add LLAMA_ARG_CONT_BATCHING	2024-08-27 11:07:01 +02:00
Georgi Gerganov	e5edb210cd	server : update deps (#9183 )	2024-08-26 12:16:57 +03:00
Xuan Son Nguyen	fc54ef0d1c	server : support reading arguments from environment variables (#9105 ) * server : support reading arguments from environment variables * add -fa and -dt * readme : specify non-arg env var	2024-08-21 11:04:34 +02:00
Xuan Son Nguyen	8b3befc0e2	server : refactor middleware and /health endpoint (#9056 ) * server : refactor middleware and /health endpoint * move "fail_on_no_slot" to /slots * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix server tests * fix CI * update server docs --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-08-16 17:19:05 +02:00
Riceball LEE	37501d9c79	server : fix duplicated n_predict key in the generation_settings (#8994 )	2024-08-15 10:28:05 +03:00
Zhenwei Jin	4af8420afb	common : remove duplicate function llama_should_add_bos_token (#8778 )	2024-08-15 10:23:23 +03:00
Jiří Podivín	234b30676a	server : init stop and error fields of the result struct (#9026 ) Signed-off-by: Jiri Podivin <jpodivin@redhat.com>	2024-08-15 09:21:57 +03:00
compilade	98a532d474	server : fix segfault on long system prompt (#8987 ) * server : fix segfault on long system prompt * server : fix parallel generation with very small batch sizes * server : fix typo in comment	2024-08-14 09:51:02 +03:00
Georgi Gerganov	5ef07e25ac	server : handle models with missing EOS token (#8997 ) ggml-ci	2024-08-12 10:21:50 +03:00
Mathieu Geli	daef3ab233	server : add one level list nesting for embeddings (#8936 )	2024-08-09 09:32:02 +03:00
Xuan Son Nguyen	1e6f6554aa	server : add lora hotswap endpoint (WIP) (#8857 ) * server : add lora hotswap endpoint * handle lora_no_apply * fix build * updae docs * clean up struct def * fix build * add LoRA test * fix style	2024-08-06 17:33:39 +02:00
Liu Jia	0a4ce78681	common : Changed tuple to struct (TODO fix) (#8823 ) * common : Changed tuple to struct (TODO fix) Use struct `llama_init_result` to replace the previous std::tuple<struct llama_model , struct llama_context > * delete llama_init_default_params() * delete the extra whitespace	2024-08-05 18:14:10 +02:00
ardfork	978ba3d83d	Server: Don't ignore llama.cpp params (#8754 ) * Don't ignore llama.cpp params * Add fallback for max_tokens	2024-08-04 20:16:23 +02:00
Igor Okulist	afbbcf3c04	server : update llama-server embedding flag documentation (#8779 ) Fixes #8763	2024-07-31 19:59:09 -04:00
Yaiko	01aec4a631	server : add Speech Recognition & Synthesis to UI (#8679 ) * server : add Speech Recognition & Synthesis to UI * server : add Speech Recognition & Synthesis to UI (fixes)	2024-07-26 00:10:16 +02:00
Ujjawal Panchal	4b0eff3df5	docs : Quantum -> Quantized (#8666 ) * docfix: imatrix readme, quantum models -> quantized models. * docfix: server readme: quantum models -> quantized models.	2024-07-25 11:13:27 +03:00
Vali Malinoiu	b841d07408	server : fix URL.parse in the UI (#8646 )	2024-07-23 17:37:42 +03:00
Jan Boon	628154492a	server : update doc to clarify n_keep when there is bos token (#8619 )	2024-07-22 11:02:09 +03:00
Eric Zhang	0d2c7321e9	server: use relative routes for static files in new UI (#8552 ) * server: public: fix api_url on non-index pages * server: public: use relative routes for static files in new UI	2024-07-18 12:43:49 +02:00
RunningLeon	3807c3de04	server : respect `--special` cli arg (#8553 )	2024-07-18 11:06:22 +03:00
Xuan Son Nguyen	4db8f60fe7	fix ci (#8494 )	2024-07-15 19:23:10 +02:00
M-A	f17f39ff9c	server: update README.md with llama-server --help output [no ci] (#8472 ) The README.md had a stale information. In particular, the --ctx-size "defaults to 512" confused me and I had to check the code to confirm this was false. This the server is evolving rapidly, it's probably better to keep the source of truth at a single place (in the source) and generate the README.md based on that. Did: make llama-server ./llama-server --help > t.txt vimdiff t.txt examples/server/README.md I copied the content inside a backquote block. I would have preferred proper text but it would require a fair amount of surgery to make the current output compatible with markdown. A follow up could be to automate this process with a script. No functional change.	2024-07-15 15:04:56 +03:00
Georgi Gerganov	4e24cffd8c	server : handle content array in chat API (#8449 ) * server : handle content array in chat API * Update examples/server/utils.hpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-07-12 14:48:15 +03:00
Douglas Hanley	c3ebcfa148	server : ensure batches are either all embed or all completion (#8420 ) * make sure batches are all embed or all non-embed * non-embedding batch for sampled tokens; fix unused params warning	2024-07-12 11:14:12 +03:00
Clint Herron	278d0e1846	Initialize default slot sampling parameters from the global context. (#8418 )	2024-07-10 20:08:17 -04:00
Clint Herron	a59f8fdc85	Server: Enable setting default sampling parameters via command-line (#8402 ) * Load server sampling parameters from the server context by default. * Wordsmithing comment	2024-07-09 18:26:40 -04:00
compilade	3fd62a6b1c	py : type-check all Python scripts with Pyright (#8341 ) * py : type-check all Python scripts with Pyright * server-tests : use trailing slash in openai base_url * server-tests : add more type annotations * server-tests : strip "chat" from base_url in oai_chat_completions * server-tests : model metadata is a dict * ci : disable pip cache in type-check workflow The cache is not shared between branches, and it's 250MB in size, so it would become quite a big part of the 10GB cache limit of the repo. * py : fix new type errors from master branch * tests : fix test-tokenizer-random.py Apparently, gcc applies optimisations even when pre-processing, which confuses pycparser. * ci : only show warnings and errors in python type-check The "information" level otherwise has entries from 'examples/pydantic_models_to_grammar.py', which could be confusing for someone trying to figure out what failed, considering that these messages can safely be ignored even though they look like errors.	2024-07-07 15:04:39 -04:00
Bjarke Viksøe	cb4d86c4d7	server: Retrieve prompt template in /props (#8337 ) * server: Retrieve prompt template in /props This PR adds the following: - Expose the model's Jinja2 prompt template from the model in the /props endpoint. - Change log-level from Error to Warning for warning about template mismatch. The front-end stands a better chance of actually executing the Jinja template format correctly. Server is currently just guessing it. Ideally this should have been inside a JSON block that expose the same key/value pairs as listed during startup in "llm_load_print_meta" function. * Make string buffer dynamic * Add doc and better string handling * Using chat_template naming convention * Use intermediate vector for string assignment	2024-07-07 11:10:38 +02:00
Pieter Ouwerkerk	5a7447c569	readme : fix minor typos [no ci] (#8314 )	2024-07-05 09:58:41 +03:00
Clint Herron	07a3fc0608	Removes multiple newlines at the end of files that is breaking the editorconfig step of CI. (#8258 )	2024-07-02 12:18:10 -04:00
Sigbjørn Skjæret	38373cfbab	Add SPM infill support (#8016 ) * add --spm-infill option * support --spm-infill * support --spm-infill	2024-06-28 12:53:43 +02:00
Olivier Chafik	139cc621e9	`json`: restore default additionalProperties to false, fix some pattern escapes (#8180 ) * json: expand ESCAPED_IN_REGEXPS_BUT_NOT_IN_LITERALS charset * json: revert default of additionalProperties to false * Update README.md	2024-06-28 09:26:45 +01:00
Georgi Gerganov	f3f65429c4	llama : reorganize source code + improve CMake (#8006 ) * scripts : update sync [no ci] * files : relocate [no ci] * ci : disable kompute build [no ci] * cmake : fixes [no ci] * server : fix mingw build ggml-ci * cmake : minor [no ci] * cmake : link math library [no ci] * cmake : build normal ggml library (not object library) [no ci] * cmake : fix kompute build ggml-ci * make,cmake : fix LLAMA_CUDA + replace GGML_CDEF_PRIVATE ggml-ci * move public backend headers to the public include directory (#8122) * move public backend headers to the public include directory * nix test * spm : fix metal header --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * scripts : fix sync paths [no ci] * scripts : sync ggml-blas.h [no ci] --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-26 18:33:02 +03:00
Olivier Chafik	9b2f16f805	`json`: better support for "type" unions (e.g. nullable arrays w/ typed items) (#7863 ) * json: better suport for "type" arrays (e.g. `{"type": ["array", "null"], "items": {"type": "string"}}`) * json: add test for type: [array, null] fix * update tests	2024-06-26 01:46:35 +01:00
Olivier Chafik	6777c544bd	`json`: fix additionalProperties, allow space after enum/const (#7840 ) * json: default additionalProperty to true * json: don't force additional props after normal properties! * json: allow space after enum/const * json: update pydantic example to set additionalProperties: false * json: prevent additional props to redefine a typed prop * port not_strings to python, add trailing space * fix not_strings & port to js+py * Update json-schema-to-grammar.cpp * fix _not_strings for substring overlaps * json: fix additionalProperties default, uncomment tests * json: add integ. test case for additionalProperties * json: nit: simplify condition * reformat grammar integ tests w/ R"""()""" strings where there's escapes * update # tokens in server test: consts can now have trailing space	2024-06-26 01:45:58 +01:00
Olivier Chafik	84631fe150	`json`: support integer minimum, maximum, exclusiveMinimum, exclusiveMaximum (#7797 ) * json: support minimum for positive integer values * json: fix min 0 * json: min + max integer constraints * json: handle negative min / max integer bounds * json: fix missing paren min/max bug * json: proper paren fix * json: integration test for schemas * json: fix bounds tests * Update json-schema-to-grammar.cpp * json: fix negative max * json: fix negative min (w/ more than 1 digit) * Update test-grammar-integration.cpp * json: nit: move string rules together * json: port min/max integer support to Python & JS * nit: move + rename _build_min_max_int * fix min in [1, 9] * Update test-grammar-integration.cpp * add C++11-compatible replacement for std::string_view * add min/max constrained int field to pydantic json schema example * fix merge * json: add integration tests for min/max bounds * reshuffle/merge min/max integ test cases * nits / cleanups * defensive code against string out of bounds (apparently different behaviour of libstdc++ vs. clang's libc++, can't read final NULL char w/ former)	2024-06-25 20:06:20 +01:00
Xuan Son Nguyen	48e6b92cc3	Add chat template support for llama-cli (#8068 ) * add chat template support for llama-cli * add help message * server: simplify format_chat * more consistent naming * improve * add llama_chat_format_example * fix server * code style * code style * Update examples/main/main.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-25 21:56:49 +10:00
HanishKVC	3791ad2193	SimpleChat v3.1: Boolean chat request options in Settings UI, cache_prompt (#7950 ) * SimpleChat: Allow for chat req bool options to be user controlled * SimpleChat: Allow user to control cache_prompt flag in request * SimpleChat: Add sample GUI images to readme file Show the chat screen and the settings screen * SimpleChat:Readme: Add quickstart block, title to image, cleanup * SimpleChat: RePosition contents of the Info and Settings UI Make it more logically structured and flow through. * SimpleChat: Rename to apiRequestOptions from chatRequestOptions So that it is not wrongly assumed that these request options are used only for chat/completions endpoint. Rather these are used for both the end points, so rename to match semantic better. * SimpleChat: Update image included with readme wrt settings ui * SimpleChat:ReadMe: Switch to webp screen image to reduce size	2024-06-25 21:27:35 +10:00
Aarni Koskela	6a2f298bd7	server : fix JSON-Scheme typo (#7975 )	2024-06-23 11:03:08 -04:00
sasha0552	ba58993152	server : fix smart slot selection (#8020 )	2024-06-20 09:57:10 +10:00
Sigbjørn Skjæret	91c188d6c2	Only use FIM middle token if it exists (#7648 ) * Only use FIM middle if it exists * Only use FIM middle if it exists	2024-06-18 22:19:45 +10:00
Olivier Chafik	1c641e6aac	`build`: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 ) * `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew * server: update refs -> llama-server gitignore llama-server * server: simplify nix package * main: update refs -> llama fix examples/main ref * main/server: fix targets * update more names * Update build.yml * rm accidentally checked in bins * update straggling refs * Update .gitignore * Update server-llm.sh * main: target name -> llama-cli * Prefix all example bins w/ llama- * fix main refs * rename {main->llama}-cmake-pkg binary * prefix more cmake targets w/ llama- * add/fix gbnf-validator subfolder to cmake * sort cmake example subdirs * rm bin files * fix llama-lookup-* Makefile rules * gitignore /llama-* * rename Dockerfiles * rename llama\|main -> llama-cli; consistent RPM bin prefixes * fix some missing -cli suffixes * rename dockerfile w/ llama-cli * rename(make): llama-baby-llama * update dockerfile refs * more llama-cli(.exe) * fix test-eval-callback * rename: llama-cli-cmake-pkg(.exe) * address gbnf-validator unused fread warning (switched to C++ / ifstream) * add two missing llama- prefixes * Updating docs for eval-callback binary to use new `llama-` prefix. * Updating a few lingering doc references for rename of main to llama-cli * Updating `run-with-preset.py` to use new binary names. Updating docs around `perplexity` binary rename. * Updating documentation references for lookup-merge and export-lora * Updating two small `main` references missed earlier in the finetune docs. * Update apps.nix * update grammar/README.md w/ new llama-* names * update llama-rpc-server bin name + doc * Revert "update llama-rpc-server bin name + doc" This reverts commit `e474ef1df4`. * add hot topic notice to README.md * Update README.md * Update README.md * rename gguf-split & quantize bins refs in **/tests.sh --------- Co-authored-by: HanClinto <hanclinto@gmail.com>	2024-06-13 00:41:52 +01:00
Georgi Gerganov	704a35b183	server : restore numeric prompts (#7883 )	2024-06-12 14:42:29 +03:00
Olivier Chafik	b61eb9644d	json: refine constraint for whitespace to avoid runaways yet allow pretty print (#7866 )	2024-06-11 02:22:57 +01:00
Olivier Chafik	396b18dfec	`json`: document schema conversion in GBNF readme, align manual grammar examples & converters (#7841 ) * json: fix char pattern in grammar converters * json: prevent number precision & whitespace runaways in example grammars * json: add doc to grammar readme	2024-06-11 01:00:30 +01:00
Georgi Gerganov	d9da0e4986	server : improve "prompt" handling (#7847 )	2024-06-10 14:59:55 +03:00
mgroeber9110	3e2ee44315	server: do not remove whitespace at the start of a completion chunk (#7830 )	2024-06-09 20:50:35 +10:00
sasha0552	7a16ce7db2	server : smart slot selection using Longest Common Prefix (#7728 ) * server : Smart selection of available slot using Longest Common Substring * add usage * remove trailing whitespaces * Use Longest Common Prefix (LCP) instead of LCS * Rename argument	2024-06-08 10:50:31 +03:00
Johannes Gäßler	7027b27d76	server: update cache_prompt documentation [no ci] (#7745 )	2024-06-07 11:15:49 +02:00
woodx	a5cabd7649	server : do not get prompt in infill mode (#7286 ) * avoid to get prompt in infill mode and embedding mode * remove embedding mode * refactor format --------- Co-authored-by: wudexiang <wudexiang@bytedance.com>	2024-06-07 10:09:45 +03:00
Georgi Gerganov	f83351f9a6	imatrix : migrate to gpt_params (#7771 ) * imatrix : migrate to gpt_params ggml-ci * imatrix : add --save-frequency cli arg * common : fix --no-ppl	2024-06-06 16:30:58 +03:00
Olivier Chafik	55b2d0849d	grammars: x{min,max} repetition operator (#6640 ) * grammars: x{min,max} repetition operator + tweak +//? to avoid duplication of original over alternates grammars: handle `x{n}` and fix `x{n,n}` * grammars: document new repetition operators * grammars: uniform use of int for min & max * grammars: refactor parser test * grammar: parsing tests w/ natural pretty print of updated expectations * grammars: much prettier print of expectations (+ TEST_GRAMMAR_PARSER_PRINT_ALL=1 to force all) * grammars: improve test pretty print again * grammars: pretty print rules and chars * grammars: fix copy rule skipping * grammars: disallow `a{,}` (not allowed in regexps) * Update common/grammar-parser.cpp Co-authored-by: Clint Herron <hanclinto@gmail.com> * grammars: fix copy rule skipping (again) & display of expectations * grammars: more test cases * grammars: update reps parsing to bring ? / * / + closer to before * json: use new GBNF repetitions{m,n} syntax * grammars: update performance gotchas w/ repetition advice * Update examples/json_schema_to_grammar.py Co-authored-by: Clint Herron <hanclinto@gmail.com> * Update examples/server/public/json-schema-to-grammar.mjs Co-authored-by: Clint Herron <hanclinto@gmail.com> * grammars: comment on rule repetitions * grammars: ensure unambiguous number alternatives * grammar: nit typo switched error msgs * grammar: nit numbering in comment * json: update numeric rule to be unambiguous * Apply suggestions from code review Co-authored-by: Clint Herron <hanclinto@gmail.com> * Update examples/server/public/json-schema-to-grammar.mjs Co-authored-by: Clint Herron <hanclinto@gmail.com> * json: fix integral-part * grammar: add repetition tests --------- Co-authored-by: Clint Herron <hanclinto@gmail.com>	2024-06-06 10:07:06 +01:00
Georgi Gerganov	1442677f92	common : refactor cli arg parsing (#7675 ) * common : gpt_params_parse do not print usage * common : rework usage print (wip) * common : valign * common : rework print_usage * infill : remove cfg support * common : reorder args * server : deduplicate parameters ggml-ci * common : add missing header ggml-ci * common : remote --random-prompt usages ggml-ci * examples : migrate to gpt_params ggml-ci * batched-bench : migrate to gpt_params * retrieval : migrate to gpt_params * common : change defaults for escape and n_ctx * common : remove chatml and instruct params ggml-ci * common : passkey use gpt_params	2024-06-04 21:23:39 +03:00
Yazan Agha-Schrader	2e666832e6	server : new UI (#7633 ) * ic * migrate my eary work * add the belonging stuff: css,favicon etc * de prompts * chore: Update HTML meta tags in index.html file * add api-key css classes * some necessary fixes * Add API key CSS classes and update styling in style.css * clean the code * move API to the top, rearrange param sliders. update css * add tooltips to the parameters with comprehensible explanations * fix FloatField and BoolField tooltips * fix grammar field width * use template literales for promptFormats.js * update const ModelGenerationInfo * remove ms per token, since not relevant for most webui users and use cases * add phi-3 prompt template * add phi3 to dropdown * add css class * update forgotten css theme * add user message suffix * fix chatml & add llama3 format * fix llama3 prompt template * more prompt format fixes * add more comon stop tokens * add missing char * do not separate with new line or comma * move prompt style * add hacky llama2 prompt solution, reduce redundancy in promptFormats.js * fix toggle state localstorage * add cmd-r prompt et reduce redundancy * set default prompt to empty * move files, clean code * fix css path * add a button to the new ui * move new ui to "/public" due to otherwise problematic CORS behaviour * include new ui in cpp * fix wrong link to old ui * renaming to ensure consistency * fix typos "prompt-format" -> "prompt-formats" * use correct indent * add new ui files to makefile * fix typo	2024-06-01 22:31:48 +03:00
HanishKVC	2ac95c9d56	SimpleChat: Simple histogram/repeatMatching driven garbageTrimming, Settings UI, Streaming mode, OpenAi Compat (Model, Authorization Bearer), Save/Restore session, Auto Settings UI (#7548 ) * SimpleChat:DU:BringIn local helper js modules using importmap Use it to bring in a simple trim garbage at end logic, which is used to trim received response. Also given that importmap assumes esm / standard js modules, so also global variables arent implicitly available outside the modules. So add it has a member of document for now * SimpleChat:DU: Add trim garbage at end in loop helper * SimpleChat:DU:TrimGarbage if unable try skip char and retry * SimpleChat:DU: Try trim using histogram based info TODO: May have to add max number of uniq chars in histogram at end of learning phase. * SimpleChat:DU: Switch trim garbage hist based to maxUniq simple Instead of blindly building histogram for specified substring length, and then checking if any new char within specified min garbage length limit, NOW exit learn state when specified maxUniq chars are found. Inturn there should be no new chars with in the specified min garbage length required limit. TODO: Need to track char classes like alphabets, numerals and special/other chars. * SimpleChat:DU: Bring in maxType to the mix along with maxUniq Allow for more uniq chars, but then ensure that a given type of char ie numerals or alphabets or other types dont cross the specified maxType limit. This allows intermixed text garbage to be identified and trimmed. * SimpleChat:DU: Cleanup debug log messages * SimpleChat:UI: Move html ui base helpers into its own module * SimpleChat:DU:Avoid setting frequence/Presence penalty Some models like llama3 found to try to be over intelligent by repeating garbage still, but by tweaking the garbage a bit so that it is not exactly same. So avoid setting these penalties and let the model's default behaviour work out, as is. Also the simple minded histogram based garbage trimming from end, works to an extent, when the garbage is more predictable and repeatative. * SimpleChat:UI: Add and use a para-create-append helper Also update the config params dump to indicate that now one needs to use document to get hold of gMe global object, this is bcas of moving to module type js. Also add ui.mjs to importmap * SimpleChat:UI: Helper to create bool button and use it wrt settings * SimpleChat:UI: Add Select helper and use it wrt ChatHistoryInCtxt * SimpleChat:UI:Select: dict-name-value, value wrt default, change Take a dict/object of name-value pairs instead of just names. Inturn specify the actual value wrt default, rather than the string representing that value. Trap the needed change event rather than click wrt select. * SimpleChat:UI: Add Div wrapped label+element helpers Move settings related elements to use the new div wrapped ones. * SimpleChat:UI:Add settings button and bring in settings ui * SimpleChat:UI:Settings make boolean button text show meaning * SimpleChat: Update a bit wrt readme and notes in du * SimpleChat: GarbageTrim enable/disable, show trimmed part ifany * SimpleChat: highlight trim, garbage trimming bitmore aggressive Make it easy for end user to identified the trimmed text. Make garbage trimming logic, consider a longer repeat garbage substring. * SimpleChat: Cleanup a bit wrt Api end point related flow Consolidate many of the Api end point related basic meta data into ApiEP class. Remove the hardcoded ApiEP/Mode settings from html+js, instead use the generic select helper logic, inturn in the settings block. Move helper to generate the appropriate request json string based on ApiEP into SimpleChat class itself. * SimpleChat:Move extracting assistant response to SimpleChat class so also the trimming of garbage. * SimpleChat:DU: Bring in both trim garbage logics to try trim * SimpleChat: Cleanup readme a bit, add one more chathistory length * SimpleChat:Stream:Initial handshake skeleton Parse the got stream responses and try extract the data from it. It allows for a part read to get a single data line or multiple data line. Inturn extract the json body and inturn the delta content/message in it. * SimpleChat: Move handling oneshot mode server response Move handling of the oneshot mode server response into SimpleChat. Also add plumbing for moving multipart server response into same. * SimpleChat: Move multi part server response handling in * SimpleChat: Add MultiPart Response handling, common trimming Add logic to call into multipart/stream server response handling. Move trimming of garbage at the end into the common handle_response helper. Add new global flag to control between oneshot and multipart/stream mode of fetching response. Allow same to be controlled by user. If in multipart/stream mode, send the stream flag to the server. * SimpleChat: show streamed generative text as it becomes available Now that the extracting of streamed generated text is implemented, add logic to show the same on the screen. * SimpleChat:DU: Add NewLines helper class To work with an array of new lines. Allow adding, appending, shifting, ... * SimpleChat:DU: Make NewLines shift more robust and flexible * SimpleChat:HandleResponseMultiPart using NewLines helper Make handle_response_multipart logic better and cleaner. Now it allows for working with the situation, where the delta data line got from server in stream mode, could be split up when recving, but still the logic will handle it appropriately. ALERT: Rather except (for now) for last data line wrt a request's response. * SimpleChat: Disable console debug by default by making it dummy Parallely save a reference to the original func. * SimpleChat:MultiPart/Stream flow cleanup Dont try utf8-decode and newlines-add_append if no data to work on. If there is no more data to get (ie done is set), then let NewLines instance return line without newline at end, So that we dont miss out on any last-data-line without newline kind of scenario. Pass stream flag wrt utf-8 decode, so that if any multi-byte char is only partly present in the passed buffer, it can be accounted for along with subsequent buffer. At sametime, bcas of utf-8's characteristics there shouldnt be any unaccounted bytes at end, for valid block of utf8 data split across chunks, so not bothering calling with stream set to false at end. LATER: Look at TextDecoder's implementation, for any over intelligence, it may be doing.. If needed, one can use done flag to account wrt both cases. * SimpleChat: Move baseUrl to Me and inturn gMe This should allow easy updating of the base url at runtime by the end user. * SimpleChat:UI: Add input element helper * SimpleChat: Add support for changing the base url This ensures that if the user is running the server with a different port or wants to try connect to server on a different machine, then this can be used. * SimpleChat: Move request headers into Me and gMe Inturn allow Authorization to be sent, if not empty. * SimpleChat: Rather need to use append to insert headers * SimpleChat: Allow Authorization header to be set by end user * SimpleChat:UI+: Return div and element wrt creatediv helpers use it to set placeholder wrt Authorization header. Also fix copy-paste oversight. * SimpleChat: readme wrt authorization, maybe minimal openai testing * SimpleChat: model request field for openai/equivalent compat May help testing with openai/equivalent web services, if they require this field. * SimpleChat: readme stream-utf-8 trim-english deps, exception2error * Readme: Add a entry for simplechat in the http server section * SimpleChat:WIP:Collate internally, Stream mode Trap exceptions This can help ensure that data fetched till that point, can be made use of, rather than losing it. On some platforms, the time taken wrt generating a long response, may lead to the network connection being broken when it enters some user-no-interaction related power saving mode. * SimpleChat:theResp-origMsg: Undo a prev change to fix non trim When the response handling was moved into SimpleChat, I had changed a flow bit unnecessarily and carelessly, which resulted in the non trim flow, missing out on retaining the ai assistant response. This has been fixed now. * SimpleChat: Save message internally in handle_response itself This ensures that throwing the caught exception again for higher up logic, doesnt lose the response collated till that time. Go through theResp.assistant in catch block, just to keep simple consistency wrt backtracing just in case. Update the readme file. * SimpleChat:Cleanup: Add spacing wrt shown req-options * SimpleChat:UI: CreateDiv Divs map to GridX2 class This allows the settings ui to be cleaner structured. * SimpleChat: Show Non SettingsUI config field by default * SimpleChat: Allow for multiline system prompt Convert SystemPrompt into a textarea with 2 rows. Reduce user-input-textarea to 2 rows from 3, so that overall vertical space usage remains same. Shorten usage messages a bit, cleanup to sync with settings ui. * SimpleChat: Add basic skeleton for saving and loading chat Inturn when ever a chat message (system/user/model) is added, the chat will be saved into browser's localStorage. * SimpleChat:ODS: Add a prefix to chatid wrt ondiskstorage key * SimpleChat:ODS:WIP:TMP: Add UI to load previously saved chat This is a temporary flow * SimpleChat:ODS:Move restore/load saved chat btn setup to Me This also allows being able to set the common system prompt ui element to loaded chat's system prompt. * SimpleChat:Readme updated wrt save and restore chat session info * SimpleChat:Show chat session restore button, only if saved session * SimpleChat: AutoCreate ChatRequestOptions settings to an extent * SimpleChat: Update main README wrt usage with server	2024-06-02 02:20:18 +10:00
Georgi Gerganov	a323ec60af	server : update js (#7670 )	2024-05-31 22:23:04 +03:00
mgroeber9110	9335b969e8	server: do not remove whitespace at the start of a completion chunk (#7524 )	2024-05-28 14:55:51 +10:00
Nathan Epstein	c41767154e	Markdownish code block fix (#7571 ) * markdownish codeblock fix * updating regexes	2024-05-28 14:41:14 +10:00
HanishKVC	b9adcbbf92	SimpleChat Completion Mode flexibility and cleanup, Settings gMe, Optional sliding window (#7480 ) * SimpleChat: A placeholder system prompt, Use usage msg in code Just have a alert msg wrt needing javascript enabled in html. And have usage message from js file. Update the usage message a bit. So also enable switch session wrt setup_ui call. Add a possible system prompt as a placeholder for the system-input. * SimpleChat:CompletionMode: Allow control of Role: prefix * SimpleChat:Completion: Avoid Role: prefix; Newline only in between In completion mode * avoid inserting Role: prefix before each role's message * avoid inserting newline at the begin and end of the prompt message. However if there are multiple role messages, then insert newline when going from one role's message to the next role's message. * SimpleChat:CompletionMode: Update readme/usage, trim textarea newline Readme update wrt completion mode behavior. Usage help updated wrt completion mode behavior. When changing from input to textarea elment wrt user input, the last newline at the end of the user input wrt textarea, was forgotten to be filtered, this is fixed now. However if user wants to have a explicit newline they can using shift+enter to insert a newline, that wont be removed. The extra newline removal logic uses substring and keyup to keep things simple and avoid some previously noted bugs wrt other events in the key path as well as IME composition etal. * SimpleChat:SC: Ensure proper clearing/reseting previous logic would have cleared/reset the xchat, without doing the same wrt iLastSys, thus leading to it pointing to a now non existent role-content entry. So if a user set a system prompt and used completion mode, it would have done the half stupid clear, after the model response was got. Inturn when user tries to send a new completion query, it would inturn lead to handle_user_submit trying to add/update system prompt if any, which will fail, bcas iLastSys will be still pointing to a non existant entry. This is fixed now, by having a proper clear helper wrt SC class. * SimpleChat: Update usage note and readme a bit * SimpleChat:Completion: clear any prev chat history at begining Previously any chat history including model response to a completion query would have got cleared, after showing the same to the user, at the end of handle_user_submit, rather than at the begining. This gave the flexibility that user could switch from chat mode to completion mode and have the chat history till then sent to the ai model, as part of the completion query. However this flow also had the issue that, if user switches between different chat sessions, after getting a completion response, they can no longer see the completion query and its response that they had just got. The new flow changes the clearing of chat history wrt completion mode to the begining of handle_user_submit, so that user doesnt lose the last completion mode query and response, till a new completion mode query is sent to the model, even if they were to switch between the chat sessions. At the same time the loss of flexibility wrt converting previous chat history into being part of the completion query implicitly doesnt matter, because now the end user can enter multiline queries. * SimpleChat:Try read json early, if available For later the server flow doesnt seem to be sending back data early, atleast for the request (inc options) that is currently sent. if able to read json data early on in future, as and when ai model is generating data, then this helper needs to indirectly update the chat div with the recieved data, without waiting for the overall data to be available. * SimpleChat: Rename the half asleep mis-spelled global var * SimpleChat: Common chat request options from a global object * SimpleChat: Update title, usage and readme a bit Keep the title simple so that print file name doesnt have chars that need to be removed. Update readme wrt some of the new helpers and options. Change Usage list to a list of lists, add few items and style it to reduce the margin wrt lists. * SimpleChat:ChatRequestOptions: max_tokens As some times based on the query from the user, the ai model may get into a run away kind of generation with repeatations etal, so adding max_tokens to try and limit this run away behaviour, if possible. * SimpleChat: Reduce max_tokens to be small but still sufficient * SimpleChat: Consolidate global vars into gMe, Display to user This allows the end user to see the settings used by the logic, as well as allows users to change/update the settings if they want to by using devel-tools/console * SimpleChat:SlidingWindow: iRecentUserMsgCnt to limit context load This is disabled by default. However if enabled, then in addition to latest system message, only the last N user messages, after the latest system message and its reponses from the ai model will be sent to the ai-model, when querying for a new response. This specified N also includes the latest user query. * SimpleChat: placeholder based usage hint for user-in textarea * SimpleChat: Try make user experience better, if possible Reduce chat history context sent to the server/ai-model to be just the system-prompt, prev-user-request-and-ai-response and cur-user-request, instead of the previous full chat history. This way if there is any response with garbage/repeatation, it doesnt mess with things beyond the next question, in some ways. Increase max_tokens to 1024, so that a relatively large previous reponse doesnt eat up the space available wrt next query-response. However dont forget that the server when started should also be started with a model context size of 1k or more, to be on safe side. Add frequency and presence penalty fields set to 1.2 to the set of fields sent to server along with the user query. So that the model is partly set to try avoid repeating text in its response. * SimpleChat:Add n_predict (equiv max_tokens) for llamacpp server The /completions endpoint of examples/server doesnt take max_tokens, instead it takes the internal n_predict, for now add the same on the client side, maybe later add max_tokens to /completions endpoint handling. * SimpleChat: Note about trying to keep things simple yet flexible	2024-05-26 10:56:34 +10:00
HanishKVC	1e374365d1	SimpleChat: a simple and dumb web front end for testing /chat/completions and /completions end points and try chat (#7350 ) * SimpleChat: Add a skeletal html page Contains a div placeholder for showing chat messages till now a text-input for allowing user to enter next chat message/query to the model. a submit button to allow sending of the user entered message and chat till now to the model. * SimpleChat: A js skeleton with SimpleChat class Allows maintaining an array of chat message. Allows adding chat message (from any of the roles be it system, user, assistant, ...) Allows showing chat messages till now, in a given div element. * SimpleChat: request_json, globals, startme * SimpleChatJS: Roles Class, submitClick Define Role class with static members corresponding to the roles. Update startme to * Get hold of the ui elements. * Attach a click handler to submit button, which adds the user input to xchats array and shows the chat messages till now in chat div element. Trap DOMContentLoaded to trigger startme * SimpleChat:HTML: Bring in the js file * SimpleChat: Rather value wrt input text element * SimpleChat: Also add completions related prompt * SimpleChat: Use common helper logic wrt json data * SimpleChat: Move handling of submit request into its own func * SimpleChat: Try handshake with llm over its web service endpoint * SimpleChat:JS: Extract model response and show to user * SimpleChat:JS: Messages/Prompt, indicate working to end user * SimpleChat: Try keep input element in view * SimpleChat: Diff user/assistant msgs, Make input wider Also show a default message to user Also add some metas * SimpleChat: Move into its own sub directory to avoid confusion * SimpleChat:sh: Add simple shell script to run python3 http.server So one needs to run the llm server locally then run this script and access it using a local browser * SimpleChat:JS: Try trap enter key press wrt input text field So user can either press submit button or press enter key * SimpleChat: Allow user to select chat or completion mode * SimpleChat: Dont submit if already submitted and waiting Also make chat the default selection wrt mode * SimpleChat:JS: Handle difference in response Try read the assistance response from appropriate field in the response got. Also examples/server seems to return the response in a slightly different field, so try account for that also. * SimpleChat:JS: Force completion mode be single message by default * SimpleChat: Add a simple readme file * SimpleChat:HTML: Cleanup/structure UI a bit, Add input for system * SimpleChat:Allow system prompt to be set, if provided before user * SimpleChat: Ignore empty user input, without trimming * SimpleChat:Alert user if they provide sysprompt late or change it * SimpleChat: Move handling systemprompt into its own func * SimpleChat:HTML: Add a style for system role message * SimpleChat: Update the readme file * SimpleChat:CSS: Move style info into its own css file To keep it simple, clean and seperate so that things are not unnecessarily cluttered. * SimpleChat:CSS: Allow for chat div to be scrollable * SimpleChat:JS: Try ensure the last entry in chat is visible Needed because now only the chat div is scrollable and not the full page. In last commit the chat div size was fixed to 75% vertical height, so the full page no longer scrolls, so the old bring user-input element to view wont work, instead now the last element in the chat div should be brought into view. * SimpleChat:JS: bottom of element visible, Set focus to user input As the generated text could be multiple lines and occupy more space that the full scrollable div's vertical space, make the bottom of the last element (which can be such a generated text) in the div visible by scrolling. Ensure that the user input box has focus * SimpleChat: Update notes a bit. Try keep browser happy Avoid browser quirk mode with DOCTYPE. Help with accessibility a bit by specifying the language explicitly. Specify the char encoding explicitly, inturn utf-8 is a safe bet, even with intermixing of languages if reqd in future. Add a cache-control http-equiv meta tag, which in all probability will be ignored. Defer js loading and execution, just for fun and future, not that critical here as it stands now. * SimpleChat:HTML:Group user input+btn together; Note about multichat * SimpleChat:JS: Allow for changing system prompt anytime for future * SimpleChat:Readme: Note about handle_systemprompt begin/anytime * SimpleChat:HTML: Add viewport meta for better mobile friendliness Without this the page content may look too small. * SimpleChat:HtmlCss: Cleanup UI flow set margin wrt vmin rather than vw or vh so portrait/landscape ok. Use flex and flex-grow to put things on the same line as well as distribute available space as needed. Given two main elements/line so it remains simple. In each line have one element with grows and one sits with a basic comfortably fixed size. * SimpleChat: textarea for multiline user chat, inturn shift+enter 4 enter * SimpleChat: Make vertical layout better responsive (flex based) Also needed to make things cleaner and properly usable whether landscape or portrait, after changing to multiline textarea rather than single line user input. Avoid hardcoding the chat-till-now display area height, instead make it a flex-growable within a flex column of ui elements within a fixed vertical area. * SimpleChat: Rename simplechat.html to index.html, update readme Instead of providing a seperate shell script, update the readme wrt how to run/use this web front end. * SimpleChat: Screen fixed view and scrolling, Printing full * SimpleChat:JS:CI: Avoid space at end of jsdoc param line * SimpleChat:JS: MultiChat initial skeleton Will help maintain multiple independent chats in future * SimpleChat:JS: Move system prompt begin/anytime into SimpleChat * SimpleChat:JS:Keep MultiChatUI simple for now Worry about different chats with different servers for later. * SimpleChat:JS: Move handle submit into MultiChat, build on same Create an instance of MultiChatUI and inturn a instance of chat session, which is what the UI will inturn work on. * SimpleChat:JS: Move to dictionary of SimpleChat, instead of array * SimpleChat: Move ui elements into MultiChatUI, Update el IDs Move ui elements into MultiChatUI, so that current handleUserSubmit doesnt need to take the element arguments. Also in future, when user is allowed to switch between different chat sessions, the UI can be updated as needed by using the elements in UI already known to MultiChatUI instance. Rename the element ids' so that they follow a common convention, as well as one can identify what the element represents in a more consistant manner. * SimpleChat:MCUI:Show available chat sessions, try switch btw them Previous commits brought in / consolidated existing logic into MultiChatUI class. Now start adding logic towards multichat support * show buttons indicating available chat sessions * on sessin button click, try switch to that session * SimpleChat:MCUI: Store and use current chat session id Also allow to switch chat session optionally, wrt some of the related helpers. setup for two chat sessions by default. * SimpleChat:MCUI: Delay enabling user-input to avoid race Re-enable user-input, only after response to a user query has been updated to the chat-div. This ensures that if user tries to switch chat session, it wont be allowed till chat-request-response flow is done. * SimpleChat: Take care of system prompt Helper to get the latest system prompt and inturn use same to set the system prompt ui, when switching. Ensure that system prompt is set if and when enter key is pressed. * SimpleChat:GetSystemLatest, fix a oversight. * SimpleChat:MCUI: Allow selected chat-session btn to be highlighted Also have a general helper for setting class of children. * SimpleChat:Cleanup corners Show system prompt in chat space, when it is set by pressing enter, as a feedback to user. Alert user, if they try to switch chat session in the middle of waiting for a response from the ai model. * SimpleChat:MCUI: Ensure req-resp failure doesnt lock up things * SimpleChat:MCUI: Support for new chat sessions Also a general create button helper. * SimpleChat:MCUI: CreateSessionBtn helper, use wrt NewChat Also fix a oversight wrt using stale data wrt the list of chat sessions. * SimpleChat:MCUI: NewChat btn first before existing chat sessions * SimpleChat:MCUI:CornerCases:Skip new chat, show only if current Skip NewChat if user cancels or if one waiting for response from the ai model. Dont show a chat with newly got ai model response, if current chat session has changed, some how. Chat session shouldnt be allowed to change, if there is a pending response, but still as a additional sanity check. * SimpleChat: Update readme, title, show usage if no chat to show * SimpleChat: Cleanup the log/dialog messages a bit	2024-05-23 03:53:21 +10:00
Georgi Gerganov	6ff13987ad	common : normalize naming style (#7462 ) * common : normalize naming style ggml-ci * common : match declaration / definition order * zig : try to fix build	2024-05-22 20:04:20 +03:00
jaime-m-p	d7e852c1bc	Tokenizer SPM fixes for phi-3 and llama-spm (bugfix) (#7425 ) * Update brute force test: add_special * Update brute force test: default values for add_bos_token and add_eos_token * Enable rtrim when pre-inserting BOS Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Revert "server : fix test regexes"	2024-05-21 14:39:48 +02:00
jaime-m-p	917dc8cfa6	Tokenizer SPM fixes for phi-3 and llama-spm (#7375 ) * Update brute force test: special tokens * Fix added tokens - Try to read 'added_tokens.json'. - Try to read 'tokenizer_config.json'. - Try to read 'tokenizer.json'. * Fix special tokens rtrim Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server : fix test regexes	2024-05-20 20:15:57 +02:00
Georgi Gerganov	3bc10cb485	server : fix temperature + disable some tests (#7409 ) * server : fix temperature * server : disable tests relying on parallel determinism * ci : change server Debug -> RelWithDebInfo	2024-05-20 22:10:03 +10:00
Georgi Gerganov	1cc0155d04	server : tuning tests (#7388 ) * server : don't pass temperature as string * server : increase timeout * tests : fix the fix 0.8f -> 0.8 ggml-ci * tests : set explicit temperature	2024-05-20 10:16:41 +03:00
Georgi Gerganov	e932094d58	server : return error on too large embedding input (#7389 )	2024-05-20 08:56:05 +03:00
Johannes Gäßler	1b01f06db0	server: add test for token probs (#7347 )	2024-05-19 16:26:02 +02:00
Johannes Gäßler	41858392e1	server: fix seed being reported back (#7382 )	2024-05-19 17:06:33 +03:00
Johannes Gäßler	cb42c29427	server: correct --threads documentation [no ci] (#7362 )	2024-05-18 11:10:47 +02:00
Radoslav Gerganov	ee94172d33	server : add support for the RPC backend (#7305 ) ref: #7292	2024-05-17 10:00:17 +03:00
Leon Knauer	9c4fdcbec8	[Server] Added --verbose option to README [no ci] (#7335 )	2024-05-17 10:11:03 +10:00
Pierrick Hymbert	24ecb58168	Revert "server bench: fix bench not waiting for model load (#7284 )" (#7334 ) This reverts commit `583fd6b000`.	2024-05-16 20:43:45 +02:00
Johannes Gäßler	583fd6b000	server bench: fix bench not waiting for model load (#7284 )	2024-05-15 08:44:16 +02:00
Steve Grubb	4f0263633b	server: free sampling contexts on exit (#7264 ) * server: free sampling contexts on exit This cleans up last leak found by the address sanitizer. * fix whitespace * fix whitespace	2024-05-14 16:11:24 +02:00
Ryuei	27f65d6267	docs: Fix typo and update description for --embeddings flag (#7026 ) - Change '--embedding' to '--embeddings' in the README - Update the description to match the latest --help output - Added a caution about defining physical batch size	2024-05-14 15:20:47 +10:00
Benjamin Findley	e586ee4259	change default temperature of OAI compat API from 0 to 1 (#7226 ) * change default temperature of OAI compat API from 0 to 1 * make tests explicitly send temperature to OAI API	2024-05-13 12:40:08 +10:00
Xuan Son Nguyen	72c177c1f6	fix system prompt handling (#7153 )	2024-05-11 17:28:10 +02:00
Steve Grubb	988631335a	server : free llama_batch on exit (#7212 ) * [server] Cleanup a memory leak on exit There are a couple memory leaks on exit of the server. This hides others. After cleaning this up, you can see leaks on slots. But that is another patch to be sent after this. * make tab into spaces	2024-05-11 11:13:02 +03:00
Johannes Gäßler	5ae3426b0b	server: fix reported top tokens for temperature 0 (#7203 )	2024-05-11 10:11:28 +02:00
compilade	f98eb31c51	convert-hf : save memory with lazy evaluation (#7075 ) * convert-hf : begin refactoring write_tensor * convert : upgrade to sentencepiece v0.2.0 * convert-hf : remove unused n_dims in extra__tensors convert-hf : simplify MoE weights stacking * convert-hf : flake8 linter doesn't like semicolons * convert-hf : allow unusual model part names For example, loading `model-00001-of-00001.safetensors` now works. * convert-hf : fix stacking MoE expert tensors `torch.stack` and `torch.cat` don't do the same thing. * convert-hf : fix Mamba conversion Tested to work even with a SentencePiece-based tokenizer. * convert : use a string for the SentencePiece tokenizer path * convert-hf : display tensor shape * convert-hf : convert norms to f32 by default * convert-hf : sort model part names `os.listdir` is said to list files in arbitrary order. Sorting the file names should let "model-00009-of-00042.safetensors" be loaded before "model-00010-of-00042.safetensors". * convert-hf : use an ABC for Model again It seems Protocol can't be used as a statically type-checked ABC, because its subclasses also can't be instantiated. (why did it seem to work?) At least there's still a way to throw an error when forgetting to define the `model_arch` property of any registered Model subclasses. * convert-hf : use a plain class for Model, and forbid direct instantiation There are no abstract methods used anyway, so using ABC isn't really necessary. * convert-hf : more consistent formatting of cmdline args * convert-hf : align the message logged for converted tensors * convert-hf : fix Refact conversion * convert-hf : save memory with lazy evaluation * convert-hf : flake8 doesn't like lowercase L as a variable name * convert-hf : remove einops requirement for InternLM2 * convert-hf : faster model parts loading Instead of pre-loading them all into a dict, iterate on the tensors in the model parts progressively as needed in Model.write_tensors Conversion for some architectures relies on checking for the presence of specific tensor names, so for multi-part models, the weight map is read from the relevant json file to quickly get these names up-front. * convert-hf : minor changes for consistency * gguf-py : add tqdm as a dependency It's small, and used for a progress bar in GGUFWriter.write_tensors_to_file	2024-05-08 18:16:38 -04:00
Johannes Gäßler	c12452c7ae	JSON: [key] -> .at(key), assert() -> GGML_ASSERT (#7143 )	2024-05-08 21:53:08 +02:00
JohnnyB	bd1871fa2b	server : add themes + favicon (#6848 ) * Added themes support with two sample themes and a favicon. * Newline * Newline * Newline * Trailing whitespace * Increased opacity for contrast * Increase opacity. Check actions cancelled for some other priority job and I can't seem to manually re-run them, so MOAR OPACITY * Opacity action trigger. Trying to re-trigger the cancelled action. * One more opacity adjustment This Actions pipeline is failing for random issues. * Delete examples/server/themes/buttons_top/completion.js This will be served from the static string built-in to server. * Delete examples/server/themes/buttons_top/index.js This will be served from the static string built-in to server. * Delete examples/server/themes/wild/completion.js This will be served from the static string built-in to server. * Delete examples/server/themes/buttons_top/json-schema-to-grammar.mjs This will be served from the static string built-in to server. * Delete examples/server/themes/wild/index.js This will be served from the static string built-in to server. * Delete examples/server/themes/wild/json-schema-to-grammar.mjs This will be served from the static string built-in to server. * Replaced underscore.	2024-05-08 22:12:06 +03:00
Johan	911b3900dd	server : add_special option for tokenize endpoint (#7059 )	2024-05-08 15:27:58 +03:00
Xuan Son Nguyen	1fd9c1741d	clean up json_value & server_log (#7142 )	2024-05-08 13:24:14 +02:00
Johannes Gäßler	af0a5b6163	server: fix incorrectly reported token probabilities (#7125 ) * server: normalize token probabilities * fix temperature == 0.0f	2024-05-07 23:07:58 +02:00
Kyle Mistele	260b7c6529	server : update readme with undocumented options (#7013 )	2024-05-07 21:44:29 +03:00
maor-ps	03fb8a002d	If first token generated from the server is the stop word the server will crash (#7038 ) This will reproduce the issue in llama13b { 'prompt': 'Q: hello world \nA: ', 'stop': ['\n'], 'temperature': 0.0, 'n_predict': 10, 'cache_prompt': True, 'n_probs': 10 }	2024-05-04 11:06:40 +02:00
Johannes Gäßler	3ea0d36000	Server: add tests for batch size, different seeds (#6950 )	2024-05-01 17:52:55 +02:00
Georgi Gerganov	9c67c2773d	ggml : add Flash Attention (#5021 ) * ggml : add ggml_flash_attn_ext API * ggml : fix GQA support in ggml_flash_attn_ext * ggml : online attention (CPU) * metal : initial implementation * metal : f16 precision * metal : reduce branches * metal : specialize for head size * wip : 8 rows per simd group * wip : 4 rows per simd group * wip : template for rows per warp * metal : parallelize across KV size * metal : parallel reduce across heads * metal : efficient flash_attn_f16 implementation * metal : avoid redundant loads of the attention * metal : scale and mask in matrix form * metal : fix comment * llama : avoid ggml_cast, use F32 query * metal : add parallel reduce version (disabled) * metal : move output into local memory + optimize - the result from each simdgroup now stays in the registers - significantly reduced SRAM usage - more efficient skipping of -INF blocks - avoid simdgroup barrier in hot loop - add comments * metal : add tests, fix scaling, support C > 32 * metal : improve precision * ggml : fix f16 mad * metal : minor * metal : support Q > 8 * tests : add ATTN tests * metal : disable buffer allocation logs * tests : more * metal : faster inner loop for C == 32 * metal : fix array initialization * tests : ifdef * ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext * ggml : fix ggml_soft_max mask requirement * cuda : fix soft_max to use correct mask size * cuda : add flash_attn kernel (wip) * metal : optimize softmax for C > 32 * metal : optimize softmax * tests : minor fix * cuda : avoid zeroing fragments * tests : update dims * cuda : fix __hisinf() result check * cuda : avoid warp_reduce for smax * cuda : use int instead of int64_t Noticeably improves performance (thanks to Johannes) * cuda : make loops use the same loop values Thanks Johannes again for the tip * cuda : unroll some of the loops * cuda : avoid __hisinf branches * cuda : use half2 in softmax * cuda : switch to 1 warp for bs > 16 * cuda : speed-up reduce part of the kernel * cuda : unroll QK^T loop cuda : fix -INF block check * cuda : simplify softmax * cuda : fix matrix names * cuda : minor * llama : adapt to F16 KQ_pos * llama : adapt new models to F16 KQ_mask * ggml : fix F16 store (ARM NEON) * llama : fix type of KQ_mask and KQ_pos * ggml : fix CPU soft_max * tests : add hs=256 * cuda : fix build * metal : improve perf via smaller int registers * cuda : adapt soft_max to F16 mask and pos * CUDA: faster FlashAttention, kernel for bs == 1 * 16 cols for Phi-2 * no vec for hs, no hs==256 ncols==32 for Volta * adjust kernel selection logic * 4 warps, 256 stride for all D * no ncols == 64 * Multiple parallel blocks for batch size 1 * fix compile warnings * fix excessive KQ_b loads * fix cmake build * fix KV cache padding, NaN from INFINITY (#6438) * llama : flash_attn cparam + fix defrag * server: support flash_attn param * server: bench: enable flash_attn param * CUDA: refactor host code, dyn. par. blocks * fix flash_attn_vec_f16 race condition * flush softmax exp below threshold to 0 * store temp KQ in registers * Calculate KQ as FP32 if KQV has GGML_PREC_F32 * Add __hgt2_mask implementation for CUDA 11 * fix KQ FP32 precision fpr parallel_blocks > 1 * llama-bench : add -fa,--flash-attn arg * metal : add BS=1 kernel for flash attention (#6508) * metal : add BS=1 kernel for flash attention (wip) * metal : support more than 1 warps * metal : opts * metal : opt * metal : switch to parallel reduce * metal : reduce registers * metal : simplify * metal : initial FA vec kernel * metal : use F32 attention accumulators * batched-bench : add fattn arg * llama : simplify llama_build_kv_store ggml-ci * llama : adapt build_olmo to changes * ggml : fix arm fp16 store on windows * metal : clean-up * metal : clean-up kernel code * metal : minor * tests : remove benchmarks ggml-ci * ggml : fix avx512 const correctness ggml-ci * ggml : fix soft_max with bias on CPU ggml-ci * common : print --flash-attn in help * ggml : fix num dimensions in ggml_flash_attn_ext * llama : force disable flash attention for incompatible models * ggml : ggml_soft_max support F16/F32 mask/pos ggml-ci * cuda : uint -> uint32_t * cuda : "constexpr dim3" -> "const dim3" ggml-ci * cuda : try to fix __hgt2_mask ggml-ci * ggml : add TODO's for F16/F32 mask/pos support in other backends * llama : replace bool need_kq_pos with use_alibi * llama : prep ALiBi support for BERT models ggml-ci * llama : fix n_batch requirements ggml-ci * cont * server : add help for --flash-attn arg * llama : disable FA for AMD * tests : remove TMP_ATTN_BENCH ggml-ci * llama : support save/load state with FA enabled ggml-ci * ci : add CUDA save-load-state tests ggml-ci * llama : llama_kv_cache_clear zeroes data + fix save-load seq ggml-ci * llama : fix copy-paste errors, add TODO * llama : disallow incompatible states * llama : update llama_state_get_size after v_trans field * metal : remove tmp log * llama : add static reminder for llama_state_get_size * metal : fix max nsg ggml-ci * ci : fix arg order ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>	2024-04-30 12:16:08 +03:00
Olivier Chafik	8843a98c2b	Improve usability of --model-url & related flags (#6930 ) * args: default --model to models/ + filename from --model-url or --hf-file (or else legacy models/7B/ggml-model-f16.gguf) * args: main & server now call gpt_params_handle_model_default * args: define DEFAULT_MODEL_PATH + update cli docs * curl: check url of previous download (.json metadata w/ url, etag & lastModified) * args: fix update to quantize-stats.cpp * curl: support legacy .etag / .lastModified companion files * curl: rm legacy .etag file support * curl: reuse regex across headers callback calls * curl: unique_ptr to manage lifecycle of curl & outfile * curl: nit: no need for multiline regex flag * curl: update failed test (model file collision) + gitignore *.gguf.json	2024-04-30 00:52:50 +01:00
Olivier Chafik	b8a7a5a90f	build(cmake): simplify instructions (`cmake -B build && cmake --build build ...`) (#6964 ) * readme: cmake . -B build && cmake --build build * build: fix typo Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * build: drop implicit . from cmake config command * build: remove another superfluous . * build: update MinGW cmake commands * Update README-sycl.md Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> * build: reinstate --config Release as not the default w/ some generators + document how to build Debug * build: revert more --config Release * build: nit / remove -H from cmake example * build: reword debug instructions around single/multi config split --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2024-04-29 17:02:45 +01:00
Pierrick Hymbert	b7368332e2	ci: server: tests python env on github container ubuntu latest / fix n_predict (#6935 ) * ci: server: fix python env * ci: server: fix server tests after #6638 * ci: server: fix windows is not building PR branch	2024-04-27 17:50:48 +02:00
Pierrick Hymbert	0c4d489e29	quantize: add imatrix and dataset metadata in GGUF (#6658 ) * imatrix: save the dataset file used in the output file * llama: support kv overrides type string string * common: factorize KV Overrides parsing between common and server * quantize: add imatrix n entries and dataset KV metadata quantize: factorize KV Overrides parsing between common #6656 * llama: remove kv override str_value initialization as it does not compile on some toolchain * quantize: add imatrix m_last_call as `quantize.imatrix.chunks_count` * quantize: add imatrix filename in KV * llama: add llama_model_kv_override_free * common: add llama_model_kv_override_free common: free kv override if used after model loading * llama: finally move the string KV override value to the stack * llama : minor * no need to add a NUL to the std::vector, std::string can be initialized from a pair of iterators. Co-authored-by: slaren <slarengh@gmail.com> * kv override: ensure string termination --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2024-04-26 20:06:33 +02:00
Pierrick Hymbert	7f5ff558ee	server: stop generation at `n_ctx_train` if `n_predict` is not set (#6638 ) * server: cap n_predict if not set to n_ctx_train * server: fix infinite loop * server: infinite loop, move in process_token server: infinite loop: set stop limit to true * minor: spaces * minor: spaces * server: include prompt tokens in the EOS limit	2024-04-26 12:15:30 +02:00
Pierrick Hymbert	5790c8dac1	bench: server add stop word for PHI-2 (#6916 )	2024-04-26 09:26:16 +02:00
Georgi Gerganov	aa750c1ede	tests : minor bash stuff (#6902 ) * tests : minor bash stuff ggml-ci * llama : fix build ggml-ci * tests : fix CUR_DIR -> ROOT_DIR ggml-ci * tests : fix fname ggml-ci	2024-04-25 14:27:20 +03:00
mgroeber9110	3fe847b574	server : do not apply Markdown formatting in code sections (#6850 )	2024-04-24 13:54:24 +03:00
Kyle Mistele	37246b1031	common : revert showing control tokens by default for server (#6860 ) * fix: revert showing control tokens by default * feat: revert changes to default behavior of llama_token_to_piece; provide overridden declaration to receive "bool special" param to toggle showing control tokens * feat: use the overridden declaration of llama_token_to_piece from common/common.cpp to specify "false" so that control tokens are not shown in chat completion responses" * common : simplify --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-24 13:15:29 +03:00
Johannes Gäßler	28103f4832	Server: fix seed for multiple slots (#6835 ) * Server: add tests for consistent results * sampling: separate rng per sampling context	2024-04-24 11:08:36 +02:00
Olivier Chafik	5cf5e7d490	`build`: generate hex dump of server assets during build (#6661 ) * `build`: generate hex dumps of server assets on the fly * build: workaround lack of -n on gnu xxd * build: don't use xxd in cmake * build: don't call xxd from build.zig * build: more idiomatic hexing * build: don't use xxd in Makefile (od hackery instead) * build: avoid exceeding max cmd line limit in makefile hex dump * build: hex dump assets at cmake build time (not config time)	2024-04-21 18:48:53 +01:00
Pedro Cuenca	b97bc3966e	llama : support Llama 3 HF conversion (#6745 ) * Support Llama 3 conversion The tokenizer is BPE. * style * Accept suggestion Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com> * llama : add llama_token_is_eog() ggml-ci * llama : auto-detect more EOT tokens when missing in KV data * convert : replacing EOS token is a hack * llama : fix codegemma EOT token + add TODOs * llama : fix model type string for 8B model --------- Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-21 14:50:41 +03:00
Jan Boon	b8109bc013	doc : server tests require llama to be built with curl enabled (#6788 )	2024-04-20 18:29:50 +02:00
Pierrick Hymbert	637e9a86c2	server: static: upstream upgrade (#6765 )	2024-04-19 13:19:01 +02:00
Olivier Chafik	7593639ce3	`main`: add --json-schema / -j flag (#6659 ) * main: add --json-schema / -j * json: move json-schema-to-grammar to common lib * json: fix zig build	2024-04-15 18:35:21 +01:00
Pierrick Hymbert	3272896d79	server : revert "minor layout improvements" (#6684 ) This reverts commit `b3a96f27f0`.	2024-04-15 15:18:47 +03:00
Olivier Chafik	ab9a3240a9	JSON schema conversion: ⚡️ faster repetitions, min/maxLength for strings, cap number length (#6555 ) * json: rename python schema converter to make import easier * server: skip null json_schema / grammar fields * json: deps management for primitive rules (+ allow null values) * json: optimize repetitions for minItems/maxItems and regexps: `a{,3}` goes from `"a"? "a"? "a"?` (explosive combos) to `(a (a (a)?)?)?` * grammars: add troubleshooting section to readme * json: cap length of numbers to 15 digits before/after decimal point (avoids infinite gen, e.g. "one third" -> `0.333333333333...`) * json: unify all repetition code (w/ or w/o sep) * json: support string minLength/maxLength * server+json: update server/README w/ result_format * nits * json: fix type error w/ python 3.8 * json: fix server/README (json_schema in /completion vs. result_format in /v1/chat/completions) * json: simplify DOT `{"type": "string", "pattern": "^.$"}` * json: remove recursion in opt_repetitions (avoids Python stack overflow) * json: rm dead code * json: rm useless assert & ggml.h import	2024-04-12 19:43:38 +01:00
Pierrick Hymbert	24ee66ed0d	server : coherent log output for KV cache full (#6637 )	2024-04-12 14:49:21 +03:00
Ralph Soika	b3a96f27f0	minor layout improvements (#6572 ) * minor layout improvements * added missing file, run deps.sh locally	2024-04-10 19:18:25 +02:00
Jared Van Bortel	1b67731e18	BERT tokenizer fixes (#6498 ) Key changes: * BERT conversion: fix abuse of LlamaHfVocab, do not set BOS or EOS * Nomic Embed conversion: pad vocab instead of slicing embedding tensor * llama_tokenize: handle added special tokens like HF does	2024-04-09 13:44:08 -04:00
Ed Lee	400d5d722d	server : detect search query to start webchat (#6554 )	2024-04-09 10:31:47 +02:00
Jan Boon	beea6e1b16	llama : save and restore kv cache for single seq id (#6341 ) * llama : save and restore kv cache for single seq id * remove trailing whitespace * respond error in case there's no space in the kv cache * add kv seq save restore to test case * add --slot-save-path arg to enable save restore and restrict save location * Returning 0 for some cases, instead of asserting. * cleanup error cases * rename sequence state functions * rename state get set functions * add previous function names back in with DEPRECATED notice * update doc * adjust endpoints to preferred style * fix restoring zero cell count * handle seq rm return value * unused param * keep in the size check * fix return types * add server test case for slot save restore * cleanup * add cake * cleanup style * add special * removing a whole sequence never fails * move sequence state file functionality from server to llama to match session api and add version tags * catch exceptions on save as well * error log messages * check types for stricter restore * update server doc * readme : update API changes date * strict filename validation * move include, reject bom as well * also reject empty filename * reject whitespace and trailing dot --------- Co-authored-by: Martin Evans <martindevans@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-08 15:43:30 +03:00
Pierrick Hymbert	75cd4c7729	ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response (#6495 ) * ci: bench: support sse and fix prompt processing time server: add tokens usage in stream mode * ci: bench: README.md EOL * ci: bench: remove total pp and tg as it is not accurate * ci: bench: fix case when there is no token generated * ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics * ci: bench: fix finish reason rate	2024-04-06 05:40:47 +02:00
Shakhar Dasgupta	2e66913e5f	server: allow penalizing repetition of newlines on server webpage (#6431 )	2024-04-04 17:03:00 +02:00
Pierrick Hymbert	7a2c92637a	ci: bench: add more ftype, fix triggers and bot comment (#6466 ) * ci: bench: change trigger path to not spawn on each PR * ci: bench: add more file type for phi-2: q8_0 and f16. - do not show the comment by default * ci: bench: add seed parameter in k6 script * ci: bench: artefact name perf job * Add iteration in the commit status, reduce again the autocomment * ci: bench: add per slot metric in the commit status * Fix trailing spaces	2024-04-04 12:57:58 +03:00
Georgi Gerganov	4399f13fb9	server : remove obsolete --memory-f32 option	2024-04-04 09:34:58 +03:00
Xiao-Yong Jin	1a43c7254e	server : add option to disable KV offload (#6468 )	2024-04-04 09:33:48 +03:00
Fattire	5fb1574c81	A few small fixes to server's README docs (#6428 ) * Typo fix to server's README.md Fix minor typo ("tonen") in server README. * server readme grammar/style fixes. Quickly went through this file to look for inconsistencies in presentation of defaults, flag options, and looked for typos and grammar issues. Not perfect, but hopefully improved. * Update README.md Remove an extra space before newline.	2024-04-03 22:22:57 +02:00
JH23X	60cdf40cc3	server : handle exception on wrong type in request (#6452 ) Co-authored-by: Jonas Holzner <jonas.holzner.external@hensoldt.net>	2024-04-03 21:09:52 +03:00
Eric Zhang	6902cb7f2e	server : stop gracefully on SIGTERM (#6348 )	2024-03-28 09:50:48 +01:00
Pierrick Hymbert	a016026a3a	server: continuous performance monitoring and PR comment (#6283 ) * server: bench: init * server: bench: reduce list of GPU nodes * server: bench: fix graph, fix output artifact * ci: bench: add mermaid in case of image cannot be uploaded * ci: bench: more resilient, more metrics * ci: bench: trigger build * ci: bench: fix duration * ci: bench: fix typo * ci: bench: fix mermaid values, markdown generated * typo on the step name Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * ci: bench: trailing spaces * ci: bench: move images in a details section * ci: bench: reduce bullet point size --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-03-27 20:26:49 +01:00
Eric Zhang	0642b22cd1	server: public: use relative routes for static files (#6325 ) server: public: support custom `api_url`, default to relative base path	2024-03-27 06:55:29 +01:00
compilade	557410b8f0	llama : greatly reduce output buffer memory usage (#6122 ) * llama : greatly reduce logits memory usage * llama : more compact state saving and reloading * llama : fix lctx.n_outputs not being set before building graph * perplexity : adapt to the logits API changes * perplexity : fix Winogrande, use correct logits for second choice start The first logits used to evaluate the second choice were not from the end of the common prefix; instead, they were the logits from the end of the first choice. This has been corrected. The previous implementation sometimes had outliers in the scores of choices for some tasks, and the logic to skip choices words in the log-likelihood evaluation probably was an attempt to reduce those, but it was complex and didn't quite seem to be the right thing. This is simpler now, and the outlier scores aren't there anymore. * perplexity : normalize spaces and punctuation in Winogrande sentences * llama : fix embedding conditions * llama : fix llama_get_embeddings_ith when the resulting id is 0 * llama : fix wrong n_outputs in llama_set_inputs A mismatch happened when using a smaller n_ubatch than n_batch and then using llama_batch_get_one(). The decision of what n_outputs should be now almost fully depends on how lctx.n_outputs is set in llama_decode_internal. The conditions are simpler this way. * llama : when saving the state, recalculate n_outputs This ensures the correct number of outputs for the entire previous batch is stored in the session file, even when n_ubatch is smaller than n_batch. * llama : fix not-skipping outputs of non-causal models * llama : fix running a batch with n_outputs == 0 It previously worked because lctx.inp_out_ids was not initialized, so it pointed to some garbage address which was somehow still valid when I ran my tests. * llama : keep same graph topology even when n_outputs == 0 * ggml : saner ggml_can_repeat with empty tensors * ggml : future-proof ggml_is_empty by using GGML_MAX_DIMS - 1 * ggml : do not multi-thread ops returning empty tensors * ggml : make ggml_is_empty public and work with views * llama : use a vector for ctx->output_ids * llama : rework reallocation logic for llama_output_reserve Now comparing the actual size with the new total size of the output buffer to allow more efficient enabling and disabling of the embeddings and/or logits output in the future. * ggml : skip empty tensors in all backends * llama : fix llama_output_reserve nullptr deref when new_size is 0 * perplexity : make Winogrande work as it does on master The problems with the Winogrande implementation will need to be fixed in a separate PR to ease review. * llama : clearer error messages for invalid logits or embeddings ids * llama : assert all models that can have inp_out_ids Since the graph topology is now constant, this presence check can be done even when there are no outputs. * llama : assert logits and embd buffers exist before writing to them * llama : handle errors from llama_output_reserve at call sites * perplexity : make hellaswag and multiple-choice outputs identical to master Due to how the KV cache is updated, the logprobs for tokens in a batch are very slightly affected by the other tokens present in the batch, so to make hellaswag and multiple-choice return exactly the same results as on master, the last token of each sequence needs to be evaluated even though its output is not used at all. This will probably be changed back in the future to make these benchmarks a tiny bit faster. * perplexity : fix division by zero when using less than 100 multiple-choice tasks * llama : allow loading state saved with a different ctx size When loading a session file, the context size is now only required to be at least enough to load the KV cells contained in that session file, instead of requiring to use exactly the same context size as when saving. Doing this enables the use-case of extending or shrinking the context size of a saved session. This breaks existing session files because the meaning of kv_buf_size is slightly changed (previously it was the size of the whole KV cache, now it's only the size of the saved part of it). This allows for finer-grained sanity checks when loading in an effort to keep kv_buf_size useful even when the kv_size is changed. * llama : minor ggml-ci * readme : update recent API changes, and warn about Vulkan --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-26 16:46:41 +02:00
Jan Boon	3d032ece8e	server : add `n_discard` parameter (#6300 )	2024-03-26 10:47:43 +02:00
slaren	280345968d	cuda : rename build flag to LLAMA_CUDA (#6299 )	2024-03-26 01:16:01 +01:00
Xuan Son Nguyen	ad3a0505e3	Server: clean up OAI params parsing function (#6284 ) * server: clean up oai parsing function * fix response_format * fix empty response_format * minor fixes * add TODO for logprobs * update docs	2024-03-25 09:42:17 +01:00
Pierrick Hymbert	f482bb2e49	common: llama_load_model_from_url split support (#6192 ) * llama: llama_split_prefix fix strncpy does not include string termination common: llama_load_model_from_url: - fix header name case sensitive - support downloading additional split in parallel - hide password in url * common: EOL EOF * common: remove redundant LLAMA_CURL_MAX_PATH_LENGTH definition * common: change max url max length * common: minor comment * server: support HF URL options * llama: llama_model_loader fix log * common: use a constant for max url length * common: clean up curl if file cannot be loaded in gguf * server: tests: add split tests, and HF options params * common: move llama_download_hide_password_in_url inside llama_download_file as a lambda * server: tests: enable back Release test on PR * spacing Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * spacing Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * spacing Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-23 18:07:00 +01:00
Pierrick Hymbert	1997577d5e	server: docs: `--threads` and `--threads`, `--ubatch-size`, `--log-disable` (#6254 )	2024-03-23 18:00:38 +01:00
Pierrick Hymbert	1b26aebe4d	server: flush stdout after logging in both text and json layout (#6253 )	2024-03-23 13:18:45 +01:00
Olivier Chafik	72114edf06	json-schema-to-grammar : fix order of props + non-str const/enum (#6232 ) * json: ordered json in server/schema converter to respect orig order * json: ws nits * json: support non-string const / enums	2024-03-22 15:07:44 +02:00
Jan Boon	6b8bb3a31d	server : fix n_keep always showing as 0 in response (#6211 )	2024-03-22 13:12:05 +02:00
Georgi Gerganov	68e210b354	server : enable continuous batching by default (#6231 )	2024-03-22 13:08:28 +02:00
Jan Boon	be07a03217	server : update readme doc from `slot_id` to `id_slot` (#6213 )	2024-03-21 23:41:24 +01:00
Olivier Chafik	5b7b0ac8df	json-schema-to-grammar improvements (+ added to server) (#5978 ) * json: fix arrays (disallow `[,1]`) * json: support tuple types (`[number, string]`) * json: support additionalProperties (`{[k: string]: [string,number][]}`) * json: support required / optional properties * json: add support for pattern * json: resolve $ref (and support https schema urls) * json: fix $ref resolution * join: support union types (mostly for nullable types I think) * json: support allOf + nested anyOf * json: support any (`{}` or `{type: object}`) * json: fix merge * json: temp fix for escapes * json: spaces in output and unrestricted output spaces * json: add typings * json:fix typo * Create ts-type-to-grammar.sh * json: fix _format_literal (json.dumps already escapes quotes) * json: merge lit sequences and handle negatives {"type": "string", "pattern": "^({\"question\": \"[^\"]+\", \"response\": \"[^\"]+\"}\\n)+$"} * json: handle pattern repetitions * Update json-schema-to-grammar.mjs * Create regex-to-grammar.py * json: extract repeated regexp patterns to subrule * Update json-schema-to-grammar.py * Update json-schema-to-grammar.py * Update json-schema-to-grammar.py * json: handle schema from pydantic Optional fields * Update json-schema-to-grammar.py * Update json-schema-to-grammar.py * Update ts-type-to-grammar.sh * Update ts-type-to-grammar.sh * json: simplify nullable fields handling * json: accept duplicate identical rules * json: revert space to 1 at most * json: reuse regexp pattern subrules * json: handle uuid string format * json: fix literal escapes * json: add --allow-fetch * json: simplify range escapes * json: support negative ranges in patterns * Delete commit.txt * json: custom regex parser, adds dot support & JS-portable * json: rm trailing spaces * Update json-schema-to-grammar.mjs * json: updated server & chat `( cd examples/server && ./deps.sh )` * json: port fixes from mjs to python * Update ts-type-to-grammar.sh * json: support prefixItems alongside array items * json: add date format + fix uuid * json: add date, time, date-time formats * json: preserve order of props from TS defs * json: port schema converter to C++, wire in ./server * json: nits * Update json-schema-to-grammar.cpp * Update json-schema-to-grammar.cpp * Update json-schema-to-grammar.cpp * json: fix mjs implementation + align outputs * Update json-schema-to-grammar.mjs.hpp * json: test C++, JS & Python versions * json: nits + regen deps * json: cleanup test * json: revert from c++17 to 11 * json: nit fixes * json: dirty include for test * json: fix zig build * json: pass static command to std::system in tests (fixed temp files) * json: fix top-level $refs * json: don't use c++20 designated initializers * nit * json: basic support for reserved names `{number:{number:{root:number}}}` * Revamp test cmake to allow args (WORKING_DIRECTORY needed for JSON test) * json: re-ran server deps.sh * json: simplify test * json: support mix of additional props & required/optional * json: add tests for some expected failures * json: fix type=const in c++, add failure expectations for non-str const&enum * json: test (& simplify output of) empty schema * json: check parsing in test + fix value & string refs * json: add server tests for OAI JSON response_format * json: test/fix top-level anyOf * json: improve grammar parsing failures * json: test/fix additional props corner cases * json: fix string patterns (was missing quotes) * json: ws nit * json: fix json handling in server when there's no response_format * json: catch schema conversion errors in server * json: don't complain about unknown format type in server if unset * json: cleaner build of test * json: create examples/json-schema-pydantic-example.py * json: fix date pattern * json: move json.hpp & json-schema-to-grammar.{cpp,h} to common * json: indent 4 spaces * json: fix naming of top-level c++ function (+ drop unused one) * json: avoid using namespace std * json: fix zig build * Update server.feature * json: iostream -> fprintf * json: space before & refs for consistency * json: nits	2024-03-21 11:50:43 +00:00
Xuan Son Nguyen	91f8ad167d	Server: version bump for httplib and json (#6169 ) * server: version bump for httplib and json * fix build * bring back content_length	2024-03-20 13:30:36 +01:00
Georgi Gerganov	bc0baab2ea	server : allow to override -ngl in tests (#6170 )	2024-03-20 14:14:32 +02:00
Karthick	47cc7a7bf9	Server: Handle n_keep parameter in the request (#6174 )	2024-03-20 12:02:34 +01:00
Jared Van Bortel	bd60d82d0c	server tests : more pythonic process management; fix bare `except:` (#6146 ) * server tests : remove seemingly redundant newlines in print() * server tests : use built-in subprocess features, not os.kill and psutil * server tests : do not catch e.g. SystemExit; use print_exc * server tests: handle TimeoutExpired exception * server tests: fix connect on dual-stack systems * server: tests: add new tokens regex on windows generated following new repeat penalties default changed in (#6127) * server: tests: remove the hack on windows since now we get the good socket family * server: tests: add new tokens regex following new repeat penalties default changed in (#6127) * server: tests: add new tokens regex following new repeat penalties default changed in (#6127) --------- Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>	2024-03-20 06:33:49 +01:00
Pierrick Hymbert	d01b3c4c32	common: llama_load_model_from_url using --model-url (#6098 ) * common: llama_load_model_from_url with libcurl dependency Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-17 19:12:37 +01:00
Pierrick Hymbert	43241adf22	server: disable debug release type sanitizer, simplify trigger (#6047 ) - increase time out for server - do not fail fast	2024-03-14 13:15:39 +02:00
slaren	f30ea47a87	llama : add pipeline parallelism support (#6017 ) * llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs ggml-ci * server : add -ub, --ubatch-size parameter * fix server embedding test * llama : fix Mamba inference for pipeline parallelism Tested to work correctly with both `main` and `parallel` examples. * llama : limit max batch size to n_batch * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage * fix hip build * fix sycl build (disable cpy_tensor_async) * fix hip build * llama : limit n_batch and n_ubatch to n_ctx during context creation * llama : fix norm backend * batched-bench : sync after decode * swiftui : sync after decode * ggml : allow ggml_get_rows to use multiple threads if they are available * check n_ubatch >= n_tokens with non-casual attention * llama : do not limit n_batch to n_ctx with non-casual attn * server : construct batch with size of llama_n_batch * ggml_backend_cpu_graph_compute : fix return value when alloc fails * llama : better n_batch and n_ubatch comment * fix merge * small fix * reduce default n_batch to 2048 --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-13 18:54:21 +01:00
Xuan Son Nguyen	99b71c068f	Server: Use multi-task for embeddings endpoint (#6001 ) * use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}	2024-03-13 11:39:11 +01:00
Jakub N	828defefb6	Update server docker image URLs (#5997 )	2024-03-11 14:40:42 +01:00

1 2 3 4 5 ...

482 Commits