llama.cpp

mirror of https://github.com/ggerganov/llama.cpp.git synced 2024-11-11 21:39:52 +00:00

Author	SHA1	Message	Date
Cuong Trinh Manh	97bbca6e85	cmake : fix ld warning duplicate libraries libllama.a (#4671 ) * fix "ld: warning: ignoring duplicate libraries: '../libllama.a'" * fix warning in example.	2023-12-29 16:39:15 +02:00
Justine Tunney	db49ff8ed7	server : replace sleep with condition variables (#4673 ) The server currently schedules tasks using a sleep(5ms) busy loop. This adds unnecessary latency since most sleep implementations do a round up to the system scheduling quantum (usually 10ms). Other libc sleep impls spin for smaller time intervals which results in the server's busy loop consuming all available cpu. Having the explicit notify() / wait() code also helps aid in the readability of the server code. See mozilla-Ocho/llamafile@711344b	2023-12-29 16:24:12 +02:00
SakuraUmi	60f55e888c	server : fix OpenAI server sampling w.r.t. penalty. (#4675 )	2023-12-29 16:22:44 +02:00
Karthik Sethuraman	b93edd22f5	server : allow to generate multimodal embeddings (#4681 )	2023-12-29 16:22:10 +02:00
Justine Tunney	65e5f6dadb	Fix OpenAI server sampling w.r.t. temp and seed (#4668 ) The default values for tfs_z and typical_p were being set to zero, which caused the token candidates array to get shrunk down to one element thus preventing any sampling. Note this only applies to OpenAI API compatible HTTP server requests. The solution is to use the default values that OpenAI documents, as well as ensuring we use the llama.cpp defaults for the rest. I've tested this change still ensures deterministic output by default. If a "temperature" greater than 0 is explicitly passed, then output is unique each time. If "seed" is specified in addition to "temperature" then the output becomes deterministic once more. See mozilla-Ocho/llamafile#117 See mozilla-Ocho/llamafile@9e4bf29	2023-12-28 15:20:00 -04:00
Alexey Parfenov	6123979952	server : allow to specify custom prompt for penalty calculation (#3727 )	2023-12-23 11:31:49 +02:00
olexiyb	0ffc92d2d2	server : disable llm logs if SERVER_VERBOSE is off (#3792 )	2023-12-17 17:02:16 +02:00
AdithyanI	8edd2b40fd	server : fix grammar being ignored (#4494 ) Fix bug in identifying the grammar.	2023-12-17 16:57:56 +02:00
Alexey Parfenov	eb16dae7e7	server : fix possible ambiguity in content type charset (#4501 )	2023-12-17 16:56:09 +02:00
mzcu	62bd52b7bf	server : allow requests larger than 8K (#4500 )	2023-12-17 16:54:37 +02:00
ShadovvBeast	88ae8952b6	server : add optional API Key Authentication example (#4441 ) * Add API key authentication for enhanced server-client security * server : to snake_case --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-15 13:49:01 +02:00
shibe2	948ff137ec	server : fix handling of characters that span multiple tokens when streaming (#4446 )	2023-12-13 21:57:15 +02:00
kalomaze	fecac45658	server : tweak default sampling parameters (#4367 ) * Set a more typical Top P setting as the default * Update temp max	2023-12-12 12:12:35 +02:00
Richard Kiss	9494d7c477	english : use `typos` to fix comments and logs (#4354 )	2023-12-12 11:53:36 +02:00
Vladimir Zorin	d9d4cfef64	server : fix local model name in server (#4420 )	2023-12-12 11:25:29 +02:00
Yueh-Po Peng	8a7b2fa528	Update README.md (#4388 ) Fix small typo.	2023-12-10 23:27:38 +01:00
Georgi Gerganov	bcc0eb4591	llama : per-layer KV cache + quantum K cache (#4309 ) * per-layer KV * remove unnecessary copies * less code duplication, offload k and v separately * llama : offload KV cache per-layer * llama : offload K shift tensors * llama : offload for rest of the model arches * llama : enable offload debug temporarily * llama : keep the KV related layers on the device * llama : remove mirrors, perform Device -> Host when partial offload * common : add command-line arg to disable KV cache offloading * llama : update session save/load * llama : support quantum K cache (#4312) * llama : support quantum K cache (wip) * metal : add F32 -> Q8_0 copy kernel * cuda : add F32 -> Q8_0 copy kernel ggml-ci * cuda : use mmv kernel for quantum cache ops * llama : pass KV cache type through API * llama : fix build ggml-ci * metal : add F32 -> Q4_0 copy kernel * metal : add F32 -> Q4_1 copy kernel * cuda : wip * cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels * llama-bench : support type_k/type_v * metal : use mm kernel only for quantum KV cache * cuda : add comment * llama : remove memory_f16 and kv_f16 flags --------- Co-authored-by: slaren <slarengh@gmail.com> * readme : add API change notice --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-12-07 13:03:17 +02:00
Georgi Gerganov	05cd6e5036	server : recognize cache_prompt parameter in OAI API (#4347 )	2023-12-06 20:21:59 +02:00
Ed Lee	33e171d1e9	server : fix OpenAI API `stop` field to be optional (#4299 ) (cherry picked from commit Mozilla-Ocho/llamafile@e8c92bcb84)	2023-12-03 11:10:43 +02:00
Rickard Edén	6949b50df5	py : add grammar to oai like api (#4294 )	2023-12-03 11:03:25 +02:00
Georgi Gerganov	d5a1cbde60	llama : support optional tensors (#4283 )	2023-12-01 20:35:47 +02:00
Ziad Ben Hadj-Alouane	1d144112c0	server : add --log-disable to disable logging to file (#4260 ) * * add --log-disable to disable logging to file in the server example * * typo fix	2023-12-01 00:25:49 +02:00
Ziad Ben Hadj-Alouane	f43f09366d	server : add single-client multi-prompt support (#4232 ) * * add multiprompt support * * cleanup * * more cleanup * * remove atomicity of id_gen, and change lock_guard to unique_lock on completion requests * * remove all references to mutex_multitasks * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * * change to set --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2023-12-01 00:25:04 +02:00
rhjdvsgsgks	e2bd725f4b	py : fix oai proxy (#3972 ) * fix oai proxy fix generation not stoped while bot stop talking in chat mode fix possible `slot_id` not exist response for cors (and pre flight) * oai proxy: workaround for some client (such as Chatbox) * use stop as separator to replace hardcoded `\n`	2023-11-30 22:50:40 +02:00
Georgi Gerganov	af19d35734	server : OAI API compatibility (#4198 ) * Add openai-compatible POST /v1/chat/completions API endpoint to server example * fix code style * Update server README.md * Improve server README.md * Fix server.cpp code style according to review * server : some style changes * server : indentation * server : enable special tokens during tokenization by default * server : minor code style * server : change random string generator * straightforward /v1/models endpoint --------- Co-authored-by: kir-gadjello <111190790+kir-gadjello@users.noreply.github.com> Co-authored-by: Tobi Lütke <tobi@Tobis-MacBook-Pro.local>	2023-11-25 11:29:06 +02:00
Haohui Mai	55978ce09b	Fix incorrect format strings and uninitialized variables. (#4133 ) * Fix incorrect format strings and uninitialized variables. * Address comments * Add the missing include statement	2023-11-23 22:56:53 +01:00
SoftwareRenderer	936c79b227	server : relay error messages (#4131 )	2023-11-19 18:54:10 +02:00
Kerfuffle	91f6499393	Respect tokenizer.ggml.add_bos_token value when tokenizing (#4040 ) * gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode. * Respect add_bos_token GGUF metadata value * gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time	2023-11-16 19:14:37 -07:00
Alexey Parfenov	d96ca7ded7	server : fix crash when prompt exceeds context size (#3996 )	2023-11-10 23:48:21 -06:00
Jhen-Jie Hong	4a4fd3eefa	server : allow continue edit on completion mode (#3950 ) * server : allow continue edit on completion mode * server : handle abort case in runCompletion * server : style improvement	2023-11-10 16:49:33 -06:00
Mihai	57ad015dc3	server : add min_p param (#3877 ) * Update server.cpp with min_p after it was introduced in https://github.com/ggerganov/llama.cpp/pull/3841 * Use spaces instead of tabs * Update index.html.hpp after running deps.sh * Fix test - fix line ending	2023-11-08 20:00:34 -06:00
Damian Stewart	381efbf480	llava : expose as a shared library for downstream projects (#3613 ) * wip llava python bindings compatibility * add external llava API * add base64 in-prompt image support * wip refactor image loading * refactor image load out of llava init * cleanup * further cleanup; move llava-cli into its own file and rename * move base64.hpp into common/ * collapse clip and llava libraries * move llava into its own subdir * wip * fix bug where base64 string was not removed from the prompt * get libllava to output in the right place * expose llava methods in libllama.dylib * cleanup memory usage around clip_image_* * cleanup and refactor again * update headerdoc * build with cmake, not tested (WIP) * Editorconfig * Editorconfig * Build with make * Build with make * Fix cyclical depts on Windows * attempt to fix build on Windows * attempt to fix build on Windows * Upd TODOs * attempt to fix build on Windows+CUDA * Revert changes in cmake * Fix according to review comments * Support building as a shared library * address review comments --------- Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com> Co-authored-by: Jared Van Bortel <jared@nomic.ai>	2023-11-07 00:36:23 +03:00
Thái Hoàng Tâm	bb60fd0bf6	server : fix typo for --alias shortcut from -m to -a (#3958 )	2023-11-05 18:15:27 +02:00
cebtenzzre	b12fa0d1c1	build : link against build info instead of compiling against it (#3879 ) * cmake : fix build when .git does not exist * cmake : simplify BUILD_INFO target * cmake : add missing dependencies on BUILD_INFO * build : link against build info instead of compiling against it * zig : make build info a .cpp source instead of a header Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com> * cmake : revert change to CMP0115 --------- Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>	2023-11-02 08:50:16 +02:00
cebtenzzre	898aeca90a	llama : implement YaRN RoPE scaling (#2268 ) Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>	2023-11-01 18:04:33 -04:00
Adrian Hesketh	ca190bca8e	server : re-enable completion and embedded at the same time (#3876 )	2023-11-01 11:28:28 +02:00
Kerfuffle	6e08281e58	Extend llama_kv_cache_seq_rm to allow matching any sequence (#3843 ) * Extend llama_kv_cache_seq_rm to allow matichng any sequence * Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear Use llama_kv_cache_clear for cache clearing Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality	2023-10-29 11:31:40 -06:00
Georgi Gerganov	34b2a5e1ee	server : do not release slot on image input (#3798 )	2023-10-26 22:54:17 +03:00
cebtenzzre	ad93962657	server : add parameter -tb N, --threads-batch N (#3584 ) (#3768 ) Co-authored-by: Michael Coppola <m18coppola@gmail.com> Co-authored-by: Michael Coppola <info@michaeljcoppola.com>	2023-10-24 23:10:43 +03:00
Georgi Gerganov	1717521cdb	server : do not block system prompt update (#3767 ) * server : do not block system prompt update * server : update state machine logic to process system prompts * server : minor	2023-10-24 23:08:20 +03:00
Marcus Dunn	5be6c803fa	llama : remove token functions with `context` args in favor of `model` (#3720 ) * added `llama_model_token_` variants to all the `llama_token_` functions. * added `LLAMA_API` * formatting Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * removed old `llama_token` functions * changed 3 more functions to take in model - `llama_token_get_text` - `llama_token_get_score` - `llama_token_get_type` * added back docs * fixed main.cpp * changed token functions to use new model variants * changed token functions to use new model variants --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-23 22:40:03 +03:00
Georgi Gerganov	438c2ca830	server : parallel decoding and multimodal (#3677 ) * implementing parallel decoding in server example * crash fixed * save dev progress * refactored sampling function * completion endpoint working * multiple client support * grammar + no stream completion * cached prompt support * chat.mjs support cached prompt + some fixes * server ui now support multiple clients * unused change reverted * fixed timings per slot * add context swap * add changes to README.md * llava multimodal integration * fixed tokens probs * add multimodal input - alfa * refactor code + remove unused comments + improved README.md * fix compilation errors with llvm * notify the user from server ui that multimodality is unavialable * some ci fixes * fix ci make build undefined ref errors * fix long prompt than ctx proposed in #3639 * fixed premature end due stop word * context shift fixed * fix llava implementation * sync README.md changes * readme change * update api like OpenAI * multimodal support enabled by default * fix make bui;d errors * fix multiple clients * fix zig build * new sampling API * latest changes of sampling API * server : coding-style normalization * server : coding-style normalization (part 2) * server : remove beam-search functionality * server : bug fix in ingest_images n_tokens is incremented internally by llama_batch_add * server : use refs + use llama_batch_clear() * server : snake case * server : minor sync * added thread safe pipeline * server : bach has to be allocated for n_parallel sequences * server : no need for atomic int - already using mutex * server : logs + minor code style * server : fix multibyte handle in partial response (#3706) * fix image load + view image in chat * make : silence stb warnings * clip : link to ggml, not to llama * server : fix switch fallthrough * server : fix crash in Debug on macOS (I have no idea why this fixes it!?) * server : refactor ctx_sampling init + n_ctx + names * server : bug fix for prompt caching * Do not save/load image_data to localStorage * editorconfig : new line in index.html * server : completion requests remember slot_id * Update readme to document multimodal in server * server : minor style * Update readme to document multimodal in server * server : hide ctx_sampling->prev behind API (#3696) * server : apply fix from #3722 * server : fix slot reuse * server : add comment about changing slot_state to bool --------- Co-authored-by: FSSRepo <go778sgt@gmail.com> Co-authored-by: Damian Stewart <d@damianstewart.com> Co-authored-by: Steward Garcia <57494570+FSSRepo@users.noreply.github.com> Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com> Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>	2023-10-22 22:53:08 +03:00
Georgi Gerganov	d1031cf49c	sampling : refactor init to use llama_sampling_params (#3696 ) * sampling : refactor init to use llama_sampling_params * llama : combine repetition, frequency and presence penalties in 1 call * examples : remove embd-input and gptneox-wip * sampling : rename penalty params + reduce size of "prev" vector * sampling : add llama_sampling_print helper * sampling : hide prev behind API and apply #3661 ggml-ci	2023-10-20 21:07:23 +03:00
Georgi Gerganov	a0edf73bda	server : fix uninitialized sampling context (close #3685 )	2023-10-20 13:06:10 +03:00
Georgi Gerganov	0e89203b51	speculative : add tree-based sampling example (#3624 ) * sampling : one sequence per sampling context ggml-ci * speculative : add tree-based sampling support ggml-ci * speculative : reuse the n_parallel CLI param * speculative : refactor sampling * examples : fix build after sampling refactoring ggml-ci * batched : fix n_seq_id * sampling : fix malloc ggml-ci * swift : fix build ggml-ci * swift : try to fix build ggml-ci * prompts : add assistant.txt * common : add llama_batch_add() and llama_batch_clear() helpers * speculative : minor refactor ggml-ci * minor : comments + rename ggml-ci * speculative : fix off-by-one for n_drafted * speculative : fix the n_drafted fix + p constants	2023-10-18 16:21:57 +03:00
Georgi Gerganov	e74c705e15	editorconfig : remove trailing spaces	2023-10-17 19:52:53 +03:00
coezbek	3ad1e3f1a1	server : documentation of JSON return value of /completion endpoint (#3632 ) * Added documentation of JSON return value of /completion endpoint * Update examples/server/README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-17 19:51:02 +03:00
Aarni Koskela	b016596d90	server : add completion mode (no chat) (#3582 )	2023-10-12 09:51:53 +03:00
Georgi Gerganov	57dd55e2c7	server : fix kv cache management (#3588 )	2023-10-12 09:29:04 +03:00
Michael Coppola	a8bdd65525	server : add parameter -tb N, --threads-batch N (#3584 ) Co-authored-by: Michael Coppola <info@michaeljcoppola.com>	2023-10-11 22:42:22 +03:00

1 2 3

129 Commits