Eric Curtin
dc7cef9f37
llama-run : fix context size ( #11094 )
...
Set `n_ctx` equal to `n_batch` in `Opt` class. Now context size is
a more reasonable 2048.
Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-06 23:45:28 +01:00
Georgi Gerganov
e6e7c75d94
server : fix extra BOS in infill endpoint ( #11106 )
...
* server : fix extra BOS in infill endpoing
ggml-ci
* server : update infill tests
2025-01-06 15:36:08 +02:00
Georgi Gerganov
47182dd03f
llama : update llama_model API names ( #11063 )
...
* llama : deprecate llama_free_model, add llama_model_free
ggml-ci
* llama : change `llama_load_model_from_file` -> `llama_model_load_from_file`
ggml-ci
2025-01-06 10:55:18 +02:00
Georgi Gerganov
3e6e7a6bc2
tokenize : escape the prompt ( #11058 )
...
* tokenize : escape the prompt
* tokenize : update help
2025-01-06 10:54:25 +02:00
Georgi Gerganov
727368c60f
llama : use LLAMA_TOKEN_NULL ( #11062 )
...
ggml-ci
2025-01-06 10:52:15 +02:00
Georgi Gerganov
f66f582927
llama : refactor src/llama.cpp
( #10902 )
...
* llama : scatter llama.cpp into multiple modules (wip)
* llama : control-vector -> adapter
* llama : arch
* llama : mmap
ggml-ci
* ci : remove BUILD_SHARED_LIBS=OFF
ggml-ci
* llama : arch (cont)
ggml-ci
* llama : chat
ggml-ci
* llama : model
ggml-ci
* llama : hparams
ggml-ci
* llama : adapter
ggml-ci
* examples : fix
ggml-ci
* rebase
ggml-ci
* minor
* llama : kv cache
ggml-ci
* llama : impl
ggml-ci
* llama : batch
ggml-ci
* cont
ggml-ci
* llama : context
ggml-ci
* minor
* llama : context (cont)
ggml-ci
* llama : model loader
ggml-ci
* common : update lora
ggml-ci
* llama : quant
ggml-ci
* llama : quant (cont)
ggml-ci
* minor [no ci]
2025-01-03 10:18:53 +02:00
Pierrick Hymbert
2f0ee84b9b
server: bench: minor fixes ( #10765 )
...
flake8 Lint / Lint (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
* server/bench:
- support openAI streaming standard output with [DONE]\n\n
- export k6 raw results in csv
- fix too many tcp idle connection in tcp_wait
- add metric time to emit first token
* server/bench:
- fix when prometheus not started
- wait for server to be ready before starting bench
2025-01-02 18:06:12 +01:00
Xuan Son Nguyen
0da5d86026
server : allow using LoRA adapters per-request ( #10994 )
...
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
Python check requirements.txt / check-requirements (push) Has been cancelled
* slot.can_batch_with
* lora per request
* test: force disable cache prompt
* move can_batch_with check
* fix condition
* add slow test with llama 8b
* update docs
* move lora change task to queue
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* lora_base
* remove redundant check
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-01-02 15:05:18 +01:00
Xuan Son Nguyen
45095a61bf
server : clean up built-in template detection ( #11026 )
...
* server : clean up built-in template detection
* fix compilation
* add chat template test
* fix condition
2024-12-31 15:22:01 +01:00
Xuan Son Nguyen
5896c65232
server : add OAI compat for /v1/completions ( #10974 )
...
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
Python check requirements.txt / check-requirements (push) Has been cancelled
* server : add OAI compat for /v1/completions
* add test
* add docs
* better docs
2024-12-31 12:34:13 +01:00
Peter
6e1531aca5
common, examples, ggml : fix MSYS2 GCC compiler errors and warnings when building with LLAMA_CURL=ON and GGML_OPENCL=ON ( #11013 )
...
In common/common.cpp:
* Convert usage of stat() function call to check if file exists to standard library function std::filesystem::exists (error unable to match to correct function signature)
* Additional conditions to check if PATH_MAX is already defined in WIN32 environment (warning it is already defined in MSYS2)
In examples/run/run.cpp:
* Add io.h header inclusion (error cannot find function _get_osfhandle)
* Change initialisers for OVERLAPPED to empty struct (warning about uninitialised members)
* Add initialiser for hFile (warning it may be uninitialised)
* Add cast for curl_off_t percentage value to long int in generate_progress_prefix function (warning that curl_off_t is long long int)
In ggml/src/ggml-opencl/ggml-opencl.cpp:
* Initialise certain declared cl_mem variables to nullptr for greater safety (warning about B_d variable possibly used unassigned)
2024-12-31 01:46:06 +01:00
ag2s20150909
c250ecb315
android : fix llama_batch free ( #11014 )
2024-12-30 14:35:13 +02:00
Isaac McFadyen
f865ea149d
server: added more docs for response_fields field ( #10995 )
2024-12-28 16:09:19 +01:00
Alexey Parfenov
16cdce7b68
server : fix token duplication when streaming with stop strings ( #10997 )
2024-12-28 16:08:54 +01:00
Peter
d283d02bf2
examples, ggml : fix GCC compiler warnings ( #10983 )
...
Warning types fixed (observed under MSYS2 GCC 14.2.0):
* format '%ld' expects argument of type 'long int', but argument has type 'size_t'
* llama.cpp/ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp:81:46: warning: missing initializer for member '_STARTUPINFOA::lpDesktop' [-Wmissing-field-initializers] (emitted for all struct field except first)
2024-12-26 14:59:11 +01:00
Reza Kakhki
9ba399dfa7
server : add support for "encoding_format": "base64" to the */embeddings endpoints ( #10967 )
...
flake8 Lint / Lint (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
* add support for base64
* fix base64 test
* improve test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-24 21:33:04 +01:00
Djip007
2cd43f4900
ggml : more perfo with llamafile tinyblas on x86_64 ( #10714 )
...
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
* more perfo with llamafile tinyblas on x86_64.
- add bf16 suport
- change dispache strategie (thanks:
https://github.com/ikawrakow/ik_llama.cpp/pull/71 )
- reduce memory bandwidth
simple tinyblas dispache and more cache freindly
* tinyblas dynamic dispaching
* sgemm: add M blocs.
* - git 2.47 use short id of len 9.
- show-progress is not part of GNU Wget2
* remove not stable test
2024-12-24 18:54:49 +01:00
NeverLucky
09fe2e7613
server: allow filtering llama server response fields ( #10940 )
...
* llama_server_response_fields
* llama_server_response_fields_fix_issues
* params fixes
* fix
* clarify docs
* change to "response_fields"
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-24 17:39:49 +01:00
Xuan Son Nguyen
14b699ecde
server : fix missing model id in /model endpoint ( #10957 )
...
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
* server : fix missing model id in /model endpoint
* fix ci
2024-12-23 12:52:25 +01:00
Xuan Son Nguyen
485dc01214
server : add system_fingerprint to chat/completion ( #10917 )
...
* server : add system_fingerprint to chat/completion
* update README
2024-12-23 12:02:44 +01:00
Radoslav Gerganov
86bf31cfe6
rpc-server : add support for the SYCL backend ( #10934 )
2024-12-23 10:39:30 +02:00
Eric Curtin
dab76c92cc
llama-run : include temperature option ( #10899 )
...
This commit updates the `examples/run/README.md` file to include a new
option for setting the temperature and updates the `run.cpp` file to
parse this option.
Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2024-12-23 01:21:40 +01:00
Xuan Son Nguyen
0ca416c91a
server : (UI) fix copy to clipboard function ( #10916 )
2024-12-20 14:12:06 +01:00
Georgi Gerganov
d408bb9268
clip : disable GPU support ( #10896 )
...
ggml-ci
2024-12-19 18:47:15 +02:00
Georgi Gerganov
5cab3e4aaa
llama : minor grammar refactor ( #10897 )
...
Python check requirements.txt / check-requirements (push) Waiting to run
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
ggml-ci
2024-12-19 17:42:13 +02:00
Xuan Son Nguyen
57bb2c40cd
server : fix logprobs, make it OAI-compatible ( #10783 )
...
* server : fix logprobs, make it openai-compatible
* update docs
* add std::log
* return pre-sampling p
* sort before apply softmax
* add comment
* fix test
* set p for sampled token
* update docs
* add --multi-token-probs
* update docs
* add `post_sampling_probs` option
* update docs [no ci]
* remove --multi-token-probs
* "top_probs" with "post_sampling_probs"
* resolve review comments
* rename struct token_prob to prob_info
* correct comment placement
* fix setting prob for sampled token
2024-12-19 15:40:08 +01:00
Eric Curtin
7909e8588d
llama-run : improve progress bar ( #10821 )
...
Set default width to whatever the terminal is. Also fixed a small bug around
default n_gpu_layers value.
Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2024-12-19 03:58:00 +01:00
Diego Devesa
9177484f58
ggml : fix arm build ( #10890 )
...
Python check requirements.txt / check-requirements (push) Waiting to run
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
* ggml: GGML_NATIVE uses -mcpu=native on ARM
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ggml: Show detected features with GGML_NATIVE
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* remove msvc support, add GGML_CPU_ARM_ARCH option
* disable llamafile in android example
* march -> mcpu, skip adding feature macros
ggml-ci
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Co-authored-by: Adrien Gallouët <angt@huggingface.co>
2024-12-18 23:21:42 +01:00
Georgi Gerganov
0bf2d10c55
tts : add OuteTTS support ( #10784 )
...
* server : add "tokens" output
ggml-ci
* server : output embeddings for all tokens when pooling = none
ggml-ci
* server : be explicit about the pooling type in the tests
ggml-ci
* server : do not normalize embeddings when there is no pooling
ggml-ci
* llama : add OuteTTS support (wip)
* wip
* extract features
* first conv
* group norm
* resnet conv
* resnet
* attn
* pos net
* layer norm
* convnext
* head
* hann window
* fix n_embd + remove llama.cpp hacks
* compute hann window
* fft
* spectrum processing
* clean-up
* tts : receive input text and generate codes
* clip : fix new conv name
* tts : minor fix
* tts : add header + minor fixes
ggml-ci
* tts : add matchematical constant
ggml-ci
* tts : fix sampling + cut initial noise
* tts : fixes
* tts : update default samplers
ggml-ci
* tts : text pre-processing
* tts : outetts-voc -> wavtokenizer-dec
* tts : remove hardcoded constants
ggml-ci
* tts : fix tensor shapes
* llama : refactor wavtokenizer tensors
ggml-ci
* cont
ggml-ci
* cont [no ci]
* llama : update WavTokenizer to non-causal attn
* llama : handle no-vocab detokenization
* tts : add Python example for OuteTTS (wip)
* tts : extend python example to generate spectrogram
ggml-ci
* server : fix rebase artifacts
* tts : enable "return_tokens" in Python example
ggml-ci
* tts : minor fixes
* common : support HF download for vocoder
2024-12-18 19:27:21 +02:00
Gaetan Bisson
7bbb5acf12
server: avoid overwriting Authorization header ( #10878 )
...
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
* server: avoid overwriting Authorization header
If no API key is set, leave the Authorization header as is. It may be
used by another part of the Web stack, such as an authenticating proxy.
Fixes https://github.com/ggerganov/llama.cpp/issues/10854
* rebuild
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-18 15:00:07 +01:00
Georgi Gerganov
152610eda9
server : output embeddings for all tokens when pooling = none ( #10861 )
...
* server : add "tokens" output
ggml-ci
* server : output embeddings for all tokens when pooling = none
ggml-ci
* server : update readme [no ci]
* server : fix spacing [no ci]
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* server : be explicit about the pooling type in the tests
ggml-ci
* server : update /embeddings and /v1/embeddings endpoints
ggml-ci
* server : do not normalize embeddings when there is no pooling
ggml-ci
* server : update readme
ggml-ci
* server : fixes
* tests : update server tests
ggml-ci
* server : update readme [no ci]
* server : remove rebase artifact
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-12-18 13:01:41 +02:00
Georgi Gerganov
0e70ba686e
server : add "tokens" output ( #10853 )
...
* server : add "tokens" output
ggml-ci
* server : update readme
ggml-ci
* server : return tokens ids only if requested
ggml-ci
* tests : improve "tokens" type check
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* server : remove "tokens" from the OAI endpoint
ggml-ci
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-12-18 11:05:29 +02:00
Xuan Son Nguyen
46828872c3
server : (embeddings) using same format for "input" and "content" ( #10872 )
...
* server : (embeddings) using same format for "input" and "content"
* fix test case
* handle empty input case
* fix test
2024-12-18 10:55:09 +02:00
krystiancha
05c3a444b8
server : fill usage info in embeddings and rerank responses ( #10852 )
...
* server : fill usage info in embeddings response
* server : fill usage info in reranking response
2024-12-17 18:00:24 +02:00
Xuan Son Nguyen
227d7c5a7f
server : (UI) fix missing async generator on safari ( #10857 )
...
* server : (UI) fix missing async generator on safari
* fix
2024-12-17 09:52:09 +01:00
Georgi Gerganov
644fd71b44
sampling : refactor + optimize penalties sampler ( #10803 )
...
* sampling : refactor + optimize penalties sampler
ggml-ci
* common : apply ignore_eos as logit bias
ggml-ci
* batched : remove penalties sampler
* params : allow penalty_last_n == -1 to be equal to context size
ggml-ci
* common : by default, move the penalties at the end of the sampling chain
ggml-ci
* common : ignore all EOG tokens
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* common : move back the penalties at the front of the sampling chain
ggml-ci
* readme : restore hint about --ignore-eos flag [no ci]
* llama : minor
ggml-ci
* webui : update
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-16 12:31:14 +02:00
Bartowski
4ddd199f6f
llava : Allow locally downloaded models for QwenVL ( #10833 )
...
Python check requirements.txt / check-requirements (push) Has been cancelled
flake8 Lint / Lint (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
* Allow locally downloaded models for QwenVL
* Define model_path
* rm trailing space
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-12-15 21:43:25 +01:00
Vinesh Janarthanan
5478bbcd17
server: (UI) add syntax highlighting and latex math rendering ( #10808 )
...
* add code highlighting and math formatting
* code cleanup
* build public/index.html
* rebuild public/index.html
* fixed coding style
* fixed coding style
* style fixes
* highlight: smaller bundle size, fix light & dark theme
* remove katex
* add bundle size check
* add more languages
* add php
* reuse some langs
* use gzip
* Revert "remove katex"
This reverts commit c0e5046acc
.
* use better maintained @vscode/markdown-it-katex
* fix gzip non deterministic
* ability to add a demo conversation for dev
* fix latex rendering
* add comment
* latex codeblock as code
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-15 12:55:54 +01:00
Michelle Tan
89d604f2c8
server: Fix has_next_line
in JSON response ( #10818 )
...
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
* Update server JSON response.
* Add unit test to check `has_new_line` JSON response
* Remove `has_new_line` unit test changes.
* Address code review comment: type check for `has_new_line` in unit test
2024-12-14 23:29:45 +01:00
HimariO
ba1cb19cdd
llama : add Qwen2VL support + multimodal RoPE ( #10361 )
...
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
Python check requirements.txt / check-requirements (push) Has been cancelled
* Barebone Qwen2VL LLM convertor
* Add Qwen2VL cli entrypoint
* [WIP] add qwen2vl arch
* Verify m-rope output
* Add vl-rope/2d-rope support for qwen2vl ViT
* update qwen2vl cli tool
* update 5D tensor op workaround
* [WIP] qwen2vl vision model
* make batch and clip utils compatible with qwen2vl
* [WIP] create inference workflow, gguf convert script but fix
* correcting vision-rope behavior, add the missing last layer back to ViT
* add arg parser to qwen2vl_surgery
* replace variable size array with vector
* cuda-gdb cmake preset
* add fp32 mrope, vision rope kernel
* add fp16 support for qwen2vl and m-rope
* add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION`
* fix rope op mode switching, out dated func args
* update `llama_hparams`
* update to keep up stream changes
* resolve linter, test errors
* add makefile entry, update speical image padding token
* add mrope unit test, fix few compiler warnings
* rename `mrope` related function, params
* minor updates on debug util, bug fixs
* add `m-rope` testcase to `test-backend-ops`
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix traililng whitespce
* store `llama_hparams.rope_sections` with fixed size array
* update position id tensor size check in GGML_OP_ROPE
* minor updates
* update `ggml_backend_*_supports_op` of unsupported backends
* remote old `rope_section` compare operator
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-12-14 14:43:46 +02:00
cduk
56eea0781c
Removes spurious \r in output that causes logging in journalctl to treat lines as binary and therefore hidden by default ( #10771 )
...
Signed-off-by: Charles Darke <s.cduk@toodevious.com>
Co-authored-by: Charles Darke <s.cduk@toodevious.com>
2024-12-13 23:21:49 +01:00
Eric Curtin
c27ac678dd
Opt class for positional argument handling ( #10508 )
...
Added support for positional arguments `model` and `prompt`. Added
functionality to download via strings like:
llama-run llama3
llama-run ollama://granite-code
llama-run ollama://granite-code:8b
llama-run hf://QuantFactory/SmolLM-135M-GGUF/SmolLM-135M.Q2_K.gguf
llama-run huggingface://bartowski/SmolLM-1.7B-Instruct-v0.2-GGUF/SmolLM-1.7B-Instruct-v0.2-IQ3_M.gguf
llama-run https://example.com/some-file1.gguf
llama-run some-file2.gguf
llama-run file://some-file3.gguf
Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2024-12-13 19:34:25 +01:00
Xuan Son Nguyen
adffa6ffd5
common : improve -ctv -ctk CLI arguments ( #10806 )
...
* common : improve ctv ctk cli argument
* regenerate docs
* even better approach
* use std::vector
2024-12-12 22:53:05 +01:00
Diego Devesa
cb13ef85a4
remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS ( #10797 )
...
other windows build fixes
2024-12-12 19:02:49 +01:00
CentricStorm
5555c0c1f6
docs: update server streaming mode documentation ( #9519 )
...
Provide more documentation for streaming mode.
2024-12-11 23:40:40 +01:00
Xuan Son Nguyen
235f6e14bf
server : (UI) add tok/s, get rid of completion.js ( #10786 )
...
* get rid of completion.js
* extract chat bubble to a component
* add tok/s info
* sync
* fix BASE_URL
* only extract timings when it's enabled
* fix auto scroll
2024-12-11 20:52:14 +01:00
qingy1337
1a31d0dc00
Update README.md ( #10772 )
2024-12-11 16:16:32 +01:00
kallewoof
484d2f31ae
bug-fix: snprintf prints NULL in place of the last character ( #10419 )
...
* bug-fix: snprintf prints NULL in place of the last character
We need to give snprintf enough space to print the last character and the null character, thus we allocate one extra byte and then ignore it when converting to std::string.
* add comment about extra null-term byte requirement
2024-12-11 14:48:04 +01:00
CentricStorm
4b4d92b098
docs: fix server documentation formatting ( #10776 )
2024-12-11 11:47:43 +01:00
Yüg
a86ad841f1
server : add flag to disable the web-ui ( #10762 ) ( #10751 )
...
Co-authored-by: eugenio.segala <esegala@deloitte.co.uk>
2024-12-10 18:22:34 +01:00