- Add `struct llama_sampler` and `struct llama_sampler_i`
- Add `llama_sampler_` API
- Add `llama_sampler_chain_` API for chaining multiple samplers
- Remove `LLAMA_API_INTERNAL`
- Add `llama_perf_` API and remove old `llama_print_timings` and `llama_reset_timings`
* Introduce ggml_compute_threadpool
- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems
* Minor fixes
* fixed use after release bug
* fixed a harmless race condition
* Fix Android bulid issue
* fix more race conditions
* fix deadlock for cases where cgraph.n_nodes == 1
and fix --poll case
* threadpool: use cpu_get_num_math to set the default number of threadpool threads
This way we avoid using E-Cores and Hyperthreaded siblings.
* bench: create fresh threadpool for each test
For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).
* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier
This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.
* threadpool: make polling the default to match openmp behavior
All command line args now allow for setting poll to 0 (false).
* threadpool: do not wakeup threads in already paused threadpool
* fix potential race condition in check_for_work
* threadpool: do not create two threadpools if their params are identical
* threadpool: reduce pause/resume/wakeup overhead in common cases
We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.
* threadpool: add support for hybrid polling
poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...
The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.
* threadpool: reduce the number of barrier required
New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.
* threadpool: remove special-casing for disposable threadpools
With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.
Include n_threads in debug print for disposable threadpool.
Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.
* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)
This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.
* threadpool: use relaxed order for chunk sync
Full memory barrier is an overkill for this since each thread works on different chunk
* threadpool: remove abort_callback from threadpool state
* threadpool: better naming for thread/cpumask releated functions
* threadpool: consistent use of int type for n_threads params
* threadpool: add support for ggml_threadpool_params_default/init
Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.
* threadpool: move typedef into ggml.h
* threadpool: fix apply_priority() function name
* threadpool: fix swift wrapper errors due to n_threads int type cleanup
* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled
* threadpool: replace checks for compute_thread ret code with proper status check
* threadpool: simplify threadpool init logic and fix main thread affinity application
Most of the init code is now exactly the same between threadpool and openmp.
* threadpool: update threadpool resume/pause function names
* threadpool: enable openmp by default for now
* threadpool: don't forget to free workers state when omp is enabled
* threadpool: avoid updating process priority on the platforms that do not require it
On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.
* threadpool: update calling thread prio and affinity only at start/resume
This avoids extra syscalls for each graph_compute()
* llama-bench: turn threadpool params into vectors, add output headers, etc
* llama-bench: add support for cool off between tests --delay
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.
* threadpool: move process priority setting into the apps (bench and cli)
This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.
* threadpool: move all pause/resume logic into ggml
* threadpool: futher api cleanup and prep for future refactoring
All threadpool related functions and structs use ggml_threadpool prefix.
* threadpool: minor indent fixes
* threadpool: improve setprioty error message
* Update examples/llama-bench/llama-bench.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* threadpool: fix indent in set_threadpool call
* use int32_t for n_thread type in public llama.cpp API
* threadpool: use _new and _free instead of _create and _release
* fix two more public APIs to use int32_t for n_threads
* build: set _GNU_SOURCE for Adroid
---------
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
This change fixes a bug where replacing text in a very long string could
cause llama.cpp to hang indefinitely. This is because the algorithm used
was quadratic, due to memmove() when s.replace() is called in a loop. It
seems most search results and LLM responses actually provide the O(n**2)
algorithm, which is a great tragedy. Using a builder string fixes things
* llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model.
- The CLIP model now prioritizes the Vulkan backend over the CPU when vulkan available.
- A GGML_OP_ACC shader has been added.
- The encoding performance of the CLIP model improved from 4.2s on the CPU to 0.9s on the GPU.
Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
* fix-up coding style.
Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
* Fix-up the missing initial parameter to resolve the compilation warning.
Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
* [fix] Add missing parameters.
Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
* [fix] Use nb1 and nb2 for dst.
Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
* Fix check results ggml_acc call
---------
Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
Co-authored-by: 0cc4m <picard12@live.de>
* [CANN] Add Ascend NPU backend
Ascend is a full-stack AI computing infrastructure for industry
applications and services based on Huawei Ascend processors and
software.
CANN (Compute Architecture of Neural Networks), developped by
Huawei, is a heterogeneous computing architecture for AI.
Co-authored-by: wangshuai09 <391746016@qq.com>
* delete trailing whitespaces
* Modify the code based on review comment
* Rename LLAMA_CANN to GGML_CANN
* Make ggml-common.h private
* add ggml_cann prefix for acl funcs
* Add logging for CANN backend
* Delete Trailing whitespace
---------
Co-authored-by: wangshuai09 <391746016@qq.com>
* py : type-check all Python scripts with Pyright
* server-tests : use trailing slash in openai base_url
* server-tests : add more type annotations
* server-tests : strip "chat" from base_url in oai_chat_completions
* server-tests : model metadata is a dict
* ci : disable pip cache in type-check workflow
The cache is not shared between branches, and it's 250MB in size,
so it would become quite a big part of the 10GB cache limit of the repo.
* py : fix new type errors from master branch
* tests : fix test-tokenizer-random.py
Apparently, gcc applies optimisations even when pre-processing,
which confuses pycparser.
* ci : only show warnings and errors in python type-check
The "information" level otherwise has entries
from 'examples/pydantic_models_to_grammar.py',
which could be confusing for someone trying to figure out what failed,
considering that these messages can safely be ignored
even though they look like errors.
* clip : suppress unused variable warnings
This commit suppresses unused variable warnings for the variables e in
the catch blocks.
The motivation for this change is to suppress the warnings that are
generated on Windows when using the MSVC compiler. The warnings are
not displayed when using GCC because GCC will mark all catch parameters
as used.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* squash! clip : suppress unused variable warnings
Remove e (/*e*/) instead instead of using GGML_UNUSED.
---------
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* common : gpt_params_parse do not print usage
* common : rework usage print (wip)
* common : valign
* common : rework print_usage
* infill : remove cfg support
* common : reorder args
* server : deduplicate parameters
ggml-ci
* common : add missing header
ggml-ci
* common : remote --random-prompt usages
ggml-ci
* examples : migrate to gpt_params
ggml-ci
* batched-bench : migrate to gpt_params
* retrieval : migrate to gpt_params
* common : change defaults for escape and n_ctx
* common : remove chatml and instruct params
ggml-ci
* common : passkey use gpt_params
The llama.cpp grammar parser had a bug where forgetting to add a closing
quotation mark to strings would cause parsing to crash. Anyone running a
server on a public endpoint is advised to upgrade. To reproduce this bug
./llamafile -m foo.gguf -p bar --grammar 'root::="'
Credit for discovering and reporting this issue goes to Eclypsium
Security Researcher Richard Johnson <Richard.johnson@eclypsium.com>.
* Revert "Revert "llava : add support for moondream vision language model (#6899)""
This reverts commit 9da243b36a.
* Fix num_positions and embeddings initialization
* add support for moondream vision language model
This required making the following changes to the CLIP model:
1. Support for patch embedding bias.
2. Make class embedding and pre-layernorm optional.
3. Add support for post-layernorm.
* Update examples/llava/clip.cpp
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit renamesthe lerp (linear interpolation) function in clip.cpp
to avoid a conflict with the lerp function in the <cmath> standard C++
library when using c++20.
The motivation for this change is to enable projects that use c++20 to
be able to compile clip.cpp without having to resort to patching it. The
lerp function was added to cmath in version C++20 (202002L) and is why
this is not causing any issue at the moment as C++11/C++17 is currently
used by llama.cpp.
I realize that llama.cpp uses either C++11 (or C++17 in the case for
SYCL) but wanted to ask if this would be an acceptable change just the
same.
Refs: https://en.cppreference.com/w/cpp/numeric/lerp
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* Support Llama 3 conversion
The tokenizer is BPE.
* style
* Accept suggestion
Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
* llama : add llama_token_is_eog()
ggml-ci
* llama : auto-detect more EOT tokens when missing in KV data
* convert : replacing EOS token is a hack
* llama : fix codegemma EOT token + add TODOs
* llama : fix model type string for 8B model
---------
Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Key changes:
* BERT conversion: fix abuse of LlamaHfVocab, do not set BOS or EOS
* Nomic Embed conversion: pad vocab instead of slicing embedding tensor
* llama_tokenize: handle added special tokens like HF does
There several places where a gguf context is allocated. A call to gguf_free
is missing in some error paths. Also on linux, llama-bench was missing a
fclose.