* more perfo with llamafile tinyblas on x86_64.
- add bf16 suport
- change dispache strategie (thanks:
https://github.com/ikawrakow/ik_llama.cpp/pull/71 )
- reduce memory bandwidth
simple tinyblas dispache and more cache freindly
* tinyblas dynamic dispaching
* sgemm: add M blocs.
* - git 2.47 use short id of len 9.
- show-progress is not part of GNU Wget2
* remove not stable test
This commit updates the `examples/run/README.md` file to include a new
option for setting the temperature and updates the `run.cpp` file to
parse this option.
Signed-off-by: Eric Curtin <ecurtin@redhat.com>
* Add Falcon3 model support
* Add fix for adding bos to added special tokens
* Add comment explaining the logic behind the if statement
* Add a log message to better track the when the following line of code is triggered
* Update log to only print when input and output characters are different
* Fix handling pre-normalized tokens
* Refactoring
Change the code to do 16b loads when possible and extract the appropriate
component late, so the code is effectively decoding a pair of elements and
then selecting one. This can allow more commoning to happen in the compiler
when neighboring elements are loaded.
* Migrate to tensor->buffer for checking backend buffer type: 1
* SYCL: common.cpp try to migrate away from tensor->backend
* SYCL: fix assertions and add proper comments
* SYCL: remove extra space
* SYCL: Add back static to ggml_backend_buffer_is_sycl_split function
* SYCL: Add pragma directive to suppress warning spam
* SYCL: Integrate debug logs with GGML_LOG and other fixes
* Revert "SYCL: Integrate debug logs with GGML_LOG and other fixes"
This reverts commit 2607b7de0f.
Let's keep the current SYCL specific logging mechanism for now
* SYCL: Use GGML_SYCL_DEBUG after reverting
* SYCL: reg_get_proc_address func, update to the current func signature
* SYCL: Refactor SYCL buffer checks in ggml_sycl_cpy_tensor_2d
* convert : use GPT2 vocab for Phi-4 model
* convert : use null value of sliding_window to distinguish Phi-4 from other PHI3-based models
* llama : do not use sliding window attention mask for Phi-4 model
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Set default width to whatever the terminal is. Also fixed a small bug around
default n_gpu_layers value.
Signed-off-by: Eric Curtin <ecurtin@redhat.com>
* server: avoid overwriting Authorization header
If no API key is set, leave the Authorization header as is. It may be
used by another part of the Web stack, such as an authenticating proxy.
Fixes https://github.com/ggerganov/llama.cpp/issues/10854
* rebuild
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* server : add "tokens" output
ggml-ci
* server : output embeddings for all tokens when pooling = none
ggml-ci
* server : update readme [no ci]
* server : fix spacing [no ci]
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* server : be explicit about the pooling type in the tests
ggml-ci
* server : update /embeddings and /v1/embeddings endpoints
ggml-ci
* server : do not normalize embeddings when there is no pooling
ggml-ci
* server : update readme
ggml-ci
* server : fixes
* tests : update server tests
ggml-ci
* server : update readme [no ci]
* server : remove rebase artifact
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* server : add "tokens" output
ggml-ci
* server : update readme
ggml-ci
* server : return tokens ids only if requested
ggml-ci
* tests : improve "tokens" type check
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* server : remove "tokens" from the OAI endpoint
ggml-ci
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Related to #10524 / be0e350c references to hipBLAS have been removed
across the repository. This fixes the link from the repositories
`README.md`.
Signed-off-by: Brian 'redbeard' Harrington <redbeard@dead-city.org>
This commit removes the return statement from ggml_gallocr_allocate_node
function.
The motivation behind this change is to make the code more readable and
consistent.