Commit Graph

4393 Commits

Author SHA1 Message Date
Georgi Gerganov
2ffcbb19ff
Merge 4ff0831ce6 into 9ba399dfa7 2024-12-24 18:02:55 -06:00
Reza Kakhki
9ba399dfa7
server : add support for "encoding_format": "base64" to the */embeddings endpoints (#10967)
Some checks failed
flake8 Lint / Lint (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
* add support for base64

* fix base64 test

* improve test

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-24 21:33:04 +01:00
Djip007
2cd43f4900
ggml : more perfo with llamafile tinyblas on x86_64 (#10714)
Some checks are pending
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
* more perfo with llamafile tinyblas on x86_64.

- add bf16 suport
- change dispache strategie (thanks:
https://github.com/ikawrakow/ik_llama.cpp/pull/71 )
- reduce memory bandwidth

simple tinyblas dispache and more cache freindly

* tinyblas dynamic dispaching

* sgemm: add M blocs.

* - git 2.47 use short id of len 9.
- show-progress is not part of GNU Wget2

* remove not stable test
2024-12-24 18:54:49 +01:00
NeverLucky
09fe2e7613
server: allow filtering llama server response fields (#10940)
* llama_server_response_fields

* llama_server_response_fields_fix_issues

* params fixes

* fix

* clarify docs

* change to "response_fields"

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-24 17:39:49 +01:00
Georgi Gerganov
30caac3a68
llama : the WPM vocabs use the CLS token as BOS (#10930)
* llama : the WPM vocabs use the CLS token as BOS

ggml-ci

* llama : add comment
2024-12-24 09:44:20 +02:00
Diego Devesa
60cfa728e2
ggml : use wstring for backend search paths (#10960)
ggml-ci
2024-12-24 04:05:27 +01:00
Diego Devesa
3327bb0f8d
ggml : fix arm enabled features check (#10961) 2024-12-24 04:05:17 +01:00
Diego Devesa
32d6ee6385
ggml : fix const usage in SSE path (#10962) 2024-12-23 20:25:52 +01:00
Xuan Son Nguyen
14b699ecde
server : fix missing model id in /model endpoint (#10957)
Some checks are pending
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
* server : fix missing model id in /model endpoint

* fix ci
2024-12-23 12:52:25 +01:00
Xuan Son Nguyen
485dc01214
server : add system_fingerprint to chat/completion (#10917)
* server : add system_fingerprint to chat/completion

* update README
2024-12-23 12:02:44 +01:00
Radoslav Gerganov
86bf31cfe6
rpc-server : add support for the SYCL backend (#10934) 2024-12-23 10:39:30 +02:00
Yun Dou
b92a14a841
llama : support InfiniAI Megrez 3b (#10893)
Some checks failed
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
Python check requirements.txt / check-requirements (push) Has been cancelled
* Support InfiniAI Megrez 3b

* Fix tokenizer_clean_spaces for megrez
2024-12-23 01:35:44 +01:00
ymcki
6f0c9e034b
llama : support for Llama-3_1-Nemotron-51B (#10669)
* conflict resolution

* move comments after bracket to its own line
2024-12-23 01:22:33 +01:00
Eric Curtin
dab76c92cc
llama-run : include temperature option (#10899)
This commit updates the `examples/run/README.md` file to include a new
option for setting the temperature and updates the `run.cpp` file to
parse this option.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2024-12-23 01:21:40 +01:00
yuri@FreeBSD
7024d59e6a
ggml : fix run-time on FreeBSD in get_executable_path() (#10948) 2024-12-23 01:20:11 +01:00
Rudi Servo
7c0e285858
devops : add docker-multi-stage builds (#10832) 2024-12-22 23:22:58 +01:00
Billel Mokeddem
7ae33a616f
llama : add Falcon3 support (#10883)
* Add Falcon3 model support

* Add fix for adding bos to added special tokens

* Add comment explaining the logic behind the if statement

* Add a log message to better track the when the following line of code is triggered

* Update log to only print when input and output characters are different

* Fix handling pre-normalized tokens

* Refactoring
2024-12-23 00:09:58 +02:00
Jeff Bolz
ebdee9478c
vulkan: build fixes for 32b (#10927)
* vulkan: build fixes for 32b

Should fix #10923

* vulkan: initialize some buffer/offset variables
2024-12-22 10:44:01 +01:00
Georgi Gerganov
5cd85b5e00
convert : add BertForMaskedLM (#10919)
Some checks failed
Python check requirements.txt / check-requirements (push) Has been cancelled
flake8 Lint / Lint (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
2024-12-21 10:10:18 +02:00
Jeff Bolz
a91a41364b
vulkan: optimize coopmat2 dequant functions (#10855)
Change the code to do 16b loads when possible and extract the appropriate
component late, so the code is effectively decoding a pair of elements and
then selecting one. This can allow more commoning to happen in the compiler
when neighboring elements are loaded.
2024-12-21 08:04:45 +01:00
Adrien Gallouët
e34c5af43f
ggml-cpu: replace NEON asm with intrinsics in ggml_gemv_q4_0_4x8_q8_0() (#10874)
* ggml-cpu: replace NEON asm with intrinsics in ggml_gemv_q4_0_4x8_q8_0()

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* ggml-cpu: format code

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2024-12-21 00:33:37 +01:00
Akarshan Biswas
eb5c3dc64b
SYCL: Migrate away from deprecated ggml_tensor->backend (#10840)
Some checks are pending
Python check requirements.txt / check-requirements (push) Waiting to run
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
* Migrate to tensor->buffer for checking backend buffer type: 1

* SYCL: common.cpp try to migrate away from tensor->backend

* SYCL: fix assertions and add proper comments

* SYCL: remove extra space

* SYCL: Add back static to ggml_backend_buffer_is_sycl_split function

* SYCL: Add pragma directive to suppress warning spam

* SYCL: Integrate debug logs with GGML_LOG and other fixes

* Revert "SYCL: Integrate debug logs with GGML_LOG and other fixes"

This reverts commit 2607b7de0f.
Let's keep the current SYCL specific logging mechanism for now

* SYCL: Use GGML_SYCL_DEBUG after reverting

* SYCL: reg_get_proc_address func, update to the current func signature

* SYCL: Refactor SYCL buffer checks in ggml_sycl_cpy_tensor_2d
2024-12-20 23:31:28 +08:00
Xuan Son Nguyen
0ca416c91a
server : (UI) fix copy to clipboard function (#10916) 2024-12-20 14:12:06 +01:00
Diego Devesa
21ae3b9be8
ggml : add test for SVE and disable when it fails (#10906) 2024-12-20 13:31:28 +01:00
Molly Sophia
0a11f8b7b5
convert : fix RWKV v6 model conversion (#10913)
* Enable --no-context-shift for llama-perplexity example

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV 6: Fix error in ggml_cuda_op_bin_bcast

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-12-20 11:44:58 +02:00
Georgi Gerganov
d408bb9268
clip : disable GPU support (#10896)
ggml-ci
2024-12-19 18:47:15 +02:00
Georgi Gerganov
5cab3e4aaa
llama : minor grammar refactor (#10897)
Some checks are pending
Python check requirements.txt / check-requirements (push) Waiting to run
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
ggml-ci
2024-12-19 17:42:13 +02:00
Georgi Gerganov
36319dec5d
tts : small QoL for easy model fetch (#10903) 2024-12-19 17:35:15 +02:00
Xuan Son Nguyen
57bb2c40cd
server : fix logprobs, make it OAI-compatible (#10783)
* server : fix logprobs, make it openai-compatible

* update docs

* add std::log

* return pre-sampling p

* sort before apply softmax

* add comment

* fix test

* set p for sampled token

* update docs

* add --multi-token-probs

* update docs

* add `post_sampling_probs` option

* update docs [no ci]

* remove --multi-token-probs

* "top_probs" with "post_sampling_probs"

* resolve review comments

* rename struct token_prob to prob_info

* correct comment placement

* fix setting prob for sampled token
2024-12-19 15:40:08 +01:00
Adrien Gallouët
a3c33b1dce
ggml: fix arm build with gcc (#10895)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2024-12-19 14:20:41 +01:00
Sukriti Sharma
2fffc52b50
llama : fix Roberta embeddings (#10856)
* fix: Use gpt2 tokenizer for roberta and add eos/bos tokens

Branch: RobertaTokenizer

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fixes to position embeddings

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* map roberta-bpe to gpt-2

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* fix linting

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
2024-12-19 15:04:51 +02:00
fairydreaming
7585edbdeb
convert : Add support for Microsoft Phi-4 model (#10817)
* convert : use GPT2 vocab for Phi-4 model

* convert : use null value of sliding_window to distinguish Phi-4 from other PHI3-based models

* llama : do not use sliding window attention mask for Phi-4 model

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-12-19 10:37:12 +01:00
Johannes Gäßler
cd920d0ac3
tests: disable GGUF test for bad value size (#10886) 2024-12-19 08:53:58 +01:00
Eric Curtin
7909e8588d
llama-run : improve progress bar (#10821)
Set default width to whatever the terminal is. Also fixed a small bug around
default n_gpu_layers value.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2024-12-19 03:58:00 +01:00
Diego Devesa
9177484f58
ggml : fix arm build (#10890)
Some checks are pending
Python check requirements.txt / check-requirements (push) Waiting to run
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
* ggml: GGML_NATIVE uses -mcpu=native on ARM

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* ggml: Show detected features with GGML_NATIVE

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* remove msvc support, add GGML_CPU_ARM_ARCH option

* disable llamafile in android example

* march -> mcpu, skip adding feature macros

ggml-ci

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Co-authored-by: Adrien Gallouët <angt@huggingface.co>
2024-12-18 23:21:42 +01:00
Georgi Gerganov
0bf2d10c55
tts : add OuteTTS support (#10784)
* server : add "tokens" output

ggml-ci

* server : output embeddings for all tokens when pooling = none

ggml-ci

* server : be explicit about the pooling type in the tests

ggml-ci

* server : do not normalize embeddings when there is no pooling

ggml-ci

* llama : add OuteTTS support (wip)

* wip

* extract features

* first conv

* group norm

* resnet conv

* resnet

* attn

* pos net

* layer norm

* convnext

* head

* hann window

* fix n_embd + remove llama.cpp hacks

* compute hann window

* fft

* spectrum processing

* clean-up

* tts : receive input text and generate codes

* clip : fix new conv name

* tts : minor fix

* tts : add header + minor fixes

ggml-ci

* tts : add matchematical constant

ggml-ci

* tts : fix sampling + cut initial noise

* tts : fixes

* tts : update default samplers

ggml-ci

* tts : text pre-processing

* tts : outetts-voc -> wavtokenizer-dec

* tts : remove hardcoded constants

ggml-ci

* tts : fix tensor shapes

* llama : refactor wavtokenizer tensors

ggml-ci

* cont

ggml-ci

* cont [no ci]

* llama : update WavTokenizer to non-causal attn

* llama : handle no-vocab detokenization

* tts : add Python example for OuteTTS (wip)

* tts : extend python example to generate spectrogram

ggml-ci

* server : fix rebase artifacts

* tts : enable "return_tokens" in Python example

ggml-ci

* tts : minor fixes

* common : support HF download for vocoder
2024-12-18 19:27:21 +02:00
Gaetan Bisson
7bbb5acf12
server: avoid overwriting Authorization header (#10878)
Some checks are pending
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
* server: avoid overwriting Authorization header

If no API key is set, leave the Authorization header as is. It may be
used by another part of the Web stack, such as an authenticating proxy.

Fixes https://github.com/ggerganov/llama.cpp/issues/10854

* rebuild

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-18 15:00:07 +01:00
Georgi Gerganov
152610eda9
server : output embeddings for all tokens when pooling = none (#10861)
* server : add "tokens" output

ggml-ci

* server : output embeddings for all tokens when pooling = none

ggml-ci

* server : update readme [no ci]

* server : fix spacing [no ci]

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* server : be explicit about the pooling type in the tests

ggml-ci

* server : update /embeddings and /v1/embeddings endpoints

ggml-ci

* server : do not normalize embeddings when there is no pooling

ggml-ci

* server : update readme

ggml-ci

* server : fixes

* tests : update server tests

ggml-ci

* server : update readme [no ci]

* server : remove rebase artifact

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-12-18 13:01:41 +02:00
Georgi Gerganov
0e70ba686e
server : add "tokens" output (#10853)
* server : add "tokens" output

ggml-ci

* server : update readme

ggml-ci

* server : return tokens ids only if requested

ggml-ci

* tests : improve "tokens" type check

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* server : remove "tokens" from the OAI endpoint

ggml-ci

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-12-18 11:05:29 +02:00
Xuan Son Nguyen
46828872c3
server : (embeddings) using same format for "input" and "content" (#10872)
* server : (embeddings) using same format for "input" and "content"

* fix test case

* handle empty input case

* fix test
2024-12-18 10:55:09 +02:00
redbeard
6b064c92b4
docs: Fix HIP (née hipBLAS) in README (#10880)
Related to #10524 / be0e350c references to hipBLAS have been removed
across the repository.  This fixes the link from the repositories
`README.md`.

Signed-off-by: Brian 'redbeard' Harrington <redbeard@dead-city.org>
2024-12-18 10:35:00 +02:00
Diego Devesa
4da69d1abd
Revert "llama : add Falcon3 support (#10864)" (#10876)
Some checks are pending
Python check requirements.txt / check-requirements (push) Waiting to run
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
This reverts commit 382bc7f2e8.
2024-12-18 01:36:46 +01:00
DAN™
d62b532c52
Use model->gguf_kv for loading the template instead of using the C API. (#10868)
Some checks are pending
Python check requirements.txt / check-requirements (push) Waiting to run
flake8 Lint / Lint (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
* Bump model_template to 16384 bytes to support larger chat templates.

* Use `model->gguf_kv` for efficiency.
2024-12-17 23:24:22 +01:00
Johannes Gäßler
081b29bd2a
tests: add tests for GGUF (#10830) 2024-12-17 19:09:35 +01:00
Georgi Gerganov
5437d4aaf5
sync : ggml 2024-12-17 18:36:02 +02:00
Georgi Gerganov
78f766768d
cmake : fix "amd64" processor string (whisper/2638) 2024-12-17 18:35:49 +02:00
gn64
8dd19a4812
vulkan : fix soft_max.comp division by zero (whisper/2633)
This change prevents a division by zero error when p.KY is 0.
2024-12-17 18:35:49 +02:00
Daniel Bevenius
130d0c90bd
ggml : remove return from ggml_gallocr_allocate_node (ggml/1048)
This commit removes the return statement from ggml_gallocr_allocate_node
function.

The motivation behind this change is to make the code more readable and
consistent.
2024-12-17 18:35:49 +02:00
Daniel Bevenius
3919da8e33
ggml : add check for grad_accs (ggml/1046)
* ggml : add check for grad_accs

This commit adds a check for grad_accs in ggml_graph_get_grad and
ggml_graph_get_grad_acc functions. This is necessary to avoid segfaults
when grad_accs is not initialized.

The motivation for this change is that I find it nice to be able to
print out a computation graph using ggml_graph_print but this function
segfaults when grad_accs is not initialized:
```console
(gdb) p g1
$2 = (ggml_cgraph *) 0x7ffff66004b0
(gdb) p *g1
$3 = {size = 2048, n_nodes = 1, n_leafs = 2, nodes = 0x7ffff6600500,
grads = 0x0, grad_accs = 0x0, leafs = 0x7ffff6604500,
visited_hash_set = {size = 4099, used = 0x7ffff6610518,
keys = 0x7ffff6608500}, order = GGML_CGRAPH_EVAL_ORDER_LEFT_TO_RIGHT}
(gdb) p ggml_graph_print(g1)
=== GRAPH ===
n_nodes = 1

Program received signal SIGSEGV, Segmentation fault.
0x0000555555579775 in ggml_graph_get_grad
(cgraph=0x7ffff66004b0,node=0x7ffff6600340)
    at /ggml/ggml/src/ggml.c:5990
5990  return igrad != GGML_HASHSET_FULL &&
          ggml_bitset_get(cgraph->visited_hash_set.used, igrad) ?
          cgraph->grads[igrad] : NULL;
```

* squash! ggml : add check for grad_accs

Fix the check in ggml_graph_get_grad. The check was incorrectly using
cgraph->grad_accs instead of cgraph->grads.
2024-12-17 18:35:48 +02:00
Georgi Gerganov
0006f5a74a
ggml : update ggml_backend_cpu_device_supports_op (#10867)
* ggml : fix cpy op for IQ-quants to use reference impl

ggml-ci

* ggml : disable tests involving i-matrix quantization

* ggml : update ggml_backend_cpu_device_supports_op

ggml-ci
2024-12-17 18:35:42 +02:00