Commit Graph

1031 Commits

Author SHA1 Message Date
Mathijs Henquet
42fb6707e8 Add example of token splitting 2024-08-22 00:41:44 +02:00
Mathijs Henquet
b11e63ce43 Handle case if tokenizer splits along utf8 continuation bytes 2024-08-22 00:32:28 +02:00
Mathijs Henquet
198daa4e34 server : Add tokenize with pieces tests to server.feature 2024-08-22 00:04:21 +02:00
Mathijs Henquet
a2d4d1913c server : added with_pieces functionality to /tokenize endpoint 2024-08-20 23:28:06 +02:00
Changyeon Kim
2f3c1466ff
llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model. (#8984)
* llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model.

- The CLIP model now prioritizes the Vulkan backend over the CPU when vulkan available.
- A GGML_OP_ACC shader has been added.
- The encoding performance of the CLIP model improved from 4.2s on the CPU to 0.9s on the GPU.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

* fix-up coding style.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

* Fix-up the missing initial parameter to resolve the compilation warning.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

* [fix] Add missing parameters.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

* [fix] Use nb1 and nb2 for dst.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

* Fix check results ggml_acc call

---------

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
Co-authored-by: 0cc4m <picard12@live.de>
2024-08-20 21:00:00 +02:00
Xuan Son Nguyen
8b3befc0e2
server : refactor middleware and /health endpoint (#9056)
* server : refactor middleware and /health endpoint

* move "fail_on_no_slot" to /slots

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix server tests

* fix CI

* update server docs

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-16 17:19:05 +02:00
tc-mb
d565bb2fd5
llava : support MiniCPM-V-2.6 (#8967)
* init

* rename

* add run android for termux in readme

* add android readme

* add instructions in readme

* change name in readme

* Update README.md

* fixed line

* add result in readme

* random pos_embed

* add positions index

* change for ollama

* change for ollama

* better pos_embed in clip

* support ollama

* updata cmakelist

* updata cmakelist

* rename wrapper

* clear code

* replace and organize code

* add link

* sync master

* fix warnings

* fix warnings

* fix bug in bicubic resize when need resize iamge smaller

* receive review comments and modify

* receive review comments and modify

* put all code into llava dir

* fix quality problem in pr code

* change n_layer

* add space in "-1"

* imitate reshape bug of python code

* fix bug in clip

* fix issues for merging

* fix llama-minicpmv-cli in cmake file

* change pr readme

* fix code review

* remove in line 33 directory in the /cmakelists.txt (not in example, in the main dir

* fix cmakefile

* add warn

* fix KEY_HAS_MINICPMV_PROJ

* remove load_image_size into clip_ctx

* remove the extern "C", MINICPMV_API

* fix uhd code for review comment

* delete minicpmv-wrapper in pr

* remove uhd_image_embed

* Modify 2 notes

* support minicpmv2.6

* modify convert script of minicpmv

* modify convert

* modify convert

* add readme

* add resampler of v2.6

* modify clip

* modify readme

* fix type-check

* fix type-check

* fix type-check

* fix type-check

* modify convert script and readme

* fix convert script and readme

* fix convert

* fix num in convert

* fix type-check

---------

Co-authored-by: Hongji Zhu <fireyoucan@gmail.com>
Co-authored-by: harvestingmoon <leewenyeong@gmail.com>
2024-08-16 16:34:41 +03:00
Aisuko
c8ddce8560
Fix inference example lacks required parameters (#9035)
Signed-off-by: Aisuko <urakiny@gmail.com>
2024-08-16 11:08:59 +02:00
gtygo
4b9afbbe90
retrieval : fix memory leak in retrieval query handling (#8955)
* retrieval

* Reuse querybatch to reduce frequent memory allocation

* delete unused white space
2024-08-15 10:40:12 +03:00
Riceball LEE
37501d9c79
server : fix duplicated n_predict key in the generation_settings (#8994) 2024-08-15 10:28:05 +03:00
Zhenwei Jin
4af8420afb
common : remove duplicate function llama_should_add_bos_token (#8778) 2024-08-15 10:23:23 +03:00
Jiří Podivín
234b30676a
server : init stop and error fields of the result struct (#9026)
Signed-off-by: Jiri Podivin <jpodivin@redhat.com>
2024-08-15 09:21:57 +03:00
compilade
98a532d474
server : fix segfault on long system prompt (#8987)
* server : fix segfault on long system prompt

* server : fix parallel generation with very small batch sizes

* server : fix typo in comment
2024-08-14 09:51:02 +03:00
Xuan Son Nguyen
828d6ff7d7
export-lora : throw error if lora is quantized (#9002) 2024-08-13 11:41:14 +02:00
Georgi Gerganov
d3ae0ee8d7
py : fix requirements check '==' -> '~=' (#8982)
* py : fix requirements check '==' -> '~='

* cont : fix the fix

* ci : run on all requirements.txt
2024-08-12 11:02:01 +03:00
Georgi Gerganov
5ef07e25ac
server : handle models with missing EOS token (#8997)
ggml-ci
2024-08-12 10:21:50 +03:00
fairydreaming
7c3f55c100
Add support for encoder-only T5 models (#8900)
* gguf-py : add T5ENCODER model architecture

* common : call llama_decode() during warmup only if the model has decoder

* convert-hf : add T5EncoderModel

* llama : add llama_model_has_decoder() API function

* llama : split build_t5() into build_t5_encoder() and build_t5_decoder()

* llama : add support for LLM_ARCH_T5ENCODER

* llama-embedding : add support for LLAMA_POOLING_TYPE_NONE

* llama-embedding : add support for encoder-only models

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-08-10 11:43:26 +02:00
Georgi Gerganov
b72942fac9
Merge commit from fork 2024-08-09 23:03:21 +03:00
Georgi Gerganov
45a55b91aa
llama : better replace_all (cont) (#8926)
* llama : better replace_all (cont)

ggml-ci

* code : deduplicate replace_all

ggml-ci
2024-08-09 18:23:52 +03:00
tc-mb
3071c0a5f2
llava : support MiniCPM-V-2.5 (#7599)
* init

* rename

* add run android for termux in readme

* add android readme

* add instructions in readme

* change name in readme

* Update README.md

* fixed line

* add result in readme

* random pos_embed

* add positions index

* change for ollama

* change for ollama

* better pos_embed in clip

* support ollama

* updata cmakelist

* updata cmakelist

* rename wrapper

* clear code

* replace and organize code

* add link

* sync master

* fix warnings

* fix warnings

* fix bug in bicubic resize when need resize iamge smaller

* receive review comments and modify

* receive review comments and modify

* put all code into llava dir

* fix quality problem in pr code

* change n_layer

* add space in "-1"

* imitate reshape bug of python code

* fix bug in clip

* fix issues for merging

* fix llama-minicpmv-cli in cmake file

* change pr readme

* fix code review

* remove in line 33 directory in the /cmakelists.txt (not in example, in the main dir

* fix cmakefile

* add warn

* fix KEY_HAS_MINICPMV_PROJ

* remove load_image_size into clip_ctx

* remove the extern "C", MINICPMV_API

* fix uhd code for review comment

* delete minicpmv-wrapper in pr

* remove uhd_image_embed

* Modify 2 notes

* clip : style changes

* del common.h in clip

* fix Type-Check error

* fix Type-Check error

* fix Type-Check error

* fix Type-Check error

* fix makefile error

* fix ubuntu-make error

* try fix clip

* try fix 1

---------

Co-authored-by: Hongji Zhu <fireyoucan@gmail.com>
Co-authored-by: harvestingmoon <leewenyeong@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-09 13:33:53 +03:00
Daniel Bevenius
5b2c04f492
embedding : add --pooling option to README.md [no ci] (#8934)
This commit adds the `--pooling` option to the README.md file in the
`examples/embedding` directory.

The motivation for adding this options is that currently if the model
used does not specify a pooling type the embedding example will fail
with the following error message:
```console
main: error: pooling type NONE not supported
```

This commit also updates the name of the executable in the examples
section.
2024-08-09 09:33:30 +03:00
Mathieu Geli
daef3ab233
server : add one level list nesting for embeddings (#8936) 2024-08-09 09:32:02 +03:00
Ouadie EL FAROUKI
0478174d59
[SYCL] Updated SYCL device filtering (#8901)
* Updated device filter to depend on default_selector (fixes non-intel device issues)
* Small related update to example/sycl Readme
2024-08-07 11:25:36 +01:00
Zhenwei Jin
506122d854
llama-bench : add support for getting cpu info on Windows (#8824)
* Add support for getting cpu info on Windows for llama_bench

* refactor

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-08-07 03:01:06 +02:00
Daniel Bevenius
725e3d9437
quantize : update usage comment in quantize.cpp (#8889)
This commit updates the usage comment in quantize.cpp to reflect the
new name of the executable, which is llama-quantize.
2024-08-07 01:43:00 +02:00
Xuan Son Nguyen
1e6f6554aa
server : add lora hotswap endpoint (WIP) (#8857)
* server : add lora hotswap endpoint

* handle lora_no_apply

* fix build

* updae docs

* clean up struct def

* fix build

* add LoRA test

* fix style
2024-08-06 17:33:39 +02:00
Daniel Bevenius
5f4dcb1e60
simple : update name of executable to llama-simple (#8885)
This commit updates the name of the executable in README.md from
`simple` to `llama-simple`.
2024-08-06 16:44:35 +02:00
Neo Zhang
d4ff847153
[SYCL] correct cmd name (#8877) 2024-08-06 09:09:12 +08:00
Liu Jia
0a4ce78681
common : Changed tuple to struct (TODO fix) (#8823)
* common : Changed tuple to struct (TODO fix)

Use struct `llama_init_result` to replace the previous
std::tuple<struct llama_model *, struct llama_context *>

* delete llama_init_default_params()

* delete the extra whitespace
2024-08-05 18:14:10 +02:00
ardfork
978ba3d83d
Server: Don't ignore llama.cpp params (#8754)
* Don't ignore llama.cpp params

* Add fallback for max_tokens
2024-08-04 20:16:23 +02:00
Brian Cunnie
ecf6b7f23e
batched-bench : handle empty -npl (#8839)
* [example] batched-bench "segmentation fault"

When `llama-batched-bench` is invoked _without_ setting `-npl`, "number
of parallel prompts", it segfaults.

The segfault is caused by invoking `max_element()` on a zero-length
vector, `n_pl`

This commit addresses that by first checking to see if the number of
parallel prompts is zero, and if so sets the maximum sequence size to 1;
otherwise, sets it to the original, the result of `max_element()`.

Fixes, when running `lldb build/bin/llama-batched-bench -- -m models/Meta-Llama-3-8B.gguf`

```
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000010000366c llama-batched-bench`main(argc=3, argv=0x000000016fdff268) at batched-bench.cpp:72:28
   69  	    llama_context_params ctx_params = llama_context_params_from_gpt_params(params);
   70
   71  	    // ensure enough sequences are available
-> 72  	    ctx_params.n_seq_max = *std::max_element(n_pl.begin(), n_pl.end());
```

* Update examples/batched-bench/batched-bench.cpp

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <git@compilade.net>
2024-08-04 13:55:03 +03:00
Daniel Bevenius
01aae2b497 baby-llama : remove duplicate vector include 2024-08-04 13:24:59 +03:00
Igor Okulist
afbbcf3c04
server : update llama-server embedding flag documentation (#8779)
Fixes #8763
2024-07-31 19:59:09 -04:00
compilade
4c676c85e5
llama : refactor session file management (#8699)
* llama : refactor session file management

* llama : saving and restoring state checks for overflow

The size of the buffers should now be given to the functions working
with them, otherwise a truncated file could cause out of bound reads.

* llama : stream from session file instead of copying into a big buffer

Loading session files should no longer cause a memory usage spike.

* llama : llama_state_get_size returns the actual size instead of max

This is a breaking change, but makes that function *much* easier
to keep up to date, and it also makes it reflect the behavior
of llama_state_seq_get_size.

* llama : share code between whole and seq_id-specific state saving

Both session file types now use a more similar format.

* llama : no longer store all hparams in session files

Instead, the model arch name is stored.
The layer count and the embedding dimensions of the KV cache
are still verified when loading.
Storing all the hparams is not necessary.

* llama : fix uint64_t format type

* llama : various integer type cast and format string fixes

Some platforms use "%lu" and others "%llu" for uint64_t.
Not sure how to handle that, so casting to size_t when displaying errors.

* llama : remove _context suffix for llama_data_context

* llama : fix session file loading

llama_state_get_size cannot be used to get the max size anymore.

* llama : more graceful error handling of invalid session files

* llama : remove LLAMA_MAX_RNG_STATE

It's no longer necessary to limit the size of the RNG state,
because the max size of session files is not estimated anymore.

* llama : cast seq_id in comparison with unsigned n_seq_max
2024-07-28 00:42:05 -04:00
slaren
2b1f616b20
ggml : reduce hash table reset cost (#8698)
* ggml : reduce hash table reset cost

* fix unreachable code warnings after GGML_ASSERT(false)

* GGML_ASSERT(false) -> GGML_ABORT("fatal error")

* GGML_ABORT use format string
2024-07-27 04:41:55 +02:00
Yaiko
01aec4a631
server : add Speech Recognition & Synthesis to UI (#8679)
* server : add Speech Recognition & Synthesis to UI

* server : add Speech Recognition & Synthesis to UI (fixes)
2024-07-26 00:10:16 +02:00
Xuan Son Nguyen
41cd47caab
examples : export-lora : fix issue with quantized base models (#8687) 2024-07-25 23:49:39 +02:00
Xuan Son Nguyen
be6d7c0791
examples : remove finetune and train-text-from-scratch (#8669)
* examples : remove finetune and train-text-from-scratch

* fix build

* update help message

* fix small typo for export-lora
2024-07-25 10:39:04 +02:00
Ujjawal Panchal
4b0eff3df5
docs : Quantum -> Quantized (#8666)
* docfix: imatrix readme, quantum models -> quantized models.

* docfix: server readme: quantum models -> quantized models.
2024-07-25 11:13:27 +03:00
Xuan Son Nguyen
96952e7181
llama : fix llama_chat_format_single for mistral (#8657)
* fix `llama_chat_format_single` for mistral

* fix typo

* use printf
2024-07-24 13:48:46 +02:00
Xuan Son Nguyen
de280085e7
examples : Fix llama-export-lora example (#8607)
* fix export-lora example

* add more logging

* reject merging subset

* better check

* typo
2024-07-23 23:48:37 +02:00
Vali Malinoiu
b841d07408
server : fix URL.parse in the UI (#8646) 2024-07-23 17:37:42 +03:00
Georgi Gerganov
938943cdbf
llama : move vocab, grammar and sampling into separate files (#8508)
* llama : move sampling code into llama-sampling

ggml-ci

* llama : move grammar code into llama-grammar

ggml-ci

* cont

ggml-ci

* cont : pre-fetch rules

* cont

ggml-ci

* llama : deprecate llama_sample_grammar

* llama : move tokenizers into llama-vocab

ggml-ci

* make : update llama.cpp deps [no ci]

* llama : redirect external API to internal APIs

ggml-ci

* llama : suffix the internal APIs with "_impl"

ggml-ci

* llama : clean-up
2024-07-23 13:10:17 +03:00
Jan Boon
628154492a
server : update doc to clarify n_keep when there is bos token (#8619) 2024-07-22 11:02:09 +03:00
devojony
b7c11d36e6
examples: fix android example cannot be generated continuously (#8621)
When generation ends `completion_loop()` should return a NULL, not the empty string
2024-07-22 09:54:42 +03:00
M-A
22f281aa16
examples : Rewrite pydantic_models_to_grammar_examples.py (#8493)
Changes:

- Move each example into its own function. This makes the code much
  easier to read and understand.
- Make the program easy to only run one test by commenting out function
  calls in main().
- Make the output easy to parse by indenting the output for each example.
- Add shebang and +x bit to make it clear it's an executable.
- Make the host configurable via --host with a default 127.0.0.1:8080.
- Make the code look in the tools list to call the registered tool,
  instead of hardcoding the returned values. This makes the code more
  copy-pastable.
- Add error checking, so that the program exits 1 if the LLM didn't
  returned expected values. It's super useful to check for correctness.

Testing:

- Tested with Mistral-7B-Instruct-v0.3 in F16 and Q5_K_M and
  Meta-Llama-3-8B-Instruct in F16 and Q5_K_M.
  - I did not observe a failure even once in Mistral-7B-Instruct-v0.3.
  - Llama-3 failed about a third of the time in example_concurrent: it
    only returned one call instead of 3. Even for F16.

Potential follow ups:

- Do not fix the prompt encoding yet. Surprisingly it mostly works even
  if the prompt encoding is not model optimized.
- Add chained answer and response.

Test only change.
2024-07-20 22:09:17 -04:00
Georgi Gerganov
07283b1a90
gguf : handle null name during init (#8587) 2024-07-20 17:15:42 +03:00
Huifeng Ou
69b9945b44
llama.swiftui: fix end of generation bug (#8268)
* fix continuing generating blank lines after getting EOT token or EOS token from LLM

* change variable name to is_done (variable name suggested by ggerganov)

* minor : fix trailing whitespace

* minor : add space

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-20 16:09:37 +03:00
Eric Zhang
0d2c7321e9
server: use relative routes for static files in new UI (#8552)
* server: public: fix api_url on non-index pages

* server: public: use relative routes for static files in new UI
2024-07-18 12:43:49 +02:00
Brian
672a6f1018
convert-*.py: GGUF Naming Convention Refactor and Metadata Override Refactor (#7499)
Main thing is that the default output filename will take this form

{name}{parameters}{finetune}{version}{encoding}{kind}

In addition this add and remove some entries in the KV store and adds a metadata class with automatic heuristics capability to derive some values based on model card content

* No Change:
  - Internal GGUF Spec
    - `general.architecture`
    - `general.quantization_version`
    - `general.alignment`
    - `general.file_type`
  - General Model Details
    - `general.name`
    - `general.author`
    - `general.version`
    - `general.description`
  - Licensing details
    - `general.license`
  - Typically represents the converted GGUF repo (Unless made from scratch)
    - `general.url`
  - Model Source during conversion
    - `general.source.url`

* Removed:
  - Model Source during conversion
    - `general.source.huggingface.repository`

* Added:
  - General Model Details
    - `general.organization`
    - `general.finetune`
    - `general.basename`
    - `general.quantized_by`
    - `general.size_label`
  - Licensing details
    - `general.license.name`
    - `general.license.link`
  - Typically represents the converted GGUF repo (Unless made from scratch)
    - `general.doi`
    - `general.uuid`
    - `general.repo_url`
  - Model Source during conversion
    - `general.source.doi`
    - `general.source.uuid`
    - `general.source.repo_url`
  - Base Model Source
    - `general.base_model.count`
    - `general.base_model.{id}.name`
    - `general.base_model.{id}.author`
    - `general.base_model.{id}.version`
    - `general.base_model.{id}.organization`
    - `general.base_model.{id}.url` (Model Website/Paper)
    - `general.base_model.{id}.doi`
    - `general.base_model.{id}.uuid`
    - `general.base_model.{id}.repo_url` (Model Source Repository (git/svn/etc...))
  - Array based KV stores
    - `general.tags`
    - `general.languages`
    - `general.datasets`

---------

Co-authored-by: compilade <git@compilade.net>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-07-18 20:40:15 +10:00