Commit Graph

418 Commits

Author SHA1 Message Date
b4b4o
c4dd11d1d3
readme : fix web link error [no ci] (#8347) 2024-07-08 17:19:24 +03:00
toyer
04ce3a8b19
readme : add supported glm models (#8360) 2024-07-08 08:57:19 +03:00
Andy Tai
f1948f1e10
readme : update bindings list (#8222)
* adding guile_llama_cpp  to binding list

* fix formatting

* fix formatting
2024-07-07 16:21:37 +03:00
Xuan Son Nguyen
60d83a0149
update main readme (#8333) 2024-07-06 19:01:23 +02:00
Xuan Son Nguyen
be20e7f49d
Reorganize documentation pages (#8325)
* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
2024-07-05 18:08:32 +02:00
Georgi Gerganov
6c05752c50
contributing : update guidelines (#8316) 2024-07-05 09:09:47 +03:00
Georgi Gerganov
e235b267a2
py : switch to snake_case (#8305)
* py : switch to snake_case

ggml-ci

* cont

ggml-ci

* cont

ggml-ci

* cont : fix link

* gguf-py : use snake_case in scripts entrypoint export

* py : rename requirements for convert_legacy_llama.py

Needed for scripts/check-requirements.sh

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2024-07-05 07:53:33 +03:00
Mateusz Charytoniuk
dae57a1ebc
readme: add Paddler to the list of projects (#8239) 2024-07-01 20:13:22 +03:00
Xuan Son Nguyen
49122a873f
gemma2: add sliding window mask (#8227)
* gemma2: add sliding window mask

* fix data_swa uninitialized

* better naming

* add co-author

Co-authored-by: Arlo Phoenix <arlo-phoenix@users.noreply.github.com>

* replace list with single tensor

* update

* llama : minor styling

* convert : add sanity check for query_pre_attn_scalar

* fix small typo in README

---------

Co-authored-by: Arlo Phoenix <arlo-phoenix@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-01 18:48:34 +02:00
Roni
0ddeff1023
readme : update tool list (#8209)
* Added gppm to Tool list in README

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-01 15:48:16 +03:00
iacore
694c59cb42
Document BERT support. (#8205)
* Update README.md

document BERT support

* Update README.md
2024-07-01 13:40:58 +02:00
Georgi Gerganov
a95631ee97
readme : update API notes 2024-06-26 19:26:13 +03:00
Georgi Gerganov
f3f65429c4
llama : reorganize source code + improve CMake (#8006)
* scripts : update sync [no ci]

* files : relocate [no ci]

* ci : disable kompute build [no ci]

* cmake : fixes [no ci]

* server : fix mingw build

ggml-ci

* cmake : minor [no ci]

* cmake : link math library [no ci]

* cmake : build normal ggml library (not object library) [no ci]

* cmake : fix kompute build

ggml-ci

* make,cmake : fix LLAMA_CUDA + replace GGML_CDEF_PRIVATE

ggml-ci

* move public backend headers to the public include directory (#8122)

* move public backend headers to the public include directory

* nix test

* spm : fix metal header

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* scripts : fix sync paths [no ci]

* scripts : sync ggml-blas.h [no ci]

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-26 18:33:02 +03:00
Isaac McFadyen
8854044561
Clarify default MMQ for CUDA and LLAMA_CUDA_FORCE_MMQ flag (#8115)
* Add message about int8 support

* Add suggestions from review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-06-26 08:29:28 +02:00
Johannes Gäßler
a818f3028d
CUDA: use MMQ instead of cuBLAS by default (#8075) 2024-06-24 17:43:42 +02:00
Abheek Gulati
1193778105
readme : update UI list (#7943) 2024-06-18 09:57:41 +03:00
Bryan Honof
b473e95084
Add Nix and Flox install instructions (#7899) 2024-06-17 09:37:55 -06:00
hopkins385
6fe1c62741
readme : update UI list [no ci] (#7958) 2024-06-16 14:51:18 +03:00
Galunid
a55eb1bf0f
readme : Remove outdated instructions from README.md (#7914) [no ci] 2024-06-13 09:42:41 +02:00
Olivier Chafik
1c641e6aac
build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809)
* `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew

* server: update refs -> llama-server

gitignore llama-server

* server: simplify nix package

* main: update refs -> llama

fix examples/main ref

* main/server: fix targets

* update more names

* Update build.yml

* rm accidentally checked in bins

* update straggling refs

* Update .gitignore

* Update server-llm.sh

* main: target name -> llama-cli

* Prefix all example bins w/ llama-

* fix main refs

* rename {main->llama}-cmake-pkg binary

* prefix more cmake targets w/ llama-

* add/fix gbnf-validator subfolder to cmake

* sort cmake example subdirs

* rm bin files

* fix llama-lookup-* Makefile rules

* gitignore /llama-*

* rename Dockerfiles

* rename llama|main -> llama-cli; consistent RPM bin prefixes

* fix some missing -cli suffixes

* rename dockerfile w/ llama-cli

* rename(make): llama-baby-llama

* update dockerfile refs

* more llama-cli(.exe)

* fix test-eval-callback

* rename: llama-cli-cmake-pkg(.exe)

* address gbnf-validator unused fread warning (switched to C++ / ifstream)

* add two missing llama- prefixes

* Updating docs for eval-callback binary to use new `llama-` prefix.

* Updating a few lingering doc references for rename of main to llama-cli

* Updating `run-with-preset.py` to use new binary names.
Updating docs around `perplexity` binary rename.

* Updating documentation references for lookup-merge and export-lora

* Updating two small `main` references missed earlier in the finetune docs.

* Update apps.nix

* update grammar/README.md w/ new llama-* names

* update llama-rpc-server bin name + doc

* Revert "update llama-rpc-server bin name + doc"

This reverts commit e474ef1df4.

* add hot topic notice to README.md

* Update README.md

* Update README.md

* rename gguf-split & quantize bins refs in **/tests.sh

---------

Co-authored-by: HanClinto <hanclinto@gmail.com>
2024-06-13 00:41:52 +01:00
Patrice Ferlet
f2b5764beb
Fix a typo and add Fedora 40 pacakge to install for Vulkan (#7794) [no ci]
Fix "appropiate" to "appropriate" and add Fedora 40 packages to install to compile with Vulkan support
2024-06-12 11:18:16 +10:00
Georgi Gerganov
c28a83902c
examples : remove --instruct remnants (#7846) 2024-06-10 15:00:15 +03:00
Mattheus Chediak
a143c04375
README minor fixes (#7798) [no ci]
derievatives --> derivatives
2024-06-06 22:17:54 +10:00
Georgi Gerganov
554c247caf
ggml : remove OpenCL (#7735)
ggml-ci
2024-06-04 21:23:20 +03:00
Georgi Gerganov
5ca0944a15
readme : remove obsolete Zig instructions (#7471) 2024-06-04 19:43:01 +03:00
HanishKVC
2ac95c9d56
SimpleChat: Simple histogram/repeatMatching driven garbageTrimming, Settings UI, Streaming mode, OpenAi Compat (Model, Authorization Bearer), Save/Restore session, Auto Settings UI (#7548)
* SimpleChat:DU:BringIn local helper js modules using importmap

Use it to bring in a simple trim garbage at end logic, which is
used to trim received response.

Also given that importmap assumes esm / standard js modules, so
also global variables arent implicitly available outside the
modules. So add it has a member of document for now

* SimpleChat:DU: Add trim garbage at end in loop helper

* SimpleChat:DU:TrimGarbage if unable try skip char and retry

* SimpleChat:DU: Try trim using histogram based info

TODO: May have to add max number of uniq chars in histogram at
end of learning phase.

* SimpleChat:DU: Switch trim garbage hist based to maxUniq simple

Instead of blindly building histogram for specified substring
length, and then checking if any new char within specified min
garbage length limit, NOW exit learn state when specified maxUniq
chars are found. Inturn there should be no new chars with in
the specified min garbage length required limit.

TODO: Need to track char classes like alphabets, numerals and
special/other chars.

* SimpleChat:DU: Bring in maxType to the mix along with maxUniq

Allow for more uniq chars, but then ensure that a given type of
char ie numerals or alphabets or other types dont cross the
specified maxType limit. This allows intermixed text garbage
to be identified and trimmed.

* SimpleChat:DU: Cleanup debug log messages

* SimpleChat:UI: Move html ui base helpers into its own module

* SimpleChat:DU:Avoid setting frequence/Presence penalty

Some models like llama3 found to try to be over intelligent by
repeating garbage still, but by tweaking the garbage a bit so that
it is not exactly same. So avoid setting these penalties and let
the model's default behaviour work out, as is.

Also the simple minded histogram based garbage trimming from end,
works to an extent, when the garbage is more predictable and
repeatative.

* SimpleChat:UI: Add and use a para-create-append helper

Also update the config params dump to indicate that now one needs
to use document to get hold of gMe global object, this is bcas of
moving to module type js.

Also add ui.mjs to importmap

* SimpleChat:UI: Helper to create bool button and use it wrt settings

* SimpleChat:UI: Add Select helper and use it wrt ChatHistoryInCtxt

* SimpleChat:UI:Select: dict-name-value, value wrt default, change

Take a dict/object of name-value pairs instead of just names.
Inturn specify the actual value wrt default, rather than the
string representing that value.

Trap the needed change event rather than click wrt select.

* SimpleChat:UI: Add Div wrapped label+element helpers

Move settings related elements to use the new div wrapped ones.

* SimpleChat:UI:Add settings button and bring in settings ui

* SimpleChat:UI:Settings make boolean button text show meaning

* SimpleChat: Update a bit wrt readme and notes in du

* SimpleChat: GarbageTrim enable/disable, show trimmed part ifany

* SimpleChat: highlight trim, garbage trimming bitmore aggressive

Make it easy for end user to identified the trimmed text.

Make garbage trimming logic, consider a longer repeat garbage
substring.

* SimpleChat: Cleanup a bit wrt Api end point related flow

Consolidate many of the Api end point related basic meta data into
ApiEP class.

Remove the hardcoded ApiEP/Mode settings from html+js, instead use
the generic select helper logic, inturn in the settings block.

Move helper to generate the appropriate request json string based
on ApiEP into SimpleChat class itself.

* SimpleChat:Move extracting assistant response to SimpleChat class

so also the trimming of garbage.

* SimpleChat:DU: Bring in both trim garbage logics to try trim

* SimpleChat: Cleanup readme a bit, add one more chathistory length

* SimpleChat:Stream:Initial handshake skeleton

Parse the got stream responses and try extract the data from it.

It allows for a part read to get a single data line or multiple
data line. Inturn extract the json body and inturn the delta
content/message in it.

* SimpleChat: Move handling oneshot mode server response

Move handling of the oneshot mode server response into SimpleChat.

Also add plumbing for moving multipart server response into same.

* SimpleChat: Move multi part server response handling in

* SimpleChat: Add MultiPart Response handling, common trimming

Add logic to call into multipart/stream server response handling.

Move trimming of garbage at the end into the common handle_response
helper.

Add new global flag to control between oneshot and multipart/stream
mode of fetching response. Allow same to be controlled by user.

If in multipart/stream mode, send the stream flag to the server.

* SimpleChat: show streamed generative text as it becomes available

Now that the extracting of streamed generated text is implemented,
add logic to show the same on the screen.

* SimpleChat:DU: Add NewLines helper class

To work with an array of new lines. Allow adding, appending,
shifting, ...

* SimpleChat:DU: Make NewLines shift more robust and flexible

* SimpleChat:HandleResponseMultiPart using NewLines helper

Make handle_response_multipart logic better and cleaner. Now it
allows for working with the situation, where the delta data line
got from server in stream mode, could be split up when recving,
but still the logic will handle it appropriately.

ALERT: Rather except (for now) for last data line wrt a request's
response.

* SimpleChat: Disable console debug by default by making it dummy

Parallely save a reference to the original func.

* SimpleChat:MultiPart/Stream flow cleanup

Dont try utf8-decode and newlines-add_append if no data to work on.

If there is no more data to get (ie done is set), then let NewLines
instance return line without newline at end, So that we dont miss
out on any last-data-line without newline kind of scenario.

Pass stream flag wrt utf-8 decode, so that if any multi-byte char
is only partly present in the passed buffer, it can be accounted
for along with subsequent buffer. At sametime, bcas of utf-8's
characteristics there shouldnt be any unaccounted bytes at end,
for valid block of utf8 data split across chunks, so not bothering
calling with stream set to false at end. LATER: Look at TextDecoder's
implementation, for any over intelligence, it may be doing..
If needed, one can use done flag to account wrt both cases.

* SimpleChat: Move baseUrl to Me and inturn gMe

This should allow easy updating of the base url at runtime by the
end user.

* SimpleChat:UI: Add input element helper

* SimpleChat: Add support for changing the base url

This ensures that if the user is running the server with a
different port or wants to try connect to server on a different
machine, then this can be used.

* SimpleChat: Move request headers into Me and gMe

Inturn allow Authorization to be sent, if not empty.

* SimpleChat: Rather need to use append to insert headers

* SimpleChat: Allow Authorization header to be set by end user

* SimpleChat:UI+: Return div and element wrt creatediv helpers

use it to set placeholder wrt Authorization header.

Also fix copy-paste oversight.

* SimpleChat: readme wrt authorization, maybe minimal openai testing

* SimpleChat: model request field for openai/equivalent compat

May help testing with openai/equivalent web services, if they
require this field.

* SimpleChat: readme stream-utf-8 trim-english deps, exception2error

* Readme: Add a entry for simplechat in the http server section

* SimpleChat:WIP:Collate internally, Stream mode Trap exceptions

This can help ensure that data fetched till that point, can be
made use of, rather than losing it.

On some platforms, the time taken wrt generating a long response,
may lead to the network connection being broken when it enters
some user-no-interaction related power saving mode.

* SimpleChat:theResp-origMsg: Undo a prev change to fix non trim

When the response handling was moved into SimpleChat, I had changed
a flow bit unnecessarily and carelessly, which resulted in the non
trim flow, missing out on retaining the ai assistant response.

This has been fixed now.

* SimpleChat: Save message internally in handle_response itself

This ensures that throwing the caught exception again for higher
up logic, doesnt lose the response collated till that time.

Go through theResp.assistant in catch block, just to keep simple
consistency wrt backtracing just in case.

Update the readme file.

* SimpleChat:Cleanup: Add spacing wrt shown req-options

* SimpleChat:UI: CreateDiv Divs map to GridX2 class

This allows the settings ui to be cleaner structured.

* SimpleChat: Show Non SettingsUI config field by default

* SimpleChat: Allow for multiline system prompt

Convert SystemPrompt into a textarea with 2 rows. Reduce
user-input-textarea to 2 rows from 3, so that overall
vertical space usage remains same.

Shorten usage messages a bit, cleanup to sync with settings ui.

* SimpleChat: Add basic skeleton for saving and loading chat

Inturn when ever a chat message (system/user/model) is added,
the chat will be saved into browser's localStorage.

* SimpleChat:ODS: Add a prefix to chatid wrt ondiskstorage key

* SimpleChat:ODS:WIP:TMP: Add UI to load previously saved chat

This is a temporary flow

* SimpleChat:ODS:Move restore/load saved chat btn setup to Me

This also allows being able to set the common system prompt
ui element to loaded chat's system prompt.

* SimpleChat:Readme updated wrt save and restore chat session info

* SimpleChat:Show chat session restore button, only if saved session

* SimpleChat: AutoCreate ChatRequestOptions settings to an extent

* SimpleChat: Update main README wrt usage with server
2024-06-02 02:20:18 +10:00
Johannes Gäßler
9b596417af
CUDA: quantized KV support for FA vec (#7527)
* CUDA: quantized KV support for FA vec

* try CI fix

* fix commented-out kernel variants

* add q8_0 q4_0 tests

* fix nwarps > batch size

* split fattn compile via extern templates

* fix flake8

* fix metal tests

* fix cmake

* make generate_cu_files.py executable

* add autogenerated .cu files

* fix AMD

* error if type_v != FP16 and not flash_attn

* remove obsolete code
2024-06-01 08:44:14 +02:00
Georgi Gerganov
16926dff92
readme : link homebrew discussion 2024-05-31 15:04:58 +03:00
Galunid
2e32f874e6
Somehow '**' got lost (#7663) 2024-05-31 18:24:41 +10:00
Galunid
1af511fc22
Add convert.py removal to hot topics (#7662) 2024-05-31 10:09:20 +02:00
Sertaç Özercan
0541f06296
[no ci] docs: add aikit to readme (#7650)
Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
2024-05-31 09:57:16 +10:00
Martin Delille
5dcdf94676
Fix conan badge display [no ci] (#7645) 2024-05-31 01:07:39 +10:00
Manuel
2e2340de17
Add brew installation instruction to README [no ci] (#7616) 2024-05-31 00:58:15 +10:00
Martin Delille
7846540bd2
readme : add Conan badge (#7638) 2024-05-30 15:52:50 +03:00
Galunid
9c4c9cc83f
Move convert.py to examples/convert-legacy-llama.py (#7430)
* Move convert.py to examples/convert-no-torch.py

* Fix CI, scripts, readme files

* convert-no-torch -> convert-legacy-llama

* Move vocab thing to vocab.py

* Fix convert-no-torch -> convert-legacy-llama

* Fix lost convert.py in ci/run.sh

* Fix imports

* Fix gguf not imported correctly

* Fix flake8 complaints

* Fix check-requirements.sh

* Get rid of ADDED_TOKENS_FILE, FAST_TOKENIZER_FILE

* Review fixes
2024-05-30 21:40:00 +10:00
Johannes Gäßler
972b555ab9
README: explain parallel build [no ci] (#7618) 2024-05-30 09:52:39 +02:00
Meng, Hengyu
b864b50ce5
[SYCL] Align GEMM dispatch (#7566)
* align GEMM dispatch
2024-05-29 07:00:24 +08:00
Aarni Koskela
9146d36fe7
Readme: add akx/ggify to tools (#1484) 2024-05-26 22:09:42 +10:00
Georgi Gerganov
74f33adf5f
readme : remove trailing space (#7469) 2024-05-23 17:43:18 +03:00
Raj Hammeer Singh Hada
8b94e799df
readme : add Bunny in supported models [no ci] (#7469) 2024-05-23 15:30:13 +03:00
Victor Nogueira
dacfcebd60
readme : add GPT-NeoX + Pythia to the list of supported models (#7491) 2024-05-23 15:12:43 +03:00
Georgi Gerganov
fabf30b4c4
llama : remove Persimmon (#7408)
* llama : remove Persimmon

* requirements : remove
2024-05-21 02:35:28 +10:00
Bingan
26cd4237bc
Update README.md (#7410) 2024-05-20 11:55:34 +02:00
slaren
d359f30921
llama : remove MPI backend (#7395) 2024-05-20 01:17:03 +02:00
Gavin Zhao
82ca83db3c
ROCm: use native CMake HIP support (#5966)
Supercedes #4024 and #4813.

CMake's native HIP support has become the
recommended way to add HIP code into a project (see
[here](https://rocm.docs.amd.com/en/docs-6.0.0/conceptual/cmake-packages.html#using-hip-in-cmake)).
This PR makes the following changes:

1. The environment variable `HIPCXX` or CMake option
`CMAKE_HIP_COMPILER` should be used to specify the HIP
compiler. Notably this shouldn't be `hipcc`, but ROCm's clang,
which usually resides in `$ROCM_PATH/llvm/bin/clang`. Previously
this was control by `CMAKE_C_COMPILER` and `CMAKE_CXX_COMPILER`.
Note that since native CMake HIP support is not yet available on
Windows, on Windows we fall back to the old behavior.

2. CMake option `CMAKE_HIP_ARCHITECTURES` is used to control the
GPU architectures to build for. Previously this was controled by
`GPU_TARGETS`.

3. Updated the Nix recipe to account for these new changes.

4. The GPU targets to build against in the Nix recipe is now
consistent with the supported GPU targets in nixpkgs.

5. Added CI checks for HIP on both Linux and Windows. On Linux, we test
both the new and old behavior.

The most important part about this PR is the separation of the
HIP compiler and the C/C++ compiler. This allows users to choose
a different C/C++ compiler if desired, compared to the current
situation where when building for ROCm support, everything must be
compiled with ROCm's clang.

~~Makefile is unchanged. Please let me know if we want to be
consistent on variables' naming because Makefile still uses
`GPU_TARGETS` to control architectures to build for, but I feel
like setting `CMAKE_HIP_ARCHITECTURES` is a bit awkward when you're
calling `make`.~~ Makefile used `GPU_TARGETS` but the README says
to use `AMDGPU_TARGETS`. For consistency with CMake, all usage of
`GPU_TARGETS` in Makefile has been updated to `AMDGPU_TARGETS`.

Thanks to the suggestion of @jin-eld, to maintain backwards
compatibility (and not break too many downstream users' builds), if
`CMAKE_CXX_COMPILER` ends with `hipcc`, then we still compile using
the original behavior and emit a warning that recommends switching
to the new HIP support. Similarly, if `AMDGPU_TARGETS` is set but
`CMAKE_HIP_ARCHITECTURES` is not, then we forward `AMDGPU_TARGETS`
to `CMAKE_HIP_ARCHITECTURES` to ease the transition to the new
HIP support.

Signed-off-by: Gavin Zhao <git@gzgz.dev>
2024-05-17 17:03:03 +02:00
Vaibhav Srivastav
ad52d5c259
doc: add references to hugging face GGUF-my-repo quantisation web tool. (#7288)
* chore: add references to the quantisation space.

* fix grammer lol.

* Update README.md

Co-authored-by: Julien Chaumond <julien@huggingface.co>

* Update README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-16 15:38:43 +10:00
Daniel Bevenius
8f7080bf48
readme : remove stray double quote (#7310)
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-05-15 23:41:03 +02:00
Andrei
d11afd6652
llava : fix moondream support (#7163)
* Revert "Revert "llava : add support for moondream vision language model (#6899)""

This reverts commit 9da243b36a.

* Fix num_positions and embeddings initialization
2024-05-10 09:41:10 +03:00
Georgi Gerganov
d46dbc76f8
readme : add scheduled server workflow status badge 2024-05-09 16:40:42 +03:00
l3utterfly
0961d86604
readme : add app (#6371)
* added Layla to supported UIs

* Update README.md
2024-05-09 16:32:40 +03:00
Georgi Gerganov
9da243b36a
Revert "llava : add support for moondream vision language model (#6899)"
This reverts commit 46e12c4692.
2024-05-08 22:14:39 +03:00
Jeximo
c780e75305
Further tidy on Android instructions README.md (#7077)
* Further tidy on Android instructions README.md

Fixed some logic when following readme direction

* Clean up redundent information

A new user arriving will see simple directions on llama.cpp homepage

* corrected puncuation

Period after cmake, colon after termux

* re-word for clarity

method seems to be more correct, instead of alternative in this context

* Organized required packages per build type

building llama.cpp with NDK on a pc doesn't require installing clang, cmake, git, or wget in termux.

* README.md

corrected title

* fix trailing whitespace
2024-05-08 02:26:43 +02:00
Georgi Gerganov
53d6c52e22
readme : update hot topics 2024-05-07 21:43:13 +03:00
Georgi Gerganov
bcdee0daa7
minor : fix trailing whitespace 2024-05-06 09:31:30 +03:00
Lyle Dean
ca36326020
readme : add note that LLaMA 3 is not supported with convert.py (#7065) 2024-05-05 08:21:46 +03:00
Jeximo
cf768b7e71
Tidy Android Instructions README.md (#7016)
* Tidy Android Instructions README.md

Remove CLBlast instructions(outdated), added OpenBlas.

* don't assume git is installed

Added apt install git, so that git clone works

* removed OpenBlas

Linked to Linux build instructions

* fix typo

Remove word "run"

* correct style

Co-authored-by: slaren <slarengh@gmail.com>

* correct grammar

Co-authored-by: slaren <slarengh@gmail.com>

* delete reference to Android API

* remove Fdroid reference, link directly to Termux

Fdroid is not required

Co-authored-by: slaren <slarengh@gmail.com>

* Update README.md

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-05-04 18:10:15 +02:00
Olivier Chafik
b8a7a5a90f
build(cmake): simplify instructions (cmake -B build && cmake --build build ...) (#6964)
* readme: cmake . -B build && cmake --build build

* build: fix typo

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* build: drop implicit . from cmake config command

* build: remove another superfluous .

* build: update MinGW cmake commands

* Update README-sycl.md

Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>

* build: reinstate --config Release as not the default w/ some generators + document how to build Debug

* build: revert more --config Release

* build: nit / remove -H from cmake example

* build: reword debug instructions around single/multi config split

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2024-04-29 17:02:45 +01:00
Georgi Gerganov
24affa7db3
readme : update hot topics 2024-04-29 17:06:19 +03:00
vik
46e12c4692
llava : add support for moondream vision language model (#6899)
* add support for moondream vision language model

This required making the following changes to the CLIP model:

1. Support for patch embedding bias.
2. Make class embedding and pre-layernorm optional.
3. Add support for post-layernorm.

* Update examples/llava/clip.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-25 22:38:31 +03:00
BarfingLemurs
3fe0596c18
readme : update model list (#6908)
* Update README.md

* missing space

* llama3 !
2024-04-25 16:52:28 +03:00
Johannes Gäßler
784e11dea1
README: add graphic for matrix multiplication (#6881) 2024-04-24 21:29:13 +02:00
Georgi Gerganov
40f74e4d73
llama : add option to render special/control tokens (#6807)
* make : fix common dep on llama.h

* llama : add option to render special tokens

* readme : add API change notice

ggml-ci

* swift : fix build
2024-04-21 18:36:45 +03:00
Jan Boon
e8d35f47cb
doc : add link to falcon (#6789) 2024-04-21 15:35:40 +03:00
Mohammadreza Hendiani
2cca09d509
readme : add Fedora instructions (#6783)
* added fedora to list of distros that may need the package (the packages have the same name on Fedora)

* how to add clblast that is avalible in the fedora repos
2024-04-21 15:32:05 +03:00
nopperl
9958c81b79
Implement the OLMo architecture (#6741)
* implement olmo architecture

* remove unused variable

* remove unused moe branch

* remove check for weight

* remove superfluous moe, bias and rope tensors

* clarified comment

* fix clamp_kqv setting

* remove obsolete parameter name filter
2024-04-19 11:35:54 +02:00
Yaroslav
8dd1ec8b3f
readme : add UI (#6724)
* Update README.md

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-17 15:47:50 +03:00
Pierrick Hymbert
4bd0f93e4a
model: support arch DbrxForCausalLM (#6515)
* model: dbrx convert to gguf
#6344

* llama: support dbrx
#6344

* doc: dbrx: add the model as supported

* scripts: get-wikitext-2 add unzip

* llama: increase maximum experts allowed

* llama: factorize moe graph implementation between grok, mixtral and dbrx


---------

Co-authored-by: Megha Agarwal <16129366+megha95@users.noreply.github.com>
2024-04-13 11:33:52 +02:00
Rene Leonhardt
5c4d767ac0
chore: Fix markdown warnings (#6625) 2024-04-12 10:52:36 +02:00
Pierrick Hymbert
67fac4b95f
docs : how to add a model (#6565)
* docs: how to add a model

* docs: model: typo and docs

* docs: model: add prevision on RoPE

* docs: model: rephrasing README.md

* docs: model: rephrasing README.md

* docs: model: README.md fix trailing spaces

* docs : some fixes

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-10 09:58:48 +03:00
Artem Zinnatullin
29122d32ac
readme : fix ROCm link (#6579) 2024-04-10 09:49:12 +03:00
sjxx
b231b37b09
readme : update UI list (#6560) 2024-04-10 09:34:00 +03:00
Jiří Sejkora
ba5e134e07
readme: fix typo in amdgpu target name (#6573) 2024-04-10 00:23:02 +02:00
Jan Boon
beea6e1b16
llama : save and restore kv cache for single seq id (#6341)
* llama : save and restore kv cache for single seq id

* remove trailing whitespace

* respond error in case there's no space in the kv cache

* add kv seq save restore to test case

* add --slot-save-path arg to enable save restore and restrict save location

* Returning 0 for some cases, instead of asserting.

* cleanup error cases

* rename sequence state functions

* rename state get set functions

* add previous function names back in with DEPRECATED notice

* update doc

* adjust endpoints to preferred style

* fix restoring zero cell count

* handle seq rm return value

* unused param

* keep in the size check

* fix return types

* add server test case for slot save restore

* cleanup

* add cake

* cleanup style

* add special

* removing a whole sequence never fails

* move sequence state file functionality from server to llama to match session api and add version tags

* catch exceptions on save as well

* error log messages

* check types for stricter restore

* update server doc

* readme : update API changes date

* strict filename validation

* move include, reject bom as well

* also reject empty filename

* reject whitespace and trailing dot

---------

Co-authored-by: Martin Evans <martindevans@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-08 15:43:30 +03:00
Firat
d752327c33
Adding KodiBot to UI list (#6535)
KodiBot is free and open source ai chat app released under the GNU General Public License.
2024-04-08 09:48:29 +02:00
Mark Fairbairn
855f54402e
Change Windows AMD example to release build to make inference much faster. (#6525) 2024-04-07 20:52:19 +02:00
DAN™
e0717e751e
Add GritLM as supported models. (#6513) 2024-04-07 19:33:59 +02:00
Hoang Nguyen
d0f5deebf8
readme : update UI list (#6503)
* Add MindMac to UI list

* Update proprietary description

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-04-05 21:39:43 +03:00
alexpinel
a307375c02
readme : add Dot to UI list (#6487) 2024-04-04 13:22:50 -04:00
Jun Jie
b660a5729e
readme : fix typo (#6481) 2024-04-04 13:16:37 -04:00
bryanSwk
bb43cf7e9d
llama : add SEA-LION support (#6448)
* initial commit for sealion support

* add sealion support

* minor fix

* q/k ln and pos_embd only if required

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* minor : clear whitespaces

---------

Co-authored-by: bryan <bryansiow@aisingapore.org>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-03 21:05:10 +03:00
Francisco Melo
154d4ee39c
readme : add feature-rich rust bindings (#6465) 2024-04-03 20:53:37 +03:00
Georgi Gerganov
076b08649e
readme : update hot topics 2024-04-03 16:11:15 +03:00
Georgi Gerganov
c50a82ce0f
readme : update hot topics 2024-03-31 11:56:30 +03:00
0cc4m
ba0c7c70ab
Vulkan k-quant mmq and ggml-backend offload functionality (#6155)
* Fix Vulkan no kv offload incoherence

* Add k-quant mul mat mat shaders

* Rework working buffer allocation, reduces vram use noticeably

Clean up cpu assist code, replaced with ggml-backend offload function

* Default to all dedicated GPUs

* Add fallback for integrated GPUs if no dedicated GPUs are found

* Add debug info which device is allocating memory

* Fix Intel dequant issue

Fix validation issue

* Fix Vulkan GGML_OP_GET_ROWS implementation

* Clean up merge artifacts

* Remove Vulkan warning
2024-03-29 17:29:21 +01:00
hxer7963
069574775c
[Model] Add support for xverse (#6301)
* Support xverse model convert to gguf format.

* 1. Convert xverse models to gguf;
2. Add LLM_ARCH_XVERSE inference in llama.cpp;
3. Add xverse item in Supported models in README.md;

* * gguf-py: remove redundant logs
* llama: remove the init_mapping_prefetch custom parameter

* llama.cpp: Include the changes from #6122 to exclude the unused outputs of the last layers.

* - Fix format issues
- Remove duplicate set kqv_out to llm_build_kv

* Update llama.cpp

---------

Co-authored-by: willhe <willhe@xverse.cn>
Co-authored-by: willhe <hexin@xverse.cn>
2024-03-29 14:37:03 +01:00
zhouwg
b910287954
readme : add project (#6356)
* readme: add Android UI binding

* Update README.md
2024-03-29 09:33:46 +02:00
Georgi Gerganov
bfe7dafc9c
readme : add notice for UI list 2024-03-28 22:56:03 +02:00
Mateusz Charytoniuk
1740d6dd4e
readme : add php api bindings (#6326)
* add php bindings to readme

* readme : add link to PR

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-27 09:08:59 +02:00
compilade
557410b8f0
llama : greatly reduce output buffer memory usage (#6122)
* llama : greatly reduce logits memory usage

* llama : more compact state saving and reloading

* llama : fix lctx.n_outputs not being set before building graph

* perplexity : adapt to the logits API changes

* perplexity : fix Winogrande, use correct logits for second choice start

The first logits used to evaluate the second choice were not from
the end of the common prefix; instead, they were the logits from the end
of the first choice. This has been corrected.

The previous implementation sometimes had outliers in the scores of
choices for some tasks, and the logic to skip choices words
in the log-likelihood evaluation probably was an attempt to reduce those,
but it was complex and didn't quite seem to be the right thing.

This is simpler now, and the outlier scores aren't there anymore.

* perplexity : normalize spaces and punctuation in Winogrande sentences

* llama : fix embedding conditions

* llama : fix llama_get_embeddings_ith when the resulting id is 0

* llama : fix wrong n_outputs in llama_set_inputs

A mismatch happened when using a smaller n_ubatch than n_batch and then using
llama_batch_get_one(). The decision of what n_outputs should be now almost
fully depends on how lctx.n_outputs is set in llama_decode_internal.
The conditions are simpler this way.

* llama : when saving the state, recalculate n_outputs

This ensures the correct number of outputs for the entire previous batch
is stored in the session file, even when n_ubatch is smaller than n_batch.

* llama : fix not-skipping outputs of non-causal models

* llama : fix running a batch with n_outputs == 0

It previously worked because lctx.inp_out_ids was not initialized,
so it pointed to some garbage address which was somehow still valid when I
ran my tests.

* llama : keep same graph topology even when n_outputs == 0

* ggml : saner ggml_can_repeat with empty tensors

*  ggml : future-proof ggml_is_empty by using GGML_MAX_DIMS - 1

* ggml : do not multi-thread ops returning empty tensors

* ggml : make ggml_is_empty public and work with views

* llama : use a vector for ctx->output_ids

* llama : rework reallocation logic for llama_output_reserve

Now comparing the actual size with the new total size of the output buffer
to allow more efficient enabling and disabling of the embeddings
and/or logits output in the future.

* ggml : skip empty tensors in all backends

* llama : fix llama_output_reserve nullptr deref when new_size is 0

* perplexity : make Winogrande work as it does on master

The problems with the Winogrande implementation will
need to be fixed in a separate PR to ease review.

* llama : clearer error messages for invalid logits or embeddings ids

* llama : assert all models that can have inp_out_ids

Since the graph topology is now constant, this presence check
can be done even when there are no outputs.

* llama : assert logits and embd buffers exist before writing to them

* llama : handle errors from llama_output_reserve at call sites

* perplexity : make hellaswag and multiple-choice outputs identical to master

Due to how the KV cache is updated, the logprobs for tokens in a batch
are very slightly affected by the other tokens present in the batch,
so to make hellaswag and multiple-choice return exactly the same results
as on master, the last token of each sequence needs to be evaluated
even though its output is not used at all.

This will probably be changed back in the future to make these benchmarks
a tiny bit faster.

* perplexity : fix division by zero when using less than 100 multiple-choice tasks

* llama : allow loading state saved with a different ctx size

When loading a session file, the context size is now only required to be
at least enough to load the KV cells contained in that session file,
instead of requiring to use exactly the same context size as when saving.

Doing this enables the use-case of extending or shrinking the context size
of a saved session.

This breaks existing session files because the meaning of kv_buf_size
is slightly changed (previously it was the size of the whole KV cache,
now it's only the size of the saved part of it). This allows for
finer-grained sanity checks when loading in an effort to keep kv_buf_size
useful even when the kv_size is changed.

* llama : minor

ggml-ci

* readme : update recent API changes, and warn about Vulkan

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-26 16:46:41 +02:00
slaren
280345968d
cuda : rename build flag to LLAMA_CUDA (#6299) 2024-03-26 01:16:01 +01:00
Pierrick Hymbert
dba1af6129
llama_model_loader: support multiple split/shard GGUFs (#6187)
* split: support in llama_model_loader

* avoid copying the entire vector

Co-authored-by: slaren <slarengh@gmail.com>

* split: move llama_tensor_offset to llama_model_loader

* llama_model_loader: PR feedbacks:
 - use only one gguf_context for metadata only
 - store all ggml_context in a vector as the files and mappings
 - store all weights in a vector along with the source tensor
 - rename ctx_gguf to meta
 - rename ctx_meta to contexts

* avoid copying the entire vector

* Simplify this by making these optional, switch some layer creation tensor optional

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Handle optional tensors

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llama_model_loader: fail if backend cannot allocate buffer

* fix mmap buffer management

* llama_model_loader: map file to backend buffer if the allocation succeeds only

* llama_model_loader: only map tensors included in the context

* llama_model_loader: minor, use same variable name for consistency, fix spacing in types cast

* llama_model_loader: fail if any of backend buffer cannot be allocated

* spacing

Co-authored-by: slaren <slarengh@gmail.com>

* fix loop over pointer

Co-authored-by: slaren <slarengh@gmail.com>

* llama_model_loader: if n_tensors declared not equals to loaded tensors in split, throw an exception instead of asserting

* llama_model_loader: ensure mappings vector has the expected size

* llama_model_loader:  use at instead of operator[] if this should never add to the map.

* llama_model_loader: immediately add the backend buffer to the model buffers in order to free them if an error occurs in the next allocation. Reserve the expected size.

* llama_model_loader: be sure the model mappings has enough capacity before allocating backend buffer

* llama_model_loader: fix map -> unordered map

* llama_split_prefix: use a clearer version, not pass split path len but dest max len.

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* llama : minor

ggml-ci

* llama : introduce some typedef helpers

* docs: add model shard in hot topic

* llama_model_loader: put mapping in a unique_ptr from the moment it is allocated

Co-authored-by: slaren <slarengh@gmail.com>

* fix llama_split_prefix

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-03-22 19:00:01 +01:00
Xiaoyi Chen
29ab270e65
readme : add RecurseChat to the list of UIs (#6219) 2024-03-22 13:29:49 +02:00
Georgi Gerganov
b3e94f26ba
metal : proper assert for mat-mat memory alignment (#6225)
* metal : proper assert for mat-mat memory alignment

ggml-ci

* readme : add notice about the bug fix

* metal : fix the fix

ggml-ci
2024-03-22 11:35:53 +02:00
Xuan Son Nguyen
dfbfdd60f9
readme : add wllama as a wasm binding (#6100) 2024-03-16 17:42:08 +02:00
Andrew Canis
12247f4c69
llama : add Command-R support (#6033)
Information about the Command-R 35B model (128k context) can be found at:
	https://huggingface.co/CohereForAI/c4ai-command-r-v01

Based on the llama2 model with a few changes:

1) New hyper parameter to scale output logits (logit_scale)
2) Uses LayerNorm instead of RMSNorm
3) Transfomer layers have a single shared LayerNorm that feeds into both the
   self-attention and FFN layers in parallel. There is no post-attention LayerNorm.
4) No support for Rotary Position Embeddings (RoPE) scaling
5) No biases used

Find GGUF files here:
	https://huggingface.co/andrewcanis/c4ai-command-r-v01-GGUF

To convert model to GGUF format yourself:

1) Download Command-R Hugging Face safetensors:
	git lfs install
	git clone https://huggingface.co/CohereForAI/c4ai-command-r-v01

2) Run:
	python3 convert-hf-to-gguf.py --outtype f16 ./c4ai-command-r-v01
2024-03-15 22:41:22 +02:00
Linwei Wang
19885d205e
readme : update details about running llama in Termux on Android (#6039) 2024-03-13 20:34:40 +02:00
Georgi Gerganov
76a936c893
readme : update API changes and hot topics 2024-03-13 20:33:56 +02:00
Georgi Gerganov
05b06210c9
llama : more consistent names of count variables (#5994)
* llama : more consistent names of count variables

ggml-ci

* llama : n_parallel -> n_seq_max

* common : fix param name

* examples : fix param name
2024-03-11 17:49:47 +02:00
Georgi Gerganov
d9f65c97c3
readme : update hot topics 2024-03-10 20:58:26 +02:00
Georgi Gerganov
098dbaab44
readme : update hot topics 2024-03-09 18:14:13 +02:00