Commit Graph

292 Commits

Author SHA1 Message Date
Johannes Gäßler
9b596417af
CUDA: quantized KV support for FA vec (#7527)
* CUDA: quantized KV support for FA vec

* try CI fix

* fix commented-out kernel variants

* add q8_0 q4_0 tests

* fix nwarps > batch size

* split fattn compile via extern templates

* fix flake8

* fix metal tests

* fix cmake

* make generate_cu_files.py executable

* add autogenerated .cu files

* fix AMD

* error if type_v != FP16 and not flash_attn

* remove obsolete code
2024-06-01 08:44:14 +02:00
Daniele
30e238b246
Improve HIP compatibility (#7672) 2024-05-31 16:00:29 +02:00
Johannes Gäßler
10b1e45876
make: add --device-debug to NVCC debug flags (#7542) 2024-05-27 19:34:40 +02:00
Georgi Gerganov
e84b71c2c6
ggml : drop support for QK_K=64 (#7473)
* ggml : drop support for QK_K=64

ggml-ci

* opencl : restore QK_K=256 define
2024-05-23 10:00:21 +03:00
junchao-loongson
65c58207ec
ggml : add loongarch lsx and lasx support (#6454)
* add loongarch lsx and lasx optimize code

* Add loongarch compilation support to makefile

* revert stb_image.h

* opt bytes_from_nibbles_32 and sum_i16_pairs_float

* fix undeclared

* format code

* update

* update 2

---------

Co-authored-by: Jinyang He <hejinyang@loongson.cn>
2024-05-20 10:19:21 +03:00
slaren
d359f30921
llama : remove MPI backend (#7395) 2024-05-20 01:17:03 +02:00
Gavin Zhao
82ca83db3c
ROCm: use native CMake HIP support (#5966)
Supercedes #4024 and #4813.

CMake's native HIP support has become the
recommended way to add HIP code into a project (see
[here](https://rocm.docs.amd.com/en/docs-6.0.0/conceptual/cmake-packages.html#using-hip-in-cmake)).
This PR makes the following changes:

1. The environment variable `HIPCXX` or CMake option
`CMAKE_HIP_COMPILER` should be used to specify the HIP
compiler. Notably this shouldn't be `hipcc`, but ROCm's clang,
which usually resides in `$ROCM_PATH/llvm/bin/clang`. Previously
this was control by `CMAKE_C_COMPILER` and `CMAKE_CXX_COMPILER`.
Note that since native CMake HIP support is not yet available on
Windows, on Windows we fall back to the old behavior.

2. CMake option `CMAKE_HIP_ARCHITECTURES` is used to control the
GPU architectures to build for. Previously this was controled by
`GPU_TARGETS`.

3. Updated the Nix recipe to account for these new changes.

4. The GPU targets to build against in the Nix recipe is now
consistent with the supported GPU targets in nixpkgs.

5. Added CI checks for HIP on both Linux and Windows. On Linux, we test
both the new and old behavior.

The most important part about this PR is the separation of the
HIP compiler and the C/C++ compiler. This allows users to choose
a different C/C++ compiler if desired, compared to the current
situation where when building for ROCm support, everything must be
compiled with ROCm's clang.

~~Makefile is unchanged. Please let me know if we want to be
consistent on variables' naming because Makefile still uses
`GPU_TARGETS` to control architectures to build for, but I feel
like setting `CMAKE_HIP_ARCHITECTURES` is a bit awkward when you're
calling `make`.~~ Makefile used `GPU_TARGETS` but the README says
to use `AMDGPU_TARGETS`. For consistency with CMake, all usage of
`GPU_TARGETS` in Makefile has been updated to `AMDGPU_TARGETS`.

Thanks to the suggestion of @jin-eld, to maintain backwards
compatibility (and not break too many downstream users' builds), if
`CMAKE_CXX_COMPILER` ends with `hipcc`, then we still compile using
the original behavior and emit a warning that recommends switching
to the new HIP support. Similarly, if `AMDGPU_TARGETS` is set but
`CMAKE_HIP_ARCHITECTURES` is not, then we forward `AMDGPU_TARGETS`
to `CMAKE_HIP_ARCHITECTURES` to ease the transition to the new
HIP support.

Signed-off-by: Gavin Zhao <git@gzgz.dev>
2024-05-17 17:03:03 +02:00
agray3
bc4bba364f
Introduction of CUDA Graphs to LLama.cpp (#6766)
* DRAFT: Introduction of CUDA Graphs to LLama.cpp

* FIx issues raised in comments

* Tidied to now only use CUDA runtime (not mixed with driver calls)

* disable for multi-gpu and batch size > 1

* Disable CUDA graphs for old GPU arch and with env var

* added missing CUDA_CHECKs

* Addressed comments

* further addressed comments

* limit to GGML_ALLOW_CUDA_GRAPHS defined in llama.cpp cmake

* Added more comprehensive graph node checking

* With mechanism to fall back if graph capture fails

* Revert "With mechanism to fall back if graph capture fails"

This reverts commit eb9f15fb6f.

* Fall back if graph capture fails and address other comments

* - renamed GGML_ALLOW_CUDA_GRAPHS to GGML_CUDA_USE_GRAPHS

- rename env variable to disable CUDA graphs to GGML_CUDA_DISABLE_GRAPHS

- updated Makefile build to enable CUDA graphs

- removed graph capture failure checking in ggml_cuda_error
  using a global variable to track this is not thread safe, but I am also not safistied with checking an error by string
  if this is necessary to workaround some issues with graph capture with eg. cuBLAS, we can pass the ggml_backend_cuda_context to the error checking macro and store the result in the context

- fixed several resource leaks

- fixed issue with zero node graphs

- changed fixed size arrays to vectors

- removed the count of number of evaluations before start capturing, and instead changed the capture mode to relaxed

- removed the check for multiple devices so that it is still possible to use a single device, instead checks for split buffers to disable cuda graphs with -sm row

- changed the op for checking batch size to GGML_OP_ADD, should be more reliable than GGML_OP_SOFT_MAX

- code style fixes

- things to look into
  - VRAM usage of the cudaGraphExec_t, if it is significant we may need to make it optional
  - possibility of using cudaStreamBeginCaptureToGraph to keep track of which ggml graph nodes correspond to which cuda graph nodes

* fix build without cuda graphs

* remove outdated comment

* replace minimum cc value with a constant

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-05-08 22:55:49 +02:00
Georgi Gerganov
92139b90af
tests : add test-tokenizer-0.sh + fix some tokenizers (#7036)
* tests : add test-tokenizer-0.sh

* unicode : add all unicode number ranges

* starcoder : fix pre-tokenizer

* tests : add test that fails with DeepSeek tokenizers

* falcon : fix regex

* unicode : regenerate unicode tables

* refact : add tokenizer model

* lint : fix

* tests : disable failing tests

ggml-ci

* refact : add tests files

ggml-ci

* convert : print -> logging

ggml-ci

* lint : fix

* unicode : digit -> number

* phi-3 : update
2024-05-04 08:32:32 +03:00
Georgi Gerganov
f4ab2a4147
llama : fix BPE pre-tokenization (#6920)
* merged the changes from deepseeker models to main branch

* Moved regex patterns to unicode.cpp and updated unicode.h

* Moved header files

* Resolved issues

* added and refactored unicode_regex_split and related functions

* Updated/merged the deepseek coder pr

* Refactored code

* Adding unicode regex mappings

* Adding unicode regex function

* Added needed functionality, testing remains

* Fixed issues

* Fixed issue with gpt2 regex custom preprocessor

* unicode : fix? unicode_wstring_to_utf8

* lint : fix whitespaces

* tests : add tokenizer tests for numbers

* unicode : remove redundant headers

* tests : remove and rename tokenizer test scripts

* tests : add sample usage

* gguf-py : reader prints warnings on duplicate keys

* llama : towards llama3 tokenization support (wip)

* unicode : shot in the dark to fix tests on Windows

* unicode : first try custom implementations

* convert : add "tokenizer.ggml.pre" GGUF KV (wip)

* llama : use new pre-tokenizer type

* convert : fix pre-tokenizer type writing

* lint : fix

* make : add test-tokenizer-0-llama-v3

* wip

* models : add llama v3 vocab file

* llama : adapt punctuation regex + add llama 3 regex

* minor

* unicode : set bomb

* unicode : set bomb

* unicode : always use std::wregex

* unicode : support \p{N}, \p{L} and \p{P} natively

* unicode : try fix windows

* unicode : category support via std::regex

* unicode : clean-up

* unicode : simplify

* convert : add convert-hf-to-gguf-update.py

ggml-ci

* lint : update

* convert : add falcon

ggml-ci

* unicode : normalize signatures

* lint : fix

* lint : fix

* convert : remove unused functions

* convert : add comments

* convert : exercise contractions

ggml-ci

* lint : fix

* cmake : refactor test targets

* tests : refactor vocab tests

ggml-ci

* tests : add more vocabs and tests

ggml-ci

* unicode : cleanup

* scripts : ignore new update script in check-requirements.sh

* models : add phi-3, mpt, gpt-2, starcoder

* tests : disable obsolete

ggml-ci

* tests : use faster bpe test

ggml-ci

* llama : more prominent warning for old BPE models

* tests : disable test-tokenizer-1-bpe due to slowness

ggml-ci

---------

Co-authored-by: Jaggzh <jaggz.h@gmail.com>
Co-authored-by: Kazim Abrar Mahi <kazimabrarmahi135@gmail.com>
2024-04-29 16:58:41 +03:00
Przemysław Pawełczyk
577277ffd2
make : change GNU make default CXX from g++ to c++ (#6966) 2024-04-29 16:08:20 +03:00
Pierrick Hymbert
0c4d489e29
quantize: add imatrix and dataset metadata in GGUF (#6658)
* imatrix: save the dataset file used in the output file

* llama: support kv overrides type string string

* common: factorize KV Overrides parsing between common and server

* quantize: add imatrix n entries and dataset KV metadata
quantize: factorize KV Overrides parsing between common
#6656

* llama: remove kv override str_value initialization as it does not compile on some toolchain

* quantize: add imatrix m_last_call as `quantize.imatrix.chunks_count`

* quantize: add imatrix filename in KV

* llama: add llama_model_kv_override_free

* common: add llama_model_kv_override_free
common: free kv override if used after model loading

* llama: finally move the string KV override value to the stack

* llama : minor

* no need to add a NUL to the std::vector, std::string can be initialized from a pair of iterators.

Co-authored-by: slaren <slarengh@gmail.com>

* kv override: ensure string termination

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-04-26 20:06:33 +02:00
Justine Tunney
192090bae4
llamafile : improve sgemm.cpp (#6796)
* llamafile : improve sgemm.cpp

- Re-enable by default
- Fix issue described in #6716
- Make code more abstract, elegant, and maintainable
- Faster handling of weirdly shaped `m` an `n` edge cases

* Address review comments

* Help clang produce fma instructions

* Address review comments
2024-04-22 22:00:36 +03:00
Olivier Chafik
5cf5e7d490
build: generate hex dump of server assets during build (#6661)
* `build`: generate hex dumps of server assets on the fly

* build: workaround lack of -n on gnu xxd

* build: don't use xxd in cmake

* build: don't call xxd from build.zig

* build: more idiomatic hexing

* build: don't use xxd in Makefile (od hackery instead)

* build: avoid exceeding max cmd line limit in makefile hex dump

* build: hex dump assets at cmake build time (not config time)
2024-04-21 18:48:53 +01:00
Georgi Gerganov
40f74e4d73
llama : add option to render special/control tokens (#6807)
* make : fix common dep on llama.h

* llama : add option to render special tokens

* readme : add API change notice

ggml-ci

* swift : fix build
2024-04-21 18:36:45 +03:00
Georgi Gerganov
3b8f1ec4b1
llamafile : tmp disable + build sgemm.o when needed (#6716)
* build : sgemm.o only when needed

ggml-ci

* llamafile : tmp disable due to MoE bug

ggml-ci
2024-04-17 23:58:26 +03:00
Georgi Gerganov
666867b799
ggml : fix llamafile sgemm wdata offsets (#6710)
ggml-ci
2024-04-16 23:50:22 +03:00
Justine Tunney
8cc91dc63c
ggml : add llamafile sgemm (#6414)
This change upstreams llamafile's cpu matrix multiplication kernels
which improve image and prompt evaluation speed. For starters, Q4_0
and Q8_0 weights should go ~40% faster on CPU. The biggest benefits
are with data types like f16 / f32, which process prompts 2x faster
thus making them faster than quantized data types for prompt evals.

This change also introduces bona fide AVX512 support since tinyBLAS
is able to exploit the larger register file. For example, on my CPU
llama.cpp llava-cli processes an image prompt at 305 tokens/second,
using the Q4_K and Q4_0 types, which has always been faster than if
we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With
this change, f16 LLaVA performance leap frogs to 464 tokens/second.

On Intel Core i9-14900K this change improves F16 prompt perf by 5x.
For example, using llama.cpp at HEAD with Mistral 7b f16 to process
a 215 token prompt will go 13 tok/sec. This change has fixes making
it go 52 tok/sec. It's mostly thanks to my vectorized outer product
kernels but also because I added support for correctly counting the
number of cores on Alderlake, so the default thread count discounts
Intel's new efficiency cores. Only Linux right now can count cores.

This work was sponsored by Mozilla who's given permission to change
the license of this code from Apache 2.0 to MIT. To read more about
what's improved, and how it works, see: https://justine.lol/matmul/
2024-04-16 21:55:30 +03:00
Olivier Chafik
7593639ce3
main: add --json-schema / -j flag (#6659)
* main: add --json-schema / -j

* json: move json-schema-to-grammar to common lib

* json: fix zig build
2024-04-15 18:35:21 +01:00
Nikolas
a474f50ebb
Refactor Error Handling for CUDA (#6575)
* Refactor Error Handling for CUDA

Add guidance for setting CUDA_DOCKER_ARCH to match GPU compute capability for CUDA versions < 11.7. Include link to NVIDIA's CUDA GPUs documentation for compute capability reference.

* Update Makefile

Improved wording

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-04-11 21:56:29 +02:00
Pierrick Hymbert
b804b1ef77
eval-callback: Example how to use eval callback for debugging (#6576)
* gguf-debug: Example how to use ggml callback for debugging

* gguf-debug: no mutex, verify type, fix stride.

* llama: cv eval: move cb eval field in common gpt_params

* ggml_debug: use common gpt_params to pass cb eval.
Fix get tensor SIGV random.

* ggml_debug: ci: add tests

* ggml_debug: EOL in CMakeLists.txt

* ggml_debug: Remove unused param n_batch, no batching here

* ggml_debug: fix trailing spaces

* ggml_debug: fix trailing spaces

* common: fix cb_eval and user data not initialized

* ci: build revert label

* ggml_debug: add main test label

* doc: add a model: add a link to ggml-debug

* ggml-debug: add to make toolchain

* ggml-debug: tests add the main label

* ggml-debug: ci add test curl label

* common: allow the warmup to be disabled in llama_init_from_gpt_params

* ci: add curl test

* ggml-debug: better tensor type support

* gitignore : ggml-debug

* ggml-debug: printing also the sum of each tensor

* ggml-debug: remove block size

* eval-callback: renamed from ggml-debug

* eval-callback: fix make toolchain

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-11 14:51:07 +02:00
Clint Herron
57dd02c44b
Tests: Added integration tests for GBNF parser (#6472)
* Added integration tests for GBNF parser to validate correctness of parsing, as well as correctness of string matching. Intended for use to pin behavior while working on performance improvements.

* Fixing whitespace errors and cleaning error message alert to be clearer.

* Removing hacky include to llama.cpp from grammar integration test now that needed functions are available via internal API.

* Comment cleanup.

* Reorganizing tests for readability.

* Cleaning up debug message to make a bit more sense.
2024-04-06 10:31:33 -04:00
Clint Herron
9b84ae1806
examples : add GBNF validator program (#5948)
* Revising GBNF validator program to be much simpler.

* Changing from streams to using cstdio

* Adding final newline character.
2024-04-04 10:44:28 +03:00
Georgi Gerganov
3a0345970e
make : whitespace 2024-03-27 15:02:49 +02:00
Jared Van Bortel
32c8486e1f
wpm : portable unicode tolower (#6305)
Also use C locale for ispunct/isspace, and split unicode-data.cpp from unicode.cpp.
2024-03-26 17:46:21 -04:00
slaren
280345968d
cuda : rename build flag to LLAMA_CUDA (#6299) 2024-03-26 01:16:01 +01:00
slaren
ae1f211ce2
cuda : refactor into multiple files (#6269) 2024-03-25 13:50:23 +01:00
Minsoo Cheong
64e7b47c69
examples : add "retrieval" (#6193)
* add `retrieval` example

* add README

* minor fixes

* cast filepos on print

* remove use of variable sized array

* store similarities in separate vector

* print error on insufficient batch size

* fix error message printing

* assign n_batch value to n_ubatch

* fix param definitions

* define retrieval-only parameters in retrieval.cpp

* fix `--context-file` option to be provided multiple times for multiple files

* use vector for `query_emb`

* add usage description in README

* fix merge conflict

* fix usage printing

* remove seed setting

* fix lint

* increase file read buffer size

* retrieval : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-25 09:38:22 +02:00
Pierrick Hymbert
21cad01b6e
split: add gguf-split in the make build target (#6262) 2024-03-23 17:18:13 +01:00
Johannes Gäßler
50ccaf5eac
lookup: complement data from context with general text statistics (#5479)
* lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens

* fixup! lookup: evaluation tools, use corpus/previous gens
2024-03-23 01:24:36 +01:00
slaren
2f0e81e053
cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy (#6208)
* cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy

* add LLAMA_CUDA_NO_PEER_COPY to HIP build
2024-03-22 14:05:31 +01:00
Olivier Chafik
5b7b0ac8df
json-schema-to-grammar improvements (+ added to server) (#5978)
* json: fix arrays (disallow `[,1]`)

* json: support tuple types (`[number, string]`)

* json: support additionalProperties (`{[k: string]: [string,number][]}`)

* json: support required / optional properties

* json: add support for pattern

* json: resolve $ref (and support https schema urls)

* json: fix $ref resolution

* join: support union types (mostly for nullable types I think)

* json: support allOf + nested anyOf

* json: support any (`{}` or `{type: object}`)

* json: fix merge

* json: temp fix for escapes

* json: spaces in output and unrestricted output spaces

* json: add typings

* json:fix typo

* Create ts-type-to-grammar.sh

* json: fix _format_literal (json.dumps already escapes quotes)

* json: merge lit sequences and handle negatives

{"type": "string", "pattern": "^({\"question\": \"[^\"]+\", \"response\": \"[^\"]+\"}\\n)+$"}

* json: handle pattern repetitions

* Update json-schema-to-grammar.mjs

* Create regex-to-grammar.py

* json: extract repeated regexp patterns to subrule

* Update json-schema-to-grammar.py

* Update json-schema-to-grammar.py

* Update json-schema-to-grammar.py

* json: handle schema from pydantic Optional fields

* Update json-schema-to-grammar.py

* Update json-schema-to-grammar.py

* Update ts-type-to-grammar.sh

* Update ts-type-to-grammar.sh

* json: simplify nullable fields handling

* json: accept duplicate identical rules

* json: revert space to 1 at most

* json: reuse regexp pattern subrules

* json: handle uuid string format

* json: fix literal escapes

* json: add --allow-fetch

* json: simplify range escapes

* json: support negative ranges in patterns

* Delete commit.txt

* json: custom regex parser, adds dot support & JS-portable

* json: rm trailing spaces

* Update json-schema-to-grammar.mjs

* json: updated server & chat `( cd examples/server && ./deps.sh )`

* json: port fixes from mjs to python

* Update ts-type-to-grammar.sh

* json: support prefixItems alongside array items

* json: add date format + fix uuid

* json: add date, time, date-time formats

* json: preserve order of props from TS defs

* json: port schema converter to C++, wire in ./server

* json: nits

* Update json-schema-to-grammar.cpp

* Update json-schema-to-grammar.cpp

* Update json-schema-to-grammar.cpp

* json: fix mjs implementation + align outputs

* Update json-schema-to-grammar.mjs.hpp

* json: test C++, JS & Python versions

* json: nits + regen deps

* json: cleanup test

* json: revert from c++17 to 11

* json: nit fixes

* json: dirty include for test

* json: fix zig build

* json: pass static command to std::system in tests (fixed temp files)

* json: fix top-level $refs

* json: don't use c++20 designated initializers

* nit

* json: basic support for reserved names `{number:{number:{root:number}}}`

* Revamp test cmake to allow args (WORKING_DIRECTORY needed for JSON test)

* json: re-ran server deps.sh

* json: simplify test

* json: support mix of additional props & required/optional

* json: add tests for some expected failures

* json: fix type=const in c++, add failure expectations for non-str const&enum

* json: test (& simplify output of) empty schema

* json: check parsing in test + fix value & string refs

* json: add server tests for OAI JSON response_format

* json: test/fix top-level anyOf

* json: improve grammar parsing failures

* json: test/fix additional props corner cases

* json: fix string patterns (was missing quotes)

* json: ws nit

* json: fix json handling in server when there's no response_format

* json: catch schema conversion errors in server

* json: don't complain about unknown format type in server if unset

* json: cleaner build of test

* json: create examples/json-schema-pydantic-example.py

* json: fix date pattern

* json: move json.hpp & json-schema-to-grammar.{cpp,h} to common

* json: indent 4 spaces

* json: fix naming of top-level c++ function (+ drop unused one)

* json: avoid using namespace std

* json: fix zig build

* Update server.feature

* json: iostream -> fprintf

* json: space before & refs for consistency

* json: nits
2024-03-21 11:50:43 +00:00
Pierrick Hymbert
d0d5de42e5
gguf-split: split and merge gguf per batch of tensors (#6135)
* gguf-split: split and merge gguf files per tensor

* gguf-split: build with make toolchain

* gguf-split: rename `--split-tensors-size` to `--split-max-tensors`. Set general.split_count KV to all split

* split : minor style + fix compile warnings

* gguf-split: remove --upload not implemented

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-19 12:05:44 +01:00
Pierrick Hymbert
d01b3c4c32
common: llama_load_model_from_url using --model-url (#6098)
* common: llama_load_model_from_url with libcurl dependency

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-17 19:12:37 +01:00
Georgi Gerganov
131b058409
make : ggml-metal.o depends on ggml.h 2024-03-15 11:38:40 +02:00
Georgi Gerganov
381da2d9f0
metal : build metallib + fix embed path (#6015)
* metal : build metallib + fix embed path

ggml-ci

* metal : fix embed build + update library load logic

ggml-ci

* metal : fix embeded library build

ggml-ci

* ci : fix iOS builds to use embedded library
2024-03-14 11:55:23 +02:00
slaren
f30ea47a87
llama : add pipeline parallelism support (#6017)
* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs

ggml-ci

* server : add -ub, --ubatch-size parameter

* fix server embedding test

* llama : fix Mamba inference for pipeline parallelism

Tested to work correctly with both `main` and `parallel` examples.

* llama : limit max batch size to n_batch

* add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism
default increase to 4 (from 2)

changing this value may improve performance for some systems, but increases memory usage

* fix hip build

* fix sycl build (disable cpy_tensor_async)

* fix hip build

* llama : limit n_batch and n_ubatch to n_ctx during context creation

* llama : fix norm backend

* batched-bench : sync after decode

* swiftui : sync after decode

* ggml : allow ggml_get_rows to use multiple threads if they are available

* check n_ubatch >= n_tokens with non-casual attention

* llama : do not limit n_batch to n_ctx with non-casual attn

* server : construct batch with size of llama_n_batch

* ggml_backend_cpu_graph_compute : fix return value when alloc fails

* llama : better n_batch and n_ubatch comment

* fix merge

* small fix

* reduce default n_batch to 2048

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-13 18:54:21 +01:00
Georgi Gerganov
83796e62bc
llama : refactor unicode stuff (#5992)
* llama : refactor unicode stuff

ggml-ci

* unicode : names

* make : fix c++ compiler

* unicode : names

* unicode : straighten tables

* zig : fix build

* unicode : put nfd normalization behind API

ggml-ci

* swift : fix build

* unicode : add BOM

* unicode : add <cstdint>

ggml-ci

* unicode : pass as cpts as const ref
2024-03-11 17:47:47 +02:00
DAN™
bcebd7dbf6
llama : add support for GritLM (#5959)
* add gritlm example

* gritlm results match

* tabs to spaces

* comment out debug printing

* rebase to new embed

* gritlm embeddings are back babeee

* add to gitignore

* allow to toggle embedding mode

* Clean-up GritLM sample code.

* Fix types.

* Flush stdout and output ending newline if streaming.

* mostly style fixes; correct KQ_mask comment

* add causal_attn flag to llama_cparams

* gritml : minor

* llama : minor

---------

Co-authored-by: Douglas Hanley <thesecretaryofwar@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-10 17:56:30 +02:00
Georgi Gerganov
8a3012a4ad
ggml : add ggml-common.h to deduplicate shared code (#5940)
* ggml : add ggml-common.h to shared code

ggml-ci

* scripts : update sync scripts

* sycl : reuse quantum tables

ggml-ci

* ggml : minor

* ggml : minor

* sycl : try to fix build
2024-03-09 12:47:57 +02:00
Gabe Goodhart
e1fa9569ba
server : add SSL support (#5926)
* add cmake build toggle to enable ssl support in server

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* add flags for ssl key/cert files and use SSLServer if set

All SSL setup is hidden behind CPPHTTPLIB_OPENSSL_SUPPORT in the same
way that the base httlib hides the SSL support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Update readme for SSL support in server

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Add LLAMA_SERVER_SSL variable setup to top-level Makefile

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2024-03-09 11:57:09 +02:00
Georgi Gerganov
2002bc96bf
server : refactor (#5882)
* server : refactoring (wip)

* server : remove llava/clip objects from build

* server : fix empty prompt handling + all slots idle logic

* server : normalize id vars

* server : code style

* server : simplify model chat template validation

* server : code style

* server : minor

* llama : llama_chat_apply_template support null buf

* server : do not process embedding requests when disabled

* server : reorganize structs and enums + naming fixes

* server : merge oai.hpp in utils.hpp

* server : refactor system prompt update at start

* server : disable cached prompts with self-extend

* server : do not process more than n_batch tokens per iter

* server: tests: embeddings use a real embeddings model (#5908)

* server, tests : bump batch to fit 1 embedding prompt

* server: tests: embeddings fix build type Debug is randomly failing (#5911)

* server: tests: embeddings, use different KV Cache size

* server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings

* server: tests: embeddings, no need to wait for server idle as it can timout

* server: refactor: clean up http code (#5912)

* server : avoid n_available var

ggml-ci

* server: refactor: better http codes

* server : simplify json parsing + add comment about t_last

* server : rename server structs

* server : allow to override FQDN in tests

ggml-ci

* server : add comments

---------

Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com>
2024-03-07 11:41:53 +02:00
le.chang
cbbd1efa06
Makefile: use variables for cublas (#5689)
* make: use arch variable for cublas

* fix UNAME_M

* check opt first

---------

Co-authored-by: lindeer <le.chang118@gmail.com>
2024-02-27 03:03:06 +01:00
kwin1412
f1a98c5254
make : fix nvcc version is empty (#5713)
fix nvcc version is empty
2024-02-25 18:46:49 +02:00
CJ Pais
6560bed3f0
server : support llava 1.6 (#5553)
* server: init working 1.6

* move clip_image to header

* remove commented code

* remove c++ style from header

* remove todo

* expose llava_image_embed_make_with_clip_img

* fix zig build
2024-02-20 21:07:22 +02:00
slaren
06bf2cf8c4
make : fix debug build with CUDA (#5616) 2024-02-20 20:06:17 +01:00
Haoxiang Fei
8dbbd75754
metal : add build system support for embedded metal library (#5604)
* add build support for embedded metal library

* Update Makefile

---------

Co-authored-by: Haoxiang Fei <feihaoxiang@idea.edu.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-20 11:58:36 +02:00
Jared Van Bortel
f24ed14ee0
make : pass CPPFLAGS directly to nvcc, not via -Xcompiler (#5598) 2024-02-19 15:54:12 -05:00
Georgi Gerganov
d0e3ce51f4
ci : enable -Werror for CUDA builds (#5579)
* cmake : pass -Werror through -Xcompiler

ggml-ci

* make, cmake : enable CUDA errors on warnings

ggml-ci
2024-02-19 14:45:41 +02:00
Georgi Gerganov
68a6b98b3c
make : fix CUDA build (#5580) 2024-02-19 13:41:51 +02:00
Xuan Son Nguyen
11b12de39b
llama : add llama_chat_apply_template() (#5538)
* llama: add llama_chat_apply_template

* test-chat-template: remove dedundant vector

* chat_template: do not use std::string for buffer

* add clarification for llama_chat_apply_template

* llama_chat_apply_template: add zephyr template

* llama_chat_apply_template: correct docs

* llama_chat_apply_template: use term "chat" everywhere

* llama_chat_apply_template: change variable name to "tmpl"
2024-02-19 10:23:37 +02:00
Jared Van Bortel
a0c2dad9d4
build : pass all warning flags to nvcc via -Xcompiler (#5570)
* build : pass all warning flags to nvcc via -Xcompiler
* make : fix apparent mis-merge from #3952
* make : fix incorrect GF_CC_VER for CUDA host compiler
2024-02-18 16:21:52 -05:00
Ananta Bastola
6e4e973b26
ci : add an option to fail on compile warning (#3952)
* feat(ci): add an option to fail on compile warning

* Update CMakeLists.txt

* minor : fix compile warnings

ggml-ci

* ggml : fix unreachable code warnings

ggml-ci

* ci : disable fatal warnings for windows, ios and tvos

* ggml : fix strncpy warning

* ci : disable fatal warnings for MPI build

* ci : add fatal warnings to ggml-ci

ggml-ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-17 23:03:14 +02:00
Johannes Gäßler
ad014bba97
make: add error message for bad CUDA version (#5444)
* make: add error message for bad CUDA version

* Update Makefile

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-02-13 12:38:37 +01:00
Johannes Gäßler
098f6d737b
make: Use ccache for faster compilation (#5318)
* make: Use ccache for faster compilation
2024-02-05 19:33:00 +01:00
Johannes Gäßler
3c0d25c475
make: add nvcc info print (#5310) 2024-02-03 20:15:13 +01:00
Johannes Gäßler
3cc5ed353c
make: fix nvcc optimization flags for host code (#5309) 2024-02-03 20:14:59 +01:00
0cc4m
e920ed393d
Vulkan Intel Fixes, Optimizations and Debugging Flags (#5301)
* Fix Vulkan on Intel ARC

Optimize matmul for Intel ARC

Add Vulkan dequant test

* Add Vulkan debug and validate flags to Make and CMakeLists.txt

* Enable asynchronous transfers in Vulkan backend

* Fix flake8

* Disable Vulkan async backend functions for now

* Also add Vulkan run tests command to Makefile and CMakeLists.txt
2024-02-03 18:15:00 +01:00
Ali Nehzat
d71ac90985
make : generate .a library for static linking (#5205) 2024-02-01 17:18:53 +02:00
0cc4m
2307523d32
ggml : add Vulkan backend (#2059)
* Vulkan loader code

* Fix matmul kernel, continue implementation

* Continue implementation

* Vulkan memory management

* Vulkan development

* Matmul call

* Add aligned malloc and free for VMA

* Continue implementation

* First matmul success

* GEMM Kernel optimization

* 1D Blocktiling

* 2D Blocktiling

* Write coalescing

* Continue vulkan implementation and optimization

* First FP16 attempt, disabled for now

* Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel

* Enable device extensions properly, restore fp16 matmul op

* Fix mulmat_f16

* Output FP32 in fp16 matmul shader

* Fix f16_to_f32 kernel

* dequant_q4_0 kernel

* Add VMA library

* Avoid requesting dedicated memory, VMA can decide that by itself

* Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly

* add cmake commands

* Add 2d write operation, profiling code

* Fix 2d write

* Fix queue selection for AMD RADV

* Fix trailing whitespace in vk_mem_alloc.h

* Add WIP warp tile mat mul shaders

* Disable glslc optimization

* Disable glslc optimization for CMake

* Optimize warptile matmul shader, replace blocktile with it

* Add split-k optimization for small matrix multiplication

Use semaphores for synchronization instead of fences or waitidle

Rework async write/read for synchronization

* Fix validation errors, improve compatibility with AMD GPUs

* Rework command buffer handling

* Variable matmul kernel using specialization constants

* Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints

* Reuse semaphores

* Handle stage flags during command buffer submission properly

* Increase matmul test runs for consistent results

* Fix F32 matmul

* Add vectorized loading and zeropadding for matrix multiplication

* Use pinned memory for f16 preprocessing

* Don't force aligned matmul

* Don't free before queue done

* Replace VMA library with native Vulkan buffer management

* Basic offloading support with mul_f32 and dmmv for q4_0

* Run glslc commands in parallel

* Unroll loops in dmmv shader

* Reduce usage of waitIdle

* Reuse pinned allocation for f16 conversion

* Handle devices with only a single queue

* Fix trailing whitespace in CMakeLists.txt

* Allow parallel execution of kernels, parallelize third and fourth dimension calls

* Add fallback for devices only supporting one DescriptorSet per DescriptorPool

* Move to graph function similar to CUDA implementation

* Use F16 kernel for most things, replace q_f32 with mul_mat_q_f16 function

* Add F32 dmmv shaders

* Batch submissions

* Add .spv to gitignore

* Split off matrix vector multiplication for separate optimization

* Use single command buffer for matrix vector multiplication ops

* Reduce overhead of mul_f32 calls by using a single command buffer

* Add submission batching to mul_f32

* Fix tests

* Add missing barrier

* Add further missing barrier

* Add further ops

* Replace vk::QueueFamilyIgnored with VK_QUEUE_FAMILY_IGNORED to support more Vulkan header versions

* Remove unnecessary cblas link

* Fix descriptor set pre-allocation assert

* Add runtime shader compilation, start transferring shaders to this approach

* Transfer remaining shaders to header and compile on runtime

* Fix fp32 fallback if device doesn't support fp16, add force disable env var GGML_VULKAN_DISABLE_F16

* Add support for q4_1, q5_0, q5_1 and q8_0

* Remove unnecessary scalar layout extension

* Parse graph early to pre-record command buffers

* Add q6_k support

* Add multi-submit for command buffers

* Fix q6_k dequant shader for AMD

* Fix q6_k for GPUs without fp16 support

* Simplify q6_k fp16 fix

* Minor fixes

* Fix wg_denom of m-mulmat shaders

* Add Python-based Vulkan shader generator

* Replace shaderc dependency with precompiled shaders

Fix python script to generate shaders

* Clean up code

* Fix shader generator script Windows compatibility

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>

* Close file before deletion

* Fix vulkan shader fp32 name

* Add q2_k and q3_k support

Add validation check to compare shader results to cpu results

* Add q4_k support

* Add q5_k support

* Bake SPIR-V bytecode into the library instead of loading shaders from file

* Switch to signal semaphores for flexibility

Prepare broadcasting support for mul mat

* Finish broadcasting mul mat support for GQA

* Clean up unused functions

Add repeat op

* Add further ops, not yet enabled. Improve semaphore code

* Reduce number of used semaphores by utilizing timelines more properly

* Remove queue information

* Reuse timeline semaphores, allow parallel operation with binary semaphores to work around nvidia driver limitations

* Add Vulkan to llama-bench

* Remove cblas dependency

* Fix matmul k-split bug

* Fix q4_k dmmv K_QUANTS_PER_ITERATION 1 shader

* Add RMS Norm shader, rework op_f32 shader setup, fix matmul bug

* Fix issues with float16 overflows in shaders

* Fix issues with older Vulkan headers on Ubuntu 22.04

* Allow multi-op partial offloading by parsing the graph to preallocate enough between-op buffers

* Implement further ops, rework op_f32 calls, fix bugs

* Finish full offloading support, add last remaining ops, fix bugs, remove redundant code

* Upload generated file ggml-vulkan-shaders.hpp, remove redundant shaders

* Merge upstream changes, fix conflicts, adapt soft_max op

* Fix Python and shader header format

* Free model gpu buffers on exit

* Use single queue per device to simplify code

* Add matmul shader support for running multiple calculations in parallel

* Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible

* Fix missing event cast

* Replace uint64_t(-1) with UINT64_MAX, rename function for clarity

* Fix warning about empty C function parameters

* Fix compiler warnings

* Properly implement Vulkan backend buffer handling

* Fix oversized host staging buffers

* Simplify barrier synchronization calls

* Fix gcc warnings

* Implement max_size for backend buffer types to limit the size of a single allocation

* Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size

* refactor multi buf

* Disable unsupported ops to fix tests

* Check for maintenance4 support before using it

* Handle devices with only a single queue

* Fix single queue logic

* propagate buffer usage in multi buffers

* Implement rope_neox op

* Cleanup header and other files

* Simplify gpu_extras by removing events and putting staging memcpys into contexts

* Move queue into context

Add not-yet-enabled async backend ops

* Simplify context use, optimize matmul shader for warp size 64 (AMD GCN), fix split_k matmul shader optimization

* Add get_max_size to SYCL backend.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llama : fix trailing whitespace

---------

Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-28 19:03:59 +02:00
Xuan Son Nguyen
48c857aa10
server : refactored the task processing logic (#5065)
* server: add llama_server_queue struct

* server: add llama_server_response_event

* server: add comments

* server: move all mutexes away from server.cpp

* server: correct multitask response

* server: only add back deferred tasks when one slot is available

* server: fix a race condition cause by "request_completion"
2024-01-26 14:42:20 +02:00
crasm
413e7b0559
ci : add model tests + script wrapper (#4586)
* scripts : add lib.sh and lib_test.sh

* scripts : stub out new ci-run.sh script

* scripts : switch to PascalCase for functions

This looks a little odd at first, but I find it very useful as a
convention to know if a command is part of our code vs a builtin.

* scripts : add some fancy conversion from snake_case to PascalCase

* Add venv to ci/run.sh

* Revert scripts work

* scripts : add wrapper script for local use of ci/run.sh

* Simplify .gitignore for tests, clang-tidy fixes

* Label all ctest tests

* ci : ctest uses -L main

* Attempt at writing ctest_with_model

* Update test-model-load-cancel

* ci : add ctest_with_model for debug and release

ggml-ci

* Fix gg_get_model function

ggml-ci

* got stuck on CMake

* Add get_model.cpp to tests/CMakeLists.txt

ggml-ci

* Fix README.md output for ctest_with_model

ggml-ci

* workflows : use `-L main` for all ctest

ggml-ci

* Fixes

* GG_RUN_CTEST_MODELFILE => LLAMACPP_TESTMODELFILE
* Always show warning rather than failing if model file variable is not
  set

* scripts : update usage text for ci-run.sh
2024-01-26 14:18:00 +02:00
Georgi Gerganov
c918fe8dca
metal : create autorelease pool during library build (#4970)
* metal : create autorelease pool during library build

ggml-ci

* test : simplify

ggml-ci
2024-01-17 18:38:39 +02:00
Georgi Gerganov
4be5ef556d
metal : remove old API (#4919)
ggml-ci
2024-01-13 20:45:45 +02:00
Kawrakow
326b418b59
Importance Matrix calculation (#4861)
* imatrix: 1st version

* imatrix: WIP

* Cleanup

* Update examples/imatrix/imatrix.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-12 06:59:57 +01:00
Georgi Gerganov
b0034d93ce
examples : add passkey test (#3856)
* examples : add passkey test

* passkey : better prints

* passkey : select pass key pos from CLI

* passkey : simplify n_past logic

* make : add passkey target

* passkey : add "self-extend"-like context extension (#4810)

* llama : "self-extend"-like context extension

* passkey : add comment

* passkey : add readme
2024-01-08 11:14:04 +02:00
slaren
5bf3953d7e
cuda : improve cuda pool efficiency using virtual memory (#4606)
* cuda : improve cuda pool efficiency using virtual memory

* fix mixtral

* fix cmake build

* check for vmm support, disable for hip

ggml-ci

* fix hip build

* clarify granularity

* move all caps to g_device_caps

* refactor error checking

* add cuda_pool_alloc, refactor most pool allocations

ggml-ci

* fix hip build

* CUBLAS_TF32_TENSOR_OP_MATH is not a macro

* more hip crap

* llama : fix msvc warnings

* ggml : fix msvc warnings

* minor

* minor

* cuda : fallback to CPU on host buffer alloc fail

* Update ggml-cuda.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml-cuda.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* ensure allocations are always aligned

* act_size -> actual_size

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-12-24 14:34:22 +01:00
LeonEricsson
7082d24cec
lookup : add prompt lookup decoding example (#4484)
* initial commit, going through initializations

* main loop finished, starting to debug

* BUG: generates gibberish/repeating tokens after a while

* kv_cache management

* Added colors to distinguish drafted tokens (--color). Updated README

* lookup : fix token positions in the draft batch

* lookup : use n_draft from CLI params

* lookup : final touches

---------

Co-authored-by: Leon Ericsson <leon.ericsson@icloud.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-22 18:05:56 +02:00
FantasyGmm
a55876955b
cuda : fix jetson compile error (#4560)
* fix old jetson compile error

* Update Makefile

* update jetson detect and cuda version detect

* update cuda marco define

* update makefile and cuda,fix some issue

* Update README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update Makefile

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-22 17:11:12 +02:00
Michael Kesper
28cb35a0ec
make : add LLAMA_HIP_UMA option (#4587)
NB: LLAMA_HIP_UMA=1 (or any value) adds MK_CPPFLAG -DGGML_HIP_UMA
2023-12-22 10:03:25 +02:00
Georgi Gerganov
32259b2dad
gguf : simplify example dependencies 2023-12-21 23:08:14 +02:00
slaren
d232aca5a7
llama : initial ggml-backend integration (#4520)
* llama : initial ggml-backend integration

* add ggml-metal

* cuda backend can be used though ggml-backend with LLAMA_GGML_BACKEND_CUDA_TEST
access all tensor data with ggml_backend_tensor_get/set

* add ggml_backend_buffer_clear
zero-init KV cache buffer

* add ggml_backend_buffer_is_hos, used to avoid copies if possible when accesing tensor data

* disable gpu backends with ngl 0

* more accurate mlock

* unmap offloaded part of the model

* use posix_fadvise64(.., POSIX_FADV_SEQUENTIAL) to improve performance with mmap

* update quantize and lora

* update session copy/set to use ggml-backend

ggml-ci

* use posix_fadvise instead of posix_fadvise64

* ggml_backend_alloc_ctx_tensors_from_buft : remove old print

* llama_mmap::align_offset : use pointers instead of references for out parameters

* restore progress_callback behavior

* move final progress_callback call to load_all_data

* cuda : fix fprintf format string (minor)

* do not offload scales

* llama_mmap : avoid unmapping the same fragments again in the destructor

* remove unnecessary unmap

* metal : add default log function that prints to stderr, cleanup code

ggml-ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-21 21:07:46 +01:00
Matheus Gabriel Alves Silva
919c40660f
build : Check the ROCm installation location (#4485)
* build : Check the ROCm installation location

* more generic approach

* fixup! It was returning the path instead of the command output

* fixup! Trailing whitespace
2023-12-17 17:23:33 +02:00
Jared Van Bortel
70f806b821
build : detect host compiler and cuda compiler separately (#4414) 2023-12-13 12:10:10 -05:00
slaren
799a1cb13b
llama : add Mixtral support (#4406)
* convert : support Mixtral as LLAMA arch

* convert : fix n_ff typo

* llama : model loading

* ggml : sync latest ggml_mul_mat_id

* llama : update graph to support MoE

* llama : fix cur -> cur_expert

* llama : first working version

* llama : fix expert weighting in the FFN

* ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only)

* ggml : add n_as argument to ggml_mul_mat_id

* ggml : fix ggml_get_rows to take into account ne02 / ne11

* metal : add more general support for ggml_get_rows + tests

* llama : add basic support for offloading moe with CUDA

* metal : add/mul/div use general kernel when src1 not cont

* metal : reduce the kernel launches for ggml_mul_mat_id

* ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D

* ggml : update get_rows f16 and q

* cuda : support non-contiguous src1 in get_rows

* llama : offload missing ffn_moe_silu

* metal : fix ggml_get_rows to work with non-cont src1

* metal : add indirect mat-vec kernels for all quantization types

* llama : do not quantize expert gating tensors

* llama : add n_expert and n_expert_used to hparams + change quants

* test-backend-ops : add moe test

* cuda : fix get_rows when ncols is odd

* convert : determine n_ctx correctly

* metal : fix ggml_mul_mat_id for F32

* test-backend-ops : make experts more evenly probable (test_moe)

* test-backend-ops : cleanup, add moe test for batches

* test-backend-ops : add cpy from f32 -> all types test

* test-backend-ops : fix dequantize block offset

* llama : fix hard-coded number of experts

* test-backend-ops : simplify and disable slow tests to avoid CI timeout

* test-backend-ops : disable MOE test with thread sanitizer

* cuda : fix mul_mat_id with multi gpu

* convert : use 1e6 rope_freq_base for mixtral

* convert : fix style

* convert : support safetensors format

* gguf-py : bump version

* metal : add cpy f16 -> f32 kernel

* metal : fix binary ops for ne10 % 4 != 0

* test-backend-ops : add one more sum_rows test

* ggml : do not use BLAS with ggml_mul_mat_id

* convert-hf : support for mixtral-instruct (#4428)

* convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct

* convert : use sentencepiece tokenizer for Mixtral-instruct

* convert : make flake8 happy

* metal : fix soft_max kernels

ref: 1914017863

* metal : limit kernels to not use more than the allowed threads

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Radek Pilar <github@mrkva.eu>
2023-12-13 14:04:25 +02:00
Jared Van Bortel
6138963fb2
build : target Windows 8 for standard mingw-w64 (#4405)
* build : target Windows 8 for standard mingw-w64

* make : fix missing console.o deps

This was causing a link error with `make all` on Windows.
2023-12-12 11:27:26 +02:00
Georgi Gerganov
fe680e3d10
sync : ggml (new ops, tests, backend, etc.) (#4359)
* sync : ggml (part 1)

* sync : ggml (part 2, CUDA)

* sync : ggml (part 3, Metal)

* ggml : build fixes

ggml-ci

* cuda : restore lost changes

* cuda : restore lost changes (StableLM rope)

* cmake : enable separable compilation for CUDA

ggml-ci

* ggml-cuda : remove device side dequantize

* Revert "cmake : enable separable compilation for CUDA"

This reverts commit 09e35d04b1.

* cuda : remove assert for rope

* tests : add test-backend-ops

* ggml : fix bug in ggml_concat

* ggml : restore `ggml_get_n_tasks()` logic in `ggml_graph_plan()`

* ci : try to fix macOS

* ggml-backend : remove backend self-registration

* ci : disable Metal for macOS cmake build

ggml-ci

* metal : fix "supports family" call

* metal : fix assert

* metal : print resource path

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-12-07 22:26:54 +02:00
Jared Van Bortel
511f52c334
build : enable libstdc++ assertions for debug builds (#4275) 2023-12-01 20:18:35 +02:00
WillCorticesAI
d2809a3ba2
make : fix Apple clang determination bug (#4272)
Co-authored-by: Will Findley <findley@gmail.com>
2023-12-01 00:23:44 +02:00
Jared Van Bortel
15f5d96037
build : fix build info generation and cleanup Makefile (#3920)
* cmake : fix joining of REAL_GIT_DIR

* fix includes with help from include-what-you-use

* make : remove unneeded deps and add test-rope target

* fix C includes in C++ source files

* Revert "fix includes with help from include-what-you-use"

This reverts commit 635e9fadfd.
2023-12-01 00:23:08 +02:00
Georgi Gerganov
922754a8d6
lookahead : add example for lookahead decoding (#4207)
* lookahead : init

* lookahead : generate and store n-grams

* lookahead : use loop instead recursion to generate n-grams

* lookahead : initial working implementation

* lookahead : filter repeating n-grams

* lookahead : use deterministic init

* lookahead : add to Makefile

* lookahead : fix a bug in the seq_id of the lookahead tokens

* lookahead : add comments

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-11-26 20:33:07 +02:00
Kerfuffle
28a2e6e7d4
tokenize example: Respect normal add BOS token behavior (#4126)
Allow building with Makefile
2023-11-18 14:48:17 -07:00
Roger Meier
8e9361089d
build : support ppc64le build for make and CMake (#3963)
* build: support ppc64le build for make and CMake

* build: keep __POWER9_VECTOR__ ifdef and extend with __powerpc64__

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-11-17 18:11:23 +02:00
Michael Potter
6bb4908a17
Fix MacOS Sonoma model quantization (#4052)
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-11-14 12:34:41 -05:00
Georgi Gerganov
413503d4b9
make : do not add linker flags when compiling static llava lib (#3977) 2023-11-07 20:25:32 +03:00
Damian Stewart
381efbf480
llava : expose as a shared library for downstream projects (#3613)
* wip llava python bindings compatibility

* add external llava API

* add base64 in-prompt image support

* wip refactor image loading

* refactor image load out of llava init

* cleanup

* further cleanup; move llava-cli into its own file and rename

* move base64.hpp into common/

* collapse clip and llava libraries

* move llava into its own subdir

* wip

* fix bug where base64 string was not removed from the prompt

* get libllava to output in the right place

* expose llava methods in libllama.dylib

* cleanup memory usage around clip_image_*

* cleanup and refactor *again*

* update headerdoc

* build with cmake, not tested (WIP)

* Editorconfig

* Editorconfig

* Build with make

* Build with make

* Fix cyclical depts on Windows

* attempt to fix build on Windows

* attempt to fix build on Windows

* Upd TODOs

* attempt to fix build on Windows+CUDA

* Revert changes in cmake

* Fix according to review comments

* Support building as a shared library

* address review comments

---------

Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2023-11-07 00:36:23 +03:00
cebtenzzre
b12fa0d1c1
build : link against build info instead of compiling against it (#3879)
* cmake : fix build when .git does not exist

* cmake : simplify BUILD_INFO target

* cmake : add missing dependencies on BUILD_INFO

* build : link against build info instead of compiling against it

* zig : make build info a .cpp source instead of a header

Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>

* cmake : revert change to CMP0115

---------

Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>
2023-11-02 08:50:16 +02:00
cebtenzzre
2046eb4345
make : remove unnecessary dependency on build-info.h (#3842) 2023-10-29 18:33:47 +02:00
Georgi Gerganov
d69d777c02
ggml : quantization refactoring (#3833)
* ggml : factor all quantization code in ggml-quants

ggml-ci

* ggml-quants : fix Zig and Swift builds + quantize tool

ggml-ci

* quantize : --pure option for disabling k-quant mixtures

---------

Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-29 18:32:28 +02:00
Georgi Gerganov
2f9ec7e271
cuda : improve text-generation and batched decoding performance (#3776)
* cuda : prints wip

* cuda : new cublas gemm branch for multi-batch quantized src0

* cuda : add F32 sgemm branch

* cuda : fine-tune >= VOLTA params + use MMQ only for small batches

* cuda : remove duplicated cuBLAS GEMM code

* cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros

* build : add compile option to force use of MMQ kernels
2023-10-27 17:01:23 +03:00
Georgi Gerganov
e3932593d4
Revert "make : add optional CUDA_NATIVE_ARCH (#2482)"
This reverts commit 96981f37b1.

See:

https://github.com/ggerganov/llama.cpp/pull/2482#issuecomment-1775975866
2023-10-23 23:46:05 +03:00
Alex
96981f37b1
make : add optional CUDA_NATIVE_ARCH (#2482)
Use the environment variable `CUDA_NATIVE_ARCH` if present to set NVCC arch. Otherwise, use `native`.
2023-10-22 22:56:53 +03:00
Georgi Gerganov
438c2ca830
server : parallel decoding and multimodal (#3677)
* implementing parallel decoding in server example

* crash fixed

* save dev progress

* refactored sampling function

* completion endpoint working

* multiple client support

* grammar + no stream completion

* cached prompt support

* chat.mjs support cached prompt + some fixes

* server ui now support multiple clients

* unused change reverted

* fixed timings per slot

* add context swap

* add changes to README.md

* llava multimodal integration

* fixed tokens probs

* add multimodal input - alfa

* refactor code + remove unused comments + improved README.md

* fix compilation errors with llvm

* notify the user from server ui that multimodality is unavialable

* some ci fixes

* fix ci make build undefined ref errors

* fix long prompt than ctx proposed in #3639

* fixed premature end due stop word

* context shift fixed

* fix llava implementation

* sync README.md changes

* readme change

* update api like OpenAI

* multimodal support enabled by default

* fix make bui;d errors

* fix multiple clients

* fix zig build

* new sampling API

* latest changes of sampling API

* server : coding-style normalization

* server : coding-style normalization (part 2)

* server : remove beam-search functionality

* server : bug fix in ingest_images

n_tokens is incremented internally by llama_batch_add

* server : use refs + use llama_batch_clear()

* server : snake case

* server : minor sync

* added thread safe pipeline

* server : bach has to be allocated for n_parallel sequences

* server : no need for atomic int - already using mutex

* server : logs + minor code style

* server : fix multibyte handle in partial response (#3706)

* fix image load + view image in chat

* make : silence stb warnings

* clip : link to ggml, not to llama

* server : fix switch fallthrough

* server : fix crash in Debug on macOS (I have no idea why this fixes it!?)

* server : refactor ctx_sampling init + n_ctx + names

* server : bug fix for prompt caching

* Do not save/load image_data to localStorage

* editorconfig : new line in index.html

* server : completion requests remember slot_id

* Update readme to document multimodal in server

* server : minor style

* Update readme to document multimodal in server

* server : hide ctx_sampling->prev behind API (#3696)

* server : apply fix from #3722

* server : fix slot reuse

* server : add comment about changing slot_state to bool

---------

Co-authored-by: FSSRepo <go778sgt@gmail.com>
Co-authored-by: Damian Stewart <d@damianstewart.com>
Co-authored-by: Steward Garcia <57494570+FSSRepo@users.noreply.github.com>
Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>
Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>
2023-10-22 22:53:08 +03:00
Georgi Gerganov
d1031cf49c
sampling : refactor init to use llama_sampling_params (#3696)
* sampling : refactor init to use llama_sampling_params

* llama : combine repetition, frequency and presence penalties in 1 call

* examples : remove embd-input and gptneox-wip

* sampling : rename penalty params + reduce size of "prev" vector

* sampling : add llama_sampling_print helper

* sampling : hide prev behind API and apply #3661

ggml-ci
2023-10-20 21:07:23 +03:00
Georgi Gerganov
0e89203b51
speculative : add tree-based sampling example (#3624)
* sampling : one sequence per sampling context

ggml-ci

* speculative : add tree-based sampling support

ggml-ci

* speculative : reuse the n_parallel CLI param

* speculative : refactor sampling

* examples : fix build after sampling refactoring

ggml-ci

* batched : fix n_seq_id

* sampling : fix malloc

ggml-ci

* swift : fix build

ggml-ci

* swift : try to fix build

ggml-ci

* prompts : add assistant.txt

* common : add llama_batch_add() and llama_batch_clear() helpers

* speculative : minor refactor

ggml-ci

* minor : comments + rename

ggml-ci

* speculative : fix off-by-one for n_drafted

* speculative : fix the n_drafted fix + p constants
2023-10-18 16:21:57 +03:00
M. Yusuf Sarıgöz
370359e5ba
examples: support LLaVA v1.5 (multimodal model) (#3436)
* WIP: start implementing LLaVA

* rm scratch buf for now, will revert after cleanup

* LLaVA image encoder is working. will combine with llama

* Add llava inference code, but it's buggy. debugging

* LLaVA is working e2e, needs to optimize memory allocation + cleanup

* Use ggml_allocr + rm unnecessary code

* fix: crlf -> lf

* fix: new line at EoF

* fix: trailing whitespace

* Add readme

* Update readme

* Some cleanup

* Are you happy editorconfig?

* rm unused batch image preprocessing

* rm unused import

* fix: rm designated initializers

* introduce pad-to-square mode for non-square images

* are you happy editorconfig?

* gitignore /llava

* Handle cases where image file does not exist

* add llava target to Makefile

* add support for 13b model variant

* Maybe seed is unlucky?

* Check if apples are compared to apples

* are you happy editorconfig?

* Use temperature = 0.1 by default

* command line: use gpt_params_parse()

* minor

* handle default n_predict

* fix typo

* llava : code formatting, rename files, fix compile warnings

* do not use Wno-cast-qual for MSVC

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-12 18:23:18 +03:00
Kerfuffle
70c29da118
common : fix mirostat state when using multiple sequences (#3543)
* Fix mirostat state when using multiple sequences

* Fix mirostat by completely refactoring sampling!

* Try to fix zig build.

* Export function to fetch/create default sampler states

Code formatting cleanups and add some comments

Silence a warning about id not being used when logging is disabled

* Apply some renaming suggestions.

Fix comments that were out of sync with the pull.

* Use more consistant naming convention for sampling contexts
2023-10-11 22:35:46 +03:00
Georgi Gerganov
8c70a5ff25
batched : add bench tool (#3545)
* batched : add bench tool

* batched : minor fix table

* batched-bench : add readme + n_kv_max is now configurable

* batched-bench : init warm-up batch

* batched-bench : pass custom set of PP, TG and PL

* batched-bench : add mmq CLI arg
2023-10-11 21:25:33 +03:00
Zane Shannon
24ba3d829e
examples : add batched.swift + improve CI for swift (#3562) 2023-10-11 06:14:05 -05:00
Georgi Gerganov
db3abcc114
sync : ggml (ggml-backend) (#3548)
* sync : ggml (ggml-backend)

ggml-ci

* zig : add ggml-backend to the build
2023-10-08 20:19:14 +03:00