* feat: add Dockerfiles for each platform that user ./server instead of ./main
* feat: update .github/workflows/docker.yml to build server-first docker containers
* doc: add information about running the server with Docker to README.md
* doc: add information about running with docker to the server README
* doc: update n-gpu-layers to show correct GPU usage
* fix(doc): update container tag from `server` to `server-cuda` for README example on running server container with CUDA
* Support for Yi-VL, templating fix for mobileVLM
* ws
* Update examples/llava/clip.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update llava-cli.cpp
* Update clip.cpp
bugfix for new conversions
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server: add llama_server_queue struct
* server: add llama_server_response_event
* server: add comments
* server: move all mutexes away from server.cpp
* server: correct multitask response
* server: only add back deferred tasks when one slot is available
* server: fix a race condition cause by "request_completion"
* kl-divergence: be able to save all logits to a file
* Add ability to compute KL-divergence
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* MobileVLM native implementation
* delete depthwise_conv_2d and permute_cpy relative code, replace the two by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake
* move android script to example/llava directory
* Fix the editor config checks
---------
Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>
This commit adds `--sample-start` and `--include-sample-start` to the
output from the main function in finetune.cpp.
The motivation for this is that even though these are set explicitly by
the user via the command line, if one forgets to set them then it is
useful to have their values printed out. Otherwise it is possible to go
through the whole training process before realizing that the values are
not what one expected.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S
* Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K
Together with an importance matrix, this brings perplexity
for LLaMA-v2-70B below the perplexity of the former Q2_K
with a 800 MB smaller quantized model size.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* TruthfulQA: 1st attempt, does not look like it is working
The same implementation can be used for HellaSwag as well,
so I converted a HellaSwag validation dataset to the binary
format used here and tested with that. The score is only
around 50, so something is not quite right.
* TruthfulQA: works but the result is bad
I know it works because if I convert the HellaSwag validation
data to the binary format used in the truthful_qa_score() function
I get the exact same result as from the hellaswag_score() function.
But I guess, the questions are tricky and the way I have done
the combination of question + answer is very likely not the best.
The TruthfulQA validation dataset contains 817 questions, with
random chance result around 19%. With this version I get
29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2.
The HF leader board results for these two models are
42.2% and 68.3%, respectively.
* TruthfulQA: fix random sample
* TruthfulQA: prepare tasks in parallel for large test datasets
* Rename truthful_qa to multiple_choice
* Make MSVC happy
I had forgotten that MSVC does not make constexpr's available
inside a lambda.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
For Mistral-7B and fp16, time on my system goes down from 536 seconds
to 423 seconds for the full evaluation dataset (10042 tasks).
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* winogrande: simple implementation
It doesn't look like it is working - why?
For Mistral-7B it is barely better than
random chance (score ~60% for 1267 tasks), while I see
Mistral-7B scoring 78.4% on the HF leader board.
1-sigma statistical uncertainty for 1267 tasks is ~1.4,
so no way the difference is due to statistics.
* winogrande: somewhat better
Score for Mistrali7-B is now 68.9 on the validation set of
winogrande_debiased. Still far from the reported 78.4, but
better than what I had before.
* winogrande: improving
Mistral-7B score is now 73.56.
Still not quite 78.4 but getting there.
We are also getting a lower score on HellaSwag
compared to HF leader board, so I'm not expecting
we will get up to 78.4 anyway.
It looks like it is better to skip the choice word(s)
when evaluating the average log-likelihood. This kind of
makes sense because a more common word (in Winogrande this is
often a name) will have a higher probability without knowing
about the follow up context, and this will skew the log-likelihood
towards the more common word. We can only do this if the
choice words are not last in the sentence.
It also looks like it is better to skip the punctuation at the
end of the sentence, provided the choice words are not last.
* winogrande: add dataset instructions
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* backend : add eval callback
ggml-ci
* backend : group nodes in a single compute when user don't need them
* backend : clean-up the implementation
ggml-ci
* simple : do not perform tensor data copy if not needed
* simple : fix
* imatrix : offload to GPU support
* imatrix : fix ggml_mul_mat_id hanlding
ggml-ci
* ci : add imatrix test
ggml-ci
* ci : rearrange output
ggml-ci
This commit adds the name of the training data file to the log message
printed when the training data is tokenized.
The motivation for this change is that it can be useful to show which
file is being tokenized when running the finetune example.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* Introduce starter project for Android
Based on examples/llama.swiftui.
* Add github workflow
* Set NDK version
* Only build arm64-v8a in CI
* Sync bench code
* Rename CI prop to skip-armeabi-v7a
* Remove unused tests
This commit replaces the magic number LLAMA_FILE_MAGIC_LORA used in
finetune.cpp with LLAMA_FILE_MAGIC_GGLA defined in llama.h.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* examples : save-load-state: save only required state
* llama : only reserve n_vocab * n_batch at most for logits
llama_decode asserts that only n_batch tokens are passed each call, and
n_ctx is expected to be bigger than n_batch.
* llama : always reserve n_vocab * n_batch for logits
llama_context de-serialization breaks if the contexts have differing
capacity for logits and llama_decode will at maximum resize to
n_vocab * n_batch.
* llama : only save and restore used logits
for batch sizes of 512 this reduces save state in the best case by
around 62 MB, which can be a lot if planning to save on each message
to allow regenerating messages.
* llama : use ostringstream and istringstream for save and load
* llama : serialize rng into minimum amount of space required
* llama : break session version due to serialization changes
* add the parameter : --no-display-prompt , combine with --log-disable it will display only the generated tokens
* remove empty line
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama : ggml-backend integration
* ggml-backend : add names to buffers
* fix unmap after loading
* batched-bench : add tensor_split param
* llama : check for null tensor_split
* ggml-backend : increase GGML_MAX_BACKENDS
* improve graph splitting, partial fix for --no-kv-offload
* cuda : add ggml-backend split buffer support
* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)
* ggml : fix null backend dereference (#4807)
* ggml : fix null backend dereference
* ggml : also check ggml_backend_is_cpu
* test-backend-ops : check buffer allocation failures
* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)
* ggml : fix mul_mat_id work size
* llama : rewrite session kv load/set without graphs
* minor
* llama : only initialize used backends, free backends on context free
* llama : abort ctx if cuda backend init fails
* llama : rewrite lora with ggml-backend and compute on CPU
ggml-ci
* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer
* opencl : add ggml-backend buffer type
* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)
* llama : on Metal, by default offload the full model
ggml-ci
* metal : page align the data ptr (#4854)
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* cuda : fix split buffer free
* address review comments
* llama-bench : add split-mode parameter
* fix whitespace
* opencl : fix double initialization
* server : add --split-mode parameter
* use async copy and compute to improve multi-gpu performance
ggml-ci
* use async memcpys to copy the graph outputs to the CPU
* fix opencl
* use a host buffer for the cpu compute buffer for faster copies to the gpu
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
This commit replaces the magic number used in export-lora.cpp with
the one defined in llama.h, which is indirectly included via common.h.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* Updated Models Layout
- Added a models drawer
- Added downloading directly from Hugging Face
- Load custom models from local folder
- Delete models by swiping left
* trimmed trailing white space
* Updated Models Layout