In these codes, we want to retain the value that they previously held
when mask[i] is false. So we should use undisturbed. With the default
agnostic policy of rvv intrinsic, these values can be held or be
written with 1s.
Co-authored-by: carter.li <carter.li@starfivetech.com>
* chore: Fix compiler warnings, add help text, improve CLI options
* Add prototypes for function definitions
* Invert logic of --no-clean option to be more intuitive
* Provide a new help prompt with clear instructions
* chore : Add ignore rule for vulkan shader generator
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
* Update ggml/src/vulkan-shaders/vulkan-shaders-gen.cpp
Co-authored-by: 0cc4m <picard12@live.de>
* chore : Remove void and apply C++ style empty parameters
* chore : Remove void and apply C++ style empty parameters
---------
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
Co-authored-by: 0cc4m <picard12@live.de>
* llama : refactor session file management
* llama : saving and restoring state checks for overflow
The size of the buffers should now be given to the functions working
with them, otherwise a truncated file could cause out of bound reads.
* llama : stream from session file instead of copying into a big buffer
Loading session files should no longer cause a memory usage spike.
* llama : llama_state_get_size returns the actual size instead of max
This is a breaking change, but makes that function *much* easier
to keep up to date, and it also makes it reflect the behavior
of llama_state_seq_get_size.
* llama : share code between whole and seq_id-specific state saving
Both session file types now use a more similar format.
* llama : no longer store all hparams in session files
Instead, the model arch name is stored.
The layer count and the embedding dimensions of the KV cache
are still verified when loading.
Storing all the hparams is not necessary.
* llama : fix uint64_t format type
* llama : various integer type cast and format string fixes
Some platforms use "%lu" and others "%llu" for uint64_t.
Not sure how to handle that, so casting to size_t when displaying errors.
* llama : remove _context suffix for llama_data_context
* llama : fix session file loading
llama_state_get_size cannot be used to get the max size anymore.
* llama : more graceful error handling of invalid session files
* llama : remove LLAMA_MAX_RNG_STATE
It's no longer necessary to limit the size of the RNG state,
because the max size of session files is not estimated anymore.
* llama : cast seq_id in comparison with unsigned n_seq_max
* Update doc for MUSA
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Add GGML_MUSA in Makefile
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Add GGML_MUSA in CMake
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* CUDA => MUSA
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* MUSA adds support for __vsubss4
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Fix CI build failure
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Apply a loop tiling technique to the generic path, which provides
performance upside for ISAs with enough registers to take advantage
of it. Also helps the compiler optimize this path.
* Add support for float16 tensors in 1d pooling operations
* Add support for float16 input tensors in 2d pooling operations
* code cleanup
remove unnecessary casting during srow ptr initialization
---------
Co-authored-by: vanaka11 <vanaka1189@gmail.com>
This prevents invalid frees when destroying a partially initialized
vk_buffer_struct. For example, this could happen in ggml_vk_create_buffer
when running out of device memory.
Co-authored-by: Tony Wasserka <neobrain@users.noreply.github.com>
This commit removes an UNUSED macro call that is not needed as the
variable n0 is used in the code and will not produce a warning.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* Add llama 3.1 rope scaling factors to llama conversion and inference
This commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the `ggml_rope_ext` rope oepration, improving results for context windows above 8192
* Update convert_hf_to_gguf.py
Co-authored-by: compilade <git@compilade.net>
* address comments
* address comments
* Update src/llama.cpp
Co-authored-by: compilade <git@compilade.net>
* Update convert_hf_to_gguf.py
Co-authored-by: compilade <git@compilade.net>
---------
Co-authored-by: compilade <git@compilade.net>
This commit adds a --no-warmup option for llama-cli.
The motivation for this is that it can be convenient to skip the
warmup llama_decode call when debugging.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
`ggml_init` can fail if no unused context is found. In that case, a NULL-pointer deref will happen later in the code during a call to `ggml_set_on_alloc`.
This fixes it by bailing out if no context is found.
* Improvements for Windows with Snapdragon X
* Revert "Improvements for Windows with Snapdragon X"
This reverts commit bf21397ae5.
* Improvements for Windows with Snapdragon X
* WOA build clarifications
* WIndows on ARM build clarifications
* cmake build for Windows clarifications
* Update docs/build.md
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: AndreasKunar <andreaskmsn.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The check gating the use of `__builtin_amdgc_sdot4` specifically checks for gfx1030. This causes a severe perf regression for anything gfx103? that's not gfx1030 and not using `HSA_OVERRIDE_GFX_VERSION` (if you've built ROCm to support it). We already have a generic RDNA2 define, let's use it.