Merge remote-tracking branch 'origin/master' into bins

2025-01-12 11:40:17 +00:00 · 2024-06-10 15:38:41 +01:00 · 2024-06-10 15:38:41 +01:00 · daeaeb1222
commit daeaeb1222
parent 5265c15d4c fd5ea0f897
30 changed files with 1233 additions and 607 deletions
--- a/.github/PULL_REQUEST_TEMPLATE/pull_request_template.md
+++ b/.github/PULL_REQUEST_TEMPLATE/pull_request_template.md
@ -0,0 +1,5 @@
 - Self Reported Review Complexity:
    - [ ] Review Complexity : Low
    - [ ] Review Complexity : Medium
    - [ ] Review Complexity : High
 - [ ] I have read the [contributing guidelines](CONTRIBUTING.md)
--- a/.github/workflows/server.yml
+++ b/.github/workflows/server.yml
@ -16,11 +16,9 @@ on:
    branches:
      - master
    paths: ['.github/workflows/server.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/**.*']
-  pull_request_target:
+  pull_request:
    types: [opened, synchronize, reopened]
    paths: ['.github/workflows/server.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/**.*']
  schedule:
    -  cron: '2 4 * * *'
 concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}-${{ github.head_ref || github.run_id }}
@ -115,7 +113,7 @@ jobs:
  server-windows:
-    runs-on: windows-latest
+    runs-on: windows-2019
    steps:
      - name: Clone
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -0,0 +1,14 @@
 # Contributing Guidelines
 ## Checklist
 * Make sure your PR follows the [coding guidelines](https://github.com/ggerganov/llama.cpp/blob/master/README.md#coding-guidelines)
 * Test your changes using the commands in the [`tests`](tests) folder. For instance, running the `./tests/test-backend-ops` command tests different backend implementations of the GGML library
 * Execute [the full CI locally on your machine](ci/README.md) before publishing
 ## PR formatting
 * Please rate the complexity of your PR (i.e. `Review Complexity : Low`, `Review Complexity : Medium`, `Review Complexity : High`). This makes it easier for maintainers to triage the PRs.
    - The PR template has a series of review complexity checkboxes `[ ]` that you can mark as `[X]` for your conveience. Refer to [About task lists](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/about-task-lists) for more information.
 * If the pull request only contains documentation changes (e.g., updating READMEs, adding new wiki pages), please add `[no ci]` to the commit title. This will skip unnecessary CI checks and help reduce build times.
 * When squashing multiple commits on merge, use the following format for your commit title: `<module> : <commit title> (#<issue_number>)`. For example: `utils : Fix typo in utils.py (#1234)`
--- a/README.md
+++ b/README.md
@ -53,7 +53,6 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)
        <li><a href="#quantization">Quantization</a></li>
        <li><a href="#interactive-mode">Interactive mode</a></li>
        <li><a href="#constrained-output-with-grammars">Constrained output with grammars</a></li>
        <li><a href="#instruct-mode">Instruct mode</a></li>
        <li><a href="#obtaining-and-using-the-facebook-llama-2-model">Obtaining and using the Facebook LLaMA 2 model</a></li>
        <li><a href="#seminal-papers-and-background-on-the-models">Seminal papers and background on the models</a></li>
        <li><a href="#perplexity-measuring-model-quality">Perplexity (measuring model quality)</a></li>
@ -769,34 +768,6 @@ The `grammars/` folder contains a handful of sample grammars. To write your own,
 For authoring more complex JSON grammars, you can also check out https://grammar.intrinsiclabs.ai/, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on [its repo](http://github.com/intrinsiclabsai/gbnfgen) and not this one.
 ### Instruct mode
 1. First, download and place the `ggml` model into the `./models` folder
 2. Run the `main` tool like this:
 ```
 ./examples/alpaca.sh
 ```
 Sample run:
 ```
 == Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMA.
 - If you want to submit another line, end your input in '\'.
 Below is an instruction that describes a task. Write a response that appropriately completes the request.
 > How many letters are there in the English alphabet?
 There 26 letters in the English Alphabet
 > What is the most common way of transportation in Amsterdam?
 The majority (54%) are using public transit. This includes buses, trams and metros with over 100 lines throughout the city which make it very accessible for tourists to navigate around town as well as locals who commute by tram or metro on a daily basis
 > List 5 words that start with "ca".
 cadaver, cauliflower, cabbage (vegetable), catalpa (tree) and Cailleach.
 >
 ```
 ### Obtaining and using the Facebook LLaMA 2 model
 - Refer to [Facebook's LLaMA download page](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) if you want to access the model data.
--- a/common/common.cpp
+++ b/common/common.cpp
@ -200,19 +200,13 @@ void gpt_params_handle_model_default(gpt_params & params) {
            }
            params.hf_file = params.model;
        } else if (params.model.empty()) {
-            std::string cache_directory = fs_get_cache_directory();
+            params.model = fs_get_cache_file(string_split(params.hf_file, '/').back());
            const bool success = fs_create_directory_with_parents(cache_directory);
            if (!success) {
                throw std::runtime_error("failed to create cache directory: " + cache_directory);
            }
            params.model = cache_directory + string_split(params.hf_file, '/').back();
        }
    } else if (!params.model_url.empty()) {
        if (params.model.empty()) {
            auto f = string_split(params.model_url, '#').front();
            f = string_split(f, '?').front();
-            f = string_split(f, '/').back();
+            params.model = fs_get_cache_file(string_split(f, '/').back());
            params.model =  "models/" + f;
        }
    } else if (params.model.empty()) {
        params.model = DEFAULT_MODEL_PATH;
@ -2279,6 +2273,16 @@ std::string fs_get_cache_directory() {
    return ensure_trailing_slash(cache_directory);
 }
 std::string fs_get_cache_file(const std::string & filename) {
    GGML_ASSERT(filename.find(DIRECTORY_SEPARATOR) == std::string::npos);
    std::string cache_directory = fs_get_cache_directory();
    const bool success = fs_create_directory_with_parents(cache_directory);
    if (!success) {
        throw std::runtime_error("failed to create cache directory: " + cache_directory);
    }
    return cache_directory + filename;
 }
 //
 // Model utils
--- a/common/common.h
+++ b/common/common.h
@ -277,6 +277,7 @@ bool fs_validate_filename(const std::string & filename);
 bool fs_create_directory_with_parents(const std::string & path);
 std::string fs_get_cache_directory();
 std::string fs_get_cache_file(const std::string & filename);
 //
 // Model utils
--- a/convert-hf-to-gguf.py
+++ b/convert-hf-to-gguf.py
@ -47,11 +47,12 @@ class Model:
    _model_classes: dict[str, type[Model]] = {}
    dir_model: Path
-    ftype: int
+    ftype: gguf.LlamaFileType
    is_big_endian: bool
    endianess: gguf.GGUFEndian
    use_temp_file: bool
    lazy: bool
    model_name: str | None
    part_names: list[str]
    is_safetensors: bool
    hparams: dict[str, Any]
@ -64,7 +65,7 @@ class Model:
    # subclasses should define this!
    model_arch: gguf.MODEL_ARCH
-    def __init__(self, dir_model: Path, ftype: gguf.LlamaFileType, fname_out: Path, is_big_endian: bool, use_temp_file: bool, eager: bool):
+    def __init__(self, dir_model: Path, ftype: gguf.LlamaFileType, fname_out: Path, is_big_endian: bool, use_temp_file: bool, eager: bool, model_name: str | None):
        if type(self) is Model:
            raise TypeError(f"{type(self).__name__!r} should not be directly instantiated")
        self.dir_model = dir_model
@ -73,10 +74,11 @@ class Model:
        self.endianess = gguf.GGUFEndian.BIG if is_big_endian else gguf.GGUFEndian.LITTLE
        self.use_temp_file = use_temp_file
        self.lazy = not eager
-        self.part_names = Model.get_model_part_names(self.dir_model, ".safetensors")
+        self.model_name = model_name
        self.part_names = Model.get_model_part_names(self.dir_model, "model", ".safetensors")
        self.is_safetensors = len(self.part_names) > 0
        if not self.is_safetensors:
-            self.part_names = Model.get_model_part_names(self.dir_model, ".bin")
+            self.part_names = Model.get_model_part_names(self.dir_model, "pytorch_model", ".bin")
        self.hparams = Model.load_hparams(self.dir_model)
        self.block_count = self.find_hparam(["n_layers", "num_hidden_layers", "n_layer"])
        self.tensor_map = gguf.get_tensor_name_map(self.model_arch, self.block_count)
@ -94,7 +96,7 @@ class Model:
        ftype_lw: str = ftype_up.lower()
        # allow templating the file name with the output ftype, useful with the "auto" ftype
        self.fname_out = fname_out.parent / fname_out.name.format(ftype_lw, outtype=ftype_lw, ftype=ftype_lw, OUTTYPE=ftype_up, FTYPE=ftype_up)
-        self.gguf_writer = gguf.GGUFWriter(self.fname_out, gguf.MODEL_ARCH_NAMES[self.model_arch], endianess=self.endianess, use_temp_file=self.use_temp_file)
+        self.gguf_writer = gguf.GGUFWriter(path=None, arch=gguf.MODEL_ARCH_NAMES[self.model_arch], endianess=self.endianess, use_temp_file=self.use_temp_file)
    @classmethod
    def __init_subclass__(cls):
@ -182,7 +184,7 @@ class Model:
        return new_name
    def set_gguf_parameters(self):
-        self.gguf_writer.add_name(self.dir_model.name)
+        self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
        self.gguf_writer.add_block_count(self.block_count)
        if (n_ctx := self.find_hparam(["max_position_embeddings", "n_ctx"], optional=True)) is not None:
@ -324,21 +326,21 @@ class Model:
    def write(self):
        self.write_tensors()
-        self.gguf_writer.write_header_to_file()
+        self.gguf_writer.write_header_to_file(self.fname_out)
        self.gguf_writer.write_kv_data_to_file()
        self.gguf_writer.write_tensors_to_file(progress=True)
        self.gguf_writer.close()
    def write_vocab(self):
-        self.gguf_writer.write_header_to_file()
+        self.gguf_writer.write_header_to_file(self.fname_out)
        self.gguf_writer.write_kv_data_to_file()
        self.gguf_writer.close()
    @staticmethod
-    def get_model_part_names(dir_model: Path, suffix: str) -> list[str]:
+    def get_model_part_names(dir_model: Path, prefix: str, suffix: str) -> list[str]:
        part_names: list[str] = []
        for filename in os.listdir(dir_model):
-            if filename.endswith(suffix):
+            if filename.startswith(prefix) and filename.endswith(suffix):
                part_names.append(filename)
        part_names.sort()
@ -665,7 +667,7 @@ class GPTNeoXModel(Model):
    def set_gguf_parameters(self):
        block_count = self.hparams["num_hidden_layers"]
-        self.gguf_writer.add_name(self.dir_model.name)
+        self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
        self.gguf_writer.add_context_length(self.hparams["max_position_embeddings"])
        self.gguf_writer.add_embedding_length(self.hparams["hidden_size"])
        self.gguf_writer.add_block_count(block_count)
@ -798,7 +800,7 @@ class MPTModel(Model):
    def set_gguf_parameters(self):
        block_count = self.hparams["n_layers"]
-        self.gguf_writer.add_name(self.dir_model.name)
+        self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
        self.gguf_writer.add_context_length(self.hparams["max_seq_len"])
        self.gguf_writer.add_embedding_length(self.hparams["d_model"])
        self.gguf_writer.add_block_count(block_count)
@ -850,7 +852,7 @@ class OrionModel(Model):
            raise ValueError("gguf: can not find ctx length parameter.")
        self.gguf_writer.add_file_type(self.ftype)
-        self.gguf_writer.add_name(self.dir_model.name)
+        self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
        self.gguf_writer.add_source_hf_repo(hf_repo)
        self.gguf_writer.add_tensor_data_layout("Meta AI original pth")
        self.gguf_writer.add_context_length(ctx_length)
@ -887,7 +889,7 @@ class BaichuanModel(Model):
        else:
            raise ValueError("gguf: can not find ctx length parameter.")
-        self.gguf_writer.add_name(self.dir_model.name)
+        self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
        self.gguf_writer.add_source_hf_repo(hf_repo)
        self.gguf_writer.add_tensor_data_layout("Meta AI original pth")
        self.gguf_writer.add_context_length(ctx_length)
@ -1010,7 +1012,7 @@ class XverseModel(Model):
        else:
            raise ValueError("gguf: can not find ctx length parameter.")
-        self.gguf_writer.add_name(self.dir_model.name)
+        self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
        self.gguf_writer.add_source_hf_repo(hf_repo)
        self.gguf_writer.add_tensor_data_layout("Meta AI original pth")
        self.gguf_writer.add_context_length(ctx_length)
@ -1206,7 +1208,7 @@ class StableLMModel(Model):
        hparams = self.hparams
        block_count = hparams["num_hidden_layers"]
-        self.gguf_writer.add_name(self.dir_model.name)
+        self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
        self.gguf_writer.add_context_length(hparams["max_position_embeddings"])
        self.gguf_writer.add_embedding_length(hparams["hidden_size"])
        self.gguf_writer.add_block_count(block_count)
@ -1681,7 +1683,7 @@ class GPT2Model(Model):
    model_arch = gguf.MODEL_ARCH.GPT2
    def set_gguf_parameters(self):
-        self.gguf_writer.add_name(self.dir_model.name)
+        self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
        self.gguf_writer.add_block_count(self.hparams["n_layer"])
        self.gguf_writer.add_context_length(self.hparams["n_ctx"])
        self.gguf_writer.add_embedding_length(self.hparams["n_embd"])
@ -2248,7 +2250,7 @@ class GemmaModel(Model):
        hparams = self.hparams
        block_count = hparams["num_hidden_layers"]
-        self.gguf_writer.add_name(self.dir_model.name)
+        self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
        self.gguf_writer.add_context_length(hparams["max_position_embeddings"])
        self.gguf_writer.add_embedding_length(hparams["hidden_size"])
        self.gguf_writer.add_block_count(block_count)
@ -2348,7 +2350,7 @@ class MambaModel(Model):
        # Fail early for models which don't have a block expansion factor of 2
        assert d_inner == 2 * d_model
-        self.gguf_writer.add_name(self.dir_model.name)
+        self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
        self.gguf_writer.add_context_length(2**20) # arbitrary value; for those who use the default
        self.gguf_writer.add_embedding_length(d_model)
        self.gguf_writer.add_feed_forward_length(0) # unused, but seemingly required when loading
@ -2852,7 +2854,7 @@ def main() -> None:
            logger.error(f"Model {hparams['architectures'][0]} is not supported")
            sys.exit(1)
-        model_instance = model_class(dir_model, ftype_map[args.outtype], fname_out, args.bigendian, args.use_temp_file, args.no_lazy)
+        model_instance = model_class(dir_model, ftype_map[args.outtype], fname_out, args.bigendian, args.use_temp_file, args.no_lazy, args.model_name)
        logger.info("Set model parameters")
        model_instance.set_gguf_parameters()
--- a/examples/alpaca.sh
+++ b/examples/alpaca.sh
@ -1,19 +0,0 @@
 #!/bin/bash
 #
 # Temporary script - will be removed in the future
 #
 cd `dirname $0`
 cd ..
 ./llama-cli -m ./models/alpaca.13b.ggmlv3.q8_0.bin \
       --color \
       -f ./prompts/alpaca.txt \
       --ctx_size 2048 \
       -n -1 \
       -ins -b 256 \
       --top_k 10000 \
       --temp 0.2 \
       --repeat_penalty 1.1 \
       -t 7
--- a/examples/gpt4all.sh
+++ b/examples/gpt4all.sh
@ -1,15 +0,0 @@
 #!/bin/bash
 #
 # Temporary script - will be removed in the future
 #
 cd `dirname $0`
 cd ..
 ./llama-cli --color --instruct --threads 4 \
       --model ./models/gpt4all-7B/gpt4all-lora-quantized.bin \
       --file ./prompts/alpaca.txt \
       --batch_size 8 --ctx_size 2048 -n -1 \
       --repeat_last_n 64 --repeat_penalty 1.3 \
       --n_predict 128 --temp 0.1 --top_k 40 --top_p 0.95
--- a/examples/imatrix/imatrix.cpp
+++ b/examples/imatrix/imatrix.cpp
@ -218,20 +218,64 @@ void IMatrixCollector::save_imatrix(int ncall) const {
        fname += std::to_string(ncall);
    }
    // avoid writing imatrix entries that do not have full data
    // this can happen with MoE models where some of the experts end up not being exercised by the provided training data
    int n_entries = 0;
    std::vector<std::string> to_store;
    bool is_first = true; // for printing
    for (const auto & kv : m_stats) {
        const int n_all = kv.second.counts.size();
        if (n_all == 0) {
            continue;
        }
        int n_zeros = 0;
        for (const int c : kv.second.counts) {
            if (c == 0) {
                n_zeros++;
            }
        }
        if (n_zeros != 0 && is_first) {
            fprintf(stderr, "\n");
            is_first = false;
        }
        if (n_zeros == n_all) {
            fprintf(stderr, "%s: entry '%40s' has no data - skipping\n", __func__, kv.first.c_str());
            continue;
        }
        if (n_zeros > 0) {
            fprintf(stderr, "%s: entry '%40s' has partial data (%.2f%%) - skipping\n", __func__, kv.first.c_str(), 100.0f * (n_all - n_zeros) / n_all);
            continue;
        }
        n_entries++;
        to_store.push_back(kv.first);
    }
    if (to_store.size() < m_stats.size()) {
        fprintf(stderr, "%s: warning: storing only %zu out of %zu entries\n", __func__, to_store.size(), m_stats.size());
    }
    std::ofstream out(fname, std::ios::binary);
    int n_entries = m_stats.size();
    out.write((const char *) &n_entries, sizeof(n_entries));
-    for (const auto & p : m_stats) {
+    for (const auto & name : to_store) {
-        int len = p.first.size();
+        const auto & stat = m_stats.at(name);
        int len = name.size();
        out.write((const char *) &len, sizeof(len));
-        out.write(p.first.c_str(), len);
+        out.write(name.c_str(), len);
-        out.write((const char *) &p.second.ncall, sizeof(p.second.ncall));
+        out.write((const char *) &stat.ncall, sizeof(stat.ncall));
-        int nval = p.second.values.size();
+        int nval = stat.values.size();
        out.write((const char *) &nval, sizeof(nval));
        if (nval > 0) {
            std::vector<float> tmp(nval);
            for (int i = 0; i < nval; i++) {
-                tmp[i] = (p.second.values[i] / static_cast<float>(p.second.counts[i])) * static_cast<float>(p.second.ncall);
+                tmp[i] = (stat.values[i] / static_cast<float>(stat.counts[i])) * static_cast<float>(stat.ncall);
            }
            out.write((const char*)tmp.data(), nval*sizeof(float));
        }
--- a/examples/llama2-13b.sh
+++ b/examples/llama2-13b.sh
@ -1,18 +0,0 @@
 #!/bin/bash
 #
 # Temporary script - will be removed in the future
 #
 cd `dirname $0`
 cd ..
 ./llama-cli -m models/available/Llama2/13B/llama-2-13b.ggmlv3.q4_0.bin \
       --color \
       --ctx_size 2048 \
       -n -1 \
       -ins -b 256 \
       --top_k 10000 \
       --temp 0.2 \
       --repeat_penalty 1.1 \
       -t 8
--- a/examples/llama2.sh
+++ b/examples/llama2.sh
@ -1,18 +0,0 @@
 #!/bin/bash
 #
 # Temporary script - will be removed in the future
 #
 cd `dirname $0`
 cd ..
 ./llama-cli -m models/available/Llama2/7B/llama-2-7b.ggmlv3.q4_0.bin \
       --color \
       --ctx_size 2048 \
       -n -1 \
       -ins -b 256 \
       --top_k 10000 \
       --temp 0.2 \
       --repeat_penalty 1.1 \
       -t 8
--- a/examples/rpc/rpc-server.cpp
+++ b/examples/rpc/rpc-server.cpp
@ -6,10 +6,6 @@
 #include "ggml-metal.h"
 #endif
 #ifdef GGML_USE_SYCL
 #include "ggml-sycl.h"
 #endif
 #include "ggml-rpc.h"
 #ifdef _WIN32
 #  include <windows.h>
@ -83,12 +79,6 @@ static ggml_backend_t create_backend() {
    if (!backend) {
        fprintf(stderr, "%s: ggml_backend_metal_init() failed\n", __func__);
    }
 #elif GGML_USE_SYCL
    fprintf(stderr, "%s: using SYCL backend\n", __func__);
    backend = ggml_backend_sycl_init(0); // init device 0
    if (!backend) {
        fprintf(stderr, "%s: ggml_backend_sycl_init() failed\n", __func__);
    }
 #endif
    // if there aren't GPU Backends fallback to CPU backend
--- a/examples/server/public/index-new.html
+++ b/examples/server/public/index-new.html
@ -416,7 +416,7 @@
          message = html`<${Probabilities} data=${data} />`
        } else {
          const text = isArrayMessage ?
-            data.map(msg => msg.content).join('').replace(/^\s+/, '') :
+            data.map(msg => msg.content).join('') :
            data;
          message = isCompletionMode ?
            text :
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@ -147,7 +147,7 @@ struct server_slot {
    int32_t n_prompt_tokens           = 0;
    int32_t n_prompt_tokens_processed = 0;
-    json prompt;
+    std::string prompt;
    // when a task is submitted, we first tokenize the prompt and store it here
    std::vector<llama_token> prompt_tokens;
@ -822,13 +822,8 @@ struct server_context {
                    continue;
                }
                // skip the slot if it does not contains prompt
                if (!slot.prompt.is_string()) {
                    continue;
                }
                // current slot's prompt
-                std::string slot_prompt = slot.prompt.get<std::string>();
+                std::string slot_prompt = slot.prompt;
                // length of the current slot's prompt
                int slot_prompt_len = slot_prompt.size();
@ -958,13 +953,16 @@ struct server_context {
        if (!task.infill) {
            const auto & prompt = data.find("prompt");
            if (prompt == data.end()) {
-                send_error(task, "Either \"prompt\" or \"messages\" must be provided", ERROR_TYPE_INVALID_REQUEST);
+                send_error(task, "\"prompt\" must be provided", ERROR_TYPE_INVALID_REQUEST);
                return false;
            } else {
                slot.prompt = *prompt;
            }
-            if (slot.prompt.is_array() && slot.prompt.size() == 0) {
+
-                send_error(task, "\"prompt\" cannot be an empty array", ERROR_TYPE_INVALID_REQUEST);
+            if (prompt->is_string()) {
                slot.prompt = prompt->get<std::string>();
            } else if (prompt->is_array() && prompt->size() == 1 && prompt->at(0).is_string()) {
                slot.prompt = prompt->at(0).get<std::string>();
            } else {
                send_error(task, "\"prompt\" must be a string or an array of strings", ERROR_TYPE_INVALID_REQUEST);
                return false;
            }
        }
@ -1582,14 +1580,18 @@ struct server_context {
        switch (task.type) {
            case SERVER_TASK_TYPE_COMPLETION:
                {
-                    int id_slot        = json_value(task.data, "id_slot", -1);
+                    const int id_slot = json_value(task.data, "id_slot", -1);
                    std::string prompt = json_value(task.data, "prompt", std::string());
                    server_slot * slot;
                    if (id_slot != -1) {
                        slot = get_slot_by_id(id_slot);
                    } else {
                        std::string prompt;
                        if (task.data.contains("prompt") && task.data.at("prompt").is_string()) {
                            json_value(task.data, "prompt", std::string());
                        }
                        slot = get_available_slot(prompt);
                    }
--- a/flake.lock
+++ b/flake.lock
@ -20,11 +20,11 @@
    },
    "nixpkgs": {
      "locked": {
-        "lastModified": 1716948383,
+        "lastModified": 1717786204,
-        "narHash": "sha256-SzDKxseEcHR5KzPXLwsemyTR/kaM9whxeiJohbL04rs=",
+        "narHash": "sha256-4q0s6m0GUcN7q+Y2DqD27iLvbcd1G50T2lv08kKxkSI=",
        "owner": "NixOS",
        "repo": "nixpkgs",
-        "rev": "ad57eef4ef0659193044870c731987a6df5cf56b",
+        "rev": "051f920625ab5aabe37c920346e3e69d7d34400e",
        "type": "github"
      },
      "original": {
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@ -1347,10 +1347,30 @@ static void ggml_cuda_set_peer_access(const int n_tokens, int main_device) {
    GGML_UNUSED(main_device);
 }
 static cudaError_t ggml_cuda_Memcpy2DPeerAsync(
    void * dst, int dstDevice, size_t dpitch, void * src, int srcDevice, size_t spitch, size_t width, size_t height, cudaStream_t stream) {
 #if !defined(GGML_USE_HIPBLAS)
    // cudaMemcpy2DAsync may fail with copies between vmm pools of different devices
    cudaMemcpy3DPeerParms p = {};
    p.dstDevice = dstDevice;
    p.dstPtr = make_cudaPitchedPtr(dst, dpitch, dpitch, height);
    p.srcDevice = srcDevice;
    p.srcPtr = make_cudaPitchedPtr(src, spitch, spitch, height);
    p.extent = make_cudaExtent(width, height, 1);
    return cudaMemcpy3DPeerAsync(&p, stream);
 #else
    // HIP does not support cudaMemcpy3DPeerAsync or vmm pools
    GGML_UNUSED(dstDevice);
    GGML_UNUSED(srcDevice);
    return cudaMemcpy2DAsync(dst, dpitch, src, spitch, width, height, cudaMemcpyDeviceToDevice, stream);
 #endif // !defined(GGML_USE_HIPBLAS)
 }
 static void ggml_cuda_op_mul_mat(
    ggml_backend_cuda_context & ctx,
    const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, ggml_cuda_op_mul_mat_t op,
-    const bool convert_src1_to_q8_1) {
+    quantize_cuda_t quantize_src1) {
    const int64_t ne00 = src0->ne[0];
    const int64_t ne01 = src0->ne[1];
@ -1407,7 +1427,9 @@ static void ggml_cuda_op_mul_mat(
    }
    struct dev_data {
-        ggml_cuda_pool_alloc<char>  src0_dd_alloc;
+        int cc;
        ggml_cuda_pool_alloc<char>   src0_dd_alloc;
        ggml_cuda_pool_alloc<float> src1_ddf_alloc;
        ggml_cuda_pool_alloc<char>  src1_ddq_alloc;
        ggml_cuda_pool_alloc<float>   dst_dd_alloc;
@ -1426,6 +1448,8 @@ static void ggml_cuda_op_mul_mat(
    int used_devices = 0;
    for (int id = 0; id < ggml_backend_cuda_get_device_count(); ++id) {
        dev[id].cc = ggml_cuda_info().devices[id].cc;
        // by default, use all rows
        dev[id].row_low  = 0;
        dev[id].row_high = ne01;
@ -1476,11 +1500,15 @@ static void ggml_cuda_op_mul_mat(
            dev[id].src1_ddf = dev[id].src1_ddf_alloc.alloc(ctx.pool(id), ggml_nelements(src1));
        }
-        if (convert_src1_to_q8_1) {
+        if (quantize_src1) {
-            dev[id].src1_ddq = dev[id].src1_ddq_alloc.alloc(ctx.pool(id), nrows1*src1_padded_col_size*q8_1_ts/q8_1_bs);
+            size_t src_1_ddq_size = nrows1*src1_padded_col_size*q8_1_ts/q8_1_bs;
            if (quantize_src1 == quantize_mmq_q8_1_cuda) {
                src_1_ddq_size += get_mmq_x_max_host(dev[id].cc)*sizeof(block_q8_1_mmq);
            }
            dev[id].src1_ddq = dev[id].src1_ddq_alloc.alloc(ctx.pool(id), src_1_ddq_size);
            if (src1_on_device && src1_is_contiguous) {
-                quantize_row_q8_1_cuda(dev[id].src1_ddf, dev[id].src1_ddq, ne10, nrows1, src1_padded_col_size, stream);
+                quantize_src1(dev[id].src1_ddf, dev[id].src1_ddq, ne10, ne11, ne12*ne13, src1_padded_col_size, src0->type, stream);
                CUDA_CHECK(cudaGetLastError());
            }
        }
@ -1526,7 +1554,12 @@ static void ggml_cuda_op_mul_mat(
                const int64_t i03 = i0 / ne12;
                const int64_t i02 = i0 % ne12;
-                const size_t src1_ddq_i_offset = (i0*ne11 + src1_col_0) * src1_padded_col_size*q8_1_ts/q8_1_bs;
+                size_t src1_ddq_i_offset = i0*ne11 * src1_padded_col_size*q8_1_ts/q8_1_bs;
                if (quantize_src1 == quantize_mmq_q8_1_cuda) {
                    src1_ddq_i_offset += src1_col_0 * sizeof(block_q8_1_mmq);
                } else {
                    src1_ddq_i_offset += src1_col_0 * src1_padded_col_size*q8_1_ts/q8_1_bs;
                }
                // for split tensors the data begins at i0 == i0_offset_low
                char  *  src0_dd_i =  dev[id].src0_dd + (i0/i02_divisor) * (ne01*ne00*src0_ts)/src0_bs;
@ -1543,10 +1576,17 @@ static void ggml_cuda_op_mul_mat(
                // copy src0, src1 to device if necessary
                if (src1_is_contiguous) {
                    if (id != ctx.device) {
-                        if (convert_src1_to_q8_1) {
+                        if (quantize_src1) {
                            char * src1_ddq_i_source = dev[ctx.device].src1_ddq + src1_ddq_i_offset;
-                            CUDA_CHECK(cudaMemcpyPeerAsync(src1_ddq_i, id, src1_ddq_i_source, ctx.device,
+                            if (quantize_src1 == quantize_mmq_q8_1_cuda) {
-                                                            src1_ncols*src1_padded_col_size*q8_1_ts/q8_1_bs, stream));
+                                const size_t pitch = ne11*sizeof(block_q8_1_mmq);
                                const size_t width = src1_ncols*sizeof(block_q8_1_mmq);
                                const size_t height = src1_padded_col_size/(4*QK8_1);
                                CUDA_CHECK(ggml_cuda_Memcpy2DPeerAsync(src1_ddq_i, id, pitch, src1_ddq_i_source, ctx.device, pitch, width, height, stream));
                            } else {
                                CUDA_CHECK(cudaMemcpyPeerAsync(
                                    src1_ddq_i, id, src1_ddq_i_source, ctx.device, src1_ncols*src1_padded_col_size*q8_1_ts/q8_1_bs, stream));
                            }
                        } else {
                            float * src1_ddf_i_source = (float *) src1->data;
                            src1_ddf_i_source += (i0*ne11 + src1_col_0) * ne10;
@ -1561,8 +1601,8 @@ static void ggml_cuda_op_mul_mat(
                    GGML_ASSERT(false);
                }
-                if (convert_src1_to_q8_1 && !src1_is_contiguous) {
+                if (quantize_src1 && !src1_is_contiguous) {
-                    quantize_row_q8_1_cuda(src1_ddf_i, src1_ddq_i, ne10, src1_ncols, src1_padded_col_size, stream);
+                    quantize_src1(src1_ddf_i, src1_ddq_i, ne10, src1_ncols, 1, src1_padded_col_size, src0->type, stream);
                    CUDA_CHECK(cudaGetLastError());
                }
@ -1587,22 +1627,8 @@ static void ggml_cuda_op_mul_mat(
                        float * dhf_dst_i = (float *) ((char *) dst_off_device + i02*nb2 + i03*nb3);
                        GGML_ASSERT(dst->nb[1] == ne0*sizeof(float));
                        dhf_dst_i += src1_col_0*ne0 + dev[id].row_low;
-#if !defined(GGML_USE_HIPBLAS)
+                        CUDA_CHECK(ggml_cuda_Memcpy2DPeerAsync(
-                        // cudaMemcpy2DAsync may fail with copies between vmm pools of different devices
+                            dhf_dst_i, ctx.device, ne0*sizeof(float), dst_dd_i, id, row_diff*sizeof(float), row_diff*sizeof(float), src1_ncols, stream));
                        cudaMemcpy3DPeerParms p = {};
                        p.dstDevice = ctx.device;
                        p.dstPtr = make_cudaPitchedPtr(dhf_dst_i, ne0*sizeof(float), row_diff, src1_ncols);
                        p.srcDevice = id;
                        p.srcPtr = make_cudaPitchedPtr(dst_dd_i, row_diff*sizeof(float), row_diff, src1_ncols);
                        p.extent = make_cudaExtent(row_diff*sizeof(float), src1_ncols, 1);
                        CUDA_CHECK(cudaMemcpy3DPeerAsync(&p, stream));
 #else
                        // HIP does not support cudaMemcpy3DPeerAsync or vmm pools
                        CUDA_CHECK(cudaMemcpy2DAsync(dhf_dst_i, ne0*sizeof(float),
                                                        dst_dd_i, row_diff*sizeof(float),
                                                        row_diff*sizeof(float), src1_ncols,
                                                        cudaMemcpyDeviceToDevice, stream));
 #endif
                    } else {
                        float * dhf_dst_i = (float *) ((char *) dst_off_device + i02*nb2 + i03*nb3);
                        GGML_ASSERT(dst->nb[1] == ne0*sizeof(float));
@ -1941,13 +1967,13 @@ static void ggml_cuda_mul_mat(ggml_backend_cuda_context & ctx, const ggml_tensor
        // KQ + KQV multi-batch
        ggml_cuda_mul_mat_batched_cublas(ctx, src0, src1, dst);
    } else if (use_dequantize_mul_mat_vec) {
-        ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_dequantize_mul_mat_vec, false);
+        ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_dequantize_mul_mat_vec, nullptr);
    } else if (use_mul_mat_vec_q) {
-        ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_mul_mat_vec_q, true);
+        ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_mul_mat_vec_q, quantize_row_q8_1_cuda);
    } else if (use_mul_mat_q) {
-        ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_mul_mat_q, true);
+        ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_mul_mat_q, quantize_mmq_q8_1_cuda);
    } else {
-        ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_mul_mat_cublas, false);
+        ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_mul_mat_cublas, nullptr);
    }
 }
--- a/ggml-cuda/common.cuh
+++ b/ggml-cuda/common.cuh
@ -139,6 +139,7 @@
 #define CC_PASCAL     600
 #define MIN_CC_DP4A   610 // minimum compute capability for __dp4a, an intrinsic for byte-wise dot products
 #define CC_VOLTA      700
 #define CC_TURING     750
 #define CC_AMPERE     800
 #define CC_OFFSET_AMD 1000000
 #define CC_RDNA1      (CC_OFFSET_AMD + 1010)
@ -326,9 +327,17 @@ static __device__ __forceinline__ half2 __shfl_xor(half2 var, int laneMask, int
 #endif // defined(__HIP_PLATFORM_AMD__) && HIP_VERSION < 50600000
 #endif // defined(GGML_USE_HIPBLAS)
-#define FP16_AVAILABLE (defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) || __CUDA_ARCH__ >= CC_PASCAL
+#if (defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) || __CUDA_ARCH__ >= CC_PASCAL
 #define FP16_AVAILABLE
 #endif // (defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) || __CUDA_ARCH__ >= CC_PASCAL
-#define FP16_MMA_AVAILABLE !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_VOLTA
+#if !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_VOLTA
 #define FP16_MMA_AVAILABLE
 #endif // !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_VOLTA
 #if !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_TURING
 #define INT8_MMA_AVAILABLE
 #endif // !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && __CUDA_ARCH__ >= CC_TURING
 static bool fast_fp16_available(const int cc) {
    return cc >= CC_PASCAL && cc != 610;
@ -338,6 +347,10 @@ static bool fp16_mma_available(const int cc) {
    return cc < CC_OFFSET_AMD && cc >= CC_VOLTA;
 }
 static bool int8_mma_available(const int cc) {
    return cc < CC_OFFSET_AMD && cc >= CC_TURING;
 }
 [[noreturn]]
 static __device__ void no_device_code(
    const char * file_name, const int line, const char * function_name, const int arch, const char * arch_list) {
@ -379,7 +392,7 @@ static __device__ __forceinline__ float2 warp_reduce_sum(float2 a) {
 }
 static __device__ __forceinline__ half2 warp_reduce_sum(half2 a) {
-#if FP16_AVAILABLE
+#ifdef FP16_AVAILABLE
 #if defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)
 #pragma unroll
@ -412,7 +425,7 @@ static __device__ __forceinline__ float warp_reduce_max(float x) {
 }
 static __device__ __forceinline__ half ggml_cuda_hmax(const half a, const half b) {
-#if FP16_AVAILABLE
+#ifdef FP16_AVAILABLE
 #if !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) && CUDART_VERSION < CUDART_HMAX
    return __float2half(fmaxf(__half2float(a), __half2float(b)));
--- a/ggml-cuda/fattn-common.cuh
+++ b/ggml-cuda/fattn-common.cuh
@ -74,7 +74,7 @@ static __device__ __forceinline__ T vec_dot_fattn_vec_KQ_q4_0(
        const int sumi = __dp4a(v, u, 0);
-#if FP16_AVAILABLE
+#ifdef FP16_AVAILABLE
        if (std::is_same<T, half>::value) {
            const half2  * Q_ds = (const half2  *) Q_ds_v;
@ -122,7 +122,7 @@ static __device__ __forceinline__ T vec_dot_fattn_vec_KQ_q4_1(
        const int sumi = __dp4a(v, u, 0);
-#if FP16_AVAILABLE
+#ifdef FP16_AVAILABLE
        if (std::is_same<T, half>::value) {
            const half2  * Q_ds = (const half2  *) Q_ds_v;
@ -181,7 +181,7 @@ static __device__ __forceinline__ T vec_dot_fattn_vec_KQ_q5_0(
        const int sumi = __dp4a(v, u, 0);
-#if FP16_AVAILABLE
+#ifdef FP16_AVAILABLE
        if (std::is_same<T, half>::value) {
            const half2  * Q_ds = (const half2  *) Q_ds_v;
@ -236,7 +236,7 @@ static __device__ __forceinline__ T vec_dot_fattn_vec_KQ_q5_1(
        const int sumi = __dp4a(v, u, 0);
-#if FP16_AVAILABLE
+#ifdef FP16_AVAILABLE
        if (std::is_same<T, half>::value) {
            const half2  * Q_ds = (const half2  *) Q_ds_v;
@ -314,7 +314,7 @@ static __device__ __forceinline__ T vec_dot_fattn_vec_KQ_f16(
    GGML_UNUSED(Q_q8);
    GGML_UNUSED(Q_ds_v);
-#if FP16_AVAILABLE
+#ifdef FP16_AVAILABLE
    if (std::is_same<T, half>::value) {
        const half2 * Q_h2 = (const half2 *) Q_v;
@ -407,7 +407,7 @@ static __device__ __forceinline__ T dequantize_1_q4_0(const void * __restrict__
    const int q0 = x[ib].qs[iqs];
    const int q  = ((q0 >> (4*shift)) & 0x0F) - 8;
-#if FP16_AVAILABLE
+#ifdef FP16_AVAILABLE
    if (std::is_same<T, half>::value) {
        return ((half) d)*((half) q);
    }
@ -428,7 +428,7 @@ static __device__ __forceinline__ T dequantize_1_q4_1(const void * __restrict__
    const int   q0 = x[ib].qs[iqs];
    const int   q  = ((q0 >> (4*shift)) & 0x0F);
-#if FP16_AVAILABLE
+#ifdef FP16_AVAILABLE
    if (std::is_same<T, half>::value) {
        return __low2half(dm)*((half) q) + __high2half(dm);
    }
@ -453,7 +453,7 @@ static __device__ __forceinline__ T dequantize_1_q5_0(const void * __restrict__
    const int qh  = ((qh0 >> idq) << 4) & 0x10;
    const int q   = (ql | qh) - 16;
-#if FP16_AVAILABLE
+#ifdef FP16_AVAILABLE
    if (std::is_same<T, half>::value) {
        return ((half) d)*((half) q);
    }
@ -478,7 +478,7 @@ static __device__ __forceinline__ T dequantize_1_q5_1(const void * __restrict__
    const int   qh  = ((qh0 >> idq) << 4) & 0x10;
    const int   q   = (ql | qh);
-#if FP16_AVAILABLE
+#ifdef FP16_AVAILABLE
    if (std::is_same<T, half>::value) {
        return __low2half(dm)*((half) q) + __high2half(dm);
    }
@ -497,7 +497,7 @@ static __device__ __forceinline__ T dequantize_1_q8_0(const void * __restrict__
    const T   d = x[ib].d;
    const int q = x[ib].qs[iqs];
-#if FP16_AVAILABLE
+#ifdef FP16_AVAILABLE
    if (std::is_same<T, half>::value) {
        return ((half) d)*((half) q);
    }
--- a/ggml-cuda/fattn-tile-f16.cu
+++ b/ggml-cuda/fattn-tile-f16.cu
@ -43,7 +43,7 @@ static __global__ void flash_attn_tile_ext_f16(
        const int ne1,
        const int ne2,
        const int ne3) {
-#if FP16_AVAILABLE
+#ifdef FP16_AVAILABLE
    //In this kernel Q, K, V are matrices while i, j, k are matrix indices.
    const int ic0 = (blockIdx.x / parallel_blocks) * ncols; // Index of the Q/QKV column to work on.
--- a/ggml-cuda/fattn-vec-f16.cuh
+++ b/ggml-cuda/fattn-vec-f16.cuh
@ -40,7 +40,7 @@ static __global__ void flash_attn_vec_ext_f16(
        const int ne1,
        const int ne2,
        const int ne3) {
-#if FP16_AVAILABLE
+#ifdef FP16_AVAILABLE
    //In this kernel Q, K, V are matrices while i, j, k are matrix indices.
    constexpr vec_dot_KQ_f16_t vec_dot_KQ = get_vec_dot_KQ_f16<D>(type_K);
--- a/ggml-cuda/fattn-wmma-f16.cuh
+++ b/ggml-cuda/fattn-wmma-f16.cuh
@ -1,9 +1,9 @@
 #include "common.cuh"
 #include "fattn-common.cuh"
-#if FP16_MMA_AVAILABLE
+#ifdef FP16_MMA_AVAILABLE
 #include <mma.h>
-#endif
+#endif // FP16_MMA_AVAILABLE
 // D == head size, VKQ_stride == num VKQ rows calculated in parallel:
 template<int D, int ncols, int nwarps, int VKQ_stride, int parallel_blocks, typename KQ_acc_t>
@ -45,7 +45,7 @@ static __global__ void flash_attn_ext_f16(
        const int ne1,
        const int ne2,
        const int ne3) {
-#if FP16_MMA_AVAILABLE
+#ifdef FP16_MMA_AVAILABLE
    //In this kernel Q, K, V are matrices while i, j, k are matrix indices.
    const int ic0 = ncols*(blockIdx.x / parallel_blocks); // Index of the first Q/QKV column to work on.
--- a/ggml-cuda/mma.cuh
+++ b/ggml-cuda/mma.cuh
@ -0,0 +1,95 @@
 #include "common.cuh"
 struct mma_int_A_I16K8 {
    static constexpr int I  = 16;
    static constexpr int K  = 8;
    static constexpr int ne = 4;
    int x[ne] = {0};
    static __device__ __forceinline__ int get_i(const int l) {
        const int ret = (l%2) * (I/2) + threadIdx.x / (K/2);
        GGML_CUDA_ASSUME(ret >= 0);
        GGML_CUDA_ASSUME(ret <  I);
        return ret;
    }
    static __device__ __forceinline__ int get_k(const int l) {
        const int ret = (l/2) * (K/2) + threadIdx.x % (K/2);
        GGML_CUDA_ASSUME(ret >= 0);
        GGML_CUDA_ASSUME(ret <  K);
        return ret;
    }
 };
 struct mma_int_B_J8K8 {
    static constexpr int J  = 8;
    static constexpr int K  = 8;
    static constexpr int ne = 2;
    int x[ne] = {0};
    static __device__ __forceinline__ int get_j(const int /* l */) {
        const int ret = threadIdx.x / (K/2);
        GGML_CUDA_ASSUME(ret >= 0);
        GGML_CUDA_ASSUME(ret <  J);
        return ret;
    }
    static __device__ __forceinline__ int get_k(const int l) {
        const int ret = l * (K/2) + threadIdx.x % (K/2);
        GGML_CUDA_ASSUME(ret >= 0);
        GGML_CUDA_ASSUME(ret <  K);
        return ret;
    }
 };
 struct mma_int_C_I16J8 {
    static constexpr int I  = 16;
    static constexpr int J  = 8;
    static constexpr int ne = 4;
    int x[ne] = {0};
    static __device__ __forceinline__ int get_i(const int l) {
        const int ret = (l/2) * (I/2) + threadIdx.x / (J/2);
        GGML_CUDA_ASSUME(ret >= 0);
        GGML_CUDA_ASSUME(ret <  I);
        return ret;
    }
    static __device__ __forceinline__ int get_j(const int l) {
        const int ret = 2 * (threadIdx.x % (J/2)) + l%2;
        GGML_CUDA_ASSUME(ret >= 0);
        GGML_CUDA_ASSUME(ret <  J);
        return ret;
    }
    __device__ __forceinline__ void mma_K8(const mma_int_A_I16K8 & mma_A, const mma_int_B_J8K8 & mma_B) {
 #ifdef INT8_MMA_AVAILABLE
 #if __CUDA_ARCH__ >= CC_AMPERE
        asm("mma.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32 {%0, %1, %2, %3}, {%4, %5, %6, %7}, {%8, %9}, {%0, %1, %2, %3};"
            : "+r"(x[0]), "+r"(x[1]), "+r"(x[2]), "+r"(x[3])
            : "r"(mma_A.x[0]), "r"(mma_A.x[1]), "r"(mma_A.x[2]), "r"(mma_A.x[3]), "r"(mma_B.x[0]), "r"(mma_B.x[1]));
 #else
        // On Turing m16n8k32 mma is not available, use 4x m8n8k16 mma instead:
        asm("mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32 {%0, %1}, {%2}, {%3}, {%0, %1};"
            : "+r"(x[0]), "+r"(x[1])
            : "r"(mma_A.x[0]), "r"(mma_B.x[0]));
        asm("mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32 {%0, %1}, {%2}, {%3}, {%0, %1};"
            : "+r"(x[2]), "+r"(x[3])
            : "r"(mma_A.x[1]), "r"(mma_B.x[0]));
        asm("mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32 {%0, %1}, {%2}, {%3}, {%0, %1};"
            : "+r"(x[0]), "+r"(x[1])
            : "r"(mma_A.x[2]), "r"(mma_B.x[1]));
        asm("mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32 {%0, %1}, {%2}, {%3}, {%0, %1};"
            : "+r"(x[2]), "+r"(x[3])
            : "r"(mma_A.x[3]), "r"(mma_B.x[1]));
 #endif // __CUDA_ARCH__ >= CC_AMPERE
 #else
        GGML_UNUSED(mma_A);
        GGML_UNUSED(mma_B);
        NO_DEVICE_CODE;
 #endif // INT8_MMA_AVAILABLE
    }
 };
--- a/ggml-cuda/mmq.cu
+++ b/ggml-cuda/mmq.cu
@ -11,6 +11,7 @@ void ggml_cuda_op_mul_mat_q(
    const int64_t nb01 = src0->nb[1];
    const int64_t ne10 = src1->ne[0];
    const int64_t ne11 = src1->ne[1];
    GGML_ASSERT(ne10 % QK8_1 == 0);
    const int64_t ne0 = dst->ne[0];
@ -25,7 +26,7 @@ void ggml_cuda_op_mul_mat_q(
    // nrows_dst == nrows of the matrix that the kernel writes into
    const int64_t nrows_dst = id == ctx.device ? ne0 : row_diff;
-    const mmq_args args = {src0_dd_i, src1_ddq_i, dst_dd_i, ne00, row_diff, stride00, src1_padded_row_size, src1_ncols, nrows_dst};
+    const mmq_args args = {src0_dd_i, src1_ddq_i, dst_dd_i, ne00, row_diff, stride00, src1_padded_row_size, src1_ncols, ne11, nrows_dst};
    switch (src0->type) {
        case GGML_TYPE_Q4_0:
--- a/ggml-cuda/mmq.cuh
+++ b/ggml-cuda/mmq.cuh
--- a/ggml-cuda/quantize.cu
+++ b/ggml-cuda/quantize.cu
@ -1,22 +1,23 @@
 #include "quantize.cuh"
 #include <cstdint>
-static __global__ void quantize_q8_1(const float * __restrict__ x, void * __restrict__ vy, const int64_t kx, const int64_t kx_padded) {
+static __global__ void quantize_q8_1(const float * __restrict__ x, void * __restrict__ vy, const int64_t kx, const int64_t kx0_padded) {
-    const int64_t ix = (int64_t)blockDim.x*blockIdx.x + threadIdx.x;
+    const int64_t ix0 = (int64_t)blockDim.x*blockIdx.x + threadIdx.x;
-    if (ix >= kx_padded) {
+    if (ix0 >= kx0_padded) {
        return;
    }
-    const int64_t iy = (int64_t)blockDim.y*blockIdx.y + threadIdx.y;
+    const int64_t ix1 = blockIdx.y;
-    const int64_t i_padded = (int64_t)iy*kx_padded + ix;
+    const int64_t i_padded = ix1*kx0_padded + ix0;
    block_q8_1 * y = (block_q8_1 *) vy;
    const int64_t ib = i_padded / QK8_1; // block index
    const int64_t iqs = i_padded % QK8_1; // quant index
-    const float xi = ix < kx ? x[iy*kx + ix] : 0.0f;
+    const float xi = ix0 < kx ? x[ix1*kx + ix0] : 0.0f;
    float amax = fabsf(xi);
    float sum = xi;
@ -36,10 +37,76 @@ static __global__ void quantize_q8_1(const float * __restrict__ x, void * __rest
    reinterpret_cast<half&>(y[ib].ds.y) = sum;
 }
-void quantize_row_q8_1_cuda(const float * x, void * vy, const int64_t kx, const int64_t ky, const int64_t kx_padded, cudaStream_t stream) {
+template <bool need_sum>
-    const int64_t block_num_x = (kx_padded + CUDA_QUANTIZE_BLOCK_SIZE - 1) / CUDA_QUANTIZE_BLOCK_SIZE;
+static __global__ void quantize_mmq_q8_1(
-    const dim3 num_blocks(block_num_x, ky, 1);
+    const float * __restrict__ x, void * __restrict__ vy, const int64_t kx0, const int64_t kx1, const int64_t kx0_padded) {
-    const dim3 block_size(CUDA_QUANTIZE_BLOCK_SIZE, 1, 1);
+
-    quantize_q8_1<<<num_blocks, block_size, 0, stream>>>(x, vy, kx, kx_padded);
+    const int64_t ix0 = (int64_t)blockDim.x*blockIdx.x + threadIdx.x;
    if (ix0 >= kx0_padded) {
        return;
    }
    const int64_t ix1 = kx1*blockIdx.z + blockIdx.y;
    block_q8_1_mmq * y = (block_q8_1_mmq *) vy;
    const int64_t ib0 = blockIdx.z*(gridDim.y*gridDim.x*blockDim.x/(4*QK8_1)); // first block of channel
    const int64_t ib  = ib0 + (ix0 / (4*QK8_1))*kx1 + blockIdx.y;              // block index in channel
    const int64_t iqs = ix0 % (4*QK8_1);                                       // quant index in block
    const float xi = ix0 < kx0 ? x[ix1*kx0 + ix0] : 0.0f;
    float amax = fabsf(xi);
    amax = warp_reduce_max(amax);
    float sum;
    if (need_sum) {
        sum = warp_reduce_sum(xi);
    }
    const float d = amax / 127;
    const int8_t q = amax == 0.0f ? 0 : roundf(xi / d);
    y[ib].qs[iqs] = q;
    if (iqs % QK8_1 != 0) {
        return;
    }
    if (need_sum) {
        y[ib].ds[iqs/QK8_1] = make_half2(d, sum);
    } else {
        ((float *) y[ib].ds)[iqs/QK8_1] = d;
    }
 }
 void quantize_row_q8_1_cuda(
    const float * x, void * vy, const int64_t kx0, const int64_t kx1, const int64_t channels,
    const int64_t kx0_padded, const ggml_type type_x, cudaStream_t stream) {
    GGML_ASSERT(kx0_padded % QK8_1 == 0);
    const int64_t block_num_x = (kx0_padded + CUDA_QUANTIZE_BLOCK_SIZE - 1) / CUDA_QUANTIZE_BLOCK_SIZE;
    const dim3 num_blocks(block_num_x, kx1*channels, 1);
    const dim3 block_size(CUDA_QUANTIZE_BLOCK_SIZE, 1, 1);
    quantize_q8_1<<<num_blocks, block_size, 0, stream>>>(x, vy, kx0, kx0_padded);
    GGML_UNUSED(type_x);
 }
 void quantize_mmq_q8_1_cuda(
    const float * x, void * vy, const int64_t kx0, const int64_t kx1, const int64_t channels,
    const int64_t kx0_padded, const ggml_type type_x, cudaStream_t stream) {
    GGML_ASSERT(kx0_padded % (4*QK8_1) == 0);
    const int64_t block_num_x = (kx0_padded + CUDA_QUANTIZE_BLOCK_SIZE - 1) / CUDA_QUANTIZE_BLOCK_SIZE;
    const dim3 num_blocks(block_num_x, kx1, channels);
    const dim3 block_size(CUDA_QUANTIZE_BLOCK_SIZE, 1, 1);
    if (mmq_need_sum(type_x)) {
        quantize_mmq_q8_1<true><<<num_blocks, block_size, 0, stream>>>(x, vy, kx0, kx1, kx0_padded);
    } else {
        quantize_mmq_q8_1<false><<<num_blocks, block_size, 0, stream>>>(x, vy, kx0, kx1, kx0_padded);
    }
 }
--- a/ggml-cuda/quantize.cuh
+++ b/ggml-cuda/quantize.cuh
@ -1,5 +1,20 @@
 #pragma once
 #include "common.cuh"
 #include "mmq.cuh"
 #include <cstdint>
 #define CUDA_QUANTIZE_BLOCK_SIZE 256
-void quantize_row_q8_1_cuda(const float * x, void * vy, const int64_t kx, const int64_t ky, const int64_t kx_padded, cudaStream_t stream);
+typedef void (*quantize_cuda_t)(
    const float * x, void * vy, const int64_t kx0, const int64_t kx1, const int64_t channels, const int64_t kx0_padded,
    const ggml_type type_x, cudaStream_t stream);
 void quantize_row_q8_1_cuda(
    const float * x, void * vy, const int64_t kx0, const int64_t kx1, const int64_t channels, const int64_t kx0_padded,
    const ggml_type type_x, cudaStream_t stream);
 void quantize_mmq_q8_1_cuda(
    const float * x, void * vy, const int64_t kx0, const int64_t kx1, const int64_t channels, const int64_t kx0_padded,
    const ggml_type type_x, cudaStream_t stream);
--- a/ggml-sycl.cpp
+++ b/ggml-sycl.cpp
@ -13089,10 +13089,12 @@ void *ggml_sycl_host_malloc(size_t size) try {
        return nullptr;
    }
    ggml_sycl_set_device(g_main_device);
    dpct::queue_ptr main_stream = g_syclStreams[g_main_device][0];
    void * ptr = nullptr;
    //allow to use dpct::get_in_order_queue() for host malloc
    dpct::err0 err = CHECK_TRY_ERROR(
-        ptr = (void *)sycl::malloc_host(size, dpct::get_in_order_queue()));
+        ptr = (void *)sycl::malloc_host(size, *main_stream));
    if (err != 0) {
        // clear the error
@ -13113,8 +13115,9 @@ catch (sycl::exception const &exc) {
 }
 void ggml_sycl_host_free(void *ptr) try {
-    //allow to use dpct::get_in_order_queue() for host malloc
+    ggml_sycl_set_device(g_main_device);
-    SYCL_CHECK(CHECK_TRY_ERROR(sycl::free(ptr, dpct::get_in_order_queue())));
+    dpct::queue_ptr main_stream = g_syclStreams[g_main_device][0];
    SYCL_CHECK(CHECK_TRY_ERROR(sycl::free(ptr, *main_stream)));
 }
 catch (sycl::exception const &exc) {
  std::cerr << exc.what() << "Exception caught at file:" << __FILE__
--- a/gguf-py/gguf/gguf_writer.py
+++ b/gguf-py/gguf/gguf_writer.py
@ -5,6 +5,7 @@ import os
 import shutil
 import struct
 import tempfile
 from dataclasses import dataclass
 from enum import Enum, auto
 from io import BufferedWriter
 from typing import IO, Any, Sequence, Mapping
@ -30,17 +31,36 @@ from .quants import quant_shape_from_byte_shape
 logger = logging.getLogger(__name__)
@dataclass
 class TensorInfo:
    shape: Sequence[int]
    dtype: GGMLQuantizationType
    nbytes: int
    tensor: np.ndarray[Any, Any] | None = None
@dataclass
 class GGUFValue:
    value: Any
    type: GGUFValueType
 class WriterState(Enum):
    NO_FILE = auto()
    EMPTY   = auto()
    HEADER  = auto()
    KV_DATA = auto()
    TI_DATA = auto()
    WEIGHTS = auto()
 class GGUFWriter:
-    fout: BufferedWriter
+    fout: BufferedWriter | None
    path: os.PathLike[str] | str | None
    temp_file: tempfile.SpooledTemporaryFile[bytes] | None
-    tensors: list[np.ndarray[Any, Any]]
+    tensors: dict[str, TensorInfo]
    kv_data: dict[str, GGUFValue]
    state: WriterState
    _simple_value_packing = {
        GGUFValueType.UINT8:   "B",
        GGUFValueType.INT8:    "b",
@ -56,141 +76,140 @@ class GGUFWriter:
    }
    def __init__(
-        self, path: os.PathLike[str] | str, arch: str, use_temp_file: bool = True,
+        self, path: os.PathLike[str] | str | None, arch: str, use_temp_file: bool = False,
        endianess: GGUFEndian = GGUFEndian.LITTLE,
    ):
-        self.fout = open(path, "wb")
+        self.fout = None
        self.path = path
        self.arch = arch
        self.endianess = endianess
        self.offset_tensor = 0
        self.data_alignment = GGUF_DEFAULT_ALIGNMENT
        self.kv_data = bytearray()
        self.kv_data_count = 0
        self.ti_data = bytearray()
        self.ti_data_count = 0
        self.ti_names = set()
        self.use_temp_file = use_temp_file
        self.temp_file = None
-        self.tensors = []
+        self.tensors = dict()
        self.kv_data = dict()
        logger.info("gguf: This GGUF file is for {0} Endian only".format(
            "Big" if self.endianess == GGUFEndian.BIG else "Little",
        ))
-        self.state = WriterState.EMPTY
+        self.state = WriterState.NO_FILE
        self.add_architecture()
-    def write_header_to_file(self) -> None:
+    def open_output_file(self, path: os.PathLike[str] | str | None = None) -> None:
        if self.state is WriterState.EMPTY and self.fout is not None and (path is None or path == self.path):
            # allow calling this multiple times as long as the path is the same
            return
        if self.state is not WriterState.NO_FILE:
            raise ValueError(f'Expected output file to be not yet opened, got {self.state}')
        if path is not None:
            self.path = path
        if self.path is not None:
            if self.fout is not None:
                self.fout.close()
            self.fout = open(self.path, "wb")
            self.state = WriterState.EMPTY
    def write_header_to_file(self, path: os.PathLike[str] | str | None = None) -> None:
        self.open_output_file(path)
        if self.state is not WriterState.EMPTY:
            raise ValueError(f'Expected output file to be empty, got {self.state}')
        self._write_packed("<I", GGUF_MAGIC, skip_pack_prefix = True)
        self._write_packed("I", GGUF_VERSION)
-        self._write_packed("Q", self.ti_data_count)
+        self._write_packed("Q", len(self.tensors))
-        self._write_packed("Q", self.kv_data_count)
+        self._write_packed("Q", len(self.kv_data))
        self.flush()
        self.state = WriterState.HEADER
    def write_kv_data_to_file(self) -> None:
        if self.state is not WriterState.HEADER:
            raise ValueError(f'Expected output file to contain the header, got {self.state}')
        assert self.fout is not None
-        self.fout.write(self.kv_data)
+        kv_data = bytearray()
        for key, val in self.kv_data.items():
            kv_data += self._pack_val(key, GGUFValueType.STRING, add_vtype=False)
            kv_data += self._pack_val(val.value, val.type, add_vtype=True)
        self.fout.write(kv_data)
        self.flush()
        self.state = WriterState.KV_DATA
    def write_ti_data_to_file(self) -> None:
        if self.state is not WriterState.KV_DATA:
            raise ValueError(f'Expected output file to contain KV data, got {self.state}')
        assert self.fout is not None
-        self.fout.write(self.ti_data)
+        ti_data = bytearray()
        offset_tensor = 0
        for name, ti in self.tensors.items():
            ti_data += self._pack_val(name, GGUFValueType.STRING, add_vtype=False)
            n_dims = len(ti.shape)
            ti_data += self._pack("I", n_dims)
            for i in range(n_dims):
                ti_data += self._pack("Q", ti.shape[n_dims - 1 - i])
            ti_data += self._pack("I", ti.dtype)
            ti_data += self._pack("Q", offset_tensor)
            offset_tensor += GGUFWriter.ggml_pad(ti.nbytes, self.data_alignment)
        self.fout.write(ti_data)
        self.flush()
        self.state = WriterState.TI_DATA
-    def add_key(self, key: str) -> None:
+    def add_key_value(self, key: str, val: Any, vtype: GGUFValueType) -> None:
-        self.add_val(key, GGUFValueType.STRING, add_vtype=False)
+        if key in self.kv_data:
            raise ValueError(f'Duplicated key name {key!r}')
        self.kv_data[key] = GGUFValue(value=val, type=vtype)
    def add_uint8(self, key: str, val: int) -> None:
-        self.add_key(key)
+        self.add_key_value(key,val, GGUFValueType.UINT8)
        self.add_val(val, GGUFValueType.UINT8)
    def add_int8(self, key: str, val: int) -> None:
-        self.add_key(key)
+        self.add_key_value(key, val, GGUFValueType.INT8)
        self.add_val(val, GGUFValueType.INT8)
    def add_uint16(self, key: str, val: int) -> None:
-        self.add_key(key)
+        self.add_key_value(key, val, GGUFValueType.UINT16)
        self.add_val(val, GGUFValueType.UINT16)
    def add_int16(self, key: str, val: int) -> None:
-        self.add_key(key)
+        self.add_key_value(key, val, GGUFValueType.INT16)
        self.add_val(val, GGUFValueType.INT16)
    def add_uint32(self, key: str, val: int) -> None:
-        self.add_key(key)
+        self.add_key_value(key, val, GGUFValueType.UINT32)
        self.add_val(val, GGUFValueType.UINT32)
    def add_int32(self, key: str, val: int) -> None:
-        self.add_key(key)
+        self.add_key_value(key, val, GGUFValueType.INT32)
        self.add_val(val, GGUFValueType.INT32)
    def add_float32(self, key: str, val: float) -> None:
-        self.add_key(key)
+        self.add_key_value(key, val, GGUFValueType.FLOAT32)
        self.add_val(val, GGUFValueType.FLOAT32)
    def add_uint64(self, key: str, val: int) -> None:
-        self.add_key(key)
+        self.add_key_value(key, val, GGUFValueType.UINT64)
        self.add_val(val, GGUFValueType.UINT64)
    def add_int64(self, key: str, val: int) -> None:
-        self.add_key(key)
+        self.add_key_value(key, val, GGUFValueType.INT64)
        self.add_val(val, GGUFValueType.INT64)
    def add_float64(self, key: str, val: float) -> None:
-        self.add_key(key)
+        self.add_key_value(key, val, GGUFValueType.FLOAT64)
        self.add_val(val, GGUFValueType.FLOAT64)
    def add_bool(self, key: str, val: bool) -> None:
-        self.add_key(key)
+        self.add_key_value(key, val, GGUFValueType.BOOL)
        self.add_val(val, GGUFValueType.BOOL)
    def add_string(self, key: str, val: str) -> None:
        if not val:
            return
-        self.add_key(key)
+        self.add_key_value(key, val, GGUFValueType.STRING)
        self.add_val(val, GGUFValueType.STRING)
    def add_array(self, key: str, val: Sequence[Any]) -> None:
        if not isinstance(val, Sequence):
            raise ValueError("Value must be a sequence for array type")
-        self.add_key(key)
+        self.add_key_value(key, val, GGUFValueType.ARRAY)
        self.add_val(val, GGUFValueType.ARRAY)
    def add_val(self, val: Any, vtype: GGUFValueType | None = None, add_vtype: bool = True) -> None:
        if vtype is None:
            vtype = GGUFValueType.get_type(val)
        if add_vtype:
            self.kv_data += self._pack("I", vtype)
            self.kv_data_count += 1
        pack_fmt = self._simple_value_packing.get(vtype)
        if pack_fmt is not None:
            self.kv_data += self._pack(pack_fmt, val, skip_pack_prefix = vtype == GGUFValueType.BOOL)
        elif vtype == GGUFValueType.STRING:
            encoded_val = val.encode("utf-8") if isinstance(val, str) else val
            self.kv_data += self._pack("Q", len(encoded_val))
            self.kv_data += encoded_val
        elif vtype == GGUFValueType.ARRAY and isinstance(val, Sequence) and val:
            ltype = GGUFValueType.get_type(val[0])
            if not all(GGUFValueType.get_type(i) is ltype for i in val[1:]):
                raise ValueError("All items in a GGUF array should be of the same type")
            self.kv_data += self._pack("I", ltype)
            self.kv_data += self._pack("Q", len(val))
            for item in val:
                self.add_val(item, add_vtype=False)
        else:
            raise ValueError("Invalid GGUF metadata value type or value")
    @staticmethod
    def ggml_pad(x: int, n: int) -> int:
@ -200,16 +219,12 @@ class GGUFWriter:
        self, name: str, tensor_shape: Sequence[int], tensor_dtype: np.dtype,
        tensor_nbytes: int, raw_dtype: GGMLQuantizationType | None = None,
    ) -> None:
-        if self.state is not WriterState.EMPTY:
+        if self.state is not WriterState.NO_FILE:
-            raise ValueError(f'Expected output file to be empty, got {self.state}')
+            raise ValueError(f'Expected output file to be not yet opened, got {self.state}')
-        if name in self.ti_names:
+        if name in self.tensors:
-            raise ValueError(f'Duplicated tensor name {name}')
+            raise ValueError(f'Duplicated tensor name {name!r}')
        self.ti_names.add(name)
        encoded_name = name.encode("utf-8")
        self.ti_data += self._pack("Q", len(encoded_name))
        self.ti_data += encoded_name
        if raw_dtype is None:
            if tensor_dtype == np.float16:
                dtype = GGMLQuantizationType.F16
@ -231,14 +246,8 @@ class GGUFWriter:
            dtype = raw_dtype
            if tensor_dtype == np.uint8:
                tensor_shape = quant_shape_from_byte_shape(tensor_shape, raw_dtype)
-        n_dims = len(tensor_shape)
+
-        self.ti_data += self._pack("I", n_dims)
+        self.tensors[name] = TensorInfo(shape=tensor_shape, dtype=dtype, nbytes=tensor_nbytes)
        for i in range(n_dims):
            self.ti_data += self._pack("Q", tensor_shape[n_dims - 1 - i])
        self.ti_data += self._pack("I", dtype)
        self.ti_data += self._pack("Q", self.offset_tensor)
        self.offset_tensor += GGUFWriter.ggml_pad(tensor_nbytes, self.data_alignment)
        self.ti_data_count += 1
    def add_tensor(
        self, name: str, tensor: np.ndarray[Any, Any], raw_shape: Sequence[int] | None = None,
@ -252,10 +261,10 @@ class GGUFWriter:
            self.temp_file = fp
        shape: Sequence[int] = raw_shape if raw_shape is not None else tensor.shape
-        self.add_tensor_info(name, shape, tensor.dtype, tensor.nbytes, raw_dtype = raw_dtype)
+        self.add_tensor_info(name, shape, tensor.dtype, tensor.nbytes, raw_dtype=raw_dtype)
        if self.temp_file is None:
-            self.tensors.append(tensor)
+            self.tensors[name].tensor = tensor
            return
        tensor.tofile(self.temp_file)
@ -267,8 +276,9 @@ class GGUFWriter:
            fp.write(bytes([0] * pad))
    def write_tensor_data(self, tensor: np.ndarray[Any, Any]) -> None:
-        if self.state is not WriterState.TI_DATA:
+        if self.state is not WriterState.TI_DATA and self.state is not WriterState.WEIGHTS:
-            raise ValueError(f'Expected output file to contain tensor info, got {self.state}')
+            raise ValueError(f'Expected output file to contain tensor info or weights, got {self.state}')
        assert self.fout is not None
        if self.endianess == GGUFEndian.BIG:
            tensor.byteswap(inplace=True)
@ -276,50 +286,51 @@ class GGUFWriter:
        tensor.tofile(self.fout)
        self.write_padding(self.fout, tensor.nbytes)
        self.state = WriterState.WEIGHTS
    def write_tensors_to_file(self, *, progress: bool = False) -> None:
        self.write_ti_data_to_file()
        assert self.fout is not None
        self.write_padding(self.fout, self.fout.tell())
        if self.temp_file is None:
-            self.tensors.reverse()  # to pop from the "beginning" in constant time
+            bar = None
            if progress:
                from tqdm import tqdm
-                total_bytes = sum(t.nbytes for t in self.tensors)
+                total_bytes = sum(t.nbytes for t in self.tensors.values())
                bar = tqdm(desc="Writing", total=total_bytes, unit="byte", unit_scale=True)
-                while True:
+            # relying on the fact that Python dicts preserve insertion order (since 3.7)
-                    try:
+            for ti in self.tensors.values():
-                        tensor = self.tensors.pop()
+                assert ti.tensor is not None  # can only iterate once over the tensors
-                    except IndexError:
+                assert ti.tensor.nbytes == ti.nbytes
-                        break
+                ti.tensor.tofile(self.fout)
-                    tensor.tofile(self.fout)
+                if bar is not None:
-                    bar.update(tensor.nbytes)
+                    bar.update(ti.nbytes)
-                    self.write_padding(self.fout, tensor.nbytes)
+                self.write_padding(self.fout, ti.nbytes)
-                return
+                ti.tensor = None
-            while True:
+        else:
-                try:
+            self.temp_file.seek(0)
                    tensor = self.tensors.pop()
                except IndexError:
                    break
                tensor.tofile(self.fout)
                self.write_padding(self.fout, tensor.nbytes)
            return
-        self.temp_file.seek(0)
+            shutil.copyfileobj(self.temp_file, self.fout)
            self.flush()
            self.temp_file.close()
-        shutil.copyfileobj(self.temp_file, self.fout)
+        self.state = WriterState.WEIGHTS
        self.flush()
        self.temp_file.close()
    def flush(self) -> None:
        assert self.fout is not None
        self.fout.flush()
    def close(self) -> None:
-        self.fout.close()
+        if self.fout is not None:
            self.fout.close()
            self.fout = None
    def add_architecture(self) -> None:
        self.add_string(Keys.General.ARCHITECTURE, self.arch)
@ -449,7 +460,7 @@ class GGUFWriter:
    def add_rope_scaling_factor(self, value: float) -> None:
        self.add_float32(Keys.Rope.SCALING_FACTOR.format(arch=self.arch), value)
-    def add_rope_scaling_attn_factors(self, value: Sequence[float]) -> None:
+    def add_rope_scaling_attn_factors(self, value: float) -> None:
        self.add_float32(Keys.Rope.SCALING_ATTN_FACTOR.format(arch=self.arch), value)
    def add_rope_scaling_orig_ctx_len(self, value: int) -> None:
@ -571,5 +582,32 @@ class GGUFWriter:
            pack_prefix = '<' if self.endianess == GGUFEndian.LITTLE else '>'
        return struct.pack(f'{pack_prefix}{fmt}', value)
    def _pack_val(self, val: Any, vtype: GGUFValueType, add_vtype: bool) -> bytes:
        kv_data = bytearray()
        if add_vtype:
            kv_data += self._pack("I", vtype)
        pack_fmt = self._simple_value_packing.get(vtype)
        if pack_fmt is not None:
            kv_data += self._pack(pack_fmt, val, skip_pack_prefix = vtype == GGUFValueType.BOOL)
        elif vtype == GGUFValueType.STRING:
            encoded_val = val.encode("utf-8") if isinstance(val, str) else val
            kv_data += self._pack("Q", len(encoded_val))
            kv_data += encoded_val
        elif vtype == GGUFValueType.ARRAY and isinstance(val, Sequence) and val:
            ltype = GGUFValueType.get_type(val[0])
            if not all(GGUFValueType.get_type(i) is ltype for i in val[1:]):
                raise ValueError("All items in a GGUF array should be of the same type")
            kv_data += self._pack("I", ltype)
            kv_data += self._pack("Q", len(val))
            for item in val:
                kv_data += self._pack_val(item, ltype, add_vtype=False)
        else:
            raise ValueError("Invalid GGUF metadata value type or value")
        return kv_data
    def _write_packed(self, fmt: str, value: Any, skip_pack_prefix: bool = False) -> None:
        assert self.fout is not None
        self.fout.write(self._pack(fmt, value, skip_pack_prefix))
--- a/gguf-py/scripts/gguf-new-metadata.py
+++ b/gguf-py/scripts/gguf-new-metadata.py
@ -101,8 +101,7 @@ def copy_with_new_metadata(reader: gguf.GGUFReader, writer: gguf.GGUFWriter, new
            logger.debug(f'Copying {field.name}')
        if val.value is not None:
-            writer.add_key(field.name)
+            writer.add_key_value(field.name, val.value, val.type)
            writer.add_val(val.value, val.type)
    if gguf.Keys.Tokenizer.CHAT_TEMPLATE in new_metadata:
        logger.debug('Adding chat template(s)')
@ -111,8 +110,7 @@ def copy_with_new_metadata(reader: gguf.GGUFReader, writer: gguf.GGUFWriter, new
    for key, val in new_metadata.items():
        logger.debug(f'Adding {key}: "{val.value}" {val.description}')
-        writer.add_key(key)
+        writer.add_key_value(key, val.value, val.type)
        writer.add_val(val.value, val.type)
    total_bytes = 0