llama.cpp/gguf-py/scripts/gguf_hash.py

#!/usr/bin/env python3
from __future__ import annotations

import uuid
import hashlib

import logging
import argparse
import os
import sys
from pathlib import Path

from tqdm import tqdm

# Necessary to load the local gguf package
if "NO_LOCAL_GGUF" not in os.environ and (Path(__file__).parent.parent.parent / 'gguf-py').exists():
    sys.path.insert(0, str(Path(__file__).parent.parent))

from gguf import GGUFReader  # noqa: E402


logger = logging.getLogger("gguf-hash")

# UUID_NAMESPACE_LLAMA_CPP = uuid.uuid5(uuid.NAMESPACE_URL, 'en.wikipedia.org/wiki/Llama.cpp')
UUID_NAMESPACE_LLAMA_CPP = uuid.UUID('ef001206-dadc-5f6d-a15f-3359e577d4e5')


# For more information about what field.parts and field.data represent,
# please see the comments in the modify_gguf.py example.
def gguf_hash(reader: GGUFReader, filename: str, disable_progress_bar: bool, no_layer: bool) -> None:
    sha1 = hashlib.sha1()
    sha256 = hashlib.sha256()
    uuidv5_sha1 = hashlib.sha1()
    uuidv5_sha1.update(UUID_NAMESPACE_LLAMA_CPP.bytes)

    # Total Weight Calculation For Progress Bar
    total_weights = 0
    for n, tensor in enumerate(reader.tensors, 1):

        # We don't need these
        if tensor.name.endswith((".attention.masked_bias", ".attention.bias", ".rotary_emb.inv_freq")):
            continue

        # Calculate Tensor Volume
        sum_weights_in_tensor = 1
        for dim in tensor.shape:
            sum_weights_in_tensor *= dim
        total_weights += sum_weights_in_tensor

    # Hash Progress Bar
    bar = tqdm(desc="Hashing", total=total_weights, unit="weights", unit_scale=True, disable=disable_progress_bar)

    # Hashing Process
    for tensor in reader.tensors:

        # We don't need these
        if tensor.name.endswith((".attention.masked_bias", ".attention.bias", ".rotary_emb.inv_freq")):
            continue

        # Progressbar
        sum_weights_in_tensor = 1
        for dim in tensor.shape:
            sum_weights_in_tensor *= dim
        bar.update(sum_weights_in_tensor)

        if not no_layer:

            sha1_layer = hashlib.sha1()
            sha1_layer.update(tensor.data.data)
            print("sha1      {0}  {1}:{2}".format(sha1_layer.hexdigest(), filename, tensor.name)) # noqa: NP100

            sha256_layer = hashlib.sha256()
            sha256_layer.update(tensor.data.data)
            print("sha256    {0}  {1}:{2}".format(sha256_layer.hexdigest(), filename, tensor.name)) # noqa: NP100

        sha1.update(tensor.data.data)
        sha256.update(tensor.data.data)
        uuidv5_sha1.update(tensor.data.data)

    # Flush Hash Progress Bar
    bar.close()

    # Display Hash Output
    print("sha1      {0}  {1}".format(sha1.hexdigest(), filename)) # noqa: NP100
    print("sha256    {0}  {1}".format(sha256.hexdigest(), filename)) # noqa: NP100
    print("uuid      {0}  {1}".format(uuid.UUID(bytes=uuidv5_sha1.digest()[:16], version=5), filename)) # noqa: NP100


def main() -> None:
    parser = argparse.ArgumentParser(description="Dump GGUF file metadata")
    parser.add_argument("model",         type=str,            help="GGUF format model filename")
    parser.add_argument("--no-layer",    action="store_true", help="exclude per layer hash")
    parser.add_argument("--verbose",     action="store_true", help="increase output verbosity")
    parser.add_argument("--progressbar", action="store_true", help="enable progressbar")
    args = parser.parse_args(None if len(sys.argv) > 1 else ["--help"])
    logging.basicConfig(level=logging.DEBUG if args.verbose else logging.INFO)
    reader = GGUFReader(args.model, 'r')
    gguf_hash(reader, args.model, not args.progressbar, args.no_layer)


if __name__ == '__main__':
    main()
gguf-hash: model wide and per tensor hashing using xxhash and sha1 (#8048) CLI to hash GGUF files to detect difference on a per model and per tensor level The hash type we support is: - `--xxh64`: use xhash 64bit hash mode (default) - `--sha1`: use sha1 - `--uuid`: use uuid - `--sha256`: use sha256 While most POSIX systems already have hash checking programs like sha256sum, it is designed to check entire files. This is not ideal for our purpose if we want to check for consistency of the tensor data even if the metadata content of the gguf KV store has been updated. This program is designed to hash a gguf tensor payload on a 'per tensor layer' in addition to a 'entire tensor model' hash. The intent is that the entire tensor layer can be checked first but if there is any detected inconsistencies, then the per tensor hash can be used to narrow down the specific tensor layer that has inconsistencies. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2024-07-07 12:58:43 +00:00			`#!/usr/bin/env python3`
			`from __future__ import annotations`

			`import uuid`
			`import hashlib`

			`import logging`
			`import argparse`
			`import os`
			`import sys`
			`from pathlib import Path`

			`from tqdm import tqdm`

			`# Necessary to load the local gguf package`
			`if "NO_LOCAL_GGUF" not in os.environ and (Path(__file__).parent.parent.parent / 'gguf-py').exists():`
			`sys.path.insert(0, str(Path(__file__).parent.parent))`

			`from gguf import GGUFReader # noqa: E402`


			`logger = logging.getLogger("gguf-hash")`

			`# UUID_NAMESPACE_LLAMA_CPP = uuid.uuid5(uuid.NAMESPACE_URL, 'en.wikipedia.org/wiki/Llama.cpp')`
			`UUID_NAMESPACE_LLAMA_CPP = uuid.UUID('ef001206-dadc-5f6d-a15f-3359e577d4e5')`


			`# For more information about what field.parts and field.data represent,`
			`# please see the comments in the modify_gguf.py example.`
gguf_hash.py: Add sha256 (#8470) * gguf_hash.py: Add sha256 * gguf_hash.py: rename string UUIDv5 --> uuid * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net> 2024-07-14 06:47:14 +00:00			`def gguf_hash(reader: GGUFReader, filename: str, disable_progress_bar: bool, no_layer: bool) -> None:`
gguf-hash: model wide and per tensor hashing using xxhash and sha1 (#8048) CLI to hash GGUF files to detect difference on a per model and per tensor level The hash type we support is: - `--xxh64`: use xhash 64bit hash mode (default) - `--sha1`: use sha1 - `--uuid`: use uuid - `--sha256`: use sha256 While most POSIX systems already have hash checking programs like sha256sum, it is designed to check entire files. This is not ideal for our purpose if we want to check for consistency of the tensor data even if the metadata content of the gguf KV store has been updated. This program is designed to hash a gguf tensor payload on a 'per tensor layer' in addition to a 'entire tensor model' hash. The intent is that the entire tensor layer can be checked first but if there is any detected inconsistencies, then the per tensor hash can be used to narrow down the specific tensor layer that has inconsistencies. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2024-07-07 12:58:43 +00:00			`sha1 = hashlib.sha1()`
gguf_hash.py: Add sha256 (#8470) * gguf_hash.py: Add sha256 * gguf_hash.py: rename string UUIDv5 --> uuid * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net> 2024-07-14 06:47:14 +00:00			`sha256 = hashlib.sha256()`
gguf-hash: model wide and per tensor hashing using xxhash and sha1 (#8048) CLI to hash GGUF files to detect difference on a per model and per tensor level The hash type we support is: - `--xxh64`: use xhash 64bit hash mode (default) - `--sha1`: use sha1 - `--uuid`: use uuid - `--sha256`: use sha256 While most POSIX systems already have hash checking programs like sha256sum, it is designed to check entire files. This is not ideal for our purpose if we want to check for consistency of the tensor data even if the metadata content of the gguf KV store has been updated. This program is designed to hash a gguf tensor payload on a 'per tensor layer' in addition to a 'entire tensor model' hash. The intent is that the entire tensor layer can be checked first but if there is any detected inconsistencies, then the per tensor hash can be used to narrow down the specific tensor layer that has inconsistencies. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2024-07-07 12:58:43 +00:00			`uuidv5_sha1 = hashlib.sha1()`
			`uuidv5_sha1.update(UUID_NAMESPACE_LLAMA_CPP.bytes)`

			`# Total Weight Calculation For Progress Bar`
			`total_weights = 0`
			`for n, tensor in enumerate(reader.tensors, 1):`

			`# We don't need these`
			`if tensor.name.endswith((".attention.masked_bias", ".attention.bias", ".rotary_emb.inv_freq")):`
			`continue`

			`# Calculate Tensor Volume`
			`sum_weights_in_tensor = 1`
			`for dim in tensor.shape:`
			`sum_weights_in_tensor *= dim`
			`total_weights += sum_weights_in_tensor`

			`# Hash Progress Bar`
			`bar = tqdm(desc="Hashing", total=total_weights, unit="weights", unit_scale=True, disable=disable_progress_bar)`

			`# Hashing Process`
gguf_hash.py: Add sha256 (#8470) * gguf_hash.py: Add sha256 * gguf_hash.py: rename string UUIDv5 --> uuid * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net> 2024-07-14 06:47:14 +00:00			`for tensor in reader.tensors:`
gguf-hash: model wide and per tensor hashing using xxhash and sha1 (#8048) CLI to hash GGUF files to detect difference on a per model and per tensor level The hash type we support is: - `--xxh64`: use xhash 64bit hash mode (default) - `--sha1`: use sha1 - `--uuid`: use uuid - `--sha256`: use sha256 While most POSIX systems already have hash checking programs like sha256sum, it is designed to check entire files. This is not ideal for our purpose if we want to check for consistency of the tensor data even if the metadata content of the gguf KV store has been updated. This program is designed to hash a gguf tensor payload on a 'per tensor layer' in addition to a 'entire tensor model' hash. The intent is that the entire tensor layer can be checked first but if there is any detected inconsistencies, then the per tensor hash can be used to narrow down the specific tensor layer that has inconsistencies. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2024-07-07 12:58:43 +00:00
			`# We don't need these`
			`if tensor.name.endswith((".attention.masked_bias", ".attention.bias", ".rotary_emb.inv_freq")):`
			`continue`

			`# Progressbar`
			`sum_weights_in_tensor = 1`
			`for dim in tensor.shape:`
			`sum_weights_in_tensor *= dim`
			`bar.update(sum_weights_in_tensor)`

gguf_hash.py: Add sha256 (#8470) * gguf_hash.py: Add sha256 * gguf_hash.py: rename string UUIDv5 --> uuid * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net> 2024-07-14 06:47:14 +00:00			`if not no_layer:`

			`sha1_layer = hashlib.sha1()`
			`sha1_layer.update(tensor.data.data)`
			`print("sha1 {0} {1}:{2}".format(sha1_layer.hexdigest(), filename, tensor.name)) # noqa: NP100`

			`sha256_layer = hashlib.sha256()`
			`sha256_layer.update(tensor.data.data)`
			`print("sha256 {0} {1}:{2}".format(sha256_layer.hexdigest(), filename, tensor.name)) # noqa: NP100`

py : type-check all Python scripts with Pyright (#8341) * py : type-check all Python scripts with Pyright * server-tests : use trailing slash in openai base_url * server-tests : add more type annotations * server-tests : strip "chat" from base_url in oai_chat_completions * server-tests : model metadata is a dict * ci : disable pip cache in type-check workflow The cache is not shared between branches, and it's 250MB in size, so it would become quite a big part of the 10GB cache limit of the repo. * py : fix new type errors from master branch * tests : fix test-tokenizer-random.py Apparently, gcc applies optimisations even when pre-processing, which confuses pycparser. * ci : only show warnings and errors in python type-check The "information" level otherwise has entries from 'examples/pydantic_models_to_grammar.py', which could be confusing for someone trying to figure out what failed, considering that these messages can safely be ignored even though they look like errors. 2024-07-07 19:04:39 +00:00			`sha1.update(tensor.data.data)`
gguf_hash.py: Add sha256 (#8470) * gguf_hash.py: Add sha256 * gguf_hash.py: rename string UUIDv5 --> uuid * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net> 2024-07-14 06:47:14 +00:00			`sha256.update(tensor.data.data)`
py : type-check all Python scripts with Pyright (#8341) * py : type-check all Python scripts with Pyright * server-tests : use trailing slash in openai base_url * server-tests : add more type annotations * server-tests : strip "chat" from base_url in oai_chat_completions * server-tests : model metadata is a dict * ci : disable pip cache in type-check workflow The cache is not shared between branches, and it's 250MB in size, so it would become quite a big part of the 10GB cache limit of the repo. * py : fix new type errors from master branch * tests : fix test-tokenizer-random.py Apparently, gcc applies optimisations even when pre-processing, which confuses pycparser. * ci : only show warnings and errors in python type-check The "information" level otherwise has entries from 'examples/pydantic_models_to_grammar.py', which could be confusing for someone trying to figure out what failed, considering that these messages can safely be ignored even though they look like errors. 2024-07-07 19:04:39 +00:00			`uuidv5_sha1.update(tensor.data.data)`
gguf-hash: model wide and per tensor hashing using xxhash and sha1 (#8048) CLI to hash GGUF files to detect difference on a per model and per tensor level The hash type we support is: - `--xxh64`: use xhash 64bit hash mode (default) - `--sha1`: use sha1 - `--uuid`: use uuid - `--sha256`: use sha256 While most POSIX systems already have hash checking programs like sha256sum, it is designed to check entire files. This is not ideal for our purpose if we want to check for consistency of the tensor data even if the metadata content of the gguf KV store has been updated. This program is designed to hash a gguf tensor payload on a 'per tensor layer' in addition to a 'entire tensor model' hash. The intent is that the entire tensor layer can be checked first but if there is any detected inconsistencies, then the per tensor hash can be used to narrow down the specific tensor layer that has inconsistencies. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2024-07-07 12:58:43 +00:00
			`# Flush Hash Progress Bar`
			`bar.close()`

			`# Display Hash Output`
gguf_hash.py: Add sha256 (#8470) * gguf_hash.py: Add sha256 * gguf_hash.py: rename string UUIDv5 --> uuid * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net> 2024-07-14 06:47:14 +00:00			`print("sha1 {0} {1}".format(sha1.hexdigest(), filename)) # noqa: NP100`
			`print("sha256 {0} {1}".format(sha256.hexdigest(), filename)) # noqa: NP100`
			`print("uuid {0} {1}".format(uuid.UUID(bytes=uuidv5_sha1.digest()[:16], version=5), filename)) # noqa: NP100`
gguf-hash: model wide and per tensor hashing using xxhash and sha1 (#8048) CLI to hash GGUF files to detect difference on a per model and per tensor level The hash type we support is: - `--xxh64`: use xhash 64bit hash mode (default) - `--sha1`: use sha1 - `--uuid`: use uuid - `--sha256`: use sha256 While most POSIX systems already have hash checking programs like sha256sum, it is designed to check entire files. This is not ideal for our purpose if we want to check for consistency of the tensor data even if the metadata content of the gguf KV store has been updated. This program is designed to hash a gguf tensor payload on a 'per tensor layer' in addition to a 'entire tensor model' hash. The intent is that the entire tensor layer can be checked first but if there is any detected inconsistencies, then the per tensor hash can be used to narrow down the specific tensor layer that has inconsistencies. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2024-07-07 12:58:43 +00:00

			`def main() -> None:`
			`parser = argparse.ArgumentParser(description="Dump GGUF file metadata")`
			`parser.add_argument("model", type=str, help="GGUF format model filename")`
gguf_hash.py: Add sha256 (#8470) * gguf_hash.py: Add sha256 * gguf_hash.py: rename string UUIDv5 --> uuid * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net> 2024-07-14 06:47:14 +00:00			`parser.add_argument("--no-layer", action="store_true", help="exclude per layer hash")`
gguf-hash: model wide and per tensor hashing using xxhash and sha1 (#8048) CLI to hash GGUF files to detect difference on a per model and per tensor level The hash type we support is: - `--xxh64`: use xhash 64bit hash mode (default) - `--sha1`: use sha1 - `--uuid`: use uuid - `--sha256`: use sha256 While most POSIX systems already have hash checking programs like sha256sum, it is designed to check entire files. This is not ideal for our purpose if we want to check for consistency of the tensor data even if the metadata content of the gguf KV store has been updated. This program is designed to hash a gguf tensor payload on a 'per tensor layer' in addition to a 'entire tensor model' hash. The intent is that the entire tensor layer can be checked first but if there is any detected inconsistencies, then the per tensor hash can be used to narrow down the specific tensor layer that has inconsistencies. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2024-07-07 12:58:43 +00:00			`parser.add_argument("--verbose", action="store_true", help="increase output verbosity")`
			`parser.add_argument("--progressbar", action="store_true", help="enable progressbar")`
			`args = parser.parse_args(None if len(sys.argv) > 1 else ["--help"])`
			`logging.basicConfig(level=logging.DEBUG if args.verbose else logging.INFO)`
			`reader = GGUFReader(args.model, 'r')`
gguf_hash.py: Add sha256 (#8470) * gguf_hash.py: Add sha256 * gguf_hash.py: rename string UUIDv5 --> uuid * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net> 2024-07-14 06:47:14 +00:00			`gguf_hash(reader, args.model, not args.progressbar, args.no_layer)`
gguf-hash: model wide and per tensor hashing using xxhash and sha1 (#8048) CLI to hash GGUF files to detect difference on a per model and per tensor level The hash type we support is: - `--xxh64`: use xhash 64bit hash mode (default) - `--sha1`: use sha1 - `--uuid`: use uuid - `--sha256`: use sha256 While most POSIX systems already have hash checking programs like sha256sum, it is designed to check entire files. This is not ideal for our purpose if we want to check for consistency of the tensor data even if the metadata content of the gguf KV store has been updated. This program is designed to hash a gguf tensor payload on a 'per tensor layer' in addition to a 'entire tensor model' hash. The intent is that the entire tensor layer can be checked first but if there is any detected inconsistencies, then the per tensor hash can be used to narrow down the specific tensor layer that has inconsistencies. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2024-07-07 12:58:43 +00:00

			`if __name__ == '__main__':`
			`main()`