llama.cpp/ggml
Sigbjørn Skjæret b72c20b85c
Fix conversion of unnormalized BF16->BF16 weights (#7843)
* add truncate_bf16

* truncate intermediate fp32 if converting bf16 to bf16

* fix masking in __compute_fp32_to_bf16

* np.int16 no longer used

* missing cast and additional numpy 2.x fix

* ggml-impl : do not flush bf16 subnormals to zero

* ggml : add reference fp32 to bf16 conversion

The fast version is no longer equivalent for all platforms
because of the handling of subnormal values.

* gguf-py : remove flush to zero for bf16 subnormals

* gguf-py : remove float32 truncation to bf16

Rounding achieves the same thing in the cases where this was used.

* missed prototype update in merge

* merge cleanup

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2024-08-02 15:11:39 -04:00
..
cmake llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
include Fix conversion of unnormalized BF16->BF16 weights (#7843) 2024-08-02 15:11:39 -04:00
src Fix conversion of unnormalized BF16->BF16 weights (#7843) 2024-08-02 15:11:39 -04:00
.gitignore vulkan : cmake integration (#8119) 2024-07-13 18:12:39 +02:00
CMakeLists.txt cann: update cmake (#8765) 2024-07-30 12:37:35 +02:00