This minor (though time consuming) change,
1) Moves the models/gmml-vocab.bin file into the test folder.
2) changes the order in which information is presented to the user
3) recommends using symlinks to link training data into the right place in the repo
4) adds some clarification around the importance of the model weights
1 is handy because it enables 'automation' towards 3, e.g. the
command rm -r models/ can safely be symlinked into the models folder
and the commands to do so are clearly listed and described in the README.md
2 is ultimately the only important aspect of this change. The readme
currently must be read in full by the user, cached, and then returned to
in order to follow along with all the steps in the documentation.
3 is (I think) handy because these files are pretty huge and not exclusive
to this repo. Symlinks shine in this as that many symlinks can be created
across multiple projects and all point to the same source location.
If researchers were copying/ pasting these to each project, it would get
out of hand fast I think.
4 seems valuable, the AI world looks really opaque to people just getting started.
I did my best to be accurate with my statements in the hops that it makes
it more possible for humans to become more aware of this technology and
what's happening to the internet and the world.
It appears this file is only used during tests as of now.
Removing it from the model folder makes it more flexible
for how users are loading their model data into the project
(e.g. are they using a docker bind-mounts, are they using
symlinks, are they DLing models directly into this folder?)
By moving this, the instructions for getting started can be
safely simplified to:
$ rm models/.gitkeep
$ rm -r models
$ ln -s /mnt/c/ai/models/LLaMA $(pwd)/models
I think it's a good idea because the model files are quite large, and
be useful across multiple projects so symlinks shine in this use case
without creating too much confusion for the onboardee..
The llama_set_state_data function restores the rng state to what it
was at the time llama_copy_state_data was called. But users may want
to restore the state and proceed with a different seed.
* Updated build information
First update to the build instructions to include BLAS.
* Update README.md
* Update information about BLAS
* Better BLAS explanation
Adding a clearer BLAS explanation and adding a link to download the CUDA toolkit.
* Better BLAS explanation
* BLAS for Mac
Specifying that BLAS is already supported on Macs using the Accelerate Framework.
* Clarify the effect of BLAS
* Windows Make instructions
Added the instructions to build with Make on Windows
* Fixing typo
* Fix trailing whitespace
instead of `int` (while `int` option still being supported)
This allows the following usage:
`./quantize ggml-model-f16.bin ggml-model-q4_0.bin q4_0`
instead of:
`./quantize ggml-model-f16.bin ggml-model-q4_0.bin 2`
* Use full range for q4_0 quantization
By keeping the sign of the highest magnitude, we can make sure the
highest value maps to -8, which is currently unused.
This is a bit of a freebie since it is fully backwards compatible with
the current format.
* Update quantize_row_q4_0 for AVX/AVX2
* Update quantize_row_q4_0 for WASM
Untested
* Update quantize_row_q4_0 for Arm NEON
* Update quantize_row_q4_0 for PowerPC
Untested
* Use full range for q4_2 quantization
* add save_load_state example
* use <cstdio> instead of <iostream> and fprintf / printf instead of cout
* renamed save-load-state example files replacing underscores by dashes
* Unit test for quantization functions
Use the ggml_internal_get_quantize_fn function to loop through all
quantization formats and run a sanity check on the result.
Also add a microbenchmark that times these functions directly without
running the rest of the GGML graph.
* test-quantize-fns: CI fixes
Fix issues uncovered in CI
- need to use sizes divisible by 32*8 for loop unrolling
- use intrinsic header that should work on Mac
* test-quantize: remove
Per PR comment, subsumed by test-quantize-fns
* test-quantize: fix for q8_0 intermediates
* set default n_batch to 512 when using BLAS
* spacing
* alternate implementation of setting different n_batch for BLAS
* set n_batch to 512 for all cases
* ggml : prefer vzip to vuzp
This way we always use the same type of instruction across all quantizations
* ggml : alternative Q4_3 implementation using modified Q8_0
* ggml : fix Q4_3 scalar imlpementation
* ggml : slight improvement of Q4_3 - no need for loop unrolling
* ggml : fix AVX paths for Q8_0 quantization
* Moving parameters to separate lines for readability.
* Increasing repeate_penalty to 1.1 to make alpaca more usable by default.
* Adding trailing newline.
* reserve correct size for logits
* add functions to get and set the whole llama state:
including rng, logits, embedding and kv_cache
* remove unused variables
* remove trailing whitespace
* fix comment
* Improve cuBLAS performance by using a memory pool
* Move cuda specific definitions to ggml-cuda.h/cu
* Add CXX flags to nvcc
* Change memory pool synchronization mechanism to a spin lock
General code cleanup