* lora: load to devide buft
* add patch tensor function
* correct tensor patch
* llama_lora_adapter_apply
* correct ggml_backend_tensor_copy
* add llm_build_mm
* fix auto merge
* update based on review comments
* add convert script
* no more transpose A
* add f16 convert
* add metadata check
* add sanity check
* fix ftype
* add requirements
* fix requirements
* fix outfile
* conversion: only allow selected models
* fix types
* cuda : do not use dmmv if the tensor does not have enough cols
* llama : lora fixes
* do not disable mmap with lora
Co-authored-by: slaren <slarengh@gmail.com>
* llm_build_lora_mm_id
* convert_lora : MoE LoRA conversion support
* convert_lora : prefer safetensors, similarly to convert_hf
* convert_hf : simplify modify_tensors for InternLM2
* convert_lora : lazy conversion
* llama : load and use alpha from LoRA adapters
* llama : use llm_build_lora_mm in most model graphs
* auto scale
* Revert "auto scale"
This reverts commit 42415a4874.
* remove redundant params
* Apply suggestions from code review
Co-authored-by: slaren <slarengh@gmail.com>
* change kv metadata
* move add_type to __init__
* convert_hf : move add_type to main()
* convert_lora : use the GGUFWriter from Model instead of overwriting it
---------
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
* python: add check-requirements.sh and GitHub workflow
This script and workflow forces package versions to remain compatible
across all convert*.py scripts, while allowing secondary convert scripts
to import dependencies not wanted in convert.py.
* Move requirements into ./requirements
* Fail on "==" being used for package requirements (but can be suppressed)
* Enforce "compatible release" syntax instead of ==
* Update workflow
* Add upper version bound for transformers and protobuf
* improve check-requirements.sh
* small syntax change
* don't remove venvs if nocleanup is passed
* See if this fixes docker workflow
* Move check-requirements.sh into ./scripts/
---------
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
Current status: Working, except for the latest GPTQ-for-LLaMa format
that includes `g_idx`. This turns out to require changes to GGML, so
for now it only works if you use the `--outtype` option to dequantize it
back to f16 (which is pointless except for debugging).
I also included some cleanup for the C++ code.
This script is meant to replace all the existing conversion scripts
(including the ones that convert from older GGML formats), while also
adding support for some new formats. Specifically, I've tested with:
- [x] `LLaMA` (original)
- [x] `llama-65b-4bit`
- [x] `alpaca-native`
- [x] `alpaca-native-4bit`
- [x] LLaMA converted to 'transformers' format using
`convert_llama_weights_to_hf.py`
- [x] `alpaca-native` quantized with `--true-sequential --act-order
--groupsize 128` (dequantized only)
- [x] same as above plus `--save_safetensors`
- [x] GPT4All
- [x] stock unversioned ggml
- [x] ggmh
There's enough overlap in the logic needed to handle these different
cases that it seemed best to move to a single script.
I haven't tried this with Alpaca-LoRA because I don't know where to find
it.
Useful features:
- Uses multiple threads for a speedup in some cases (though the Python
GIL limits the gain, and sometimes it's disk-bound anyway).
- Combines split models into a single file (both the intra-tensor split
of the original and the inter-tensor split of 'transformers' format
files). Single files are more convenient to work with and more
friendly to future changes to use memory mapping on the C++ side. To
accomplish this without increasing memory requirements, it has some
custom loading code which avoids loading whole input files into memory
at once.
- Because of the custom loading code, it no longer depends in PyTorch,
which might make installing dependencies slightly easier or faster...
although it still depends on NumPy and sentencepiece, so I don't know
if there's any meaningful difference. In any case, I also added a
requirements.txt file to lock the dependency versions in case of any
future breaking changes.
- Type annotations checked with mypy.
- Some attempts to be extra user-friendly:
- The script tries to be forgiving with arguments, e.g. you can
specify either the model file itself or the directory containing
it.
- The script doesn't depend on config.json / params.json, just in
case the user downloaded files individually and doesn't have those
handy. But you still need tokenizer.model and, for Alpaca,
added_tokens.json.
- The script tries to give a helpful error message if
added_tokens.json is missing.