diff --git a/examples/llava/README.md b/examples/llava/README.md index e2ef0eff1..1d5374f2a 100644 --- a/examples/llava/README.md +++ b/examples/llava/README.md @@ -1,10 +1,12 @@ # LLaVA -Currently this implementation supports [llava-v1.5](https://huggingface.co/liuhaotian/llava-v1.5-7b) variants. +Currently this implementation supports [llava-v1.5](https://huggingface.co/liuhaotian/llava-v1.5-7b) variants, +as well as llava-1.6 [llava-v1.6](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2) variants. The pre-converted [7b](https://huggingface.co/mys/ggml_llava-v1.5-7b) and [13b](https://huggingface.co/mys/ggml_llava-v1.5-13b) models are available. +For llava-1.6 a variety of prepared gguf models are available as well [7b-34b](https://huggingface.co/cmp-nct/llava-1.6-gguf) After API is confirmed, more models will be supported / uploaded. @@ -18,6 +20,7 @@ After building, run: `./llava-cli` to see the usage. For example: ``` **note**: A lower temperature like 0.1 is recommended for better quality. add `--temp 0.1` to the command to do so. +**note**: For GPU offloading ensure to use the `-ngl` flag just like usual ## LLaVA 1.5 @@ -55,11 +58,46 @@ python ./convert.py ../llava-v1.5-7b Now both the LLaMA part and the image encoder is in the `llava-v1.5-7b` directory. -## LLaVA 1.6 +## LLaVA 1.6 gguf conversion + +1) Backup your pth/safetensor model files as llava-surgery modifies them +2) Use `python llava-surgery-v2.py -C -m /path/to/hf-model` which also supports llava-1.5 variants pytorch as well as safetensor models: +- you will find a llava.projector and a llava.clip file in your model directory +3) Copy the llava.clip file into a subdirectory (like vit), rename it to pytorch_model.bin and add a fitting vit configuration to the directory (https://huggingface.co/cmp-nct/llava-1.6-gguf/blob/main/config.json) +4) Create the visual gguf model: `python ./examples/llava/convert-image-encoder-to-gguf.py -m ../path/to/vit --llava-projector ../path/to/llava.projector --output-dir ../path/to/output --clip_model_is_vision` +- This is similar to llava-1.5, the difference is that we tell the encoder that we are working with the pure vision model part of CLIP +5) Everything else as usual: convert.py the hf model, quantize as needed +**note** llava-1.6 needs more context than llava-1.5, at least 3000 is needed (just run it at -c 4096) +**note** llava-1.6 greatly benefits from batched prompt processing (defaults work) + +## llava-cli templating and llava-1.6 prompting + +llava-1.5 models all use the same vicuna prompt, here you can just add your image question like `-p "Provide a full description."` +For llava-1.5 models which are not vicuna (mistral and Yi) you need to adapt system prompt as well as user prompt, for this purpose llava-cli has a basic templating system: + +**For Mistral and using llava-cli binary:** +Add this: `-p "\nUSER:\nProvide a full description.\nASSISTANT:\n"` +The mistral template for llava-1.6 seems to be no system print and a USER/ASSISTANT role + +**For the 34B this should work:** +Add this: `-e -p <|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n\nProvide a full description.<|im_end|><|im_start|>assistant\n` + + +## How to know if you are running in llava-1.5 or llava-1.6 mode + +When running llava-cli you will see a visual information right before the prompt is being processed: + +**Llava-1.5:** +`encode_image_with_clip: image embedding created: 576 tokens` + +**Llava-1.6 (anything above 576):** +`encode_image_with_clip: image embedding created: 2880 tokens` + + +Alternatively just pay notice to how many "tokens" have been used for your prompt, it will also show 1000+ tokens for llava-1.6 + -- Use `llava-surgery-v2.py` -- TODO: add detailed instructions ## TODO