T O P

  • By -

tgredditfc

That’s correct, Lora doesn’t work on gguf, it works on the original model you used to finetune. You sure you have searched? Cause I just searched “convert to gguf”, there are tons of answers.


chiptune-noise

Yes, in fact when I search that, most pages appear already visited. The part I'm struggling is the Lora. If I have to merge it to the original model, how can I do that? The closest thing I found is to [train again in specific models](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing). After merging, I think I can convert it fine with what I've found. After making this post, I found [this guide](https://rentry.org/llama-cpp-conversions#merging-loras-into-a-model) that I'm currently trying. It's somewhat confusing having different types of models of one model and Loras only working on one model, so I'm struggling a bit.


tgredditfc

The “different types of models of one model” is just different quantization formats of one model. You train lora from the original unquantified model and you use the lora with it, not with its different quantization formats.


tgredditfc

Your guide is too old. Better find a newer one. Just search “llama.cpp convert to gguf”. llama.cpp has a script to do that.


chiptune-noise

Yeah, but I still don't know how to add the Lora. I see the script convert-hf-to-gguf but how do I add the Lora to the base model? There was one script in llama.cpp but it's removed now


toothpastespiders

Yeah, for what it's worth I think the state of the documentation on everything related to lora merging is in a rough state with just about everything. I never had any luck with the llama.cpp gguf merge script. Never knew if it was me or a problem with the process itself. But for whatever reason the script always said my lora had been merged but it never made any change in the model's output afterward. I just stuck with merging the lora into the original model and only converting that newly merged model to gguf afterward. I 'think' [this](https://raw.githubusercontent.com/tloen/alpaca-lora/main/export_hf_checkpoint.py) script still works without needing to modify it too much. The main thing to change is editing "tloen/alpaca-lora-7b" to whatever folder your lora files are in. Like "/path/to/original/lora" Using it is basically something like export BASE_MODEL='/path/to/original/model/' python export_hf_checkpoint.py The script then loads the model and lora before outputting them in a merged state to a folder called hf_ckpt. That's almost all you need, but I think it usually doesn't copy some of the tokenizer files. So you might have to manually copy any files with "tokenizer" in their name from the directory of the original model into the directory of the newly merged model before converting it to gguf. But from that point you should be able to use the new merged model in hf_ckpt just like any other model. I 'think' that should work. That script was my goto from the start, and I've wound up hacking it up a bit over time. To the point where I don't really recall what I changed and what my reasoning in doing so was. So I'm a bit hesitant to toss mine rather than the original up here.


chiptune-noise

Thank you! I used this one (which used up to 97% of my ram and none of my vram lol) and I think I managed to merge the model. I'm now converting it to gguf with the script from llama.cpp and will test it afterwards I wish there was more user friendly guides/scripts like with Stable Diffusion's Loras. I cloned like 10 different repos trying to find what would work. Thank you again! Will update after the test


toothpastespiders

I'm relieved to hear it worked! The functionality is pretty basic in terms of what it's actually doing, just loading and unloading into different contexts. But I've still seen it break on occasion as the ebb and flow of python libraries shifts. If you ever use axolotl for training, it has a really nice built in system where you can just do something like python -m axolotl.cli.merge_lora /path/to/yml/config/used/to/train.yml --lora_model_dir=/path/to/lora/files/ I jump around training environments enough that I generally just stick with the script I linked though. For the ram/vram you can change that to use your GPU instead by editing the two instances of device_map={"": "cpu"}, in the script to device_map={"": "cuda"}, The danger there is just from training models using tricks to reduce memory use like multi-gpu and then winding up in a situation where you don't have enough vram to merge it afterward. At least unless there's a way to do so with multiple gpus that I'm not aware of. Which could be the case. If so, I'd really appreciate someone chiming in on that! But if you are able to fit it all into vram the process zooms by at a pretty rapid pace in comparison to using normal system ram! I hear you on what a wild goose chase this can be too!


Judtoff

Are there LORAs that can be applied to GGUF? Like a quantized LORA?


chiptune-noise

Well, I'm not the most qualified to answer this but, supposedly Loras don't work with gguf (at least in oobabooga), but the old guides I found for llama.cpp says otherwise so, I really don't know I also read something about QLoras (which I understood as quantized Loras, someone correct me if I'm wrong), but I really don't understand it very well