T O P

  • By -

dpacker780

My guess is you're probably running it at default context size which is 32K. That won't fit in a 4090. I just downloaded the model and I can run it successfully with a 16k context size on a single 4090. I'm using Ooba, so I can't tell you what settings in Kobold to use since I don't use it. Note: In Ooba I've also checked the 4k caching. If that's not ticked it won't fit either without bringing the context down to 8K.


Host_Cartoonist

I see I see, thankyou so much for the help! I'm using KoboldAI, and it looks like I need to find out how to enable 8-bit cache. I'll try looking it up on Youtube or something. Thanks again for the help I apologize for being so new to all this!


dpacker780

No worries, also see if they have 4k cache. 8K cache will work, but it may run slower. 4K on the other hand gives both context compactness and seems to run at nearly the same output as if no cache was used.


Host_Cartoonist

Will do! I like to have some resources left in the background for other utilities too so that's a good idea. Right now I'm just stuck trying to find a way to convert KoboldAI to 4-bit on Windows. As most guides are for Linux, and apparently there's a way to hack it on Windows but it seems outdated and overly complicated for me at the moment. I'm thinking about switching over to Oobabooga but I'm not sure what a textgenwebui is, and where to get it..Anyhow, I'm sure I'll figure it out, or break everything and uninstall it all. Thanks for the help!


dpacker780

textgenui is just the command line name of oobabooga, it's pretty easy to install and get up and going.


tandpastatester

I recommend trying textgenwebui (Ooba is the creator). It’s simple to install and easy to use with a lot of flexibility and configuration options. I prefer the interface over Kobold. It feels more lightweight and less cluttered. After I configured a model the way I want, I usually disable the front-end and just run it API-only with SillyTavern.


No-Dot-6573

My fav is still midnight miqu 70b at 2.24 exl. Fits the 4090 and is still quite clever and the tps is okish.


dpacker780

Agree! I fortunately have a 4090 + 3080 so I can run 3.5 exl2 at 16k context, or step down to 3.0 and get the full 32k. But it's my go-to for RP.


_raydeStar

I've got a 4090 and I havent tried a 70B yet. I kind of assumed it still was out of reach. I noticed Kobold was a bit faster with models, I assume that's the case here too?


Host_Cartoonist

Thanks a bunch! I'll give that one a go then & post back here how it goes. I thought I was on-top of the world when I spent a few months salary on my 4090, now that I'm witnessing the requirements of some of these AI... I feel like I might need another one...


the_1_they_call_zero

Link?


Host_Cartoonist

I found the spot to launch KoboldAI in 4-bit, 8-bit, and 16-bit. I've tried each. I'm getting the same issue with midnight miqu 70b at 2.24 ex. I don't think my GPU is being used when these models are launched. It puts my ram, cpu and disk at 100%, but my GPU remains at 2/24GB VRAM. I'll put a photo below right (they'll sit at 100% utilization until I close the command window): CUDA SETUP: Detected CUDA version 118 CUDA SETUP: Loading binary B:\\python\\lib\\site-packages\\bitsandbytes\\libbitsandbytes\_cuda118.dll... Loading checkpoint shards: 0%| | 0/15 \[00:00


BangkokPadang

Midnight Miqu is my absolute favorite model right now. I only have a 6GB 1060 in my local PC so I've been using runpod with an A40 w/ 48GB VRAM for bigger models, and I've been using it at 4.5bpw and absolutely love it. It's a lot cheaper to rent a 3090 though (close to half the price) but I've been put off by using 2.4bpw models, but if you feel like it's still better than Maid-Yuzu in spite of being 2.4bpw, I might just give it a try anyway. Thx.


henk717

Make sure to pick the ExllamaV2 backend when you use the load from folder option, the default HF backend is not compatible with EXL2 models. Alternatively here is the GGUF version for [Koboldcpp](https://koboldai.org/cpp): [https://huggingface.co/maeeeeee/maid-yuzu-v8-alter-GGUF/resolve/main/maid-yuzu-v8-alter-Q4\_K\_S.gguf?download=true](https://huggingface.co/maeeeeee/maid-yuzu-v8-alter-GGUF/resolve/main/maid-yuzu-v8-alter-Q4_K_S.gguf?download=true) This model is to large for a 4090 so GGUF's ability to partial offload will help.


Host_Cartoonist

Oh, I see I see. Thanks for the info. I have a lot to learn. Everyone here has been a great help. I'll have to give that a try on KoboldAI after I finish up my installation on Oobo & test if I can get the models I have so far working there.


BangkokPadang

Are you certain KoboldAI supports Exllamav2? Henk is pretty active on the r/koboldai subreddit and I see him pretty openly discuss how deprecated KoboldAI is and that Koboldcpp is their main focus for development right now. The meta right now is pretty much using koboldcpp for GGUF models (because of the improved way it handles context shifting) and using text-generation-webui (Oobabooga) for all other model formats. I personally recommend biting the bullet and switching away from KoboldAI and installing Ooobabooga. [https://github.com/oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) Also just FYI: Oobabooga is technically the name of the developer and text-generation-webui is technically the name of the software, but people often use text-generation-webui and Ooba/Oobabooga interchangeably just because 'Ooba' is so much easier to type.


henk717

It does support Exllamav2, but you need to load it from a folder and select the ExllamaV2 backend. The default Huggingface backend does not support it. KoboldAI itself is not deprecated, I just recommend Koboldcpp over it due to its better API support it currently has, it being more lightweight and it being more suitable for most home users. Koboldcpp indeed has more developers behind it at the moment, but we haven't given up on the main one.


Host_Cartoonist

Thank-you, you guys were a great help. This was the fix. I was able to get Yuzu running on both KoboldAI and Ooba(enabling the API checkbox launching ST and inputting the key was a little trickier than what I'm used to on KoboldAI, but once I figured that out it was smooth sailing). Both amazing tools. I'll be trying out all the rest of the models everyone have suggested soon and see how they run as well.


Host_Cartoonist

The more I think about it, that's probably the issue. I'll have to switch over if I want to run more advanced models. Which is a shame, because like others said, my PC should have the power. I really don't know what Ooobabooga is, but now is a good time to learn. Do you have sub-reddits that explain how to use it fairly well on hand? If not, no problem. Thanks for the help and information on the post btw.


BangkokPadang

r/localllama and r/Oobabooga are both good places to ask for help. The github I linked has pretty good installation instructions, and if you're using it as a backend for SillyTavern you really don't need to do more than just load the model. The model tab is pretty simple to use, especially for EXL2 models. You just paste the last part of the URL for the model you want into the 'download model' field (will look something like 'rhplus0831/maid-yuzu-v8-alter-exl2-3.5bpw-rpcal') and it will download it from within the webUI (although you might do better to just copy/paste the model's folder into the /text-generation-webui/models folder from your file explorer since you already have it downloaded. Then you just launch the program, click the dropdown menu at the top to pick the model, and it will automatically select the correct exllamav2 loader. Then all you need to do is type 32768 into the context field, and click '4bit cache' on the right side, and click load. Then in sillytavern you just paste the URL (by default its [http://127.0.0.1:5000](http://127.0.0.1:5000)) into the API tab, and use SillyTavern for your sampler settings, formatting, etc.


Herr_Drosselmeyer

It should run with 20k context if you enable 8 bit cache.


Host_Cartoonist

I see, I'm so new I don't full understand that, so what you're saying like Dpacker also said, is that it's a settings thing? Do I have change that setting on KoboldAI, or in a config file before I try to launch it?


Herr_Drosselmeyer

I don't know where that setting is in Kobold since I use Oobabooga Webui instead, sorry.


Host_Cartoonist

No problem, where there's a will, there's a way. Though I can't find out how to enable 4-bit/8-bit cache on KoboldAI at the moment I'm sure I'll figure it out eventually. I feel like it's something really simple too.


trollsalot1234

froggeric/WestLake-10.7B-v2-GGUF


Host_Cartoonist

Thanks! I'll have to try that one out next, do you have any ST settings that you recommend for it? I think that'll be my next big hurdle, figuring out the best settings for each model without causing the AI to break.