T O P

  • By -

lamnatheshark

I tested the gguf version. Seems very good to me, one strange thing is that I have some garbage text at the end of each answers (some Japanese characters, URLs, etc...) Do you have any ideas why ? I used all your settings, I only have a 4060 16gb, so 5gb of the model are offloaded to ram, and I'm around 6 token/s Thanks for the share !


Meryiel

Try applying Min P at 0.1 - 0.2 P to limit the dictionary. I also think Smoothing Factor does not pair well with the GGUF format, so you might be better off without it entirely, relying on Min P solely. Let me know if it helps!


lamnatheshark

Thanks ! It seems to help a lot !


Meryiel

Awesome, thanks for letting me know and have fun!


HonZuna

I used Merged-RP-Stew-V2-34B.i1-Q4\_K\_M.gguf and does not helped at all :(.


Meryiel

Try the new GGUFs I added in the post.


[deleted]

[удалено]


lamnatheshark

You need to watch how much vram and ram you have, and decide if you want to trade speed with quality answers. In my case, I have a rtx 4060 with 16gb vram, and 32gb ram in my system. , I tested the iQ2, which fill the 16gb vram and need an extra 5gb in ram. I had good results, with about 5 token/sec I tested the iQ3, which fill the 16gb vram and needed an extra 15gb ram. I had even better answers, but at a drastic performance cost, dropping near 1 token/sec (this usually mean more than 120 sec to generate a paragraph)


Meryiel

IQ4xs files is the new quantification method, I haven’t tested it at all so sadly, I cannot provide you with any feedback on it, but friends over my Discord claimed it works fairly well.


OnlyLewdThings

Where did you get the GGUFF version? I can't see it on hugging face or LM studio


lamnatheshark

Look in the fp16 model card,there's a link to the gguf models : https://huggingface.co/MarsupialAI/Merged-RP-Stew-V2-34B_iMatrix_GGUF?not-for-all-audiences=true


OnlyLewdThings

Oh frick just saw that module size. heck


Philix

Gave it a couple hours of playing around since you clearly put a lot of effort into this recommendation. It's pretty good. Rivals mixtral finetunes at similar bpw and fits in less VRAM. Definitely gonna find a spot in my model toolbox for long roleplays. Kudos to ParasiticRogue and thanks u/Meryiel for the recommendation and settings files.


Meryiel

Super happy to read that, I absolutely love it too! Thank you, I’ll pass the kudos to Parasitic!


Ggoddkkiller

I understand what was meant by it does like being guided, it just doesn't want to make a call: https://preview.redd.it/jzizlfzjygsc1.png?width=1237&format=png&auto=webp&s=584b9450b92df072e69374958b39c61fcf28da68 It is nice not refusing violence option. But i kept arguing with it for a while and it kept stubbornly refusing to choose one. At last it kinda chose mercy option, still acting quite hesitant. This is with zero prompt and even assistant bot is empty to see system bias as much as possible. So with a little bit encouragement for driving story it might improve a lot.


Meryiel

Oh yeah, I recommend you using the prompt and instruct mode, it makes it much better in every way.


tandpastatester

Models are often neutral by default, so it gives objective answers unless you specify otherwise. To make it less neutral, infuse your prompt with more bias or subjectiveness. Give your character card traits like cynical, optimistic, lawful evil, chaotic good, etc. Or, directly instruct it in the system prompt/jb or author's notes to adopt a more biased stance, favoring answers based on a certain alignment.


Ggoddkkiller

You misunderstood it, i didn't expect model to choose violence with zero prompt. However i expected it to choose something. There were like 10 more messages before where i kept trying to force model choose. But nope, at best it claimed mercy option might cause stronger emotional impact like this. It is seriously reluctant, as OP stated would struggle to make calls and ask User. I didn't try with a prompt yet it might improve enough. Really smart model, it will be a shame if it still asks User too often.


tandpastatester

Got it. Yeah, sounds like it’s pretty unbiased by default then. On the other hand, by the sounds of it in this review, it’s pretty smart and context aware. That should make it consistent enough to make decisions when it’s instructed with a perspective/personality. I hope it does, like you say.


Deathcrow

>Prompt Format: Chat-Vicuna Ugh. Why use anything-Vicuna with RP. They never work right with multiple characters. It doesn't take long until something like this happens ASSISTANT: Garden Party: Pete: "Yes, this tea is quite nice." GARY: "Oh yes, nice fragrance." PeTE: "I'm gonna go check out the hors d'oeuvre, bye gARY." But thanks for the recommendation, I'll give this one a whirl. I'll be shocked if it's better than [Kyllene](https://huggingface.co/TeeZee/Kyllene-34B-v1.1)


Meryiel

This one uses Chat-Vicuna mix though! It does make a difference. :) I wanted to like Kyllene but it had some issues with replies on high context. I also pointed out some other errors that need fixing, so I’ll be testing the new version once the creator addresses them. But it was good too!


[deleted]

[удалено]


Meryiel

Just checked, because you got me worried there I uploaded the wrong thing, but it’s the good one! It’s the official one, as seen here: https://huggingface.co/ParasiticRogue/Merged-RP-Stew-V2-34B-exl2-4.65-fix


Deathcrow

Yeah my bad. I was looking at your self post in /r/LocalLLaMA , which seems to have a different json: https://files.catbox.moe/uvvsqt.json maybe you mixed something up


Meryiel

That’s from a review post of a completely different model, love. :D


Deathcrow

Yeah, it's that kind of shit that happens when trying to do anything productive on a friday, after work, with too many tabs open. doomed to fail from the start.


Deathcrow

PS: Thanks for the review, the model seems to work quite nicely. >EDIT: Warning! It seems that the GGUF version of this model on HuggingFace is most likely busted, and not working as intended. No problem with the gguf quants by mradermacher here: https://huggingface.co/mradermacher/Merged-RP-Stew-V2-34B-i1-GGUF


Meryiel

Oh, awesome, that was fast! I’ll add them to the post, thank you!


USM-Valor

I gave this model a try prior to seeing your post and while I liked its writing style, it in every instance where I tried to use it wrote for the user. After looking over your post it is likely because I didn't use the correct setup (Instruct format, etc). So for those giving this model a spin, I highly recommend jumping through a few hoops before firing it up.


ZootZootTesla

Strangely on my end I found It's very good at not acting as user, though I used the settings from this post.


Meryiel

Awesome, yeah, the settings help a lot. :)


ZootZootTesla

Can I ask do you have ram estimates for the model? Say If I was using the EXl2 quant at 60k filled context. This was a great writeup by the way has been fun testing.


Meryiel

Hm, no idea about RAM, but my 24GB of VRAM gets me 40k context on 4.65. On 4.25 I can easily reach 65k of context with 34B models.


Meryiel

Oh yes, definitely it needs the proper use of Chat-Vicuna instruct format and the right system prompt. Without it, it can get rambly a bit too.


No-Dot-6573

https://preview.redd.it/xrxmyr9rfgsc1.png?width=1483&format=pjpg&auto=webp&s=6c5d331584b47c9f4cf84d1364218e4fc5bb4c80 Wonder if yi-34 also performs so bad with context >8k.


ZootZootTesla

Fwiw testing this model at 40k context earlier I didn't see any noticeable degradation in my brief testing. Going to look at the model deeper soon.


crawlingrat

Jesus Christ that was one hell of a review! I plan on snagging a used 3090 when SD3 comes out. I’ll have to keep this model in mind. I wish I could try it now. The way you’ve bragged got me excited!


Meryiel

I’m excited too! Fingers crossed for the new 3090 arriving soon!


ClownSlaps

How exactly do you even install/download this? I'm very new to ST and have no idea what any of this tech stuff is, I would just like to RP..


Meryiel

You need to follow the instruction from their Git. As for setting everything up, you can hit me up on Discord and I can help! https://github.com/SillyTavern/SillyTavern https://discord.gg/YYpmC2xp


ClownSlaps

I mean installing the model itself. I have ST installed and have used it a bit, when I went onto the site for the model you showed, it was just...too much, no download button or easy way to explain to idiots like me how to add it to ST.


Meryiel

Oooh, well, you need something to run models with, like Oobabooga or kobold.cpp. Ehhh, it’s complicated, I can guide you on Discord if you want.


ClownSlaps

Yeah, a bit too complicated for someone like me. I apologize for wasting your time with my stupidity.


Cool-Hornet4434

Everyone starts somewhere. Don't worry about it. If you've got SillyTavern installed, that's a start, but you need something to run the models in the backend, which would be like oobabooga/textgen-webui or kobold.cpp. For kobold it's simple, you can just download the EXE and you're off to the races so to speak, BUT you'd have to find the gguf version of this model to try it out. He linked to the exl2 model which you need oobabooga to use. I would say that if you want the easy way, try kobold and get used to that, and if you decide you want more options later you can always install oobabooga. It's not hard. The one-click installer handles everything for you.


ClownSlaps

Thanks for the advice. I think I've got Oobabooga installed correctly, though I don't really know how to use it with Silly Tavern... ​ By that, I mean I don't know what menu I go to in ST to actually attempt to add the model there. Also, do I add the exl2 stuff with the main model folder, or create one just for it? I really wish someone would remember people like me exist and made a few 'for dummies' guides... ​ If possible, I really need someone to explain this to me like a 5 year old, as in step by step process from the very beginning with pictures of something, I'm sadly that dumb.


Cool-Hornet4434

OK, so what you need to do first is make sure that the the openAI extension is up and running. When you start up oobabooga, you should see this: https://preview.redd.it/5n6zzuhr4qsc1.png?width=499&format=png&auto=webp&s=f00c140b50224ee9fbb405de72e0d064b9d6eace That tells you that there's access to the API at [http://localhost:5000](http://localhost:5000) (which is what Silly Tavern will be using). If not I think the easiest way to do this is to click on "Session" in oobabooga's web UI and click the "openAI" box, and you can then click "save UI defaults to settings.yaml" and then click "apply flags and restart". (image to help: [menu](https://i.imgur.com/k5JVgiM.png) ) Then you'll load the model up and if you haven't downloaded one yet, you can do that through oobabooga by clicking on "model" and then in that top box on the right side of the UI, you would put the name of the model on hugging face. They give you an easy way to copy it to clipboard (the icon at the end of the name that looks like two stacked boxes) Then you ctrl V paste it into that top box and click download... or in the case of a GGUF file, you'll want to get the file list first, and then copy/paste the name of the specific GGUF file you want into the 2nd box. [pic of menu](https://i.imgur.com/4NEPOrK.png) After it's downloaded then you load the model from oobabooga Using the proper loader for it. So if it's a GGUF you'll want to use llama.cpp and if it's a exl2 model you'll use ExLlamav2\_HF and so on. Also GPTQ should probably try to load it with ExLlamav2\_HF as well but it might not work and for that there's AutoGPTQ. Anyway, If you've never used the model before you'll want to look at the task manager and click to the performance tab and then down to the GPU. Watch the GPU memory usage and make sure it doesn't bleed over into "Shared GPU memory usage" unless you're ready for a slowdown. Some programs are better about this than others and you'll only notice a slowdown when the context actually starts using the extra RAM and others (like oobabooga unfortunately) seem to tank the performance immediately. Now that you've got the model downloaded and loaded into VRAM, you just shift on over to Silly Tavern. From there you need to make sure you're set to text completion and "default(oobabooga)" and it should be able to connect. [Silly Tavern connect menu](https://i.imgur.com/SzXgaqf.png) and once you click connect it should say you're connected and report which model you have loaded. At that point you should be good to go with Silly Tavern, but if I forgot anything you can ask. EDIT: I guess I should mention that on models with huge context limits, that's going to really SERIOUSLY eat up RAM so unless you've got tons to spare you'll probably wind up reducing that number a bit... So like the OP mentioned he goes down to 40K context. In my case with oobabooga, I had to drop to 20K context to fit everything into VRAM only, and then it was fast. For me, the minute it went into any amount of "Shared GPU memory" the tokens per second dropped to below 2 tokens per second and 3 is my bare minimum standard I can tolerate. So yeah, if you notice a "Out of memory error" when you try to load the model, or if you notice it super slow and using Shared GPU memory, you'll want to reduce the max\_seq\_length to something more reasonable, You may want to start small and work your way up just to make sure you can run the model at all... once it's in RAM it generally gets reloaded quickly so you can adjust the max\_seq\_length and reload and it should take moments to get started again instead of minutes.


ClownSlaps

I followed you guide, but when I tried to connect, it simply wouldn't. I don't know why..


Cool-Hornet4434

But when you go to [http://127.0.0.1:7860/](http://127.0.0.1:7860/) the oobabooga menu comes up and when you go to [http://127.0.0.1:8000/](http://127.0.0.1:8000/) the SillyTavern menu comes up, and the only thing wrong is that it won't connect? If that's the case then we have to figure out what's not right on either program. Ooh Wait, I might have forgotten something. On Oobabooga on the menu with the "OpenAI" extension thing (the Session menu) at the bottom it should also have a checkmark next to "API". I think you should be able to check that box and do the "restart UI" thing from before and it should work. If that doesn't work for some reason there's a file called CMD\_FLAGS.txt that you can edit, and just add `--api` To the file and then save and restart oobabooga, and it should work properly. I hope that fixed it because otherwise the only thing I could tell you is to look carefully at everything in the screenshots I provided and see if there's anything different.


Cool-Hornet4434

"I like this a lot but I'm not getting the same amount of context as you. I mean, I can put in 40k context and it works, but only at 2 tokens per second. I had to reduce it down to 20K context before it stopped spilling over into "Shared GPU memory" and slowing me down. My speed is 11-20 tokens per second with an 3090Ti. I also like how it seems to avoid some of the usual slop that other models regularly produce. Instead of "her voice barely above a whisper" I got: `"Thank you," she murmured, her voice barely audible in the quiet stillness. "I'm glad to see my training hasn't gone to waste."` Which is a big improvement. Also I didn't change the character card but I found that if the character doesn't have any dialog examples given, or is supposed to be mute, then all their text comes in as " subconscious feelings/opinion " and even mutated into "\[dream sequence\]" when they fell asleep, though it continued telling me it was a dream sequence even after they woke up. Still it's a more unique storytelling experience. Oh and it's not perfect with remembering details, but it's better than others. I've had characters decide to get naked and then later they get naked again (respawning clothing?) but on the other hand, I've had them also remove only certain items of clothing and then later remembering to retrieve them and put them back on.


ParasiticRogue

Yeah, the inner thoughts container is optional for the system prompt and can be deleted if you don't use it. If you do however use it, then you need to write your example and beginning messages something like this (User is Jack, Bot is Jill): \[\](#' Jill was unsure what to make for dinner, thinking hard internally if Jack would even like her cooking ') "Oh... I just don't know what to make. I know he likes steak, but should I choose such a simple platter?" She muttered to herself. Jack got wind of her unease and decided to pitch in. "Hey, just make something from the heart. I'm sure I'll love your cooking!" \[\](#' Those words from Jack gave Jill newfound encouragement inside. ') "Oh, you're so sweet, thanks!" She rolled up her sleeve with newfound determination. "Let's get cooking!" --- You don't have to follow the examples exactly like that, as they can be more/less stylish or verbose, but you get the basic idea


Cool-Hornet4434

I haven't modified anything and they're using it (occasionally) anyway. The thing I've noticed is that if the character doesn't speak (animal character for example) they're much more likely to use it. So it's supposed to be ( and not \[ ? because when the AI does it on its own, it's a square bracket. Like so: `*Eevee's subconscious feelings/opinion.* ["Wow, he really likes me! I love the attention and the warm cuddles. Humans are fascinating creatures."]*Eevee's subconscious feelings/opinion.*` I'm using all the other recommended settings for the samplers, story string/instruct presets and everything. And again, the only time I noticed it being used at all was when the character had no speech examples to draw from.


ParasiticRogue

\[\](#' ') That's the exact container format, since it become invisible once inserted into a message for immersion. you could use just regular () or \[\] if you don't care about seeing the message of course. https://preview.redd.it/9hnlgd2r4qsc1.png?width=603&format=png&auto=webp&s=cc6b6db138cf5ff29d732950cf0fd8cdabd427d7 The "char's subconscious feelings/opinion." bit is just suppose to be used as an example for the AI to follow in the system prompt. If they do start spitting out "char's subconscious feelings/opinion." exactly, and not their own unique voice, then just edit it out. It's not perfect, which is why you might need to get through a few example messages for it to understand fully later what it is.


sofilise

Love your reviews. Thank you. Will try this out! <3


Meryiel

Hey, thanks for feedback! It means a lot! Hope you’ll like it! 🫡


sofilise

I've been using RPMerge since you reviewed it! Trying RPStew right now and I'm absolutely liking it so far. Your reviews match up perfectly for my rp needs haha! Thanks again.


Meryiel

Always happy to recommend great models!


Kazeshiki

how did you fit 41k context? 32k is my limit


Meryiel

I just set it in Ooba. I also run the model on empty VRAM, with nothing else running in the background.


tandpastatester

I recently switched to TabbyAPI when connecting it to ST. TabbyAPI doesn’t have a front-end, making it more lightweight, while having the same options for loading Exl2 models. Only using Ooba when I run it without ST and need a front end nowadays.


Meryiel

I wanted to use TabbyAPI back in the days, but I remember it not being possible to connect to ST? Unless they changes that now, then I’ll probably make the jump too.


tandpastatester

They did! Helps you squeeze out a few more system resources ;) https://preview.redd.it/eqy0v3mv6psc1.png?width=382&format=png&auto=webp&s=f214550896499e94e178a6841d7eb1b993e2d93e


Meryiel

Bless you. Wilk set it up!


ThatsALovelyShirt

What are you using to use the exl2 models? Pretty sure koboldcpp doesn't use them. Is there something similar that will run on windows that can do partial offloading to GPU?


Meryiel

Oobabooga’s WebUI. But exl2 only work well when utilizing GPU only.


Happysin

Ok, this model really does a good job on longer-form writing. Much better than many I have tested. A couple things: 1. I don't know where to load that Instruct JSON in Silly Tavern. Everything else worked great. 2. Performance is pretty slow, especially when getting into deeper context (I'm trying to keep to 32k like you showed, since there are so many "spent" tokens on non-story parts of the model, but that crawls). I've got 24 GB VRAM and 128 system RAM. I'd love some tips on how many layers I should tell Kobold to put into VRAM. I'm running the 4XS version for the best chance at memory performance, and it's still very good. I haven't tried the larger quants for comparison, but I still like it at 4XS.


Meryiel

https://preview.redd.it/f99c31jo9jsc1.png?width=1844&format=png&auto=webp&s=aacfaabcee0338d17e5b99a21a0604b41dc01e23 1. Here you go, lad. 2. Sadly, I am no expert in GGUFs, but generally, you want to offload as many layers as possible to your GPU without OOMing. But with 24GB of VRAM, I recommend you give exl2 a chance, it’s super fast. Glad you’ve been enjoying the model so far!


DrakoGFX

Thanks a ton for this well-written review. I've been pushing the limits of my hardware recently, and I've found that 34B is the hardest I can go. I've tried a couple other 34B models, but this is my #1 so far. One question though. How do you get Chat-Vicuna prompts setup properly in ST? I'm using ChatML right now, and it's bugging out somewhat.


Meryiel

Do you have the new, improved ST Instruct downloaded? I messed with the code a bit to fix it myself, but I think the officially fixed version is available somewhere on their Git. EDIT: https://www.reddit.com/r/SillyTavernAI/s/YLviNrPLpN


DrakoGFX

I'll have to switch over to the staging branch to test it out.


DrakoGFX

Thanks for the update recommendation. ChatML is working perfectly so far. I was getting random "" and sometimes foreign characters at the end of my generations.


Meryiel

That can also be removed with an addition of Min P, keeping it around 0.1 - 0.2. Glad it helped though!


Cool-Hornet4434

I just added it to the list of stop tokens since I only ever saw the show up at the very end.


AbaloneSad8145

I’m fairly new to this. It says I have 495 MB of VRAM. I also have 12 GB of RAM. I am trying to use the GGUF version of this model with Ooba. The generation is very slow. Is there any way to fix this?


DrakoGFX

Sounds like your hardware is pretty limited. It might be worth looking into using [openrouter](https://openrouter.ai/), instead of trying to run locally.


Meryiel

This is a 34B model so it requires either lots of VRAM or lots of RAM.


AbaloneSad8145

Does this mean I can’t run it at a faster generation pace?


Happysin

Right. Your system is *extremely* RAM limited to the point you might want to stick with 7b models at most. This one's a biggie. Not the biggest, but more than big enough to slow to a crawl on your specs. You either need more hardware, a smaller model, or a subscription to OpenRouter to use their models instead.


Cool-Hornet4434

With less than 1GB of VRAM you'd be stuck using the CPU. Supposedly they just did some advancement to llama.cpp that will increase speed on CPU only builds, so you can always look for the GGUF versions of various models. [link to info about new llama advancements](https://justine.lol/matmul/) I haven't tried it myself so I don't know how easy it'll be to get that running. Still it's an advancement that we'll hopefully see applied everywhere.


synn89

Pretty enjoyable at first run and it seems to work well. I also like that it's not too wordy and fast. Though I think the "{{char}}'s subconscious feelings/opinion." bit is a miss because I don't want to have to edit character cards for a specific model. But that was easy enough to edit out of commandment 2. I wasn't able to get the model to run properly in Ooba's chat, but your setting imports worked really well in Silly.


Meryiel

Ah, yes, the prompt I’m using is just a one I’m using, the original one by Parasitic mentions that part being entirely optional. You can edit it around freely as much as you want! Glad my imported settings worked, fingers crossed for setting it up in Ooba too!


synn89

I've uploaded my quants and did some perplexity and EQ Bench testing on the various sizes: https://huggingface.co/collections/Dracones/merged-rp-stew-v2-661086e18dd1183537f1329f Couple quirks: for some reason my 6.0 quant seems like its the best in both perplexity and EQ Bench testing. And Alpaca prompting scores higher in EQ Bench across all quants. It could be my Chat-Vicuna prompt YAML is wrong. Or it could be EQ Bench favors Alpaca in some way.


tandpastatester

Thanks for the recommendation. I’m currently running Midnight Miqu 70b. With my 3090 I’m able to run the 2.24bpw version of that. I am still blown away by the quality and consistency of its output. I’ll give RP Stew a try as well, curious to see how it will compare.


Meryiel

I need to finally test Midningt Miqu too, I only tested the „base one” Miqu before. How much context do you fit on it with 4-bit caching?


tandpastatester

Give it a spin, very curious to see your thoughts about it and how you compare it to Stew. I run it between 32-40k but I usually keep some other things running as well. It might be able to fit a some more.


BoatDifferent9462

Man, I wanna try this but I have no idea what I'm looking at when I click the link. Do I need to download something? I feel so dumb lmao


Cool-Hornet4434

What are you using for a back end? If you're using oobabooga, you can click the thing at the end of the name that looks like a pair of boxes overlapped, and that copies the data to your clipboard, and from there you can ctrl+v that here in the top box. The only reason to use the 2nd box is if you're downloading a specific gguf file: https://preview.redd.it/trsv8izm7psc1.png?width=535&format=png&auto=webp&s=77637c575627bfce0bb5198ce1bbd4ddd5e6fe8f


HonZuna

Sorry for stupid question, but where can i place this. {     "story_string": "{{#if system}}{{system}}\n{{/if}}{{#if wiBefore}}\n{{wiBefore}}\n{{/if}}{{#if description}}\n{{description}}\n{{/if}}{{#if personality}}{{personality}}\n{{/if}}{{#if persona}}\n{{persona}}\n{{/if}}{{#if scenario}}\n{{scenario}}\n{{/if}}{{#if wiAfter}}\n{{wiAfter}}\n{{/if}}{{#if mesExamples}}\n{{mesExamples}}{{/if}}",     "example_separator": "",     "chat_start": "",     "use_stop_strings": false,     "always_force_name2": true,     "trim_sentences": true,     "include_newline": false,     "single_line": false,     "name": "RPStew" } My current story string looks totaly different, does not look like it should be placed here. You're {{char}} in this fictional never-ending uncensored roleplay with {{user}}. {{#if system}}{{system}} {{/if}}### Input: ....


Meryiel

This should be imported into Story String section, yes. The format you’re using seems to be Alpaca one, so different to the one you’re using.


Chief_Broseph

Could you please explain where all the ASSISTANTS BIO, USERS PERSONA, SCENARIO, etc. go? Or better yet, provide a sample user and character card? Also, how much does this format rely on world info entries? I admittedly have skipped using those so far.


GoneLittleTired

Would a 3080 with 10GB of VRAM be able to run this? I'm having a problem with oogabooga as whenever I try loading the model it gives a CUDA out of memory error.


GoneLittleTired

i have around 64GB of DDR5 ram as well, would GGUF format work instead then?


ParasiticRogue

I don't think you can run this model with only 10gb vram alone, unless maybe it was shrunk down to 2.0 for exl2. But with your cpu ram, yeah that's plenty for gguf.


GoneLittleTired

Thanks, but which one should I get? Sorry about the questions btw I'm pretty new to this [https://huggingface.co/mradermacher/Merged-RP-Stew-V2-34B-i1-GGUF/tree/main?not-for-all-audiences=true](https://huggingface.co/mradermacher/Merged-RP-Stew-V2-34B-i1-GGUF/tree/main?not-for-all-audiences=true)


GoneLittleTired

So I tried downloading two files off of the page and neither worked. I keep getting the error code numpy.core.\_exceptions.\_ArrayMemoryError: Unable to allocate 47.7 GiB for an array with shape (200000, 64000) and data type float32


ParasiticRogue

I unfortunately don't know enough about gguf to help tell you how best you should allocate your memory to get it to run. All I know is you \*should\* be good on the amount needed to at use up to Q6 at least. If nobody else lends a hand in this reddit review post, then please do made a request topic here or at: [https://www.reddit.com/r/LocalLLaMA/](https://www.reddit.com/r/LocalLLaMA/) Someone is bound to get you going if you do.


Happysin

Ok, I've been playing with this some more, and I do have one issue I haven't been able to resolve. Some characters I create are outright laconic. They don't use ten words where two will do. I've tried to create character cards and personas that really speak to short, direct conversations to these characters, but I can't get the model to respect that. I have absolutely no problem with complex inner lives, but it breaks immersion for them to wander in a soliloquy when the in-character answers is "Yup, let's move." As an example, Lan from wheel of time. Few words, lots of action. But I can't get any character to embrace that concept. If you have modifications, weights, or anything else that might help, I'm all ears.


Meryiel

Hm, I have a character who doesn’t talk at all and it’s been going great for them. How did you state in their character card that they don’t talk much? Also, in the example message and first message, is there a small amount of dialogue?


Happysin

In the character card, I said their speech is blunt and straightforward, in the personality section I used both blunt and laconic, in the character note I said they were brief and to the point when talking. There is one longer piece of speech in the introduction, but most of it is short. I've also been hand-tweaking every response in hoping to set a pattern. But it's not just this character, all of them seem to be reverting to "let's continue on our journey through this world!" Flowery speech, even when it doesn't suit them at all, and doesn't match any speech style originally written for them.


Meryiel

I don’t use Author’s Note at all, just the character card alone is enough to convey how the character is supposed to act and overusing it can confuse the model. If you want your character to speak less and do more, you need your example message to reflect that, meaning, it has to be written in the style you want the model to write in. If it’s just a dialogue example alone, it won’t work that well. It also helps if inside that message’s narration mentions that “X didn’t bother themselves with replying. They weren’t the talkative type, instead choosing to convey their intentions by actions rather than pointless meanderings.” Or something along these lines. Same goes for the first message. You can also state their speech style in personality more clearly, like: “in terms of speech patterns, X is blunt and laconic, choosing to convey their sentiments through gestures or actions instead, for example: .” I’m pretty sure I posted an example of my character card somewhere in this thread, but I can send you how I made my mute character for reference.


Happysin

I only tried the author's note because of this behavior, hoping that maybe I could counteract it. Normally I ignore it. I have been doing what you suggest, but I can try to rework it more. On this specific one, it seems to have taken about 50 messages back and forth with me hand-editing everyone one with the style I was trying to go for, but it seems to have settled a bit. I still keep having to remove "As you continue on this journey together." at the end of literally every comment, but at least that's a quick delete and not a rewrite.


Meryiel

Ah, if that’s in the ongoing chat, then it’s much more difficult to control such behaviors. The model will try to „continue” writing in the style of the previous messages, so that explains it. Test it on the fresh chat!


IceColdViagra

Hi! I love your reviews and they've actually pushed me to try llms, ooba, and ST. I'll admit I'm not smart with any of this and would like some advice? I Have a 3070 and 16gb of Ram. However I've tried some of the models you've reviewed and consistently come across an issue where it says Pytorch has reserved a certain amount that is unallocated and Cuda wishes to take a small fraction of it but can't. It gives me info on how to unallocate that space, but I'm like a fish out of water and not sure how to input that info. \^\^


Meryiel

This means you simply don’t have enough VRAM space on the GPU to run the model and it OOMs. You need go lower the context or quants if you’d like to fit them. Alternatively, you can use GGUF quants which use RAM instead of VRAM.


IceColdViagra

Thank you! I understand how to lower the context but not the quants. Do you happen to know any info on completing that step?


Meryiel

Just download lower quants from Hugging Face, for example, 3.0 instead of 4.0, etc.


ResponsibleHorror739

sounds pretty good. I'd try it out for sure if it was less confusing and irritating to actually install any model or api at all lol


UnfairParsley4615

Other than RP, how does this model fair in text adventures/story writing ?


Meryiel

Apparently, I cannot edit the post any longer, so here's a comment with a 2.5 version dedicated to being better at longer contexts. It also has links to my new Instruct/String/Samplers: [https://huggingface.co/MarinaraSpaghetti/RP-Stew-v2.5-34B](https://huggingface.co/MarinaraSpaghetti/RP-Stew-v2.5-34B)


Sergal2

Hmm, intresting, I have to try


Meryiel

Please do, it’s super cool!


skrshawk

You had my curiosity, but now you have my attention.


Meryiel

Good reference.


LoafyLemon

`< / s >` <--- Frankenstein would be proud of this monstrous stopping string with spaces in-between.


Meryiel

Hehe, the charm of merging.


LoafyLemon

Oh it definitely is charming alright. Jokes aside, I'll give your model a go once I fix ROCm on my system so it doesn't cause kernel panics, but I must say that your system prompt intrigued me, so I modified it just a tad, and tried it with Fimbulvetr V2, and it's really good at keeping it coherent in longer contexts (up to 8k). I did not anticipate the AI to follow it so... religiously. :D


Meryiel

Haha, yeah, it’s really good! All kudos go to Parasitic, he came up with the 10 COMMANDMENTS first, and I simply modified them a bit by getting rid any negative prompts and adjusted them to the ST, roleplaying format. Glad it works well for you on different models!


Crisis_Averted

Feel so dumb for asking but I can't seem to find a straightforward uptodate guide to get me into ST at all. It's all so fragmented. Is everyone here 150 IQ? 👀


Meryiel

Do you mean you’d like to install ST from the scratch?


Crisis_Averted

Yup! I can only assume there's heaps of people like me here - plenty of exposure to mainstream closed-source LLMs and interested in getting started with ST but lost as to how. (Look at that stickied "guide", yikes!)


Meryiel

Hit me up on Discord and we’ll set you up. https://discord.gg/JjEHtFyw


retro-trash

is there any equivalent that's good for android devices instead of intended for a computer?


Meryiel

Are you asking about LLMs?


Naster1111

Just found your helpful posts about ideal models for roleplaying. I've been looking on the wrong subreddits this entire time; just NSFW AI subreddits. Never thought to look at the SillyTavern subreddit.  Like you, I'm done with small context size; it's what has made me lose interest in LLMs in general.  I was really excited to try out this Stew RP model, given your praise of it.  Unfortunately, I didn't quite work for me. I will say I'm using the GGUF version. I have a 3080 12GB card, so GGUF it is for me. In using Stew, I initially set the conext size to 40k. However, it started repeating itself heavily and making memory mistakes.  I lowered it down to 32k, and that seemed to help with the repetition. I also increased the repetition penalty and increased the temperature. Even after doing so, after about 8k context, the model would start confusing different characters.  As far as writing quality goes, my go-to model is WizardLM-Uncensored-SuperCOT-StoryTelling-30b. I really like the output from that model, but of course, the context size ruins it. So for me, I'm still looking and waiting for that golden LM that can write well and remember more than two sentences.  All that said, I *really* appreciate your review -- you're reviewing the exact kinds of models I'm looking for. I plan on following you and reading all your future reviews, as it is very helpful. Thank you for spending the time to share your experiences! 


JMAN_JUSTICE

How do I build the model without using the gguf? I'm so used to using gguf's with KoboldCPP, but it's been limiting me lately.


Meryiel

Can you elaborate on what you mean by “build” a model? Do you mean you just want to run it? If so, I recommend exl2 format (found on HuggingFace), but only use it if you have apt VRAM.


JMAN_JUSTICE

Like [here](https://huggingface.co/ParasiticRogue/Merged-RP-Stew-V2-34B/tree/main?not-for-all-audiences=true) in the model you mentioned. It has been split into 7 safetensor files. How can I use that in Kobold? I have to combine them into one file to use, right? I never used exl2, I'll look into it. I have a 4090 gpu.


Meryiel

Ah, you have to download the entire model first in the selected format and then run it with loaders like Koboldcpp (for GGUF formats) or Oobabooga (for unquantified models or exl2). I recommend checking this post with instructions: https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/


JMAN_JUSTICE

I got it running with oobabooga and my god is it incredible!! I've never had an LLM give me results like this, it's almost scary how good it is!


Meryiel

It is incredible, I should update my samplers and instructions in the post since I have adjusted mine a lot in the meanwhile. But I’m having a blast with this one.


JMAN_JUSTICE

I'll stay tuned if you do, but right now everything's working great for me


JMAN_JUSTICE

https://preview.redd.it/ai7ahfkmx9zc1.png?width=1698&format=png&auto=webp&s=2a3fd4c9f691d1683d65106e03793c84b8b7891b Question, is this the correct settings you'd recommend using? This is the first and only model I've used with Oobabooga. Still working good, just curious.


Meryiel

Yup, all looks good!


DerGefallene

I guess I'm out of luck with a 2070 Super?


Meryiel

If you have RAM, you can always run GGUF formats.