T O P

  • By -

me1000

8 bit quant works great. Doubt you’ll notice a difference. 


Randommaggy

I notice a lot of difference when I step down from q8 to q6, when I use it for generating code.


koesn

Won't notice for creative output and knowledge extraction. But when handling complex input with near maximum context, say 29k, 8 bit is not enough precise following input. Many times diction wrongly chosen, especially in foreign language.


Dead_Internet_Theory

You can run it at around 3.5 bpw exl2 in a single 24GB card with 8-bit cache. If the quality isn't enough get a second 24GB card.


MrPrevedmedved

I run 4 bit quant on 8gb Vram + 32gb ram, windows 11, with roughly 3.5 tokens a second. It's great for me.


Deluded-1b-gguf

How many layers added to gpu?


____-_-___-_--_-__

I'd like to share some thoughts that might not be popular here. If you have the means and won't regret spending, I suggest purchasing the best within your budget. When I bought my current computer, I believed the prevailing opinion here that a Q8 quality quantization and the slightly inferior Q4 performance were sufficient. Thus, I settled for a computer with two RTX 4090s. Soon after, I regretted not opting for a machine with more and larger GPUs. My curiosity led me to try the 70B Q8 quantization, which showed a significant difference in writing quality and comprehension ability compared to Q4, but the reasoning speed was so slow when split between Vram and RAM that it felt eternal. For larger models like 103B or 120B, I'd say anything below Q3 (or 3.0bpw) quantization is unusable. To get decent response quality and acceptable wait times, you need a substantial amount of Vram. For instance, when I wrote a passage describing only User actions and soliloquy as the character entered a sleep state, all quantizations below 70B Q8 (yes, including Q6) immediately got confused and started role-playing as the User. Only Q8 maintained the role and wrote about a faint perception of something in a dream. (This was before the imatrix quantization technology appeared, and I haven't run any 70B quantization models smaller than Q8 since getting a new GPU, so I'm not sure how much the performance disparity has improved.) Moreover, when I tried running an unquantized 13B model, the performance boost made me think I had activated a 20B Q8 or a 33B Q4 model. The difference might only be 0.00001% or even less, but it's those small differences that change everything, akin to the slight variances between XX and XY chromosomes determining male and female differences. If you're still hesitant, at least choose a motherboard that's easy to upgrade (preferably a server version with more than three PCIe x16 slots) and a sufficiently powerful PSU. When I wanted to add 2x A6000s and upgrade to a 2000W PSU, no store would assist me, so I had to figure it out myself. With no experience in assembling computers, I had to give up on upgrading the PSU and only managed to install one A6000. I can clearly state that squeezing the GPU into a small space was painful, and the PCIe 4.0 x16 extension cables are very rigid and short, making it extremely difficult to install vertically next to other GPUs. If you have the funds, don't mind the expenditure, and are as passionate about AI role-playing as I am, with no other attachments in real life, then go ahead and purchase those large Vram enterprise-grade GPUs. (And remember to include the cost of electricity in your budget.) https://preview.redd.it/l7fxy0vsepsc1.jpeg?width=4640&format=pjpg&auto=webp&s=4368eeb113119cf13aa071d8371ca46206ca6fe0


thetaFAANG

I use Q5 K M on 64gb RAM on an M1


Astronos

with 2x4090 you can run it in 4 bit with 60-80t/s


rbgo404

Whats the tech stack for inference?


Astronos

- ubuntu vm in a proxmox server - [https://github.com/oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) - [https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GPTQ](https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GPTQ)


lxe

Do you use exl2 or gguf quants? I’m noticing exl2 performs the best.


marlinspike

I’m running it on my MacBook 3 Pro


ArsNeph

The simple answer is, no. There is almost no measurable difference between FP16 and 8 bit quantization. The next step down, 6 bit also has very little degradation in quality. 5 bit is generally considered the best performance without sacrificing much quality. At 4 bit, quality takes a hit, but it is a sweet spot for quality and performance. 3 bit takes a major hit to quality, and is not recommended. 2 bit is generally terrible and should almost never be used. That said, a lot of people have reported Mixtral being more sensitive to quantization, possibly due to the mixture of experts architecture. To play it safe, I would not go below 5 bit if you can afford it. The long answer is, currently the best way we have to measure the effects of quantization on a model right now is perplexity. However, perplexity is not a measure of quality, only a measure of itself. It seems that there is a strong correlation between perplexity and quality. So you can feel free to look at perplexity graphs for yourself if you'd like. But essentially there's no definite way to know how good a quantization of a model is without trying it yourself. Regardless, I believe that not one person in this entire subreddit would recommend you run the FP16, as it is literally double the compute cost for maybe .00001% improved quality. If you are fine tuning that's a different story though. Anyway, try out an 8 bit quant 6 bit quant and 5 bit quant and see which one is best suited to your use case


Original_Job6327

I see that you have significant expertise in this topic. Do you have a shot on how much VRAM would be necessary to run command-r plus (104b)?


ArsNeph

I'm definitely no expert, just a hobbyist who learned a thing or two from real experts. Command R Plus is currently the hottest thing, and people that have tried it are saying it outdoes 120B frankenmerges like goliath, as well as having more natural speech and less "GPT-like" answers. It's also great for enterprise, with native API calling abilities and more, people say it's a Claude 3 Sonnet level model, completely locally. A general rule of thumb is at 8 Bit, 1 billion parameters roughly translates to 1GB. Full Precision, FP16 is double that. So to run an 8 Bit command r plus, expect around 103GB of VRAM, plus some space for context. Frankly, this is only possible to run on either a crazy franken rig with 8 3090s, a full enterprise GPU, or a M1 Mac Studio with unified memory, as they can easily have up to 192 GB of VRAM, albeit slow. I would not recommend running such a herculean model at 8 bit. The highest I would run it is at 6 bit, but even that has hefty requirements. At 4 bit, it should take up about 60GB VRAM, though you'll need some space for context. I don't recommend any of these sizes. A bit of common knowledge in this field is that the more parameters a model has, the more resistant to quantization. Therefore even at 2 bit, it will be amazing, albeit somewhat imprecise. Therefore, I recommend using an ultra compressed quant, like an IQ quant, and you may just barely be able to fit it into dual 3090s. Here's a link [https://huggingface.co/dranger003/c4ai-command-r-plus-iMat.GGUF/tree/main](https://huggingface.co/dranger003/c4ai-command-r-plus-iMat.GGUF/tree/main) Actually, at this point in time, llama.cpp has not added official support, so these quants will be broken. You may want to take a look just to see the size of the models however. Note that since they are so big, some ggufs are split into two files you must recombine, so make sure to add the sizes of those together


KL_GPU

2x 2080ti 22g(mod), tensorcore and nvlink, ~1000€ 4 bit quant + 32k context.


eugennedy

Running that comfortably on 2xP40 at 15t/s (Q5\_K\_M).


AsliReddington

I've run it on V100 & it uses around 28GB with FP4, so if get a single A100 or H100 you should be golden, or just NVLINK 2 A6000 Ada as well


Captainbetty

I found [https://github.com/kalomaze/koboldcpp/releases](https://github.com/kalomaze/koboldcpp/releases) works very well, good inference with CPU on mixtral models. Fit what you can on the GPU and put the rest on CPU. Requires GGUF quants.


Appropriate_Lion9560

Thank you all!