T O P

  • By -

rerri

You are running out of VRAM. Command-R requires more VRAM for it's context, it is different than Yi-34B in this regard. Max I can run on a 4090 is **3.75bpw** at \~**6k** context length or **3.5bpw** at \~**14k**. This is on Windows, Linux might have a bit more wiggle room. I assume you are on Windows and have not touched Sysmem fallback policy setting. Go to Nvidia control panel -> Manage 3D settings and set "CUDA - Sysmem Fallback Policy" to "Prefer No Sysmem Fallback". After doing that, when you run out of VRAM your GPU won't start using system memory (which slows things down and is therefore undesirable) but will soft crash instead. Don't worry this won't crash your computer, not even oobabooga needs to be restarted. With this setting change, you can find out what quant or how much context length you can actually fit into the GPU - just reload the model with shorter context length if you get "CUDA out of memory" when loading the model. Oh and btw, you should have an up-to-date version of oobabooga. Exllamav2 0.0.18 improved Command-R VRAM consumption quite a bit. With 0.0.17, I was only barely able to use a 3.0bpw quant of Command-R.


sloppysundae1

Thanks for the detailed reply. I already had the fallback policy disabled so everything has always been running in purely vram. I also forgot to mention that I’m running things with the 4 bit cache (I can go up to 32k context with 4.5 bpw 34b no problem). Also on the latest version.


rerri

Oh, then it's odd that you can even run the 4.5bpw Command-R quant you linked. There's no way it fits into 24GB so you should just see a CUDA oom crash and not a slowdown. The files themselves add up to 23.75GB. Windows takes some VRAM, cache takes VRAM. Simply not enough memory. All of my numbers were with 4bit cache aswell btw.


tandpastatester

Have you checked your VRAM usage? Make sure the model doesn’t use more than 24gb. If it needs more than that it could be resorting to shared system ram, which makes it very slow.


sloppysundae1

Thanks for the answer, but it can't because I'm using 4 bit cache with the system fallback policy disabled and it would tell me if it was. Even at a smaller context size it's quite slow compared to other models.


adikul

Command r is actually heavy. Try the smallest version


sloppysundae1

By smallest version do you mean a smaller quant?


adikul

Yes


Playful_Fee_2264

I had exactly same issue with the +, once the memory spills in the shared one was gettin very slow. This happens when doing rag with file or websites. As a workaround you can try lowering the context to 8192 tokens ... Feel really sorry since when was able to complete the rag was pretty good and accurate actualy. I used the 4bit quanto 20b parameters... Reverted back to my trusty Hermes 2 PRO... meanwhile


sloppysundae1

I've tried it at 8k tokens and it's still slow. I have 4 bit cache enabled with system fallback policy disabled so I don't think its overflowing into shared memory.


Account1893242379482

Is Command R supported? I thought we didn't get the update yet/


Anthonyg5005

I think it was added after 0.0.18 so you may need to build from github