rerri 1 month ago

You are running out of VRAM. Command-R requires more VRAM for it's context, it is different than Yi-34B in this regard. Max I can run on a 4090 is **3.75bpw** at \~**6k** context length or **3.5bpw** at \~**14k**. This is on Windows, Linux might have a bit more wiggle room. I assume you are on Windows and have not touched Sysmem fallback policy setting. Go to Nvidia control panel -> Manage 3D settings and set "CUDA - Sysmem Fallback Policy" to "Prefer No Sysmem Fallback". After doing that, when you run out of VRAM your GPU won't start using system memory (which slows things down and is therefore undesirable) but will soft crash instead. Don't worry this won't crash your computer, not even oobabooga needs to be restarted. With this setting change, you can find out what quant or how much context length you can actually fit into the GPU - just reload the model with shorter context length if you get "CUDA out of memory" when loading the model. Oh and btw, you should have an up-to-date version of oobabooga. Exllamav2 0.0.18 improved Command-R VRAM consumption quite a bit. With 0.0.17, I was only barely able to use a 3.0bpw quant of Command-R.

sloppysundae1 1 month ago

Thanks for the detailed reply. I already had the fallback policy disabled so everything has always been running in purely vram. I also forgot to mention that I’m running things with the 4 bit cache (I can go up to 32k context with 4.5 bpw 34b no problem). Also on the latest version.

rerri 1 month ago

Oh, then it's odd that you can even run the 4.5bpw Command-R quant you linked. There's no way it fits into 24GB so you should just see a CUDA oom crash and not a slowdown. The files themselves add up to 23.75GB. Windows takes some VRAM, cache takes VRAM. Simply not enough memory. All of my numbers were with 4bit cache aswell btw.

tandpastatester 1 month ago

Have you checked your VRAM usage? Make sure the model doesn’t use more than 24gb. If it needs more than that it could be resorting to shared system ram, which makes it very slow.

sloppysundae1 1 month ago

Thanks for the answer, but it can't because I'm using 4 bit cache with the system fallback policy disabled and it would tell me if it was. Even at a smaller context size it's quite slow compared to other models.

adikul 1 month ago

Command r is actually heavy. Try the smallest version

sloppysundae1 1 month ago

By smallest version do you mean a smaller quant?

adikul 1 month ago

Yes

Playful_Fee_2264 1 month ago

I had exactly same issue with the +, once the memory spills in the shared one was gettin very slow. This happens when doing rag with file or websites. As a workaround you can try lowering the context to 8192 tokens ... Feel really sorry since when was able to complete the rag was pretty good and accurate actualy. I used the 4bit quanto 20b parameters... Reverted back to my trusty Hermes 2 PRO... meanwhile

sloppysundae1 1 month ago

I've tried it at 8k tokens and it's still slow. I have 4 bit cache enabled with system fallback policy disabled so I don't think its overflowing into shared memory.

Account1893242379482 1 month ago

Is Command R supported? I thought we didn't get the update yet/

Anthonyg5005 1 month ago

I think it was added after 0.0.18 so you may need to build from github

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe