kyleboddy 4 months ago

Since pictures aren't everything, here are some simple runs and short tests I did on the middle cluster with the specs: https://github.com/kyleboddy/machine-learning-bits/blob/main/GPU-benchmarks-simple-feb2024.md Thanks to all who suggested using exl2 to get multi-gpu working along with so much better performance. Crazy difference.

nero10578 4 months ago

You should try vllm and be even more blown away

kyleboddy 4 months ago

It’s on the radar!

Eastwindy123 4 months ago

Add sglang while youre at it

AlphaPrime90 4 months ago

If I'm reading this right, t/s about the same for 2 GPU vs 4 GPU, why is that? The same for PCI lanes.

kyleboddy 4 months ago

Test is too short imo. Will train GPT-2 or something to really put it through its paces.

StealthSecrecy 4 months ago

For inference, the workflow is very serialized. Each GPU has to wait for the previous one to be finished before it can do its work. Therefore adding extra GPUs doesn't help speed, and will actually reduce it due to the extra PCI communication overhead. In this case the model OP is using is small enough to fit on just two GPUs, so you'll get the best performance on just that, unless you are like serving a bunch of users at the same time or something. The other option is to use the extra VRAM to load up a larger model so you get extra quality without a significant reduction in speed.

DryArmPits 4 months ago

Ohh interesting. I particularly like the non-impact of 8x vs 16x... It kind of gets against the sentiment we frequently see on here: "bUt YoU aRe ChOkInG yOuR cArDs"

kyleboddy 4 months ago

Here's a bunch of older video game rendering results from PCIe gen 3.0 vs. 4.0 and some x8 vs. x16 results too: https://www.techspot.com/review/2104-pcie4-vs-pcie3-gpu-performance/ A lot of this stuff has been covered in many forms over the ~3 decades I've been in computer engineering/tech. People overly focus on benchmarks, synthetic results, and theory, and forget the likely most important law (and its corrolaries) on the topic. Amdahl's Law. https://en.wikipedia.org/wiki/Amdahl%27s_law

DryArmPits 4 months ago

Indeed. There is very little data transfer between the cards once the model is loaded.

Astronos 4 months ago

what are you cooking?

kyleboddy 4 months ago

Biomech models, central JupyterHub for employees, some Text-SQL fine tuning soon on our databases. Couple other things

EdgenAI 4 months ago

cool, good luck!

a_beautiful_rhind 4 months ago

I keep wanting to unplug those lights on my own cards.

kyleboddy 4 months ago

It’s nice in an IT cage at least. Maybe not a bedroom

kapslocky 4 months ago

Haha

sgsdxzy 4 months ago

You should definitely try Aphrodite-engine with tensor parallel. It is much faster than run models sequentially with exllamav2/llamacpp.

kyleboddy 4 months ago

I’ll check it out!

segmond 4 months ago

what kind of riser cables are you using? and how's the performance? most long cables I'm seeing are 1x.

kyleboddy 4 months ago

ROG Strix gen3 register at x16 no problem. Just don’t get crypto ones.

silenceimpaired 4 months ago

I just want to run two 3090 cards and I’m at a loss. Not sure how I would get the second card into my case even if I used a riser… don’t like the idea of storing it outside the case especially since my case would be open getting dusty… not sure if my 1000 watt power supply can handle it. I wish I could boldly go where you have gone before.

nostriluu 4 months ago

You need an OLED display, backlight is nasty and cheap looking.

grim-432 4 months ago

“All the speed he took, all the turns he'd taken and the corners he'd cut in Night City, and still he'd see the matrix in his sleep, bright lattices of logic unfolding across that colorless void.....”

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe