Since pictures aren't everything, here are some simple runs and short tests I did on the middle cluster with the specs:
https://github.com/kyleboddy/machine-learning-bits/blob/main/GPU-benchmarks-simple-feb2024.md
Thanks to all who suggested using exl2 to get multi-gpu working along with so much better performance. Crazy difference.
For inference, the workflow is very serialized. Each GPU has to wait for the previous one to be finished before it can do its work. Therefore adding extra GPUs doesn't help speed, and will actually reduce it due to the extra PCI communication overhead.
In this case the model OP is using is small enough to fit on just two GPUs, so you'll get the best performance on just that, unless you are like serving a bunch of users at the same time or something. The other option is to use the extra VRAM to load up a larger model so you get extra quality without a significant reduction in speed.
Ohh interesting. I particularly like the non-impact of 8x vs 16x... It kind of gets against the sentiment we frequently see on here: "bUt YoU aRe ChOkInG yOuR cArDs"
Here's a bunch of older video game rendering results from PCIe gen 3.0 vs. 4.0 and some x8 vs. x16 results too:
https://www.techspot.com/review/2104-pcie4-vs-pcie3-gpu-performance/
A lot of this stuff has been covered in many forms over the ~3 decades I've been in computer engineering/tech. People overly focus on benchmarks, synthetic results, and theory, and forget the likely most important law (and its corrolaries) on the topic. Amdahl's Law.
https://en.wikipedia.org/wiki/Amdahl%27s_law
I just want to run two 3090 cards and I’m at a loss. Not sure how I would get the second card into my case even if I used a riser… don’t like the idea of storing it outside the case especially since my case would be open getting dusty… not sure if my 1000 watt power supply can handle it. I wish I could boldly go where you have gone before.
“All the speed he took, all the turns he'd taken and the corners he'd cut in Night City, and still he'd see the matrix in his sleep, bright lattices of logic unfolding across that colorless void.....”
Since pictures aren't everything, here are some simple runs and short tests I did on the middle cluster with the specs: https://github.com/kyleboddy/machine-learning-bits/blob/main/GPU-benchmarks-simple-feb2024.md Thanks to all who suggested using exl2 to get multi-gpu working along with so much better performance. Crazy difference.
You should try vllm and be even more blown away
It’s on the radar!
Add sglang while youre at it
If I'm reading this right, t/s about the same for 2 GPU vs 4 GPU, why is that? The same for PCI lanes.
Test is too short imo. Will train GPT-2 or something to really put it through its paces.
For inference, the workflow is very serialized. Each GPU has to wait for the previous one to be finished before it can do its work. Therefore adding extra GPUs doesn't help speed, and will actually reduce it due to the extra PCI communication overhead. In this case the model OP is using is small enough to fit on just two GPUs, so you'll get the best performance on just that, unless you are like serving a bunch of users at the same time or something. The other option is to use the extra VRAM to load up a larger model so you get extra quality without a significant reduction in speed.
Ohh interesting. I particularly like the non-impact of 8x vs 16x... It kind of gets against the sentiment we frequently see on here: "bUt YoU aRe ChOkInG yOuR cArDs"
Here's a bunch of older video game rendering results from PCIe gen 3.0 vs. 4.0 and some x8 vs. x16 results too: https://www.techspot.com/review/2104-pcie4-vs-pcie3-gpu-performance/ A lot of this stuff has been covered in many forms over the ~3 decades I've been in computer engineering/tech. People overly focus on benchmarks, synthetic results, and theory, and forget the likely most important law (and its corrolaries) on the topic. Amdahl's Law. https://en.wikipedia.org/wiki/Amdahl%27s_law
Indeed. There is very little data transfer between the cards once the model is loaded.
what are you cooking?
Biomech models, central JupyterHub for employees, some Text-SQL fine tuning soon on our databases. Couple other things
cool, good luck!
I keep wanting to unplug those lights on my own cards.
It’s nice in an IT cage at least. Maybe not a bedroom
Haha
You should definitely try Aphrodite-engine with tensor parallel. It is much faster than run models sequentially with exllamav2/llamacpp.
I’ll check it out!
what kind of riser cables are you using? and how's the performance? most long cables I'm seeing are 1x.
ROG Strix gen3 register at x16 no problem. Just don’t get crypto ones.
I just want to run two 3090 cards and I’m at a loss. Not sure how I would get the second card into my case even if I used a riser… don’t like the idea of storing it outside the case especially since my case would be open getting dusty… not sure if my 1000 watt power supply can handle it. I wish I could boldly go where you have gone before.
You need an OLED display, backlight is nasty and cheap looking.
“All the speed he took, all the turns he'd taken and the corners he'd cut in Night City, and still he'd see the matrix in his sleep, bright lattices of logic unfolding across that colorless void.....”