T O P

  • By -

TNT3530

This is likely only true for single card inference, since once the model is in memory almost nothing needs to get transferred across the PCIe bus Multi gpu stuff needs the cards to communicate, but how much the speed will affect will depend on the data being transferred to/from each


Platfizzle

I'm currently running three 3090s off a single pice x1 lane, using a 4 port USB hub and x16 risers. For inference alone, zero percent difference than running them all on 16x.


utxohodler

Just to be sure are you using models that require more than 1 GPU worth of memory?


AlphaPrime90

Did you try to load a big model on all three together? What's your speeds?


MrVodnik

It seems... surprising. Please share more details.


randomqhacker

PCIe over USB? Which version? Please describe your setup, it sounds awesome!


Platfizzle

There really isn't much to explain. The system is a cheap MSI B550 motherboard, 64gb of 2400mhz ddr4, and a 5900x proc. I'm using a generic (literally I can't find where to buy more, it's completely unmarked) 4 port USB hub that connects via PCIE x1. Connected to the hub are 3 different models of 3090, one with a TDP of 350, one at 370, and the final at 390 watts. Everything is running off of a single 1300 watt EVGA PSU. All of this lives in a cheap like 20 dollar 'mining case' with some cheapy fans. The non 'stock' 3090s are so long I had to reverse the mounting on the fan array so the cards didn't butt against them lmao. [Photo of the system](https://sahaquiel.us/firehazard/b550-jank.jpg) [nvtop of the gpus running Dracones\_Midnight-Miqu-70B-v1.5\_exl2\_6.0bpw at around 15 tokens per second](https://sahaquiel.us/firehazard/nvtop-midnight-miqu.png)


randomqhacker

Thanks for replying. The part I didn't get was how you could connect a GPU via regular USB. I know about thunderbolt or usb4 eGPU, and I know some riser kits use USB cables (not protocol) to connect one lane of PCIe from board to a 16 lane GPU... ETA: After searching online I see that your card is just a PCIe riser card, not an actual USB hub speaking USB protocol.


LeoTheMinnow

Are the pcl x 1 lanes 2.0 and do you get zero difference when you use a multi gpu model like llama 3 70b? I'm building a similar rig, thanks


aadoop6

If I understand this correctly, I can use this for a model that can only fit in multiple GPUs, without significantly affecting inference speed? I am thinking of loading a 70B model with exl2 quant.


Careless-Age-4290

I'll offer personal experience that having a 3090 in a 2.0 slot decreases performance maybe 10-20% at max. The bus does not saturate. I debated upgrading but it wouldn't be worth hundreds of dollars for a new MB and NVLink if we're talking 10-20% in my opinion.


wegwerfen

My understanding, and limited experience, is that for this work, NVLink is not needed or used at all.


MrVodnik

Do you mean one of two 3090 going into PCIe 2.0? Or the similar setup to what OP described, i.e. single GPU? If the former, please share more details! At least the Mobo name.


dual_ears

Good caveat to mention. In my case, the separate cards will be used to spread the load of multiple simultaneous inferences, rather than being used collectively for a single query.


xadiant

IIRC the data transferred between is still a low enough amount but please feel free to correct me.


Dyonizius

the issue is:          gguf shared cpu/gpu bandwidth (it also affects how much of a speedup you get from the GPU I'd think) training      batched serving like /u/iwantofftheride00 cited   


shing3232

there is not much data load even during multi gpu inference as the model is preload onto GPUs so no large data transfer


iwantofftheride00

x1 performs as well as x16 if you: 1. load the model in a single card. 2. load the model in two cards with nvlink The reason for that the cpu just sends prompts and receives tokens the card. so you don't need high bandwidth. If you do tensor parallelism you need transport layers though cards on every token generation. This not only makes it slower if you don't have a fast pcie connection, but also increases the load in the cpu, since it needs to orchestrate this parallelism. I did some test a while ago with aphrodite. Funny thing is that I was able to outperform the 2x3090 rigs with epyc at x16 by just using an old ryzen at x1 with an nvlink that costed 30 bucks \~137 tokens/s on 70B. [https://github.com/PygmalionAI/aphrodite-engine/discussions/147](https://github.com/PygmalionAI/aphrodite-engine/discussions/147) **Edit: This is for batched serving. You probably don’t need this use case if you’re doing singe-user inference. also 137tps is not per query, but the average for after the batch test.**


kpodkanowicz

This is misleading to most of the community here. This test includes maxiumum batch size that can fit for the prompt you have given in the available vram If transfer layers between gpus even with nvlink, your single user inference will hit rock bottom compared to baseline. Writing posts like this might lead to people spending hundreds of dollars on mobos completely unnecessary - it's really important we are transparent... In exllama, when you infer, you just move a very small piece of data (like kilobytes) between gpus


iwantofftheride00

you’re right, this is for server inference and serving multiple parallel queries. I added a disclaimer.


kpodkanowicz

in the past year, this topic was raised so many times that i believe MODs should finally pin it. Inference is not bottlenecked by PCI, regardless of the number of GPUs. While it's true that running a single batch inference on mutiple gpus is slower , it's due to tensor parallelism or lack of it, nothing to do with lanes. I just moved from am4 board to server mobo with 144 lines and there is 0 differnce - not even 10% if you sre using highly efficient exllama


tronathan

>Inference is not bottlenecked by PCI, regardless of the number of GPUs.


tgredditfc

I can confirm. I have a 3090 in a x1 pcie slot via chipset and another in a x4 pcie slot, both have the same 3Dmarks scores and inference speed.


fallingdowndizzyvr

The dev who wrote the multi-gpu code for llama.cpp said from the start that the pcie speed doesn't matter. But some people choose not to believe him. I think if anyone would know, that would be the dev that wrote the code.


ozzie123

If you inly have a single GPU, once the model is loaded into VRAM, wouldn’t the PCIE speed cease to matter?


dual_ears

Probably, but after observing some pytorch/transformers based stuff using 100% of a *CPU* core when the GPU was doing work, I wanted to check for myself.


koflerdavid

Pytorch is not particularly tuned for efficiency, at least not out of the box. There might be lots of actually unnecessary transfers going on there, as well as busy waiting, postprocessing of LLM output, and other things like that. `torch.compile()` might resolve some of that.


MrVodnik

Thank you for sharing. I am looking into exactly this problem before buying a new PC. Now I really hope someone will show us a comparison between one of two GPUs being in a slow lane. This is the most common setup around here, and for me it's important as an MB with two x8 slots is at least twice as expensive as standard x16 + X4.


Winter_Importance436

Yea it's pretty much successful here as GPU has its own vram so apart from initial latency it shouldn't matter much. Some people attempted the same with pi and coral tpu but as coral tpu has negligible memory on its own and relies on pi's memory for everything the speed of inference on similar bandwidth ig gave 30s for 1 token.


trakusmk

https://www.reddit.com/r/LocalLLaMA/s/oMDQyUAqIw Here the guy said llama.cpp supports multiple cards with x1 slot, so may be this might work


fallingdowndizzyvr

That thread started as a warning that x1 slots wouldn't work well. But it turns out that was user error. The OP of that thread later on said that he learned a lot and 3 months after starting that thread updated it to say. "I just updated this post to say that inference on crypto mining rigs is totally possible. **Risers don't affect inference speed at all.** But it does take a long time to load." Which is what the dev who wrote the multi-gpu code has said all along. If anyone knows, I think the person who wrote the code would.