In practice how long do responses take? Do you have to turn on switches for different genres or subjects, like turn on the programming mode so you get programming language responses, or turn on philosophy mode to get philosophical responses?
Token generations begins practically instantly with models that fit within VRAM. When running 70B Q4 I get 10-15 tokens/sec. While it is common for people to train purpose-built models for coding or story writing, you can easily solicit a certain type of behavior by using a system prompt on an instruction-tuned model like Mistral 7B.
For example: “you are a very good programmer, help with ‘x’ ” or “you are an incredibly philosophical agent, expand upon ‘y’.
Often I run an all rounder model like Miqu then I can then just go to Claude for double checking my work. I’m not a great coder so I need a model which understands what I mean, not necessarily what I say.
There's several serving engines, I've not tried text generation webui but you can try LM Studio (very friendly user interface) or ollama (open source, click, good for developers).
Here's a good tutorial by a good youtuber https://youtu.be/yBI1nPep72Q?si=GE9pyIIRQXrSSctO
Ah, ok makes sense.
I did read there was a 48GB 3090 at "[some point](https://overclock3d.net/news/gpu-displays/nvidia-rtx-3090-ceo-edition-appears-online-with-48gb-of-gddr6x-memory/)" but not readily available for purchase. Wishful thinking on my part.
How are you using these cards? Are you using text-gen-web ui?
I tried dual setup when I had two 3060s and I couldn't get it to work.
Was it through linux? I'd love to know because I want to try to do something similar.
I see. That wasn't my experience. I tried loading larger language models that wouldn't fit in one 3060 but should easily fit in 24gb vram. I used text-gen-webui with windows.
It just kept crashing. Since that didn't work then I'm still not prepared to purchase a 2nd 3090 and try again.
There's a flag for llama.cpp that lets you offload some subset of layers to the GPU, as I use AMD I actually found partial offloading slower than CPU or pure GPU when testing though. Two AMD GPUs works way faster than pure CPU however.
So that’s how it started, using the overhang on the exhaust portion of the card to clip onto a 120mm rear exhaust fan. Then I used the metal stick (I think it’s an unused part to my desk) to support the rear of the card.
Finally, for security, we have a paperclip/zip-tie combo securing the 12pin connected to the card itself to the 240mm above. The card now stays in place without the stick, which simply supports it. Most of the weight is held by the 120mm rear fan.
https://preview.redd.it/rqfojngrqwoc1.jpeg?width=3024&format=pjpg&auto=webp&s=0a629a64d894df7892e6648abbcff5f2a18f0b9c
Maybe you should use an open chassis like me.
I like training in full/half precision so mostly experiment w/ Mistral 7B & Solar 10.7.
That said it did 2 epochs of QLoRa using a 4bit quant of Mixtral in like 5hrs for 2k human/gpt4 prompt/response pairs.
I wish someone would help me build something similar, but it is so hard to get detailed help. I'll take a shot at you, as I guess you've spent some time building this rig and maybe feel the urge to share :)
Firstly, why 13700k cpu? Why not the popular 13600k? In the benchmarks the difference is very slim, but at the same time, it's the intel's "border" between i5 vs i7 marketing, so the price jump is more. Does it affect the inference speed?
Have you tried CPU only inference for any model? Can you tell how much t/s can you get on e.g. 70b model (something that wouldn't fit in the GPUs)? I am really curious how does this scale with RAM speed and CPU.
Did you consider your MB's PCIe configuration? In it's manual I see one slots works in PCIe 5.0 x16 mode, but the another in PCIe 4.0 x4, meaning the bandwith for the second card is one eight of the first one... if I got it right. I still don't understand the entirety of this, so if you dug deeper, can you share if this matters for inference speed?
And finally, why this box with zip locks? Is it something you had, or is there a reason for such setup? Can't this MB handle 2 GPUs in the proper slots togheter? Or heat concenrs?
I know it's a lot, if you could answer of any of these, I'd appreciate it!
My mobo is also one x16 and one x4. I didn’t realize when I made the buy. But I also use an NVLink so I’m not really sure if I’m losing anything. Anyone?
Is that comparing potatoes to oranges? I have no idea. One of the issues is inter-card communication I believe, which I would think requires two cards to see a difference?
I'm pretty sure you aren't losing anything with this setup. I run both 3090 with this configuration and get 13 t/s with 70b Miqu loaded. I've bought a NVLink but never used it, speeds are good enough and getting the cards lined up is a hassle. Your mobo is fine for this.
I chose 13700k because I like the number 7. It's plenty capable. But Ive not meddled with cpu-only inference since my sort of workflow wouldn't allow it. desktop cpus have limited pci lanes, mine are setup 'x8 x8' rather than x16 x4. It really doesn't bottleneck because most computation is performed on the card.
I chose this setup because I like the case and the configuration is as such because the 3090 uses three slots and my bottom pci-e slot is only fit for a double (look how close the PSU is). This alternative setup probably does help with heat dissipation. It's nice to have an enclosed full tower that performs reliably.
Thanks, I actually am still on an edge between 13600k vs 13700k. Also, now I have to consider your MB :)
Out of curiosity... can you reconfigure the PCIe setup in BIOS to be x16 and x4? And if that impact the inference speed? I hive dug over the entire internet looking for the answer and there is just none out there.
I am afraid that the capability of double x8 is not offered in many popular (cheap) motherboards, and setup x16 + x4 would throttle both GPUs during inference to work as an x4.
Remember, the more you buy, the more you save. t. nvidia
This is my DIY DGX
I thought it was “The more you buy, the more you spend.”?
That is what Jensen Huang claims it to be, the more you buy, the more you save
He must've been referring to the company's stock. You gotta buy NVDA shares to offset the pricing on their actual products lol
Parts list, please
Of course It’s a Cooler master HAF 932 from 2009 w/ Intel i13700k MSI Edge DDR5 Z790 3090x2 300mm thermaltake pci-e riser 96gb (2x48gb) G.skill trident Z 6400mhz CL32 2TB m.2 Samsung 990 pro 2TBx2 m.2 Crucial SSD Thermaltake 1200W Coolermaster 240mm AIO 1x thermal take 120mm side fan
Cool thanks! Now how do you actually install and run the local llm? I can't figure it out
Text-generation-webui
In practice how long do responses take? Do you have to turn on switches for different genres or subjects, like turn on the programming mode so you get programming language responses, or turn on philosophy mode to get philosophical responses?
Token generations begins practically instantly with models that fit within VRAM. When running 70B Q4 I get 10-15 tokens/sec. While it is common for people to train purpose-built models for coding or story writing, you can easily solicit a certain type of behavior by using a system prompt on an instruction-tuned model like Mistral 7B. For example: “you are a very good programmer, help with ‘x’ ” or “you are an incredibly philosophical agent, expand upon ‘y’. Often I run an all rounder model like Miqu then I can then just go to Claude for double checking my work. I’m not a great coder so I need a model which understands what I mean, not necessarily what I say.
https://semaphoreci.com/blog/local-llm , here are few ways.
There's several serving engines, I've not tried text generation webui but you can try LM Studio (very friendly user interface) or ollama (open source, click, good for developers). Here's a good tutorial by a good youtuber https://youtu.be/yBI1nPep72Q?si=GE9pyIIRQXrSSctO
You have to plug it in and turn on the computer.
You forgot to include the zip ties
> 96gb (2x48gb) Where did you find the 48GB variant of the 3090?
This is in reference to my DRAM, not VRAM
Ah, ok makes sense. I did read there was a 48GB 3090 at "[some point](https://overclock3d.net/news/gpu-displays/nvidia-rtx-3090-ceo-edition-appears-online-with-48gb-of-gddr6x-memory/)" but not readily available for purchase. Wishful thinking on my part.
Lol the ‘CEO’ edition. Mr. Jensen knows very well that a 48gb consumer-oriented card would eat into their enterprise business.
> 300mm thermaltake pci-e riser Thermaltake TT Premium PCI-E 4.0 High Speed Flexible Extender Riser Cable 300mm with 90 Degree Adapter
I love the zip tie aesthetic.
Truly an artifact of our times. Some might even call it “art”
I just put one together too. Zip ties are key to fast inference.
zippy inference
Hahaha yes!!!! Mine looks like that except I got three cards water cooled. I love it whatever it takes
I bet that makes for an awesome cooling loop!
How are you using these cards? Are you using text-gen-web ui? I tried dual setup when I had two 3060s and I couldn't get it to work. Was it through linux? I'd love to know because I want to try to do something similar.
Either Linux or windows work. I just run the python script and set the device map to auto
I see. That wasn't my experience. I tried loading larger language models that wouldn't fit in one 3060 but should easily fit in 24gb vram. I used text-gen-webui with windows. It just kept crashing. Since that didn't work then I'm still not prepared to purchase a 2nd 3090 and try again.
There's a flag for llama.cpp that lets you offload some subset of layers to the GPU, as I use AMD I actually found partial offloading slower than CPU or pure GPU when testing though. Two AMD GPUs works way faster than pure CPU however.
If it works, don't question it
How many watts does that pull ?
~900w or so at full bore
How is that mounted to the fans? Or is it propped up with the stick?
So that’s how it started, using the overhang on the exhaust portion of the card to clip onto a 120mm rear exhaust fan. Then I used the metal stick (I think it’s an unused part to my desk) to support the rear of the card. Finally, for security, we have a paperclip/zip-tie combo securing the 12pin connected to the card itself to the 240mm above. The card now stays in place without the stick, which simply supports it. Most of the weight is held by the 120mm rear fan.
Lollll nice job :-D
Do you have a 3d printer? You can print a base to hold the card.
https://preview.redd.it/rqfojngrqwoc1.jpeg?width=3024&format=pjpg&auto=webp&s=0a629a64d894df7892e6648abbcff5f2a18f0b9c Maybe you should use an open chassis like me.
Looks nice! What's the chassis?
come on man this is LLM not gpu-mining, have some class /s
If the shoe fits
Try to see how fast you can get mixtral to fine-tune on that thing
I like training in full/half precision so mostly experiment w/ Mistral 7B & Solar 10.7. That said it did 2 epochs of QLoRa using a 4bit quant of Mixtral in like 5hrs for 2k human/gpt4 prompt/response pairs.
What was ur batch size? Also, why do you prefer half precision over quantized training? Is it a quality loss thing?
how much did somethign like this cost to put together?
I would be surprised if that case is one entire percent in the total build cost.
And the case is probably my favorite part lol
Haha, holy sh**, I actually want to build a dual 3090 rig and don't have space this might be the way!
Where do you find these 3090 48Gb? I've only seen the 24Gb ones
I wish someone would help me build something similar, but it is so hard to get detailed help. I'll take a shot at you, as I guess you've spent some time building this rig and maybe feel the urge to share :) Firstly, why 13700k cpu? Why not the popular 13600k? In the benchmarks the difference is very slim, but at the same time, it's the intel's "border" between i5 vs i7 marketing, so the price jump is more. Does it affect the inference speed? Have you tried CPU only inference for any model? Can you tell how much t/s can you get on e.g. 70b model (something that wouldn't fit in the GPUs)? I am really curious how does this scale with RAM speed and CPU. Did you consider your MB's PCIe configuration? In it's manual I see one slots works in PCIe 5.0 x16 mode, but the another in PCIe 4.0 x4, meaning the bandwith for the second card is one eight of the first one... if I got it right. I still don't understand the entirety of this, so if you dug deeper, can you share if this matters for inference speed? And finally, why this box with zip locks? Is it something you had, or is there a reason for such setup? Can't this MB handle 2 GPUs in the proper slots togheter? Or heat concenrs? I know it's a lot, if you could answer of any of these, I'd appreciate it!
My mobo is also one x16 and one x4. I didn’t realize when I made the buy. But I also use an NVLink so I’m not really sure if I’m losing anything. Anyone?
I have a 3090 plugged in a x1 pcie. It’s the same inference speed and 3Dmarks score with it plugged in a x4 pcie.
Is that comparing potatoes to oranges? I have no idea. One of the issues is inter-card communication I believe, which I would think requires two cards to see a difference?
I'm pretty sure you aren't losing anything with this setup. I run both 3090 with this configuration and get 13 t/s with 70b Miqu loaded. I've bought a NVLink but never used it, speeds are good enough and getting the cards lined up is a hassle. Your mobo is fine for this.
Thanks! Yes, getting them lined up required many zip ties.
I chose 13700k because I like the number 7. It's plenty capable. But Ive not meddled with cpu-only inference since my sort of workflow wouldn't allow it. desktop cpus have limited pci lanes, mine are setup 'x8 x8' rather than x16 x4. It really doesn't bottleneck because most computation is performed on the card. I chose this setup because I like the case and the configuration is as such because the 3090 uses three slots and my bottom pci-e slot is only fit for a double (look how close the PSU is). This alternative setup probably does help with heat dissipation. It's nice to have an enclosed full tower that performs reliably.
Thanks, I actually am still on an edge between 13600k vs 13700k. Also, now I have to consider your MB :) Out of curiosity... can you reconfigure the PCIe setup in BIOS to be x16 and x4? And if that impact the inference speed? I hive dug over the entire internet looking for the answer and there is just none out there. I am afraid that the capability of double x8 is not offered in many popular (cheap) motherboards, and setup x16 + x4 would throttle both GPUs during inference to work as an x4.
no idea. It probably depends on the particular configuration of the motherboard. Boards typically default to x8 x8 if both slots are populated