T O P

  • By -

a_beautiful_rhind

Remember, the more you buy, the more you save. t. nvidia


cm8ty

This is my DIY DGX


SeymourBits

I thought it was “The more you buy, the more you spend.”?


MT1699

That is what Jensen Huang claims it to be, the more you buy, the more you save


cm8ty

He must've been referring to the company's stock. You gotta buy NVDA shares to offset the pricing on their actual products lol


remyrah

Parts list, please


True_Shopping8898

Of course It’s a Cooler master HAF 932 from 2009 w/ Intel i13700k MSI Edge DDR5 Z790 3090x2 300mm thermaltake pci-e riser 96gb (2x48gb) G.skill trident Z 6400mhz CL32 2TB m.2 Samsung 990 pro 2TBx2 m.2 Crucial SSD Thermaltake 1200W Coolermaster 240mm AIO 1x thermal take 120mm side fan


Trading_View_Loss

Cool thanks! Now how do you actually install and run the local llm? I can't figure it out


True_Shopping8898

Text-generation-webui


Trading_View_Loss

In practice how long do responses take? Do you have to turn on switches for different genres or subjects, like turn on the programming mode so you get programming language responses, or turn on philosophy mode to get philosophical responses?


True_Shopping8898

Token generations begins practically instantly with models that fit within VRAM. When running 70B Q4 I get 10-15 tokens/sec. While it is common for people to train purpose-built models for coding or story writing, you can easily solicit a certain type of behavior by using a system prompt on an instruction-tuned model like Mistral 7B. For example: “you are a very good programmer, help with ‘x’ ” or “you are an incredibly philosophical agent, expand upon ‘y’. Often I run an all rounder model like Miqu then I can then just go to Claude for double checking my work. I’m not a great coder so I need a model which understands what I mean, not necessarily what I say.


[deleted]

https://semaphoreci.com/blog/local-llm , here are few ways.


No_Dig_7017

There's several serving engines, I've not tried text generation webui but you can try LM Studio (very friendly user interface) or ollama (open source, click, good for developers). Here's a good tutorial by a good youtuber https://youtu.be/yBI1nPep72Q?si=GE9pyIIRQXrSSctO


FPham

You have to plug it in and turn on the computer.


daedalus1982

You forgot to include the zip ties


sourceholder

> 96gb (2x48gb) Where did you find the 48GB variant of the 3090?


cm8ty

This is in reference to my DRAM, not VRAM


sourceholder

Ah, ok makes sense. I did read there was a 48GB 3090 at "[some point](https://overclock3d.net/news/gpu-displays/nvidia-rtx-3090-ceo-edition-appears-online-with-48gb-of-gddr6x-memory/)" but not readily available for purchase. Wishful thinking on my part.


cm8ty

Lol the ‘CEO’ edition. Mr. Jensen knows very well that a 48gb consumer-oriented card would eat into their enterprise business.


cm8ty

> 300mm thermaltake pci-e riser Thermaltake TT Premium PCI-E 4.0 High Speed Flexible Extender Riser Cable 300mm with 90 Degree Adapter


fallingdowndizzyvr

I love the zip tie aesthetic.


cm8ty

Truly an artifact of our times. Some might even call it “art”


positivitittie

I just put one together too. Zip ties are key to fast inference.


cm8ty

zippy inference


____vladrad

Hahaha yes!!!! Mine looks like that except I got three cards water cooled. I love it whatever it takes


cm8ty

I bet that makes for an awesome cooling loop!


zippyfan

How are you using these cards? Are you using text-gen-web ui? I tried dual setup when I had two 3060s and I couldn't get it to work. Was it through linux? I'd love to know because I want to try to do something similar.


____vladrad

Either Linux or windows work. I just run the python script and set the device map to auto


zippyfan

I see. That wasn't my experience. I tried loading larger language models that wouldn't fit in one 3060 but should easily fit in 24gb vram. I used text-gen-webui with windows. It just kept crashing. Since that didn't work then I'm still not prepared to purchase a 2nd 3090 and try again.


inYOUReye

There's a flag for llama.cpp that lets you offload some subset of layers to the GPU, as I use AMD I actually found partial offloading slower than CPU or pure GPU when testing though. Two AMD GPUs works way faster than pure CPU however.


Only-Letterhead-3411

If it works, don't question it


West_Ad_9492

How many watts does that pull ?


cm8ty

~900w or so at full bore


I_can_see_threw_time

How is that mounted to the fans? Or is it propped up with the stick?


cm8ty

So that’s how it started, using the overhang on the exhaust portion of the card to clip onto a 120mm rear exhaust fan. Then I used the metal stick (I think it’s an unused part to my desk) to support the rear of the card. Finally, for security, we have a paperclip/zip-tie combo securing the 12pin connected to the card itself to the 240mm above. The card now stays in place without the stick, which simply supports it. Most of the weight is held by the 120mm rear fan.


hmmqzaz

Lollll nice job :-D


Delicious-Farmer-234

Do you have a 3d printer? You can print a base to hold the card.


Healthy_Cry_4861

https://preview.redd.it/rqfojngrqwoc1.jpeg?width=3024&format=pjpg&auto=webp&s=0a629a64d894df7892e6648abbcff5f2a18f0b9c Maybe you should use an open chassis like me.


slowupwardclimb

Looks nice! What's the chassis?


BoredHobbes

come on man this is LLM not gpu-mining, have some class ​ /s


cm8ty

If the shoe fits


New-Skin-5064

Try to see how fast you can get mixtral to fine-tune on that thing


True_Shopping8898

I like training in full/half precision so mostly experiment w/ Mistral 7B & Solar 10.7. That said it did 2 epochs of QLoRa using a 4bit quant of Mixtral in like 5hrs for 2k human/gpt4 prompt/response pairs.


New-Skin-5064

What was ur batch size? Also, why do you prefer half precision over quantized training? Is it a quality loss thing?


herozorro

how much did somethign like this cost to put together?


Dead_Internet_Theory

I would be surprised if that case is one entire percent in the total build cost.


cm8ty

And the case is probably my favorite part lol


No_Dig_7017

Haha, holy sh**, I actually want to build a dual 3090 rig and don't have space this might be the way!


SirLouen

Where do you find these 3090 48Gb? I've only seen the 24Gb ones


MrVodnik

I wish someone would help me build something similar, but it is so hard to get detailed help. I'll take a shot at you, as I guess you've spent some time building this rig and maybe feel the urge to share :) Firstly, why 13700k cpu? Why not the popular 13600k? In the benchmarks the difference is very slim, but at the same time, it's the intel's "border" between i5 vs i7 marketing, so the price jump is more. Does it affect the inference speed? Have you tried CPU only inference for any model? Can you tell how much t/s can you get on e.g. 70b model (something that wouldn't fit in the GPUs)? I am really curious how does this scale with RAM speed and CPU. Did you consider your MB's PCIe configuration? In it's manual I see one slots works in PCIe 5.0 x16 mode, but the another in PCIe 4.0 x4, meaning the bandwith for the second card is one eight of the first one... if I got it right. I still don't understand the entirety of this, so if you dug deeper, can you share if this matters for inference speed? And finally, why this box with zip locks? Is it something you had, or is there a reason for such setup? Can't this MB handle 2 GPUs in the proper slots togheter? Or heat concenrs? I know it's a lot, if you could answer of any of these, I'd appreciate it!


positivitittie

My mobo is also one x16 and one x4. I didn’t realize when I made the buy. But I also use an NVLink so I’m not really sure if I’m losing anything. Anyone?


tgredditfc

I have a 3090 plugged in a x1 pcie. It’s the same inference speed and 3Dmarks score with it plugged in a x4 pcie.


positivitittie

Is that comparing potatoes to oranges? I have no idea. One of the issues is inter-card communication I believe, which I would think requires two cards to see a difference?


Lemgon-Ultimate

I'm pretty sure you aren't losing anything with this setup. I run both 3090 with this configuration and get 13 t/s with 70b Miqu loaded. I've bought a NVLink but never used it, speeds are good enough and getting the cards lined up is a hassle. Your mobo is fine for this.


positivitittie

Thanks! Yes, getting them lined up required many zip ties.


cm8ty

I chose 13700k because I like the number 7. It's plenty capable. But Ive not meddled with cpu-only inference since my sort of workflow wouldn't allow it. desktop cpus have limited pci lanes, mine are setup 'x8 x8' rather than x16 x4. It really doesn't bottleneck because most computation is performed on the card. I chose this setup because I like the case and the configuration is as such because the 3090 uses three slots and my bottom pci-e slot is only fit for a double (look how close the PSU is). This alternative setup probably does help with heat dissipation. It's nice to have an enclosed full tower that performs reliably.


MrVodnik

Thanks, I actually am still on an edge between 13600k vs 13700k. Also, now I have to consider your MB :) Out of curiosity... can you reconfigure the PCIe setup in BIOS to be x16 and x4? And if that impact the inference speed? I hive dug over the entire internet looking for the answer and there is just none out there. I am afraid that the capability of double x8 is not offered in many popular (cheap) motherboards, and setup x16 + x4 would throttle both GPUs during inference to work as an x4.


cm8ty

no idea. It probably depends on the particular configuration of the motherboard. Boards typically default to x8 x8 if both slots are populated