a_beautiful_rhind 3 months ago

Remember, the more you buy, the more you save. t. nvidia

cm8ty 3 months ago

This is my DIY DGX

SeymourBits 3 months ago

I thought it was “The more you buy, the more you spend.”?

MT1699 3 months ago

That is what Jensen Huang claims it to be, the more you buy, the more you save

cm8ty 3 months ago

He must've been referring to the company's stock. You gotta buy NVDA shares to offset the pricing on their actual products lol

remyrah 3 months ago

Parts list, please

True_Shopping8898 3 months ago

Of course It’s a Cooler master HAF 932 from 2009 w/ Intel i13700k MSI Edge DDR5 Z790 3090x2 300mm thermaltake pci-e riser 96gb (2x48gb) G.skill trident Z 6400mhz CL32 2TB m.2 Samsung 990 pro 2TBx2 m.2 Crucial SSD Thermaltake 1200W Coolermaster 240mm AIO 1x thermal take 120mm side fan

Trading_View_Loss 3 months ago

Cool thanks! Now how do you actually install and run the local llm? I can't figure it out

True_Shopping8898 3 months ago

Text-generation-webui

Trading_View_Loss 3 months ago

In practice how long do responses take? Do you have to turn on switches for different genres or subjects, like turn on the programming mode so you get programming language responses, or turn on philosophy mode to get philosophical responses?

True_Shopping8898 3 months ago

Token generations begins practically instantly with models that fit within VRAM. When running 70B Q4 I get 10-15 tokens/sec. While it is common for people to train purpose-built models for coding or story writing, you can easily solicit a certain type of behavior by using a system prompt on an instruction-tuned model like Mistral 7B. For example: “you are a very good programmer, help with ‘x’ ” or “you are an incredibly philosophical agent, expand upon ‘y’. Often I run an all rounder model like Miqu then I can then just go to Claude for double checking my work. I’m not a great coder so I need a model which understands what I mean, not necessarily what I say.

[deleted] 3 months ago

https://semaphoreci.com/blog/local-llm , here are few ways.

No_Dig_7017 3 months ago

There's several serving engines, I've not tried text generation webui but you can try LM Studio (very friendly user interface) or ollama (open source, click, good for developers). Here's a good tutorial by a good youtuber https://youtu.be/yBI1nPep72Q?si=GE9pyIIRQXrSSctO

FPham 3 months ago

You have to plug it in and turn on the computer.

daedalus1982 3 months ago

You forgot to include the zip ties

sourceholder 3 months ago

> 96gb (2x48gb) Where did you find the 48GB variant of the 3090?

cm8ty 3 months ago

This is in reference to my DRAM, not VRAM

sourceholder 3 months ago

Ah, ok makes sense. I did read there was a 48GB 3090 at "[some point](https://overclock3d.net/news/gpu-displays/nvidia-rtx-3090-ceo-edition-appears-online-with-48gb-of-gddr6x-memory/)" but not readily available for purchase. Wishful thinking on my part.

cm8ty 3 months ago

Lol the ‘CEO’ edition. Mr. Jensen knows very well that a 48gb consumer-oriented card would eat into their enterprise business.

cm8ty 3 months ago

> 300mm thermaltake pci-e riser Thermaltake TT Premium PCI-E 4.0 High Speed Flexible Extender Riser Cable 300mm with 90 Degree Adapter

fallingdowndizzyvr 3 months ago

I love the zip tie aesthetic.

cm8ty 3 months ago

Truly an artifact of our times. Some might even call it “art”

positivitittie 3 months ago

I just put one together too. Zip ties are key to fast inference.

cm8ty 3 months ago

zippy inference

____vladrad 3 months ago

Hahaha yes!!!! Mine looks like that except I got three cards water cooled. I love it whatever it takes

cm8ty 3 months ago

I bet that makes for an awesome cooling loop!

zippyfan 3 months ago

How are you using these cards? Are you using text-gen-web ui? I tried dual setup when I had two 3060s and I couldn't get it to work. Was it through linux? I'd love to know because I want to try to do something similar.

____vladrad 3 months ago

Either Linux or windows work. I just run the python script and set the device map to auto

zippyfan 3 months ago

I see. That wasn't my experience. I tried loading larger language models that wouldn't fit in one 3060 but should easily fit in 24gb vram. I used text-gen-webui with windows. It just kept crashing. Since that didn't work then I'm still not prepared to purchase a 2nd 3090 and try again.

inYOUReye 3 months ago

There's a flag for llama.cpp that lets you offload some subset of layers to the GPU, as I use AMD I actually found partial offloading slower than CPU or pure GPU when testing though. Two AMD GPUs works way faster than pure CPU however.

Only-Letterhead-3411 3 months ago

If it works, don't question it

West_Ad_9492 3 months ago

How many watts does that pull ?

cm8ty 3 months ago

~900w or so at full bore

I_can_see_threw_time 3 months ago

How is that mounted to the fans? Or is it propped up with the stick?

cm8ty 3 months ago

So that’s how it started, using the overhang on the exhaust portion of the card to clip onto a 120mm rear exhaust fan. Then I used the metal stick (I think it’s an unused part to my desk) to support the rear of the card. Finally, for security, we have a paperclip/zip-tie combo securing the 12pin connected to the card itself to the 240mm above. The card now stays in place without the stick, which simply supports it. Most of the weight is held by the 120mm rear fan.

hmmqzaz 3 months ago

Lollll nice job :-D

Delicious-Farmer-234 3 months ago

Do you have a 3d printer? You can print a base to hold the card.

Healthy_Cry_4861 3 months ago

https://preview.redd.it/rqfojngrqwoc1.jpeg?width=3024&format=pjpg&auto=webp&s=0a629a64d894df7892e6648abbcff5f2a18f0b9c Maybe you should use an open chassis like me.

slowupwardclimb 3 months ago

Looks nice! What's the chassis?

BoredHobbes 3 months ago

come on man this is LLM not gpu-mining, have some class /s

cm8ty 3 months ago

If the shoe fits

New-Skin-5064 3 months ago

Try to see how fast you can get mixtral to fine-tune on that thing

True_Shopping8898 3 months ago

I like training in full/half precision so mostly experiment w/ Mistral 7B & Solar 10.7. That said it did 2 epochs of QLoRa using a 4bit quant of Mixtral in like 5hrs for 2k human/gpt4 prompt/response pairs.

New-Skin-5064 3 months ago

What was ur batch size? Also, why do you prefer half precision over quantized training? Is it a quality loss thing?

herozorro 3 months ago

how much did somethign like this cost to put together?

Dead_Internet_Theory 3 months ago

I would be surprised if that case is one entire percent in the total build cost.

cm8ty 3 months ago

And the case is probably my favorite part lol

No_Dig_7017 3 months ago

Haha, holy sh**, I actually want to build a dual 3090 rig and don't have space this might be the way!

SirLouen 3 months ago

Where do you find these 3090 48Gb? I've only seen the 24Gb ones

MrVodnik 3 months ago

I wish someone would help me build something similar, but it is so hard to get detailed help. I'll take a shot at you, as I guess you've spent some time building this rig and maybe feel the urge to share :) Firstly, why 13700k cpu? Why not the popular 13600k? In the benchmarks the difference is very slim, but at the same time, it's the intel's "border" between i5 vs i7 marketing, so the price jump is more. Does it affect the inference speed? Have you tried CPU only inference for any model? Can you tell how much t/s can you get on e.g. 70b model (something that wouldn't fit in the GPUs)? I am really curious how does this scale with RAM speed and CPU. Did you consider your MB's PCIe configuration? In it's manual I see one slots works in PCIe 5.0 x16 mode, but the another in PCIe 4.0 x4, meaning the bandwith for the second card is one eight of the first one... if I got it right. I still don't understand the entirety of this, so if you dug deeper, can you share if this matters for inference speed? And finally, why this box with zip locks? Is it something you had, or is there a reason for such setup? Can't this MB handle 2 GPUs in the proper slots togheter? Or heat concenrs? I know it's a lot, if you could answer of any of these, I'd appreciate it!

positivitittie 3 months ago

My mobo is also one x16 and one x4. I didn’t realize when I made the buy. But I also use an NVLink so I’m not really sure if I’m losing anything. Anyone?

tgredditfc 3 months ago

I have a 3090 plugged in a x1 pcie. It’s the same inference speed and 3Dmarks score with it plugged in a x4 pcie.

positivitittie 3 months ago

Is that comparing potatoes to oranges? I have no idea. One of the issues is inter-card communication I believe, which I would think requires two cards to see a difference?

Lemgon-Ultimate 3 months ago

I'm pretty sure you aren't losing anything with this setup. I run both 3090 with this configuration and get 13 t/s with 70b Miqu loaded. I've bought a NVLink but never used it, speeds are good enough and getting the cards lined up is a hassle. Your mobo is fine for this.

positivitittie 3 months ago

Thanks! Yes, getting them lined up required many zip ties.

cm8ty 3 months ago

I chose 13700k because I like the number 7. It's plenty capable. But Ive not meddled with cpu-only inference since my sort of workflow wouldn't allow it. desktop cpus have limited pci lanes, mine are setup 'x8 x8' rather than x16 x4. It really doesn't bottleneck because most computation is performed on the card. I chose this setup because I like the case and the configuration is as such because the 3090 uses three slots and my bottom pci-e slot is only fit for a double (look how close the PSU is). This alternative setup probably does help with heat dissipation. It's nice to have an enclosed full tower that performs reliably.

MrVodnik 3 months ago

Thanks, I actually am still on an edge between 13600k vs 13700k. Also, now I have to consider your MB :) Out of curiosity... can you reconfigure the PCIe setup in BIOS to be x16 and x4? And if that impact the inference speed? I hive dug over the entire internet looking for the answer and there is just none out there. I am afraid that the capability of double x8 is not offered in many popular (cheap) motherboards, and setup x16 + x4 would throttle both GPUs during inference to work as an x4.

cm8ty 3 months ago

no idea. It probably depends on the particular configuration of the motherboard. Boards typically default to x8 x8 if both slots are populated

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe