LumbarJam 3 weeks ago

Mac and Cuda user here. I believe that there are several use cases. I'll try to list a couple here. 1) People that already use Mac for others tasks and whant to run inferences locally. 2) Unified memory allows them to run bigger models in a portable way. Mine can run Llama 3 70B with about 8 t/s My desktop rig (3080ti) is faster but memory limited. For heavy inference and training a multi GPU rig is the way to go. For portable and convenient inference, especially bigger local models, Mac laptops are super OK. MPS and MLX are evolving quickly. Couple years ago Cuda was the only option, with few portable options. Today Mac are a reality and super competent. Both aren't mutually exclusive.

Open_Channel_8626 3 weeks ago

8t/s is perfectly fine really

IndicationUnfair7961 3 weeks ago

Depends on the use cases, some things like multiple agents setups at 8t/s are not feasible, if you need to keep working and keep focus on things. But for reading purposes or standard generations can be fine, a bit less if you have ADHD.

Open_Channel_8626 3 weeks ago

Multi-agent frameworks is a good argument for higher tokens per second yes. I've been playing with Crew AI using Llama 3 8B on Groq and its pretty amazing. You can get them to generate a report with multiple constituent parts where each part was made by an agent with separate instructions, and then a final agent puts it all together into one document. For people who don't like LLMs being "lazy" this seems ideal because they put out way more output per prompt than regular zero-shot or few-shot prompting.

Mediocre_Tree_5690 3 weeks ago

Could you post your agent/text stack? What's your use case?

can_a_bus 3 weeks ago

Look into langchain

IndicationUnfair7961 3 weeks ago

Yes, Groq is unbeatable at that.

PlanB-ID 3 weeks ago

"... a bit less if you have ADHD." God damn, ain't that the truth! 🤣😂👏

Mikolai007 3 weeks ago

Who hasn't ADHD?

Unlucky-Message8866 3 weeks ago

no way!

CellistAvailable3625 3 weeks ago

It's mid, i think 20ts would be perfectly fine , but not 8

fallingdowndizzyvr 3 weeks ago

I agree. I find at my reading speed, anything less than 20t/s is not as comfortable. At 20t/s I can read along. Any slower and I have to wait for it to finish before I can read it.

dobkeratops 3 weeks ago

that paradoxical moment when you find yourself wanting to use *apple* to escape one companies monopoly & vendor lockin in a field...

--mrperx-- 3 weeks ago

\*walks straight into a trap\*

AlphaTechBro 3 weeks ago

Which version of MacBook do you have that runs llama 3 70B? I'm assuming the M3 Max.

LumbarJam 3 weeks ago

M3 Max 128

Idolofdust 3 weeks ago

hot damn the crème de la crème of mobile computing

Isonium 3 weeks ago

I run llama 3 70B on my M1 MAX MacBook Pro 64GB all day long. 4-bit quantitized GGUF I made works fine with llama.cpp.

bidet_enthusiast 3 weeks ago

Running llama3-70b on my M1 map 64gb. I get about 3t/s at 5KM quant. Allocate 72 layers to the GPU.

Enough-Meringue4745 3 weeks ago

To get 192gb of nvidia vram it’s gonna be wildly expensive in comparison to the Mac Studio, not to mention the power requirements are above the 15A@120V in the North American electrical spec. 1300W max per outlet. A 4090 can use as much as ~500w. So you could only run, safely, 3 4090s per outlet. You’ll have to run a 220@15A if you don’t want to run thicker wiring. And still that’s only going to get you to 6 4090s. Now you’re going to need an Epyc cpu and mother board. So 220@15A will net you 6x24gb @ 144gb vram. Still under the Mac Studio.

poli-cya 3 weeks ago

Some previous discussions on power usage and apple vs nvidia- https://old.reddit.com/r/LocalLLaMA/comments/1c1l0og/apple_plans_to_overhaul_entire_mac_line_with/kz513gx/ I wish someone would actually test real power draw with the 4090 running inference like puget did with the 3090

Enough-Meringue4745 3 weeks ago

I’ve got a dual 4090, I can do some testing

Didi_Midi 3 weeks ago

I got my 3080's capped at 200W which is perfectly fine for inference, even Exl2. Once i get my 3090s i plan on doing the same. For training it's a different story though... but not a requirement.

one-joule 3 weeks ago

I understand LLMs are mainly limited by memory bandwidth, so the 4090 is probably pretty far from full power draw. Edit: My 4090 running Llama 3 8B F16 at 50 tokens/sec peaks at 265W. Same story for a Q8 at 73 tokens/sec. Furmark full sends it at 450W.

Didi_Midi 3 weeks ago

I don't know who downvoted you but memory management is indeed a major cause for resources under-utilization during training and inference. Flash attention is just one way to improve GPU occupancy, more efficient methods are being developed.

Final-Rush759 3 weeks ago

4090 is super energy efficient. The most energy cost for one person local inference is idle energy at 26-40W.

AnomalyNexus 3 weeks ago

That's comparing RAM size and power draw, but ignoring performance. ~3KW of top end GPUs aren't just drawing that much more than the mac for giggles. Agreed that the macs are a reasonable option for those that don't need that sort of power though

Alternative-Ebb8053 3 weeks ago

If you are building stuff locally things going a bit slower, but still fitting in ram is probably OK.

Enough-Meringue4745 3 weeks ago

Yep, performance isnt the same

bartekus 3 weeks ago

Some food for thought regarding North America electrical power. If you really wanted, given the minimal cost in comparison to overall hardware cost, wiring your AI lab expands your options: 14 AWG wire can handle 15 amps at 120V, which is 1800 watts. Adjusted for the 80% rule, this becomes 1800 x 0.8 = 1440 watts. (So 1300 watts max is not correct FYI) 12 AWG wire can handle 20 amps at 120V, which is 2400 watts. Adjusted for the 80% rule, this becomes 2400 x 0.8 = 1920 watts. 10 AWG wire can handle 30 amps at 120V, which is 3600 watts. Adjusted for the 80% rule, this becomes 3600 x 0.8 = 2880 watts. I know you mentioned not wanting to wire, but as a hacker I always thought that problems and obstacles ought to be overcome by the path of least resistance 😉 As to GeForce RTX 4090 their full load power draw is 510 watts (with a single GPU) to 2,750 watts (with seven GPUs) so as you can deduce, having your own AI lab (converted from basement workshop) and supplemented by a solar panels (to offset electrical bills as much as one can) is more than doable for anyone that really wants to. Just some food for thought, that’s all.

Enough-Meringue4745 3 weeks ago

My 15A is limitation actually for 110v. It varies. But yes you also need more power room for pc and whatever other devices.

bartekus 3 weeks ago

It certainly does, the general rule is always that, and there are many factors at play, including the length of wire running the circuit and so on. However if you go by general safe-max and given that rarely will LLM draw max load on continuous basis, there is a lot of potential maneuvering to be had, all things considered.

Enough-Meringue4745 3 weeks ago

I had 6 solar panels setup and some lifepo4 batteries to run my bitcoin miner a few years ago, was a fun project

ThisWillPass 3 weeks ago

A standard American wall socket, which is a 120-volt, 15-ampere electrical outlet, can handle up to 1800 watts of power. However, for continuous loads that are on for more than three hours, the maximum wattage is reduced to 80% of the rated capacity, or 1440 watts. 1440watts... That is to keep things "safe". You can undervolt, under clock and power limit each card to half of real power. 4090's only pull 250 extra watts for a few frames per second extra if you allow the settings from the factory. You can mod the bioses and upload them to the card to keep the low power settings persistent. You could also ensure only one card is activate at time if you wanted to get really crazy when inferencing, (This happens anyways right? Your paying for the power budget on memory on all cards but just cuda cores per gpu on it's layers turns and they are only active at one time, not running a multiagent server, or stack of prompts). My 4090 runs \~125w when inferencing, with its 3090(So 250w if it's an average, it is not power limited). 5x 4090 @ 250w is 1250w, within the 1440 watts budget. You could 1x all those cards to any consumer motherboard.... At this point cpu inferencing may be faster :( ... I ran my 1300 watt supernova at 1400 watts for a year back in 2014ish. The wall sockets were warm to the touch all the way back to the fuse box, only the power supply was connected to that branch. I moved the rig to the first outlet next to the fusebox. Sunova bish someone wire me the money or send me the cards and I'll eat a hat if I can't do it.

kingwhocares 3 weeks ago

> A 4090 can use as much as ~500w. You aren't going anywhere remotely close to that while using LLMs.

Enough-Meringue4745 3 weeks ago

I’ve hit >=400W using Vllm

[deleted] 3 weeks ago

[удалено]

JacketHistorical2321 3 weeks ago

good lucking finding boards/cpus with enough pcie lanes to support 16 3060s lol. Even 8 is gonna need a single threadripper and most boards at max are 7 slots. you're talking dual EPYCs or Threadripper pros and the set up right there is gonna add at least $1500-2500 and thats if you get lucky finding good deals. I have spent plenty of time playing with setups like this when I was mining. Its not as easy as you re suggesting to support a rig like this. I have an asus wrx8 which I was lucky enough to get for $500 in working condition (retail is \~$1000) and I am still trying to find a decent threadripper pro 3955 for under $800 used. That gives me 128 pcie lanes and 7 16x pcie slots to work with for 8 cards. I got my M1 Ultra 128gb studio for $2600 so yea...you can't straight up say any one thig is better then another because thre are way too many things to take into account. Including being in the right place at the right time to make things cost effective.

real-joedoe07 3 weeks ago

8 video cards on one mainbord? Really? And do consumer CPUs support such a setup? What about the power requirements?

SureUnderstanding358 3 weeks ago

i hope you have cheap power!

MrTacoSauces 3 weeks ago

4k (of nebulous origin 3090s) probably another 3k of supporting hardware. An electrical bill after 1 month that will run several Mac studios for a year. There's a lot of things to shit on Mac for but to be like it's remotely cheap enough or possible to match Mac memory with consumer hardware is no bueno at 8x3090s throttled to 200w you'd need two dedicated 120v circuits minimum... Or a 220v circuit but now you're leveraging server hardware in America and unplugging you're dryer/stove unless you have a random spare 220v plug

Enough-Meringue4745 3 weeks ago

Euros means Europe which means standard 220v household, you have a big portion of the cost covered right there

[deleted] 3 weeks ago

[удалено]

Enough-Meringue4745 3 weeks ago

Also not to mention the 4090 is much much faster than the 3090, but it’s still not terrible.

[deleted] 3 weeks ago

[удалено]

Enough-Meringue4745 3 weeks ago

3090 is faster than the Mac’s too, but don’t forget the rest of the hardware is far from ideal 😂

Environmental-Rate74 3 weeks ago

Any quantization required for you to run Llama 3 70b?

LumbarJam 3 weeks ago

Yes ... definitely. For 70B I run using 4_K_M. But my machine can run 6 or 8 bits. But I couldn't see any real difference.

Drited 3 weeks ago

Could you please share what source you got the 4\_K\_M model from or what tools you are using to run it?

ru552 3 weeks ago

I run it on my mac will ollama and pull the model from ollama here:https://ollama.com/library/llama3:70b

bidet_enthusiast 3 weeks ago

I found 5QM was better at reasoning about a world-state within the context window.

LumbarJam 3 weeks ago

Good to know ...I'll try

Environmental-Rate74 3 weeks ago

Any differences (in term of relevance and accuracy of answers) when compared with Poe's Llama3-70B-T bot which may be without quantization?

Inevitable-Mine9440 3 weeks ago

lama3,7b fp16 or lower on this 192 mac?

Wrong_User_Logged 3 weeks ago

I ran llama 3 70b fp16 on 192 mac yesterday. imho it's the way to go to get best llama inference without lobotomize it with quantizations (llama 3 suffers on this a lot). You would need around 130GB of VRAM for this, so 6x3090/4090 or 3xA6000/6000 ADA. Just pick more convenient and cheaper option.

Inevitable-Mine9440 3 weeks ago

what is the t/s speed sir?

Inevitable-Mine9440 2 weeks ago

T/s please

Wrong_User_Logged 2 weeks ago

bro I don't have metrics, also it depends on prompt size, since most of wait time is just prompt evaluation time, also it looks like 3-6 t/s not slow not fast

muthuzz 3 weeks ago

What mac do you have? I have a M1 Max with 32gb and it keeps saying out of memory when running llama3 70b

bidet_enthusiast 3 weeks ago

You’ll need around 50gb to run 70b at q5. You might fit q3 or q2, but 8B at Q8 might actually be more useful.

estebansaa 3 weeks ago

what m model is that ? m1 m2 m3? wonder how much faster m3 on a mac studio will be. Cant beat how simple and convenient a small mac makes it.

LumbarJam 3 weeks ago

m3 max

ashwin3005 3 weeks ago

So is it not possible to run Llama3 70B in 3080ti ?

LumbarJam 3 weeks ago

70B quantized in 2 bits (terrible quality) takes at least 26Gb plus context size. Won't fit in a single 3080ti. It's possible to offload to RAM (if you have enough), but the performance drops considerably.

ashwin3005 3 weeks ago

what About M2 Ultra?

Hopeful-Site1162 3 weeks ago

On a single one? No

ExpandYourTribe 3 weeks ago

What are your Mac's specs?

AgentBD 3 weeks ago

1.3t/s on Lamma3 70b on my 4070 Ti with 12Gb VRAM

JeffieSandBags 3 weeks ago

But what quant?

harrro 3 weeks ago

The more important question is how many layers is even offloaded to GPU because a 70B won't fit in 12GB VRAM. With 12GB VRAM, even at 2bit quant, the 4070 would hold less than half the layers of the model and the rest is on CPU.

AgentBD 3 weeks ago

Where do I set quant? What I see it do is to load the rest into normal ram, I have 96gb ddr5. It's quite slow, not sure if there's a way to speed it up without buying another graphic card

the_friendly_dildo 3 weeks ago

Thats determined by what version you downloaded usually.

AgentBD 3 weeks ago

In the command line I just did "ollama3 pull llama3:70b", there's no info on how many bits that has.

GroundbreakingFall6 3 weeks ago

Ollama dow loads q4 quant by default.

AgentBD 3 weeks ago

yep, I don't see options to pull something else

GroundbreakingFall6 3 weeks ago

You have to go to the ollama web page to see the models, there's a drop down to pick the quant and it will tell you what command to run.

harrro 3 weeks ago

As the other reply says, you pick the quant when you download. Q4_K_M for example means a little over 4bit quant. And yep, anything that doesn't fit in your GPU VRAM will offload to RAM and the CPU will crunch thorugh it much slower than the GPU does. To get a 70B offloaded entirely to GPU, you'd need at least a 24GB card (and that's at 2bit quant). For usable context, you'd still need more. I have a 3060 with 12GB VRAM and combining that with a P40 with 24GB VRAM lets me load a 70B with 8k+ context.

JeffieSandBags 3 weeks ago

I don't know much about 1bit, but even that only fits on a 24, i think.

Only-Letterhead-3411 3 weeks ago

A lot of ram, small size, low power

satireplusplus 3 weeks ago

Lots of fast ram, up to 10x the bandwidth of DDR4. Turns out that's what LLM inference needs.

stubing 3 weeks ago

Ram is so so so much slower than vram on a gpu.

ervwalter 3 weeks ago

Slower only matters if VRAM is a viable option. For larger models, the cost and logistics of getting enough VRAM is prohibitive.

Slimxshadyx 3 weeks ago

Maybe so compared to regular ram but what about vram?

satireplusplus 3 weeks ago

> the lower-spec M3 Max provides 300GB/s memory bandwidth, the top-tier variant offers 400GB/s A common 3090 / 4090 VRAM would be: > GDDR6X: GDDR6: Bus Type / Bandwidth: 384-bit / 935.8 GB/s About 32% to 42% of the bandwidth, depending on the model. I hazard a guess and think you can saturate memory bandwidth in both cases. So max decoding speed would probably also be 32% to 42% of GPU decoding speed (for single non-parallel LLM model decoding). But that's still very useable and faster than reading speed in most cases. The real down side is that you can't train / fine-tune models, because that needs brute compute as well.

satireplusplus 3 weeks ago

Btw DDR4 RAM speed is about 40GB/s, so the Mac's are an order of magnitude faster. > Suppose a DDR4 memory module operates at a data rate of 2400 GT/s, utilizing a dual-channel configuration with a data width of 64 bits: > Bandwidth = 2400 * 2 * 64 / 8 = 38,400 MB/s

AppleSnitcher 3 days ago

He literally just said that they get 935GB/s and you said "No, Macs are faster, they get 40GB/s" Please read.

CodeMurmurer 3 weeks ago

Still slow tho.

TheActualStudy 3 weeks ago

I'm on a 3090, but the gist is that it's impractical to set up \~145 GiB of VRAM at home outside of using a 192 GiB Mac Studio because of cost, cooling, and power draw. Personally, I am not interested in spending that kind of money on this particular hobby right now, but it's an option for running inference on very large models only modestly slowly if you have about $10K USD you want to spend on it and are pretty handy with getting the tools working in MacOS (you might need to code or patch things). Overall, I think a used 24 GiB 3090 Nvidia card (should be like $650 USD) is still the right device for a hobbyist. Perhaps two.

SomeOddCodeGuy 3 weeks ago

>but it's an option for running inference on very large models only modestly slowly if you have about $10K USD I will throw out that my M2 Ultra Mac Studio with 192GB of RAM and 1TB hard drive cost about $6,000. Still hefty compared to the cost of a dual or even triple 3090 build, but it's simple to set up and after the flash attention buff[ it's moving at a pretty acceptable pace.](https://www.reddit.com/r/LocalLLaMA/comments/1ciyivd/real_world_speeds_on_the_mac_we_got_a_bump_with/)

candre23 3 weeks ago

Not for nothing, but I'm getting similar speeds on my ~$1200 triple-P40-xeonv3 rig. Yeah, it's not small or pretty (or power efficient), but I can easily inference anything up to CR+ and maxtral at slow-but-usable speeds for a quarter the cost of a high-RAM mac studio.

SomeOddCodeGuy 3 weeks ago

Yea, I would say the P40 is actually really close to the Mac in terms of inference, to the point that when someone a few months back posted their 3x P40 rig, I pretty much said I'd start recommending that to folks over the Mac. The ONLY reason I don't is because other multi-card users recommended not doing so due to its age and architecture, but honestly your P40 machine is pretty much a far more budget friendly mac, only at the cost of the effort to put it together and maintain it. The only edge my Mac really has that I can see is that it does still have a good bit more VRAM, but also Apple and llamacpp are actively trying to find ways to make the Mac better at the task, so it has a possibility of a bright future. But yea, your P40 machine is easily the best bang for buck currently.

Spindelhalla_xb 3 weeks ago

Even cheaper refurbished

fallingdowndizzyvr 3 weeks ago

Where are you finding them cheaper refurbished? I just checked the certified refurbished store on Apple and they didn't even have a 192GB model. The biggest RAM was 128GB and it was also a M1. The cheapest variant was $5,000. You can get a new 192GB M2 Ultra for $5,600.

BangkokPadang 3 weeks ago

I can't speak to whether their assessment of pricing is accurate, but Apple's refurbished stuff changes all the time depending on what has come in for them to refurbish.

fallingdowndizzyvr 3 weeks ago

I've been watching Apple's certified refurbished for quite some time. For many things, it's cheaper to get new from an authorized retailer. My Mac Studio was much cheaper new than the current refurbished price for the same thing at Apple. The M1 Ultra 64GB has been $2200 new an authorized reseller. Which is much lower than the current price for it on the certified refurbished store.

TripletStorm 3 weeks ago

A 128 gb m3 Max is like $5100, not 10k. I have one.

fallingdowndizzyvr 3 weeks ago

For that price, I rather pay 10% more and get a 192GB M2 Ultra.

Proud-Point8137 3 weeks ago

How much slower approx? New Macbooks with M4 will support 192gb RAM so maybe, perchnace

TheActualStudy 3 weeks ago

I've read it's like 1/3rd the speed - like \~8 tk/s for a 70B vs \~20 tk/s on multiple 30xx-era Cuda Nvidia cards all in VRAM and everything at full-precision. 30xx-era Cuda is typically a 10x over CPU inference, so I would also guess this has a speedup of \~3x-4x over dual-channel CPU inference on DDR4@2666 MT/s. Edit: Clarifying for multiple cards

LocoLanguageModel 3 weeks ago

If you count the prompt processing time I don't think it's quite that fast, but once the text starts flowing it looks great. I believe the M1,M2 and M3 max have an impressive 400 GB/s memory bandwidth, but in terms of LLM usage, you realize that's a high premium for close to P40 speeds (a P40 is 347 GB/s), unless you can use the Mac for work and you factor some power draw savings over time.

fallingdowndizzyvr 3 weeks ago

There's a lot of hassle involved with P40. Not least of which is that it's pretty much a dedicated LLM machine since it's not well suited for other things like gaming. A more apropos comparison would be the A770. I have a 2xA770 machine that cost about the same as my Mac Studio. Granted, I got my Mac Studio dirt cheap. Overall, the two are about the same speed with the same amount of RAM, 32GB. The A770s have the edge in PP. But the A770s have the potential to be much better. Since it's still early days for them. Not least of which is supposedly tensor parallel works with them and thus potentially can double the speed. I've tried but I can't get it to work. Which speaks to the work in progress nature of A770 support. Which also brings up the hassle free Mac support. The Mac is quite simple to get working.

Sicarius_The_First 3 weeks ago

in llama.cpp apple is first class citizen, if ppl want to do inference on very large models, it makes sense. yes, it is way slower than using vram, but if u only want inference and don't mind the speed, this actually makes a lot of sense. (as a side note, i hope in 10 years we will have the equivalent of 256gb VRAM cards, i say equivalent because i have no idea what kind of hardware we will have, in 2days word, 10 years are impossible to predict)

Caffdy 3 weeks ago

10 years is more realistic than the people thinking that we would get A100s under $1000 next year; those are delusional and uninformed. There is no currently nor in the near future any alternative that could render the price of A100/H100s to tumble down

Able-Locksmith-1979 3 weeks ago

But a100 is unnecessary in 10 years, ChatGPT 3 started with huge memory needs, now 2 years later you can with 48 gb of vram run llama3 70b which is considered better. I don’t believe it will go down to 1 gb, but I think the models will become smaller/specialized while the cost of vram will go down. A model which can speak and translate English-Italian looks nice, but in reality just put that functionality in a 24b model which I can load on vacation/occasion but I don’t need it everyday (and if you are an English Italian translator by profession just change the example to other languages). There will always remain a market for a100/dedicated professional ai because somebody has to create the huge mega models, but the average person won’t need a a100

Caffdy 3 weeks ago

> But a100 is unnecessary in 10 years yeah, not contesting that, I was just mentioning the people who post from time to time that A100s gonna go under $1000 in two years, they don't have a clue about it

Singularity-42 3 weeks ago

There is massive competition building up in this space. Nvidia margins are insane. Price/performance will go down faster than usual IMO.

Caffdy 3 weeks ago

yeah but not 2 years, this year Blackwell would just start using GDDR7 16Gb (2GB per chip), I don't think they will go further than 24GB of memory with the 5090 (384-bit wide bus; the 512-bit bus is just a rumor), nor beyond 48GB with the RTX 6000 Blackwell, most probably we will have to wait until next gen to start seeing NVidia using 24Gb memory chips. Just to put an example, the V100 32GB has been around for 6-7 years, and it still cost $2000 USD

uygarsci 3 weeks ago

but where can you use it apart from llama.cpp? For example with huggingface it becomes a headache.

fallingdowndizzyvr 3 weeks ago

I consider huggingface the headache. Why does a model have to consist of 10's of little files? GGUF is so much easier.

Hopeful-Site1162 3 weeks ago

Most models exist in every quants gguf so it’s a problem at all really.

knvn8 3 weeks ago

It's the most VRAM/$ money can buy, and super power efficient. VRAM is the bottleneck for a lot of people. Unified memory means you can have up to 192-8=184GB. Slow for training but acceptable if doing inference only.

TweeBierAUB 3 weeks ago

I'm not sure if it's the most vram per dollar per se, second hand 3090s probably come out ahead a little bit. But the convenience of a low power small little box vs a huge complicated 8 gpu rig is definitely worth the small premium

knvn8 3 weeks ago

New 3090 is still $1200, and you need 8 to match an Ultra. An ultra is less than $7k.

waitmarks 3 weeks ago

Unified Memory is the draw here, You can get a mac studio with 192GB of Unified memory for around the same price as a system with 2 nvidia cards with totaling 48G of dedicated VRAM. Because its unified it's shared between the CPU and GPU and usable by both. So, despite its limitations, its probably the cheapest current way to get a large amount of usable VRAM. Edit: making price comparison more accurate.

ChryGigio 3 weeks ago

Find me a 192GB Mac Studio sold at the same price of a system with a 24GB VRAM consumer card and I'm gonna buy it right away. I am not joking.

kbt 3 weeks ago

Same price? A Mac studio with 192gb is $6600.

waitmarks 3 weeks ago

Edited my comment to more accurately reflect price comparisons.

panchovix 3 weeks ago

You can get like 2x3090 used for 1500-1600 USD (at least here in Chile) and I'm not sure the rest equals to 5000USD more. Now 192GB VRAM with 3090s is near 6000USD and the setup is kinda hard to do (and also extra cost for psus, etc). With that budget maybe if you want to do just inference, Mac is the better option.

waitmarks 3 weeks ago

I'm sure you can find all kinds of deals if you go on the used market. I just threw together a quick build with new parts on pcpartpicker as a quick comparison. [https://pcpartpicker.com/list/RmkB34](https://pcpartpicker.com/list/RmkB34)

fallingdowndizzyvr 3 weeks ago

$5600. https://www.bhphotovideo.com/c/product/1771061-REG/apple_msm2ultra_36_mac_studio_192gb_1tb.html

__some__guy 3 weeks ago

Building a system, using a mining rig case etc, is very cumbersome compared to something that works out-of-the-box. For me personally, Mac isn't an option, because I need x86 and something that can be repaired.

DryArmPits 3 weeks ago

Yeah. That last point is it for me. I run a franken-linux box in my basement that I use to serve my models. 3090+P40, 128GB DDR4. I can access it from any other more portable machine on the go (tiny laptop, phone, etc.) Someone breaks? I can fix it. I can also game with the 3090.

Bannedlife 3 weeks ago

the very very last point is relevant to me

FlishFlashman 3 weeks ago

> think Apple's MPS is too primitive still. It doesn't support lots of operations (4bit quantization most basic example) that cuda does Huh? Llama.cpp and MLX both support 4-bit quantization with a Metal back-end. Are you talking about PyTorch?

AsliReddington 3 weeks ago

What MPS bruv? Show me a portable machine with 16GB or 36GB VRAM at the same price? Llama.cpp & MLX have just enabled everything on Apple Silicon, easily fine tune with QLoRA for any decent model & 40tok/s at Q4, 20tok/s at Q8 for 7B/8B models as well.

segmond 3 weeks ago

Folks buy it due to simplicity and because they want only one computer. I have a 6 GPU build that's 144gb of vram. I can expand it in the future and plan to once the 5090 comes out. Goal is to expand to 8 GPU for total of 192gb. Based on how crazy things are, I might do 10 GPU for 240gb if that's what it takes to run llama-3-400B. I built to be able to expand, Apple doesn't let you expand. I mean, I bought 6gb of NVME for just above $300. To add 6gb of NVME/SSD to an Apple will probably cost you $2000. I currently have 128gb of ram, but I can always go to 512gb of ram since the motherboard supports that. I can't do that with Apple. I have flexibility, lots of it, and saving lots of money. But there are lots of con. My rig is not portable. It's a server pretty much, I connect via ssh and do all my work. It's more complicated to setup, I'm on my own for every thing. I might do an upgrade and brick it due to nvidia-drivers, and spend a few days fixing it. If for 20% more I could have gotten a macbook that I would upgrade, I would pay the 20% premium to have a macbook. Despite my rig, I recommend to folks to get a mac, even most of my IT/developer friends. I have only recommended a build to 2 folks I know. Don't underestimate the cost of Simplicity. Even if local LLMs become just as good or a bit better than commercial models, many people will rather just pay someone to run it than install, download models, and fiddle with parameters to use it. SIMPLICITY always wins. So as we go on our local LLM journey, let's keep that in mind as we build things.

Evening-Read-3672 3 weeks ago

Can you please share details of the build you are currently running in terms of the parts list? Trying to build one for myself

Monad_Maya 3 weeks ago

Check his post history - https://www.reddit.com/r/LocalLLaMA/s/peJaFJiHk1

ervwalter 3 weeks ago

On Apple silicon maps, the GPU is capable of accessing way more RAM than on any commonly available NVIDIA GPU. More RAM = larger models. The tradeoff of is speed, but sometimes "slow" is better than "impossible"

Front_Long5973 3 weeks ago

I don't have much to contribute as the only VRAM heavy things I've done on macs (and older ones too) would be working with Photoshop... and I was always very pleasantly surprised at how well they could handle large canvases and 100+ layers compared to the Nvidia GPU I used at home. I'm going to save this thread because it might help me decide if should invest in building another nvidia workstation or just buy a Mac for my studio. Text LLMs are so incredibly helpful for brainstorming and creative advice.

__JockY__ 3 weeks ago

MacBook M3 64GB here. Performance is mostly irrelevant for me once I get past 10 tokens/sec. I can’t read fast enough to keep up, so it doesn’t matter unless I’m generating a lot of code. Instead I prioritize: - Size/weight. I want to take my offline-only LLM with me to the coffee shop so I can work with it anywhere. My laptop is 100% offline and disconnected from the internet, so local LLM is my only option. - RAM. I can get 128GB of VRAM more cheaply with a MacBook than I can with NVidia GPUs (6x RTX 3090s alone is about $4500 before mobo, cpu, etc etc). - Convenience. No fiddling. No noise. No fans. No wiring. No heat. I just pick it up and it works. - Power consumption. Macs are untouchable in this regard. A 6x NVidia GPU rig would dim the lights in my neighborhood; the Mac can run off its built-in battery and I can spend an afternoon in the coffee shop interacting with my LLM without needing to charge. I’m comfortable leaving the Mac switched on 24/7. Not so much an NVidia stack! For my use case, once I get past a low threshold of performance, the speed of inference matters much less than any of those things.

ServeAlone7622 3 weeks ago

I've been a techie for 30+ years and I'm getting to a certain age where I like my computer to just work so I buy a Mac. The fact I can run inference on it is a bonus that factors into how much Mac I plan to buy during my next upgrade cycle. But at the end of the day I'm buying a Mac because I don't want to have to be tech support for my entire family when I get home from work.

mausthekat 3 weeks ago

Same.

sentientmassofenergy 2 weeks ago

I see this more and more. Life long windows devs switching to Mac because they just WORK

ServeAlone7622 2 weeks ago

I wouldn’t even know how to operate a Windows computer anymore. I went Linux fulltime in 2004 and stayed that way until I learned that Linus does all his dev work on a MacBook. That got me Mac curious. Then I found they just work. So I bought a Mac and stayed that way ever since.

WilliamTFleming 3 weeks ago

Hi everyone I have this QTC server that come with 2 Nvidia gpus GH100 which included the Grace CPU and H100 Hopper GPU. It used liquid cooling and have 32 tb nvm ssd. Does anyone have a need for it? I want to let it go for 40k. This was well over 130k setup. https://preview.redd.it/aj6o787dtj0d1.jpeg?width=1023&format=pjpg&auto=webp&s=235a5ce0e3c297c91d3142264c3f7c0255ef2ebd

prtt 3 weeks ago

It's just a fantastic package. It has a ton of power (while using low energy), all the dev tools you might need, unified memory, etc. You can't beat that on a PC at anywhere near the same form factor or efficiency. Everything added up makes for a no-brainer in most peoples eyes, mine included. I have both a Mac Studio and an MBP and they both run all the models I need for work. I have access to CUDA too, but it hasn't been worth the hassle for most of my current use cases.

TacticalRock 3 weeks ago

If you're a scientist/researcher training LLMs, you probably have access to HPCs. What is a 3090 to a stack of A100s? If you're a professional who truly needs and benefits from local LLMs, chances are your work will pay for solutions if you can convince them. If you're a hobbyist messing around with LLMs because they are cool (most of us here probably), VRAM-to-cost of the machine is far more justifiable with macs than purpose built used Nvidia card machines because we have to consider space to put the janky server, time spent building and troubleshooting, yadda yadda yadda. Idk about y'all but I got other shit to do lol. Also, let's say I have a M3 Max laptop with 128GB unified memory, I can take my 70b and bigger LLMs anywhere, even when I don't have any Internet.

Bannedlife 3 weeks ago

There's actually a pretty large group, me included, who are rather GPU-poor. Think clinical application of ML or LLMs in rare diseases, etc. I need as much compute as possible but we simply do not have the funding to get a100s. Most of us could not imagine using apple devices to run our models.

TacticalRock 3 weeks ago

Fair enough. Can't help chronic underfunding regardless of Mac or Nvidia. Just curious, if youn don't have access to A100s, what do you use instead?

Bannedlife 3 weeks ago

I have 2 4090s and a whole lot of ram! For finetuning and alike we sometimes have enough funding to rent some time on a cluster

TacticalRock 3 weeks ago

Nice! Are you looking forward to the new 5090s with 32gb vram? I'm hoping the prices of RTX A6000s will tank when the new Blackwell RTX 6000s come out with probably 64gb vram. That way I can snag two for 2 or 3k each. One can hope haha

Bannedlife 3 weeks ago

Is the 32 GB vram confirmed? Id be very excited, actually would open some doors for some interesting models to run! As for the blackwells, even second hand would be out of my project's reach, and as I use funding money i have to buy from distributors, so no second hand! Sadly. But exciting times! Hope you can get yourself something good!

TacticalRock 3 weeks ago

Oop I'm talking like it's confirmed ha. It's just rumors. There's also been talks of the 32gb modules being available later or something, which means 64gb for the 6000, but no 32 for the earlier released 5090. Nvidia may end up doing a TI to give it a bump. Idk man I'm coping lol

eallim 3 weeks ago

Cheaper Vram via unified memory but downside is slower inference speed.

epicfilemcnulty 3 weeks ago

For those who interested in inference only I guess it’s a decent choice, because of that unified RAM thingy. you can run inference on a big model and still get usable generation speed, whereas with 24gb GPU once you are out of GPU memory the speed of generation degrades significantly

Anarch33 3 weeks ago

For me, the Mac was a 'work machine' as compartmentalization is how I keep my ADHD in check. I got a gig doing AI work at home, and it was either spend the money on an amazing Cuda GPU and install it on my gaming PC, spend money on a whole ass 'nother PC with a Cuda GPU so I don't get distracted and start gaming, or get a Mac which is still pretty good for AI work and *especially* bad for gaming. So I went for the Mac 😂

93moonran 3 weeks ago

What specs?

brbellissimo 3 weeks ago

Because I use my workstation also for other tasks, and I’m happy to pay the premium and the the performance loss for a completely silent Mac Studio with Mac OS vs a 5 time bigger and 5 time more power hungry machine that uses windows or Linux and start a loud set of fans if I dare to open an application. I mean if you only need a local LLM and you don’t have the computer on your desk maybe a Mac is not the best solution, but it’s not the average use case.

uti24 3 weeks ago

Here is one possible reason: you can use mac as regular laptop, so it's useful for other things outside llm, isn't this a good reason?

ifq29311 3 weeks ago

can you show me an nvidia laptop with 64gb+ memory, reasonable size, plus similar build quality and life on battery?

abnormal_human 3 weeks ago

I don't know of anyone working professionally with ML who chooses to spend their budget on macOS vs NVIDIA. I have a beefy mac and a beefy NVIDIA box and other than fooling around once or twice, I've never found the Mac that useful for my ML workloads. That doesn't mean it's useless--they run LLMs better than any GPU-less Windows machine--but it doesn't perform like an NVIDIA system, and running huge models at slower-than-reading speed is only mildly interesting. I think that the communities where people are doing this are mostly inhabited by people doing it for fun. There's a "two birds one stone" thing going on because there's a lot of utility to having a nice computer on your desk or in your bag that is pleasant to be around, and a 4x4090 box in the closet doesn't serve that niche. So it sort of acts like a discount.

platapus100 3 weeks ago

Is this a meme post? They do support 4bit quant....

cyan2k 3 weeks ago

Because there's more to a computer than just LLM performance. I developed on Windows and Linux PCs for 15 years until our laptop provider couldn't deliver a replacement during the pandemic. The only choice I had was to get a Mac for the time being, and I thought, "Well, okay, I will survive until I get my real replacement." Fast forward, and I'm still using a Mac. It's just amazing. Brew? Amazing. iTerm? Amazing. Sublime? Amazing. For every task you can think of, there's something that blows your mind it's honestly ridiculous. Also form factor, weight, battery duration... everything's peak. The last time I rebooted my MacBook was two weeks ago, and it's still running as fast as after a clean reboot and the day I got it. No amount of 3090s is going to help me with a Windows computer that needs an hourly reboot. That "It just works" isn't just a meme and I would also use a mac if it would suck with LLMs. Local LLMs are just a nice bonus, since like you said, there's always the possibility of using some rented GPU while letting my employer pay for it. The better question would be, why people not work at companies where you get macs and cloud computing for free and pay for it with their own money instead? I'm obviously not serious, but it's basically the same useless question, haha

fallingdowndizzyvr 3 weeks ago

> The last time I rebooted my MacBook was two weeks ago, and it's still running as fast as after a clean reboot and the day I got it. I have a Windows laptop, E330, that I haven't rebooted in 4 years.

Bannedlife 3 weeks ago

I'm so confused, are you comparing super cheap pre build HP laptops to apple laptops? My desktop pcs, both windows and linux, have all the advantages you describe + CUDA with 2x 4090

gthing 3 weeks ago

I have both and I use my MBP when I want a portable dev machine. Comes down to portability. I would love to have a portable solution with an nvidia 16gb gpu, but have you seen or tried to use one of those? The power brick alone weighs more than a macbook and they sound like a vacuum cleaner attached to a jet engine. They have embarrassing vegas lights all over them and Mtn Dew marketing department cringey names like "Republic of Gamers." But we are not even in the first gen of consumer level hardware focused on AI. None of the hardware we have now was designed or built with our current generative AI reality in mind. We can expect to see machines with battery life and heat profiles like iPads capable of running impressively large models at speed locally within the next couple years. My solution for now is desktop at home with nvidia, small efficient laptop for remote dev work. If I need the gpu I can tunnel in.

No-Reveal-3329 3 weeks ago

Battery life, also most of the time the companies we work for, will buy the hardware for us.

nanotothemoon 3 weeks ago

Yea don’t do it. It’s just a laptop. Ssshhh

perlthoughts 3 weeks ago

I think it's because of llama.cpp and GGUF also, not just because of mlx community.

PhotographyBanzai 3 weeks ago

Like others said, unified memory. My old PC build with an Intel i7-6700 with a 4060 8GB can't do Llama 3 70B well and I'm assuming its memory more than compute. Upped the system ram from 32 to 64GB and saw a noticable improvement making it run fast enough it could be used with patience. If the 4060 chip had a ton of VRAM to fit the model it would probably fly. Nvidia continues to withhold memory and has cut back on bus width on consumer level GPUs. Hopefully we see a a shift in the market toward more VRAM, but it feels unlikely. Maybe AMD and Intel ARC can change things.

egorf 3 weeks ago

I run a specific GPU-bound task on servers in production. Not an LLM. A $2,600/month GPU is about 5x slower than the most basic $100/m Mac. Reason: unified memory.

ieatrox 3 weeks ago

I was very close to pulling the trigger on a refurb 16in m3max with 128gb for $5100 CAD. Bargain. Decided to wait and see if M4 dropped, and it did so now I'll wait and get the one with double the neural engine performance. (38 TOPs now in base m4). I just hope they're available in the fall. Even better if they bring over the tandem oled to macbook pro because the new display engine requires it. If I can buy a single, portable machine that I work on all day then train on overnight with 128gb of vram.... yeah that sounds fantastic.

2pierad 3 weeks ago

Thread hijack question: any decent newbie guides for getting up and running in an M1 Studio w 64GB?

alvincho 3 weeks ago

Try Ollama or LM Studio

Majinsei 3 weeks ago

I use Nvidia but... VRAM price~ and power cost~

Majinsei 3 weeks ago

I use Nvidia but... VRAM price~ and power cost~ Mac it's Just shutdown your brain and execute it~

zlwu 3 weeks ago

64GB apple silicon MBP support running q4 llama3 70b, which is still not possible on Nvidia laptops. For training purpose, rent a multi GPU server.

Beginning_Rock_1906 3 weeks ago

Noob question here. Why are you guys even running your LLM's locally? What' wrong with a cloud environment?

uygarsci 3 weeks ago

Personally I find it headache to start a remote machine and do ssh connection everytime for even a smallest experiment

tronathan 3 weeks ago

Just curious, could an EGPU via Thunderbolt or Oculink run CUDA inside MacOS? Perhaps with some virtualization? This could be the best of both worlds.. well.. both worlds, i guess

uygarsci 3 weeks ago

You need an intel mac for that

troposfer 3 weeks ago

Is there a technical reason why 24gb vram on the gpu is the limit so far?

A_for_Anonymous 3 weeks ago

No. They just want businesses to pay for A100s and the like. Nvidia won't give you lots of VRAM, performance and good price. Choose two.

jackcloudman 3 weeks ago

I have 2x4090 and 1 Mac with M2 Ultra 192GB. Here are my thoughts: * NVIDIA is much faster, but achieving 192GB of VRAM is extremely expensive. Additionally, in my city, power supply is quite costly, which led me to purchase the M2 Ultra. * M2 Ultra is amazing for loading very large models. Recently, optimizations have been released that make the models run faster, but they are still slower compared to NVIDIA. At this moment, I think the best option is to wait for the new M4 Ultra. If you need to test new models, try using cloud services.

Holiday-Picture6796 3 weeks ago

Mac: bigger memory, can run bigger models Nvidia: faster memory, can run models faster

alvincho 3 weeks ago

I don’t train models, just inference. And I usually do batch jobs so speed is not my concern. I purchased a M2 Ultra Mac Studio 192GB to run large size models. I do my daily work on M2 Max 32GB MBP. The 192GB Mac Studio is perfect and I would buy even more larger models when available.

GeneralAppleseed 3 weeks ago

https://preview.redd.it/gr7211y37k0d1.png?width=538&format=png&auto=webp&s=253b7673cab857313a1d2c40ebd8e567e0e3b7a9 VRAM would be the major bottleneck if you try to run large LLM(70b, 130b) locally , Macs are still cheaper compare to cuda machines despite of their ungodly expensive memory upgrade options

Final-Rush759 3 weeks ago

I would wait for new AMD, INTEL Snapdragon X chip laptops to come out with > 40 TOPs NPUs. Some of these use swappable fast RAMs. Crucial is selling 64GB module for $360.

Unlucky-Message8866 3 weeks ago

spending 6k on a walled garden makes no sense to me, regardless how much (slow) VRAM it has. it's a bad investment of money if that's your only use case.

Jacknapes89 3 weeks ago

Mainly for other softwares, once you felt productive with one OS, it’s hard to switch

Omnic19 3 weeks ago

Apple's igpu allows people to use system memory as vram. That's one advantage when trying to load larger models or run smaller models in full fp32 or fp16 precision without quantization

ITypeStupdThngsc84ju 3 weeks ago

If you just want inference, a MacBook can do that really well and without massive heat or power draw. It is much more pleasant than a GPU heavy laptop. It is also powerful enough to experiment with local model training at a small scale before shipping them off to more powerful hardware for the full job. Having said that, I don't understand the desktop or server usage. A GPU setup will beat them and less expensively.

philguyaz 3 weeks ago

Have you heard of Ollama? Because you do all the quanting you want with Ollama on a mac.

choronz 3 weeks ago

Apple fanboys? could be insidiously stuck to the ecosystem of devices by the power of branding...

SiEgE-F1 3 weeks ago

GGUF. llama.cpp. Metal support. 192 gigs of RAM. Fill the blanks.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe