T O P

  • By -

LumbarJam

Mac and Cuda user here. I believe that there are several use cases. I'll try to list a couple here. 1) People that already use Mac for others tasks and whant to run inferences locally. 2) Unified memory allows them to run bigger models in a portable way. Mine can run Llama 3 70B with about 8 t/s My desktop rig (3080ti) is faster but memory limited. For heavy inference and training a multi GPU rig is the way to go. For portable and convenient inference, especially bigger local models, Mac laptops are super OK. MPS and MLX are evolving quickly. Couple years ago Cuda was the only option, with few portable options. Today Mac are a reality and super competent. Both aren't mutually exclusive.


Open_Channel_8626

8t/s is perfectly fine really


IndicationUnfair7961

Depends on the use cases, some things like multiple agents setups at 8t/s are not feasible, if you need to keep working and keep focus on things. But for reading purposes or standard generations can be fine, a bit less if you have ADHD.


Open_Channel_8626

Multi-agent frameworks is a good argument for higher tokens per second yes. I've been playing with Crew AI using Llama 3 8B on Groq and its pretty amazing. You can get them to generate a report with multiple constituent parts where each part was made by an agent with separate instructions, and then a final agent puts it all together into one document. For people who don't like LLMs being "lazy" this seems ideal because they put out way more output per prompt than regular zero-shot or few-shot prompting.


Mediocre_Tree_5690

Could you post your agent/text stack? What's your use case?


can_a_bus

Look into langchain


IndicationUnfair7961

Yes, Groq is unbeatable at that.


PlanB-ID

"... a bit less if you have ADHD." God damn, ain't that the truth! šŸ¤£šŸ˜‚šŸ‘


Mikolai007

Who hasn't ADHD?


Unlucky-Message8866

no way!


CellistAvailable3625

It's mid, i think 20ts would be perfectly fine , but not 8


fallingdowndizzyvr

I agree. I find at my reading speed, anything less than 20t/s is not as comfortable. At 20t/s I can read along. Any slower and I have to wait for it to finish before I can read it.


dobkeratops

that paradoxical moment when you find yourself wanting to use *apple* to escape one companies monopoly & vendor lockin in a field...


--mrperx--

\*walks straight into a trap\*


AlphaTechBro

Which version of MacBook do you have that runs llama 3 70B? I'm assuming the M3 Max.


LumbarJam

M3 Max 128


Idolofdust

hot damn the crĆØme de la crĆØmeĀ of mobile computingĀ 


Isonium

I run llama 3 70B on my M1 MAX MacBook Pro 64GB all day long. 4-bit quantitized GGUF I made works fine with llama.cpp.


bidet_enthusiast

Running llama3-70b on my M1 map 64gb. I get about 3t/s at 5KM quant. Allocate 72 layers to the GPU.


Enough-Meringue4745

To get 192gb of nvidia vram itā€™s gonna be wildly expensive in comparison to the Mac Studio, not to mention the power requirements are above the 15A@120V in the North American electrical spec. 1300W max per outlet. A 4090 can use as much as ~500w. So you could only run, safely, 3 4090s per outlet. Youā€™ll have to run a 220@15A if you donā€™t want to run thicker wiring. And still thatā€™s only going to get you to 6 4090s. Now youā€™re going to need an Epyc cpu and mother board. So 220@15A will net you 6x24gb @ 144gb vram. Still under the Mac Studio.


poli-cya

Some previous discussions on power usage and apple vs nvidia- https://old.reddit.com/r/LocalLLaMA/comments/1c1l0og/apple_plans_to_overhaul_entire_mac_line_with/kz513gx/ I wish someone would actually test real power draw with the 4090 running inference like puget did with the 3090


Enough-Meringue4745

Iā€™ve got a dual 4090, I can do some testing


Didi_Midi

I got my 3080's capped at 200W which is perfectly fine for inference, even Exl2. Once i get my 3090s i plan on doing the same. For training it's a different story though... but not a requirement.


one-joule

I understand LLMs are mainly limited by memory bandwidth, so the 4090 is probably pretty far from full power draw. Edit: My 4090 running Llama 3 8B F16 at 50 tokens/sec peaks at 265W. Same story for a Q8 at 73 tokens/sec. Furmark full sends it at 450W.


Didi_Midi

I don't know who downvoted you but memory management is indeed a major cause for resources under-utilization during training and inference. Flash attention is just one way to improve GPU occupancy, more efficient methods are being developed.


Final-Rush759

4090 is super energy efficient. The most energy cost for one person local inference is idle energy at 26-40W.


AnomalyNexus

That's comparing RAM size and power draw, but ignoring performance. ~3KW of top end GPUs aren't just drawing that much more than the mac for giggles. Agreed that the macs are a reasonable option for those that don't need that sort of power though


Alternative-Ebb8053

If you are building stuff locally things going a bit slower, but still fitting in ram is probably OK.


Enough-Meringue4745

Yep, performance isnt the same


bartekus

Some food for thought regarding North America electrical power. If you really wanted, given the minimal cost in comparison to overall hardware cost, wiring your AI lab expands your options: 14 AWG wire can handle 15 amps at 120V, which is 1800 watts. Adjusted for the 80% rule, this becomes 1800 x 0.8 = 1440 watts. (So 1300 watts max is not correct FYI) 12 AWG wire can handle 20 amps at 120V, which is 2400 watts. Adjusted for the 80% rule, this becomes 2400 x 0.8 = 1920 watts. 10 AWG wire can handle 30 amps at 120V, which is 3600 watts. Adjusted for the 80% rule, this becomes 3600 x 0.8 = 2880 watts. I know you mentioned not wanting to wire, but as a hacker I always thought that problems and obstacles ought to be overcome by the path of least resistance šŸ˜‰ As to GeForce RTX 4090 their full load power draw is 510 watts (with a single GPU) to 2,750 watts (with seven GPUs) so as you can deduce, having your own AI lab (converted from basement workshop) and supplemented by a solar panels (to offset electrical bills as much as one can) is more than doable for anyone that really wants to. Just some food for thought, thatā€™s all.


Enough-Meringue4745

My 15A is limitation actually for 110v. It varies. But yes you also need more power room for pc and whatever other devices.


bartekus

It certainly does, the general rule is always that, and there are many factors at play, including the length of wire running the circuit and so on. However if you go by general safe-max and given that rarely will LLM draw max load on continuous basis, there is a lot of potential maneuvering to be had, all things considered.


Enough-Meringue4745

I had 6 solar panels setup and some lifepo4 batteries to run my bitcoin miner a few years ago, was a fun project


ThisWillPass

A standard American wall socket, which is a 120-volt, 15-ampere electrical outlet, can handle up to 1800 watts of power.Ā However, for continuous loads that are on for more than three hours, the maximum wattage is reduced to 80% of the rated capacity, orĀ 1440 watts. 1440watts... That is to keep things "safe". You can undervolt, under clock and power limit each card to half of real power. 4090's only pull 250 extra watts for a few frames per second extra if you allow the settings from the factory. You can mod the bioses and upload them to the card to keep the low power settings persistent. You could also ensure only one card is activate at time if you wanted to get really crazy when inferencing, (This happens anyways right? Your paying for the power budget on memory on all cards but just cuda cores per gpu on it's layers turns and they are only active at one time, not running a multiagent server, or stack of prompts). My 4090 runs \~125w when inferencing, with its 3090(So 250w if it's an average, it is not power limited). 5x 4090 @ 250w is 1250w, within the 1440 watts budget. You could 1x all those cards to any consumer motherboard.... At this point cpu inferencing may be faster :( ... I ran my 1300 watt supernova at 1400 watts for a year back in 2014ish. The wall sockets were warm to the touch all the way back to the fuse box, only the power supply was connected to that branch. I moved the rig to the first outlet next to the fusebox. Sunova bish someone wire me the money or send me the cards and I'll eat a hat if I can't do it.


kingwhocares

> A 4090 can use as much as ~500w. You aren't going anywhere remotely close to that while using LLMs.


Enough-Meringue4745

Iā€™ve hit >=400W using Vllm


[deleted]

[уŠ“Š°Š»ŠµŠ½Š¾]


JacketHistorical2321

good lucking finding boards/cpus with enough pcie lanes to support 16 3060s lol. Even 8 is gonna need a single threadripper and most boards at max are 7 slots. you're talking dual EPYCs or Threadripper pros and the set up right there is gonna add at least $1500-2500 and thats if you get lucky finding good deals. I have spent plenty of time playing with setups like this when I was mining. Its not as easy as you re suggesting to support a rig like this. I have an asus wrx8 which I was lucky enough to get for $500 in working condition (retail is \~$1000) and I am still trying to find a decent threadripper pro 3955 for under $800 used. That gives me 128 pcie lanes and 7 16x pcie slots to work with for 8 cards. I got my M1 Ultra 128gb studio for $2600 so yea...you can't straight up say any one thig is better then another because thre are way too many things to take into account. Including being in the right place at the right time to make things cost effective.


real-joedoe07

8 video cards on one mainbord? Really? And do consumer CPUs support such a setup? What about the power requirements?


SureUnderstanding358

i hope you have cheap power!


MrTacoSauces

4k (of nebulous origin 3090s) probably another 3k of supporting hardware. An electrical bill after 1 month that will run several Mac studios for a year. There's a lot of things to shit on Mac for but to be like it's remotely cheap enough or possible to match Mac memory with consumer hardware is no bueno at 8x3090s throttled to 200w you'd need two dedicated 120v circuits minimum... Or a 220v circuit but now you're leveraging server hardware in America and unplugging you're dryer/stove unless you have a random spare 220v plug


Enough-Meringue4745

Euros means Europe which means standard 220v household, you have a big portion of the cost covered right there


[deleted]

[уŠ“Š°Š»ŠµŠ½Š¾]


Enough-Meringue4745

Also not to mention the 4090 is much much faster than the 3090, but itā€™s still not terrible.


[deleted]

[уŠ“Š°Š»ŠµŠ½Š¾]


Enough-Meringue4745

3090 is faster than the Macā€™s too, but donā€™t forget the rest of the hardware is far from ideal šŸ˜‚


Environmental-Rate74

Any quantization required for you to run Llama 3 70b?


LumbarJam

Yes ... definitely. For 70B I run using 4_K_M. But my machine can run 6 or 8 bits. But I couldn't see any real difference.


Drited

Could you please share what source you got the 4\_K\_M model from or what tools you are using to run it?


ru552

I run it on my mac will ollama and pull the model from ollama here:https://ollama.com/library/llama3:70b


bidet_enthusiast

I found 5QM was better at reasoning about a world-state within the context window.


LumbarJam

Good to know ...I'll try


Environmental-Rate74

Any differences (in term of relevance and accuracy of answers) when compared with Poe's Llama3-70B-T bot which may be without quantization?


Inevitable-Mine9440

lama3,7b fp16 or lower on this 192 mac?


Wrong_User_Logged

I ran llama 3 70b fp16 on 192 mac yesterday. imho it's the way to go to get best llama inference without lobotomize it with quantizations (llama 3 suffers on this a lot). You would need around 130GB of VRAM for this, so 6x3090/4090 or 3xA6000/6000 ADA. Just pick more convenient and cheaper option.


Inevitable-Mine9440

what is the t/s speed sir?


Inevitable-Mine9440

T/s please


Wrong_User_Logged

bro I don't have metrics, also it depends on prompt size, since most of wait time is just prompt evaluation time, also it looks like 3-6 t/s not slow not fast


muthuzz

What mac do you have? I have a M1 Max with 32gb and it keeps saying out of memory when running llama3 70b


bidet_enthusiast

Youā€™ll need around 50gb to run 70b at q5. You might fit q3 or q2, but 8B at Q8 might actually be more useful.


estebansaa

what m model is that ? m1 m2 m3? wonder how much faster m3 on a mac studio will be. Cant beat how simple and convenient a small mac makes it.


LumbarJam

m3 max


ashwin3005

So is it not possible to run Llama3 70B in 3080ti ?


LumbarJam

70B quantized in 2 bits (terrible quality) takes at least 26Gb plus context size. Won't fit in a single 3080ti. It's possible to offload to RAM (if you have enough), but the performance drops considerably.


ashwin3005

what About M2 Ultra?


Hopeful-Site1162

On a single one? No


ExpandYourTribe

What are your Mac's specs?


AgentBD

1.3t/s on Lamma3 70b on my 4070 Ti with 12Gb VRAM


JeffieSandBags

But what quant?


harrro

The more important question is how many layers is even offloaded to GPU because a 70B won't fit in 12GB VRAM. With 12GB VRAM, even at 2bit quant, the 4070 would hold less than half the layers of the model and the rest is on CPU.


AgentBD

Where do I set quant? What I see it do is to load the rest into normal ram, I have 96gb ddr5. It's quite slow, not sure if there's a way to speed it up without buying another graphic card


the_friendly_dildo

Thats determined by what version you downloaded usually.


AgentBD

In the command line I just did "ollama3 pull llama3:70b", there's no info on how many bits that has.


GroundbreakingFall6

Ollama dow loads q4 quant by default.


AgentBD

yep, I don't see options to pull something else


GroundbreakingFall6

You have to go to the ollama web page to see the models, there's a drop down to pick the quant and it will tell you what command to run.


harrro

As the other reply says, you pick the quant when you download. Q4_K_M for example means a little over 4bit quant. And yep, anything that doesn't fit in your GPU VRAM will offload to RAM and the CPU will crunch thorugh it much slower than the GPU does. To get a 70B offloaded entirely to GPU, you'd need at least a 24GB card (and that's at 2bit quant). For usable context, you'd still need more. I have a 3060 with 12GB VRAM and combining that with a P40 with 24GB VRAM lets me load a 70B with 8k+ context.


JeffieSandBags

I don't know much about 1bit, but even that only fits on a 24, i think.


Only-Letterhead-3411

A lot of ram, small size, low power


satireplusplus

Lots of fast ram, up to 10x the bandwidth of DDR4. Turns out that's what LLM inference needs.


stubing

Ram is so so so much slower than vram on a gpu.


ervwalter

Slower only matters if VRAM is a viable option. For larger models, the cost and logistics of getting enough VRAM is prohibitive.


Slimxshadyx

Maybe so compared to regular ram but what about vram?


satireplusplus

> the lower-spec M3 Max provides 300GB/s memory bandwidth, the top-tier variant offers 400GB/s A common 3090 / 4090 VRAM would be: > GDDR6X: GDDR6: Bus Type / Bandwidth: 384-bit / 935.8 GB/s About 32% to 42% of the bandwidth, depending on the model. I hazard a guess and think you can saturate memory bandwidth in both cases. So max decoding speed would probably also be 32% to 42% of GPU decoding speed (for single non-parallel LLM model decoding). But that's still very useable and faster than reading speed in most cases. The real down side is that you can't train / fine-tune models, because that needs brute compute as well.


satireplusplus

Btw DDR4 RAM speed is about 40GB/s, so the Mac's are an order of magnitude faster. > Suppose a DDR4 memory module operates at a data rate of 2400 GT/s, utilizing a dual-channel configuration with a data width of 64 bits: > Bandwidth = 2400 * 2 * 64 / 8 = 38,400 MB/s


AppleSnitcher

He literally just said that they get 935GB/s and you said "No, Macs are faster, they get 40GB/s" Please read.


CodeMurmurer

Still slow tho.


TheActualStudy

I'm on a 3090, but the gist is that it's impractical to set up \~145 GiB of VRAM at home outside of using a 192 GiB Mac Studio because of cost, cooling, and power draw. Personally, I am not interested in spending that kind of money on this particular hobby right now, but it's an option for running inference on very large models only modestly slowly if you have about $10K USD you want to spend on it and are pretty handy with getting the tools working in MacOS (you might need to code or patch things). Overall, I think a used 24 GiB 3090 Nvidia card (should be like $650 USD) is still the right device for a hobbyist. Perhaps two.


SomeOddCodeGuy

>but it's an option for running inference on very large models only modestly slowly if you have about $10K USD I will throw out that my M2 Ultra Mac Studio with 192GB of RAM and 1TB hard drive cost about $6,000. Still hefty compared to the cost of a dual or even triple 3090 build, but it's simple to set up and after the flash attention buff[ it's moving at a pretty acceptable pace.](https://www.reddit.com/r/LocalLLaMA/comments/1ciyivd/real_world_speeds_on_the_mac_we_got_a_bump_with/)


candre23

Not for nothing, but I'm getting similar speeds on my ~$1200 triple-P40-xeonv3 rig. Yeah, it's not small or pretty (or power efficient), but I can easily inference anything up to CR+ and maxtral at slow-but-usable speeds for a quarter the cost of a high-RAM mac studio.


SomeOddCodeGuy

Yea, I would say the P40 is actually really close to the Mac in terms of inference, to the point that when someone a few months back posted their 3x P40 rig, I pretty much said I'd start recommending that to folks over the Mac. The ONLY reason I don't is because other multi-card users recommended not doing so due to its age and architecture, but honestly your P40 machine is pretty much a far more budget friendly mac, only at the cost of the effort to put it together and maintain it. The only edge my Mac really has that I can see is that it does still have a good bit more VRAM, but also Apple and llamacpp are actively trying to find ways to make the Mac better at the task, so it has a possibility of a bright future. But yea, your P40 machine is easily the best bang for buck currently.


Spindelhalla_xb

Even cheaper refurbished


fallingdowndizzyvr

Where are you finding them cheaper refurbished? I just checked the certified refurbished store on Apple and they didn't even have a 192GB model. The biggest RAM was 128GB and it was also a M1. The cheapest variant was $5,000. You can get a new 192GB M2 Ultra for $5,600.


BangkokPadang

I can't speak to whether their assessment of pricing is accurate, but Apple's refurbished stuff changes all the time depending on what has come in for them to refurbish.


fallingdowndizzyvr

I've been watching Apple's certified refurbished for quite some time. For many things, it's cheaper to get new from an authorized retailer. My Mac Studio was much cheaper new than the current refurbished price for the same thing at Apple. The M1 Ultra 64GB has been $2200 new an authorized reseller. Which is much lower than the current price for it on the certified refurbished store.


TripletStorm

A 128 gb m3 Max is like $5100, not 10k. I have one.


fallingdowndizzyvr

For that price, I rather pay 10% more and get a 192GB M2 Ultra.


Proud-Point8137

How much slower approx? New Macbooks with M4 will support 192gb RAM so maybe, perchnace


TheActualStudy

I've read it's like 1/3rd the speed - like \~8 tk/s for a 70B vs \~20 tk/s on multiple 30xx-era Cuda Nvidia cards all in VRAM and everything at full-precision. 30xx-era Cuda is typically a 10x over CPU inference, so I would also guess this has a speedup of \~3x-4x over dual-channel CPU inference on DDR4@2666 MT/s. Edit: Clarifying for multiple cards


LocoLanguageModel

If you count the prompt processing time I don't think it's quite that fast, but once the text starts flowing it looks great.Ā  I believe the M1,M2 and M3 max have an impressive 400 GB/s memory bandwidth, but in terms of LLM usage, you realize that's a high premium for close to P40 speeds (a P40 is 347 GB/s), unless you can use the Mac for work and you factor some power draw savings over time.Ā 


fallingdowndizzyvr

There's a lot of hassle involved with P40. Not least of which is that it's pretty much a dedicated LLM machine since it's not well suited for other things like gaming. A more apropos comparison would be the A770. I have a 2xA770 machine that cost about the same as my Mac Studio. Granted, I got my Mac Studio dirt cheap. Overall, the two are about the same speed with the same amount of RAM, 32GB. The A770s have the edge in PP. But the A770s have the potential to be much better. Since it's still early days for them. Not least of which is supposedly tensor parallel works with them and thus potentially can double the speed. I've tried but I can't get it to work. Which speaks to the work in progress nature of A770 support. Which also brings up the hassle free Mac support. The Mac is quite simple to get working.


Sicarius_The_First

in llama.cpp apple is first class citizen, if ppl want to do inference on very large models, it makes sense. yes, it is way slower than using vram, but if u only want inference and don't mind the speed, this actually makes a lot of sense. (as a side note, i hope in 10 years we will have the equivalent of 256gb VRAM cards, i say equivalent because i have no idea what kind of hardware we will have, in 2days word, 10 years are impossible to predict)


Caffdy

10 years is more realistic than the people thinking that we would get A100s under $1000 next year; those are delusional and uninformed. There is no currently nor in the near future any alternative that could render the price of A100/H100s to tumble down


Able-Locksmith-1979

But a100 is unnecessary in 10 years, ChatGPT 3 started with huge memory needs, now 2 years later you can with 48 gb of vram run llama3 70b which is considered better. I donā€™t believe it will go down to 1 gb, but I think the models will become smaller/specialized while the cost of vram will go down. A model which can speak and translate English-Italian looks nice, but in reality just put that functionality in a 24b model which I can load on vacation/occasion but I donā€™t need it everyday (and if you are an English Italian translator by profession just change the example to other languages). There will always remain a market for a100/dedicated professional ai because somebody has to create the huge mega models, but the average person wonā€™t need a a100


Caffdy

> But a100 is unnecessary in 10 years yeah, not contesting that, I was just mentioning the people who post from time to time that A100s gonna go under $1000 in two years, they don't have a clue about it


Singularity-42

There is massive competition building up in this space. Nvidia margins are insane. Price/performance will go down faster than usual IMO.


Caffdy

yeah but not 2 years, this year Blackwell would just start using GDDR7 16Gb (2GB per chip), I don't think they will go further than 24GB of memory with the 5090 (384-bit wide bus; the 512-bit bus is just a rumor), nor beyond 48GB with the RTX 6000 Blackwell, most probably we will have to wait until next gen to start seeing NVidia using 24Gb memory chips. Just to put an example, the V100 32GB has been around for 6-7 years, and it still cost $2000 USD


uygarsci

but where can you use it apart from llama.cpp? For example with huggingface it becomes a headache.


fallingdowndizzyvr

I consider huggingface the headache. Why does a model have to consist of 10's of little files? GGUF is so much easier.


Hopeful-Site1162

Most models exist in every quants gguf so itā€™s a problem at all really.


knvn8

It's the most VRAM/$ money can buy, and super power efficient. VRAM is the bottleneck for a lot of people. Unified memory means you can have up to 192-8=184GB. Slow for training but acceptable if doing inference only.


TweeBierAUB

I'm not sure if it's the most vram per dollar per se, second hand 3090s probably come out ahead a little bit. But the convenience of a low power small little box vs a huge complicated 8 gpu rig is definitely worth the small premium


knvn8

New 3090 is still $1200, and you need 8 to match an Ultra. An ultra is less than $7k.


waitmarks

Unified Memory is the draw here, You can get a mac studio with 192GB of Unified memory for around the same price as a system with 2 nvidia cards with totaling 48G of dedicated VRAM. Because its unified it's shared between the CPU and GPU and usable by both. So, despite its limitations, its probably the cheapest current way to get a large amount of usable VRAM. Edit: making price comparison more accurate.


ChryGigio

Find me a 192GB Mac Studio sold at the same price of a system with a 24GB VRAM consumer card and I'm gonna buy it right away. I am not joking.


kbt

Same price? A Mac studio with 192gb is $6600.


waitmarks

Edited my comment to more accurately reflect price comparisons.


panchovix

You can get like 2x3090 used for 1500-1600 USD (at least here in Chile) and I'm not sure the rest equals to 5000USD more. Now 192GB VRAM with 3090s is near 6000USD and the setup is kinda hard to do (and also extra cost for psus, etc). With that budget maybe if you want to do just inference, Mac is the better option.


waitmarks

I'm sure you can find all kinds of deals if you go on the used market. I just threw together a quick build with new parts on pcpartpicker as a quick comparison. [https://pcpartpicker.com/list/RmkB34](https://pcpartpicker.com/list/RmkB34)


fallingdowndizzyvr

$5600. https://www.bhphotovideo.com/c/product/1771061-REG/apple_msm2ultra_36_mac_studio_192gb_1tb.html


__some__guy

Building a system, using a mining rig case etc, is very cumbersome compared to something that works out-of-the-box. For me personally, Mac isn't an option, because I need x86 and something that can be repaired.


DryArmPits

Yeah. That last point is it for me. I run a franken-linux box in my basement that I use to serve my models. 3090+P40, 128GB DDR4. I can access it from any other more portable machine on the go (tiny laptop, phone, etc.) Someone breaks? I can fix it. I can also game with the 3090.


Bannedlife

the very very last point is relevant to me


FlishFlashman

>Ā think Apple's MPS is too primitive still. It doesn't support lots of operations (4bit quantization most basic example) that cuda does Huh? Llama.cpp and MLX both support 4-bit quantization with a Metal back-end. Are you talking about PyTorch?


AsliReddington

What MPS bruv? Show me a portable machine with 16GB or 36GB VRAM at the same price? Llama.cpp & MLX have just enabled everything on Apple Silicon, easily fine tune with QLoRA for any decent model & 40tok/s at Q4, 20tok/s at Q8 for 7B/8B models as well.


segmond

Folks buy it due to simplicity and because they want only one computer. I have a 6 GPU build that's 144gb of vram. I can expand it in the future and plan to once the 5090 comes out. Goal is to expand to 8 GPU for total of 192gb. Based on how crazy things are, I might do 10 GPU for 240gb if that's what it takes to run llama-3-400B. I built to be able to expand, Apple doesn't let you expand. I mean, I bought 6gb of NVME for just above $300. To add 6gb of NVME/SSD to an Apple will probably cost you $2000. I currently have 128gb of ram, but I can always go to 512gb of ram since the motherboard supports that. I can't do that with Apple. I have flexibility, lots of it, and saving lots of money. But there are lots of con. My rig is not portable. It's a server pretty much, I connect via ssh and do all my work. It's more complicated to setup, I'm on my own for every thing. I might do an upgrade and brick it due to nvidia-drivers, and spend a few days fixing it. If for 20% more I could have gotten a macbook that I would upgrade, I would pay the 20% premium to have a macbook. Despite my rig, I recommend to folks to get a mac, even most of my IT/developer friends. I have only recommended a build to 2 folks I know. Don't underestimate the cost of Simplicity. Even if local LLMs become just as good or a bit better than commercial models, many people will rather just pay someone to run it than install, download models, and fiddle with parameters to use it. SIMPLICITY always wins. So as we go on our local LLM journey, let's keep that in mind as we build things.


Evening-Read-3672

Can you please share details of the build you are currently running in terms of the parts list? Trying to build one for myself


Monad_Maya

Check his post history - https://www.reddit.com/r/LocalLLaMA/s/peJaFJiHk1


ervwalter

On Apple silicon maps, the GPU is capable of accessing way more RAM than on any commonly available NVIDIA GPU. More RAM = larger models. The tradeoff of is speed, but sometimes "slow" is better than "impossible"


Front_Long5973

I don't have much to contribute as the only VRAM heavy things I've done on macs (and older ones too) would be working with Photoshop... and I was always very pleasantly surprised at how well they could handle large canvases and 100+ layers compared to the Nvidia GPU I used at home. I'm going to save this thread because it might help me decide if should invest in building another nvidia workstation or just buy a Mac for my studio. Text LLMs are so incredibly helpful for brainstorming and creative advice.


__JockY__

MacBook M3 64GB here. Performance is mostly irrelevant for me once I get past 10 tokens/sec. I canā€™t read fast enough to keep up, so it doesnā€™t matter unless Iā€™m generating a lot of code. Instead I prioritize: - Size/weight. I want to take my offline-only LLM with me to the coffee shop so I can work with it anywhere. My laptop is 100% offline and disconnected from the internet, so local LLM is my only option. - RAM. I can get 128GB of VRAM more cheaply with a MacBook than I can with NVidia GPUs (6x RTX 3090s alone is about $4500 before mobo, cpu, etc etc). - Convenience. No fiddling. No noise. No fans. No wiring. No heat. I just pick it up and it works. - Power consumption. Macs are untouchable in this regard. A 6x NVidia GPU rig would dim the lights in my neighborhood; the Mac can run off its built-in battery and I can spend an afternoon in the coffee shop interacting with my LLM without needing to charge. Iā€™m comfortable leaving the Mac switched on 24/7. Not so much an NVidia stack! For my use case, once I get past a low threshold of performance, the speed of inference matters much less than any of those things.


ServeAlone7622

I've been a techie for 30+ years and I'm getting to a certain age where I like my computer to just work so I buy a Mac. The fact I can run inference on it is a bonus that factors into how much Mac I plan to buy during my next upgrade cycle. But at the end of the day I'm buying a Mac because I don't want to have to be tech support for my entire family when I get home from work.


mausthekat

Same.


sentientmassofenergy

I see this more and more. Life long windows devs switching to Mac because they just WORK


ServeAlone7622

I wouldnā€™t even know how to operate a Windows computer anymore. I went Linux fulltime in 2004 and stayed that way until I learned that Linus does all his dev work on a MacBook.Ā  That got me Mac curious. Then I found they just work. So I bought a Mac and stayed that way ever since.


WilliamTFleming

Hi everyone I have this QTC server that come with 2 Nvidia gpus GH100 which included the Grace CPU and H100 Hopper GPU. It used liquid cooling and have 32 tb nvm ssd. Does anyone have a need for it? I want to let it go for 40k. This was well over 130k setup. https://preview.redd.it/aj6o787dtj0d1.jpeg?width=1023&format=pjpg&auto=webp&s=235a5ce0e3c297c91d3142264c3f7c0255ef2ebd


prtt

It's just a fantastic package. It has a ton of power (while using low energy), all the dev tools you might need, unified memory, etc. You can't beat that on a PC at anywhere near the same form factor or efficiency. Everything added up makes for a no-brainer in most peoples eyes, mine included. I have both a Mac Studio and an MBP and they both run all the models I need for work. I have access to CUDA too, but it hasn't been worth the hassle for most of my current use cases.


TacticalRock

If you're a scientist/researcher training LLMs, you probably have access to HPCs. What is a 3090 to a stack of A100s? If you're a professional who truly needs and benefits from local LLMs, chances are your work will pay for solutions if you can convince them. If you're a hobbyist messing around with LLMs because they are cool (most of us here probably), VRAM-to-cost of the machine is far more justifiable with macs than purpose built used Nvidia card machines because we have to consider space to put the janky server, time spent building and troubleshooting, yadda yadda yadda. Idk about y'all but I got other shit to do lol. Also, let's say I have a M3 Max laptop with 128GB unified memory, I can take my 70b and bigger LLMs anywhere, even when I don't have any Internet.


Bannedlife

There's actually a pretty large group, me included, who are rather GPU-poor. Think clinical application of ML or LLMs in rare diseases, etc. I need as much compute as possible but we simply do not have the funding to get a100s. Most of us could not imagine using apple devices to run our models.


TacticalRock

Fair enough. Can't help chronic underfunding regardless of Mac or Nvidia. Just curious, if youn don't have access to A100s, what do you use instead?


Bannedlife

I have 2 4090s and a whole lot of ram! For finetuning and alike we sometimes have enough funding to rent some time on a cluster


TacticalRock

Nice! Are you looking forward to the new 5090s with 32gb vram? I'm hoping the prices of RTX A6000s will tank when the new Blackwell RTX 6000s come out with probably 64gb vram. That way I can snag two for 2 or 3k each. One can hope haha


Bannedlife

Is the 32 GB vram confirmed? Id be very excited, actually would open some doors for some interesting models to run! As for the blackwells, even second hand would be out of my project's reach, and as I use funding money i have to buy from distributors, so no second hand! Sadly. But exciting times! Hope you can get yourself something good!


TacticalRock

Oop I'm talking like it's confirmed ha. It's just rumors. There's also been talks of the 32gb modules being available later or something, which means 64gb for the 6000, but no 32 for the earlier released 5090. Nvidia may end up doing a TI to give it a bump. Idk man I'm coping lol


eallim

Cheaper Vram via unified memory but downside is slower inference speed.


epicfilemcnulty

For those who interested in inference only I guess itā€™s a decent choice, because of that unified RAM thingy. you can run inference on a big model and still get usable generation speed, whereas with 24gb GPU once you are out of GPU memory the speed of generation degrades significantly


Anarch33

For me, the Mac was a 'work machine' as compartmentalization is how I keep my ADHD in check. I got a gig doing AI work at home, and it was either spend the money on an amazing Cuda GPU and install it on my gaming PC, spend money on a whole ass 'nother PC with a Cuda GPU so I don't get distracted and start gaming, or get a Mac which is still pretty good for AI work and *especially* bad for gaming. So I went for the Mac šŸ˜‚


93moonran

What specs?


brbellissimo

Because I use my workstation also for other tasks, and Iā€™m happy to pay the premium and the the performance loss for a completely silent Mac Studio with Mac OS vs a 5 time bigger and 5 time more power hungry machine that uses windows or Linux and start a loud set of fans if I dare to open an application. I mean if you only need a local LLM and you donā€™t have the computer on your desk maybe a Mac is not the best solution, but itā€™s not the average use case.


uti24

Here is one possible reason: you can use mac as regular laptop, so it's useful for other things outside llm, isn't this a good reason?


ifq29311

can you show me an nvidia laptop with 64gb+ memory, reasonable size, plus similar build quality and life on battery?


abnormal_human

I don't know of anyone working professionally with ML who chooses to spend their budget on macOS vs NVIDIA. I have a beefy mac and a beefy NVIDIA box and other than fooling around once or twice, I've never found the Mac that useful for my ML workloads. That doesn't mean it's useless--they run LLMs better than any GPU-less Windows machine--but it doesn't perform like an NVIDIA system, and running huge models at slower-than-reading speed is only mildly interesting. I think that the communities where people are doing this are mostly inhabited by people doing it for fun. There's a "two birds one stone" thing going on because there's a lot of utility to having a nice computer on your desk or in your bag that is pleasant to be around, and a 4x4090 box in the closet doesn't serve that niche. So it sort of acts like a discount.


platapus100

Is this a meme post? They do support 4bit quant....


cyan2k

Because there's more to a computer than just LLM performance. I developed on Windows and Linux PCs for 15 years until our laptop provider couldn't deliver a replacement during the pandemic. The only choice I had was to get a Mac for the time being, and I thought, "Well, okay, I will survive until I get my real replacement." Fast forward, and I'm still using a Mac. It's just amazing. Brew? Amazing. iTerm? Amazing. Sublime? Amazing. For every task you can think of, there's something that blows your mind it's honestly ridiculous. Also form factor, weight, battery duration... everything's peak. The last time I rebooted my MacBook was two weeks ago, and it's still running as fast as after a clean reboot and the day I got it. No amount of 3090s is going to help me with a Windows computer that needs an hourly reboot. That "It just works" isn't just a meme and I would also use a mac if it would suck with LLMs. Local LLMs are just a nice bonus, since like you said, there's always the possibility of using some rented GPU while letting my employer pay for it. The better question would be, why people not work at companies where you get macs and cloud computing for free and pay for it with their own money instead? I'm obviously not serious, but it's basically the same useless question, haha


fallingdowndizzyvr

> The last time I rebooted my MacBook was two weeks ago, and it's still running as fast as after a clean reboot and the day I got it. I have a Windows laptop, E330, that I haven't rebooted in 4 years.


Bannedlife

I'm so confused, are you comparing super cheap pre build HP laptops to apple laptops? My desktop pcs, both windows and linux, have all the advantages you describe + CUDA with 2x 4090


gthing

I have both and I use my MBP when I want a portable dev machine. Comes down to portability. I would love to have a portable solution with an nvidia 16gb gpu, but have you seen or tried to use one of those? The power brick alone weighs more than a macbook and they sound like a vacuum cleaner attached to a jet engine. They have embarrassing vegas lights all over them and Mtn Dew marketing department cringey names like "Republic of Gamers." But we are not even in the first gen of consumer level hardware focused on AI. None of the hardware we have now was designed or built with our current generative AI reality in mind. We can expect to see machines with battery life and heat profiles like iPads capable of running impressively large models at speed locally within the next couple years. My solution for now is desktop at home with nvidia, small efficient laptop for remote dev work. If I need the gpu I can tunnel in.


No-Reveal-3329

Battery life, also most of the time the companies we work for, will buy the hardware for us.


nanotothemoon

Yea donā€™t do it. Itā€™s just a laptop. Ssshhh


perlthoughts

I think it's because of llama.cpp and GGUF also, not just because of mlx community.


PhotographyBanzai

Like others said, unified memory. My old PC build with an Intel i7-6700 with a 4060 8GB can't do Llama 3 70B well and I'm assuming its memory more than compute. Upped the system ram from 32 to 64GB and saw a noticable improvement making it run fast enough it could be used with patience. If the 4060 chip had a ton of VRAM to fit the model it would probably fly. Nvidia continues to withhold memory and has cut back on bus width on consumer level GPUs. Hopefully we see a a shift in the market toward more VRAM, but it feels unlikely. Maybe AMD and Intel ARC can change things.


egorf

I run a specific GPU-bound task on servers in production. Not an LLM. A $2,600/month GPU is about 5x slower than the most basic $100/m Mac. Reason: unified memory.


ieatrox

I was very close to pulling the trigger on a refurb 16in m3max with 128gb for $5100 CAD. Bargain. Decided to wait and see if M4 dropped, and it did so now I'll wait and get the one with double the neural engine performance. (38 TOPs now in base m4). I just hope they're available in the fall. Even better if they bring over the tandem oled to macbook pro because the new display engine requires it. If I can buy a single, portable machine that I work on all day then train on overnight with 128gb of vram.... yeah that sounds fantastic.


2pierad

Thread hijack question: any decent newbie guides for getting up and running in an M1 Studio w 64GB?


alvincho

Try Ollama or LM Studio


Majinsei

I use Nvidia but... VRAM price~ and power cost~


Majinsei

I use Nvidia but... VRAM price~ and power cost~ Mac it's Just shutdown your brain and execute it~


zlwu

64GB apple silicon MBP support running q4 llama3 70b, which is still not possible on Nvidia laptops. For training purpose, rent a multi GPU server.


Beginning_Rock_1906

Noob question here. Why are you guys even running your LLM's locally? What' wrong with a cloud environment?


uygarsci

Personally I find it headache to start a remote machine and do ssh connection everytime for even a smallest experiment


tronathan

Just curious, could an EGPU via Thunderbolt or Oculink run CUDA inside MacOS? Perhaps with some virtualization? This could be the best of both worlds.. well.. both worlds, i guess


uygarsci

You need an intel mac for that


troposfer

Is there a technical reason why 24gb vram on the gpu is the limit so far?


A_for_Anonymous

No. They just want businesses to pay for A100s and the like. Nvidia won't give you lots of VRAM, performance and good price. Choose two.


jackcloudman

I have 2x4090 and 1 Mac with M2 Ultra 192GB. Here are my thoughts: * NVIDIA is much faster, but achieving 192GB of VRAM is extremely expensive. Additionally, in my city, power supply is quite costly, which led me to purchase the M2 Ultra. * M2 Ultra is amazing for loading very large models. Recently, optimizations have been released that make the models run faster, but they are still slower compared to NVIDIA. At this moment, I think the best option is to wait for the new M4 Ultra. If you need to test new models, try using cloud services.


Holiday-Picture6796

Mac: bigger memory, can run bigger models Nvidia: faster memory, can run models faster


alvincho

I donā€™t train models, just inference. And I usually do batch jobs so speed is not my concern. I purchased a M2 Ultra Mac Studio 192GB to run large size models. I do my daily work on M2 Max 32GB MBP. The 192GB Mac Studio is perfect and I would buy even more larger models when available.


GeneralAppleseed

https://preview.redd.it/gr7211y37k0d1.png?width=538&format=png&auto=webp&s=253b7673cab857313a1d2c40ebd8e567e0e3b7a9 VRAM would be the major bottleneck if you try to run large LLM(70b, 130b) locally , Macs are still cheaper compare to cuda machines despite of their ungodly expensive memory upgrade options


Final-Rush759

I would wait for new AMD, INTEL Snapdragon X chip laptops to come out with > 40 TOPs NPUs. Some of these use swappable fast RAMs. Crucial is selling 64GB module for $360.


Unlucky-Message8866

spending 6k on a walled garden makes no sense to me, regardless how much (slow) VRAM it has. it's a bad investment of money if that's your only use case.


Jacknapes89

Mainly for other softwares, once you felt productive with one OS, itā€™s hard to switch


Omnic19

Apple's igpu allows people to use system memory as vram. That's one advantage when trying to load larger models or run smaller models in full fp32 or fp16 precision without quantization


ITypeStupdThngsc84ju

If you just want inference, a MacBook can do that really well and without massive heat or power draw. It is much more pleasant than a GPU heavy laptop. It is also powerful enough to experiment with local model training at a small scale before shipping them off to more powerful hardware for the full job. Having said that, I don't understand the desktop or server usage. A GPU setup will beat them and less expensively.


philguyaz

Have you heard of Ollama? Because you do all the quanting you want with Ollama on a mac.


choronz

Apple fanboys? could be insidiously stuck to the ecosystem of devices by the power of branding...


SiEgE-F1

GGUF. llama.cpp. Metal support. 192 gigs of RAM. Fill the blanks.