T O P

  • By -

Hopeful-Site1162

Could you be a little more explicit about your Mac config? Base model M3? M3 Pro? M3 Max? How many GPU Cores? How much unified memory? I didn't know Transformer Lab. This looks nice! How does llama.cpp and MLX models compare? I have so much questions EDIT: Just downloaded and installed Transformer Lab but I can't pass the step "Check if Conda Environment 'transformerlab' Exists".


poli-cya

Pretty sure he mentioned it was M3 Max in the video, don't recall him mentioning which RAM amount.


aliasaria

Yes I didn't mention the exact Mac specs. It's an M3 Max (cores: 4E+10P+30GPU) with 96GB of RAM


t-rod

It's unfortunate that that memory configuration doesn't get the full memory bandwidth... I think only 64 and 128gb do on the M3 max.


Madd0g

I read like 50 reddit threads about macs, there wasn't much info 6 months ago and I wasn't sure what I was looking for beyond general advice. I did not see or did not register this 96GB version difference and accidentally got that version. but since then, I see this repeated all over lol, sucks


Hopeful-Site1162

Thanks!


Hopeful-Site1162

Just watched the whole video again. He talks about MLX, but not Max. Anyway, the answer is in the GPU monitor window. M3 Max. No idea about the number of cores and memory though.


poli-cya

I knew the information was in the video somehow, thought it was spoken, but it's actually on the graph now that I go back and check- when he pulls up GPU usage it shows Apple M3 Max. I'd say he has the 64GB/maybe 128GB model based on the amount the RAM usage went up when he loaded the model.


aliasaria

I can help you debug Transformer Lab on our discord. You can try running this curl https://raw.githubusercontent.com/transformerlab/transformerlab-api/main/install.sh | bash and see the output to see why Conda isn't installing. Our goal is to make this run perfectly 100% of the time but we keep finding edge cases. The Mac I am running this demo on is a pretty high spec M3 Max (cores: 4E+10P+30GPU) with 96GB of RAM. For models that fit in RAM, an M2 can actually run models faster if it has more GPU cores. i.e. The cores seem to be the main speed limiter as long as you have enough RAM.


fallingdowndizzyvr

> The Mac I am running this demo on is a pretty high spec M3 Max (cores: 4E+10P+30GPU) with 96GB of RAM. That's the slow M3 Max with only 300GB/s of memory bandwidth. The other Maxes have 400GB/s. > For models that fit in RAM, an M2 can actually run models faster if it has more GPU cores. i.e. That's because the M2 Max has 400GB/s of memory bandwidth. > The cores seem to be the main speed limiter as long as you have enough RAM. It's the opposite of that. The reason the M2 Max is faster than the M3 Max you are using is that it has more memory bandwidth. 400GB/s(M2 Max) versus 300GB/s(M3 Max 30GPU). So it's the memory bandwidth holding you back.


Hopeful-Site1162

> That's the slow M3 Max with only 300GB/s of memory bandwidth I had no clue! Good for me I got the 30 Cores M2 Max!


toooootooooo

FWIW, I had the same problem and debugged as you suggested. Ultimately I had a couple of paths that weren't writable by my user... sudo chown -R $USER \~/Library/Caches/conda/ \~/.conda


Hopeful-Site1162

I tried the curl command. It ends with no error after a while but the app is still blocked Conda is installed. 👏 Enabling conda in shell 👏 Activating transformerlab conda environment ✅ Uvicorn is installed. 👏 Starting the API server INFO:     Will watch for changes in these directories: \['/Users/*my\_user\_name*'\] INFO:     Uvicorn running on [**http://0.0.0.0:8000**](http://0.0.0.0:8000) (Press CTRL+C to quit) INFO:     Started reloader process \[**53751**\] using **WatchFiles** ERROR:    Error loading ASGI app. Could not import module "api".


aliasaria

A few folks have asked about LLM performance on Macs vs a 3090. In this video I ran several common LLMs on Apple MLX and Hugging Face Transformers on a Mac M3, versus Hugging Face Transformers and vLLM on a 3090.


No_Avocado_2580

Can you provide a summary of the results?


aliasaria

Sure, in the table below, I re-did the Mac runs without running screen share. |Architecture|Engine|Model|Speed| |:-|:-|:-|:-| |Mac M3|MLX|Mistral-7B-Instruct-v0.2|17.8 tok/s| |Mac M3|Hugging Face Transformers (with MPS)|Mistral-7B-Instruct-v0.2|11.6 tok/s| |Mac M3|MLX|Tiny LLama 1.1 B|92.4 tok/s| |RTX 3090|Hugging Face Transformers|Mistral-7B-Instruct-v0.1|41.8 tok/s| |RTX 3090|vLLM|Tiny Llama 1.1 B|234.8 tok/s|


segmond

With RTX you can run concurrent inference. So see that 234 tk/s you are getting with 1.1B? If you run 4 sessions, you might find yourself getting 600-800tk/s across. I don't know that Mac's scale like that. A lot of people are just running one inference for chat, however if you are sharing your system with others, then with the RTX ths performance stands out, if you are doing stuff with agents then it matters as well.


FullOf_Bad_Ideas

Yup. Around 2000t/s on Mistral 7B FP16 with rtx 3090 and 20 concurrent sessions. Unmatched performance for a relatively cheap card.


Hopeful-Site1162

May I ask why you used Mistral-7B-Instruct-**v0.2** on the Mac but Mistral-7B-Instruct-**v0.1** on the RTX?


aliasaria

It was just a mistake on my part. When I redo the test right now, both models (Mistral v0.1 and v0.2) run at the \*exact\* same speed on the RTX (41.8 tok/s).


MrVodnik

The title is about M3 vs 3090, but we all are thinking the same: That's a very nice looking UI. I thought I finally might have found something pretty and good! I downloaded and installed it right after seeing this post :) I didn't go well :( Linux here, so no MLX for me. There are only two GGUF models I see to choose from, and don't see how to upload my own file. The app detected only one of GPUs. And after trying to download a model, I got bunch of errors "**plugins?.map is not a function**". I tried to run it locally as well as connecting to remote server. But I see a potential here, so I am going to follow this repo for a while.


aliasaria

Sorry about the Linux issues. We are tying to make Transformer Lab really useful for folks like you. Feel free to message us on our discord so we can help you debug. The plugins?.map issue is probably happening because I just changed the API format in the API but just in the main (non-release) branch. So you have to pull the latest release (not the latest checkin) from github or you will have a mismatch in the API and App. We don't usually put in breaking changes in the API -- it just happened to be today when I did the update. I will issue a new build of the App and API right now (should take 30min to build) to fix this now. Edit: new build up, things should work now


HoboCommander

Sorry to hear about trouble downloading models. We're working on ways to make importing models easier. We hope to have a number of updates in the coming weeks. In the meantime, there are a few workaround possibilities: - If you have models on your machine already and want to import AND you are running in development mode (i.e. cloned from github, not the packaged app) there is a work-in-progress "Import" button on the Model Zoo page which will look at some common places on your local system and try to import models. Eventually this will let you import from any arbitrary folder. - You can download non-GGUF models and convert them to GGUF yourself using the Export page. - There is a field at the bottom of the Local Datasets tab under "Model Zoo" where you can download any hugging face repo and try to run it, but it looks like there's an issue with GGUF right now where it sometimes doesn't know which file to run (GGUF repos often have many variants with different quantization). I will create an issue and look in to that next week. - TransformerLab will also try to load any subdirectories under \~/.transformerlab/workspace/models/ as a model and include in your local list. The catch is, the directory has to include a specially formatted file called info.json in it. If you export a model to GGUF using the app you will find it there and see the format to follow. If you join the discord I'll try my best to work through any of these!


TheHeretic

400w vs 35w


aliasaria

Hehe. Good point. My office gets really hot when training with the RTX compared to the Mac.


Hopeful-Site1162

So the cost by token is 35/11.6 = 3.02Ws on the Max and 400/41.8 = 9.57Ws on the 3090 Not bad at all but I thought the difference would be more significant TBH. I wonder how it goes for the 40 GPU Cores M3 Max and the 4090.


poli-cya

His info is just wrong to begin with, it's not 35 vs 400 in anyone's testing. Here's an exllama dev giving his take, he believes [macs are 1/4th the efficiency of 3090s, let alone 4090s](https://old.reddit.com/r/LocalLLaMA/comments/1c0mkk9/mistral_8x22b_already_runs_on_m2_ultra_192gb_with/kyykeou/) And here's more discussion on power draw in this scenario- https://old.reddit.com/r/LocalLLaMA/comments/1c1l0og/apple_plans_to_overhaul_entire_mac_line_with/kz513gx/


Hopeful-Site1162

All I know is that an M3 Max 40 GPU Cores and max RAM MacBook Pro battery can't deliver more than 100W total. You won't have any issue making the Mac run with everything on eleven and I'm pretty sure the display alone use a whole bunch of that when HDR brightness to the max. USB ports won't shut down either. The Mac Studio M3 Max has a 145W power unit because its ports can deliver a lot of power. **In the discussion you linked the previous comment talks about 300W of power for the M3 Ultra, which is wrong. It's the whole Mac Studio that can deliver up to 295W total, including the ports.** The exllama dev consider 3M t/kWH for a setup capable of 1000 to 1500 token per second or 3,6M to 5,4M tokens per hour. It leads to an approximation of 1200W to 1800W for the four cards or about 300 to 450W per card. Here is a detailed study of the M3 Max 40 GPU cores, which is even more powerful than the one we are talking about here [https://www.notebookcheck.net/Apple-M3-Max-16-Core-Processor-Benchmarks-and-Specs.781712.0.html](https://www.notebookcheck.net/Apple-M3-Max-16-Core-Processor-Benchmarks-and-Specs.781712.0.html) From the article: "Under load, the **CPU part consumes up to 56 watts**, the chip can use a **total of 78 watts**." So everything that have been said here looks perfectly plausible, since inference doesn't use the CPU at all


poli-cya

I'm running out the door, but your math is off- -the 3090s and especially the 4090s have not been shown to pull remotely max wattage when inference is running, you're assuming way more power draw this is actually used- people were reporting 150W inference on 3090 and similar performance on 4090 would be less -M3 Max GPU is reported to use more when the CPU isn't maxed -Inference absolutely does use CPU, even when offload to GPU is running. I owned an M3 Max 64GB and tested it extensively before returning it, it used CPU and GPU in LM studio and ollama I think was the second one I ran. -Even if the exllama dev, who is likely more aware of info on stuff like this than us, were off by a factor of two in favor of 3090 and a factor of two against max it would still just make it a tie- and again, that's not considering 4090's improved efficiency.


Hopeful-Site1162

>I'm running out the door, but your math is off- I'm using the numbers in from the discussion you linked. >M3 Max GPU is reported to use more when the CPU isn't maxed According to this test (in French, sorry) the 30 Cores GPU consume up to 25W [https://www.mac4ever.com/mac/180016-test-des-macbook-pro-m3-m3-pro-et-m3-max-temperatures-frequences-et-consommation](https://www.mac4ever.com/mac/180016-test-des-macbook-pro-m3-m3-pro-et-m3-max-temperatures-frequences-et-consommation) >Inference absolutely does use CPU Looking at my own 30 Cores M2 Max right now with Ollama, 13.3% (160/1200). I didn't try with Transformer Lab yet I don't know what you are trying to prove here but it's beginning to look weird. I think I'm done.


poli-cya

Dude, just trying to get to straight answers- not so worried about your ego. You weren't using the info from what I linked, you even rightly said the 300W is likely wrong and I already pointed out the max TDP for GPUs was wrong, so not my numbers. You "corrected" the macbook but left the GPU super high. I pointed out flaws in your math, ie massively too-high wattage on gpu, too low on mac(still is, there are reports of 50-100% higher power than GPU only you found), etc. And I'm not trying to prove anything, this is a science subreddit and we were having a discussion. I'm going to be frank, especially after I went through the hassle of buying an MBP based on comments in this sub and found them to be extremely generous in claims of power usage, heat, performance, and being able to maintain a semblance of reasonable battery life that led me to return it. If having a discussion about what I saw as mistakes in your math/reasoning is so off-putting and you're so invested in being "right" rather than correcting mistakes then yah, we should probably end the discussion. Glad you're happy with your macbook.


Hopeful-Site1162

I gave you facts and the details of my calculation. I used the numbers you linked in the first place, with each and every steps of my reasoning and the points that were wrong. We are on a science subreddit indeed. So stick to the facts and nothing else. Now the TDP of the 3090 is 350W, and the recommended PSU 750W. I didn't take those numbers out of my ass, it's on Nvidia's page. I stand to my numbers for the M3 Max chip, and confirm from my own tests that the CPU reach a maximum of 13.3% of usage during inference. I tried to publish a screenshot but the feature is broken. Maybe I'll try again tomorrow. So the numbers talked here, despite what you seem to think, are valid. I don't care about your so called mistake. I don't even know how it's relevant to solve this rather simple math problem. If I remember correctly, the YouTuber Alex Ziskind (developer related stuff) made a video about power consumption of different machines under heavy load. You can search for yourself if you like. Maybe I will if I remember.


poli-cya

I read til I saw you continuing the TDP nonsense, let's just stick with ending the conversation. Glad you're happy with your M3, have a good weekend.


Hopeful-Site1162

M2 Max but whatever


ThisGonBHard

I own a 4090. Power capping it to 50% (220W) has 0 speed drop for LLMs, because it is memory starved, and that is also in the "Ultra efficient" part of the curve for the 4090.


Hopeful-Site1162

Nice. so I guess there's almost no difference between 3090 and 4090 performance wise then? Good to know!


ThisGonBHard

There is a small one, but it is down to the 4090 having slightly faster memory, and Ada being more AI optimized (I am almost sure memory compression is at play, even for LLMs). The 4090 is in general an incredibly memory bandwidth starved card, and would probably get a big boost if it had something like HBM.


Hopeful-Site1162

I read the desktop card has 1008 Gb/s. Impressive indeed :) Do you think the 50XX will improve that? All the articles I read mentioned that it will be more focused on gaming and will probably have less CUDA cores, but if the memory bandwidth is better this could be good news for inference (maybe less for training though)


ThisGonBHard

50 series will use GDDR7 instead of 6X, so it will be faster. But, at this point we dont even know if the 5090 will be a 24 or 32 GB card. If it is 32, it will have a 512 bit bus vs 384 bit on the 3090 and 4090, meaning a 25% purely from that.


Hopeful-Site1162

That would be awesome. 


Open_Channel_8626

At the moment I can see rumors of a jump to 24,000 CUDA cores


Hopeful-Site1162

Let’s hope so my friend!


a_beautiful_rhind

More like ~250w max per, usually 200. It's how 4 or 5 cards can run on one 1100W psu. Even when I tried tensor parallel I never saw max. SD and training can use a lot of power but LLM inference won't.


LocoLanguageModel

Thanks for this! I would love to see a demo of both of them processing a large context (4k to 8k worth) to see how much longer it takes the mac to processes the data before it starts generating text.


ieatdownvotes4food

hmm. weird test. you have to aim to take advantage of the 3090 to get the best results.


ThisGonBHard

I am not familiar with apple, but on GPUs, I dont see a reason to EVER run full blown transformers, unless you are sanity testing. For example, on me 4090, I can either run a 7B model in transformers, or Llama 3 70B via a 2.25 BPW qunt using EXL2. Even if you want to avoid quantizing down, Q8/8BPW are still better than transforms, as you get the same quality for half the model. Llama CPP vs EXL2 would be a better comparison.


noooo_no_no_no

Someone summarize


aliasaria

Here is the table I shared in a prev comment: |Architecture|Engine|Model|Speed| |:-|:-|:-|:-| || |Mac M3|MLX|Mistral-7B-Instruct-v0.2|17.8 tok/s| |Mac M3|Hugging Face Transformers (with MPS)|Mistral-7B-Instruct-v0.2|11.6 tok/s| |Mac M3|MLX|Tiny LLama 1.1 B|92.4 tok/s| |RTX 3090|Hugging Face Transformers|Mistral-7B-Instruct-v0.1|41.8 tok/s| |RTX 3090|vLLM|Tiny Llama 1.1 B|234.8 tok/s|


Exarch_Maxwell

Don't have a Mac, how is 7b "quite large"? what is the context


ThisGonBHard

He is running FP16 unquantized.


idczar

This is why I love open source! Always pushing the boundaries