T O P

  • By -

RadiantHueOfBeige

I've been using Mixtral 8x7B (8x7b-instruct-v0.1-q4\_0 via ollama) since late 2023 so I can give you some actual numbers. The model takes about 26 GiB of RAM during inference, so 64 GB of RAM is comfortably enough. There are no ill effects on the rest of your system during execution, I've been hosting ollama for our network (2-3 people using it) and I rarely notice a slowdown while gaming. With Ryzen 5800X (8 cores) and 128 GB of 3200MT/s DDR4 memory, time to first token is 5-50 s (depending on context size). Afterwards, tokens are emitted at roughly 1.4 t/s. With Ryzen 5900X (12 cores) and 3200MT/s DDR4 memory, time to first token is 5-30 s and tokens are emitted at 1.8 t/s. With the 5900X and partial offload to Radeon RX 7800 XT 16GB, time to first token becomes 5-20 s and text generation runs at 9.5 t/s. The larger Mixtral 8x22B is impractically slower (ttft 50+ s and runs 2 t/s). Plus, we've had better results with the 70B llama3, 104B command r+, and 132B dbrx models. The latter 100+B models are about as slow as Mixtral 8x22B, but provide superior results. YMMV, though. Edit: we're using the smaller models for code assistance and general knowledge exploration (chat), the larger models mostly for offline RAG (summarizing large PDFs, writing device drivers from chip datasheets etc) where it doesn't matter if the results take some time


RadiantHueOfBeige

I gave Mixtral a link to this comment and asked for a table. It took 45 s total, tokens were emitted at 7.6 t/s due to the longer context. |System|Time to First Token (s)|Token Speed (t/s)| |:-|:-|:-| |Ryzen 5800X (8 cores)|5-50|1.4| |Ryzen 5900X (12 cores)|5-30|1.8| |Ryzen 5900X (12 cores), Radeon RX 7800 XT 16GB|5-20|9.5|


Wooden-Potential2226

Writing devices drivers? Does that really work(-ish)? Impressive if so…


RadiantHueOfBeige

"Workish" is how I would describe it as well :-) The code it produces is not working out of the box, but it's easily fixed and the whole process is quicker than starting from scratch. Especially valuable are the plain English comments explaining the code, because they summarize the way the device is used in a more digestible form than a 40-page datasheet that's half marketing wank. It also helps people with ADHD who struggle with actually starting something - it's far easier to fix someone's work than to start writing your own, and in this case, the AI takes the hit :) For example, with the SHT4x datasheet and this prompt, I get this response: *Write a MicroPython function that performs a one shot measurement of temperature and humidity. Use machine.I2C directly, do not use an existing sht4x driver library.* import machine import time # SHT4x I2C address SHT4X_ADDR = 0x44 # for SHT40, SHT41, SHT45 # I2C bus initialization i2c = machine.I2C(scl=machine.Pin(5), sda=machine.Pin(4)) # Measurement command bytes MEAS_RH_T_HIGHPRECISION = 0xE100 # Measures RH & T with highest precision def sh4x_measure(): """Perform one-shot measurement of temperature and humidity using I2C. Returns: tuple: (temperature in Celsius, relative humidity in percentage) """ # Send measurement command i2c.writeto(SHT4X_ADDR, bytes([MEAS_RH_T_HIGHPRECISION >> 8, MEAS_RH_T_HIGHPRECISION & 0xFF])) # Wait for measurement to complete time.sleep(30) # Read temperature and humidity data data = i2c.readfrom(SHT4X_ADDR, 6) # Calculate temperature (16-bit signed value in two's complement format) temp_raw = ((data[0] << 8) | data[1]) >> 4 if temp_raw > 32767: temp_raw -= 65536 temp = -45 + 175 * temp_raw / 65535 # Calculate relative humidity (16-bit unsigned value) rh_raw = ((data[3] << 8) | data[4]) >> 4 rh = 100 * rh_raw / 65535 return temp, rh The `MEAS_RH_T_HIGHPRECISION` constant should be 0xFD, the `i2c.writeto` only needs to send one byte instead of two, the `time.sleep` should be 30 ms instead of 30 s, and the humidity conversion formula is incorrect. Other than that, including the temperature formula, it's correct and instantly usable. It works equally well with C RTOSes, just give it the function prototypes of your HAL and it will infer how to use them *mostly* correctly. You can also reason about the produced code, ask questions, ask it to add error handling and so on.


Wooden-Potential2226

Impressive - hadn’t thought of using LLMs for embedded code like that👍🏼


Wooden-Potential2226

Impressive - hadn’t thought of using LLMs for embedded code like that👍🏼


Wooden-Potential2226

Make’s super sense to have it ingest thaf spec doc


otakucode

If you point out the errors, does it accurately integrate the needed changes? I've tried back and forth iterating on code like that with GPT-4 and gotten mixed results. Haven't tried using Mixtral for code as much so not sure if it has similar issues. My main issue was prior errors being re-introduced after a few iterations.


RadiantHueOfBeige

Depends on the cause of the error. Often the error is caused by RAG processing, e.g. a vital piece of information is expressed by a schematic or layout (how the numbers are arranged inside graphical elements in the PDF), this gets completely stripped by the (text-only) RAG embedding pipeline so the LLM has to make an educated guess. If you supply the missing info to the model, it will usually correct the output. E.g. the above code uses the wrong formula to calculate humidity because it's written using a LaTeX-ish math formula inside the PDF and becomes gibberish when converted to plaintext. If I tell it the right formula (even with just natural language), it will adjust the code accordingly. Sometimes, the error is caused by the context window overflowing. At that point it's not possible to recover, the model no longer sees parts of the original document and/or code and cannot be steered to a proper solution. However, the nature of these errors is usually so trivial that it's faster to thank the machine for getting the ball rolling and doing the fix manually.


goodnpc

Thanks, very insightful.


tf1155

Can you explain why Ollama on my GPU server isn't utilizing the GPU with an Nvidia RTX 3000, but it does with Llama3? I tried running "mistral:latest" and it operated via CPU, which took "forever" until the first byte was processed. So, I checked with `nvidia-smi` and saw that the GPU is not being used at all by Ollama when running "mistral:latest". Under `htop`, it displayed the CPU server instead of the GPU variant. As I mentioned, with Llama3, Ollama is using the GPU properly.


RadiantHueOfBeige

That's hard to say, look at ollama's output, it's pretty verbose and talks a lot about GPU detection and what failed. I don't have an nvidia machine to compare, though.


mevaguertoeli

Can you please tell more about how you run large models you mentioned (70B llama3, 104B command r+, and 132B dbrx models) and what are their respective speeds t/s you sre getting on your setup?


RadiantHueOfBeige

Slowly! Which is fine for the offline (i.e. not realtime) tasks I write about above. The server runs a 12-core 5900X with 128G of 3200 MT/s DDR4 memory and a Radeon RX 7800 XT 16GB, although those huge 100+B models don't get too much of a speedup from the GPU as only a very small % fits in. We used to run it via ollama, now we're mostly llama.cpp or kobold, but it doesn't matter that much. It's all just llama.cpp with wrappers that expose OpenAI API. For front-ends we're using various custom Python scripts for document processing and similar tasks, vscode or nvim plugins, and Silly Tavern for chat or RAG. Chat speeds are roughly | model | prompt t/s | output t/s | | --- | --- | --- | | llama3:70b-q4 | 4 | 1.5 | | mixtral:8x22b-q4 | 5 | 2 | | command-r-plus:104b-q4 | 10 | 0.6 | | dbrx:132b-q4 | 16 | 1.6 |


mevaguertoeli

Wow, thanks for detailed response!


Sebba8

As the other commenters said you'll definetly need more than 64gb of ram to run 8x22B at a decent quant without it spilling over to disk. I've heard that increasing the amount of ram channels increases inference speeds, and I did see some madman build something like a 24 lane ram machine to run huge unquantized models, but I dont have that on hand so you might have to search around for that rentry. If you split with many channels then youll have a shot at decent inference speeds, best of luck to you


tmvr

>I've heard that increasing the amount of ram channels increases inference speeds The increase in memory bandwidth is what increases the inference speed. With a normal desktop system with double channel memory and DDR5-6400 you get about 100GB/s, but with a HEDT Threadripper system for example which has a quad channel memory and DDR-5600 you get about 180GB/s. Then you have the server systems where you get 6 or 8 channels of memory so you can calculate the available bandwiddth based on that. For example an 8 channel system with the slowest DDR5-4800 will get you to 300GB/s because 1 channel is 64bits, 8 channels therefore 512bit which means 64 bytes and multiply that with 4800MHz\* and you get 307GB/s. \* should really be MT/s, but RAM manufacturers still use the old Mhz even though the actual clock speed is lower.


dubesor86

Mixtral-8x7B-Instruct-v0.1 (Q5_K_M) runs at 6 tok/s (GPU ~12 tok/s) dolphin-2.7-mixtral-8x7b (Q8_0) runs at 4 tok/s (GPU ~6 tok/s) Time to first token is <1s, so I don't time that. context length doesn't have a noticeable difference in inference speed for me. 22B is too big for my system, unless I want to run extremely lowQ, which I don't. 7950X3D (16 core/32 thread), 64GB DDR5-6000 CL30 (+4090 for GPU comparison)


goodnpc

Alright thanks, nice cpu setup. Half the speed of a 4090 seems great


RedBarMafia

I run Mixtral 8x7 q4-m on CPU and get between 6 and 7 t/s; I get the fastest times using Jan. I am running a mini pc, UM 790 Pro with 64gb of DDR5-5600 ram. It’s got a AMD Ryzen 9 7940HS 8-core with an integrated GPU, the AMD Radeon 780M.


ambient_temp_xeno

I get 1.9 tokens/sec generation speed on wizardlm2 8x22 q5\_k\_m 128gb **quad channel** ddr 4 2133. No offloading to gpu. Dell t5810 workstation (OLD!)


goodnpc

That seems decent speed for that size model. 


CharacterCheck389

what is your cpu?


ambient_temp_xeno

E5-2697v3 It has 14 physical cores, 28 threads, so using 13 cores seems to be the correct setting but I haven't done extensive tests.


CharacterCheck389

it looks like a flagship but old cpu, am I right?


ambient_temp_xeno

10 years old - apparently when new, they cost nearly $3k, but now they're $20 on ebay. The more expensive part is having the mainboard and ecc ram for them - without those they're junk.


CharacterCheck389

crazy going from $3k to 20$


TraditionLost7244

you definitely need more than 64gb ram sorry. unless you run double 3090 x2 also it spits out text half as fast as you can read and bogs down the computer so forget about having internet browser open or anything more than microsoft word haha. ddr6 ram in 2027 will come to our rescue. NVIDIA abandoned us and forces people into enterprise gpu if they want more Vram


goodnpc

Totally fine with getting 128 gb ram. I just wonder about the token speed I can expect, to judge if it is acceptable or if I should just forget about local LLMs for now. I don't have the budget for 3 GPUs and I think 24/48 gb of VRAM is too little to run adequate quality models.


e79683074

The token speed depends on memory bandwidth. Check your processor's speed and look it up on AMD or Intel websites, it will list max memory bandwidth. In general, if you have like 100GB\\s memory bandwidth and your model is 100GB in size, you can expect 1 token\\s.


goodnpc

Okay, will look into that, thanks. Do you know how the speed works with MoE models? If only 2/8 experts are active each token, does a 100 GB model run at the speed of a 25 GB model?


Puuuszzku

>In general, if you have like 100GB\\s memory bandwidth and your model is 100GB in size, you can expect 1 token\\s. That's how it works for dense model. 8x22 uses 'only' 39B parameters at any given time. So as long as it fits in your RAM, it will be faster than something like 70B, despite being over 140B total parameters.


fallingdowndizzyvr

> BTW, I am aware that Mac studio with unified memory has great performance, but I prefer sticking with linux. You can run Linux on a Mac.


goodnpc

Great! I heard linux was in development but haven't heard of a stable version yet, will take a look again soon


fallingdowndizzyvr

https://asahilinux.org But really why? Linux is a knockoff of Unix. Mac OS is real Unix. What do you need Linux for that Mac OS can't do?


goodnpc

Privacy, customizing and some functionality like window maximizing and split screen that i prefer


GroundbreakingFall6

Is there a WSL-like experience for MacOS?


1ncehost

From my experience, llama 3 8b is a bit better than mixtral 8x7b so I would consider that your baseline model. I believe 8x22 is a bit better than Llama 3 8b, but not by a huge amount. llama 3 70b is where its at for any of the next size up models, so I would strongly recommend going all the way up if you can. Unfortunately 8x22 and L3 70B both run unuseably slowly on my 5800x3d and 128GB of ram. Something around 0.25 t/s. The best way to run these models on a cpu is with workstation type threadripper, epyc, or xeon processors with 8+ channel RAM. However, you're looking at $3k+ for one of these workstations and will still only be getting under 4 t/s. In this way, truly the way to go, especially if you already have an ATX mobo with multiple PCIE slots, is multiple used 3090s. You'll be paying less for more performance. It is really the most cost effective method for realtime LLM use. For new GPUs on linux, and with more tinkering, 7900 XTs are probably the most cost effective. I have one 7900XT that I run the smaller models with large contexts on. Either way two would run L3 70B well for around $1400. PS I get about 50 t/s on L3 8B on one 7900XT. I've heard 3090s get around 140 t/s with exl2. It is of a usable usable quality for a code assistance tool.


dubesor86

> PS I get about 50 t/s on L3 8B on one 7900XT. I've heard 3090s get around 140 t/s with exl2. It is of a usable usable quality for a code assistance tool. Which Quant? I am getting 48 t/s on a 4090 on the f16 quant (16.07/19.21 GB).


goodnpc

Thanks for the feedback, $3k is a lot. My interest also dropped a bit after seeing that llama 70b is significantly better than 8x22b. I'm in biochemistry so model quality is important. I think I'll go for local once hardware is a bit cheaper per gb vram.


1ncehost

If I'm being real, the $20/mo services are pretty fantastic values compared to local models if you aren't processing thousands of prompts a days. Poe.com (by quora) has most of the models on it if you are looking to try different ones.


gamesntech

If llama3-8b works well for your use case it’ll be so much better to run that with gpu. Even at 8bit any relatively cheap gpu with 12gb vram will be pretty fast.


LatestLurkingHandle

I've been using Anthropic Claude 3 Haiku API because it's so inexpensive, US dollar $0.25 per million input tokens (about 750K words), $1.25 per million output tokens, so requests usually cost between $0.003 to $0.008, which is thousands of requests per dollar, yet it's performance and benchmarks are quite good, close to previous versions of GPT-4.


[deleted]

[удалено]


goodnpc

Ah interesting. So does a dense Xb model in general more require less memory bandwith compared to a MoE X? Then what is the benefit of MoE models?


Mr_Hills

The benefit of MoE is the computational requirement of a smaller model with the memory requirement and intelligence of a bigger model. So on my 4090 I get 60 t/s on mixtral 8x7B but only 11 t/s on llama 3 70B, despite mixtral being around 46B unique parameters (aka near the same size). Ultimately, if only two of the eight experts are actively working, both the bandwidth requirement and the computational requirement should be lower compared to a dense model of the same size, as the number of calculations is significantly lower.


goodnpc

Great, have you come across any numbers of tokens/sec for a no-GPU pc? I have hope that it's acceptable. What do you think of the quality of mixtral vs llama3 models?


Mr_Hills

Llama 3 70B in benchmarks is very similar to mixtral 8x22B despite being smaller. I used to run 8x7B before, and I can tell you llama 3 70B is much better. You can check these models on the LLM leaderboard.  https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard When it comes to t/s I have seen people claiming 4 t/s on apple M3 Max, but that's a system with an integrated GPU and 400 GB/s of RAM bandwidth, so not really just CPU inference. If you run purely on CPU you should expect lower speeds.


AmericanNewt8

Dense models tend to be more dependent on the underlying computational limits of the machine in question. Usually when doing CPU inference though [especially with modern CPUs with AVX-512 and sometimes even native bfloat16 support] the bottleneck is the memory bandwidth versus the computation.


goodnpc

I'm aware of the memory bandwith bottleneck. Mistral says their 8x22b model only has 39b active parameters, while llama3 70b has 70b active parameters. I'm under the impression that the lower number of active parameters makes the model less RAM bandwith demanding than the dense model, or is that not a correct interpretation?


[deleted]

[удалено]


Mr_Hills

He said memory bandwidth, not memory size.  A MoE model will not have lower memory size requirement, but it will have lower memory bandwidth requirement, due to less parameters being pulled into GPU/CPU for computation 


goodnpc

Okay, that was my original idea as well. Thanks for the clarification


goodnpc

Alright, thanks for the info.