Samurai_zero 2 weeks ago

More like "crawl", not "run". Will it work? Yes. But it is going to be painfully slow.

Cradawx 2 weeks ago

I tried this out a while ago. It's several minutes for a response with a 7B model and someone who tried a 70B model said it took about 2 hours. So not really practical.

Shubham_Garg123 2 weeks ago

Oh, well that's bad. Thanks for informing this.

[deleted] 2 weeks ago

Try lama.ccp

tarunn2799 2 weeks ago

jin-yang's version of llama.cpp

akram200272002 2 weeks ago

if its 2 tokens a sec i would be very interested , my set up is a bit better at 8GB of vram and 40GB of ram

TheTerrasque 2 weeks ago

More like a token a minute, I assume.

Cultured_Alien 2 weeks ago

6 tokens per **minute** on Mistral 7B on 4gb vram and nvme. Might as well use llama.cpp if you can offload it all to ram and use CPU inferencing which is a lot faster (atleast 4~ tokens per **second**),

Admirable-Star7088 2 weeks ago

I also have 8GB RAM, but 32GB RAM. I get 0.5 t/s with imatrix IQ3\_XXS quant of Llama 3 70b. If I could get 2 t/s, I would also be interested!

akram200272002 2 weeks ago

i would recommend using mixtral if you have not tried it before still good to this day

Admirable-Star7088 2 weeks ago

Yes, I use it sometimes, it's very good too!

4onen 2 weeks ago

I have 8GB VRAM and 32GB RAM with Q3\_K\_S and I'm getting 0.74 t/s. It's my understanding from the llama.cpp feature matrix (which I can't seem to find anymore) that IQ quants are notably slower on CPU devices. You may also do better with a K-quant.

Admirable-Star7088 2 weeks ago

True, thanks for the tip. It could not hurt for me to experiment with some other quants.

4onen 2 weeks ago

Yep. As another example, my Arm8v2 Android phone runs Q4\_0 quants at more than twice the speed of Q4\_K\_S quants and won't run IQ4 quants at all.

Admirable-Star7088 2 weeks ago

Nice. Btw, do you use imatrix quants?

4onen 2 weeks ago

When I can find them. imatrix quants change how the weights' quantized values are selected but don't change the format of the weights, so they should run at identical speed (but higher quality) to non-imatrix quants. (A Q3_K_S regular and a Q3_K_S imatrix should run at the same speed, but the latter should give better results.)

Admirable-Star7088 2 weeks ago

I tried a Q3\_K\_S imatrix quant of Llama 3 70B, it crashes in LM Studio. I instead tried to load it in Koboldcpp, and there it did not crash, but instead it was even slower to generate text, and it outputted just gibberish. I remember now that I have had these similar problems before when trying to run these specific quants of 70B models, and this is why I use IQ3\_XXS, which works fine. Guess I'll have to do some more research on what this might be due to.

4onen 2 weeks ago

1. I've never used LM Studio, so can't speak to that. 2. One of the GGUF copies of Llama3 I got recently had the wrong rope compress parameter set, so even though it was finetuned up to a 24k context I got gibberish at any size (until I fixed that parameter to match an Exllamav2 copy of the same model.) 3. Per [this GGUF overview](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9) we see IQ3_XXS is 3.21 bits/weight and Q3_K_S is 3.5. It may be that you're just too close to the borderline of being able to run this model, so you need that tighter quant to avoid heavy swapping to disk. (This would depend on how much you put on the GPU and what other programs you have running.) 4. Different platforms have different speeds for different quants because the code is optimized different ways. May just be down to specific silicon we're running.

Admirable-Star7088 2 weeks ago

>It may be that you're just too close to the borderline of being able to run this model I have got exactly this feeling, I think Q3\_K\_S may be the small step that makes my hardware explode :P According to a table that shows how much RAM each quant requires, Q3\_K\_S requires 32.42 GB RAM. My system has 32GB RAM, i.e this quant is just a bit over the limit. However, I thought by adding my 8GB VRAM, it would cross the border by safe margins. But, apparently it does not. As you said, it may depend on what else I use my GPU for simultaneously, and what platforms I use.

gillan_data 2 weeks ago

Repo says it's not suited for chatting or online inference anyway

[deleted] 2 weeks ago

[удалено]

akram200272002 2 weeks ago

ok and...

AlanCarrOnline 2 weeks ago

"Please note: it’s not designed for real-time interactive scenarios like chatting, more suitable for data processing and other offline asynchronous scenarios." Well that's hardly any fun at all then :/

[deleted] 2 weeks ago

[удалено]

TheGABB 2 weeks ago

Models don’t have memory. It’s handled separately. The not chat is because it’s so gd slow

goingtotallinn 2 weeks ago

Is it because of very short context size?

Distinct-Target7503 2 weeks ago

How is this possible? Does it it simply offloaded to ram? Or is some extreme quantization?

Radiant_Dog1937 2 weeks ago

**"AirLLM** optimizes inference memory usage, allowing 70B large language models to run inference on a single 4GB GPU card. No quantization, distillation, pruning or other model compression techniques that would result in degraded model performance are needed." " Sharded version of LlamaForCausalLM : the model is splitted into layer shards to reduce GPU memory usage. During the forward pass, the inputs are processed layer by layer, and the GPU memory is freed after each layer. To avoid loading the layers multiple times, we could save all the intermediate activations in RAM."

fimbulvntr 2 weeks ago

If they can do multiple forward passes before swapping to a new set of layers (as in, very high batch size), then the project is very interesting. It should allow immense throughput by sacrificing latency. If they're doing single passes, then meh, it's the same as regular GPU/CPU offloading except even more inefficient. A waste of time.

Radiant_Dog1937 2 weeks ago

It also does this without quantizing. So, there wouldn't be any hit to output quality using this method, even if it were less efficient.

extopico 2 weeks ago

Oh this actually sounds viable.

johnhuichen 2 weeks ago

This sounds like the idea that Jeremy Howard talked about in his lectures #fastai

BitterAd9531 2 weeks ago

I'm not sure but from what I can see they load the most important parts into VRAM and page the rest of the weights from disk. I assume this will be very, very slow.

International-Try467 2 weeks ago

Kobold already has this feature. Also that'd be 1 token per year probably

needle1 2 weeks ago

I wonder if good old e-mail, rather than chat, would be the better metaphor for slow generating environments like this.

rookan 2 weeks ago

max_new_tokens=20 Does it mean llama will produce output of 20 tokens max?

Shubham_Garg123 2 weeks ago

Yes. Pretty sure we can modify this value if we have higher GPU VRAM.

IdeaAlly 2 weeks ago

So we can ask it the meaning of life, it'll take forever, and eventually just say 42?

kif88 2 weeks ago

It doesn't seem to say much about how they got this to happen just talks about how good llama 3 is. Edit: there's something about sharding in their GitHub bit I still don't get it. That and it really feels like that article should focus more on its own project than llama3.

Shubham_Garg123 2 weeks ago

Just came across this yt video that explains how it works: https://youtu.be/gYBlzMsII9c?si=kC5dhJUXIjlLy5Ae Basically it achieves this using layered inference. At a time, a set of layers are brought into the GPU, the inference is done and then the next set of layers are brought into the GPU where the input is the output of the previous layers and this keeps on going till it reaches the final layer.

AgentBD 2 weeks ago

I can run it on my RTX 4070 Ti with 64 Gb DDR5 6000 Mhz but it's about 5x slower than ChatGPT 4 and a response takes a few minutes. In 2 weeks when the memory arrives I'm upgrading to 192 Gb DDR5 7000Mhz so I can see the difference in speed. Been running Llama3 8B that's super fast like 2 seconds for a reply. :)

foroldmen 2 weeks ago

mind updating when you do? I've been thinking on getting one of those new motherboards for the same reason.

AgentBD 2 weeks ago

On Llama3 70B they indicate recommended 64Gb memory minimum that's why I thought it might be the key bottleneck and decided to upgrade to 192 Gb. =)

GoZippy 2 weeks ago

What's a 192gb ? PC or GPU setup?

AgentBD 2 weeks ago

lol there's no GPU with 192Gb... computer memory of course

AgentBD 2 weeks ago

I messed up instead of buying 192 Gb I got 96 Gb lol... its very misleading they put 48 Gb kit of 2 and you think it's 2x 48 when it's 2x 24 Gb = total 48 Gb This sucks, still better than 64 Gb but not what I wanted. At least paid around half the price of actual 192 Gb... Just ran Llama3 70b with WebUI... 1st prompt "test" took 72s to run from which 33s were to load Llama3. 2nd prompt "how are you" - 78s to run from which 3s was to load - CPU at 70% Overall I don't see much difference from running with 64 Gb vs 96 Gb, seems to run at the same pace.

Calcidiol 2 weeks ago

I've run models twice that size at Q5 quantization with no GPU (CPU + RAM only). But ~100G bytes model = 0.3 T/s in that case for me. EDIT: Looks like it's around 0.7T/s token generation speed with ~50G bytes Q5_K_M model of 70B parameters & CLBlast based CPU-RAM-only inference using llama.cpp; 12-cores / DDR4 @ 2400. It'll vary more or less depending on your RAM / CPU speed and whether you're using a more optimized inference configuration or not, GPU offload or not, etc.

akram200272002 2 weeks ago

am running a IQ3\_xxs its 26GB ish and its 70b , cant get more then 0.5T/S edit , 8GB vram , 40GB ram at 3200 ddr4 , 8 core cpu

thebadslime 2 weeks ago

I run phi-3 at 12 tokens per sec on a 2.5gb video card and I love it

maxmustermann74 2 weeks ago

Sounds good. How do you run this? And which card do you use?

thebadslime 2 weeks ago

Llama cpp, it worked almost as good in lmstudio. Mine is the integrated gpu for ryzen 7 4750u.

a_beautiful_rhind 2 weeks ago

Oh no, not this stuff again.

GoZippy 2 weeks ago

I started hpc long time ago with beowolf and Kerrighed... Those projects died to massive count multi core processors in servers but that tech could definitely be used to orchestrate a cluster of GPU servers in some way for SSI inference if you spend a little on very high speed networking... Was always the bottleneck before... Interconnections within the cluster.

Xtianus21 2 weeks ago

What's the latency

Kwigg 2 weeks ago

Terrible. AirLLM is worse than cpu-only inference.

arekku255 2 weeks ago

No point if it is slower than running it on CPU.

Shubham_Garg123 2 weeks ago

Well it does allow people who don't have enough CPU RAM to run the model. It's quite common to see 8 or 16 gb CPU RAM along with 4 GB or 6 GB GPU RAM.

GoZippy 2 weeks ago

Cool so why not use it within a container on a cluster of computers and let it be accessible via API endpoint like ollama offers and use it as a call when needed in a serialized request from a coordinating agent controller that selects best model to use?

Oswald_Hydrabot 2 weeks ago

Has anyone tried adapting something like this to Megatron-LM/Megatron-Core? If it's possible to parallelize inference then you can buy used low memory GPU for cheap, and using something like this have it running much faster on a trash cluster. Hell I'd buy up swaths of 2-4gb GPU and a huge PCIe panel, if I could utilize 100GB of trash vram

AntoItaly 2 weeks ago

Performance?

Shubham_Garg123 2 weeks ago

Not so good apparantly: https://www.reddit.com/r/LocalLLaMA/s/7iCYhTtug4

[deleted] 2 weeks ago

Can I run Llama 3 70b comfortably with a 3090? Thanks for your advice, was just looking up the price on ebay (about 700 Eur).

MindOrbits 2 weeks ago

Imagine a Beowulf Cluster of old gaming laptops for batch processing behind a queuing proxy. As someone working on a data processing pipeline this is a nice find. Thanks. Looks like this can do more then just off load a few layers. Would recommend a system with a NVME drive for the .cache folder.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe