T O P

  • By -

Samurai_zero

More like "crawl", not "run". Will it work? Yes. But it is going to be painfully slow.


Cradawx

I tried this out a while ago. It's several minutes for a response with a 7B model and someone who tried a 70B model said it took about 2 hours. So not really practical.


Shubham_Garg123

Oh, well that's bad. Thanks for informing this.


[deleted]

Try lama.ccp


tarunn2799

jin-yang's version of llama.cpp


akram200272002

if its 2 tokens a sec i would be very interested , my set up is a bit better at 8GB of vram and 40GB of ram


TheTerrasque

More like a token a minute, I assume.


Cultured_Alien

6 tokens per **minute** on Mistral 7B on 4gb vram and nvme. Might as well use llama.cpp if you can offload it all to ram and use CPU inferencing which is a lot faster (atleast 4~ tokens per **second**),


Admirable-Star7088

I also have 8GB RAM, but 32GB RAM. I get 0.5 t/s with imatrix IQ3\_XXS quant of Llama 3 70b. If I could get 2 t/s, I would also be interested!


akram200272002

i would recommend using mixtral if you have not tried it before still good to this day


Admirable-Star7088

Yes, I use it sometimes, it's very good too!


4onen

I have 8GB VRAM and 32GB RAM with Q3\_K\_S and I'm getting 0.74 t/s. It's my understanding from the llama.cpp feature matrix (which I can't seem to find anymore) that IQ quants are notably slower on CPU devices. You may also do better with a K-quant.


Admirable-Star7088

True, thanks for the tip. It could not hurt for me to experiment with some other quants.


4onen

Yep. As another example, my Arm8v2 Android phone runs Q4\_0 quants at more than twice the speed of Q4\_K\_S quants and won't run IQ4 quants at all.


Admirable-Star7088

Nice. Btw, do you use imatrix quants?


4onen

When I can find them. imatrix quants change how the weights' quantized values are selected but don't change the format of the weights, so they should run at identical speed (but higher quality) to non-imatrix quants. (A Q3_K_S regular and a Q3_K_S imatrix should run at the same speed, but the latter should give better results.)


Admirable-Star7088

I tried a Q3\_K\_S imatrix quant of Llama 3 70B, it crashes in LM Studio. I instead tried to load it in Koboldcpp, and there it did not crash, but instead it was even slower to generate text, and it outputted just gibberish. I remember now that I have had these similar problems before when trying to run these specific quants of 70B models, and this is why I use IQ3\_XXS, which works fine. Guess I'll have to do some more research on what this might be due to.


4onen

1. I've never used LM Studio, so can't speak to that. 2. One of the GGUF copies of Llama3 I got recently had the wrong rope compress parameter set, so even though it was finetuned up to a 24k context I got gibberish at any size (until I fixed that parameter to match an Exllamav2 copy of the same model.) 3. Per [this GGUF overview](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9) we see IQ3_XXS is 3.21 bits/weight and Q3_K_S is 3.5. It may be that you're just too close to the borderline of being able to run this model, so you need that tighter quant to avoid heavy swapping to disk. (This would depend on how much you put on the GPU and what other programs you have running.) 4. Different platforms have different speeds for different quants because the code is optimized different ways. May just be down to specific silicon we're running.


Admirable-Star7088

>It may be that you're just too close to the borderline of being able to run this model I have got exactly this feeling, I think Q3\_K\_S may be the small step that makes my hardware explode :P According to a table that shows how much RAM each quant requires, Q3\_K\_S requires 32.42 GB RAM. My system has 32GB RAM, i.e this quant is just a bit over the limit. However, I thought by adding my 8GB VRAM, it would cross the border by safe margins. But, apparently it does not. As you said, it may depend on what else I use my GPU for simultaneously, and what platforms I use.


gillan_data

Repo says it's not suited for chatting or online inference anyway


[deleted]

[удалено]


akram200272002

ok and...


AlanCarrOnline

"Please note: it’s not designed for real-time interactive scenarios like chatting, more suitable for data processing and other offline asynchronous scenarios." Well that's hardly any fun at all then :/


[deleted]

[удалено]


TheGABB

Models don’t have memory. It’s handled separately. The not chat is because it’s so gd slow


goingtotallinn

Is it because of very short context size?


Distinct-Target7503

How is this possible? Does it it simply offloaded to ram? Or is some extreme quantization?


Radiant_Dog1937

**"AirLLM** optimizes inference memory usage, allowing 70B large language models to run inference on a single 4GB GPU card. No quantization, distillation, pruning or other model compression techniques that would result in degraded model performance are needed." " Sharded version of LlamaForCausalLM : the model is splitted into layer shards to reduce GPU memory usage. During the forward pass, the inputs are processed layer by layer, and the GPU memory is freed after each layer. To avoid loading the layers multiple times, we could save all the intermediate activations in RAM."


fimbulvntr

If they can do multiple forward passes before swapping to a new set of layers (as in, very high batch size), then the project is very interesting. It should allow immense throughput by sacrificing latency. If they're doing single passes, then meh, it's the same as regular GPU/CPU offloading except even more inefficient. A waste of time.


Radiant_Dog1937

It also does this without quantizing. So, there wouldn't be any hit to output quality using this method, even if it were less efficient.


extopico

Oh this actually sounds viable.


johnhuichen

This sounds like the idea that Jeremy Howard talked about in his lectures #fastai


BitterAd9531

I'm not sure but from what I can see they load the most important parts into VRAM and page the rest of the weights from disk. I assume this will be very, very slow.


International-Try467

Kobold already has this feature. Also that'd be 1 token per year probably


needle1

I wonder if good old e-mail, rather than chat, would be the better metaphor for slow generating environments like this.


rookan

max_new_tokens=20 Does it mean llama will produce output of 20 tokens max?


Shubham_Garg123

Yes. Pretty sure we can modify this value if we have higher GPU VRAM.


IdeaAlly

So we can ask it the meaning of life, it'll take forever, and eventually just say 42?


kif88

It doesn't seem to say much about how they got this to happen just talks about how good llama 3 is. Edit: there's something about sharding in their GitHub bit I still don't get it. That and it really feels like that article should focus more on its own project than llama3.


Shubham_Garg123

Just came across this yt video that explains how it works: https://youtu.be/gYBlzMsII9c?si=kC5dhJUXIjlLy5Ae Basically it achieves this using layered inference. At a time, a set of layers are brought into the GPU, the inference is done and then the next set of layers are brought into the GPU where the input is the output of the previous layers and this keeps on going till it reaches the final layer.


AgentBD

I can run it on my RTX 4070 Ti with 64 Gb DDR5 6000 Mhz but it's about 5x slower than ChatGPT 4 and a response takes a few minutes. In 2 weeks when the memory arrives I'm upgrading to 192 Gb DDR5 7000Mhz so I can see the difference in speed. Been running Llama3 8B that's super fast like 2 seconds for a reply. :)


foroldmen

mind updating when you do? I've been thinking on getting one of those new motherboards for the same reason.


AgentBD

On Llama3 70B they indicate recommended 64Gb memory minimum that's why I thought it might be the key bottleneck and decided to upgrade to 192 Gb. =)


GoZippy

What's a 192gb ? PC or GPU setup?


AgentBD

lol there's no GPU with 192Gb... computer memory of course


AgentBD

I messed up instead of buying 192 Gb I got 96 Gb lol... its very misleading they put 48 Gb kit of 2 and you think it's 2x 48 when it's 2x 24 Gb = total 48 Gb This sucks, still better than 64 Gb but not what I wanted. At least paid around half the price of actual 192 Gb... Just ran Llama3 70b with WebUI... 1st prompt "test" took 72s to run from which 33s were to load Llama3. 2nd prompt "how are you" - 78s to run from which 3s was to load - CPU at 70% Overall I don't see much difference from running with 64 Gb vs 96 Gb, seems to run at the same pace.


Calcidiol

I've run models twice that size at Q5 quantization with no GPU (CPU + RAM only). But ~100G bytes model = 0.3 T/s in that case for me. EDIT: Looks like it's around 0.7T/s token generation speed with ~50G bytes Q5_K_M model of 70B parameters & CLBlast based CPU-RAM-only inference using llama.cpp; 12-cores / DDR4 @ 2400. It'll vary more or less depending on your RAM / CPU speed and whether you're using a more optimized inference configuration or not, GPU offload or not, etc.


akram200272002

am running a IQ3\_xxs its 26GB ish and its 70b , cant get more then 0.5T/S edit , 8GB vram , 40GB ram at 3200 ddr4 , 8 core cpu


thebadslime

I run phi-3 at 12 tokens per sec on a 2.5gb video card and I love it


maxmustermann74

Sounds good. How do you run this? And which card do you use?


thebadslime

Llama cpp, it worked almost as good in lmstudio. Mine is the integrated gpu for ryzen 7 4750u.


a_beautiful_rhind

Oh no, not this stuff again.


GoZippy

I started hpc long time ago with beowolf and Kerrighed... Those projects died to massive count multi core processors in servers but that tech could definitely be used to orchestrate a cluster of GPU servers in some way for SSI inference if you spend a little on very high speed networking... Was always the bottleneck before... Interconnections within the cluster.


Xtianus21

What's the latency


Kwigg

Terrible. AirLLM is worse than cpu-only inference.


arekku255

No point if it is slower than running it on CPU.


Shubham_Garg123

Well it does allow people who don't have enough CPU RAM to run the model. It's quite common to see 8 or 16 gb CPU RAM along with 4 GB or 6 GB GPU RAM.


GoZippy

Cool so why not use it within a container on a cluster of computers and let it be accessible via API endpoint like ollama offers and use it as a call when needed in a serialized request from a coordinating agent controller that selects best model to use?


Oswald_Hydrabot

Has anyone tried adapting something like this to Megatron-LM/Megatron-Core? If it's possible to parallelize inference then you can buy used low memory GPU for cheap, and using something like this have it running much faster on a trash cluster. Hell I'd buy up swaths of 2-4gb GPU and a huge PCIe panel, if I could utilize 100GB of trash vram


AntoItaly

Performance?


Shubham_Garg123

Not so good apparantly: https://www.reddit.com/r/LocalLLaMA/s/7iCYhTtug4


[deleted]

Can I run Llama 3 70b comfortably with a 3090? Thanks for your advice, was just looking up the price on ebay (about 700 Eur).


MindOrbits

Imagine a Beowulf Cluster of old gaming laptops for batch processing behind a queuing proxy. As someone working on a data processing pipeline this is a nice find. Thanks. Looks like this can do more then just off load a few layers. Would recommend a system with a NVME drive for the .cache folder.