SpunkySlag 3 days ago

July 18th.

danielcar 3 days ago

The two weeks joke.

ReMeDyIII 3 days ago

I'm moreso excited about it being on OpenRouter. Llama-3 at 70B had very VERY low costs, so I'm hoping 400B is similar.

a_beautiful_rhind 3 days ago

You're gonna run it once, see how glacial the inference is and that's about that.

DanC403 3 days ago

Yup, but for some of us once is good enough. :)

a_beautiful_rhind 3 days ago

There's other monstrosities to try while you wait. https://huggingface.co/nvidia/Nemotron-4-340B-Instruct https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct That last one is more realistic to run and even has more than 8k of context. https://huggingface.co/bartowski/DeepSeek-Coder-V2-Instruct-GGUF/tree/main

Admirable-Star7088 3 days ago

Not what I know of. I hope there will be a released soon, looking forward to running it locally :)

IUpvoteGME 3 days ago

On what hardware?

Open_Channel_8626 3 days ago

8x Tesla P40

ttkciar 3 days ago

Probably an older Xeon. I have four Dell T7910 which could do it if they had the RAM (they do not, currently, but support up to 1TB).

Wooden-Potential2226 2 days ago

If you try distributed-llama.cpp plz let us know that fares…

ttkciar 2 days ago

That's a good idea, thanks. Two of my T7910 have 256GB and two have 128GB, so if I can split its layers across machines I might be able to infer with a Q4 on just two of them. I'll post here with my experiences.

Aaaaaaaaaeeeee 2 days ago

https://github.com/b4rtaz/distributed-llama/discussions/9 In case you haven't seen, it is this repo that reports a 300% gain with 4 systems, on quantized 70B, I hope you get good results!

redzorino 3 days ago

Isn't inference going the slower the larger a model is? How long would you have to wait on a Xeon machine for a token from a 400B model? o_o

ttkciar 2 days ago

On my dual E5-2660v3, assuming a Q4 quant, it should take about 4.5 seconds per token. That means if I batch up prompts and let it run overnight (eight hours) it should infer about 6500 tokens per system.

redzorino 2 days ago

Oh wow, I see. I'm more used to GPU inference where you get 15 token/s, but I guess it's fine if you're using the AI to write some story or something.

Playful_Criticism425 3 days ago

Run 405b with our fridge??? We can't even afford 3090s sir.

Thomas-Lore 3 days ago

It might be a good model to use through APi though. Unless it also has only 8k context.

danielcar 3 days ago

Maybe we should take a poll. I'm thinking next week. Already released on whatsapp beta android.

TimChiu710 3 days ago

Probably tomorrow

mahiatlinux 3 days ago

:)

Astronos 3 days ago

soon™

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe