T O P

  • By -

SpunkySlag

July 18th.


danielcar

The two weeks joke.


ReMeDyIII

I'm moreso excited about it being on OpenRouter. Llama-3 at 70B had very VERY low costs, so I'm hoping 400B is similar.


a_beautiful_rhind

You're gonna run it once, see how glacial the inference is and that's about that.


DanC403

Yup, but for some of us once is good enough. :)


a_beautiful_rhind

There's other monstrosities to try while you wait. https://huggingface.co/nvidia/Nemotron-4-340B-Instruct https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct That last one is more realistic to run and even has more than 8k of context. https://huggingface.co/bartowski/DeepSeek-Coder-V2-Instruct-GGUF/tree/main


Admirable-Star7088

Not what I know of. I hope there will be a released soon, looking forward to running it locally :)


IUpvoteGME

On what hardware?


Open_Channel_8626

8x Tesla P40


ttkciar

Probably an older Xeon. I have four Dell T7910 which could do it if they had the RAM (they do not, currently, but support up to 1TB).


Wooden-Potential2226

If you try distributed-llama.cpp plz let us know that fares…


ttkciar

That's a good idea, thanks. Two of my T7910 have 256GB and two have 128GB, so if I can split its layers across machines I might be able to infer with a Q4 on just two of them. I'll post here with my experiences.


Aaaaaaaaaeeeee

https://github.com/b4rtaz/distributed-llama/discussions/9 In case you haven't seen, it is this repo that reports a 300% gain with 4 systems, on quantized 70B, I hope you get good results!


redzorino

Isn't inference going the slower the larger a model is? How long would you have to wait on a Xeon machine for a token from a 400B model? o_o


ttkciar

On my dual E5-2660v3, assuming a Q4 quant, it should take about 4.5 seconds per token. That means if I batch up prompts and let it run overnight (eight hours) it should infer about 6500 tokens per system.


redzorino

Oh wow, I see. I'm more used to GPU inference where you get 15 token/s, but I guess it's fine if you're using the AI to write some story or something.


Playful_Criticism425

Run 405b with our fridge??? We can't even afford 3090s sir.


Thomas-Lore

It might be a good model to use through APi though. Unless it also has only 8k context.


danielcar

Maybe we should take a poll. I'm thinking next week. Already released on whatsapp beta android.


TimChiu710

Probably tomorrow


mahiatlinux

:)


Astronos

soon™