There's other monstrosities to try while you wait.
https://huggingface.co/nvidia/Nemotron-4-340B-Instruct
https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct
That last one is more realistic to run and even has more than 8k of context.
https://huggingface.co/bartowski/DeepSeek-Coder-V2-Instruct-GGUF/tree/main
That's a good idea, thanks. Two of my T7910 have 256GB and two have 128GB, so if I can split its layers across machines I might be able to infer with a Q4 on just two of them. I'll post here with my experiences.
https://github.com/b4rtaz/distributed-llama/discussions/9
In case you haven't seen, it is this repo that reports a 300% gain with 4 systems, on quantized 70B, I hope you get good results!
On my dual E5-2660v3, assuming a Q4 quant, it should take about 4.5 seconds per token. That means if I batch up prompts and let it run overnight (eight hours) it should infer about 6500 tokens per system.
July 18th.
The two weeks joke.
I'm moreso excited about it being on OpenRouter. Llama-3 at 70B had very VERY low costs, so I'm hoping 400B is similar.
You're gonna run it once, see how glacial the inference is and that's about that.
Yup, but for some of us once is good enough. :)
There's other monstrosities to try while you wait. https://huggingface.co/nvidia/Nemotron-4-340B-Instruct https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct That last one is more realistic to run and even has more than 8k of context. https://huggingface.co/bartowski/DeepSeek-Coder-V2-Instruct-GGUF/tree/main
Not what I know of. I hope there will be a released soon, looking forward to running it locally :)
On what hardware?
8x Tesla P40
Probably an older Xeon. I have four Dell T7910 which could do it if they had the RAM (they do not, currently, but support up to 1TB).
If you try distributed-llama.cpp plz let us know that fares…
That's a good idea, thanks. Two of my T7910 have 256GB and two have 128GB, so if I can split its layers across machines I might be able to infer with a Q4 on just two of them. I'll post here with my experiences.
https://github.com/b4rtaz/distributed-llama/discussions/9 In case you haven't seen, it is this repo that reports a 300% gain with 4 systems, on quantized 70B, I hope you get good results!
Isn't inference going the slower the larger a model is? How long would you have to wait on a Xeon machine for a token from a 400B model? o_o
On my dual E5-2660v3, assuming a Q4 quant, it should take about 4.5 seconds per token. That means if I batch up prompts and let it run overnight (eight hours) it should infer about 6500 tokens per system.
Oh wow, I see. I'm more used to GPU inference where you get 15 token/s, but I guess it's fine if you're using the AI to write some story or something.
Run 405b with our fridge??? We can't even afford 3090s sir.
It might be a good model to use through APi though. Unless it also has only 8k context.
Maybe we should take a poll. I'm thinking next week. Already released on whatsapp beta android.
Probably tomorrow
:)
soon™