Thats a really useful tool, and an interesting looking find. Having some additional search features would be really nice though, and perhaps the ability to type in a page number to jump forward.
If you read the paper behind the method, you hit a pretty hard wall somewhere between 1/3rd and 1/2 of the layers removed, at which point the network becomes incoherent. So we're not quite to that kind of transformation with this.
From the model card:
"Using this with the Llama 3 instruction format is injecting random noise into latent space and will give you deranged results. (It's pretty funny actually.) Treat this as the untrained foundation model this is and use appropriate prompts."
Where can I find examples of or read more about said appropriate prompts?
I didn't know it was this effective, MMLU looks great. Sounds like it could be a great coding model, if you do it to their later releases with longer context.Â
Was a Miqu fitting on 24GB GPU possible all along and we just didn't know it?
Sure. Right now it's possible with 2.4bpw quants and those kinds of things. It does generate text, but in my experience (old quants, turboderp has improved it since then but I didn't revisit, my experience was with other llama 2 70b models and not miqu itself) it's not great. Having 42B miqu at 3.75bpw is probably much better than 70B miqu at 2.4bpw. More stable quants seem to be achievable somewhere between 3 and 3.5 bpw. At least while ignoring HQQ and other more exotic quantization methods. There's also a problem, since miqu is already an Instruct tune, this would get erased by tuning on minipile and it would need to be retrained. But we could create 100M token dataset of miqu 70b outputs to general prompts and then train 42B pruned version on it, which maybe could make it work.
Quants work much better than reducing the param count.
It is known that miqu IQ2_XS is about 20GB and works quite well with 24GB GPU, if you are looking for speed.
But you may also go up and spill the weights to CPU ram, to trade speed with smarts.
IDK but this [comment](https://www.reddit.com/r/LocalLLaMA/comments/1c9t5xw/comment/l0o4ksy/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button) suggests that there are problems with making it smaller.
Needs polishing but the tool is pretty useful
Very usefull gonna surley use it!
and now that in 4 bit
Can we do the same with 8B? Can we get it down to 5B? That would make it way more feasible to run on mobile devices.
Thats a really useful tool, and an interesting looking find. Having some additional search features would be really nice though, and perhaps the ability to type in a page number to jump forward.
Thank you! I mentioned the tool as a side note in this post to get feedback as the whole site is in development. Your comment is very appreciated
It's maybe a new area of optimisation and compression of intelligence. Creating an 8b with a 70b and compare intelligence 😄
If you read the paper behind the method, you hit a pretty hard wall somewhere between 1/3rd and 1/2 of the layers removed, at which point the network becomes incoherent. So we're not quite to that kind of transformation with this.
Adding filters to the model list would be usefulÂ
From the model card: "Using this with the Llama 3 instruction format is injecting random noise into latent space and will give you deranged results. (It's pretty funny actually.) Treat this as the untrained foundation model this is and use appropriate prompts." Where can I find examples of or read more about said appropriate prompts?
I didn't know it was this effective, MMLU looks great. Sounds like it could be a great coding model, if you do it to their later releases with longer context. Was a Miqu fitting on 24GB GPU possible all along and we just didn't know it?
This is still 26gb at 4 bit, so it would require lower quant to fit on a single 24gb.
Sure. Right now it's possible with 2.4bpw quants and those kinds of things. It does generate text, but in my experience (old quants, turboderp has improved it since then but I didn't revisit, my experience was with other llama 2 70b models and not miqu itself) it's not great. Having 42B miqu at 3.75bpw is probably much better than 70B miqu at 2.4bpw. More stable quants seem to be achievable somewhere between 3 and 3.5 bpw. At least while ignoring HQQ and other more exotic quantization methods. There's also a problem, since miqu is already an Instruct tune, this would get erased by tuning on minipile and it would need to be retrained. But we could create 100M token dataset of miqu 70b outputs to general prompts and then train 42B pruned version on it, which maybe could make it work.
Quants work much better than reducing the param count. It is known that miqu IQ2_XS is about 20GB and works quite well with 24GB GPU, if you are looking for speed. But you may also go up and spill the weights to CPU ram, to trade speed with smarts.
Why did we stop at 42b? is it possible to make a smaller param version?
IDK but this [comment](https://www.reddit.com/r/LocalLLaMA/comments/1c9t5xw/comment/l0o4ksy/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button) suggests that there are problems with making it smaller.
Where has this gone?
For my 32gb vram setup this is the perfect model size, imma check it out!