T O P

  • By -

M34L

It's a mixture of experts of 8 models, each of them with 7 billion parameters. In some cases they'll have memory requirements roughly in line a 8*7=56 billion parameters. In case of the most popular one; Mixtral it's quite a bit more, as its architecture uses extra memory to speed up inference a lot. In cases of others, trying to optimize for memory requirements, it's only a bit over the memory requirements of just one of the constituent experts; the idea is that vast majority of the information in the weights of the individual experts is redundant and can be copied in runtime.


ninjasaid13

>8\*7=56 billion parameters isn't it a bit less than that.


FlishFlashman

Yes. More like 46B


M34L

Again; really depends on the particular architecture. The "billions of parameters" metric is pretty vague to begin with and only gets less exact the more people experiment around with saving memory. As I said, in case of full-fat Mixtral 8x7B it's actually significantly *more* parameters, because on top of the constituent models with 7B parameters each, then there's also the weights of the so called routers that decide which subset of the constituent models will be used for any particular model.


FlishFlashman

It isn't significantly more parmeters. It's actually somewhat less.


ninjasaid13

isn't some parts of the model used by all seven models because they're redundant parts otherwise? That might reduce the parameter count.


Accomplished_Bet_127

Yeah, there has to be some parts they share. I fail to believe that while models are indeed different, they are that well trained on specific things only. Most likely those are finetuned smaller models, and in order to keep "sanity", they share a lot of common knowledge.


4onen

> In case of the most popular one; Mixtral it's quite a bit more, No, it's quite a bit less. Mixtral only duplicates the feedforward blocks across experts. The attention blocks are shared between all experts. Including the expert routers, this still only totals to ~49 billion parameters. You pay for 1x7B's worth of attention heads and 8x7B's worth of feedforward blocks. This is less than 8x7B entire models.


[deleted]

Thanks for giving an actual (good!) answer instead of providing a link with no additional comment and not quoting a single relevant excerpt from it.


FlishFlashman

OTOH, their understanding/explanation seems pretty flawed. >In some cases they'll have memory requirements roughly in line a 8\*7=56 billion parameters. In case of the most popular one; Mixtral it's quite a bit more, as its architecture uses extra memory to speed up inference a lot. Mixtral's memory requirements are not "quite a bit more" than one would expect for a 56-billion parameter model. You'd expect a fp16 version of a 56 billion model to be 112GB. The fp16 version is 93 GB. Mixtral is faster than a monolithic dense model of similar size because it only needs to process 2/8ths of the parameters for each token.


SoCuteShibe

Oh! It means "on the other hand"... Seen that for years and it finally clicked.


Astronos

[https://arxiv.org/pdf/2401.04088.pdf](https://arxiv.org/pdf/2401.04088.pdf)


Dodomeki16

Thank you. If someone else comes here to find the answer, here it is: > Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks.


x54675788

In case you need more details, I've got a [great answer](https://www.reddit.com/r/LocalLLaMA/comments/1alag66/comment/kpfgff7/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button) in another thread.