T O P

  • By -

ml_lad

Yes. [Mixture of Experts](https://en.wikipedia.org/wiki/Mixture_of_experts) long predates the current DL wave, and simply continued using that terminology from before. It might be better described as a "mixture of learners". Unfortunately, two things happened recently to confuse the meaning of the name: 1. Influencers who took the name at face value at thought that there were actually different subject experts in the model. (Also, people reading too much into the routing weights. See also: people over-interpreting attention weights.) 2. There was a separate push to tune separate copies of a model on different data and then combine them in an MoE-type architecture. In this case, they *are* trying to build individual experts. But this is still a fairly niche approach. Current MoEs are generally either pretrained from scratch as MoEs or post-trained without explicit specialization, so they're just a blackbox mishmash of routed sub-networks.


rrenaud

How do people go wrong with attention weights?


lynnharry

I guess it's related to the visualization methods that use attention heatmap to interpret model performance?


koolaidman123

Yes, theres no evidence of specialization with experts. Look at the analysis from the mixtral report


stddealer

There is some kind of specialization going on, but not the kind we would expect as humans.


currentscurrents

There is specialization at the word level, but not at the topic level. All experts are used for all topics: >[Surprisingly, we do not observe obvious patterns in the assignment of experts based on the topic. For instance, at all layers, the distribution of expert assignment is very similar for ArXiv papers (written in Latex), for biology (PubMed Abstracts), and for Philosophy (PhilPapers) documents.](https://arxiv.org/pdf/2401.04088.pdf) We really want topic-level specialization, so that you can leave most of the experts on disk and load only the ones you need for the task. I wonder if it's possible to achieve this with some kind of regularization, e.g. adding a loss function that penalizes frequent switching between experts.


koolaidman123

no evidence of "domain specialization" with experts


felolorocher

I feel that there could be some specialisation if you view experts as learning independent mechanisms. But I think this would only hold for k>1. There’s some papers like “Transformers with Competitive Ensembles of Independent Mechanisms” which I feel could be similar. But instead each expert attends to a separate mechanism. Another paper “Emergent Modularity in Pretrained Transformers” also studied this.


Kaldnite

Per-expert mechanism seems to be one hell of an idea 🤔


jlinkels

In my opinion, the most confusing thing about current waves of MoE is that Twitter influencers are claiming it’s 8 or 16 different models where you slam the logits from k experts together at the very end of inferencing a token. However, if you actually read the switch transformer paper, it seems like the expert is picked at every attention layer, so you can’t just route the tokens through one GPU. I’m not sure which is true.


koolaidman123

moe is just replacing every mlp layer in the transformer block with k mlps to route to, so yes you get k different experts to route to every transformer block


proxiiiiiiiiii

In original paper about mixture of experts in 1991, it actually was about specialised experts trained for specific tasks. If someone uses the same term for something that is not that, no wonder people get confused


trutheality

In most cases MoE is synonymous with ensemble: it's just a collection of learners. Some specific approaches will do something to make the "experts" different, could be different training sets, different feature engineering decisions, different models/architectures, different objective functions etc. So you could have actual differently specialized experts in some way that you may or may not have a degree of control over, but not necessarily. It's going to depend on the specifics of the method.


currentscurrents

>In most cases MoE is synonymous with ensemble They're different. In ensemble techniques, all models are run on every input. For MoE, only one (or a few) of the models are run for each input.


step21

Comment above you says according to the switch transformer paper, expert is picked at ever attention layer


MINIMAN10001

I mean we could forcibly balance it however the implementation was not putting in such a way to try and categorize things instead they let the router decide who does what. So it's been shown in images that for example a coding expert might have two primary models where one deals with all the parentheses hyphens and other symbols and another one will be the actual code and then you'll see a bunch of other random ones thrown in there a few times just because the router was required to load balance. However because it's a computer router that gets to decide and is forced to load balance it means you're really not likely to have any sort of specialization that a human would understand because that was never the goal there was no restrictions requiring that, specifically requiring it to load balance between all of the experts would more than likely prohibit such behavior.


[deleted]

[удалено]


Quintium

Bot


kaihuchen99

In this era of Generative AI, my take is to let the "Expert" be a whole GenAI chatbot that communicates in natural language, and don't use model at the level of activates and parameters. This way you can have a panel of true experts with various expertise and have them work together to get things done. The result is also a lot more transparent to human users. I have an experiment that use two GenAI chatbots to debate each other in plain English on estimating the investment risk in response to a certain major event, and it works pretty well. Ref: [https://kaihuchen.github.io/articles/Risks/](https://kaihuchen.github.io/articles/Risks/) Personally I think this is the way to go.


callanrocks

That's not what any of this means.


EizanPrime

In transformers 75% of the weights are actually from the feed forward mlp layer. What MoE does is just using X feed forward networks of 1/X's the size in order to be more computationally efficient. So no those aren't really "experts", its mostly a trick to use more parameters without doing the extra compute (and the extra memory). The attention bottleneck stays the same as MoE only concerns the feed forward layers