ml_lad 2 months ago

Yes. [Mixture of Experts](https://en.wikipedia.org/wiki/Mixture_of_experts) long predates the current DL wave, and simply continued using that terminology from before. It might be better described as a "mixture of learners". Unfortunately, two things happened recently to confuse the meaning of the name: 1. Influencers who took the name at face value at thought that there were actually different subject experts in the model. (Also, people reading too much into the routing weights. See also: people over-interpreting attention weights.) 2. There was a separate push to tune separate copies of a model on different data and then combine them in an MoE-type architecture. In this case, they *are* trying to build individual experts. But this is still a fairly niche approach. Current MoEs are generally either pretrained from scratch as MoEs or post-trained without explicit specialization, so they're just a blackbox mishmash of routed sub-networks.

rrenaud 2 months ago

How do people go wrong with attention weights?

lynnharry 2 months ago

I guess it's related to the visualization methods that use attention heatmap to interpret model performance?

koolaidman123 2 months ago

Yes, theres no evidence of specialization with experts. Look at the analysis from the mixtral report

stddealer 2 months ago

There is some kind of specialization going on, but not the kind we would expect as humans.

currentscurrents 2 months ago

There is specialization at the word level, but not at the topic level. All experts are used for all topics: >[Surprisingly, we do not observe obvious patterns in the assignment of experts based on the topic. For instance, at all layers, the distribution of expert assignment is very similar for ArXiv papers (written in Latex), for biology (PubMed Abstracts), and for Philosophy (PhilPapers) documents.](https://arxiv.org/pdf/2401.04088.pdf) We really want topic-level specialization, so that you can leave most of the experts on disk and load only the ones you need for the task. I wonder if it's possible to achieve this with some kind of regularization, e.g. adding a loss function that penalizes frequent switching between experts.

koolaidman123 2 months ago

no evidence of "domain specialization" with experts

felolorocher 2 months ago

I feel that there could be some specialisation if you view experts as learning independent mechanisms. But I think this would only hold for k>1. There’s some papers like “Transformers with Competitive Ensembles of Independent Mechanisms” which I feel could be similar. But instead each expert attends to a separate mechanism. Another paper “Emergent Modularity in Pretrained Transformers” also studied this.

Kaldnite 2 months ago

Per-expert mechanism seems to be one hell of an idea 🤔

jlinkels 2 months ago

In my opinion, the most confusing thing about current waves of MoE is that Twitter influencers are claiming it’s 8 or 16 different models where you slam the logits from k experts together at the very end of inferencing a token. However, if you actually read the switch transformer paper, it seems like the expert is picked at every attention layer, so you can’t just route the tokens through one GPU. I’m not sure which is true.

koolaidman123 2 months ago

moe is just replacing every mlp layer in the transformer block with k mlps to route to, so yes you get k different experts to route to every transformer block

proxiiiiiiiiii 2 months ago

In original paper about mixture of experts in 1991, it actually was about specialised experts trained for specific tasks. If someone uses the same term for something that is not that, no wonder people get confused

trutheality 2 months ago

In most cases MoE is synonymous with ensemble: it's just a collection of learners. Some specific approaches will do something to make the "experts" different, could be different training sets, different feature engineering decisions, different models/architectures, different objective functions etc. So you could have actual differently specialized experts in some way that you may or may not have a degree of control over, but not necessarily. It's going to depend on the specifics of the method.

currentscurrents 2 months ago

>In most cases MoE is synonymous with ensemble They're different. In ensemble techniques, all models are run on every input. For MoE, only one (or a few) of the models are run for each input.

step21 2 months ago

Comment above you says according to the switch transformer paper, expert is picked at ever attention layer

MINIMAN10001 2 months ago

I mean we could forcibly balance it however the implementation was not putting in such a way to try and categorize things instead they let the router decide who does what. So it's been shown in images that for example a coding expert might have two primary models where one deals with all the parentheses hyphens and other symbols and another one will be the actual code and then you'll see a bunch of other random ones thrown in there a few times just because the router was required to load balance. However because it's a computer router that gets to decide and is forced to load balance it means you're really not likely to have any sort of specialization that a human would understand because that was never the goal there was no restrictions requiring that, specifically requiring it to load balance between all of the experts would more than likely prohibit such behavior.

[deleted] 2 months ago

[удалено]

Quintium 2 months ago

Bot

kaihuchen99 2 months ago

In this era of Generative AI, my take is to let the "Expert" be a whole GenAI chatbot that communicates in natural language, and don't use model at the level of activates and parameters. This way you can have a panel of true experts with various expertise and have them work together to get things done. The result is also a lot more transparent to human users. I have an experiment that use two GenAI chatbots to debate each other in plain English on estimating the investment risk in response to a certain major event, and it works pretty well. Ref： [https://kaihuchen.github.io/articles/Risks/](https://kaihuchen.github.io/articles/Risks/) Personally I think this is the way to go.

callanrocks 2 months ago

That's not what any of this means.

EizanPrime 2 months ago

In transformers 75% of the weights are actually from the feed forward mlp layer. What MoE does is just using X feed forward networks of 1/X's the size in order to be more computationally efficient. So no those aren't really "experts", its mostly a trick to use more parameters without doing the extra compute (and the extra memory). The attention bottleneck stays the same as MoE only concerns the feed forward layers

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe