Mixture of experts

The model is trained by performing gradient descent on the mean-squared error loss

In their original publication, they were solving the problem of classifying phonemes in speech signal from 6 different Japanese speakers, 2 females and 4 males.

Each expert simply predicts a gaussian distribution, and totally ignores the input.

The mixture of experts predict that the output is distributed according to the probability density function:

This encourages the weighting function to learn to select only the experts that make the right predictions for each input.

After that happens, the lesser expert is unable to obtain a high gradient signal, and becomes even worse at predicting such kind of input.

Conversely, the lesser expert can become better at predicting other kinds of input, and increasingly pulled away into another region.

Hierarchical mixtures of experts[7][8] uses multiple levels of gating in a tree.

In words, each expert learns to do linear regression, with a learnable uncertainty estimate.

This is later generalized for multi-class classification, with multinomial logistic regression experts.

[15] One paper proposed mixture of softmaxes for autoregressive language modelling.

is a probability distribution by a linear-softmax operation on the activations of the hidden neurons within the model.

The original paper demonstrated its effectiveness for recurrent neural networks.

After deep learning, MoE found applications in running the largest models, as a simple way to perform conditional computation: only parts of the model are used, the parts chosen according to what the input is.

The key goal when using MoE in deep learning is to reduce computing cost.

The sparsely-gated MoE layer,[20] published by researchers from Google Brain, uses feedforward networks as experts, and linear-softmax gating.

[22] Table 3 shows that the MoE models used less inference time compute, despite having 30x more parameters.

To encourage the gate to select each expert with equal frequency (proper load balancing) within each batch, each MoE layer has two auxiliary loss functions.

[24] They also proposed "auxiliary-loss-free load balancing strategy", which does not use auxiliary loss.

The capacity factor is sometimes used to enforce a hard constraint on load balancing.

[26] In the original sparsely-gated MoE, only the top-k experts are queried, and their outputs are weighted-summed.

Instead, its vector representation simply passes through the feedforward layer without change.

[28] Other approaches include solving it as a constrained linear programming problem,[29] using reinforcement learning to train the routing algorithm (since picking an expert is a discrete action, like in RL).

[30] The token-expert match may involve no learning ("static routing"): It can be done by a deterministic hash function[31] or a random number generator.

This is because the feedforward layers take up an increasing portion of the computing cost as models grow larger.

[34] There are a large number of design choices involved in Transformer MoE that affect the training stability and final performance.

Later, GLaM[39] demonstrated a language model with 1.2 trillion parameters, each MoE layer using top-2 out of 64 experts.

On the first level, the gating function chooses to use either a "shared" feedforward layer, or to use the experts.

[41] MoE large language models can be adapted for downstream tasks by instruction tuning.

[42] In December 2023, Mistral AI released Mixtral 8x7B under Apache 2.0 license.