Mixture of Experts in Transformers (MoE)

MoE replaces the dense FFN layer in each transformer block with multiple expert FFNs and a router that selects which experts process each token.

In a standard transformer, every token passes through the same FFN. In MoE, a router network first scores each expert based on the input token, producing a probability distribution via softmax. The top-k experts (typically k=1 or k=2) are selected, and their outputs are combined using the softmax probabilities as weights.

Each expert is just a standard FFN—the same architecture as in a dense transformer. The router is typically a simple linear projection from the token embedding to a vector of size num_experts.

You get N× more parameters while keeping compute roughly constant. This matters because larger models generally perform better, but compute is expensive. MoE lets you scale parameters without proportionally scaling FLOPs.

mixture-of-experts-in-transformers-(moe) 3

image source

In practice, models like Mixtral 8x7B have 8 experts with top-2 routing—46.7B total parameters but only ~13B active per token, performing comparably to dense models with much higher compute budgets.

Usually, an auxiliary load balancing loss is added to model objective to ensure all experts get selected often enough to receive training signal or else it's easy to collapse to a single expert.

Training Challenges Because of Discrete Operation

The top-k selection is discrete and non-differentiable, which affects gradient flow. The softmax gradients are treated as a proxy of the router gradients.

Standard backprop: "If I had changed this weight, how would the loss change?" — exact gradient, counterfactual reasoning.

MoE routing: "I sampled expert 2, it did well, so increase P(expert 2)" — learning from what you happened to try, no counterfactual. More like REINFORCE - Score Function Estimator:

  • Sample an action (expert selection)
  • Observe reward (loss)
  • Reinforce actions that led to good outcomes

This still works for LLMs because:

  • Massive Scale (billions of samples)
  • Low-variance setting - experts are similar enough that "wrong" selections aren't catastrophic
  • Routing losses force exploration

There have been approaches proposed to principally tackle this: