Mixture of Experts Architecture
The Specialist vs. Generalist Trade-Off
Picture a law firm. You could hire one attorney who knows a little about everything - contracts, criminal law, tax, immigration, intellectual property. That attorney can handle any case that walks in the door, but they'll never be as sharp on any specific topic as a specialist would be. Or you could hire 20 specialist attorneys. Now you have vastly more total knowledge in the firm, but on any given case you only need one or two lawyers actively working. The other 18 are at their desks, not billing.
This is the core intuition behind Mixture of Experts. Instead of one dense network that handles everything, you have many specialized networks (experts). For any given input, only a small number of experts are activated - the most relevant ones. The others don't participate, don't compute, and don't cost anything for this particular forward pass.
The result is a model with much greater total capacity (many specialists) at a compute cost similar to a much smaller model (only a few specialists active at once). This trade-off - total capacity for active compute - is the fundamental promise of MoE architectures.
Why This Exists - The Limits of Dense Scaling
When you scale a dense transformer, every parameter participates in every forward pass. Double the parameters, double the FLOPs per token. Train a 70B model instead of a 7B model: every token inference costs 10x as much compute.
This is not a law of nature - it's a consequence of the architecture. In a dense model, the same set of weights processes legal text, Python code, Japanese poetry, and organic chemistry questions. There's no specialization: every parameter has to learn to be useful across all domains.
The observation that motivated MoE research: a lot of what's in a large dense model is domain-specific knowledge that's wasted on most inputs. The fraction of the model's capacity that's relevant to any given token is small. Why activate the parameters that encode knowledge of ancient Greek when you're processing a Python function?
MoE architectures make this specialization explicit and structural. Different experts learn to handle different types of inputs, and the router learns to direct each token to the experts best suited for it.
Historical Context - From 1991 to Modern Transformers
The idea of mixture of experts is older than deep learning. Jacobs, Jordan, Nowlan, and Hinton (1991) introduced "Adaptive Mixtures of Local Experts" - a framework where multiple expert networks are combined via a gating network that learns to assign different inputs to different experts.
Jordan and Jacobs (1994) extended this with the Hierarchical Mixture of Experts, a recursive structure where experts can themselves be mixtures of more specialized experts.
The modern large-scale application to language modeling began with Shazeer et al. (2017) "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." This paper introduced the key innovation of sparsely-activated experts: using a gating network that selects only the top- experts per token, making the effective compute per token bounded regardless of total expert count.
The 2017 paper showed that MoE layers could be stacked inside LSTM language models to create models with up to 137 billion parameters at the time when LSTM models typically had a few billion. Performance improved substantially.
The application to transformers came later: GShard (Lepikhin et al., 2021) applied MoE to translation models with 600B parameters. Switch Transformer (Fedus et al., 2022) proposed top-1 routing for simplicity and scaling. GLaM (Du et al., 2022) applied MoE to GPT-style models. Mixtral (Jiang et al., 2024) brought the paradigm to the open-source community with strong results.
The Core Architecture - MoE as a Drop-In Replacement
In a standard transformer, each layer has two sub-layers:
- Multi-head self-attention
- A dense feed-forward network (FFN): two linear projections with a nonlinearity
In a Mixture of Experts transformer, the FFN is replaced by a MoE layer that contains expert FFNs plus a router:
where:
- is the output of expert applied to input
- is the routing weight assigned to expert for this token
- The sum is only over the top- selected experts
The attention layers remain dense - they process every token with all attention heads. Only the FFN layers become sparse MoE.
The Router - Deciding Which Experts Handle What
The router is a learned linear transformation followed by a softmax:
where is the router weight matrix and contains scores for all experts.
Top- selection takes the experts with the highest scores:
The routing weights are normalized to sum to 1 over the selected experts:
The final MoE output is the weighted sum of selected expert outputs:
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional
class ExpertFFN(nn.Module):
"""
A single expert: a standard FFN with its own independent weights.
"""
def __init__(self, d_model: int, d_ff: int, activation: str = "silu"):
super().__init__()
self.w1 = nn.Linear(d_model, d_ff, bias=False)
self.w2 = nn.Linear(d_ff, d_model, bias=False)
self.w3 = nn.Linear(d_model, d_ff, bias=False) # Gate proj (SwiGLU)
activations = {"silu": F.silu, "relu": F.relu, "gelu": F.gelu}
self.activation = activations[activation]
def forward(self, x: torch.Tensor) -> torch.Tensor:
# SwiGLU activation (used by Mixtral and DeepSeek)
# FFN(x) = (SiLU(xW1) * xW3) @ W2
gate = self.activation(self.w1(x))
value = self.w3(x)
return self.w2(gate * value)
class MoELayer(nn.Module):
"""
Mixture of Experts layer that replaces a dense FFN in a transformer.
Each token is routed to the top-k experts; the outputs are combined
as a weighted sum using the router's normalized scores.
"""
def __init__(
self,
d_model: int,
d_ff: int,
n_experts: int = 8,
top_k: int = 2,
activation: str = "silu",
load_balance_alpha: float = 0.01,
):
super().__init__()
self.n_experts = n_experts
self.top_k = top_k
self.load_balance_alpha = load_balance_alpha
# Router: projects token representation to expert scores
self.router = nn.Linear(d_model, n_experts, bias=False)
# N independent expert FFNs
self.experts = nn.ModuleList([
ExpertFFN(d_model, d_ff, activation)
for _ in range(n_experts)
])
def forward(
self,
x: torch.Tensor,
) -> tuple:
"""
Forward pass through the MoE layer.
Args:
x: Token representations [batch, seq_len, d_model]
Returns:
output: [batch, seq_len, d_model]
aux_loss: Load balancing auxiliary loss (scalar)
"""
batch, seq_len, d_model = x.shape
# Flatten to [batch*seq_len, d_model] for routing
x_flat = x.view(-1, d_model) # [T, d_model] where T = batch * seq_len
T = x_flat.shape[0]
# Step 1: Compute router logits and softmax scores
router_logits = self.router(x_flat) # [T, n_experts]
router_scores = F.softmax(router_logits, dim=-1) # [T, n_experts]
# Step 2: Select top-k experts for each token
top_k_scores, top_k_indices = torch.topk(
router_scores, self.top_k, dim=-1
) # Both [T, top_k]
# Step 3: Normalize routing weights (sum to 1 over selected experts)
top_k_weights = top_k_scores / top_k_scores.sum(dim=-1, keepdim=True)
# [T, top_k]
# Step 4: Compute auxiliary load balancing loss
aux_loss = self._load_balance_loss(router_scores, top_k_indices, T)
# Step 5: Route tokens through selected experts
output = torch.zeros_like(x_flat) # [T, d_model]
for expert_idx in range(self.n_experts):
# Find all tokens that route to this expert
# token_mask: [T, top_k] -> find where top_k_indices == expert_idx
token_mask = (top_k_indices == expert_idx) # [T, top_k]
# Which tokens use this expert at any position?
any_mask = token_mask.any(dim=-1) # [T]
if not any_mask.any():
continue # No tokens route to this expert
# Get the tokens that use this expert
expert_inputs = x_flat[any_mask] # [T_e, d_model]
# Run expert
expert_output = self.experts[expert_idx](expert_inputs) # [T_e, d_model]
# Get routing weights for this expert at each position
# For tokens that use this expert at position j, get weight at j
weights_for_expert = torch.zeros(T, device=x.device)
for k_idx in range(self.top_k):
k_mask = token_mask[:, k_idx]
weights_for_expert += k_mask.float() * top_k_weights[:, k_idx]
expert_weights = weights_for_expert[any_mask].unsqueeze(-1) # [T_e, 1]
# Accumulate weighted expert outputs
output[any_mask] += expert_weights * expert_output
# Reshape back to [batch, seq_len, d_model]
output = output.view(batch, seq_len, d_model)
return output, aux_loss
def _load_balance_loss(
self,
router_scores: torch.Tensor,
top_k_indices: torch.Tensor,
T: int,
) -> torch.Tensor:
"""
Auxiliary load balancing loss to prevent expert collapse.
From Switch Transformer (Fedus et al., 2022):
L_aux = alpha * n_experts * sum_i(f_i * P_i)
where:
- f_i = fraction of tokens dispatched to expert i
- P_i = fraction of router probability allocated to expert i
"""
# Compute f_i: fraction of tokens sent to each expert
# Create one-hot encoding of which experts were selected
expert_usage = torch.zeros(T, self.n_experts, device=router_scores.device)
for k_idx in range(self.top_k):
expert_usage.scatter_(1, top_k_indices[:, k_idx:k_idx+1], 1.0)
f = expert_usage.mean(dim=0) # [n_experts]
# Compute P_i: mean routing probability for each expert
P = router_scores.mean(dim=0) # [n_experts]
# Load balance loss: minimize if all experts get equal share
aux_loss = self.load_balance_alpha * self.n_experts * (f * P).sum()
return aux_loss
Parameter Counts - Why MoE Is Efficient
Let's work through the numbers for a concrete example.
Mixtral 8x7B:
- 32 transformer layers
- At each layer: multi-head attention + MoE layer (8 experts, top-2)
- Each expert is a standard 7B-class FFN
- Total parameters: ~47 billion
- Active parameters per token: ~13 billion (attention params are shared + 2 of 8 experts)
Compare to a dense model with 47B parameters:
- Every forward pass requires computing with all 47B parameters
- Inference cost is proportional to 47B
Mixtral's inference cost is proportional to ~13B parameters - because attention heads are dense but the FFN (the majority of parameters in a transformer) is sparse.
def compute_moe_parameter_counts(
n_layers: int,
d_model: int,
n_heads: int,
d_ff: int,
n_experts: int,
top_k: int,
vocab_size: int = 32000,
) -> dict:
"""
Compute total and active parameter counts for a MoE transformer.
Returns both total (all parameters) and active (per-token forward pass).
"""
# Attention layer parameters
# Q, K, V projections + output projection
attn_params_per_layer = 4 * d_model * d_model # 4 weight matrices
# MoE layer parameters
# Each expert: 3 weight matrices for SwiGLU (w1, w2, w3)
expert_params = 3 * d_model * d_ff
total_expert_params_per_layer = n_experts * expert_params
active_expert_params_per_layer = top_k * expert_params # Only top-k active
# Router parameters (small)
router_params_per_layer = d_model * n_experts
# Per-layer totals
total_params_per_layer = (
attn_params_per_layer +
total_expert_params_per_layer +
router_params_per_layer +
2 * d_model # LayerNorm (per layer, approximately)
)
active_params_per_layer = (
attn_params_per_layer +
active_expert_params_per_layer +
router_params_per_layer
)
# Full model
total_params = n_layers * total_params_per_layer + vocab_size * d_model * 2
active_params = n_layers * active_params_per_layer + vocab_size * d_model * 2
return {
"total_parameters": total_params,
"active_parameters_per_token": active_params,
"active_fraction": active_params / total_params,
"total_params_B": total_params / 1e9,
"active_params_B": active_params / 1e9,
"moe_efficiency_ratio": (total_params / active_params), # Capacity mult.
}
# Example: Mixtral 8x7B configuration
mixtral_counts = compute_moe_parameter_counts(
n_layers=32,
d_model=4096,
n_heads=32,
d_ff=14336, # Mixtral's FFN dimension
n_experts=8,
top_k=2,
vocab_size=32000,
)
print(f"Mixtral 8x7B:")
print(f" Total parameters: {mixtral_counts['total_params_B']:.1f}B")
print(f" Active parameters: {mixtral_counts['active_params_B']:.1f}B")
print(f" Active fraction: {mixtral_counts['active_fraction']:.1%}")
print(f" Capacity ratio: {mixtral_counts['moe_efficiency_ratio']:.1f}x")
# Output:
# Total parameters: 46.7B
# Active parameters: 12.9B
# Active fraction: 27.6%
# Capacity ratio: 3.6x
How MoE Fits Into the Transformer
Not all layers in a transformer need to be MoE layers. In practice, there are two common approaches:
All FFN layers as MoE (Mixtral, Grok): every transformer layer's FFN is replaced with a MoE layer. This maximizes capacity gains.
Alternating layers (some models): alternate between dense FFN layers and MoE layers. The dense layers handle general processing while MoE layers provide specialized capacity.
The attention layers are universally kept dense. The intuition: attention learns to identify relationships between positions (which tokens to attend to), which is a universal operation needed for all inputs. The FFN layers learn content-level transformations (how to process the content of each position), which benefits more from specialization.
What Do Experts Actually Specialize In?
This is one of the most interesting empirical questions in MoE research: do experts actually learn domain-specific specializations, or are they more uniform?
Research findings (from analysis of Mixtral and other MoE models):
Syntactic specialization: some experts preferentially process specific syntactic contexts (prepositional phrases, verb arguments, etc.)
Domain specialization: when experts are analyzed by the types of text they most often handle, some clusters around code, others around natural language, others around specific domains (math, medical, legal)
Layer-dependent specialization: early-layer experts tend to show more syntactic specialization; later-layer experts show more semantic specialization
Imperfect specialization: experts are not cleanly specialized - there's significant overlap. The specialization is statistical, not categorical.
The practical implication: MoE works because different inputs genuinely benefit from different computation paths, even if those paths aren't cleanly labeled by human-interpretable categories.
Production Engineering Notes
Memory Layout
In a MoE model, all expert weights must be loadable into memory even though only a fraction are active at any time. For Mixtral 8x7B:
- Total weight size at FP16: ~94 GB
- Active weights per token: ~26 GB
- To avoid spilling experts to CPU: need 94 GB total GPU VRAM
This is the key infrastructure challenge with MoE: the memory requirement is driven by total parameters, not active parameters. On a single 80 GB A100, Mixtral 8x7B doesn't fit. You need either 2x A100 (with tensor parallelism) or quantization.
Quantization
MoE models quantize well. At 4-bit quantization (GPTQ or AWQ):
- Mixtral 8x7B: ~24 GB - fits on a single A100 80GB
- Mixtral 8x22B: ~48 GB - fits on 2x A100 80GB
- DeepSeek-V3 671B: ~320 GB - needs 4-8 A100s even quantized
The quality degradation from 4-bit quantization of MoE models is generally similar to or slightly less than for dense models of equivalent active parameters.
:::danger Common Mistake: Confusing Total and Active Parameters "Mixtral is a 47B model" is misleading for cost analysis. Its inference cost is closer to a 13B model. When comparing costs across dense and MoE models, always compare active parameters, not total parameters. A Mixtral 8x7B call costs roughly the same compute as a 13B dense model call, despite having 47B total parameters. :::
:::warning Expert Collapse Without Load Balancing Without an auxiliary load balancing loss, MoE training almost always converges to expert collapse: a few experts receive nearly all tokens, and the remaining experts receive almost none. The neglected experts never develop meaningful specializations, and the model degenerates toward a smaller dense model. Always include the auxiliary load balance loss during training. :::
:::tip Choosing Top-K Most production MoE models use top-2 routing. Using top-1 (Switch Transformer) is simpler and reduces compute but sacrifices robustness - a wrong routing decision for a token means no correction. Using top-4 or higher gives more robustness but increases compute proportionally. Top-2 is the empirical sweet spot for most configurations. :::
Interview Questions and Answers
Q1: Explain the core idea of Mixture of Experts and why it's useful for LLMs.
Mixture of Experts replaces the dense FFN layers in a transformer with specialized expert networks plus a learned router. For each token, the router selects only the top- experts (typically ) to process it, and the outputs are combined as a weighted sum. This creates a model with much larger total capacity (many experts, many parameters) at a compute cost similar to a model with only active experts' worth of compute. The result: Mixtral 8x7B has 47B total parameters but costs roughly the same to run as a 13B dense model, while achieving quality closer to a 70B dense model. This is the fundamental efficiency gain.
Q2: Where in the transformer architecture does the MoE layer go, and why?
The MoE layer replaces the FFN (feed-forward network) in each transformer layer. The self-attention layers remain dense. This is because attention computes relationships between tokens - a universally useful operation that benefits from seeing the entire model's attention capacity. The FFN layers, by contrast, transform each token's representation independently and are where most content-level processing happens. FFNs account for the majority of parameters in a transformer (typically ~2/3 of total params), so making them sparse yields the largest efficiency gains. Attention is kept dense because routing individual heads to different experts is architecturally much more complex and the efficiency gains are smaller.
Q3: What is top-k routing and how does it work mathematically?
Top-k routing: for each token representation , compute router scores via a linear transformation and softmax: where . Select the experts with the highest scores. Normalize the selected scores to sum to 1. Compute the MoE output as where is the normalized weight for expert . Top-2 routing () is most common - it uses two experts per token, providing a mix of two specializations for each input. The top-k selection is a discrete operation (not differentiable through expert selection), but gradients flow through the router weights via the routing scores.
Q4: Explain the difference between total parameters and active parameters in MoE models.
Total parameters: all parameters in the model, including all experts that are never simultaneously active. For Mixtral 8x7B, this is ~47B. Active parameters: the parameters actually used for a single forward pass - attention weights (always active) plus the top-k expert weights. For Mixtral 8x7B with top-2 routing, this is ~13B. The inference compute cost is proportional to active parameters, not total parameters. Memory requirement is proportional to total parameters (all experts must be loaded into VRAM). This creates a key trade-off: MoE is efficient in compute but expensive in memory.
Q5: How does expert specialization emerge, and what do experts actually learn?
Expert specialization is not explicitly trained - it emerges from the optimization process. Each expert starts with random weights. Through training, if a certain type of input consistently gets routed to expert 3, and the reward signal (predicting the next token correctly) is strong for those inputs when processed by expert 3, then expert 3's weights update to better handle that input type. Meanwhile, that input type's router score for expert 3 increases, reinforcing the routing decision. This creates a feedback loop that leads to specialization. Empirically, experts in production MoE models show statistical specialization by syntax, domain, and language, but the specialization is gradual and overlapping, not clean categorical separation.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Mixture of Experts (MoE) Architecture demo on the EngineersOfAI Playground - no code required.
:::
