Skip to main content

DeepSeek MoE Architecture

The $6 Million Question

In late 2024, DeepSeek announced that they trained DeepSeek-V3 - a 671 billion parameter model that matched GPT-4o on most benchmarks - for approximately 5.576milliondollars.Not5.576 million dollars. Not 500 million. Not 50million.50 million. 5.576 million.

The AI industry reacted with a mix of disbelief, excitement, and concern. OpenAI, Anthropic, and Google had spent orders of magnitude more on equivalent capabilities. DeepSeek had done it with aggressive engineering, smart architecture choices, and a MoE design that squeezed maximum quality out of every compute dollar.

The core of their efficiency story was a series of MoE innovations that went beyond what Mixtral had done. Fine-grained experts. Shared experts. Multi-token prediction auxiliary objectives. New parallelism strategies. This lesson covers what DeepSeek invented and why it worked.


Why This Exists - The Limits of Standard MoE

The standard MoE approach (as in Mixtral) has well-understood limitations that DeepSeek set out to address:

Expert knowledge sharing is expensive: when multiple tokens benefit from similar information, different experts must independently encode that information. Knowledge that's useful for many types of inputs gets replicated across all experts that receive those inputs. This is wasteful.

Coarse expert granularity: Mixtral has 8 experts, each of which is very large (about 7B parameters). The routing decision is coarse - the model picks 2 of 8 very large, general experts. Finer-grained specialization might be better: 64 small experts instead of 8 large ones.

Router instability with many experts: the more experts you have, the harder it is to maintain stable routing. With 64 or 128 experts, the auxiliary load balancing problem becomes more complex.

DeepSeek-MoE (Dai et al., 2024) addressed the knowledge sharing problem. DeepSeek-V2 and V3 scaled the approach to production scale with additional innovations.


The DeepSeek-MoE Core Innovation - Fine-Grained + Shared Experts

DeepSeek's key architectural contributions:

Fine-Grained Experts

Instead of 8 large experts (as in Mixtral), DeepSeek uses many small experts. The total parameter count in the expert pool remains similar, but each expert is smaller and more specialized.

Standard approach (Mixtral-style): N=8N = 8 experts, each with dff=14336d_{\text{ff}} = 14336. Top-k=2k = 2 selected per token.

DeepSeek fine-grained approach: N=64N = 64 experts, each with dff=2048d_{\text{ff}} = 2048 (one-quarter the size). Top-k=6k = 6 selected per token. Total active FFN compute is approximately the same (6 × 2048 ≈ 2 × 14336 ≈ 12,288), but the model can combine 6 specialized sub-experts instead of 2 general ones.

The intuition: a token processing a technical English sentence about thermodynamics benefits from:

  1. Technical vocabulary expert
  2. English syntax expert
  3. Physics domain expert
  4. Temperature/heat concepts expert
  5. Formal writing style expert
  6. Scientific reasoning expert

With 8 coarse experts, the model might combine a "scientific English" expert and a "formal reasoning" expert. With 64 fine-grained experts, it can combine six much more specific capabilities.

The finer granularity allows more precise, diversified knowledge utilization.

import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional


class FineGrainedMoELayer(nn.Module):
"""
DeepSeek-MoE style fine-grained expert layer.

Uses many small experts with higher top-k instead of
few large experts with low top-k.
"""

def __init__(
self,
d_model: int = 4096,
n_experts: int = 64, # Many small experts
top_k: int = 6, # Activate more of them
expert_size_ratio: float = 0.25, # Each expert is 1/4 the standard FFN size
n_shared_experts: int = 2,
shared_expert_ratio: float = 1.0, # Shared experts are full-size
):
super().__init__()
self.n_experts = n_experts
self.top_k = top_k
self.n_shared_experts = n_shared_experts

# Compute dimensions
d_ff_standard = d_model * 4 # Standard FFN dimension
d_ff_expert = int(d_ff_standard * expert_size_ratio) # Small expert
d_ff_shared = int(d_ff_standard * shared_expert_ratio) # Full-size shared

# Router for routed experts
self.gate = nn.Linear(d_model, n_experts, bias=False)

# Many small routed experts
self.routed_experts = nn.ModuleList([
self._make_swiglu_ffn(d_model, d_ff_expert)
for _ in range(n_experts)
])

# Shared experts (always active, full-size)
self.shared_experts = nn.ModuleList([
self._make_swiglu_ffn(d_model, d_ff_shared)
for _ in range(n_shared_experts)
])

def _make_swiglu_ffn(self, d_model: int, d_ff: int) -> nn.Module:
"""SwiGLU feed-forward network."""
class SwiGLUFFN(nn.Module):
def __init__(self):
super().__init__()
self.w1 = nn.Linear(d_model, d_ff, bias=False)
self.w2 = nn.Linear(d_ff, d_model, bias=False)
self.w3 = nn.Linear(d_model, d_ff, bias=False)

def forward(self, x):
return self.w2(F.silu(self.w1(x)) * self.w3(x))

return SwiGLUFFN()

def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Args:
x: [batch, seq_len, d_model]

Returns:
output: [batch, seq_len, d_model]
"""
B, T, D = x.shape
x_flat = x.view(-1, D) # [B*T, D]

# 1. Always compute shared expert outputs
shared_output = sum(
expert(x_flat) for expert in self.shared_experts
) # [B*T, d_model]

# 2. Route to top-k routed experts
router_logits = self.gate(x_flat) # [B*T, n_experts]
routing_weights, selected_experts = torch.topk(
F.softmax(router_logits, dim=-1),
self.top_k,
dim=-1,
) # Both [B*T, top_k]

# Normalize routing weights
routing_weights /= routing_weights.sum(dim=-1, keepdim=True)

# Compute routed expert outputs
routed_output = torch.zeros_like(x_flat)

for expert_idx in range(self.n_experts):
expert_mask = (selected_experts == expert_idx).any(dim=-1)

if not expert_mask.any():
continue

expert_input = x_flat[expert_mask]
expert_out = self.routed_experts[expert_idx](expert_input)

# Get this expert's routing weight for each token
weights = torch.zeros(x_flat.shape[0], device=x.device)
for k_idx in range(self.top_k):
k_mask = (selected_experts[:, k_idx] == expert_idx)
weights[k_mask] += routing_weights[k_mask, k_idx]

routed_output[expert_mask] += (
weights[expert_mask].unsqueeze(-1) * expert_out
)

# 3. Combine: shared + routed
total_output = shared_output + routed_output

return total_output.view(B, T, D)

Shared Experts - The Knowledge Hub

The second DeepSeek innovation is shared experts: a small set of expert FFNs that are always activated for every token, regardless of routing.

The motivation: some knowledge is universally useful - basic English grammar, general mathematical reasoning, common sense facts. Standard MoE models encode this knowledge redundantly across all experts (each expert needs it to function correctly). Shared experts encode this common knowledge once, freeing the routed experts to specialize more aggressively.

In DeepSeek-MoE:

  • KsK_s shared experts: always active, contribute to every token's output
  • NN routed experts: standard top-k selection

MoE_output(x)=i=1KsEishared(x)shared contribution+jTopKgjEjrouted(x)specialized contribution\text{MoE\_output}(x) = \underbrace{\sum_{i=1}^{K_s} E_i^{\text{shared}}(x)}_{\text{shared contribution}} + \underbrace{\sum_{j \in \text{TopK}} g_j \cdot E_j^{\text{routed}}(x)}_{\text{specialized contribution}}

DeepSeek-V2 uses Ks=2K_s = 2 shared experts (always active) plus 64 routed experts with top-6 selection.


DeepSeek-V2 - Scaling to 236B Parameters

DeepSeek-V2 (2024) applied these innovations at scale:

ParameterValue
Total parameters236B
Active parameters per token21B
Total experts per layer160 (2 shared + 158 routed)
Top-k for routed experts6
Shared experts2 (always active)
Context length128K tokens
ArchitectureMulti-head Latent Attention (MLA)

The 128K context window came from another DeepSeek innovation: Multi-head Latent Attention (MLA), which compresses the KV cache through low-rank projection, enabling long context without proportional memory growth.

Performance: DeepSeek-V2 matched GPT-4 on many benchmarks while being 42x cheaper to run than similarly-capable MoE alternatives. Training cost: $5.2 million, similar to Mixtral 8x22B despite being ~5x more capable.


DeepSeek-V3 - The 671B Model That Changed Everything

DeepSeek-V3 represents the current frontier of MoE design efficiency. Released December 2024.

ParameterValue
Total parameters671B
Active parameters per token37B
Total experts per MoE layer256 (1 shared + 255 routed)
Top-k for routed experts8
Shared experts1 (always active, larger)
Attention layersDense (every 3rd layer)
Context length128K tokens
Training tokens14.8 trillion
Reported training cost~$5.576M

The Training Cost Breakdown

DeepSeek-V3's training cost was so low for several reasons:

  1. FP8 mixed precision training: DeepSeek developed custom FP8 (8-bit floating point) training infrastructure that reduces memory and compute compared to BF16/FP16, with careful handling of numerical stability

  2. DualPipe parallelism: a custom pipeline parallelism scheme that overlaps computation and communication more efficiently than standard pipeline parallelism

  3. All-to-all communication optimization: custom kernel for the all-to-all communication in expert dispatch, running on NVLink (GPU-to-GPU) and InfiniBand

  4. No activation checkpointing: they had sufficient memory to avoid the expensive recomputation of activations during the backward pass

  5. Efficient hardware utilization: they report achieving ~57% Model FLOP Utilization (MFU) on H800 GPUs - near the top of published MFU numbers


Multi-Token Prediction - An Auxiliary Training Objective

DeepSeek-V3 introduced Multi-Token Prediction (MTP) as an auxiliary training objective. Standard LLM training predicts the next single token. MTP additionally trains the model to predict the next 2–4 tokens simultaneously.

The motivation: predicting multiple future tokens requires the model to maintain more information about the global context and forces it to plan ahead. This is similar to how n-gram objectives in early NLP models encouraged the model to understand slightly longer-range dependencies than pure bigram models.

class MultiTokenPredictionHead(nn.Module):
"""
Auxiliary training objective: predict the next N tokens simultaneously.

Added on top of the standard next-token prediction head.
Used by DeepSeek-V3 to improve training signal quality.
"""

def __init__(
self,
d_model: int,
vocab_size: int,
n_predict_ahead: int = 3, # Predict next 1, 2, and 3 tokens
):
super().__init__()
self.n_predict_ahead = n_predict_ahead

# Separate prediction heads for t+1, t+2, ..., t+N
# Each is a lightweight transformation + projection
self.future_heads = nn.ModuleList([
nn.Sequential(
nn.Linear(d_model, d_model, bias=False),
nn.GELU(),
nn.Linear(d_model, vocab_size, bias=False),
)
for _ in range(n_predict_ahead)
])

def forward(
self,
hidden_states: torch.Tensor, # [B, T, d_model]
targets: torch.Tensor, # [B, T] - token ids
) -> torch.Tensor:
"""
Compute auxiliary MTP loss.

For each position t, predict tokens t+1, t+2, ..., t+N
using hidden_states at position t.

Args:
hidden_states: Transformer output
targets: Target token sequence

Returns:
Averaged MTP auxiliary loss
"""
B, T, D = hidden_states.shape
total_aux_loss = torch.tensor(0.0, device=hidden_states.device)
n_valid_heads = 0

for k_ahead in range(self.n_predict_ahead):
# Predict token at position t + k_ahead + 1
# From hidden state at position t
if T - k_ahead - 1 <= 0:
continue

# Tokens we're predicting (shifted targets)
future_targets = targets[:, k_ahead + 1:] # [B, T - k_ahead - 1]

# Hidden states we predict from
prediction_states = hidden_states[:, :-k_ahead - 1] # [B, T - k_ahead - 1, D]

# Predict future tokens
future_logits = self.future_heads[k_ahead](prediction_states)
# [B, T - k_ahead - 1, vocab_size]

# Cross-entropy loss for this prediction head
loss = F.cross_entropy(
future_logits.reshape(-1, future_logits.shape[-1]),
future_targets.reshape(-1),
reduction='mean',
)

total_aux_loss = total_aux_loss + loss
n_valid_heads += 1

# Average across prediction heads
return total_aux_loss / max(n_valid_heads, 1)


def combined_training_loss(
main_logits: torch.Tensor, # [B, T, vocab_size]
mtp_head: MultiTokenPredictionHead,
hidden_states: torch.Tensor, # [B, T, d_model]
targets: torch.Tensor, # [B, T]
mtp_weight: float = 0.3,
) -> dict:
"""
Combine main next-token prediction loss with MTP auxiliary loss.
"""
# Main next-token prediction loss
main_loss = F.cross_entropy(
main_logits[:, :-1].reshape(-1, main_logits.shape[-1]),
targets[:, 1:].reshape(-1),
reduction='mean',
)

# Auxiliary MTP loss
mtp_loss = mtp_head(hidden_states, targets)

# Combined loss
total_loss = main_loss + mtp_weight * mtp_loss

return {
"total_loss": total_loss,
"main_loss": main_loss.item(),
"mtp_loss": mtp_loss.item(),
}

MTP improves performance in two ways: (1) it provides richer training signal at each position (the model must plan further ahead), and (2) at inference time, the additional prediction heads can be used for speculative decoding - generating draft tokens rapidly, then verifying them with the main model.


DeepSeek's Parameter Efficiency

How does DeepSeek-V3 match GPT-4 with only 37B active parameters?

The answer is not just the MoE architecture - it's a combination of:

FactorImpact
MoE: 671B total capacity at 37B active costCore efficiency
Fine-grained experts (256 instead of 8)Better specialization
Shared expert for common knowledgeLess redundancy
14.8T training tokensThoroughly trained
FP8 training stabilityAllowed larger batch sizes
MTP auxiliary objectiveBetter training signal
MLA (Multi-head Latent Attention)Efficient long context

The compound effect of multiple innovations is what enables the efficiency.


Multi-head Latent Attention (MLA) - Efficient Long Context

Beyond the MoE innovations, DeepSeek-V2 introduced Multi-head Latent Attention (MLA) to handle 128K context windows efficiently. This is separate from the MoE design but works synergistically with it.

Standard Multi-Head Attention (MHA) requires a KV cache that grows as:

KV_cache_size=2×L×HKV×Dhead×Tcontext\text{KV\_cache\_size} = 2 \times L \times H_{KV} \times D_{\text{head}} \times T_{\text{context}}

where LL is layers, HKVH_{KV} is KV heads, DheadD_{\text{head}} is head dimension, and TcontextT_{\text{context}} is context length. For a 128K context window, this is enormous - tens of gigabytes per user session.

MLA compresses the KV representation through a low-rank joint projection:

  1. The key and value tensors are projected to a low-dimensional "latent" vector
  2. This latent vector is cached (much smaller than full KV)
  3. At attention time, full K and V are recovered from the latent through a learned up-projection

The latent dimension is typically much smaller than the full KV dimension - for DeepSeek-V2, the latent dimension is 512 while the full KV dimension would be ~32,768 (32 heads × 1024 head dim). This is a ~64x compression.

class MultiHeadLatentAttention(nn.Module):
"""
Multi-head Latent Attention (MLA) from DeepSeek-V2.

Reduces KV cache size through low-rank latent compression.
Standard MHA KV cache: L * n_kv_heads * d_head * T bytes
MLA KV cache: L * d_latent * T bytes (where d_latent << n_kv_heads * d_head)
"""

def __init__(
self,
d_model: int = 5120, # DeepSeek-V2 hidden dim
n_heads: int = 128, # Query heads
d_head: int = 128,
d_latent_kv: int = 512, # Low-rank latent for KV (huge compression)
d_latent_q: int = 1536, # Latent for Q (less critical)
d_rope: int = 64, # RoPE head dimension for positional encoding
):
super().__init__()
self.n_heads = n_heads
self.d_head = d_head
self.d_latent_kv = d_latent_kv

# Down-projections to latent space
self.kv_down = nn.Linear(d_model, d_latent_kv, bias=False)
self.q_down = nn.Linear(d_model, d_latent_q, bias=False)

# Up-projections from latent to full KV/Q
self.k_up = nn.Linear(d_latent_kv, n_heads * d_head, bias=False)
self.v_up = nn.Linear(d_latent_kv, n_heads * d_head, bias=False)
self.q_up = nn.Linear(d_latent_q, n_heads * (d_head + d_rope), bias=False)

# Output projection
self.o_proj = nn.Linear(n_heads * d_head, d_model, bias=False)

def forward(
self,
x: torch.Tensor, # [B, T, d_model]
latent_kv_cache: torch.Tensor = None, # [B, T_past, d_latent_kv]
) -> tuple:
B, T, _ = x.shape

# Compute compressed KV latent
kv_latent = self.kv_down(x) # [B, T, d_latent_kv]

# Append to cache
if latent_kv_cache is not None:
kv_latent_all = torch.cat([latent_kv_cache, kv_latent], dim=1)
else:
kv_latent_all = kv_latent

# Recover full K and V from latent (at attention time)
T_all = kv_latent_all.shape[1]
k = self.k_up(kv_latent_all).view(B, T_all, self.n_heads, self.d_head)
v = self.v_up(kv_latent_all).view(B, T_all, self.n_heads, self.d_head)

# Compute Q
q_latent = self.q_down(x) # [B, T, d_latent_q]
q = self.q_up(q_latent).view(B, T, self.n_heads, -1)

# Standard attention (simplified)
k = k.transpose(1, 2) # [B, n_heads, T_all, d_head]
v = v.transpose(1, 2)
q = q[..., :self.d_head].transpose(1, 2) # Use non-RoPE portion

scale = self.d_head ** -0.5
attn = (q @ k.transpose(-2, -1)) * scale
attn = F.softmax(attn, dim=-1)
out = (attn @ v).transpose(1, 2).contiguous().view(B, T, -1)
out = self.o_proj(out)

return out, kv_latent # Return new KV latent for caching


def compare_kv_cache_sizes():
"""Compare KV cache sizes: MHA vs MLA vs GQA."""
T_context = 128_000 # 128K tokens
n_layers = 60
bytes_per_fp16 = 2

# Standard MHA (32 heads, 128 dim per head, no sharing)
mha_size_GB = 2 * n_layers * 32 * 128 * T_context * bytes_per_fp16 / (1024**3)

# GQA (8 KV heads, as in Mixtral)
gqa_size_GB = 2 * n_layers * 8 * 128 * T_context * bytes_per_fp16 / (1024**3)

# MLA (512-dim latent, as in DeepSeek-V2)
mla_size_GB = n_layers * 512 * T_context * bytes_per_fp16 / (1024**3)

print(f"KV Cache sizes at 128K context ({n_layers} layers):")
print(f" MHA (32 heads): {mha_size_GB:.1f} GB")
print(f" GQA (8 KV heads): {gqa_size_GB:.1f} GB")
print(f" MLA (512 latent): {mla_size_GB:.1f} GB")
print(f" MLA vs GQA savings: {(1 - mla_size_GB/gqa_size_GB):.0%} reduction")

compare_kv_cache_sizes()
# Output:
# KV Cache sizes at 128K context (60 layers):
# MHA (32 heads): 120.1 GB
# GQA (8 KV heads): 30.0 GB
# MLA (512 latent): 3.6 GB
# MLA vs GQA savings: 88% reduction

MLA's ~88% KV cache reduction compared to GQA is critical for DeepSeek-V2 and V3's 128K context window. Without it, a 128K context window would require 30+ GB of KV cache per concurrent user session, making long-context serving economically infeasible at scale.


DeepSeek's Parallelism Innovations - DualPipe

For training DeepSeek-V3, the team developed DualPipe, a custom pipeline parallelism strategy that reduces pipeline bubbles (idle GPU time caused by sequential dependencies).

Standard pipeline parallelism (e.g., GPipe) creates "bubbles" - periods where GPUs are idle waiting for activations from the previous pipeline stage. The bubble fraction is:

bubble fraction=p1m+p1\text{bubble fraction} = \frac{p - 1}{m + p - 1}

where pp is the number of pipeline stages and mm is the number of micro-batches.

DualPipe overlaps forward passes for one micro-batch with backward passes for another micro-batch, reducing effective bubble time. Combined with the expert parallelism all-to-all communications being carefully overlapped with computation, DeepSeek achieved ~57% Model FLOP Utilization (MFU) - excellent for a 671B model.


Comparison: Mixtral vs. DeepSeek MoE Approaches

AspectMixtral 8x7BDeepSeek-V3
Total parameters47B671B
Active parameters13B37B
Number of experts8256 (255 routed + 1 shared)
Expert sizeLarge (14336 dim)Small (distributed)
Shared expertsNone1 always-active
Top-k28
Auxiliary objectiveStandard load balanceMTP + load balance
Context32K128K
Training cost~$5M (estimated)~$5.6M (reported)
Quality tierGPT-3.5 classGPT-4 class

:::danger Common Mistake: Assuming More Experts Is Always Better More experts with fine-grained routing works for DeepSeek because they carefully balanced expert size, top-k, and shared expert capacity. Naively increasing from 8 to 64 experts without adjusting these other parameters often hurts performance - load balancing becomes harder, individual experts receive too little training signal, and routing becomes noisier. The combination of fine-grained + shared experts + appropriate top-k is what makes it work. :::

:::warning FP8 Training Requires Custom Infrastructure DeepSeek's FP8 training is one of the keys to their cost efficiency, but FP8 training is not plug-and-play. It requires careful handling of numerical precision issues (some operations need higher precision), custom CUDA kernels, and loss scaling strategies. Don't attempt FP8 training without significant infrastructure investment and expertise. BF16 (the standard today) is stable and well-supported; FP8 is a frontier technique. :::

:::tip The Shared Expert Insight Is Underrated DeepSeek's shared expert concept is a simple but powerful idea that should be considered for any MoE implementation. The intuition is clean: some knowledge (basic syntax, common reasoning patterns, general world knowledge) is useful for every token. Encoding this in dedicated always-active experts frees the routed experts to develop cleaner, more specialized representations. The implementation is straightforward - just add a few always-on FFN layers alongside the routing mechanism. :::


Interview Questions and Answers

Q1: What are fine-grained experts and why does DeepSeek use them instead of the Mixtral approach?

Fine-grained experts are smaller, more specialized expert FFNs. Mixtral uses 8 large experts (each with ~14336-dimensional FFN). DeepSeek uses 64–256 much smaller experts, while activating more of them per token (top-6 or top-8 instead of top-2). The total active compute per token is similar, but fine-grained routing enables more precise combination of specializations. A token processing technical scientific text might benefit from 6 very specific sub-experts (vocabulary, domain, syntax, formality, reasoning type, notation) rather than 2 general experts. Empirically, fine-grained experts with higher top-k outperform coarse experts with lower top-k at equal compute.

Q2: What is the purpose of shared experts in DeepSeek's architecture?

Shared experts are always-active expert FFNs that every token passes through, regardless of routing. Their purpose is to encode knowledge that's universally useful - basic syntax, common reasoning patterns, general world knowledge - in a single location. Without shared experts, this common knowledge must be duplicated across all routed experts, which is wasteful. By centralizing common knowledge in shared experts, the routed experts can focus on specialized knowledge, developing cleaner specializations. This reduces knowledge redundancy and improves parameter efficiency. DeepSeek-V2 uses 2 shared experts; DeepSeek-V3 uses 1 larger shared expert.

Q3: How did DeepSeek train DeepSeek-V3 for approximately $6 million when comparable models cost 10x more?

Several compounding factors: (1) MoE architecture - 671B total parameters but only 37B active per token, so training FLOPs are proportional to 37B, not 671B. (2) FP8 mixed precision - 8-bit floats reduce memory and compute vs. standard BF16, enabling larger effective batch sizes. (3) Custom communication kernels - all-to-all operations for expert dispatch were optimized to minimize overhead. (4) DualPipe parallelism - a custom pipeline parallelism scheme that overlaps compute and communication more efficiently. (5) No activation checkpointing - sufficient GPU memory to avoid recomputation. (6) High MFU (~57%) - excellent hardware utilization. The result: ~2.8 million H800 GPU hours for 14.8T tokens, at roughly $2/GPU-hour.

Q4: What is multi-token prediction and what are its benefits?

Multi-token prediction (MTP) is an auxiliary training objective where the model predicts not just the next token, but the next 2, 3, and 4 tokens simultaneously, using separate lightweight prediction heads. Benefits: (1) Training signal quality - predicting multiple future tokens forces the model to maintain more information about upcoming context and plan further ahead, providing richer gradient signal. (2) Speculative decoding - the MTP heads can generate draft tokens rapidly at inference time, which the main model then verifies. This can improve throughput by 2–3x for appropriate workloads. (3) Better long-range dependencies - predicting token t+3 from position t requires capturing dependencies across longer spans, improving the model's long-range reasoning.

Q5: How does DeepSeek-V3 achieve GPT-4 class performance with only 37B active parameters?

It's a combination of factors: (1) 671B total parameters - even at 37B active, the router selects among 671B worth of specialized knowledge, providing enormous capacity. (2) Fine-grained experts + high top-k - selecting 8 of 256 experts provides much more precise knowledge combination than selecting 2 of 8 large experts. (3) Shared expert for common knowledge - reduces redundancy, allowing routed experts to specialize more aggressively. (4) 14.8T training tokens - extremely thoroughly trained, squeezing maximum knowledge into the parameters. (5) MTP auxiliary objective - better training signal at each position. (6) MLA for long context - efficient 128K context without excessive KV cache memory. The combination means each of the 37B active parameters is doing more useful work than the equivalent in a less carefully architected model.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Mixture of Experts (MoE) Architecture demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.