What is DeepSeek MoE?

DeepSeek's innovations in mixture of experts - fine-grained experts, shared experts, DeepSeek-V2 and V3, multi-token prediction, and training for $6M.

How does fine-grained experts work in practice?

DeepSeek MoE Architecture covers DeepSeek MoE, fine-grained experts, shared experts from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/mixture-of-experts/deepseek-moe

What is the difference between DeepSeek MoE and shared experts?

See the full breakdown at https://engineersofai.com/docs/llms/mixture-of-experts/deepseek-moe

DeepSeek MoE Architecture

The $6 Million Question

In late 2024, DeepSeek announced that they trained DeepSeek-V3 - a 671 billion parameter model that matched GPT-4o on most benchmarks - for approximately $5.576 million dollars. Not$ 500 million. Not $50 million.$ 5.576 million.

The AI industry reacted with a mix of disbelief, excitement, and concern. OpenAI, Anthropic, and Google had spent orders of magnitude more on equivalent capabilities. DeepSeek had done it with aggressive engineering, smart architecture choices, and a MoE design that squeezed maximum quality out of every compute dollar.

The core of their efficiency story was a series of MoE innovations that went beyond what Mixtral had done. Fine-grained experts. Shared experts. Multi-token prediction auxiliary objectives. New parallelism strategies. This lesson covers what DeepSeek invented and why it worked.

Why This Exists - The Limits of Standard MoE

The standard MoE approach (as in Mixtral) has well-understood limitations that DeepSeek set out to address:

Expert knowledge sharing is expensive: when multiple tokens benefit from similar information, different experts must independently encode that information. Knowledge that's useful for many types of inputs gets replicated across all experts that receive those inputs. This is wasteful.

Coarse expert granularity: Mixtral has 8 experts, each of which is very large (about 7B parameters). The routing decision is coarse - the model picks 2 of 8 very large, general experts. Finer-grained specialization might be better: 64 small experts instead of 8 large ones.

Router instability with many experts: the more experts you have, the harder it is to maintain stable routing. With 64 or 128 experts, the auxiliary load balancing problem becomes more complex.

DeepSeek-MoE (Dai et al., 2024) addressed the knowledge sharing problem. DeepSeek-V2 and V3 scaled the approach to production scale with additional innovations.

The DeepSeek-MoE Core Innovation - Fine-Grained + Shared Experts

DeepSeek's key architectural contributions:

Fine-Grained Experts

Instead of 8 large experts (as in Mixtral), DeepSeek uses many small experts. The total parameter count in the expert pool remains similar, but each expert is smaller and more specialized.

Standard approach (Mixtral-style): $N = 8$ experts, each with $d_{\text{ff}} = 14336$ . Top- $k = 2$ selected per token.

DeepSeek fine-grained approach: $N = 64$ experts, each with $d_{\text{ff}} = 2048$ (one-quarter the size). Top- $k = 6$ selected per token. Total active FFN compute is approximately the same (6 × 2048 ≈ 2 × 14336 ≈ 12,288), but the model can combine 6 specialized sub-experts instead of 2 general ones.

The intuition: a token processing a technical English sentence about thermodynamics benefits from:

Technical vocabulary expert
English syntax expert
Physics domain expert
Temperature/heat concepts expert
Formal writing style expert
Scientific reasoning expert

With 8 coarse experts, the model might combine a "scientific English" expert and a "formal reasoning" expert. With 64 fine-grained experts, it can combine six much more specific capabilities.

The finer granularity allows more precise, diversified knowledge utilization.

import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional


class FineGrainedMoELayer(nn.Module):
    """
    DeepSeek-MoE style fine-grained expert layer.

    Uses many small experts with higher top-k instead of
    few large experts with low top-k.
    """

    def __init__(
        self,
        d_model: int = 4096,
        n_experts: int = 64,    # Many small experts
        top_k: int = 6,         # Activate more of them
        expert_size_ratio: float = 0.25,  # Each expert is 1/4 the standard FFN size
        n_shared_experts: int = 2,
        shared_expert_ratio: float = 1.0,  # Shared experts are full-size
    ):
        super().__init__()
        self.n_experts = n_experts
        self.top_k = top_k
        self.n_shared_experts = n_shared_experts

        # Compute dimensions
        d_ff_standard = d_model * 4  # Standard FFN dimension
        d_ff_expert = int(d_ff_standard * expert_size_ratio)  # Small expert
        d_ff_shared = int(d_ff_standard * shared_expert_ratio)  # Full-size shared

        # Router for routed experts
        self.gate = nn.Linear(d_model, n_experts, bias=False)

        # Many small routed experts
        self.routed_experts = nn.ModuleList([
            self._make_swiglu_ffn(d_model, d_ff_expert)
            for _ in range(n_experts)
        ])

        # Shared experts (always active, full-size)
        self.shared_experts = nn.ModuleList([
            self._make_swiglu_ffn(d_model, d_ff_shared)
            for _ in range(n_shared_experts)
        ])

    def _make_swiglu_ffn(self, d_model: int, d_ff: int) -> nn.Module:
        """SwiGLU feed-forward network."""
        class SwiGLUFFN(nn.Module):
            def __init__(self):
                super().__init__()
                self.w1 = nn.Linear(d_model, d_ff, bias=False)
                self.w2 = nn.Linear(d_ff, d_model, bias=False)
                self.w3 = nn.Linear(d_model, d_ff, bias=False)

            def forward(self, x):
                return self.w2(F.silu(self.w1(x)) * self.w3(x))

        return SwiGLUFFN()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: [batch, seq_len, d_model]

        Returns:
            output: [batch, seq_len, d_model]
        """
        B, T, D = x.shape
        x_flat = x.view(-1, D)  # [B*T, D]

        # 1. Always compute shared expert outputs
        shared_output = sum(
            expert(x_flat) for expert in self.shared_experts
        )  # [B*T, d_model]

        # 2. Route to top-k routed experts
        router_logits = self.gate(x_flat)  # [B*T, n_experts]
        routing_weights, selected_experts = torch.topk(
            F.softmax(router_logits, dim=-1),
            self.top_k,
            dim=-1,
        )  # Both [B*T, top_k]

        # Normalize routing weights
        routing_weights /= routing_weights.sum(dim=-1, keepdim=True)

        # Compute routed expert outputs
        routed_output = torch.zeros_like(x_flat)

        for expert_idx in range(self.n_experts):
            expert_mask = (selected_experts == expert_idx).any(dim=-1)

            if not expert_mask.any():
                continue

            expert_input = x_flat[expert_mask]
            expert_out = self.routed_experts[expert_idx](expert_input)

            # Get this expert's routing weight for each token
            weights = torch.zeros(x_flat.shape[0], device=x.device)
            for k_idx in range(self.top_k):
                k_mask = (selected_experts[:, k_idx] == expert_idx)
                weights[k_mask] += routing_weights[k_mask, k_idx]

            routed_output[expert_mask] += (
                weights[expert_mask].unsqueeze(-1) * expert_out
            )

        # 3. Combine: shared + routed
        total_output = shared_output + routed_output

        return total_output.view(B, T, D)

Shared Experts - The Knowledge Hub

The second DeepSeek innovation is shared experts: a small set of expert FFNs that are always activated for every token, regardless of routing.

The motivation: some knowledge is universally useful - basic English grammar, general mathematical reasoning, common sense facts. Standard MoE models encode this knowledge redundantly across all experts (each expert needs it to function correctly). Shared experts encode this common knowledge once, freeing the routed experts to specialize more aggressively.

In DeepSeek-MoE:

$K_s$ shared experts: always active, contribute to every token's output
$N$ routed experts: standard top-k selection

$\text{MoE\_output}(x) = \underbrace{\sum_{i=1}^{K_s} E_i^{\text{shared}}(x)}_{\text{shared contribution}} + \underbrace{\sum_{j \in \text{TopK}} g_j \cdot E_j^{\text{routed}}(x)}_{\text{specialized contribution}}$

DeepSeek-V2 uses $K_s = 2$ shared experts (always active) plus 64 routed experts with top-6 selection.

DeepSeek-V2 - Scaling to 236B Parameters

DeepSeek-V2 (2024) applied these innovations at scale:

Parameter	Value
Total parameters	236B
Active parameters per token	21B
Total experts per layer	160 (2 shared + 158 routed)
Top-k for routed experts	6
Shared experts	2 (always active)
Context length	128K tokens
Architecture	Multi-head Latent Attention (MLA)

The 128K context window came from another DeepSeek innovation: Multi-head Latent Attention (MLA), which compresses the KV cache through low-rank projection, enabling long context without proportional memory growth.

Performance: DeepSeek-V2 matched GPT-4 on many benchmarks while being 42x cheaper to run than similarly-capable MoE alternatives. Training cost: $5.2 million, similar to Mixtral 8x22B despite being ~5x more capable.

DeepSeek-V3 - The 671B Model That Changed Everything

DeepSeek-V3 represents the current frontier of MoE design efficiency. Released December 2024.

Parameter	Value
Total parameters	671B
Active parameters per token	37B
Total experts per MoE layer	256 (1 shared + 255 routed)
Top-k for routed experts	8
Shared experts	1 (always active, larger)
Attention layers	Dense (every 3rd layer)
Context length	128K tokens
Training tokens	14.8 trillion
Reported training cost	~$5.576M

The Training Cost Breakdown

DeepSeek-V3's training cost was so low for several reasons:

FP8 mixed precision training: DeepSeek developed custom FP8 (8-bit floating point) training infrastructure that reduces memory and compute compared to BF16/FP16, with careful handling of numerical stability
DualPipe parallelism: a custom pipeline parallelism scheme that overlaps computation and communication more efficiently than standard pipeline parallelism
All-to-all communication optimization: custom kernel for the all-to-all communication in expert dispatch, running on NVLink (GPU-to-GPU) and InfiniBand
No activation checkpointing: they had sufficient memory to avoid the expensive recomputation of activations during the backward pass
Efficient hardware utilization: they report achieving ~57% Model FLOP Utilization (MFU) on H800 GPUs - near the top of published MFU numbers

Multi-Token Prediction - An Auxiliary Training Objective

DeepSeek-V3 introduced Multi-Token Prediction (MTP) as an auxiliary training objective. Standard LLM training predicts the next single token. MTP additionally trains the model to predict the next 2–4 tokens simultaneously.

The motivation: predicting multiple future tokens requires the model to maintain more information about the global context and forces it to plan ahead. This is similar to how n-gram objectives in early NLP models encouraged the model to understand slightly longer-range dependencies than pure bigram models.

class MultiTokenPredictionHead(nn.Module):
    """
    Auxiliary training objective: predict the next N tokens simultaneously.

    Added on top of the standard next-token prediction head.
    Used by DeepSeek-V3 to improve training signal quality.
    """

    def __init__(
        self,
        d_model: int,
        vocab_size: int,
        n_predict_ahead: int = 3,  # Predict next 1, 2, and 3 tokens
    ):
        super().__init__()
        self.n_predict_ahead = n_predict_ahead

        # Separate prediction heads for t+1, t+2, ..., t+N
        # Each is a lightweight transformation + projection
        self.future_heads = nn.ModuleList([
            nn.Sequential(
                nn.Linear(d_model, d_model, bias=False),
                nn.GELU(),
                nn.Linear(d_model, vocab_size, bias=False),
            )
            for _ in range(n_predict_ahead)
        ])

    def forward(
        self,
        hidden_states: torch.Tensor,  # [B, T, d_model]
        targets: torch.Tensor,        # [B, T] - token ids
    ) -> torch.Tensor:
        """
        Compute auxiliary MTP loss.

        For each position t, predict tokens t+1, t+2, ..., t+N
        using hidden_states at position t.

        Args:
            hidden_states: Transformer output
            targets: Target token sequence

        Returns:
            Averaged MTP auxiliary loss
        """
        B, T, D = hidden_states.shape
        total_aux_loss = torch.tensor(0.0, device=hidden_states.device)
        n_valid_heads = 0

        for k_ahead in range(self.n_predict_ahead):
            # Predict token at position t + k_ahead + 1
            # From hidden state at position t
            if T - k_ahead - 1 <= 0:
                continue

            # Tokens we're predicting (shifted targets)
            future_targets = targets[:, k_ahead + 1:]  # [B, T - k_ahead - 1]

            # Hidden states we predict from
            prediction_states = hidden_states[:, :-k_ahead - 1]  # [B, T - k_ahead - 1, D]

            # Predict future tokens
            future_logits = self.future_heads[k_ahead](prediction_states)
            # [B, T - k_ahead - 1, vocab_size]

            # Cross-entropy loss for this prediction head
            loss = F.cross_entropy(
                future_logits.reshape(-1, future_logits.shape[-1]),
                future_targets.reshape(-1),
                reduction='mean',
            )

            total_aux_loss = total_aux_loss + loss
            n_valid_heads += 1

        # Average across prediction heads
        return total_aux_loss / max(n_valid_heads, 1)


def combined_training_loss(
    main_logits: torch.Tensor,    # [B, T, vocab_size]
    mtp_head: MultiTokenPredictionHead,
    hidden_states: torch.Tensor,  # [B, T, d_model]
    targets: torch.Tensor,        # [B, T]
    mtp_weight: float = 0.3,
) -> dict:
    """
    Combine main next-token prediction loss with MTP auxiliary loss.
    """
    # Main next-token prediction loss
    main_loss = F.cross_entropy(
        main_logits[:, :-1].reshape(-1, main_logits.shape[-1]),
        targets[:, 1:].reshape(-1),
        reduction='mean',
    )

    # Auxiliary MTP loss
    mtp_loss = mtp_head(hidden_states, targets)

    # Combined loss
    total_loss = main_loss + mtp_weight * mtp_loss

    return {
        "total_loss": total_loss,
        "main_loss": main_loss.item(),
        "mtp_loss": mtp_loss.item(),
    }

MTP improves performance in two ways: (1) it provides richer training signal at each position (the model must plan further ahead), and (2) at inference time, the additional prediction heads can be used for speculative decoding - generating draft tokens rapidly, then verifying them with the main model.

DeepSeek's Parameter Efficiency

How does DeepSeek-V3 match GPT-4 with only 37B active parameters?

The answer is not just the MoE architecture - it's a combination of:

Factor	Impact
MoE: 671B total capacity at 37B active cost	Core efficiency
Fine-grained experts (256 instead of 8)	Better specialization
Shared expert for common knowledge	Less redundancy
14.8T training tokens	Thoroughly trained
FP8 training stability	Allowed larger batch sizes
MTP auxiliary objective	Better training signal
MLA (Multi-head Latent Attention)	Efficient long context

The compound effect of multiple innovations is what enables the efficiency.

Multi-head Latent Attention (MLA) - Efficient Long Context

Beyond the MoE innovations, DeepSeek-V2 introduced Multi-head Latent Attention (MLA) to handle 128K context windows efficiently. This is separate from the MoE design but works synergistically with it.

Standard Multi-Head Attention (MHA) requires a KV cache that grows as:

$\text{KV\_cache\_size} = 2 \times L \times H_{KV} \times D_{\text{head}} \times T_{\text{context}}$

where $L$ is layers, $H_{KV}$ is KV heads, $D_{\text{head}}$ is head dimension, and $T_{\text{context}}$ is context length. For a 128K context window, this is enormous - tens of gigabytes per user session.

MLA compresses the KV representation through a low-rank joint projection:

The key and value tensors are projected to a low-dimensional "latent" vector
This latent vector is cached (much smaller than full KV)
At attention time, full K and V are recovered from the latent through a learned up-projection

The latent dimension is typically much smaller than the full KV dimension - for DeepSeek-V2, the latent dimension is 512 while the full KV dimension would be ~32,768 (32 heads × 1024 head dim). This is a ~64x compression.

class MultiHeadLatentAttention(nn.Module):
    """
    Multi-head Latent Attention (MLA) from DeepSeek-V2.

    Reduces KV cache size through low-rank latent compression.
    Standard MHA KV cache: L * n_kv_heads * d_head * T bytes
    MLA KV cache: L * d_latent * T bytes (where d_latent << n_kv_heads * d_head)
    """

    def __init__(
        self,
        d_model: int = 5120,         # DeepSeek-V2 hidden dim
        n_heads: int = 128,          # Query heads
        d_head: int = 128,
        d_latent_kv: int = 512,      # Low-rank latent for KV (huge compression)
        d_latent_q: int = 1536,      # Latent for Q (less critical)
        d_rope: int = 64,            # RoPE head dimension for positional encoding
    ):
        super().__init__()
        self.n_heads = n_heads
        self.d_head = d_head
        self.d_latent_kv = d_latent_kv

        # Down-projections to latent space
        self.kv_down = nn.Linear(d_model, d_latent_kv, bias=False)
        self.q_down = nn.Linear(d_model, d_latent_q, bias=False)

        # Up-projections from latent to full KV/Q
        self.k_up = nn.Linear(d_latent_kv, n_heads * d_head, bias=False)
        self.v_up = nn.Linear(d_latent_kv, n_heads * d_head, bias=False)
        self.q_up = nn.Linear(d_latent_q, n_heads * (d_head + d_rope), bias=False)

        # Output projection
        self.o_proj = nn.Linear(n_heads * d_head, d_model, bias=False)

    def forward(
        self,
        x: torch.Tensor,            # [B, T, d_model]
        latent_kv_cache: torch.Tensor = None,  # [B, T_past, d_latent_kv]
    ) -> tuple:
        B, T, _ = x.shape

        # Compute compressed KV latent
        kv_latent = self.kv_down(x)  # [B, T, d_latent_kv]

        # Append to cache
        if latent_kv_cache is not None:
            kv_latent_all = torch.cat([latent_kv_cache, kv_latent], dim=1)
        else:
            kv_latent_all = kv_latent

        # Recover full K and V from latent (at attention time)
        T_all = kv_latent_all.shape[1]
        k = self.k_up(kv_latent_all).view(B, T_all, self.n_heads, self.d_head)
        v = self.v_up(kv_latent_all).view(B, T_all, self.n_heads, self.d_head)

        # Compute Q
        q_latent = self.q_down(x)   # [B, T, d_latent_q]
        q = self.q_up(q_latent).view(B, T, self.n_heads, -1)

        # Standard attention (simplified)
        k = k.transpose(1, 2)  # [B, n_heads, T_all, d_head]
        v = v.transpose(1, 2)
        q = q[..., :self.d_head].transpose(1, 2)  # Use non-RoPE portion

        scale = self.d_head ** -0.5
        attn = (q @ k.transpose(-2, -1)) * scale
        attn = F.softmax(attn, dim=-1)
        out = (attn @ v).transpose(1, 2).contiguous().view(B, T, -1)
        out = self.o_proj(out)

        return out, kv_latent  # Return new KV latent for caching


def compare_kv_cache_sizes():
    """Compare KV cache sizes: MHA vs MLA vs GQA."""
    T_context = 128_000  # 128K tokens
    n_layers = 60
    bytes_per_fp16 = 2

    # Standard MHA (32 heads, 128 dim per head, no sharing)
    mha_size_GB = 2 * n_layers * 32 * 128 * T_context * bytes_per_fp16 / (1024**3)

    # GQA (8 KV heads, as in Mixtral)
    gqa_size_GB = 2 * n_layers * 8 * 128 * T_context * bytes_per_fp16 / (1024**3)

    # MLA (512-dim latent, as in DeepSeek-V2)
    mla_size_GB = n_layers * 512 * T_context * bytes_per_fp16 / (1024**3)

    print(f"KV Cache sizes at 128K context ({n_layers} layers):")
    print(f"  MHA (32 heads):     {mha_size_GB:.1f} GB")
    print(f"  GQA (8 KV heads):   {gqa_size_GB:.1f} GB")
    print(f"  MLA (512 latent):   {mla_size_GB:.1f} GB")
    print(f"  MLA vs GQA savings: {(1 - mla_size_GB/gqa_size_GB):.0%} reduction")

compare_kv_cache_sizes()
# Output:
# KV Cache sizes at 128K context (60 layers):
# MHA (32 heads):     120.1 GB
# GQA (8 KV heads):   30.0 GB
# MLA (512 latent):    3.6 GB
# MLA vs GQA savings: 88% reduction

MLA's ~88% KV cache reduction compared to GQA is critical for DeepSeek-V2 and V3's 128K context window. Without it, a 128K context window would require 30+ GB of KV cache per concurrent user session, making long-context serving economically infeasible at scale.

DeepSeek's Parallelism Innovations - DualPipe

For training DeepSeek-V3, the team developed DualPipe, a custom pipeline parallelism strategy that reduces pipeline bubbles (idle GPU time caused by sequential dependencies).

Standard pipeline parallelism (e.g., GPipe) creates "bubbles" - periods where GPUs are idle waiting for activations from the previous pipeline stage. The bubble fraction is:

$\text{bubble fraction} = \frac{p - 1}{m + p - 1}$

where $p$ is the number of pipeline stages and $m$ is the number of micro-batches.

DualPipe overlaps forward passes for one micro-batch with backward passes for another micro-batch, reducing effective bubble time. Combined with the expert parallelism all-to-all communications being carefully overlapped with computation, DeepSeek achieved ~57% Model FLOP Utilization (MFU) - excellent for a 671B model.

Comparison: Mixtral vs. DeepSeek MoE Approaches

Aspect	Mixtral 8x7B	DeepSeek-V3
Total parameters	47B	671B
Active parameters	13B	37B
Number of experts	8	256 (255 routed + 1 shared)
Expert size	Large (14336 dim)	Small (distributed)
Shared experts	None	1 always-active
Top-k	2	8
Auxiliary objective	Standard load balance	MTP + load balance
Context	32K	128K
Training cost	~$5M (estimated)	~$5.6M (reported)
Quality tier	GPT-3.5 class	GPT-4 class

:::danger Common Mistake: Assuming More Experts Is Always Better More experts with fine-grained routing works for DeepSeek because they carefully balanced expert size, top-k, and shared expert capacity. Naively increasing from 8 to 64 experts without adjusting these other parameters often hurts performance - load balancing becomes harder, individual experts receive too little training signal, and routing becomes noisier. The combination of fine-grained + shared experts + appropriate top-k is what makes it work. :::

:::warning FP8 Training Requires Custom Infrastructure DeepSeek's FP8 training is one of the keys to their cost efficiency, but FP8 training is not plug-and-play. It requires careful handling of numerical precision issues (some operations need higher precision), custom CUDA kernels, and loss scaling strategies. Don't attempt FP8 training without significant infrastructure investment and expertise. BF16 (the standard today) is stable and well-supported; FP8 is a frontier technique. :::

:::tip The Shared Expert Insight Is Underrated DeepSeek's shared expert concept is a simple but powerful idea that should be considered for any MoE implementation. The intuition is clean: some knowledge (basic syntax, common reasoning patterns, general world knowledge) is useful for every token. Encoding this in dedicated always-active experts frees the routed experts to develop cleaner, more specialized representations. The implementation is straightforward - just add a few always-on FFN layers alongside the routing mechanism. :::

Interview Questions and Answers

Q1: What are fine-grained experts and why does DeepSeek use them instead of the Mixtral approach?

Fine-grained experts are smaller, more specialized expert FFNs. Mixtral uses 8 large experts (each with ~14336-dimensional FFN). DeepSeek uses 64–256 much smaller experts, while activating more of them per token (top-6 or top-8 instead of top-2). The total active compute per token is similar, but fine-grained routing enables more precise combination of specializations. A token processing technical scientific text might benefit from 6 very specific sub-experts (vocabulary, domain, syntax, formality, reasoning type, notation) rather than 2 general experts. Empirically, fine-grained experts with higher top-k outperform coarse experts with lower top-k at equal compute.

Q2: What is the purpose of shared experts in DeepSeek's architecture?

Shared experts are always-active expert FFNs that every token passes through, regardless of routing. Their purpose is to encode knowledge that's universally useful - basic syntax, common reasoning patterns, general world knowledge - in a single location. Without shared experts, this common knowledge must be duplicated across all routed experts, which is wasteful. By centralizing common knowledge in shared experts, the routed experts can focus on specialized knowledge, developing cleaner specializations. This reduces knowledge redundancy and improves parameter efficiency. DeepSeek-V2 uses 2 shared experts; DeepSeek-V3 uses 1 larger shared expert.

Q3: How did DeepSeek train DeepSeek-V3 for approximately $6 million when comparable models cost 10x more?

Several compounding factors: (1) MoE architecture - 671B total parameters but only 37B active per token, so training FLOPs are proportional to 37B, not 671B. (2) FP8 mixed precision - 8-bit floats reduce memory and compute vs. standard BF16, enabling larger effective batch sizes. (3) Custom communication kernels - all-to-all operations for expert dispatch were optimized to minimize overhead. (4) DualPipe parallelism - a custom pipeline parallelism scheme that overlaps compute and communication more efficiently. (5) No activation checkpointing - sufficient GPU memory to avoid recomputation. (6) High MFU (~57%) - excellent hardware utilization. The result: ~2.8 million H800 GPU hours for 14.8T tokens, at roughly $2/GPU-hour.

Q4: What is multi-token prediction and what are its benefits?

Multi-token prediction (MTP) is an auxiliary training objective where the model predicts not just the next token, but the next 2, 3, and 4 tokens simultaneously, using separate lightweight prediction heads. Benefits: (1) Training signal quality - predicting multiple future tokens forces the model to maintain more information about upcoming context and plan further ahead, providing richer gradient signal. (2) Speculative decoding - the MTP heads can generate draft tokens rapidly at inference time, which the main model then verifies. This can improve throughput by 2–3x for appropriate workloads. (3) Better long-range dependencies - predicting token t+3 from position t requires capturing dependencies across longer spans, improving the model's long-range reasoning.

Q5: How does DeepSeek-V3 achieve GPT-4 class performance with only 37B active parameters?

It's a combination of factors: (1) 671B total parameters - even at 37B active, the router selects among 671B worth of specialized knowledge, providing enormous capacity. (2) Fine-grained experts + high top-k - selecting 8 of 256 experts provides much more precise knowledge combination than selecting 2 of 8 large experts. (3) Shared expert for common knowledge - reduces redundancy, allowing routed experts to specialize more aggressively. (4) 14.8T training tokens - extremely thoroughly trained, squeezing maximum knowledge into the parameters. (5) MTP auxiliary objective - better training signal at each position. (6) MLA for long context - efficient 128K context without excessive KV cache memory. The combination means each of the 37B active parameters is doing more useful work than the equivalent in a less carefully architected model.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Mixture of Experts (MoE) Architecture demo on the EngineersOfAI Playground - no code required.

:::

The $6 Million Question​

Why This Exists - The Limits of Standard MoE​

The DeepSeek-MoE Core Innovation - Fine-Grained + Shared Experts​

Fine-Grained Experts​

Shared Experts - The Knowledge Hub​

DeepSeek-V2 - Scaling to 236B Parameters​

DeepSeek-V3 - The 671B Model That Changed Everything​

The Training Cost Breakdown​

Multi-Token Prediction - An Auxiliary Training Objective​

DeepSeek's Parameter Efficiency​

Multi-head Latent Attention (MLA) - Efficient Long Context​

DeepSeek's Parallelism Innovations - DualPipe​

Comparison: Mixtral vs. DeepSeek MoE Approaches​

Interview Questions and Answers​