What is sparse MoE vs dense transformer?

Why MoE gives more capacity per FLOP than dense models, the memory vs. compute trade-off, training efficiency, inference complexity, and when to choose each architecture.

How does FLOP efficiency work in practice?

Sparse vs Dense Models - Trade-offs covers sparse MoE vs dense transformer, FLOP efficiency, model capacity from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/mixture-of-experts/sparse-vs-dense-models

What is the difference between sparse MoE vs dense transformer and model capacity?

See the full breakdown at https://engineersofai.com/docs/llms/mixture-of-experts/sparse-vs-dense-models

Sparse vs Dense Models - Trade-offs

The Two Ways to Be Smarter

If you want to build a smarter AI system, there are fundamentally two strategies. The first is to make the system work harder on every input - give it more neurons, more layers, more computation per token. This is the dense approach. The second is to give the system more total knowledge but apply only the relevant part to each input - more specialists, each activated only when relevant. This is the MoE approach.

Neither is unconditionally better. They make different bets about the structure of the problems the model will encounter and about the costs that matter most in your deployment environment.

A medical AI company deploying a model for rare disease diagnosis might desperately want the total capacity of a 700B parameter model, but can only afford the inference cost of a 70B model. MoE gives them the parameters of the former at roughly the compute cost of the latter. A startup deploying a code assistant that processes millions of requests per day might prioritize minimizing per-request cost above all else - a dense 7B model might serve them better than a 47B MoE model with equivalent quality.

This lesson is about understanding those trade-offs deeply enough to make the right architectural choice for your use case.

Why This Matters - The Economics of LLM Deployment

The cost of running an LLM has two major components:

Compute cost (FLOPs): proportional to model size times number of tokens. For MoE, this is active parameters times tokens.
Memory cost (VRAM): determined by total model size (all parameters must be loaded), not active size. For MoE, this is total parameters.

Dense models have equivalent compute and memory costs (both scale with total parameters). MoE models have divergent compute and memory costs: compute scales with active parameters, memory scales with total parameters.

This divergence creates the fundamental MoE trade-off: cheaper to run but more expensive to store.

Why MoE Gives More Capacity Per FLOP

The key empirical insight, formalized in Fedus et al. (2022) and Zoph et al. (2022): for a fixed FLOPs budget, MoE models achieve lower perplexity than dense models.

The argument: with a fixed compute budget per training step, you can either:

Train one dense model with $d$ parameters (all active)
Train one MoE model with $N \cdot d$ total parameters, but only $d / (N/k)$ active per token (approximately)

With the same FLOPs, the MoE model has $N$ times more total capacity. The model can store more knowledge, more diverse representations, and more specialized patterns - it just doesn't apply all of them at once.

The question is whether all that additional parameter capacity translates to capability. The answer depends heavily on the diversity of the training data. If you're training on data covering many domains, languages, and task types, MoE significantly outperforms dense models at the same compute cost. If you're training on a narrow domain (e.g., only code), the benefits are smaller.

def moe_efficiency_analysis(
    dense_params_B: float,      # Dense model parameter count (billions)
    moe_total_params_B: float,  # MoE total parameters (billions)
    moe_active_params_B: float, # MoE active parameters (billions)
    tokens_per_day: float,      # Training tokens per day
    gpu_flop_per_second: float = 312e12,  # A100 FP16: ~312 TFLOPS
    gpu_count: int = 64,
) -> dict:
    """
    Compare training efficiency of dense vs MoE models.

    Compute is measured in FLOPs; memory in GB.
    """
    # FLOPs per token (approximate): 6 * parameters for forward+backward
    # (2 * params per forward, 4 * params for backward in standard training)
    flops_per_token_dense = 6 * dense_params_B * 1e9
    flops_per_token_moe = 6 * moe_active_params_B * 1e9  # Only active params

    # Total FLOP budget per day
    total_flops_per_day = gpu_flop_per_second * gpu_count * 86400
    efficiency = 0.45  # ~45% MFU is good for large models

    effective_flops_per_day = total_flops_per_day * efficiency

    # Tokens that can be trained per day
    tokens_dense = effective_flops_per_day / flops_per_token_dense
    tokens_moe = effective_flops_per_day / flops_per_token_moe

    # Memory requirements (FP16 = 2 bytes per parameter)
    memory_dense_GB = dense_params_B * 1e9 * 2 / (1024**3)
    memory_moe_GB = moe_total_params_B * 1e9 * 2 / (1024**3)

    print(f"=== Training Efficiency Comparison ===")
    print(f"Dense model ({dense_params_B}B params):")
    print(f"  FLOPs/token: {flops_per_token_dense:.2e}")
    print(f"  Tokens/day: {tokens_dense:.2e}")
    print(f"  Memory (FP16): {memory_dense_GB:.0f} GB")
    print()
    print(f"MoE model ({moe_total_params_B}B total, {moe_active_params_B}B active):")
    print(f"  FLOPs/token: {flops_per_token_moe:.2e}")
    print(f"  Tokens/day: {tokens_moe:.2e}")
    print(f"  Memory (FP16): {memory_moe_GB:.0f} GB")
    print()
    print(f"=== Trade-off ===")
    print(f"  Compute advantage (MoE processes {tokens_moe/tokens_dense:.1f}x more tokens/day)")
    print(f"  Memory penalty (MoE needs {memory_moe_GB/memory_dense_GB:.1f}x more memory)")

    return {
        "dense_tokens_per_day": tokens_dense,
        "moe_tokens_per_day": tokens_moe,
        "compute_speedup": tokens_moe / tokens_dense,
        "memory_dense_GB": memory_dense_GB,
        "memory_moe_GB": memory_moe_GB,
        "memory_overhead": memory_moe_GB / memory_dense_GB,
    }


# Example: Dense 70B vs Mixtral 8x7B (47B total, 13B active)
moe_efficiency_analysis(
    dense_params_B=70,
    moe_total_params_B=47,
    moe_active_params_B=13,
    tokens_per_day=1e12,
    gpu_count=64,
)

The Compute Equivalence Principle

Fedus et al. (2022) established a key finding: a sparse MoE model achieves the quality of a dense model with similar active parameter count, but at the quality level associated with a larger model when trained for the same number of steps.

More precisely: a sparse MoE model requires fewer training FLOPs to reach a given quality level compared to a dense model.

This can be visualized as a shift in the scaling curve: the MoE model's quality-vs-FLOPs curve is shifted to the left compared to the dense model, meaning it achieves any given quality level faster.

The magnitude of this shift depends on:

Number of experts: more experts means more total capacity, but with diminishing returns
Top-k routing: more experts active per token gives better quality but reduces the compute savings
Data diversity: MoE benefits most from diverse training data where different experts can genuinely specialize

The Memory Bandwidth Bottleneck

Dense models at inference time are constrained by memory bandwidth, not compute. The GPU computes fast, but can't read parameters from HBM (High Bandwidth Memory) fast enough to keep the compute units busy.

For a transformer processing a single token:

Forward pass requires reading all model parameters once
Reading 70B parameters at FP16 requires reading 140 GB
A100's HBM bandwidth: ~2 TB/s
Time to read 140 GB: ~70 ms just for memory reads

This is the memory bandwidth bottleneck: inference is slow not because the GPU can't compute fast enough, but because it can't read model weights fast enough.

MoE models make this worse in a specific way: for a single token, you only activate k/N of the experts. But to determine which experts to activate, you need to run the router, which means the token has to be in the context of all experts' routing scores. And since experts are on different devices (or in different memory locations), there's no escaping the memory reads for at least the router itself.

However, for large batches, MoE models behave much better relative to dense models:

Dense 70B: processes 1 token or 1000 tokens at the same memory read cost (~70ms either way)
MoE 47B (13B active): processes 1 token at ~26ms memory read, 1000 tokens at similar cost

At small batch sizes, both dense and MoE are memory-bandwidth-bound. At large batch sizes, compute starts to matter more, and MoE's FLOPs advantage becomes a real throughput advantage.

def estimate_inference_latency(
    model_params_B: float,
    active_params_B: float,  # For dense models: same as model_params_B
    batch_size: int,
    seq_len: int,
    hbm_bandwidth_TB_s: float = 2.0,  # A100 HBM2e bandwidth
    tflops: float = 312.0,            # A100 FP16 TFLOPS
) -> dict:
    """
    Estimate inference latency for dense vs MoE models.

    In the memory-bandwidth-limited regime (small batch),
    latency is dominated by memory reads.

    In the compute-limited regime (large batch),
    latency is dominated by FLOPs.
    """
    # Memory read time: all parameters must be loaded regardless of batch size
    memory_bytes = model_params_B * 1e9 * 2  # FP16 = 2 bytes
    memory_read_time_s = memory_bytes / (hbm_bandwidth_TB_s * 1e12)

    # Compute time: proportional to active parameters * tokens
    flops_per_token = 2 * active_params_B * 1e9  # 2 FLOPs per param per token
    total_tokens = batch_size * seq_len
    total_flops = flops_per_token * total_tokens

    compute_time_s = total_flops / (tflops * 1e12)

    # Inference is bottlenecked by whichever is larger
    total_time_s = max(memory_read_time_s, compute_time_s)
    bottleneck = "memory" if memory_read_time_s > compute_time_s else "compute"

    throughput_tokens_per_sec = total_tokens / total_time_s

    return {
        "memory_read_time_ms": memory_read_time_s * 1000,
        "compute_time_ms": compute_time_s * 1000,
        "total_time_ms": total_time_s * 1000,
        "bottleneck": bottleneck,
        "throughput_tokens_per_sec": throughput_tokens_per_sec,
    }


print("=== Dense 70B Inference ===")
for bs in [1, 8, 32, 128]:
    result = estimate_inference_latency(70, 70, bs, 1)
    print(f"  Batch={bs:4d}: {result['total_time_ms']:.1f}ms, {result['bottleneck']:8s} bound, "
          f"{result['throughput_tokens_per_sec']:.0f} tok/s")

print()
print("=== MoE 47B total / 13B active ===")
for bs in [1, 8, 32, 128]:
    result = estimate_inference_latency(47, 13, bs, 1)
    print(f"  Batch={bs:4d}: {result['total_time_ms']:.1f}ms, {result['bottleneck']:8s} bound, "
          f"{result['throughput_tokens_per_sec']:.0f} tok/s")

When to Choose Dense vs MoE

Choose Dense When

Single-GPU or memory-constrained deployment: dense models pack more quality per GB of VRAM. A quantized 34B dense model fits in 20 GB and performs very well. A 47B MoE model needs 24–94 GB depending on quantization and routing.

Fine-tuning is a priority: fine-tuning MoE models is significantly harder than fine-tuning dense models. The router needs to be updated along with expert weights, and fine-tuning on domain-specific data can cause the router to collapse domain specialization. Dense models fine-tune more predictably.

Low batch size inference: if your deployment processes one request at a time (interactive single-user applications), both dense and MoE are memory-bandwidth-bound. The advantage of MoE's lower active parameters only manifests at larger batches.

Small model size: MoE benefits are largest at scale. A 7B MoE model (e.g., 8x1B) typically underperforms a well-trained dense 7B model. MoE specialization requires sufficient total capacity for experts to develop meaningful differences.

Training on a narrow domain: if your training data is a single domain (code-only, math-only), MoE's advantage from expert specialization is reduced. Dense models may converge faster and to better quality on narrow datasets.

Choose MoE When

Training budget is the constraint: if you have a fixed GPU budget for training and want maximum quality, MoE is almost always better. Same training FLOPs → higher quality model.

Serving with large batches: in high-throughput serving (many concurrent users, server-side deployment), large batch sizes are natural. At batch size 32+, MoE's compute advantage becomes real throughput.

Multi-GPU serving is already planned: MoE models naturally map to multi-GPU deployment via expert parallelism. If you're already planning tensor parallelism for a large dense model, switching to expert parallelism for MoE has minimal additional infrastructure cost.

Maximum model quality is the goal: the largest models in the world are MoE. DeepSeek-V3 (671B total, 37B active), Grok-1 (314B MoE), and (reportedly) GPT-4 are MoE models. At the absolute frontier of capability, MoE dominates.

Fine-Tuning MoE Models - The Hard Part

Fine-tuning a pre-trained MoE model is substantially harder than fine-tuning a dense model:

Router drift: when fine-tuning on a new domain, the router can shift its routing decisions dramatically. Tokens that used to go to "language expert A" might now all go to "code expert B" because the fine-tuning data is code. This can break the pre-trained specializations.

Expert imbalance during fine-tuning: fine-tuning datasets are typically much smaller and less diverse than pre-training data. With less diversity, load balancing breaks down quickly.

Memory requirements: fine-tuning requires storing gradients and optimizer states for all expert weights - even the ones that rarely activate on your fine-tuning data. For LoRA-style fine-tuning, this means applying LoRA adapters to all expert FFN weights, which can still be substantial.

# Fine-tuning MoE with LoRA: attach adapters to all experts
# (Simplified LoRA implementation for MoE)

import torch.nn as nn


class LoRAExpertAdapter(nn.Module):
    """
    LoRA adapter for a single expert FFN.
    Adds low-rank adaptation to w1 and w2.
    """

    def __init__(
        self,
        expert: nn.Module,
        rank: int = 16,
        alpha: float = 32.0,
        dropout: float = 0.1,
    ):
        super().__init__()
        self.expert = expert
        self.rank = rank
        self.scale = alpha / rank

        d_model = expert.w1.in_features
        d_ff = expert.w1.out_features

        # Freeze original expert weights
        for param in expert.parameters():
            param.requires_grad = False

        # LoRA adapters for w1
        self.lora_A_w1 = nn.Linear(d_model, rank, bias=False)
        self.lora_B_w1 = nn.Linear(rank, d_ff, bias=False)
        self.dropout = nn.Dropout(dropout)

        # Initialize A with normal, B with zeros (so delta starts at 0)
        nn.init.normal_(self.lora_A_w1.weight, std=0.02)
        nn.init.zeros_(self.lora_B_w1.weight)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Original expert output
        original_output = self.expert(x)

        # LoRA delta for w1
        lora_delta = self.lora_B_w1(
            self.dropout(self.lora_A_w1(x))
        ) * self.scale

        # The full adapted w1 output (simplified - real SwiGLU is more complex)
        # In practice you'd inject lora_delta into the expert's forward pass
        return original_output  # Placeholder - real impl modifies expert internals


def apply_lora_to_moe(moe_layer, rank: int = 16) -> nn.Module:
    """Apply LoRA adapters to all experts in a MoE layer."""
    adapted_experts = nn.ModuleList()
    for expert in moe_layer.experts:
        adapted_experts.append(
            LoRAExpertAdapter(expert, rank=rank)
        )
    moe_layer.experts = adapted_experts

    # Also allow gradients for the router during fine-tuning
    # (router needs to adapt to new data distribution)
    for param in moe_layer.router.parameters():
        param.requires_grad = True

    return moe_layer


# Count trainable parameters (LoRA only)
def count_trainable_params(model: nn.Module) -> tuple:
    total = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total, trainable

Comparative Benchmarks

How do sparse MoE and dense models compare on standard LLM benchmarks?

Model	Type	Total Params	Active Params	MMLU	HumanEval	HellaSwag
Llama 2 70B	Dense	70B	70B	69.8%	29.3%	87.3%
Mixtral 8x7B	Sparse MoE	47B	13B	70.6%	40.2%	89.1%
Llama 3.1 70B	Dense	70B	70B	83.6%	80.5%	94.0%
Mixtral 8x22B	Sparse MoE	141B	39B	77.8%	53.1%	91.0%
DeepSeek-V3	Sparse MoE	671B	37B	88.5%	89.3%	~95%

Key observations:

Mixtral 8x7B matches or slightly exceeds Llama 2 70B with 5x fewer active parameters
At equal active parameters, MoE generally wins on quality
Very large MoE models (DeepSeek-V3) achieve quality exceeding similar compute dense models

:::danger Common Mistake: Comparing Total Params When Choosing a Model Saying "Mixtral 8x7B is a 47B model, which is smaller than Llama 70B" misframes the comparison. For inference cost, compare active parameters: Mixtral's 13B active vs. Llama's 70B. For memory requirement, compare total parameters: Mixtral's 47B vs. Llama's 70B. For quality, compare empirical benchmarks. Never make architectural decisions based on "total parameters" alone. :::

:::warning MoE Fine-Tuning Is Not the Same as Dense Fine-Tuning If you're planning to fine-tune a model on your proprietary data, MoE models require more careful handling. Router drift can degrade pre-trained specializations. LoRA on MoE requires adapters for every expert (significantly more total LoRA parameters than for a dense model of equivalent active size). Start with a dense model if fine-tuning quality is critical and you're not prepared for MoE-specific fine-tuning techniques. :::

:::tip Quantization + MoE Is an Effective Combination MoE models quantize well - the per-expert structure means quantization noise is more localized than in dense models. GPTQ or AWQ 4-bit quantization reduces Mixtral 8x7B from ~94 GB to ~24 GB, fitting comfortably on a single 40 GB A100. For production serving of MoE models, 4-bit quantization is almost always worth the modest quality cost. :::

Interview Questions and Answers

Q1: Why is MoE more compute-efficient than a dense model of the same parameter count?

Because only a fraction of experts activate per token. A dense model with $P$ parameters uses all $P$ parameters for every token forward pass. A MoE model with $P$ total parameters and $k/N$ expert activation ratio uses only $P \cdot k/N$ parameters per token. For Mixtral 8x7B: 47B total parameters, 13B active - each token forward pass uses 13B parameters worth of compute, not 47B. This means you can train an MoE model with 3–4x more total capacity than a dense model for the same compute budget, leading to higher quality at equal training cost.

Q2: What is the memory bandwidth bottleneck and how does it affect MoE inference?

At small batch sizes (typical for interactive, single-user applications), LLM inference is bottlenecked by memory bandwidth - the speed at which GPU VRAM can supply weights to the compute units. The GPU can compute far faster than it can read weights from memory. For a dense 70B model: ~140 GB must be read per forward pass, taking ~70ms at A100's 2 TB/s bandwidth, regardless of batch size. For MoE, the total memory requirement is larger (all expert weights must be loaded), but active compute is smaller. MoE's advantage (lower active parameters) only materializes at larger batch sizes where compute starts to bottleneck rather than memory bandwidth. For single-token interactive serving, both architectures are memory-bandwidth-bound.

Q3: When would you choose a dense model over a MoE model?

Choose dense when: (1) Memory is highly constrained - dense models have more quality per GB of VRAM. (2) Fine-tuning is critical - dense models fine-tune more reliably; MoE fine-tuning risks router drift. (3) Batch size is small - at batch size 1–4, the memory bandwidth bottleneck dominates and MoE's compute advantage doesn't manifest. (4) Training data is narrow-domain - MoE specialization requires diverse data; on narrow domains, dense models may converge better. (5) Model scale is small (under 7B active) - MoE benefits require sufficient scale for expert specialization to emerge meaningfully.

Q4: Explain the concept of compute equivalence in MoE training.

Compute equivalence refers to the empirical finding that a sparse MoE model achieves the quality of a dense model that is significantly larger, when both are trained with the same compute budget (FLOPs). In other words, MoE shifts the quality-vs-FLOPs curve to the left: you achieve a given quality level with fewer training FLOPs using MoE than using a dense model. This happens because the MoE model has more total capacity (more parameters) that can be trained efficiently. The same compute budget that trains a 70B dense model to quality Q can train a 47B MoE model (with 13B active) to quality Q+ because the MoE's 47B parameters are trained on more diverse, specialized knowledge.

Q5: How does fine-tuning a MoE model differ from fine-tuning a dense model?

Key differences: (1) Router drift - the router may shift dramatically when exposed to new domain data, reassigning tokens from established expert specializations to whatever expert happens to work for the new data. This degrades pre-trained capabilities. Mitigation: freeze router weights or use very low learning rates for the router. (2) Scale of LoRA adapters - applying LoRA to a MoE model requires adapters for every expert FFN, not just one FFN per layer. For Mixtral with 8 experts per layer and 32 layers, that's 256 expert FFNs to adapt. (3) Load imbalance - fine-tuning datasets are small and narrow, causing severe imbalance. The auxiliary loss weights may need to be increased during fine-tuning. (4) Memory requirements - storing gradients and optimizer states for all experts is expensive even with LoRA.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Sparse MoE vs Dense Models demo on the EngineersOfAI Playground - no code required.

:::

The Two Ways to Be Smarter​

Why This Matters - The Economics of LLM Deployment​

Why MoE Gives More Capacity Per FLOP​

The Compute Equivalence Principle​

The Memory Bandwidth Bottleneck​

When to Choose Dense vs MoE​

Choose Dense When​

Choose MoE When​

Fine-Tuning MoE Models - The Hard Part​

Comparative Benchmarks​

Interview Questions and Answers​