Skip to main content

Module 11: Mixture of Experts

The fundamental challenge in scaling LLMs is that making a model twice as capable typically requires making it twice as expensive to run. Mixture of Experts (MoE) breaks this relationship. By using many specialized "expert" networks but activating only a small fraction per token, MoE models can have vastly more total parameters - and therefore more capacity - while keeping per-token compute roughly constant.

Mixtral 8x7B has 47 billion total parameters but only activates 13 billion for any given token. DeepSeek-V3 has 671 billion total parameters but runs on 37 billion active parameters. This is how you get GPT-4-level capability at Llama-level inference cost.

Module Map

Lessons at a Glance

#LessonWhat You Will Learn
01MoE ArchitectureDense vs sparse models, the MoE layer, top-k routing, how experts fit into the transformer
02Router MechanismsLinear routers, expert choice routing, auxiliary load balancing loss, noisy top-k gating
03Sparse vs DenseWhy MoE gives more capacity per FLOP, memory bottlenecks, when to choose each
04Training MoE ModelsExpert parallelism, token dropping, gradient flow, GShard at 600B parameters
05Mixtral Deep Dive8x7B architecture, sliding window attention, performance vs. Llama 2 70B
06DeepSeek MoEFine-grained experts, shared experts, DeepSeek-V2/V3, multi-token prediction
07Inference OptimizationExpert caching, CPU offloading, vLLM support, tensor vs. expert parallelism

Key Concepts

Sparse MoE - a model where each token is processed by only kk out of NN total expert networks, keeping FLOPs per token low while total parameters remain large.

Router - a learned linear layer that, given a token's representation, computes scores for all experts and selects the top-kk to activate.

Expert - one of NN feedforward networks (FFNs) in the MoE layer. Each expert has the same architecture but different learned weights, specializing in different input patterns.

Load balancing - ensuring tokens are distributed roughly evenly across experts during training. Without it, a few experts receive all tokens and most experts never learn.

Expert parallelism - a parallelism strategy where different experts live on different devices, enabling MoE models to be served across multiple GPUs/nodes.

Active parameters - the number of parameters actually used for a given forward pass. For MoE, this is much smaller than total parameters (e.g., 13B active out of 47B total for Mixtral 8x7B).

© 2026 EngineersOfAI. All rights reserved.