Module 11: Mixture of Experts
The fundamental challenge in scaling LLMs is that making a model twice as capable typically requires making it twice as expensive to run. Mixture of Experts (MoE) breaks this relationship. By using many specialized "expert" networks but activating only a small fraction per token, MoE models can have vastly more total parameters - and therefore more capacity - while keeping per-token compute roughly constant.
Mixtral 8x7B has 47 billion total parameters but only activates 13 billion for any given token. DeepSeek-V3 has 671 billion total parameters but runs on 37 billion active parameters. This is how you get GPT-4-level capability at Llama-level inference cost.
Module Map
Lessons at a Glance
| # | Lesson | What You Will Learn |
|---|---|---|
| 01 | MoE Architecture | Dense vs sparse models, the MoE layer, top-k routing, how experts fit into the transformer |
| 02 | Router Mechanisms | Linear routers, expert choice routing, auxiliary load balancing loss, noisy top-k gating |
| 03 | Sparse vs Dense | Why MoE gives more capacity per FLOP, memory bottlenecks, when to choose each |
| 04 | Training MoE Models | Expert parallelism, token dropping, gradient flow, GShard at 600B parameters |
| 05 | Mixtral Deep Dive | 8x7B architecture, sliding window attention, performance vs. Llama 2 70B |
| 06 | DeepSeek MoE | Fine-grained experts, shared experts, DeepSeek-V2/V3, multi-token prediction |
| 07 | Inference Optimization | Expert caching, CPU offloading, vLLM support, tensor vs. expert parallelism |
Key Concepts
Sparse MoE - a model where each token is processed by only out of total expert networks, keeping FLOPs per token low while total parameters remain large.
Router - a learned linear layer that, given a token's representation, computes scores for all experts and selects the top- to activate.
Expert - one of feedforward networks (FFNs) in the MoE layer. Each expert has the same architecture but different learned weights, specializing in different input patterns.
Load balancing - ensuring tokens are distributed roughly evenly across experts during training. Without it, a few experts receive all tokens and most experts never learn.
Expert parallelism - a parallelism strategy where different experts live on different devices, enabling MoE models to be served across multiple GPUs/nodes.
Active parameters - the number of parameters actually used for a given forward pass. For MoE, this is much smaller than total parameters (e.g., 13B active out of 47B total for Mixtral 8x7B).
