Module 11: Mixture of Experts

The fundamental challenge in scaling LLMs is that making a model twice as capable typically requires making it twice as expensive to run. Mixture of Experts (MoE) breaks this relationship. By using many specialized "expert" networks but activating only a small fraction per token, MoE models can have vastly more total parameters - and therefore more capacity - while keeping per-token compute roughly constant.

Mixtral 8x7B has 47 billion total parameters but only activates 13 billion for any given token. DeepSeek-V3 has 671 billion total parameters but runs on 37 billion active parameters. This is how you get GPT-4-level capability at Llama-level inference cost.

Module Map

Lessons at a Glance

#	Lesson	What You Will Learn
01	MoE Architecture	Dense vs sparse models, the MoE layer, top-k routing, how experts fit into the transformer
02	Router Mechanisms	Linear routers, expert choice routing, auxiliary load balancing loss, noisy top-k gating
03	Sparse vs Dense	Why MoE gives more capacity per FLOP, memory bottlenecks, when to choose each
04	Training MoE Models	Expert parallelism, token dropping, gradient flow, GShard at 600B parameters
05	Mixtral Deep Dive	8x7B architecture, sliding window attention, performance vs. Llama 2 70B
06	DeepSeek MoE	Fine-grained experts, shared experts, DeepSeek-V2/V3, multi-token prediction
07	Inference Optimization	Expert caching, CPU offloading, vLLM support, tensor vs. expert parallelism

Key Concepts

Sparse MoE - a model where each token is processed by only $k$ out of $N$ total expert networks, keeping FLOPs per token low while total parameters remain large.

Router - a learned linear layer that, given a token's representation, computes scores for all experts and selects the top- $k$ to activate.

Expert - one of $N$ feedforward networks (FFNs) in the MoE layer. Each expert has the same architecture but different learned weights, specializing in different input patterns.

Load balancing - ensuring tokens are distributed roughly evenly across experts during training. Without it, a few experts receive all tokens and most experts never learn.

Expert parallelism - a parallelism strategy where different experts live on different devices, enabling MoE models to be served across multiple GPUs/nodes.

Active parameters - the number of parameters actually used for a given forward pass. For MoE, this is much smaller than total parameters (e.g., 13B active out of 47B total for Mixtral 8x7B).

Module Map​

Lessons at a Glance​

Key Concepts​

Module Map

Lessons at a Glance

Key Concepts