01Module 11: Mixture of ExpertsHow sparse MoE models achieve massive capacity at lower compute cost - routing mechanisms, load balancing, Mixtral, and DeepSeek's innovations.02Mixture of Experts ArchitectureThe architecture of sparse MoE models - how expert networks replace dense FFN layers, top-k routing, and how parameter count relates to active compute.03Router Mechanisms - How Tokens Get Assigned to ExpertsThe algorithms that decide which experts process which tokens - linear routing, expert choice, auxiliary load balancing loss, noisy top-k gating, and the Switch Transformer approach.04Sparse vs Dense Models - Trade-offsWhy MoE gives more capacity per FLOP than dense models, the memory vs. compute trade-off, training efficiency, inference complexity, and when to choose each architecture.05Training MoE ModelsHow to train Mixture of Experts models at scale - expert parallelism, capacity factors, token dropping, load imbalance, training instability, and the GShard approach to 600B parameters.06Mixtral 8x7B - Architecture Deep DiveMistral AI's Mixtral 8x7B architecture - 8 experts with top-2 routing, sliding window attention, multilingual training, performance vs. Llama 2 70B, and serving requirements.07DeepSeek MoE ArchitectureDeepSeek's innovations in mixture of experts - fine-grained experts, shared experts, DeepSeek-V2 and V3, multi-token prediction, and training for $6M.08Inference Optimization for MoE ModelsProduction techniques for serving MoE models efficiently - expert caching, CPU offloading, vLLM support, tensor vs. expert parallelism, batch size sensitivity, and quantization strategies.