8 docs tagged with "mixture-of-experts"

DeepSeek MoE Architecture

DeepSeek's innovations in mixture of experts - fine-grained experts, shared experts, DeepSeek-V2 and V3, multi-token prediction, and training for $6M.

Inference Optimization for MoE Models

Production techniques for serving MoE models efficiently - expert caching, CPU offloading, vLLM support, tensor vs. expert parallelism, batch size sensitivity, and quantization strategies.

Mixtral 8x7B - Architecture Deep Dive

Mistral AI's Mixtral 8x7B architecture - 8 experts with top-2 routing, sliding window attention, multilingual training, performance vs. Llama 2 70B, and serving requirements.

Mixture of Experts Architecture

The architecture of sparse MoE models - how expert networks replace dense FFN layers, top-k routing, and how parameter count relates to active compute.

Module 11: Mixture of Experts

How sparse MoE models achieve massive capacity at lower compute cost - routing mechanisms, load balancing, Mixtral, and DeepSeek's innovations.

Router Mechanisms - How Tokens Get Assigned to Experts

The algorithms that decide which experts process which tokens - linear routing, expert choice, auxiliary load balancing loss, noisy top-k gating, and the Switch Transformer approach.

Sparse vs Dense Models - Trade-offs

Why MoE gives more capacity per FLOP than dense models, the memory vs. compute trade-off, training efficiency, inference complexity, and when to choose each architecture.

Training MoE Models

How to train Mixture of Experts models at scale - expert parallelism, capacity factors, token dropping, load imbalance, training instability, and the GShard approach to 600B parameters.