Distributed Training - Scaling from One GPU to a Thousand
Reading time: ~40 min | Interview relevance: High | Roles: MLE, AI Infra Eng, Research Engineer, Applied Scientist
The Real Interview Moment
You are in a Google Brain (now DeepMind) interview. The interviewer draws a single GPU on the whiteboard with a model that does not fit in memory. She says: "This model has 70 billion parameters. An A100 has 80GB of memory. Walk me through, step by step, how you would train this model across a cluster of 256 GPUs. What parallelism strategies do you use? What are the communication bottlenecks? How do you decide the optimal configuration?"
This question separates ML practitioners who have trained models at scale from those who have only trained on a single GPU. The interviewer is not looking for buzzwords like "data parallelism" or "DeepSpeed." She wants you to reason about memory budgets, communication costs, and the tradeoffs between different parallelism strategies. She wants you to know that a 70B model requires approximately 140GB in FP16 just for weights - before optimizer states, gradients, or activations - and that you need to partition these across GPUs intelligently.
Candidates who can only describe data parallelism get a "lean no-hire." Candidates who can explain the full stack - data, tensor, pipeline parallelism, ZeRO stages, communication patterns, and how to configure them - get a "strong hire."
What You Will Master
- Calculate the memory requirements for training a model of any size
- Compare data parallelism, tensor parallelism, and pipeline parallelism with precise tradeoffs
- Explain ZeRO stages 1, 2, and 3 and the memory savings of each
- Describe AllReduce, ring-allreduce, and gradient compression algorithms
- Apply FSDP (Fully Sharded Data Parallel) and explain how it relates to ZeRO
- Derive scaling laws relating loss to compute, data, and parameters
- Explain Chinchilla optimal training and why it changed LLM training
- Design a training configuration for a given model size and GPU cluster
Self-Assessment: Where Are You Now?
| Skill | 1 - Cannot | 2 - Vaguely | 3 - Can Explain | 4 - Can Derive | 5 - Can Teach | Your Score |
|---|---|---|---|---|---|---|
| Calculate model memory requirements | ___ | |||||
| Explain data parallelism + AllReduce | ___ | |||||
| Explain tensor parallelism | ___ | |||||
| Explain pipeline parallelism | ___ | |||||
| Describe ZeRO stages 1/2/3 | ___ | |||||
| Explain ring-allreduce algorithm | ___ | |||||
| State and interpret scaling laws | ___ | |||||
| Design a multi-GPU training config | ___ |
Target: All 4s and 5s before your interview.
Part 1 - Memory Anatomy: Why One GPU Is Not Enough
Before discussing parallelism, you must understand what consumes GPU memory during training.
Memory Breakdown for Training
For a model with parameters trained with Adam optimizer in mixed precision:
| Component | Memory (bytes) | For 7B params | For 70B params |
|---|---|---|---|
| Model weights (FP16) | 14 GB | 140 GB | |
| Gradients (FP16) | 14 GB | 140 GB | |
| Adam optimizer state: momentum (FP32) | 28 GB | 280 GB | |
| Adam optimizer state: variance (FP32) | 28 GB | 280 GB | |
| Master weights (FP32) | 28 GB | 280 GB | |
| Total (no activations) | 112 GB | 1120 GB | |
| Activations (varies) | ~10-50 GB | ~50-500 GB |
A single A100 (80GB) cannot even hold the weights of a 70B model in FP16 (140GB), let alone the optimizer states. A 7B model fits its weights on one GPU but not the full training state (112GB > 80GB).
"Training memory has four main components: model weights, gradients, optimizer states, and activations. For a model with P parameters using Adam in mixed precision, you need approximately 16P bytes - 2P for FP16 weights, 2P for FP16 gradients, and 12P for FP32 optimizer states and master weights. A 7B model needs ~112GB just for the static state, before activations. A single 80GB A100 cannot fit this, so we need to distribute across multiple GPUs using some combination of data parallelism, tensor parallelism, and pipeline parallelism."
Activation Memory
Activations are the intermediate tensors saved during the forward pass for use in backpropagation. For a transformer with layers, hidden dimension , sequence length , and batch size :
Activation checkpointing (gradient checkpointing) trades compute for memory: discard activations during the forward pass, recompute them during the backward pass. This reduces activation memory from to at the cost of ~33% more compute.
Part 2 - Data Parallelism
The Basic Idea
Replicate the entire model on GPUs. Split each batch into micro-batches. Each GPU processes its micro-batch independently, then all GPUs synchronize gradients.
Requirement: Each GPU must hold the full model, full optimizer state, and gradients. Data parallelism does not reduce per-GPU memory for the model - it only divides the activation memory (smaller micro-batch per GPU).
Scaling: Throughput scales nearly linearly with the number of GPUs, as long as communication does not become the bottleneck.
AllReduce: The Communication Primitive
AllReduce computes the sum (or average) of gradients across all GPUs and distributes the result to every GPU.
Naive AllReduce: Every GPU sends its gradients to every other GPU. Communication cost: . Terrible at scale.
Ring-AllReduce: GPUs are arranged in a logical ring. The algorithm has two phases:
Phase 1 - Scatter-Reduce (N-1 steps): Each GPU sends a chunk of its gradients to the next GPU in the ring. Each GPU accumulates the incoming chunk with its own. After steps, each GPU holds the fully reduced result for one chunk.
Phase 2 - All-Gather (N-1 steps): Each GPU sends its fully reduced chunk around the ring until every GPU has all chunks.
Total communication: Each GPU sends and receives exactly data. The cost is independent of - it does not grow with the number of GPUs.
Do NOT say "AllReduce is O(N) in communication." Ring-allreduce has total per-GPU bandwidth cost of regardless of . The latency (number of steps) is , but the bandwidth cost is constant. Interviewers at Google and Meta expect you to understand this distinction between latency and bandwidth costs.
Gradient Compression
Even with ring-allreduce, communication can be a bottleneck. Gradient compression techniques reduce the data transferred:
| Technique | Compression Ratio | How It Works |
|---|---|---|
| FP16 gradients | 2x | Cast FP32 gradients to FP16 before communication |
| Top-K sparsification | 10-1000x | Only communicate the K largest gradient values |
| Random sparsification | 10-100x | Randomly select gradient values to communicate |
| Quantization (1-bit) | 32x | Send only the sign of each gradient |
| PowerSGD | 10-100x | Low-rank approximation of gradient matrix |
| Error feedback | - | Accumulate compression error locally, add to next step |
Gradient compression is a hot topic at Meta AI and Google Research. Meta's work on 1-bit Adam and Google's work on distributed optimization are frequently discussed. If you are interviewing for infrastructure/systems ML roles, be prepared to discuss the tradeoffs between compression ratio and convergence rate.
Part 3 - Model Parallelism
When a model does not fit on a single GPU even without optimizer states (inference-only), you must split the model across GPUs.
Tensor Parallelism (Intra-layer)
Split individual layers across GPUs. Each GPU computes a portion of the layer's output.
Example: splitting a linear layer where
Split column-wise across 2 GPUs:
GPU 0 computes , GPU 1 computes . The full output is .
For a transformer MLP block (two linear layers with a nonlinearity):
Megatron-LM approach (Shoeybi et al., 2020):
- Split column-wise: GPU computes - no communication needed because GeLU is element-wise
- Split row-wise: GPU computes - each GPU has a partial result
- AllReduce to sum partial results:
Result: One AllReduce per MLP block and one per attention block - total of 2 AllReduces per transformer layer.
Best for: GPUs connected with high-bandwidth interconnects (NVLink: 900 GB/s on H100). Tensor parallelism requires very frequent communication (every layer), so it only works within a single node.
Pipeline Parallelism (Inter-layer)
Assign different layers to different GPUs. GPU 0 has layers 1-10, GPU 1 has layers 11-20, etc.
Naive approach problem - the bubble:
Most GPUs are idle most of the time. With pipeline stages, the bubble fraction is .
GPipe (Huang et al., 2019): Split the mini-batch into micro-batches. Pipeline them through stages so that multiple micro-batches are in flight simultaneously.
Bubble fraction: . With , the bubble becomes negligible.
1F1B schedule (Narayanan et al., 2021 - PipeDream): Interleave forward and backward passes to further reduce the bubble and reduce peak activation memory. Each stage alternates between forward passes on new micro-batches and backward passes on completed micro-batches.
Comparing Parallelism Strategies
| Dimension | Data Parallel | Tensor Parallel | Pipeline Parallel |
|---|---|---|---|
| What is split | Data (batches) | Individual layers | Groups of layers |
| Memory savings | None (model replicated) | Proportional to TP degree | Proportional to PP degree |
| Communication | AllReduce after backward | AllReduce every layer | Point-to-point between stages |
| Bandwidth needs | Moderate | Very high (NVLink) | Low |
| Latency sensitivity | Low | High | Moderate |
| Bubble overhead | None | None | |
| Ideal placement | Across nodes | Within node (NVLink) | Across nodes |
Never say "just use data parallelism" for a model that does not fit in a single GPU's memory. Data parallelism replicates the model - it makes the memory problem worse (each GPU needs the full model plus activations). If the model does not fit, you must use model parallelism (tensor or pipeline) or ZeRO to shard the state.
Part 4 - ZeRO: Eliminating Redundancy
The Redundancy Problem in Data Parallelism
In standard data parallelism, every GPU holds a complete copy of:
- Model weights ( bytes in FP16)
- Gradients ( bytes)
- Optimizer states ( bytes for Adam with FP32 master weights)
With 64 GPUs, you have 64 copies of everything - only the activations differ. ZeRO (Zero Redundancy Optimizer) by Rajbhandari et al. (2020) eliminates this redundancy.
ZeRO Stage 1: Partition Optimizer States
Each GPU stores only of the optimizer states. After computing gradients, each GPU:
- Reduces gradients for its partition via reduce-scatter
- Updates its partition of the optimizer states
- Broadcasts updated weights via all-gather
Memory savings: Optimizer states drop from to per GPU. Communication: Same as data parallelism (AllReduce = reduce-scatter + all-gather).
ZeRO Stage 2: Partition Gradients
In addition to Stage 1, each GPU stores only of the gradients.
Memory savings: Gradients drop from to . Combined with Stage 1, static memory drops from to .
Communication: Same as data parallelism - reduce-scatter suffices since each GPU only needs its own gradient partition.
ZeRO Stage 3: Partition Parameters
Each GPU stores only of the model parameters. When a layer needs the full parameter for forward or backward pass, it performs an all-gather to reconstruct the parameter, uses it, then discards the non-local portion.
Memory savings: Everything is partitioned. Per-GPU memory: .
Communication cost: Additional all-gather operations during forward and backward passes. Total communication increases by 1.5x compared to standard data parallelism.
FSDP: PyTorch's ZeRO Stage 3
Fully Sharded Data Parallel (FSDP) is PyTorch's native implementation of ZeRO Stage 3. Key features:
- Shards parameters, gradients, and optimizer states across all GPUs
- Uses all-gather to reconstruct parameters on-demand during forward/backward
- Supports mixed precision with per-parameter sharding
- Can be combined with activation checkpointing
- Simpler API than DeepSpeed for PyTorch users
FSDP vs DeepSpeed ZeRO:
| Feature | FSDP (PyTorch) | DeepSpeed ZeRO |
|---|---|---|
| Framework | Native PyTorch | Library on top of PyTorch |
| ZeRO stages | Stage 3 equivalent | Stage 1, 2, 3, 3+ (offload) |
| CPU offloading | Limited | Full support (ZeRO-Infinity) |
| NVMe offloading | No | Yes (ZeRO-Infinity) |
| Ease of use | Better for PyTorch users | More configuration options |
| Mixed precision | Native AMP integration | Own mixed precision pipeline |
| Community | Growing | Mature |
When asked about distributed training, interviewers want to see that you can reason about when to use each strategy, not just what they are. A strong answer: "For a 7B model on 8 A100s, I'd use ZeRO Stage 2 - it partitions optimizer states and gradients, bringing per-GPU memory from 112GB to about 30GB, which fits in 80GB A100. I wouldn't need Stage 3 because the weights alone (14GB) fit on each GPU. For a 70B model, I'd use ZeRO Stage 3 combined with tensor parallelism within each node."
Part 5 - DeepSpeed: The Full Stack
DeepSpeed (Microsoft) provides a comprehensive distributed training framework:
Key Components
| Component | What It Does |
|---|---|
| ZeRO Stage 1/2/3 | Partition optimizer/gradient/parameter memory |
| ZeRO-Offload | Offload optimizer states to CPU RAM |
| ZeRO-Infinity | Offload to CPU RAM + NVMe SSD |
| 3D Parallelism | Combine data + tensor + pipeline parallelism |
| Sparse Attention | Efficient attention for long sequences |
| Mixture of Experts | MoE training support |
| Compression | Gradient and communication compression |
ZeRO-Infinity: Training on a Budget
ZeRO-Infinity offloads parameters, gradients, and optimizer states to CPU memory or even NVMe SSDs. This allows training models that are orders of magnitude larger than GPU memory:
- GPU memory: Used only for active computations (forward/backward of current layer)
- CPU memory: Stores parameters, gradients, optimizer states that are not currently needed
- NVMe SSD: Overflow from CPU memory
Tradeoff: Significantly slower due to PCIe bandwidth bottleneck (CPU-GPU: 32 GB/s on PCIe 4.0 vs GPU-GPU NVLink: 900 GB/s on H100). Training throughput can drop by 10-50x compared to all-GPU training.
Part 6 - 3D Parallelism: Combining Everything
For training the largest models (100B+ parameters), you combine all three parallelism strategies:
Rule of thumb for assigning parallelism:
- Tensor parallelism: Within a node (NVLink bandwidth). Degree = 2, 4, or 8 (number of GPUs per node).
- Pipeline parallelism: Across nodes (lower bandwidth OK). Degree = number of pipeline stages.
- Data parallelism: Across the remaining GPUs. Degree = total GPUs / (TP degree x PP degree).
Example: 70B model on 256 A100 GPUs (32 nodes x 8 GPUs/node):
- TP = 4 (split each layer across 4 GPUs within a node)
- PP = 8 (8 pipeline stages across 8 groups of nodes)
- DP = 256 / (4 x 8) = 8 (8-way data parallelism)
- Effective batch size = DP x micro-batch size x gradient accumulation steps
Part 7 - Scaling Laws
The Kaplan Scaling Laws (OpenAI, 2020)
Kaplan et al. discovered that language model loss follows power laws:
Where:
- = number of parameters (excluding embeddings)
- = dataset size (tokens)
- = compute budget (FLOPs)
- , ,
Key insights:
- Loss decreases as a power law with N, D, and C - no diminishing returns within the studied range
- Larger models are more sample-efficient - they achieve the same loss with less data per parameter
- Model size matters more than dataset size for a fixed compute budget (Kaplan's conclusion - later challenged by Chinchilla)
The Chinchilla Scaling Laws (Hoffmann et al., 2022)
DeepMind's Chinchilla paper challenged Kaplan's conclusion. They found:
Kaplan conclusion: For a fixed compute budget, scale the model as large as possible and use whatever data fits.
Chinchilla conclusion: For a fixed compute budget, scale model size and dataset size equally. The optimal allocation is approximately:
The rule of thumb: train on approximately 20 tokens per parameter.
| Model | Parameters | Kaplan-optimal tokens | Chinchilla-optimal tokens | Actual tokens |
|---|---|---|---|---|
| GPT-3 | 175B | 300B (what they used) | 3.5T | 300B |
| Chinchilla | 70B | - | 1.4T | 1.4T |
| Llama 2 | 70B | - | 1.4T | 2T (overtrained) |
| Llama 3 | 70B | - | 1.4T | 15T (massively overtrained) |
The Chinchilla paper is one of the most important results in LLM training. When asked about scaling laws, an interviewer at a top AI lab expects you to know: (1) power law relationship between loss and compute/data/params, (2) Chinchilla's "20 tokens per parameter" rule, (3) why recent models like Llama 3 are deliberately "overtrained" relative to Chinchilla - inference cost matters, because a smaller model trained on more data is cheaper to deploy than a larger model trained on less data.
Beyond Chinchilla: Inference-Aware Scaling
Chinchilla optimizes for training compute - minimizing loss for a fixed training FLOP budget. But in practice, inference cost matters too. A model that runs 1 billion inference calls should be optimized differently than one that runs 1000.
Inference-aware scaling trades more training compute (overtrain a smaller model) for lower inference cost:
This is why Llama 3 (70B, 15T tokens) is trained far beyond Chinchilla-optimal - the extra training cost is amortized over billions of inference calls.
Do NOT say "Chinchilla proved that GPT-3 was too large." Chinchilla showed that GPT-3 was undertrained for its size - it should have been trained on more data, not that it should have been smaller. The implication is that for 300B tokens of training data, the optimal model is ~10B parameters (not 175B). But if you have enough data, a 175B model will eventually outperform a 10B model.
The Compute Formula
For transformer language models, the compute required for training is approximately:
Where is in FLOPs, is number of parameters, and is number of training tokens. The factor of 6 comes from: 2 FLOPs per multiply-add in the forward pass per parameter per token, times 3 (forward + backward, where backward is approximately 2x forward).
This formula is essential for compute budgeting. To train a 70B model on 2T tokens:
On 1024 A100 GPUs at 40% MFU (Model FLOPs Utilization):
Part 8 - Practical Guide: Scaling from 1 GPU to 1000
Decision Tree
Step-by-Step Scaling Guide
Level 1: 1 GPU
- Standard single-GPU training
- Mixed precision (BF16) to double effective memory
- Gradient accumulation for larger effective batch size
- Activation checkpointing if memory-tight
Level 2: 2-8 GPUs (single node)
- PyTorch DDP (DistributedDataParallel)
- NVLink interconnect for fast communication
- Linear throughput scaling expected
- If model does not fit: ZeRO Stage 2 (FSDP)
Level 3: 8-64 GPUs (multi-node)
- DDP across nodes with InfiniBand/RoCE
- Communication becomes significant - overlap gradient communication with backward computation
- Consider ZeRO Stage 2/3 for memory-constrained setups
- Gradient compression if network bandwidth is limited
Level 4: 64-512 GPUs
- Tensor parallelism within nodes (TP=4 or TP=8)
- Data parallelism across nodes
- Careful batch size scaling with learning rate warmup
- Monitor GPU utilization - aim for >50% MFU
Level 5: 512-10000+ GPUs
- Full 3D parallelism (TP + PP + DP)
- Pipeline parallelism across node groups
- Failure recovery: checkpoint frequently, use elastic training
- Network topology awareness: place communicating GPUs on same switch
- Typically only frontier model training (GPT-4, Gemini, Llama 3 scale)
Model FLOPs Utilization (MFU)
MFU measures how efficiently you use the GPU's theoretical peak FLOPS:
| Setup | Typical MFU |
|---|---|
| Single GPU, well-optimized | 50-60% |
| 8 GPUs, DDP within node | 45-55% |
| 64 GPUs, multi-node DDP | 35-50% |
| 256 GPUs, 3D parallelism | 30-45% |
| 1000+ GPUs, frontier training | 25-40% |
Why MFU is never 100%: Communication overhead, pipeline bubbles, memory bandwidth bottlenecks, kernel launch overhead, idle time during synchronization.
Practice Problems
Problem 1: Memory Budget Calculation
You have a 13B parameter model and 8 A100 GPUs (80GB each). Calculate the memory requirements and determine the minimum ZeRO stage needed. Assume Adam optimizer, BF16 mixed precision, and no activation checkpointing.
Hint 1 - Direction
Calculate the per-GPU memory for each ZeRO stage. Remember: weights (2P), gradients (2P), optimizer states (12P) for mixed precision Adam.
Hint 2 - Insight
Total static memory: . Per GPU with no sharding: 208GB (does not fit in 80GB). With ZeRO Stage 1: optimizer states sharded, so per GPU = . Activations will add more.
Hint 3 - Full Solution + Rubric
Memory per component:
- Weights (BF16):
- Gradients (BF16):
- Adam momentum (FP32):
- Adam variance (FP32):
- Master weights (FP32):
- Total static: 208GB
Per-GPU memory by ZeRO stage (8 GPUs):
| Stage | Weights | Gradients | Optimizer | Total Static | Fits in 80GB? |
|---|---|---|---|---|---|
| None (DDP) | 26 GB | 26 GB | 156 GB | 208 GB | No |
| Stage 1 | 26 GB | 26 GB | 19.5 GB | 71.5 GB | Tight with activations |
| Stage 2 | 26 GB | 3.25 GB | 19.5 GB | 48.75 GB | Yes |
| Stage 3 | 3.25 GB | 3.25 GB | 19.5 GB | 26 GB | Yes, generous |
Answer: ZeRO Stage 1 gives 71.5GB of static memory, leaving only ~8.5GB for activations - likely insufficient. ZeRO Stage 2 gives 48.75GB, leaving ~31GB for activations - comfortable for moderate batch sizes. Stage 2 is the minimum viable stage.
If larger batch sizes or longer sequences are needed, use Stage 3 or add activation checkpointing.
Scoring Rubric:
- Strong Hire: Correct calculation for all components, identifies Stage 2 as minimum, accounts for activation memory, mentions activation checkpointing as an option.
- Lean Hire: Gets the right answer but makes arithmetic errors or forgets a memory component.
- No Hire: Cannot calculate memory requirements or does not know what ZeRO stages do.
Problem 2: Parallelism Configuration
You need to train a 70B parameter model on 128 A100 GPUs (16 nodes, 8 GPUs per node, NVLink within node, InfiniBand across nodes). Design the parallelism configuration.
Hint 1 - Direction
Start by checking if the model fits within a single node using tensor parallelism. Then decide pipeline and data parallelism degrees.
Hint 2 - Insight
70B model weights in BF16: 140GB. 8 GPUs per node x 80GB = 640GB per node. With TP=8, weights per GPU = 17.5GB - fits easily. But you also need optimizer states. With TP=8 within a node, you still need the full optimizer state per TP group. ZeRO Stage 1 or 2 across the TP group can help.
Hint 3 - Full Solution + Rubric
Configuration:
Tensor parallelism (TP) = 4 within each node:
- Weights per GPU: 140GB / 4 = 35GB
- Uses NVLink (900 GB/s) for the frequent intra-layer communication
- TP=4 instead of TP=8 to leave room for data parallelism within the node
Pipeline parallelism (PP) = 4 across nodes:
- Model split into 4 stages of ~17.5B params each
- Each stage spans 4 nodes (one TP group per stage per DP rank)
- Inter-node communication via InfiniBand - pipeline parallelism is latency-tolerant
Data parallelism (DP) = 128 / (4 x 4) = 8:
- 8 data-parallel replicas
- ZeRO Stage 1 within the DP group to shard optimizer states
Memory per GPU:
- Weights (TP=4): ~35GB
- Optimizer states (ZeRO-1, DP=8): ~4.9GB
- Gradients: ~8.75GB (sharded across TP group)
- Total static: ~49GB, leaving ~31GB for activations
Pipeline configuration:
- Micro-batches: 32 (to minimize bubble overhead)
- Bubble fraction:
- Use 1F1B schedule to reduce peak activation memory
Alternative valid configurations:
- TP=8, PP=2, DP=8 - more TP, less pipeline overhead
- TP=4, PP=8, DP=4 - more pipeline stages, less data parallelism
- TP=4, PP=1, DP=32 with ZeRO Stage 3 - pure sharded data parallelism
Scoring Rubric:
- Strong Hire: Provides a specific configuration with TP/PP/DP degrees, justifies each choice (NVLink for TP, InfiniBand tolerance for PP), calculates memory, addresses pipeline bubble, mentions alternatives.
- Lean Hire: Suggests a reasonable configuration but cannot justify the choices or calculate memory.
- No Hire: Cannot combine multiple parallelism strategies or suggests only data parallelism for a 70B model.
Problem 3: Scaling Law Application
Your company has a fixed training compute budget of FLOPs. Using Chinchilla scaling laws, determine the optimal model size and dataset size. Then explain why your company might choose to deviate from the optimal.
Hint 1 - Direction
Chinchilla rule of thumb: 20 tokens per parameter. The compute is approximately for transformers (6 FLOPs per parameter per token).
Hint 2 - Insight
From and : , so . With : parameters. Dataset: tokens. But you might want a smaller model trained on more data for inference efficiency.
Hint 3 - Full Solution + Rubric
Chinchilla-optimal calculation:
Given FLOPs, using and :
Reasons to deviate:
-
Inference cost dominance: If the model serves millions of users, a smaller model (e.g., 1B) trained on more data (e.g., 1.7T tokens) may have higher total cost-efficiency because inference cost = per request.
-
Data availability: You may not have 60B high-quality tokens. If limited to 20B tokens, a smaller model (~1B) that is not overtrained on repeated data may perform better.
-
Latency requirements: A 3B model may not meet latency SLAs for real-time applications. A 1B model might, even if it has slightly higher loss.
-
Emergent capabilities: Some capabilities only appear at certain model sizes. If you need chain-of-thought reasoning, you might need a larger model even if Chinchilla says otherwise.
-
Fine-tuning plans: If you plan to fine-tune for specific tasks, a larger pre-trained model often fine-tunes better even with the same pre-training compute.
Scoring Rubric:
- Strong Hire: Correct Chinchilla calculation, provides 3+ valid reasons to deviate, mentions inference cost as the primary modern reason, connects to real examples (Llama 3 strategy).
- Lean Hire: Gets the calculation right but provides only 1-2 reasons to deviate.
- No Hire: Cannot perform the calculation or does not know what Chinchilla scaling laws are.
Problem 4: Communication Bottleneck Analysis
You are training with 64 GPUs across 8 nodes, using DDP with ring-allreduce. Your model has 7B parameters in BF16 gradients. The inter-node network bandwidth is 100 Gbps per node. Calculate the time for one AllReduce operation and determine if communication is the bottleneck.
Hint 1 - Direction
Ring-allreduce per-GPU bandwidth cost is approximately bytes regardless of . But the bottleneck is the slowest link in the ring. With 8 nodes, the ring must cross inter-node links.
Hint 2 - Insight
Data to transfer per GPU: (the factor of 2 is from the two phases of ring-allreduce). Network bandwidth: 100 Gbps = 12.5 GB/s. Time: seconds. If a forward+backward pass takes about 4.5 seconds, communication is 33% of total time - a significant bottleneck.
Hint 3 - Full Solution + Rubric
Calculation:
Gradient size:
Ring-allreduce total per-GPU transfer: (for )
Inter-node bandwidth:
Time for AllReduce:
Is this a bottleneck?
Estimated forward+backward time for 7B model on A100:
- A100 BF16 peak: 312 TFLOPS
- Approximate compute per step with batch size 8 and seq_len 2048: ~4.5 seconds
- Communication fraction:
This is a significant bottleneck. Mitigation strategies:
- Overlap communication with computation: Start allreduce for earlier layers while computing gradients for later layers. Can hide 50-80% of communication time.
- Gradient compression: Top-K sparsification or PowerSGD can reduce transfer by 10-100x.
- Gradient accumulation: Accumulate gradients over multiple micro-batches before allreduce, amortizing the communication cost.
- Faster network: 400 Gbps InfiniBand (standard in modern GPU clusters) would reduce communication to 0.56 seconds.
Scoring Rubric:
- Strong Hire: Correct calculation of AllReduce time, determines communication fraction, proposes mitigation strategies including overlap and compression, mentions real network bandwidth numbers.
- Lean Hire: Gets the approximate time right but cannot analyze whether it is a bottleneck or propose mitigations.
- No Hire: Cannot calculate ring-allreduce communication cost or does not understand the concept.
Problem 5: Design Challenge
You are the tech lead for training a new 30B parameter multilingual language model. You have 512 H100 GPUs for 3 months. Design the complete distributed training setup: parallelism strategy, compute budget, and failure recovery.
Hint 1 - Direction
Start with the compute budget. 512 H100s for 3 months at ~40% MFU. How many FLOPs? Then use scaling laws to determine if 30B is well-allocated. Then design the parallelism.
Hint 2 - Insight
Compute budget: 512 GPUs x ~1000 TFLOPS (H100 BF16) x 0.4 MFU x 90 days x 86400 s/day = ~ FLOPs. Chinchilla-optimal for this budget: . Training a 30B model means you are overtraining by ~4x, which is fine for inference efficiency. Data needed: .
Hint 3 - Full Solution + Rubric
Compute budget:
- 512 H100 GPUs x 989 TFLOPS (BF16 peak) x 0.40 MFU = ~202,752 TFLOPS effective
- Training time: 90 days x 86,400 s/day = 7,776,000 seconds
- Total FLOPs: ~ FLOPs
Scaling law analysis:
- Chinchilla-optimal N for this budget:
- We are training 30B - about 3.8x smaller than Chinchilla-optimal
- This is deliberate: 30B is a practical deployment size, and overtraining improves inference efficiency
- Required tokens:
Parallelism configuration (512 H100s = 64 nodes x 8 GPUs):
- TP = 4 (within node via NVLink)
- PP = 2 (across 2 groups of nodes, minimal bubble)
- DP = 512 / (4 x 2) = 64 (64-way data parallelism)
- ZeRO Stage 1 within DP groups (partition optimizer states)
Memory per GPU:
- Weights (TP=4): 30B x 2 / 4 = 15GB
- Optimizer (ZeRO-1, DP=64): ~1.4GB
- Gradients (TP=4): ~15GB
- Total static: ~32GB, leaving ~48GB for activations on 80GB H100
Failure recovery:
- Checkpoint every 1000 steps (~75 minutes)
- Async checkpoint to shared NFS / object storage
- Elastic training: handle up to 5% GPU failures without restart
- Automatic restart from latest checkpoint on node failure
- Watchdog: alert if loss spikes >2x or MFU drops below 30%
Scoring Rubric:
- Strong Hire: Complete compute budget, scaling law analysis, specific parallelism configuration with justification, memory calculation, failure recovery plan.
- Lean Hire: Provides a reasonable parallelism configuration but missing scaling law analysis or failure recovery.
- No Hire: Cannot design a multi-dimensional parallelism configuration or does not understand the scale of the problem.
Interview Cheat Sheet
| Concept | Key Formula/Number | One-Liner | Red Flag |
|---|---|---|---|
| Training memory (Adam, mixed prec.) | bytes total | Weights + grads + optimizer + master weights | "Training memory = model size" |
| Data parallelism | Replicate model, split data | AllReduce gradients after backward | "DP reduces model memory" |
| Ring-allreduce bandwidth | bytes per GPU | Constant cost regardless of GPU count | "AllReduce is O(N)" |
| Tensor parallelism | Split layers across GPUs | Requires NVLink, intra-node only | "TP works across nodes" |
| Pipeline parallelism | Split layer groups across GPUs | Bubble = | "No bubble with enough GPUs" |
| ZeRO Stage 1 | Shard optimizer states | Memory: | "ZeRO removes all redundancy" (only Stage 3) |
| ZeRO Stage 3 / FSDP | Shard everything | Memory: , 1.5x comm cost | "FSDP is free" |
| Chinchilla rule | 20 tokens per parameter | Scale data and params equally | "Bigger model always better" |
| Compute formula | 6 FLOPs per param per token | Not knowing the 6ND formula | |
| MFU | Actual FLOPS / Peak FLOPS | 30-50% is typical at scale | "MFU should be 90%+" |
| Activation checkpointing | Recompute vs store | ~33% more compute, memory | "Free memory savings" |
Spaced Repetition Checkpoints
Day 0 - Initial Learning
- Read this entire page
- Calculate memory requirements for a 7B and 70B model from memory
- Draw the ring-allreduce algorithm on paper
- Write out the ZeRO stages and memory savings for each
- State the Chinchilla scaling law and the 20-tokens-per-parameter rule
Day 3 - First Recall
- Without notes, calculate memory for a 13B model with Adam mixed precision
- Explain data vs tensor vs pipeline parallelism in 2 minutes, aloud
- Give the "60-Second Answer" for model memory out loud, timed
- Derive the Chinchilla-optimal model size for FLOPs
Day 7 - Connections
- Design a parallelism configuration for 70B on 256 GPUs (write it out)
- Explain why ring-allreduce bandwidth is independent of GPU count
- Do Practice Problem 1 (memory budget) without looking at hints
- Explain inference-aware scaling and how it changes the Chinchilla recommendation
Day 14 - Application
- Do Practice Problem 2 (parallelism configuration) under timed conditions (15 minutes)
- Do Practice Problem 4 (communication bottleneck) with full calculations
- Design a training setup for a model size and GPU count of your choice
- Review concepts you hesitated on
Day 21 - Mock Interview
- Have someone ask: "How would you train a 70B model on 512 GPUs? Walk me through everything."
- Time yourself: full answer should take 8-12 minutes covering memory, parallelism, scaling, and failure recovery
- Do all 5 practice problems in sequence under timed conditions (60 minutes total)
- Can you whiteboard the 3D parallelism diagram from memory?
Key Takeaways
-
Memory is the first constraint. Before choosing a parallelism strategy, calculate the exact memory requirements. The formula for Adam mixed precision is your starting point.
-
Different parallelism strategies serve different purposes. Data parallelism scales throughput. Tensor parallelism fits layers on multiple GPUs. Pipeline parallelism fits models across nodes. ZeRO eliminates memory redundancy. Most large-scale training uses a combination of all four.
-
Communication is the scaling bottleneck. Ring-allreduce is bandwidth-optimal but latency-limited. Tensor parallelism requires NVLink. Pipeline parallelism tolerates lower bandwidth. Understanding these tradeoffs is what makes you a strong distributed training engineer.
-
Chinchilla changed the game. The 20-tokens-per-parameter rule optimizes for training compute. Inference-aware scaling (overtraining smaller models) optimizes for deployment cost. Know both perspectives and when each applies.
-
Failure recovery is not optional at scale. With 1000 GPUs, hardware failures are frequent (multiple per day). Checkpoint frequently, use elastic training, and monitor aggressively.
