Distributed Training - Scaling from One GPU to a Thousand

Reading time: ~40 min | Interview relevance: High | Roles: MLE, AI Infra Eng, Research Engineer, Applied Scientist

The Real Interview Moment

You are in a Google Brain (now DeepMind) interview. The interviewer draws a single GPU on the whiteboard with a model that does not fit in memory. She says: "This model has 70 billion parameters. An A100 has 80GB of memory. Walk me through, step by step, how you would train this model across a cluster of 256 GPUs. What parallelism strategies do you use? What are the communication bottlenecks? How do you decide the optimal configuration?"

This question separates ML practitioners who have trained models at scale from those who have only trained on a single GPU. The interviewer is not looking for buzzwords like "data parallelism" or "DeepSpeed." She wants you to reason about memory budgets, communication costs, and the tradeoffs between different parallelism strategies. She wants you to know that a 70B model requires approximately 140GB in FP16 just for weights - before optimizer states, gradients, or activations - and that you need to partition these across GPUs intelligently.

Candidates who can only describe data parallelism get a "lean no-hire." Candidates who can explain the full stack - data, tensor, pipeline parallelism, ZeRO stages, communication patterns, and how to configure them - get a "strong hire."

What You Will Master

Calculate the memory requirements for training a model of any size
Compare data parallelism, tensor parallelism, and pipeline parallelism with precise tradeoffs
Explain ZeRO stages 1, 2, and 3 and the memory savings of each
Describe AllReduce, ring-allreduce, and gradient compression algorithms
Apply FSDP (Fully Sharded Data Parallel) and explain how it relates to ZeRO
Derive scaling laws relating loss to compute, data, and parameters
Explain Chinchilla optimal training and why it changed LLM training
Design a training configuration for a given model size and GPU cluster

Self-Assessment: Where Are You Now?

Skill	1 - Cannot	2 - Vaguely	3 - Can Explain	4 - Can Derive	5 - Can Teach	Your Score
Calculate model memory requirements						___
Explain data parallelism + AllReduce						___
Explain tensor parallelism						___
Explain pipeline parallelism						___
Describe ZeRO stages 1/2/3						___
Explain ring-allreduce algorithm						___
State and interpret scaling laws						___
Design a multi-GPU training config						___

Target: All 4s and 5s before your interview.

Part 1 - Memory Anatomy: Why One GPU Is Not Enough

Before discussing parallelism, you must understand what consumes GPU memory during training.

Memory Breakdown for Training

For a model with $P$ parameters trained with Adam optimizer in mixed precision:

Component	Memory (bytes)	For 7B params	For 70B params
Model weights (FP16)	$2P$	14 GB	140 GB
Gradients (FP16)	$2P$	14 GB	140 GB
Adam optimizer state: momentum (FP32)	$4P$	28 GB	280 GB
Adam optimizer state: variance (FP32)	$4P$	28 GB	280 GB
Master weights (FP32)	$4P$	28 GB	280 GB
Total (no activations)	$16P$	112 GB	1120 GB
Activations (varies)	$\sim 2 \times \text{batch} \times \text{seq} \times \text{hidden}$	~10-50 GB	~50-500 GB

A single A100 (80GB) cannot even hold the weights of a 70B model in FP16 (140GB), let alone the optimizer states. A 7B model fits its weights on one GPU but not the full training state (112GB > 80GB).

60-Second Answer

"Training memory has four main components: model weights, gradients, optimizer states, and activations. For a model with P parameters using Adam in mixed precision, you need approximately 16P bytes - 2P for FP16 weights, 2P for FP16 gradients, and 12P for FP32 optimizer states and master weights. A 7B model needs ~112GB just for the static state, before activations. A single 80GB A100 cannot fit this, so we need to distribute across multiple GPUs using some combination of data parallelism, tensor parallelism, and pipeline parallelism."

Activation Memory

Activations are the intermediate tensors saved during the forward pass for use in backpropagation. For a transformer with $L$ layers, hidden dimension $H$ , sequence length $S$ , and batch size $B$ :

$\text{Activation memory} \approx L \times B \times S \times H \times 2 \text{ bytes (FP16)}$

Activation checkpointing (gradient checkpointing) trades compute for memory: discard activations during the forward pass, recompute them during the backward pass. This reduces activation memory from $O(L)$ to $O(\sqrt{L})$ at the cost of ~33% more compute.

Part 2 - Data Parallelism

The Basic Idea

Replicate the entire model on $N$ GPUs. Split each batch into $N$ micro-batches. Each GPU processes its micro-batch independently, then all GPUs synchronize gradients.

Data Parallelism - Replicate Model, Split Data, AllReduce Gradients

Requirement: Each GPU must hold the full model, full optimizer state, and gradients. Data parallelism does not reduce per-GPU memory for the model - it only divides the activation memory (smaller micro-batch per GPU).

Scaling: Throughput scales nearly linearly with the number of GPUs, as long as communication does not become the bottleneck.

AllReduce: The Communication Primitive

AllReduce computes the sum (or average) of gradients across all GPUs and distributes the result to every GPU.

Naive AllReduce: Every GPU sends its gradients to every other GPU. Communication cost: $O(N^2 \times P)$ . Terrible at scale.

Ring-AllReduce: GPUs are arranged in a logical ring. The algorithm has two phases:

Phase 1 - Scatter-Reduce (N-1 steps): Each GPU sends a chunk of its gradients to the next GPU in the ring. Each GPU accumulates the incoming chunk with its own. After $N-1$ steps, each GPU holds the fully reduced result for one chunk.

Phase 2 - All-Gather (N-1 steps): Each GPU sends its fully reduced chunk around the ring until every GPU has all chunks.

Total communication: Each GPU sends and receives exactly $2(N-1)/N \times P \approx 2P$ data. The cost is independent of $N$ - it does not grow with the number of GPUs.

Common Trap

Do NOT say "AllReduce is O(N) in communication." Ring-allreduce has total per-GPU bandwidth cost of $2P$ regardless of $N$ . The latency (number of steps) is $2(N-1)$ , but the bandwidth cost is constant. Interviewers at Google and Meta expect you to understand this distinction between latency and bandwidth costs.

Gradient Compression

Even with ring-allreduce, communication can be a bottleneck. Gradient compression techniques reduce the data transferred:

Technique	Compression Ratio	How It Works
FP16 gradients	2x	Cast FP32 gradients to FP16 before communication
Top-K sparsification	10-1000x	Only communicate the K largest gradient values
Random sparsification	10-100x	Randomly select gradient values to communicate
Quantization (1-bit)	32x	Send only the sign of each gradient
PowerSGD	10-100x	Low-rank approximation of gradient matrix
Error feedback	-	Accumulate compression error locally, add to next step

Company Variation

Gradient compression is a hot topic at Meta AI and Google Research. Meta's work on 1-bit Adam and Google's work on distributed optimization are frequently discussed. If you are interviewing for infrastructure/systems ML roles, be prepared to discuss the tradeoffs between compression ratio and convergence rate.

Part 3 - Model Parallelism

When a model does not fit on a single GPU even without optimizer states (inference-only), you must split the model across GPUs.

Tensor Parallelism (Intra-layer)

Split individual layers across GPUs. Each GPU computes a portion of the layer's output.

Example: splitting a linear layer $Y = XW$ where $W \in \mathbb{R}^{d \times d}$

Split $W$ column-wise across 2 GPUs:

$W = [W_1 | W_2], \quad W_1 \in \mathbb{R}^{d \times d/2}, \quad W_2 \in \mathbb{R}^{d \times d/2}$

GPU 0 computes $Y_1 = XW_1$ , GPU 1 computes $Y_2 = XW_2$ . The full output is $Y = [Y_1 | Y_2]$ .

For a transformer MLP block (two linear layers with a nonlinearity):

$\text{MLP}(X) = \text{GeLU}(XA) \cdot B$

Megatron-LM approach (Shoeybi et al., 2020):

Split $A$ column-wise: GPU $i$ computes $\text{GeLU}(XA_i)$ - no communication needed because GeLU is element-wise
Split $B$ row-wise: GPU $i$ computes $\text{GeLU}(XA_i) B_i$ - each GPU has a partial result
AllReduce to sum partial results: $Y = \sum_i \text{GeLU}(XA_i) B_i$

Result: One AllReduce per MLP block and one per attention block - total of 2 AllReduces per transformer layer.

Tensor Parallelism - Split Matrix Multiplications Across GPUs (Megatron-LM)

Best for: GPUs connected with high-bandwidth interconnects (NVLink: 900 GB/s on H100). Tensor parallelism requires very frequent communication (every layer), so it only works within a single node.

Pipeline Parallelism (Inter-layer)

Assign different layers to different GPUs. GPU 0 has layers 1-10, GPU 1 has layers 11-20, etc.

Naive approach problem - the bubble:

Most GPUs are idle most of the time. With $P$ pipeline stages, the bubble fraction is $(P-1)/P$ .

GPipe (Huang et al., 2019): Split the mini-batch into $M$ micro-batches. Pipeline them through stages so that multiple micro-batches are in flight simultaneously.

Bubble fraction: $(P-1)/(P-1+M)$ . With $M \gg P$ , the bubble becomes negligible.

1F1B schedule (Narayanan et al., 2021 - PipeDream): Interleave forward and backward passes to further reduce the bubble and reduce peak activation memory. Each stage alternates between forward passes on new micro-batches and backward passes on completed micro-batches.

Pipeline Parallelism - GPipe Micro-batches Reduce the Bubble

Comparing Parallelism Strategies

Dimension	Data Parallel	Tensor Parallel	Pipeline Parallel
What is split	Data (batches)	Individual layers	Groups of layers
Memory savings	None (model replicated)	Proportional to TP degree	Proportional to PP degree
Communication	AllReduce after backward	AllReduce every layer	Point-to-point between stages
Bandwidth needs	Moderate	Very high (NVLink)	Low
Latency sensitivity	Low	High	Moderate
Bubble overhead	None	None	$(P-1)/(P-1+M)$
Ideal placement	Across nodes	Within node (NVLink)	Across nodes

Instant Rejection

Never say "just use data parallelism" for a model that does not fit in a single GPU's memory. Data parallelism replicates the model - it makes the memory problem worse (each GPU needs the full model plus activations). If the model does not fit, you must use model parallelism (tensor or pipeline) or ZeRO to shard the state.

Part 4 - ZeRO: Eliminating Redundancy

The Redundancy Problem in Data Parallelism

In standard data parallelism, every GPU holds a complete copy of:

Model weights ( $2P$ bytes in FP16)
Gradients ( $2P$ bytes)
Optimizer states ( $12P$ bytes for Adam with FP32 master weights)

With 64 GPUs, you have 64 copies of everything - only the activations differ. ZeRO (Zero Redundancy Optimizer) by Rajbhandari et al. (2020) eliminates this redundancy.

ZeRO Stage 1: Partition Optimizer States

Each GPU stores only $1/N$ of the optimizer states. After computing gradients, each GPU:

Reduces gradients for its partition via reduce-scatter
Updates its partition of the optimizer states
Broadcasts updated weights via all-gather

Memory savings: Optimizer states drop from $12P$ to $12P/N$ per GPU. Communication: Same as data parallelism (AllReduce = reduce-scatter + all-gather).

ZeRO Stage 2: Partition Gradients

In addition to Stage 1, each GPU stores only $1/N$ of the gradients.

Memory savings: Gradients drop from $2P$ to $2P/N$ . Combined with Stage 1, static memory drops from $16P$ to $2P + 2P/N + 12P/N = 2P + 14P/N$ .

Communication: Same as data parallelism - reduce-scatter suffices since each GPU only needs its own gradient partition.

ZeRO Stage 3: Partition Parameters

Each GPU stores only $1/N$ of the model parameters. When a layer needs the full parameter for forward or backward pass, it performs an all-gather to reconstruct the parameter, uses it, then discards the non-local portion.

Memory savings: Everything is partitioned. Per-GPU memory: $(2P + 2P + 12P)/N = 16P/N$ .

Communication cost: Additional all-gather operations during forward and backward passes. Total communication increases by 1.5x compared to standard data parallelism.

ZeRO Optimizer Stages - Eliminating Redundancy in Data Parallelism

FSDP: PyTorch's ZeRO Stage 3

Fully Sharded Data Parallel (FSDP) is PyTorch's native implementation of ZeRO Stage 3. Key features:

Shards parameters, gradients, and optimizer states across all GPUs
Uses all-gather to reconstruct parameters on-demand during forward/backward
Supports mixed precision with per-parameter sharding
Can be combined with activation checkpointing
Simpler API than DeepSpeed for PyTorch users

FSDP vs DeepSpeed ZeRO:

Feature	FSDP (PyTorch)	DeepSpeed ZeRO
Framework	Native PyTorch	Library on top of PyTorch
ZeRO stages	Stage 3 equivalent	Stage 1, 2, 3, 3+ (offload)
CPU offloading	Limited	Full support (ZeRO-Infinity)
NVMe offloading	No	Yes (ZeRO-Infinity)
Ease of use	Better for PyTorch users	More configuration options
Mixed precision	Native AMP integration	Own mixed precision pipeline
Community	Growing	Mature

Interviewer's Perspective

When asked about distributed training, interviewers want to see that you can reason about when to use each strategy, not just what they are. A strong answer: "For a 7B model on 8 A100s, I'd use ZeRO Stage 2 - it partitions optimizer states and gradients, bringing per-GPU memory from 112GB to about 30GB, which fits in 80GB A100. I wouldn't need Stage 3 because the weights alone (14GB) fit on each GPU. For a 70B model, I'd use ZeRO Stage 3 combined with tensor parallelism within each node."

Part 5 - DeepSpeed: The Full Stack

DeepSpeed (Microsoft) provides a comprehensive distributed training framework:

Key Components

Component	What It Does
ZeRO Stage 1/2/3	Partition optimizer/gradient/parameter memory
ZeRO-Offload	Offload optimizer states to CPU RAM
ZeRO-Infinity	Offload to CPU RAM + NVMe SSD
3D Parallelism	Combine data + tensor + pipeline parallelism
Sparse Attention	Efficient attention for long sequences
Mixture of Experts	MoE training support
Compression	Gradient and communication compression

ZeRO-Infinity: Training on a Budget

ZeRO-Infinity offloads parameters, gradients, and optimizer states to CPU memory or even NVMe SSDs. This allows training models that are orders of magnitude larger than GPU memory:

GPU memory: Used only for active computations (forward/backward of current layer)
CPU memory: Stores parameters, gradients, optimizer states that are not currently needed
NVMe SSD: Overflow from CPU memory

Tradeoff: Significantly slower due to PCIe bandwidth bottleneck (CPU-GPU: 32 GB/s on PCIe 4.0 vs GPU-GPU NVLink: 900 GB/s on H100). Training throughput can drop by 10-50x compared to all-GPU training.

Part 6 - 3D Parallelism: Combining Everything

For training the largest models (100B+ parameters), you combine all three parallelism strategies:

3D Parallelism - Tensor + Pipeline + Data Parallelism Combined

Rule of thumb for assigning parallelism:

Tensor parallelism: Within a node (NVLink bandwidth). Degree = 2, 4, or 8 (number of GPUs per node).
Pipeline parallelism: Across nodes (lower bandwidth OK). Degree = number of pipeline stages.
Data parallelism: Across the remaining GPUs. Degree = total GPUs / (TP degree x PP degree).

Example: 70B model on 256 A100 GPUs (32 nodes x 8 GPUs/node):

TP = 4 (split each layer across 4 GPUs within a node)
PP = 8 (8 pipeline stages across 8 groups of nodes)
DP = 256 / (4 x 8) = 8 (8-way data parallelism)
Effective batch size = DP x micro-batch size x gradient accumulation steps

Part 7 - Scaling Laws

The Kaplan Scaling Laws (OpenAI, 2020)

Kaplan et al. discovered that language model loss follows power laws:

$L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}$

Where:

$N$ = number of parameters (excluding embeddings)
$D$ = dataset size (tokens)
$C$ = compute budget (FLOPs)
$\alpha_N \approx 0.076$ , $\alpha_D \approx 0.095$ , $\alpha_C \approx 0.050$

Key insights:

Loss decreases as a power law with N, D, and C - no diminishing returns within the studied range
Larger models are more sample-efficient - they achieve the same loss with less data per parameter
Model size matters more than dataset size for a fixed compute budget (Kaplan's conclusion - later challenged by Chinchilla)

The Chinchilla Scaling Laws (Hoffmann et al., 2022)

DeepMind's Chinchilla paper challenged Kaplan's conclusion. They found:

Kaplan conclusion: For a fixed compute budget, scale the model as large as possible and use whatever data fits.

Chinchilla conclusion: For a fixed compute budget, scale model size and dataset size equally. The optimal allocation is approximately:

$N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5}$

The rule of thumb: train on approximately 20 tokens per parameter.

Model	Parameters	Kaplan-optimal tokens	Chinchilla-optimal tokens	Actual tokens
GPT-3	175B	300B (what they used)	3.5T	300B
Chinchilla	70B	-	1.4T	1.4T
Llama 2	70B	-	1.4T	2T (overtrained)
Llama 3	70B	-	1.4T	15T (massively overtrained)

Interviewer's Perspective

The Chinchilla paper is one of the most important results in LLM training. When asked about scaling laws, an interviewer at a top AI lab expects you to know: (1) power law relationship between loss and compute/data/params, (2) Chinchilla's "20 tokens per parameter" rule, (3) why recent models like Llama 3 are deliberately "overtrained" relative to Chinchilla - inference cost matters, because a smaller model trained on more data is cheaper to deploy than a larger model trained on less data.

Beyond Chinchilla: Inference-Aware Scaling

Chinchilla optimizes for training compute - minimizing loss for a fixed training FLOP budget. But in practice, inference cost matters too. A model that runs 1 billion inference calls should be optimized differently than one that runs 1000.

Inference-aware scaling trades more training compute (overtrain a smaller model) for lower inference cost:

$\text{Total cost} = C_{\text{train}} + N_{\text{infer}} \times C_{\text{infer per token}} \times N_{\text{tokens}}$

This is why Llama 3 (70B, 15T tokens) is trained far beyond Chinchilla-optimal - the extra training cost is amortized over billions of inference calls.

Common Trap

Do NOT say "Chinchilla proved that GPT-3 was too large." Chinchilla showed that GPT-3 was undertrained for its size - it should have been trained on more data, not that it should have been smaller. The implication is that for 300B tokens of training data, the optimal model is ~10B parameters (not 175B). But if you have enough data, a 175B model will eventually outperform a 10B model.

The Compute Formula

For transformer language models, the compute required for training is approximately:

$C \approx 6ND$

Where $C$ is in FLOPs, $N$ is number of parameters, and $D$ is number of training tokens. The factor of 6 comes from: 2 FLOPs per multiply-add in the forward pass per parameter per token, times 3 (forward + backward, where backward is approximately 2x forward).

This formula is essential for compute budgeting. To train a 70B model on 2T tokens:

$C = 6 \times 70 \times 10^9 \times 2 \times 10^{12} = 8.4 \times 10^{23} \text{ FLOPs}$

On 1024 A100 GPUs at 40% MFU (Model FLOPs Utilization):

$\text{Time} = \frac{8.4 \times 10^{23}}{1024 \times 312 \times 10^{12} \times 0.4} \approx 6.6 \times 10^{6} \text{ seconds} \approx 76 \text{ days}$

Part 8 - Practical Guide: Scaling from 1 GPU to 1000

Decision Tree

Scaling Decision Tree - 1 GPU to 1000+ GPUs

Step-by-Step Scaling Guide

Level 1: 1 GPU

Standard single-GPU training
Mixed precision (BF16) to double effective memory
Gradient accumulation for larger effective batch size
Activation checkpointing if memory-tight

Level 2: 2-8 GPUs (single node)

PyTorch DDP (DistributedDataParallel)
NVLink interconnect for fast communication
Linear throughput scaling expected
If model does not fit: ZeRO Stage 2 (FSDP)

Level 3: 8-64 GPUs (multi-node)

DDP across nodes with InfiniBand/RoCE
Communication becomes significant - overlap gradient communication with backward computation
Consider ZeRO Stage 2/3 for memory-constrained setups
Gradient compression if network bandwidth is limited

Level 4: 64-512 GPUs

Tensor parallelism within nodes (TP=4 or TP=8)
Data parallelism across nodes
Careful batch size scaling with learning rate warmup
Monitor GPU utilization - aim for >50% MFU

Level 5: 512-10000+ GPUs

Full 3D parallelism (TP + PP + DP)
Pipeline parallelism across node groups
Failure recovery: checkpoint frequently, use elastic training
Network topology awareness: place communicating GPUs on same switch
Typically only frontier model training (GPT-4, Gemini, Llama 3 scale)

Model FLOPs Utilization (MFU)

MFU measures how efficiently you use the GPU's theoretical peak FLOPS:

$\text{MFU} = \frac{\text{Actual training FLOPS}}{\text{Theoretical peak FLOPS of hardware}}$

Setup	Typical MFU
Single GPU, well-optimized	50-60%
8 GPUs, DDP within node	45-55%
64 GPUs, multi-node DDP	35-50%
256 GPUs, 3D parallelism	30-45%
1000+ GPUs, frontier training	25-40%

Why MFU is never 100%: Communication overhead, pipeline bubbles, memory bandwidth bottlenecks, kernel launch overhead, idle time during synchronization.

Practice Problems

Problem 1: Memory Budget Calculation

You have a 13B parameter model and 8 A100 GPUs (80GB each). Calculate the memory requirements and determine the minimum ZeRO stage needed. Assume Adam optimizer, BF16 mixed precision, and no activation checkpointing.

Hint 1 - Direction

Calculate the per-GPU memory for each ZeRO stage. Remember: weights (2P), gradients (2P), optimizer states (12P) for mixed precision Adam.

Hint 2 - Insight

Total static memory: $16 \times 13\text{B} = 208\text{GB}$ . Per GPU with no sharding: 208GB (does not fit in 80GB). With ZeRO Stage 1: optimizer states sharded, so per GPU = $2P + 2P + 12P/8 = 4P + 1.5P = 71.5\text{GB}$ . Activations will add more.

Hint 3 - Full Solution + Rubric

Memory per component:

Weights (BF16): $2 \times 13\text{B} = 26\text{GB}$
Gradients (BF16): $2 \times 13\text{B} = 26\text{GB}$
Adam momentum (FP32): $4 \times 13\text{B} = 52\text{GB}$
Adam variance (FP32): $4 \times 13\text{B} = 52\text{GB}$
Master weights (FP32): $4 \times 13\text{B} = 52\text{GB}$
Total static: 208GB

Per-GPU memory by ZeRO stage (8 GPUs):

Stage	Weights	Gradients	Optimizer	Total Static	Fits in 80GB?
None (DDP)	26 GB	26 GB	156 GB	208 GB	No
Stage 1	26 GB	26 GB	19.5 GB	71.5 GB	Tight with activations
Stage 2	26 GB	3.25 GB	19.5 GB	48.75 GB	Yes
Stage 3	3.25 GB	3.25 GB	19.5 GB	26 GB	Yes, generous

Answer: ZeRO Stage 1 gives 71.5GB of static memory, leaving only ~8.5GB for activations - likely insufficient. ZeRO Stage 2 gives 48.75GB, leaving ~31GB for activations - comfortable for moderate batch sizes. Stage 2 is the minimum viable stage.

If larger batch sizes or longer sequences are needed, use Stage 3 or add activation checkpointing.

Scoring Rubric:

Strong Hire: Correct calculation for all components, identifies Stage 2 as minimum, accounts for activation memory, mentions activation checkpointing as an option.
Lean Hire: Gets the right answer but makes arithmetic errors or forgets a memory component.
No Hire: Cannot calculate memory requirements or does not know what ZeRO stages do.

Problem 2: Parallelism Configuration

You need to train a 70B parameter model on 128 A100 GPUs (16 nodes, 8 GPUs per node, NVLink within node, InfiniBand across nodes). Design the parallelism configuration.

Hint 1 - Direction

Start by checking if the model fits within a single node using tensor parallelism. Then decide pipeline and data parallelism degrees.

Hint 2 - Insight

70B model weights in BF16: 140GB. 8 GPUs per node x 80GB = 640GB per node. With TP=8, weights per GPU = 17.5GB - fits easily. But you also need optimizer states. With TP=8 within a node, you still need the full optimizer state per TP group. ZeRO Stage 1 or 2 across the TP group can help.

Hint 3 - Full Solution + Rubric

Configuration:

Tensor parallelism (TP) = 4 within each node:

Weights per GPU: 140GB / 4 = 35GB
Uses NVLink (900 GB/s) for the frequent intra-layer communication
TP=4 instead of TP=8 to leave room for data parallelism within the node

Pipeline parallelism (PP) = 4 across nodes:

Model split into 4 stages of ~17.5B params each
Each stage spans 4 nodes (one TP group per stage per DP rank)
Inter-node communication via InfiniBand - pipeline parallelism is latency-tolerant

Data parallelism (DP) = 128 / (4 x 4) = 8:

8 data-parallel replicas
ZeRO Stage 1 within the DP group to shard optimizer states

Memory per GPU:

Weights (TP=4): ~35GB
Optimizer states (ZeRO-1, DP=8): ~4.9GB
Gradients: ~8.75GB (sharded across TP group)
Total static: ~49GB, leaving ~31GB for activations

Pipeline configuration:

Micro-batches: 32 (to minimize bubble overhead)
Bubble fraction: $(4-1)/(4-1+32) \approx 8.6\%$
Use 1F1B schedule to reduce peak activation memory

Alternative valid configurations:

TP=8, PP=2, DP=8 - more TP, less pipeline overhead
TP=4, PP=8, DP=4 - more pipeline stages, less data parallelism
TP=4, PP=1, DP=32 with ZeRO Stage 3 - pure sharded data parallelism

Scoring Rubric:

Strong Hire: Provides a specific configuration with TP/PP/DP degrees, justifies each choice (NVLink for TP, InfiniBand tolerance for PP), calculates memory, addresses pipeline bubble, mentions alternatives.
Lean Hire: Suggests a reasonable configuration but cannot justify the choices or calculate memory.
No Hire: Cannot combine multiple parallelism strategies or suggests only data parallelism for a 70B model.

Problem 3: Scaling Law Application

Your company has a fixed training compute budget of $10^{22}$ FLOPs. Using Chinchilla scaling laws, determine the optimal model size and dataset size. Then explain why your company might choose to deviate from the optimal.

Hint 1 - Direction

Chinchilla rule of thumb: 20 tokens per parameter. The compute is approximately $C \approx 6ND$ for transformers (6 FLOPs per parameter per token).

Hint 2 - Insight

From $C = 6ND$ and $D = 20N$ : $C = 120N^2$ , so $N = \sqrt{C/120}$ . With $C = 10^{22}$ : $N = \sqrt{10^{22}/120} \approx 2.9 \times 10^{9} \approx 3\text{B}$ parameters. Dataset: $D = 20 \times 3\text{B} = 60\text{B}$ tokens. But you might want a smaller model trained on more data for inference efficiency.

Hint 3 - Full Solution + Rubric

Chinchilla-optimal calculation:

Given $C = 10^{22}$ FLOPs, using $C \approx 6ND$ and $D \approx 20N$ :

$C = 6N \times 20N = 120N^2$ $N = \sqrt{\frac{C}{120}} = \sqrt{\frac{10^{22}}{120}} \approx 2.9 \times 10^9 \approx 3\text{B parameters}$ $D = 20N \approx 60\text{B tokens}$

Reasons to deviate:

Inference cost dominance: If the model serves millions of users, a smaller model (e.g., 1B) trained on more data (e.g., 1.7T tokens) may have higher total cost-efficiency because inference cost = $O(N)$ per request.
Data availability: You may not have 60B high-quality tokens. If limited to 20B tokens, a smaller model (~1B) that is not overtrained on repeated data may perform better.
Latency requirements: A 3B model may not meet latency SLAs for real-time applications. A 1B model might, even if it has slightly higher loss.
Emergent capabilities: Some capabilities only appear at certain model sizes. If you need chain-of-thought reasoning, you might need a larger model even if Chinchilla says otherwise.
Fine-tuning plans: If you plan to fine-tune for specific tasks, a larger pre-trained model often fine-tunes better even with the same pre-training compute.

Scoring Rubric:

Strong Hire: Correct Chinchilla calculation, provides 3+ valid reasons to deviate, mentions inference cost as the primary modern reason, connects to real examples (Llama 3 strategy).
Lean Hire: Gets the calculation right but provides only 1-2 reasons to deviate.
No Hire: Cannot perform the calculation or does not know what Chinchilla scaling laws are.

Problem 4: Communication Bottleneck Analysis

You are training with 64 GPUs across 8 nodes, using DDP with ring-allreduce. Your model has 7B parameters in BF16 gradients. The inter-node network bandwidth is 100 Gbps per node. Calculate the time for one AllReduce operation and determine if communication is the bottleneck.

Hint 1 - Direction

Ring-allreduce per-GPU bandwidth cost is approximately $2P$ bytes regardless of $N$ . But the bottleneck is the slowest link in the ring. With 8 nodes, the ring must cross inter-node links.

Hint 2 - Insight

Data to transfer per GPU: $2 \times 7\text{B} \times 2 \text{ bytes} = 28\text{GB}$ (the factor of 2 is from the two phases of ring-allreduce). Network bandwidth: 100 Gbps = 12.5 GB/s. Time: $28 / 12.5 \approx 2.2$ seconds. If a forward+backward pass takes about 4.5 seconds, communication is 33% of total time - a significant bottleneck.

Hint 3 - Full Solution + Rubric

Calculation:

Gradient size: $7\text{B} \times 2 \text{ bytes (BF16)} = 14\text{GB}$

Ring-allreduce total per-GPU transfer: $\frac{2(N-1)}{N} \times 14\text{GB} \approx 2 \times 14 = 28\text{GB}$ (for $N=64 \gg 1$ )

Inter-node bandwidth: $100 \text{ Gbps} = 12.5 \text{ GB/s}$

Time for AllReduce: $28\text{GB} / 12.5\text{ GB/s} = 2.24 \text{ seconds}$

Is this a bottleneck?

Estimated forward+backward time for 7B model on A100:

A100 BF16 peak: 312 TFLOPS
Approximate compute per step with batch size 8 and seq_len 2048: ~4.5 seconds
Communication fraction: $2.24 / (4.5 + 2.24) \approx 33\%$

This is a significant bottleneck. Mitigation strategies:

Overlap communication with computation: Start allreduce for earlier layers while computing gradients for later layers. Can hide 50-80% of communication time.
Gradient compression: Top-K sparsification or PowerSGD can reduce transfer by 10-100x.
Gradient accumulation: Accumulate gradients over multiple micro-batches before allreduce, amortizing the communication cost.
Faster network: 400 Gbps InfiniBand (standard in modern GPU clusters) would reduce communication to 0.56 seconds.

Scoring Rubric:

Strong Hire: Correct calculation of AllReduce time, determines communication fraction, proposes mitigation strategies including overlap and compression, mentions real network bandwidth numbers.
Lean Hire: Gets the approximate time right but cannot analyze whether it is a bottleneck or propose mitigations.
No Hire: Cannot calculate ring-allreduce communication cost or does not understand the concept.

Problem 5: Design Challenge

You are the tech lead for training a new 30B parameter multilingual language model. You have 512 H100 GPUs for 3 months. Design the complete distributed training setup: parallelism strategy, compute budget, and failure recovery.

Hint 1 - Direction

Start with the compute budget. 512 H100s for 3 months at ~40% MFU. How many FLOPs? Then use scaling laws to determine if 30B is well-allocated. Then design the parallelism.

Hint 2 - Insight

Compute budget: 512 GPUs x ~1000 TFLOPS (H100 BF16) x 0.4 MFU x 90 days x 86400 s/day = ~ $1.6 \times 10^{24}$ FLOPs. Chinchilla-optimal for this budget: $N \approx \sqrt{C/120} \approx 115\text{B}$ . Training a 30B model means you are overtraining by ~4x, which is fine for inference efficiency. Data needed: $C/(6N) \approx 8.9\text{T tokens}$ .

Hint 3 - Full Solution + Rubric

Compute budget:

512 H100 GPUs x 989 TFLOPS (BF16 peak) x 0.40 MFU = ~202,752 TFLOPS effective
Training time: 90 days x 86,400 s/day = 7,776,000 seconds
Total FLOPs: ~ $1.58 \times 10^{24}$ FLOPs

Scaling law analysis:

Chinchilla-optimal N for this budget: $\sqrt{1.58 \times 10^{24} / 120} \approx 115\text{B params}$
We are training 30B - about 3.8x smaller than Chinchilla-optimal
This is deliberate: 30B is a practical deployment size, and overtraining improves inference efficiency
Required tokens: $C / (6N) = 1.58 \times 10^{24} / (6 \times 30 \times 10^9) \approx 8.8\text{T tokens}$

Parallelism configuration (512 H100s = 64 nodes x 8 GPUs):

TP = 4 (within node via NVLink)
PP = 2 (across 2 groups of nodes, minimal bubble)
DP = 512 / (4 x 2) = 64 (64-way data parallelism)
ZeRO Stage 1 within DP groups (partition optimizer states)

Memory per GPU:

Weights (TP=4): 30B x 2 / 4 = 15GB
Optimizer (ZeRO-1, DP=64): ~1.4GB
Gradients (TP=4): ~15GB
Total static: ~32GB, leaving ~48GB for activations on 80GB H100

Failure recovery:

Checkpoint every 1000 steps (~75 minutes)
Async checkpoint to shared NFS / object storage
Elastic training: handle up to 5% GPU failures without restart
Automatic restart from latest checkpoint on node failure
Watchdog: alert if loss spikes >2x or MFU drops below 30%

Scoring Rubric:

Strong Hire: Complete compute budget, scaling law analysis, specific parallelism configuration with justification, memory calculation, failure recovery plan.
Lean Hire: Provides a reasonable parallelism configuration but missing scaling law analysis or failure recovery.
No Hire: Cannot design a multi-dimensional parallelism configuration or does not understand the scale of the problem.

Interview Cheat Sheet

Concept	Key Formula/Number	One-Liner	Red Flag
Training memory (Adam, mixed prec.)	$16P$ bytes total	Weights + grads + optimizer + master weights	"Training memory = model size"
Data parallelism	Replicate model, split data	AllReduce gradients after backward	"DP reduces model memory"
Ring-allreduce bandwidth	$2P$ bytes per GPU	Constant cost regardless of GPU count	"AllReduce is O(N)"
Tensor parallelism	Split layers across GPUs	Requires NVLink, intra-node only	"TP works across nodes"
Pipeline parallelism	Split layer groups across GPUs	Bubble = $(P-1)/(P-1+M)$	"No bubble with enough GPUs"
ZeRO Stage 1	Shard optimizer states	Memory: $4P + 12P/N$	"ZeRO removes all redundancy" (only Stage 3)
ZeRO Stage 3 / FSDP	Shard everything	Memory: $16P/N$ , 1.5x comm cost	"FSDP is free"
Chinchilla rule	20 tokens per parameter	Scale data and params equally	"Bigger model always better"
Compute formula	$C \approx 6ND$	6 FLOPs per param per token	Not knowing the 6ND formula
MFU	Actual FLOPS / Peak FLOPS	30-50% is typical at scale	"MFU should be 90%+"
Activation checkpointing	Recompute vs store	~33% more compute, $O(\sqrt{L})$ memory	"Free memory savings"

Spaced Repetition Checkpoints

Day 0 - Initial Learning

Read this entire page
Calculate memory requirements for a 7B and 70B model from memory
Draw the ring-allreduce algorithm on paper
Write out the ZeRO stages and memory savings for each
State the Chinchilla scaling law and the 20-tokens-per-parameter rule

Day 3 - First Recall

Without notes, calculate memory for a 13B model with Adam mixed precision
Explain data vs tensor vs pipeline parallelism in 2 minutes, aloud
Give the "60-Second Answer" for model memory out loud, timed
Derive the Chinchilla-optimal model size for $10^{23}$ FLOPs

Day 7 - Connections

Design a parallelism configuration for 70B on 256 GPUs (write it out)
Explain why ring-allreduce bandwidth is independent of GPU count
Do Practice Problem 1 (memory budget) without looking at hints
Explain inference-aware scaling and how it changes the Chinchilla recommendation

Day 14 - Application

Do Practice Problem 2 (parallelism configuration) under timed conditions (15 minutes)
Do Practice Problem 4 (communication bottleneck) with full calculations
Design a training setup for a model size and GPU count of your choice
Review concepts you hesitated on

Day 21 - Mock Interview

Have someone ask: "How would you train a 70B model on 512 GPUs? Walk me through everything."
Time yourself: full answer should take 8-12 minutes covering memory, parallelism, scaling, and failure recovery
Do all 5 practice problems in sequence under timed conditions (60 minutes total)
Can you whiteboard the 3D parallelism diagram from memory?

Key Takeaways

Memory is the first constraint. Before choosing a parallelism strategy, calculate the exact memory requirements. The $16P$ formula for Adam mixed precision is your starting point.
Different parallelism strategies serve different purposes. Data parallelism scales throughput. Tensor parallelism fits layers on multiple GPUs. Pipeline parallelism fits models across nodes. ZeRO eliminates memory redundancy. Most large-scale training uses a combination of all four.
Communication is the scaling bottleneck. Ring-allreduce is bandwidth-optimal but latency-limited. Tensor parallelism requires NVLink. Pipeline parallelism tolerates lower bandwidth. Understanding these tradeoffs is what makes you a strong distributed training engineer.
Chinchilla changed the game. The 20-tokens-per-parameter rule optimizes for training compute. Inference-aware scaling (overtraining smaller models) optimizes for deployment cost. Know both perspectives and when each applies.
Failure recovery is not optional at scale. With 1000 GPUs, hardware failures are frequent (multiple per day). Checkpoint frequently, use elastic training, and monitor aggressively.

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - Memory Anatomy: Why One GPU Is Not Enough​

Memory Breakdown for Training​

Activation Memory​

Part 2 - Data Parallelism​

The Basic Idea​

AllReduce: The Communication Primitive​

Gradient Compression​

Part 3 - Model Parallelism​

Tensor Parallelism (Intra-layer)​

Pipeline Parallelism (Inter-layer)​

Comparing Parallelism Strategies​

Part 4 - ZeRO: Eliminating Redundancy​

The Redundancy Problem in Data Parallelism​

ZeRO Stage 1: Partition Optimizer States​

ZeRO Stage 2: Partition Gradients​

ZeRO Stage 3: Partition Parameters​

FSDP: PyTorch's ZeRO Stage 3​

Part 5 - DeepSpeed: The Full Stack​

Key Components​

ZeRO-Infinity: Training on a Budget​

Part 6 - 3D Parallelism: Combining Everything​

Part 7 - Scaling Laws​

The Kaplan Scaling Laws (OpenAI, 2020)​

The Chinchilla Scaling Laws (Hoffmann et al., 2022)​

Beyond Chinchilla: Inference-Aware Scaling​

The Compute Formula​

Part 8 - Practical Guide: Scaling from 1 GPU to 1000​

Decision Tree​

Step-by-Step Scaling Guide​

Model FLOPs Utilization (MFU)​

Practice Problems​

Problem 1: Memory Budget Calculation​

Problem 2: Parallelism Configuration​

Problem 3: Scaling Law Application​

Problem 4: Communication Bottleneck Analysis​

Problem 5: Design Challenge​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 - Initial Learning​

Day 3 - First Recall​

Day 7 - Connections​

Day 14 - Application​

Day 21 - Mock Interview​

Key Takeaways​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - Memory Anatomy: Why One GPU Is Not Enough

Memory Breakdown for Training

Activation Memory

Part 2 - Data Parallelism

The Basic Idea

AllReduce: The Communication Primitive

Gradient Compression

Part 3 - Model Parallelism

Tensor Parallelism (Intra-layer)

Pipeline Parallelism (Inter-layer)

Comparing Parallelism Strategies

Part 4 - ZeRO: Eliminating Redundancy

The Redundancy Problem in Data Parallelism

ZeRO Stage 1: Partition Optimizer States

ZeRO Stage 2: Partition Gradients

ZeRO Stage 3: Partition Parameters

FSDP: PyTorch's ZeRO Stage 3

Part 5 - DeepSpeed: The Full Stack

Key Components

ZeRO-Infinity: Training on a Budget

Part 6 - 3D Parallelism: Combining Everything

Part 7 - Scaling Laws

The Kaplan Scaling Laws (OpenAI, 2020)

The Chinchilla Scaling Laws (Hoffmann et al., 2022)

Beyond Chinchilla: Inference-Aware Scaling

The Compute Formula

Part 8 - Practical Guide: Scaling from 1 GPU to 1000

Decision Tree

Step-by-Step Scaling Guide

Model FLOPs Utilization (MFU)

Practice Problems

Problem 1: Memory Budget Calculation

Problem 2: Parallelism Configuration

Problem 3: Scaling Law Application

Problem 4: Communication Bottleneck Analysis

Problem 5: Design Challenge

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 - Initial Learning

Day 3 - First Recall

Day 7 - Connections

Day 14 - Application

Day 21 - Mock Interview

Key Takeaways