Skip to main content

Distributed Training - Scaling from One GPU to a Thousand

Reading time: ~40 min | Interview relevance: High | Roles: MLE, AI Infra Eng, Research Engineer, Applied Scientist

The Real Interview Moment

You are in a Google Brain (now DeepMind) interview. The interviewer draws a single GPU on the whiteboard with a model that does not fit in memory. She says: "This model has 70 billion parameters. An A100 has 80GB of memory. Walk me through, step by step, how you would train this model across a cluster of 256 GPUs. What parallelism strategies do you use? What are the communication bottlenecks? How do you decide the optimal configuration?"

This question separates ML practitioners who have trained models at scale from those who have only trained on a single GPU. The interviewer is not looking for buzzwords like "data parallelism" or "DeepSpeed." She wants you to reason about memory budgets, communication costs, and the tradeoffs between different parallelism strategies. She wants you to know that a 70B model requires approximately 140GB in FP16 just for weights - before optimizer states, gradients, or activations - and that you need to partition these across GPUs intelligently.

Candidates who can only describe data parallelism get a "lean no-hire." Candidates who can explain the full stack - data, tensor, pipeline parallelism, ZeRO stages, communication patterns, and how to configure them - get a "strong hire."

What You Will Master

  • Calculate the memory requirements for training a model of any size
  • Compare data parallelism, tensor parallelism, and pipeline parallelism with precise tradeoffs
  • Explain ZeRO stages 1, 2, and 3 and the memory savings of each
  • Describe AllReduce, ring-allreduce, and gradient compression algorithms
  • Apply FSDP (Fully Sharded Data Parallel) and explain how it relates to ZeRO
  • Derive scaling laws relating loss to compute, data, and parameters
  • Explain Chinchilla optimal training and why it changed LLM training
  • Design a training configuration for a given model size and GPU cluster

Self-Assessment: Where Are You Now?

Skill1 - Cannot2 - Vaguely3 - Can Explain4 - Can Derive5 - Can TeachYour Score
Calculate model memory requirements___
Explain data parallelism + AllReduce___
Explain tensor parallelism___
Explain pipeline parallelism___
Describe ZeRO stages 1/2/3___
Explain ring-allreduce algorithm___
State and interpret scaling laws___
Design a multi-GPU training config___

Target: All 4s and 5s before your interview.

Part 1 - Memory Anatomy: Why One GPU Is Not Enough

Before discussing parallelism, you must understand what consumes GPU memory during training.

Memory Breakdown for Training

For a model with PP parameters trained with Adam optimizer in mixed precision:

ComponentMemory (bytes)For 7B paramsFor 70B params
Model weights (FP16)2P2P14 GB140 GB
Gradients (FP16)2P2P14 GB140 GB
Adam optimizer state: momentum (FP32)4P4P28 GB280 GB
Adam optimizer state: variance (FP32)4P4P28 GB280 GB
Master weights (FP32)4P4P28 GB280 GB
Total (no activations)16P16P112 GB1120 GB
Activations (varies)2×batch×seq×hidden\sim 2 \times \text{batch} \times \text{seq} \times \text{hidden}~10-50 GB~50-500 GB

A single A100 (80GB) cannot even hold the weights of a 70B model in FP16 (140GB), let alone the optimizer states. A 7B model fits its weights on one GPU but not the full training state (112GB > 80GB).

60-Second Answer

"Training memory has four main components: model weights, gradients, optimizer states, and activations. For a model with P parameters using Adam in mixed precision, you need approximately 16P bytes - 2P for FP16 weights, 2P for FP16 gradients, and 12P for FP32 optimizer states and master weights. A 7B model needs ~112GB just for the static state, before activations. A single 80GB A100 cannot fit this, so we need to distribute across multiple GPUs using some combination of data parallelism, tensor parallelism, and pipeline parallelism."

Activation Memory

Activations are the intermediate tensors saved during the forward pass for use in backpropagation. For a transformer with LL layers, hidden dimension HH, sequence length SS, and batch size BB:

Activation memoryL×B×S×H×2 bytes (FP16)\text{Activation memory} \approx L \times B \times S \times H \times 2 \text{ bytes (FP16)}

Activation checkpointing (gradient checkpointing) trades compute for memory: discard activations during the forward pass, recompute them during the backward pass. This reduces activation memory from O(L)O(L) to O(L)O(\sqrt{L}) at the cost of ~33% more compute.

Part 2 - Data Parallelism

The Basic Idea

Replicate the entire model on NN GPUs. Split each batch into NN micro-batches. Each GPU processes its micro-batch independently, then all GPUs synchronize gradients.

Data Parallelism - Replicate Model, Split Data, AllReduce Gradients

Requirement: Each GPU must hold the full model, full optimizer state, and gradients. Data parallelism does not reduce per-GPU memory for the model - it only divides the activation memory (smaller micro-batch per GPU).

Scaling: Throughput scales nearly linearly with the number of GPUs, as long as communication does not become the bottleneck.

AllReduce: The Communication Primitive

AllReduce computes the sum (or average) of gradients across all GPUs and distributes the result to every GPU.

Naive AllReduce: Every GPU sends its gradients to every other GPU. Communication cost: O(N2×P)O(N^2 \times P). Terrible at scale.

Ring-AllReduce: GPUs are arranged in a logical ring. The algorithm has two phases:

Phase 1 - Scatter-Reduce (N-1 steps): Each GPU sends a chunk of its gradients to the next GPU in the ring. Each GPU accumulates the incoming chunk with its own. After N1N-1 steps, each GPU holds the fully reduced result for one chunk.

Phase 2 - All-Gather (N-1 steps): Each GPU sends its fully reduced chunk around the ring until every GPU has all chunks.

Total communication: Each GPU sends and receives exactly 2(N1)/N×P2P2(N-1)/N \times P \approx 2P data. The cost is independent of NN - it does not grow with the number of GPUs.

Common Trap

Do NOT say "AllReduce is O(N) in communication." Ring-allreduce has total per-GPU bandwidth cost of 2P2P regardless of NN. The latency (number of steps) is 2(N1)2(N-1), but the bandwidth cost is constant. Interviewers at Google and Meta expect you to understand this distinction between latency and bandwidth costs.

Gradient Compression

Even with ring-allreduce, communication can be a bottleneck. Gradient compression techniques reduce the data transferred:

TechniqueCompression RatioHow It Works
FP16 gradients2xCast FP32 gradients to FP16 before communication
Top-K sparsification10-1000xOnly communicate the K largest gradient values
Random sparsification10-100xRandomly select gradient values to communicate
Quantization (1-bit)32xSend only the sign of each gradient
PowerSGD10-100xLow-rank approximation of gradient matrix
Error feedback-Accumulate compression error locally, add to next step
Company Variation

Gradient compression is a hot topic at Meta AI and Google Research. Meta's work on 1-bit Adam and Google's work on distributed optimization are frequently discussed. If you are interviewing for infrastructure/systems ML roles, be prepared to discuss the tradeoffs between compression ratio and convergence rate.

Part 3 - Model Parallelism

When a model does not fit on a single GPU even without optimizer states (inference-only), you must split the model across GPUs.

Tensor Parallelism (Intra-layer)

Split individual layers across GPUs. Each GPU computes a portion of the layer's output.

Example: splitting a linear layer Y=XWY = XW where WRd×dW \in \mathbb{R}^{d \times d}

Split WW column-wise across 2 GPUs:

W=[W1W2],W1Rd×d/2,W2Rd×d/2W = [W_1 | W_2], \quad W_1 \in \mathbb{R}^{d \times d/2}, \quad W_2 \in \mathbb{R}^{d \times d/2}

GPU 0 computes Y1=XW1Y_1 = XW_1, GPU 1 computes Y2=XW2Y_2 = XW_2. The full output is Y=[Y1Y2]Y = [Y_1 | Y_2].

For a transformer MLP block (two linear layers with a nonlinearity):

MLP(X)=GeLU(XA)B\text{MLP}(X) = \text{GeLU}(XA) \cdot B

Megatron-LM approach (Shoeybi et al., 2020):

  1. Split AA column-wise: GPU ii computes GeLU(XAi)\text{GeLU}(XA_i) - no communication needed because GeLU is element-wise
  2. Split BB row-wise: GPU ii computes GeLU(XAi)Bi\text{GeLU}(XA_i) B_i - each GPU has a partial result
  3. AllReduce to sum partial results: Y=iGeLU(XAi)BiY = \sum_i \text{GeLU}(XA_i) B_i

Result: One AllReduce per MLP block and one per attention block - total of 2 AllReduces per transformer layer.

Tensor Parallelism - Split Matrix Multiplications Across GPUs (Megatron-LM)

Best for: GPUs connected with high-bandwidth interconnects (NVLink: 900 GB/s on H100). Tensor parallelism requires very frequent communication (every layer), so it only works within a single node.

Pipeline Parallelism (Inter-layer)

Assign different layers to different GPUs. GPU 0 has layers 1-10, GPU 1 has layers 11-20, etc.

Naive approach problem - the bubble:

Most GPUs are idle most of the time. With PP pipeline stages, the bubble fraction is (P1)/P(P-1)/P.

GPipe (Huang et al., 2019): Split the mini-batch into MM micro-batches. Pipeline them through stages so that multiple micro-batches are in flight simultaneously.

Bubble fraction: (P1)/(P1+M)(P-1)/(P-1+M). With MPM \gg P, the bubble becomes negligible.

1F1B schedule (Narayanan et al., 2021 - PipeDream): Interleave forward and backward passes to further reduce the bubble and reduce peak activation memory. Each stage alternates between forward passes on new micro-batches and backward passes on completed micro-batches.

Pipeline Parallelism - GPipe Micro-batches Reduce the Bubble

Comparing Parallelism Strategies

DimensionData ParallelTensor ParallelPipeline Parallel
What is splitData (batches)Individual layersGroups of layers
Memory savingsNone (model replicated)Proportional to TP degreeProportional to PP degree
CommunicationAllReduce after backwardAllReduce every layerPoint-to-point between stages
Bandwidth needsModerateVery high (NVLink)Low
Latency sensitivityLowHighModerate
Bubble overheadNoneNone(P1)/(P1+M)(P-1)/(P-1+M)
Ideal placementAcross nodesWithin node (NVLink)Across nodes
Instant Rejection

Never say "just use data parallelism" for a model that does not fit in a single GPU's memory. Data parallelism replicates the model - it makes the memory problem worse (each GPU needs the full model plus activations). If the model does not fit, you must use model parallelism (tensor or pipeline) or ZeRO to shard the state.

Part 4 - ZeRO: Eliminating Redundancy

The Redundancy Problem in Data Parallelism

In standard data parallelism, every GPU holds a complete copy of:

  • Model weights (2P2P bytes in FP16)
  • Gradients (2P2P bytes)
  • Optimizer states (12P12P bytes for Adam with FP32 master weights)

With 64 GPUs, you have 64 copies of everything - only the activations differ. ZeRO (Zero Redundancy Optimizer) by Rajbhandari et al. (2020) eliminates this redundancy.

ZeRO Stage 1: Partition Optimizer States

Each GPU stores only 1/N1/N of the optimizer states. After computing gradients, each GPU:

  1. Reduces gradients for its partition via reduce-scatter
  2. Updates its partition of the optimizer states
  3. Broadcasts updated weights via all-gather

Memory savings: Optimizer states drop from 12P12P to 12P/N12P/N per GPU. Communication: Same as data parallelism (AllReduce = reduce-scatter + all-gather).

ZeRO Stage 2: Partition Gradients

In addition to Stage 1, each GPU stores only 1/N1/N of the gradients.

Memory savings: Gradients drop from 2P2P to 2P/N2P/N. Combined with Stage 1, static memory drops from 16P16P to 2P+2P/N+12P/N=2P+14P/N2P + 2P/N + 12P/N = 2P + 14P/N.

Communication: Same as data parallelism - reduce-scatter suffices since each GPU only needs its own gradient partition.

ZeRO Stage 3: Partition Parameters

Each GPU stores only 1/N1/N of the model parameters. When a layer needs the full parameter for forward or backward pass, it performs an all-gather to reconstruct the parameter, uses it, then discards the non-local portion.

Memory savings: Everything is partitioned. Per-GPU memory: (2P+2P+12P)/N=16P/N(2P + 2P + 12P)/N = 16P/N.

Communication cost: Additional all-gather operations during forward and backward passes. Total communication increases by 1.5x compared to standard data parallelism.

ZeRO Optimizer Stages - Eliminating Redundancy in Data Parallelism

FSDP: PyTorch's ZeRO Stage 3

Fully Sharded Data Parallel (FSDP) is PyTorch's native implementation of ZeRO Stage 3. Key features:

  • Shards parameters, gradients, and optimizer states across all GPUs
  • Uses all-gather to reconstruct parameters on-demand during forward/backward
  • Supports mixed precision with per-parameter sharding
  • Can be combined with activation checkpointing
  • Simpler API than DeepSpeed for PyTorch users

FSDP vs DeepSpeed ZeRO:

FeatureFSDP (PyTorch)DeepSpeed ZeRO
FrameworkNative PyTorchLibrary on top of PyTorch
ZeRO stagesStage 3 equivalentStage 1, 2, 3, 3+ (offload)
CPU offloadingLimitedFull support (ZeRO-Infinity)
NVMe offloadingNoYes (ZeRO-Infinity)
Ease of useBetter for PyTorch usersMore configuration options
Mixed precisionNative AMP integrationOwn mixed precision pipeline
CommunityGrowingMature
Interviewer's Perspective

When asked about distributed training, interviewers want to see that you can reason about when to use each strategy, not just what they are. A strong answer: "For a 7B model on 8 A100s, I'd use ZeRO Stage 2 - it partitions optimizer states and gradients, bringing per-GPU memory from 112GB to about 30GB, which fits in 80GB A100. I wouldn't need Stage 3 because the weights alone (14GB) fit on each GPU. For a 70B model, I'd use ZeRO Stage 3 combined with tensor parallelism within each node."

Part 5 - DeepSpeed: The Full Stack

DeepSpeed (Microsoft) provides a comprehensive distributed training framework:

Key Components

ComponentWhat It Does
ZeRO Stage 1/2/3Partition optimizer/gradient/parameter memory
ZeRO-OffloadOffload optimizer states to CPU RAM
ZeRO-InfinityOffload to CPU RAM + NVMe SSD
3D ParallelismCombine data + tensor + pipeline parallelism
Sparse AttentionEfficient attention for long sequences
Mixture of ExpertsMoE training support
CompressionGradient and communication compression

ZeRO-Infinity: Training on a Budget

ZeRO-Infinity offloads parameters, gradients, and optimizer states to CPU memory or even NVMe SSDs. This allows training models that are orders of magnitude larger than GPU memory:

  • GPU memory: Used only for active computations (forward/backward of current layer)
  • CPU memory: Stores parameters, gradients, optimizer states that are not currently needed
  • NVMe SSD: Overflow from CPU memory

Tradeoff: Significantly slower due to PCIe bandwidth bottleneck (CPU-GPU: 32 GB/s on PCIe 4.0 vs GPU-GPU NVLink: 900 GB/s on H100). Training throughput can drop by 10-50x compared to all-GPU training.

Part 6 - 3D Parallelism: Combining Everything

For training the largest models (100B+ parameters), you combine all three parallelism strategies:

3D Parallelism - Tensor + Pipeline + Data Parallelism Combined

Rule of thumb for assigning parallelism:

  1. Tensor parallelism: Within a node (NVLink bandwidth). Degree = 2, 4, or 8 (number of GPUs per node).
  2. Pipeline parallelism: Across nodes (lower bandwidth OK). Degree = number of pipeline stages.
  3. Data parallelism: Across the remaining GPUs. Degree = total GPUs / (TP degree x PP degree).

Example: 70B model on 256 A100 GPUs (32 nodes x 8 GPUs/node):

  • TP = 4 (split each layer across 4 GPUs within a node)
  • PP = 8 (8 pipeline stages across 8 groups of nodes)
  • DP = 256 / (4 x 8) = 8 (8-way data parallelism)
  • Effective batch size = DP x micro-batch size x gradient accumulation steps

Part 7 - Scaling Laws

The Kaplan Scaling Laws (OpenAI, 2020)

Kaplan et al. discovered that language model loss follows power laws:

L(N)=(NcN)αN,L(D)=(DcD)αD,L(C)=(CcC)αCL(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}

Where:

  • NN = number of parameters (excluding embeddings)
  • DD = dataset size (tokens)
  • CC = compute budget (FLOPs)
  • αN0.076\alpha_N \approx 0.076, αD0.095\alpha_D \approx 0.095, αC0.050\alpha_C \approx 0.050

Key insights:

  1. Loss decreases as a power law with N, D, and C - no diminishing returns within the studied range
  2. Larger models are more sample-efficient - they achieve the same loss with less data per parameter
  3. Model size matters more than dataset size for a fixed compute budget (Kaplan's conclusion - later challenged by Chinchilla)

The Chinchilla Scaling Laws (Hoffmann et al., 2022)

DeepMind's Chinchilla paper challenged Kaplan's conclusion. They found:

Kaplan conclusion: For a fixed compute budget, scale the model as large as possible and use whatever data fits.

Chinchilla conclusion: For a fixed compute budget, scale model size and dataset size equally. The optimal allocation is approximately:

NoptC0.5,DoptC0.5N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5}

The rule of thumb: train on approximately 20 tokens per parameter.

ModelParametersKaplan-optimal tokensChinchilla-optimal tokensActual tokens
GPT-3175B300B (what they used)3.5T300B
Chinchilla70B-1.4T1.4T
Llama 270B-1.4T2T (overtrained)
Llama 370B-1.4T15T (massively overtrained)
Interviewer's Perspective

The Chinchilla paper is one of the most important results in LLM training. When asked about scaling laws, an interviewer at a top AI lab expects you to know: (1) power law relationship between loss and compute/data/params, (2) Chinchilla's "20 tokens per parameter" rule, (3) why recent models like Llama 3 are deliberately "overtrained" relative to Chinchilla - inference cost matters, because a smaller model trained on more data is cheaper to deploy than a larger model trained on less data.

Beyond Chinchilla: Inference-Aware Scaling

Chinchilla optimizes for training compute - minimizing loss for a fixed training FLOP budget. But in practice, inference cost matters too. A model that runs 1 billion inference calls should be optimized differently than one that runs 1000.

Inference-aware scaling trades more training compute (overtrain a smaller model) for lower inference cost:

Total cost=Ctrain+Ninfer×Cinfer per token×Ntokens\text{Total cost} = C_{\text{train}} + N_{\text{infer}} \times C_{\text{infer per token}} \times N_{\text{tokens}}

This is why Llama 3 (70B, 15T tokens) is trained far beyond Chinchilla-optimal - the extra training cost is amortized over billions of inference calls.

Common Trap

Do NOT say "Chinchilla proved that GPT-3 was too large." Chinchilla showed that GPT-3 was undertrained for its size - it should have been trained on more data, not that it should have been smaller. The implication is that for 300B tokens of training data, the optimal model is ~10B parameters (not 175B). But if you have enough data, a 175B model will eventually outperform a 10B model.

The Compute Formula

For transformer language models, the compute required for training is approximately:

C6NDC \approx 6ND

Where CC is in FLOPs, NN is number of parameters, and DD is number of training tokens. The factor of 6 comes from: 2 FLOPs per multiply-add in the forward pass per parameter per token, times 3 (forward + backward, where backward is approximately 2x forward).

This formula is essential for compute budgeting. To train a 70B model on 2T tokens:

C=6×70×109×2×1012=8.4×1023 FLOPsC = 6 \times 70 \times 10^9 \times 2 \times 10^{12} = 8.4 \times 10^{23} \text{ FLOPs}

On 1024 A100 GPUs at 40% MFU (Model FLOPs Utilization):

Time=8.4×10231024×312×1012×0.46.6×106 seconds76 days\text{Time} = \frac{8.4 \times 10^{23}}{1024 \times 312 \times 10^{12} \times 0.4} \approx 6.6 \times 10^{6} \text{ seconds} \approx 76 \text{ days}

Part 8 - Practical Guide: Scaling from 1 GPU to 1000

Decision Tree

Scaling Decision Tree - 1 GPU to 1000+ GPUs

Step-by-Step Scaling Guide

Level 1: 1 GPU

  • Standard single-GPU training
  • Mixed precision (BF16) to double effective memory
  • Gradient accumulation for larger effective batch size
  • Activation checkpointing if memory-tight

Level 2: 2-8 GPUs (single node)

  • PyTorch DDP (DistributedDataParallel)
  • NVLink interconnect for fast communication
  • Linear throughput scaling expected
  • If model does not fit: ZeRO Stage 2 (FSDP)

Level 3: 8-64 GPUs (multi-node)

  • DDP across nodes with InfiniBand/RoCE
  • Communication becomes significant - overlap gradient communication with backward computation
  • Consider ZeRO Stage 2/3 for memory-constrained setups
  • Gradient compression if network bandwidth is limited

Level 4: 64-512 GPUs

  • Tensor parallelism within nodes (TP=4 or TP=8)
  • Data parallelism across nodes
  • Careful batch size scaling with learning rate warmup
  • Monitor GPU utilization - aim for >50% MFU

Level 5: 512-10000+ GPUs

  • Full 3D parallelism (TP + PP + DP)
  • Pipeline parallelism across node groups
  • Failure recovery: checkpoint frequently, use elastic training
  • Network topology awareness: place communicating GPUs on same switch
  • Typically only frontier model training (GPT-4, Gemini, Llama 3 scale)

Model FLOPs Utilization (MFU)

MFU measures how efficiently you use the GPU's theoretical peak FLOPS:

MFU=Actual training FLOPSTheoretical peak FLOPS of hardware\text{MFU} = \frac{\text{Actual training FLOPS}}{\text{Theoretical peak FLOPS of hardware}}

SetupTypical MFU
Single GPU, well-optimized50-60%
8 GPUs, DDP within node45-55%
64 GPUs, multi-node DDP35-50%
256 GPUs, 3D parallelism30-45%
1000+ GPUs, frontier training25-40%

Why MFU is never 100%: Communication overhead, pipeline bubbles, memory bandwidth bottlenecks, kernel launch overhead, idle time during synchronization.

Practice Problems

Problem 1: Memory Budget Calculation

You have a 13B parameter model and 8 A100 GPUs (80GB each). Calculate the memory requirements and determine the minimum ZeRO stage needed. Assume Adam optimizer, BF16 mixed precision, and no activation checkpointing.

Hint 1 - Direction

Calculate the per-GPU memory for each ZeRO stage. Remember: weights (2P), gradients (2P), optimizer states (12P) for mixed precision Adam.

Hint 2 - Insight

Total static memory: 16×13B=208GB16 \times 13\text{B} = 208\text{GB}. Per GPU with no sharding: 208GB (does not fit in 80GB). With ZeRO Stage 1: optimizer states sharded, so per GPU = 2P+2P+12P/8=4P+1.5P=71.5GB2P + 2P + 12P/8 = 4P + 1.5P = 71.5\text{GB}. Activations will add more.

Hint 3 - Full Solution + Rubric

Memory per component:

  • Weights (BF16): 2×13B=26GB2 \times 13\text{B} = 26\text{GB}
  • Gradients (BF16): 2×13B=26GB2 \times 13\text{B} = 26\text{GB}
  • Adam momentum (FP32): 4×13B=52GB4 \times 13\text{B} = 52\text{GB}
  • Adam variance (FP32): 4×13B=52GB4 \times 13\text{B} = 52\text{GB}
  • Master weights (FP32): 4×13B=52GB4 \times 13\text{B} = 52\text{GB}
  • Total static: 208GB

Per-GPU memory by ZeRO stage (8 GPUs):

StageWeightsGradientsOptimizerTotal StaticFits in 80GB?
None (DDP)26 GB26 GB156 GB208 GBNo
Stage 126 GB26 GB19.5 GB71.5 GBTight with activations
Stage 226 GB3.25 GB19.5 GB48.75 GBYes
Stage 33.25 GB3.25 GB19.5 GB26 GBYes, generous

Answer: ZeRO Stage 1 gives 71.5GB of static memory, leaving only ~8.5GB for activations - likely insufficient. ZeRO Stage 2 gives 48.75GB, leaving ~31GB for activations - comfortable for moderate batch sizes. Stage 2 is the minimum viable stage.

If larger batch sizes or longer sequences are needed, use Stage 3 or add activation checkpointing.

Scoring Rubric:

  • Strong Hire: Correct calculation for all components, identifies Stage 2 as minimum, accounts for activation memory, mentions activation checkpointing as an option.
  • Lean Hire: Gets the right answer but makes arithmetic errors or forgets a memory component.
  • No Hire: Cannot calculate memory requirements or does not know what ZeRO stages do.

Problem 2: Parallelism Configuration

You need to train a 70B parameter model on 128 A100 GPUs (16 nodes, 8 GPUs per node, NVLink within node, InfiniBand across nodes). Design the parallelism configuration.

Hint 1 - Direction

Start by checking if the model fits within a single node using tensor parallelism. Then decide pipeline and data parallelism degrees.

Hint 2 - Insight

70B model weights in BF16: 140GB. 8 GPUs per node x 80GB = 640GB per node. With TP=8, weights per GPU = 17.5GB - fits easily. But you also need optimizer states. With TP=8 within a node, you still need the full optimizer state per TP group. ZeRO Stage 1 or 2 across the TP group can help.

Hint 3 - Full Solution + Rubric

Configuration:

Tensor parallelism (TP) = 4 within each node:

  • Weights per GPU: 140GB / 4 = 35GB
  • Uses NVLink (900 GB/s) for the frequent intra-layer communication
  • TP=4 instead of TP=8 to leave room for data parallelism within the node

Pipeline parallelism (PP) = 4 across nodes:

  • Model split into 4 stages of ~17.5B params each
  • Each stage spans 4 nodes (one TP group per stage per DP rank)
  • Inter-node communication via InfiniBand - pipeline parallelism is latency-tolerant

Data parallelism (DP) = 128 / (4 x 4) = 8:

  • 8 data-parallel replicas
  • ZeRO Stage 1 within the DP group to shard optimizer states

Memory per GPU:

  • Weights (TP=4): ~35GB
  • Optimizer states (ZeRO-1, DP=8): ~4.9GB
  • Gradients: ~8.75GB (sharded across TP group)
  • Total static: ~49GB, leaving ~31GB for activations

Pipeline configuration:

  • Micro-batches: 32 (to minimize bubble overhead)
  • Bubble fraction: (41)/(41+32)8.6%(4-1)/(4-1+32) \approx 8.6\%
  • Use 1F1B schedule to reduce peak activation memory

Alternative valid configurations:

  • TP=8, PP=2, DP=8 - more TP, less pipeline overhead
  • TP=4, PP=8, DP=4 - more pipeline stages, less data parallelism
  • TP=4, PP=1, DP=32 with ZeRO Stage 3 - pure sharded data parallelism

Scoring Rubric:

  • Strong Hire: Provides a specific configuration with TP/PP/DP degrees, justifies each choice (NVLink for TP, InfiniBand tolerance for PP), calculates memory, addresses pipeline bubble, mentions alternatives.
  • Lean Hire: Suggests a reasonable configuration but cannot justify the choices or calculate memory.
  • No Hire: Cannot combine multiple parallelism strategies or suggests only data parallelism for a 70B model.

Problem 3: Scaling Law Application

Your company has a fixed training compute budget of 102210^{22} FLOPs. Using Chinchilla scaling laws, determine the optimal model size and dataset size. Then explain why your company might choose to deviate from the optimal.

Hint 1 - Direction

Chinchilla rule of thumb: 20 tokens per parameter. The compute is approximately C6NDC \approx 6ND for transformers (6 FLOPs per parameter per token).

Hint 2 - Insight

From C=6NDC = 6ND and D=20ND = 20N: C=120N2C = 120N^2, so N=C/120N = \sqrt{C/120}. With C=1022C = 10^{22}: N=1022/1202.9×1093BN = \sqrt{10^{22}/120} \approx 2.9 \times 10^{9} \approx 3\text{B} parameters. Dataset: D=20×3B=60BD = 20 \times 3\text{B} = 60\text{B} tokens. But you might want a smaller model trained on more data for inference efficiency.

Hint 3 - Full Solution + Rubric

Chinchilla-optimal calculation:

Given C=1022C = 10^{22} FLOPs, using C6NDC \approx 6ND and D20ND \approx 20N:

C=6N×20N=120N2C = 6N \times 20N = 120N^2 N=C120=10221202.9×1093B parametersN = \sqrt{\frac{C}{120}} = \sqrt{\frac{10^{22}}{120}} \approx 2.9 \times 10^9 \approx 3\text{B parameters} D=20N60B tokensD = 20N \approx 60\text{B tokens}

Reasons to deviate:

  1. Inference cost dominance: If the model serves millions of users, a smaller model (e.g., 1B) trained on more data (e.g., 1.7T tokens) may have higher total cost-efficiency because inference cost = O(N)O(N) per request.

  2. Data availability: You may not have 60B high-quality tokens. If limited to 20B tokens, a smaller model (~1B) that is not overtrained on repeated data may perform better.

  3. Latency requirements: A 3B model may not meet latency SLAs for real-time applications. A 1B model might, even if it has slightly higher loss.

  4. Emergent capabilities: Some capabilities only appear at certain model sizes. If you need chain-of-thought reasoning, you might need a larger model even if Chinchilla says otherwise.

  5. Fine-tuning plans: If you plan to fine-tune for specific tasks, a larger pre-trained model often fine-tunes better even with the same pre-training compute.

Scoring Rubric:

  • Strong Hire: Correct Chinchilla calculation, provides 3+ valid reasons to deviate, mentions inference cost as the primary modern reason, connects to real examples (Llama 3 strategy).
  • Lean Hire: Gets the calculation right but provides only 1-2 reasons to deviate.
  • No Hire: Cannot perform the calculation or does not know what Chinchilla scaling laws are.

Problem 4: Communication Bottleneck Analysis

You are training with 64 GPUs across 8 nodes, using DDP with ring-allreduce. Your model has 7B parameters in BF16 gradients. The inter-node network bandwidth is 100 Gbps per node. Calculate the time for one AllReduce operation and determine if communication is the bottleneck.

Hint 1 - Direction

Ring-allreduce per-GPU bandwidth cost is approximately 2P2P bytes regardless of NN. But the bottleneck is the slowest link in the ring. With 8 nodes, the ring must cross inter-node links.

Hint 2 - Insight

Data to transfer per GPU: 2×7B×2 bytes=28GB2 \times 7\text{B} \times 2 \text{ bytes} = 28\text{GB} (the factor of 2 is from the two phases of ring-allreduce). Network bandwidth: 100 Gbps = 12.5 GB/s. Time: 28/12.52.228 / 12.5 \approx 2.2 seconds. If a forward+backward pass takes about 4.5 seconds, communication is 33% of total time - a significant bottleneck.

Hint 3 - Full Solution + Rubric

Calculation:

Gradient size: 7B×2 bytes (BF16)=14GB7\text{B} \times 2 \text{ bytes (BF16)} = 14\text{GB}

Ring-allreduce total per-GPU transfer: 2(N1)N×14GB2×14=28GB\frac{2(N-1)}{N} \times 14\text{GB} \approx 2 \times 14 = 28\text{GB} (for N=641N=64 \gg 1)

Inter-node bandwidth: 100 Gbps=12.5 GB/s100 \text{ Gbps} = 12.5 \text{ GB/s}

Time for AllReduce: 28GB/12.5 GB/s=2.24 seconds28\text{GB} / 12.5\text{ GB/s} = 2.24 \text{ seconds}

Is this a bottleneck?

Estimated forward+backward time for 7B model on A100:

  • A100 BF16 peak: 312 TFLOPS
  • Approximate compute per step with batch size 8 and seq_len 2048: ~4.5 seconds
  • Communication fraction: 2.24/(4.5+2.24)33%2.24 / (4.5 + 2.24) \approx 33\%

This is a significant bottleneck. Mitigation strategies:

  1. Overlap communication with computation: Start allreduce for earlier layers while computing gradients for later layers. Can hide 50-80% of communication time.
  2. Gradient compression: Top-K sparsification or PowerSGD can reduce transfer by 10-100x.
  3. Gradient accumulation: Accumulate gradients over multiple micro-batches before allreduce, amortizing the communication cost.
  4. Faster network: 400 Gbps InfiniBand (standard in modern GPU clusters) would reduce communication to 0.56 seconds.

Scoring Rubric:

  • Strong Hire: Correct calculation of AllReduce time, determines communication fraction, proposes mitigation strategies including overlap and compression, mentions real network bandwidth numbers.
  • Lean Hire: Gets the approximate time right but cannot analyze whether it is a bottleneck or propose mitigations.
  • No Hire: Cannot calculate ring-allreduce communication cost or does not understand the concept.

Problem 5: Design Challenge

You are the tech lead for training a new 30B parameter multilingual language model. You have 512 H100 GPUs for 3 months. Design the complete distributed training setup: parallelism strategy, compute budget, and failure recovery.

Hint 1 - Direction

Start with the compute budget. 512 H100s for 3 months at ~40% MFU. How many FLOPs? Then use scaling laws to determine if 30B is well-allocated. Then design the parallelism.

Hint 2 - Insight

Compute budget: 512 GPUs x ~1000 TFLOPS (H100 BF16) x 0.4 MFU x 90 days x 86400 s/day = ~1.6×10241.6 \times 10^{24} FLOPs. Chinchilla-optimal for this budget: NC/120115BN \approx \sqrt{C/120} \approx 115\text{B}. Training a 30B model means you are overtraining by ~4x, which is fine for inference efficiency. Data needed: C/(6N)8.9T tokensC/(6N) \approx 8.9\text{T tokens}.

Hint 3 - Full Solution + Rubric

Compute budget:

  • 512 H100 GPUs x 989 TFLOPS (BF16 peak) x 0.40 MFU = ~202,752 TFLOPS effective
  • Training time: 90 days x 86,400 s/day = 7,776,000 seconds
  • Total FLOPs: ~1.58×10241.58 \times 10^{24} FLOPs

Scaling law analysis:

  • Chinchilla-optimal N for this budget: 1.58×1024/120115B params\sqrt{1.58 \times 10^{24} / 120} \approx 115\text{B params}
  • We are training 30B - about 3.8x smaller than Chinchilla-optimal
  • This is deliberate: 30B is a practical deployment size, and overtraining improves inference efficiency
  • Required tokens: C/(6N)=1.58×1024/(6×30×109)8.8T tokensC / (6N) = 1.58 \times 10^{24} / (6 \times 30 \times 10^9) \approx 8.8\text{T tokens}

Parallelism configuration (512 H100s = 64 nodes x 8 GPUs):

  • TP = 4 (within node via NVLink)
  • PP = 2 (across 2 groups of nodes, minimal bubble)
  • DP = 512 / (4 x 2) = 64 (64-way data parallelism)
  • ZeRO Stage 1 within DP groups (partition optimizer states)

Memory per GPU:

  • Weights (TP=4): 30B x 2 / 4 = 15GB
  • Optimizer (ZeRO-1, DP=64): ~1.4GB
  • Gradients (TP=4): ~15GB
  • Total static: ~32GB, leaving ~48GB for activations on 80GB H100

Failure recovery:

  • Checkpoint every 1000 steps (~75 minutes)
  • Async checkpoint to shared NFS / object storage
  • Elastic training: handle up to 5% GPU failures without restart
  • Automatic restart from latest checkpoint on node failure
  • Watchdog: alert if loss spikes >2x or MFU drops below 30%

Scoring Rubric:

  • Strong Hire: Complete compute budget, scaling law analysis, specific parallelism configuration with justification, memory calculation, failure recovery plan.
  • Lean Hire: Provides a reasonable parallelism configuration but missing scaling law analysis or failure recovery.
  • No Hire: Cannot design a multi-dimensional parallelism configuration or does not understand the scale of the problem.

Interview Cheat Sheet

ConceptKey Formula/NumberOne-LinerRed Flag
Training memory (Adam, mixed prec.)16P16P bytes totalWeights + grads + optimizer + master weights"Training memory = model size"
Data parallelismReplicate model, split dataAllReduce gradients after backward"DP reduces model memory"
Ring-allreduce bandwidth2P2P bytes per GPUConstant cost regardless of GPU count"AllReduce is O(N)"
Tensor parallelismSplit layers across GPUsRequires NVLink, intra-node only"TP works across nodes"
Pipeline parallelismSplit layer groups across GPUsBubble = (P1)/(P1+M)(P-1)/(P-1+M)"No bubble with enough GPUs"
ZeRO Stage 1Shard optimizer statesMemory: 4P+12P/N4P + 12P/N"ZeRO removes all redundancy" (only Stage 3)
ZeRO Stage 3 / FSDPShard everythingMemory: 16P/N16P/N, 1.5x comm cost"FSDP is free"
Chinchilla rule20 tokens per parameterScale data and params equally"Bigger model always better"
Compute formulaC6NDC \approx 6ND6 FLOPs per param per tokenNot knowing the 6ND formula
MFUActual FLOPS / Peak FLOPS30-50% is typical at scale"MFU should be 90%+"
Activation checkpointingRecompute vs store~33% more compute, O(L)O(\sqrt{L}) memory"Free memory savings"

Spaced Repetition Checkpoints

Day 0 - Initial Learning

  • Read this entire page
  • Calculate memory requirements for a 7B and 70B model from memory
  • Draw the ring-allreduce algorithm on paper
  • Write out the ZeRO stages and memory savings for each
  • State the Chinchilla scaling law and the 20-tokens-per-parameter rule

Day 3 - First Recall

  • Without notes, calculate memory for a 13B model with Adam mixed precision
  • Explain data vs tensor vs pipeline parallelism in 2 minutes, aloud
  • Give the "60-Second Answer" for model memory out loud, timed
  • Derive the Chinchilla-optimal model size for 102310^{23} FLOPs

Day 7 - Connections

  • Design a parallelism configuration for 70B on 256 GPUs (write it out)
  • Explain why ring-allreduce bandwidth is independent of GPU count
  • Do Practice Problem 1 (memory budget) without looking at hints
  • Explain inference-aware scaling and how it changes the Chinchilla recommendation

Day 14 - Application

  • Do Practice Problem 2 (parallelism configuration) under timed conditions (15 minutes)
  • Do Practice Problem 4 (communication bottleneck) with full calculations
  • Design a training setup for a model size and GPU count of your choice
  • Review concepts you hesitated on

Day 21 - Mock Interview

  • Have someone ask: "How would you train a 70B model on 512 GPUs? Walk me through everything."
  • Time yourself: full answer should take 8-12 minutes covering memory, parallelism, scaling, and failure recovery
  • Do all 5 practice problems in sequence under timed conditions (60 minutes total)
  • Can you whiteboard the 3D parallelism diagram from memory?

Key Takeaways

  1. Memory is the first constraint. Before choosing a parallelism strategy, calculate the exact memory requirements. The 16P16P formula for Adam mixed precision is your starting point.

  2. Different parallelism strategies serve different purposes. Data parallelism scales throughput. Tensor parallelism fits layers on multiple GPUs. Pipeline parallelism fits models across nodes. ZeRO eliminates memory redundancy. Most large-scale training uses a combination of all four.

  3. Communication is the scaling bottleneck. Ring-allreduce is bandwidth-optimal but latency-limited. Tensor parallelism requires NVLink. Pipeline parallelism tolerates lower bandwidth. Understanding these tradeoffs is what makes you a strong distributed training engineer.

  4. Chinchilla changed the game. The 20-tokens-per-parameter rule optimizes for training compute. Inference-aware scaling (overtraining smaller models) optimizes for deployment cost. Know both perspectives and when each applies.

  5. Failure recovery is not optional at scale. With 1000 GPUs, hardware failures are frequent (multiple per day). Checkpoint frequently, use elastic training, and monitor aggressively.

© 2026 EngineersOfAI. All rights reserved.