Skip to main content

Training Techniques - From Initialization to Distillation

Reading time: ~40 min | Interview relevance: High | Roles: MLE, AI Eng, Research Engineer, Applied Scientist

The Real Interview Moment

You are in a Meta MLE on-site. The interviewer puts up a training loss curve on the screen - the loss plateaus after the first epoch and never improves. She asks: "Debug this. What could cause this, and how would you fix it?" You start listing possibilities: bad learning rate, dead ReLUs, poor initialization. She nods, then asks: "Walk me through exactly how you'd initialize a 50-layer ResNet. What initialization scheme, what numerical precision, and what gradient management strategy?"

This is where candidates separate. Junior engineers say "use default PyTorch settings." Mid-level candidates mention Xavier or He initialization. Strong candidates explain why each technique exists - the variance propagation problem, the symmetry breaking requirement, the numerical stability guarantees - and can derive the initialization formulas from first principles. They know when to use FP16 vs BF16, why loss scaling exists, and how knowledge distillation can compress a model without sacrificing accuracy.

This page teaches you every training technique that interviewers expect you to know, with the mathematical depth to survive follow-up questions.

What You Will Master

  • Derive Xavier and He initialization from variance preservation principles
  • Explain why symmetry breaking is necessary and what happens without it
  • Implement gradient clipping (by norm and by value) and explain when each is needed
  • Compare FP16, BF16, and FP32 training with precise understanding of overflow/underflow
  • Design a mixed precision training pipeline with loss scaling
  • Apply curriculum learning strategies and explain when they help
  • Build a knowledge distillation pipeline with temperature-scaled soft labels
  • Use label smoothing and explain its regularization effect mathematically
  • Debug common training failures caused by each technique

Self-Assessment: Where Are You Now?

Skill1 - Cannot2 - Vaguely3 - Can Explain4 - Can Derive5 - Can TeachYour Score
Derive Xavier initialization___
Derive He initialization___
Explain symmetry breaking___
Implement gradient clipping___
Explain FP16 vs BF16 tradeoffs___
Design a mixed precision pipeline___
Explain knowledge distillation loss___
Apply label smoothing mathematically___

Target: All 4s and 5s before your interview.

Part 1 - Weight Initialization: Why It Matters

The Symmetry Problem

Consider a neural network where all weights are initialized to the same value (say, zero). What happens during forward and backward passes?

Forward pass: Every neuron in a layer computes the same output because all weights are identical. The layer effectively has one neuron, not hundreds.

Backward pass: Every neuron receives the same gradient. The weight update is identical for all neurons. They remain identical forever.

This is called the symmetry problem. No matter how long you train, neurons in the same layer will never differentiate. The network has the representational capacity of a single neuron per layer.

Instant Rejection

Never say "initialize all weights to zero" or "initialize all weights to the same value." This is a fundamental error that signals you do not understand neural network training. Even saying "small random values" without specifying the distribution will raise concerns.

The Variance Propagation Problem

Even with random initialization, choosing the wrong scale causes problems. Consider a layer y=Wxy = Wx where WRnout×ninW \in \mathbb{R}^{n_{out} \times n_{in}} and xRninx \in \mathbb{R}^{n_{in}}.

Each output neuron computes:

yj=i=1ninwjixiy_j = \sum_{i=1}^{n_{in}} w_{ji} x_i

Assuming wjiw_{ji} and xix_i are independent with zero mean:

Var(yj)=ninVar(w)Var(x)\text{Var}(y_j) = n_{in} \cdot \text{Var}(w) \cdot \text{Var}(x)

If Var(w)>1/nin\text{Var}(w) > 1/n_{in}, the variance of activations grows exponentially with depth. After 50 layers, activations explode to infinity.

If Var(w)<1/nin\text{Var}(w) < 1/n_{in}, the variance shrinks exponentially. After 50 layers, activations collapse to zero.

Weight Initialization - Xavier vs He: Keeping Variance Stable Across Layers

The same problem occurs in the backward pass with gradients. This is why proper initialization is critical for deep networks.

Xavier (Glorot) Initialization

Paper: Glorot & Bengio, 2010 - "Understanding the difficulty of training deep feedforward neural networks"

Derivation:

We want the variance of activations to be preserved across layers in both the forward and backward pass.

Forward pass requirement: Var(y)=Var(x)\text{Var}(y) = \text{Var}(x), which gives Var(w)=1/nin\text{Var}(w) = 1/n_{in}.

Backward pass requirement: Var(δx)=Var(δy)\text{Var}(\delta x) = \text{Var}(\delta y), which gives Var(w)=1/nout\text{Var}(w) = 1/n_{out}.

Xavier initialization compromises:

Var(w)=2nin+nout\text{Var}(w) = \frac{2}{n_{in} + n_{out}}

For a uniform distribution: wU[6nin+nout, 6nin+nout]w \sim U\left[-\sqrt{\frac{6}{n_{in}+n_{out}}},\ \sqrt{\frac{6}{n_{in}+n_{out}}}\right]

For a normal distribution: wN(0, 2nin+nout)w \sim \mathcal{N}\left(0,\ \frac{2}{n_{in}+n_{out}}\right)

Interviewer's Perspective

When an interviewer asks about Xavier initialization, they want three things: (1) the motivation - variance preservation, (2) the derivation - why the specific formula, and (3) the assumption - that activations are linear. Mentioning that Xavier assumes linear activations (or symmetric activations like tanh) and explaining why this matters for ReLU will immediately set you apart.

When Xavier works: Sigmoid, tanh, or linear activations. These are symmetric around zero, so the zero-mean assumption in the derivation holds approximately.

When Xavier fails: ReLU activations. ReLU zeros out half the inputs, effectively halving the variance. Xavier initialization causes variance to shrink by a factor of 2 per layer - a 50-layer ReLU network initialized with Xavier will have activations that are 2502^{-50} times the input variance.

He (Kaiming) Initialization

Paper: He et al., 2015 - "Delving Deep into Rectifiers"

He initialization accounts for ReLU's variance-halving effect. Since ReLU zeros out approximately half the neurons (those with negative input), the effective fan-in is halved:

Var(w)=2nin\text{Var}(w) = \frac{2}{n_{in}}

For a normal distribution: wN(0, 2nin)w \sim \mathcal{N}\left(0,\ \frac{2}{n_{in}}\right)

Why the factor of 2? ReLU passes only positive values. For a zero-mean input, half the values are positive. The variance of the positive half of a zero-mean Gaussian is half the original variance. To compensate, we double the initialization variance.

60-Second Answer

"Weight initialization controls the scale of activations and gradients through the network. Xavier initialization sets variance to 2/(nin+nout)2/(n_{in} + n_{out}) to preserve variance in both directions, assuming linear activations. He initialization sets variance to 2/nin2/n_{in} to account for ReLU zeroing out half the neurons. Using the wrong initialization for your activation function causes either exploding or vanishing activations, making training impossible or extremely slow. For modern networks with ReLU variants, I default to He initialization. For transformers with LayerNorm, the choice is less critical because normalization stabilizes variance regardless."

LSUV - Layer-Sequential Unit-Variance Initialization

Paper: Mishkin & Matas, 2016

LSUV is a data-driven initialization method:

  1. Initialize weights with orthogonal matrices (preserves norms)
  2. For each layer sequentially: a. Pass a mini-batch through the network b. Measure the variance of the layer's output c. Scale the weights so the output variance equals 1.0 d. Repeat for the next layer

Advantage over Xavier/He: Works for any activation function and any architecture without deriving activation-specific formulas. Particularly useful for exotic architectures where the variance propagation analysis is complex.

Disadvantage: Requires a forward pass during initialization, adding startup cost.

Common Trap

Do NOT confuse LSUV with batch normalization. Both aim to normalize activations, but LSUV is a one-time initialization procedure while BatchNorm is applied during every forward pass. LSUV sets the initial scale; BatchNorm maintains it throughout training. An interviewer may ask you to compare them.

Initialization for Specific Architectures

ArchitectureRecommended InitWhy
MLP with ReLUHe normalAccounts for ReLU variance halving
MLP with tanh/sigmoidXavier normalAssumes symmetric activations
ResNetHe normal + zero-init residual branchEnsures residual block starts as identity
TransformerXavier normal scaled by 1/2N1/\sqrt{2N} for NN layersGPT-2 convention, prevents output variance growth
Embedding layersNormal with small std (0.01-0.02)Empirical, no strong theoretical basis
LSTMOrthogonal for recurrent weights, Xavier for input weightsPreserves gradient norms through time steps

Part 2 - Gradient Clipping

Why Gradients Explode

During backpropagation, gradients are multiplied through layers. If the Jacobian of each layer has spectral norm greater than 1, gradients grow exponentially:

W1Ll=1LJlWLL\|\nabla_{W_1} L\| \approx \prod_{l=1}^{L} \|J_l\| \cdot \|\nabla_{W_L} L\|

For L=50L = 50 layers with Jl=1.1\|J_l\| = 1.1, the gradient magnification is 1.1501171.1^{50} \approx 117. With Jl=2\|J_l\| = 2, it becomes 25010152^{50} \approx 10^{15}.

Clipping Strategies

Clip by value: Cap each gradient element independently.

giclip(gi,τ,τ)g_i \leftarrow \text{clip}(g_i, -\tau, \tau)

  • Simple but changes the direction of the gradient vector
  • Rarely used in practice

Clip by global norm (standard approach):

ggτmax(g,τ)g \leftarrow g \cdot \frac{\tau}{\max(\|g\|, \tau)}

  • Preserves gradient direction, only scales magnitude
  • This is what PyTorch's torch.nn.utils.clip_grad_norm_ does
  • Standard values: τ=1.0\tau = 1.0 for transformers, τ=5.0\tau = 5.0 for RNNs

Gradient Clipping by Global Norm - Scale if Exceeds Threshold, Preserve Direction

Company Variation

Google and DeepMind interviews often ask about gradient clipping in the context of transformer training. They expect you to know that gradient clipping is essential for transformer training (not optional) and that the typical threshold is 1.0. Meta asks about it in the context of RNN/LSTM training where exploding gradients are more dramatic.

Gradient Clipping vs. Other Solutions

ProblemGradient ClippingProper InitializationNormalization LayersResidual Connections
Exploding gradientsDirectly capsPrevents initiallyPrevents during trainingPrevents via shortcut
Vanishing gradientsDoes NOT helpPartially preventsPartially preventsDirectly solves
Computational costNegligibleZero (one-time)Per-layer per-stepPer-block per-step
When to useAlways for transformers/RNNsAlwaysArchitecture-dependentDeep networks

Part 3 - Mixed Precision Training

The Numerical Precision Landscape

Modern GPUs support multiple floating-point formats:

FormatBitsExponentMantissaRangePrecision
FP32328 bits23 bits±3.4×1038\pm 3.4 \times 10^{38}~7 decimal digits
FP16165 bits10 bits±65504\pm 65504~3 decimal digits
BF16168 bits7 bits±3.4×1038\pm 3.4 \times 10^{38}~2 decimal digits
TF32198 bits10 bits±3.4×1038\pm 3.4 \times 10^{38}~3 decimal digits

Why Mixed Precision?

Speed: FP16/BF16 operations are 2-8x faster on modern GPUs (A100, H100). Memory: Half the memory per parameter means you can double the batch size or model size. Throughput: Tensor Cores on NVIDIA GPUs are optimized for FP16/BF16 matrix multiplications.

But you cannot simply cast everything to FP16. Here is why:

Where FP16 Fails

Problem 1: Overflow. FP16 max value is 65504. Activations in deep networks can exceed this. Pre-LayerNorm transformers are especially vulnerable because activations grow with sequence length.

Problem 2: Underflow. Gradient values below 2245.96×1082^{-24} \approx 5.96 \times 10^{-8} become zero in FP16. Many gradient values, especially in early layers of deep networks, fall below this threshold. The model stops learning.

Problem 3: Accumulation errors. Summing many small FP16 values loses precision rapidly. The sum of 1000 values of 0.001 in FP16 may not equal 1.0.

The Mixed Precision Recipe (Micikevicius et al., 2018)

Mixed Precision Training - FP16/BF16 Forward Pass with FP32 Master Weights

Key principles:

  1. Master weights in FP32. Weight updates are tiny (lr×gradient103×104=107\text{lr} \times \text{gradient} \approx 10^{-3} \times 10^{-4} = 10^{-7}), which underflows in FP16.
  2. Forward and backward pass in FP16. This is where the speedup comes from.
  3. Loss scaling. Multiply the loss by a large constant SS (e.g., 1024) before backprop. This shifts gradients into FP16's representable range. After backprop, divide by SS before the weight update.
  4. Dynamic loss scaling. Start with a large SS. If inf/nan appears, halve SS. If no inf/nan for N steps, double SS.

BF16: The Better Alternative

BF16 (Brain Floating Point, developed by Google Brain) uses the same exponent size as FP32 (8 bits) but with reduced mantissa (7 bits vs 23 bits).

Advantages of BF16 over FP16:

  • Same dynamic range as FP32 - no overflow issues
  • No loss scaling needed (the entire purpose of loss scaling was to handle FP16's limited range)
  • Simpler training pipeline

Disadvantage: Slightly less precision than FP16 (2 decimal digits vs 3). In practice, this rarely matters for training.

Interviewer's Perspective

If asked "what precision would you train a large model in?", the strong answer is: "BF16 for both forward and backward passes on hardware that supports it (A100+, TPU v3+), with FP32 master weights. This avoids the need for loss scaling. If BF16 isn't available, FP16 with dynamic loss scaling. I'd keep optimizer states (momentum, variance in Adam) in FP32 regardless." This shows you understand the practical tradeoffs, not just the theory.

Operations That Must Stay in FP32

Not everything can be safely done in reduced precision:

  • Softmax: Exponentiation can overflow. Compute in FP32 or use numerically stable implementation (subtract max).
  • Layer normalization: Variance computation needs FP32 for numerical stability.
  • Loss computation: Cross-entropy involves log of small values. FP16 underflow causes NaN.
  • Optimizer state: Adam's running averages need FP32 precision.
  • Gradient accumulation: Summing many small gradients must be in FP32.
Common Trap

Do NOT say "mixed precision just means using FP16 everywhere." The "mixed" means some operations use FP16 and some use FP32. Knowing which operations must stay in FP32 is what distinguishes engineers who have actually trained models from those who have only read about it.

Part 4 - Curriculum Learning

Core Idea

Train the model on easy examples first, then gradually introduce harder examples - mimicking how humans learn.

Paper: Bengio et al., 2009 - "Curriculum Learning"

Why It Works

  1. Better optimization landscape: Easy examples provide strong, clean gradients that move the model toward a good region of parameter space. Hard/noisy examples early on provide conflicting gradients.
  2. Faster convergence: The model builds basic representations quickly on easy examples, then refines them on hard examples.
  3. Better generalization: Starting with clean examples helps the model learn true patterns before encountering noise.

Difficulty Metrics

How do you define "easy" vs "hard"?

MethodDefinition of DifficultyExample
Loss-basedHigher loss = harderSort by per-sample loss after one epoch
Confidence-basedLower confidence = harderSort by model's predicted probability
HeuristicDomain knowledgeShort sentences before long sentences in NLP
Length-basedLonger = harderCommon in sequence tasks
Noise-basedMore noise = harderRemove noisy labels first
Self-pacedModel decidesInclude examples where loss is below a threshold that increases over time

Anti-Curriculum Learning

Surprisingly, some work shows that training on hard examples first can also help, particularly when:

  • The dataset contains many near-duplicate easy examples
  • Hard examples are the most informative for learning decision boundaries
  • The model needs to learn rare but important patterns
Company Variation

Curriculum learning is frequently asked at NLP-focused companies (Google, Meta AI, Cohere). It is especially relevant for training language models where sequence length, data quality, and domain mixing order all affect final performance. GPT-4 and Llama training reportedly use data curriculum strategies.

Part 5 - Knowledge Distillation

The Core Framework

Paper: Hinton et al., 2015 - "Distilling the Knowledge in a Neural Network"

A large "teacher" model has learned rich representations. A smaller "student" model learns to mimic the teacher's behavior, not just match the ground truth labels.

Knowledge Distillation - Teacher Soft Labels Transfer Dark Knowledge to Student

Why Soft Labels Are More Informative Than Hard Labels

Consider a teacher classifying an image. The hard label says "cat" (one-hot). But the teacher's soft predictions might be:

ClassHard LabelTeacher Soft Label (T=1)Teacher Soft Label (T=5)
Cat1.00.920.45
Dog0.00.050.25
Bird0.00.020.18
Car0.00.010.12

The soft labels reveal that this cat looks somewhat like a dog (both are furry animals) and less like a car. This "dark knowledge" - the inter-class similarity structure - is lost in hard labels but preserved in soft labels.

Temperature Scaling

The temperature parameter TT controls how "soft" the distribution is:

pi=exp(zi/T)jexp(zj/T)p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}

  • T=1T = 1: Standard softmax (peaky distribution)
  • T>1T > 1: Softer distribution, reveals more inter-class structure
  • TT \to \infty: Uniform distribution (all classes equally likely)
  • T0T \to 0: Hard argmax (one-hot)

Typical value: T=3T = 3 to T=20T = 20 (Hinton originally used T=20T = 20).

The Distillation Loss

L=αT2KL(softmax(zt/T)  softmax(zs/T))+(1α)CE(y,softmax(zs))L = \alpha \cdot T^2 \cdot \text{KL}\left(\text{softmax}(z_t/T)\ \|\ \text{softmax}(z_s/T)\right) + (1-\alpha) \cdot \text{CE}(y, \text{softmax}(z_s))

Why T2T^2? The gradients of the KL divergence with temperature-scaled softmax are scaled by 1/T21/T^2 compared to standard softmax. Multiplying by T2T^2 compensates, keeping the relative contribution of the distillation loss consistent regardless of temperature.

Typical α\alpha: 0.5 to 0.9 (weight toward distillation loss).

60-Second Answer

"Knowledge distillation transfers knowledge from a large teacher model to a smaller student model. Instead of training the student on hard labels only, we also train it to match the teacher's soft probability distribution. We raise the temperature in the softmax to reveal inter-class similarities - the 'dark knowledge.' The student learns not just what the correct answer is, but what the incorrect answers look like, which is much more informative per training example. The loss combines KL divergence between teacher and student soft distributions with standard cross-entropy on hard labels. This lets you deploy a model that's 3-10x smaller with only 1-3% accuracy loss."

Types of Knowledge Distillation

TypeWhat is TransferredExample
Response-basedFinal layer output (soft labels)Original Hinton distillation
Feature-basedIntermediate layer representationsFitNets (Romero et al., 2015)
Relation-basedRelationships between samplesContrastive Representation Distillation
Self-distillationModel distills into itselfBorn-Again Networks, deep-to-shallow within same model
Online distillationModels teach each other during trainingDeep Mutual Learning

Practical Distillation Results

TeacherStudentTaskTeacher AccStudent Acc (no distill)Student Acc (distilled)
BERT-LargeBERT-SmallGLUE86.578.283.1
ResNet-152ResNet-18ImageNet78.369.873.5
GPT-3GPT-2Text Gen--Significant improvement
Llama-70BLlama-7BBenchmarks--~5% improvement

Part 6 - Label Smoothing

The Problem with Hard Labels

Standard one-hot labels force the model to be infinitely confident: predict exactly 1.0 for the correct class and 0.0 for all others. To achieve this, logits must go to ±\pm\infty, which:

  1. Causes the model to be overconfident - calibration suffers
  2. Encourages memorization of training data
  3. Makes the model less generalizable - it overfits to the exact label distribution

Label Smoothing Formula

Replace the one-hot label yy with a smoothed version:

ysmooth=(1ϵ)y+ϵKy_{\text{smooth}} = (1 - \epsilon) \cdot y + \frac{\epsilon}{K}

Where ϵ\epsilon is the smoothing parameter (typically 0.1) and KK is the number of classes.

Example with K=4K=4 and ϵ=0.1\epsilon=0.1:

ClassHard LabelSmoothed Label
Cat (correct)1.00.925
Dog0.00.025
Bird0.00.025
Car0.00.025

Why Label Smoothing Works

  1. Regularization effect: The model cannot achieve zero loss, so it never stops learning useful representations.
  2. Better calibration: The model learns to output probabilities closer to the true uncertainty.
  3. Tighter clusters: Label smoothing encourages the model to keep all class representations at a fixed distance from each other, creating tighter, more uniform clusters in embedding space.

Label Smoothing vs. Knowledge Distillation

Both provide "soft" targets, but they are fundamentally different:

AspectLabel SmoothingKnowledge Distillation
Source of soft labelsUniform distributionTeacher model
Information contentNo inter-class structureRich inter-class relationships
Requires teacherNoYes
Computational costNegligibleRequires teacher inference
Typical improvement0.5-2%3-10%
Common Trap

Label smoothing and knowledge distillation are sometimes confused in interviews. When asked about label smoothing, do NOT describe knowledge distillation. The key difference: label smoothing uses a uniform distribution over non-target classes (every wrong class gets equal probability), while distillation uses the teacher's distribution (wrong classes get probability proportional to their similarity to the correct class). Label smoothing is a regularization technique; distillation is a knowledge transfer technique.

Part 7 - Putting It All Together: A Training Recipe

Here is how all these techniques combine in a modern training pipeline:

Modern Training Recipe - Init, Precision, Gradient Clipping, Label Smoothing

Common Training Failures and Fixes

SymptomLikely CauseFix
Loss is NaN from step 1Bad initialization or no gradient clippingUse He init, clip gradients at 1.0
Loss plateaus immediatelySymmetry problem or dead neuronsCheck initialization, use LeakyReLU
Loss spikes periodicallyExploding gradientsReduce gradient clip threshold
Loss decreases then suddenly NaNFP16 overflowEnable loss scaling or switch to BF16
Training is slow on GPUNot using mixed precisionEnable AMP with BF16
Student much worse than teacherWrong temperature or alphaTry T=5-20, alpha=0.7-0.9
Model overconfidentNo label smoothing or regularizationAdd label smoothing epsilon=0.1
Model underfits easy examplesCurriculum too aggressiveStart with more easy examples

Practice Problems

Problem 1: Initialization Debugging

You initialize a 100-layer MLP with ReLU activations using Xavier normal initialization. After the first forward pass, all activations in the last layer are nearly zero. Explain why and fix it.

Hint 1 - Direction

Think about what ReLU does to the variance of activations at each layer when Xavier initialization is used.

Hint 2 - Insight

ReLU zeros out approximately half the inputs. Xavier assumes linear/symmetric activations where all inputs pass through. Each ReLU layer halves the variance, so after 100 layers the variance is scaled by (0.5)1001030(0.5)^{100} \approx 10^{-30}.

Hint 3 - Full Solution + Rubric

Why activations vanish:

Xavier sets Var(w)=2/(nin+nout)\text{Var}(w) = 2/(n_{in} + n_{out}), designed for linear activations. With ReLU, each layer halves the variance (since half the outputs are zeroed). After LL layers:

Var(activationL)(ninnin+nout)LVar(x)\text{Var}(\text{activation}_L) \approx \left(\frac{n_{in}}{n_{in} + n_{out}} \right)^L \cdot \text{Var}(x)

For nin=noutn_{in} = n_{out}, this simplifies to (0.5)1001030(0.5)^{100} \approx 10^{-30}. Activations are effectively zero.

Fix: Switch to He initialization with Var(w)=2/nin\text{Var}(w) = 2/n_{in}, which compensates for ReLU's variance halving. After this change:

Var(activationL)Var(x)\text{Var}(\text{activation}_L) \approx \text{Var}(x)

The variance is preserved through all 100 layers.

Additional fixes for very deep networks: Add residual connections, add batch/layer normalization, or use LSUV initialization.

Scoring Rubric:

  • Strong Hire: Correctly identifies the ReLU-Xavier mismatch, derives the variance shrinkage factor, prescribes He initialization, and mentions complementary techniques (residual connections, normalization).
  • Lean Hire: Identifies the problem but cannot derive the shrinkage rate or only says "use He init" without explaining why.
  • No Hire: Cannot explain why Xavier fails with ReLU or suggests increasing learning rate as the fix.

Problem 2: Mixed Precision Design

You are training a 10B parameter language model on 8 A100 GPUs. Design the mixed precision strategy, including what precision each component uses and how you handle numerical stability.

Hint 1 - Direction

List every component: weights, activations, gradients, optimizer states, loss computation, normalization layers. Decide the precision for each.

Hint 2 - Insight

Master weights in FP32 (8 bytes per param = 80GB just for weights). Forward/backward in BF16 (avoids loss scaling). Adam optimizer states (momentum + variance) in FP32 (adds 2x weight memory). Specific operations (softmax, layernorm, loss) in FP32 for stability.

Hint 3 - Full Solution + Rubric

Component-level precision assignment:

ComponentPrecisionReason
Master weightsFP32Weight updates are tiny, underflow in FP16
Forward activationsBF16Speed + memory, BF16 range prevents overflow
Backward gradientsBF16Matches forward precision
Adam first momentFP32Running average needs precision
Adam second momentFP32Running average needs precision
SoftmaxFP32exp() overflow risk
LayerNormFP32Variance computation needs precision
Loss computationFP32log() underflow risk
Gradient accumulationFP32Summing many small values
Gradient all-reduceBF16Communication bandwidth bottleneck

Memory breakdown for 10B params:

  • Master weights: 10B x 4 bytes = 40GB
  • BF16 weights (forward copy): 10B x 2 bytes = 20GB
  • Adam states: 10B x 8 bytes = 80GB
  • Gradients: 10B x 2 bytes = 20GB
  • Total per-model: ~160GB across 8 GPUs = 20GB per GPU (with sharding)

Why BF16 over FP16: No loss scaling needed, simplifies the pipeline, same dynamic range as FP32. On A100s, BF16 and FP16 have the same throughput on Tensor Cores.

Scoring Rubric:

  • Strong Hire: Correct precision for every component, memory calculation, justification for BF16 over FP16, mentions gradient accumulation precision and communication precision.
  • Lean Hire: Gets the main idea (BF16 forward, FP32 master weights) but misses nuances like optimizer state precision or which operations need FP32.
  • No Hire: Suggests FP16 everywhere or cannot explain why master weights need FP32.

Problem 3: Distillation Temperature Selection

You are distilling a BERT-Large teacher into a DistilBERT student for sentiment classification (positive/negative). The teacher achieves 95% accuracy. With T=1T=1, the student gets 88%. How would you tune temperature to improve the student? What temperatures would you try and why?

Hint 1 - Direction

Think about what temperature controls in the softmax distribution. For a binary classification task specifically, how much "dark knowledge" is available?

Hint 2 - Insight

With only 2 classes, the soft distribution has limited dark knowledge (just the ratio between positive and negative probabilities). Higher temperatures will flatten this further. The optimal temperature for binary classification is typically lower (T=2T = 2-55) than for multi-class (T=5T = 5-2020) because there is less inter-class structure to reveal.

Hint 3 - Full Solution + Rubric

Analysis:

For binary sentiment classification, the teacher's output is a 2-class distribution. At T=1T=1, a 95%-confident teacher outputs approximately [0.95, 0.05] for a clear positive example.

Temperature sweep:

  • T=1T=1: [0.95, 0.05] - Almost hard label, little dark knowledge
  • T=3T=3: [0.74, 0.26] - Reveals teacher's uncertainty more
  • T=5T=5: [0.65, 0.35] - Quite soft, but still informative
  • T=10T=10: [0.57, 0.43] - Approaching uniform, losing signal
  • T=20T=20: [0.53, 0.47] - Nearly uniform, useless

Recommended approach:

  1. Try T{2,3,5,7}T \in \{2, 3, 5, 7\} on a validation set
  2. For binary classification, T=3T=3 or T=5T=5 is likely optimal
  3. Tune α\alpha (weight between distillation and hard loss) jointly: try α{0.3,0.5,0.7,0.9}\alpha \in \{0.3, 0.5, 0.7, 0.9\}
  4. For binary tasks, consider augmenting with feature-based distillation (match intermediate representations) since the output distribution has limited dark knowledge

Key insight: Binary classification is the worst case for response-based distillation because there are only 2 classes to distribute probability over. Feature-based distillation (matching intermediate BERT hidden states) would likely give more improvement.

Scoring Rubric:

  • Strong Hire: Recognizes that binary classification limits dark knowledge, recommends feature-based distillation as a complement, provides specific temperature range with reasoning, discusses α\alpha tuning.
  • Lean Hire: Suggests trying different temperatures but doesn't recognize the fundamental limitation of 2-class distillation.
  • No Hire: Suggests T=20T=20 "because higher is better" or doesn't understand what temperature controls.

Problem 4: Curriculum Learning for Code Generation

You are training a code generation model (similar to CodeLlama). Design a curriculum learning strategy. What defines "easy" vs "hard" code? How do you schedule the curriculum?

Hint 1 - Direction

Think about multiple dimensions of code difficulty: length, language complexity, algorithmic complexity, number of dependencies, test pass rate.

Hint 2 - Insight

Easy code: short functions, single language, common patterns (getters/setters, simple loops). Hard code: multi-file projects, complex algorithms, concurrent programming, domain-specific patterns. The curriculum should consider both syntactic complexity and semantic difficulty.

Hint 3 - Full Solution + Rubric

Difficulty dimensions:

  1. Length: Short functions (< 20 lines) to multi-file projects
  2. Cyclomatic complexity: Linear code to deeply nested conditionals/loops
  3. Language frequency: Python/JavaScript (common) to Rust/Haskell (rare)
  4. Algorithmic complexity: O(n) loops to dynamic programming/graph algorithms
  5. Dependencies: Standalone functions to complex import chains
  6. Documentation quality: Well-commented code to uncommented code
  7. Test pass rate: How often a baseline model generates correct code for this task

Proposed curriculum (4 phases):

Phase 1 (first 10% of training): Single-function Python/JavaScript, < 20 lines, simple logic, well-documented.

Phase 2 (10-40%): Multi-function files, add Java/TypeScript/Go, include standard library usage, introduce common algorithms.

Phase 3 (40-80%): Multi-file projects, all languages, complex algorithms, API usage, error handling patterns.

Phase 4 (80-100%): Full difficulty range including concurrent code, advanced type systems, low-level systems code, adversarial/edge-case examples.

Scheduling: Linear pacing function that gradually increases the difficulty threshold. At each phase boundary, validate on a held-out set at the target difficulty level.

Anti-curriculum consideration: For code, showing some hard examples early (e.g., complex algorithms) might help the model learn structural patterns. A mixed strategy that is 80% curriculum-ordered and 20% random might outperform strict ordering.

Scoring Rubric:

  • Strong Hire: Identifies multiple difficulty dimensions, provides concrete phasing, discusses scheduling, considers anti-curriculum aspects, mentions validation strategy.
  • Lean Hire: Provides a reasonable curriculum but only considers 1-2 difficulty dimensions.
  • No Hire: Cannot define code difficulty or proposes a curriculum with no clear rationale.

Problem 5: Training Recipe for Production

Your team is about to train a 1B parameter vision transformer on ImageNet from scratch. Write out the complete training recipe: initialization, precision, gradient management, label strategy, and any other techniques. Justify each choice.

Hint 1 - Direction

Think about what specific choices the ViT paper and follow-up work (DeiT, BEiT) made. Why did they deviate from standard ResNet training recipes?

Hint 2 - Insight

ViTs are notoriously harder to train than CNNs. They require: careful initialization (truncated normal, small std), strong data augmentation, longer training schedules, label smoothing, knowledge distillation from a CNN teacher (DeiT approach). Mixed precision is essential for 1B params.

Hint 3 - Full Solution + Rubric

Complete training recipe:

Initialization:

  • Patch embedding: truncated normal, std=0.02
  • Position embeddings: truncated normal, std=0.02
  • Attention layers: Xavier uniform
  • MLP layers: He normal (GELU approximates ReLU)
  • Final classifier head: zero-initialized (start as identity)
  • LayerNorm: weight=1, bias=0

Precision:

  • BF16 mixed precision on A100 GPUs
  • FP32 master weights and optimizer states
  • FP32 for LayerNorm and softmax attention

Gradient management:

  • Gradient clipping: global norm 1.0
  • Gradient accumulation: effective batch size 4096 (common for ViT)
  • Learning rate: 3e-4 peak with linear warmup (10K steps) + cosine decay

Label strategy:

  • Label smoothing: epsilon=0.1
  • Knowledge distillation from RegNetY-16GF teacher (DeiT approach)
  • Mixup (alpha=0.8) + CutMix (alpha=1.0) augmentation

Additional techniques:

  • Stochastic depth (drop path rate 0.1)
  • Repeated augmentation
  • Random erasing (probability 0.25)
  • Weight decay 0.05 (AdamW)
  • EMA of model weights for evaluation

Why these choices matter for ViTs specifically:

  • ViTs lack the inductive biases of CNNs (translation equivariance, locality), so they need more data and stronger regularization
  • Knowledge distillation from a CNN teacher provides the inductive bias that ViTs lack
  • Label smoothing prevents the model from becoming overconfident on ImageNet's noisy labels
  • Large batch size with warmup is critical for stable transformer training

Scoring Rubric:

  • Strong Hire: Covers all categories (init, precision, gradients, labels, augmentation), provides specific values with justifications, mentions ViT-specific challenges, references DeiT/BEiT choices.
  • Lean Hire: Gets the main techniques right but misses ViT-specific nuances or provides generic values without justification.
  • No Hire: Treats ViT training the same as ResNet training or cannot provide a coherent recipe.

Interview Cheat Sheet

ConceptKey FormulaOne-LinerRed Flag
Xavier initVar(w)=2/(nin+nout)\text{Var}(w) = 2/(n_{in}+n_{out})Preserve variance for linear activations"Use Xavier for everything"
He initVar(w)=2/nin\text{Var}(w) = 2/n_{in}Compensates for ReLU halving variance"He and Xavier are the same"
Symmetry breakingRandom init requiredSame init = same neurons forever"Init to zero is fine"
Gradient clipping (norm)gτ/max(g,τ)g \cdot \tau / \max(\|g\|, \tau)Preserve direction, limit magnitude"Clip each gradient independently"
FP16 rangeMax 65504Needs loss scaling to avoid underflow"FP16 is always fine"
BF16 advantageSame range as FP32No loss scaling needed"BF16 and FP16 are identical"
Loss scalingMultiply loss by SS before backpropShifts gradients into FP16 rangeNot knowing why it exists
Knowledge distillationKL(softTsoftS)T2\text{KL}(\text{soft}_T \| \text{soft}_S) \cdot T^2Teacher's soft labels transfer dark knowledge"Just train on teacher's hard predictions"
Temperaturepi=exp(zi/T)/exp(zj/T)p_i = \exp(z_i/T) / \sum \exp(z_j/T)Higher T = softer distribution"T=100 gives best results"
Label smoothing(1ϵ)y+ϵ/K(1-\epsilon) \cdot y + \epsilon/KPrevents overconfidence"Same as knowledge distillation"
Curriculum learningEasy first, hard laterMimics human learning"Always helps in every setting"

Spaced Repetition Checkpoints

Day 0 - Initial Learning

  • Read this entire page
  • Derive Xavier initialization from the variance preservation requirement
  • Derive He initialization and explain why the factor differs from Xavier
  • Draw the mixed precision training pipeline from memory
  • Explain the knowledge distillation loss function, including the T2T^2 factor

Day 3 - First Recall

  • Without notes, write the Xavier and He initialization formulas and state when each applies
  • Give the "60-Second Answer" for initialization out loud, timed
  • Explain the difference between FP16 and BF16 without looking
  • List 5 operations that must stay in FP32 during mixed precision training

Day 7 - Connections

  • Explain how initialization, gradient clipping, and normalization all address the same underlying problem (variance control)
  • Compare label smoothing and knowledge distillation - similarities and differences
  • Do Practice Problem 1 (initialization debugging) without hints

Day 14 - Application

  • Do Practice Problem 2 (mixed precision design) under timed conditions (15 minutes)
  • Design a knowledge distillation pipeline for a specific task of your choice
  • Do Practice Problem 5 (training recipe) - can you produce a complete, coherent recipe?

Day 21 - Mock Interview

  • Have someone ask: "Walk me through how you would initialize and train a 100-layer ResNet from scratch"
  • Time yourself: full answer should take 5-8 minutes
  • Do all 5 practice problems in sequence under timed conditions (50 minutes total)
  • Can you debug a training failure from a loss curve alone?

Key Takeaways

  1. Initialization is not an afterthought. It determines whether your network can even begin learning. Xavier for symmetric activations, He for ReLU, and know why each formula exists.

  2. Mixed precision is essential for large models. BF16 is the modern default. Know which operations must stay in FP32 and why loss scaling exists for FP16.

  3. Knowledge distillation is one of the most practical techniques in ML. It compresses models for deployment while preserving accuracy. Understand temperature, the T2T^2 factor, and when to use feature-based distillation.

  4. Label smoothing is cheap insurance. It costs nothing computationally and almost always improves generalization and calibration.

  5. Every technique addresses a specific failure mode. Strong candidates can diagnose a problem from a training curve and prescribe the right technique - not just list techniques they have memorized.

© 2026 EngineersOfAI. All rights reserved.