Training Techniques - From Initialization to Distillation

Reading time: ~40 min | Interview relevance: High | Roles: MLE, AI Eng, Research Engineer, Applied Scientist

The Real Interview Moment

You are in a Meta MLE on-site. The interviewer puts up a training loss curve on the screen - the loss plateaus after the first epoch and never improves. She asks: "Debug this. What could cause this, and how would you fix it?" You start listing possibilities: bad learning rate, dead ReLUs, poor initialization. She nods, then asks: "Walk me through exactly how you'd initialize a 50-layer ResNet. What initialization scheme, what numerical precision, and what gradient management strategy?"

This is where candidates separate. Junior engineers say "use default PyTorch settings." Mid-level candidates mention Xavier or He initialization. Strong candidates explain why each technique exists - the variance propagation problem, the symmetry breaking requirement, the numerical stability guarantees - and can derive the initialization formulas from first principles. They know when to use FP16 vs BF16, why loss scaling exists, and how knowledge distillation can compress a model without sacrificing accuracy.

This page teaches you every training technique that interviewers expect you to know, with the mathematical depth to survive follow-up questions.

What You Will Master

Derive Xavier and He initialization from variance preservation principles
Explain why symmetry breaking is necessary and what happens without it
Implement gradient clipping (by norm and by value) and explain when each is needed
Compare FP16, BF16, and FP32 training with precise understanding of overflow/underflow
Design a mixed precision training pipeline with loss scaling
Apply curriculum learning strategies and explain when they help
Build a knowledge distillation pipeline with temperature-scaled soft labels
Use label smoothing and explain its regularization effect mathematically
Debug common training failures caused by each technique

Self-Assessment: Where Are You Now?

Skill	1 - Cannot	2 - Vaguely	3 - Can Explain	4 - Can Derive	5 - Can Teach	Your Score
Derive Xavier initialization						___
Derive He initialization						___
Explain symmetry breaking						___
Implement gradient clipping						___
Explain FP16 vs BF16 tradeoffs						___
Design a mixed precision pipeline						___
Explain knowledge distillation loss						___
Apply label smoothing mathematically						___

Target: All 4s and 5s before your interview.

Part 1 - Weight Initialization: Why It Matters

The Symmetry Problem

Consider a neural network where all weights are initialized to the same value (say, zero). What happens during forward and backward passes?

Forward pass: Every neuron in a layer computes the same output because all weights are identical. The layer effectively has one neuron, not hundreds.

Backward pass: Every neuron receives the same gradient. The weight update is identical for all neurons. They remain identical forever.

This is called the symmetry problem. No matter how long you train, neurons in the same layer will never differentiate. The network has the representational capacity of a single neuron per layer.

Instant Rejection

Never say "initialize all weights to zero" or "initialize all weights to the same value." This is a fundamental error that signals you do not understand neural network training. Even saying "small random values" without specifying the distribution will raise concerns.

The Variance Propagation Problem

Even with random initialization, choosing the wrong scale causes problems. Consider a layer $y = Wx$ where $W \in \mathbb{R}^{n_{out} \times n_{in}}$ and $x \in \mathbb{R}^{n_{in}}$ .

Each output neuron computes:

$y_j = \sum_{i=1}^{n_{in}} w_{ji} x_i$

Assuming $w_{ji}$ and $x_i$ are independent with zero mean:

$\text{Var}(y_j) = n_{in} \cdot \text{Var}(w) \cdot \text{Var}(x)$

If $\text{Var}(w) > 1/n_{in}$ , the variance of activations grows exponentially with depth. After 50 layers, activations explode to infinity.

If $\text{Var}(w) < 1/n_{in}$ , the variance shrinks exponentially. After 50 layers, activations collapse to zero.

Weight Initialization - Xavier vs He: Keeping Variance Stable Across Layers

The same problem occurs in the backward pass with gradients. This is why proper initialization is critical for deep networks.

Xavier (Glorot) Initialization

Paper: Glorot & Bengio, 2010 - "Understanding the difficulty of training deep feedforward neural networks"

Derivation:

We want the variance of activations to be preserved across layers in both the forward and backward pass.

Forward pass requirement: $\text{Var}(y) = \text{Var}(x)$ , which gives $\text{Var}(w) = 1/n_{in}$ .

Backward pass requirement: $\text{Var}(\delta x) = \text{Var}(\delta y)$ , which gives $\text{Var}(w) = 1/n_{out}$ .

Xavier initialization compromises:

$\text{Var}(w) = \frac{2}{n_{in} + n_{out}}$

For a uniform distribution: $w \sim U\left[-\sqrt{\frac{6}{n_{in}+n_{out}}},\ \sqrt{\frac{6}{n_{in}+n_{out}}}\right]$

For a normal distribution: $w \sim \mathcal{N}\left(0,\ \frac{2}{n_{in}+n_{out}}\right)$

Interviewer's Perspective

When an interviewer asks about Xavier initialization, they want three things: (1) the motivation - variance preservation, (2) the derivation - why the specific formula, and (3) the assumption - that activations are linear. Mentioning that Xavier assumes linear activations (or symmetric activations like tanh) and explaining why this matters for ReLU will immediately set you apart.

When Xavier works: Sigmoid, tanh, or linear activations. These are symmetric around zero, so the zero-mean assumption in the derivation holds approximately.

When Xavier fails: ReLU activations. ReLU zeros out half the inputs, effectively halving the variance. Xavier initialization causes variance to shrink by a factor of 2 per layer - a 50-layer ReLU network initialized with Xavier will have activations that are $2^{-50}$ times the input variance.

He (Kaiming) Initialization

Paper: He et al., 2015 - "Delving Deep into Rectifiers"

He initialization accounts for ReLU's variance-halving effect. Since ReLU zeros out approximately half the neurons (those with negative input), the effective fan-in is halved:

$\text{Var}(w) = \frac{2}{n_{in}}$

For a normal distribution: $w \sim \mathcal{N}\left(0,\ \frac{2}{n_{in}}\right)$

Why the factor of 2? ReLU passes only positive values. For a zero-mean input, half the values are positive. The variance of the positive half of a zero-mean Gaussian is half the original variance. To compensate, we double the initialization variance.

60-Second Answer

"Weight initialization controls the scale of activations and gradients through the network. Xavier initialization sets variance to $2/(n_{in} + n_{out})$ to preserve variance in both directions, assuming linear activations. He initialization sets variance to $2/n_{in}$ to account for ReLU zeroing out half the neurons. Using the wrong initialization for your activation function causes either exploding or vanishing activations, making training impossible or extremely slow. For modern networks with ReLU variants, I default to He initialization. For transformers with LayerNorm, the choice is less critical because normalization stabilizes variance regardless."

LSUV - Layer-Sequential Unit-Variance Initialization

Paper: Mishkin & Matas, 2016

LSUV is a data-driven initialization method:

Initialize weights with orthogonal matrices (preserves norms)
For each layer sequentially: a. Pass a mini-batch through the network b. Measure the variance of the layer's output c. Scale the weights so the output variance equals 1.0 d. Repeat for the next layer

Advantage over Xavier/He: Works for any activation function and any architecture without deriving activation-specific formulas. Particularly useful for exotic architectures where the variance propagation analysis is complex.

Disadvantage: Requires a forward pass during initialization, adding startup cost.

Common Trap

Do NOT confuse LSUV with batch normalization. Both aim to normalize activations, but LSUV is a one-time initialization procedure while BatchNorm is applied during every forward pass. LSUV sets the initial scale; BatchNorm maintains it throughout training. An interviewer may ask you to compare them.

Initialization for Specific Architectures

Architecture	Recommended Init	Why
MLP with ReLU	He normal	Accounts for ReLU variance halving
MLP with tanh/sigmoid	Xavier normal	Assumes symmetric activations
ResNet	He normal + zero-init residual branch	Ensures residual block starts as identity
Transformer	Xavier normal scaled by $1/\sqrt{2N}$ for $N$ layers	GPT-2 convention, prevents output variance growth
Embedding layers	Normal with small std (0.01-0.02)	Empirical, no strong theoretical basis
LSTM	Orthogonal for recurrent weights, Xavier for input weights	Preserves gradient norms through time steps

Part 2 - Gradient Clipping

Why Gradients Explode

During backpropagation, gradients are multiplied through layers. If the Jacobian of each layer has spectral norm greater than 1, gradients grow exponentially:

$\|\nabla_{W_1} L\| \approx \prod_{l=1}^{L} \|J_l\| \cdot \|\nabla_{W_L} L\|$

For $L = 50$ layers with $\|J_l\| = 1.1$ , the gradient magnification is $1.1^{50} \approx 117$ . With $\|J_l\| = 2$ , it becomes $2^{50} \approx 10^{15}$ .

Clipping Strategies

Clip by value: Cap each gradient element independently.

$g_i \leftarrow \text{clip}(g_i, -\tau, \tau)$

Simple but changes the direction of the gradient vector
Rarely used in practice

Clip by global norm (standard approach):

$g \leftarrow g \cdot \frac{\tau}{\max(\|g\|, \tau)}$

Preserves gradient direction, only scales magnitude
This is what PyTorch's torch.nn.utils.clip_grad_norm_ does
Standard values: $\tau = 1.0$ for transformers, $\tau = 5.0$ for RNNs

Gradient Clipping by Global Norm - Scale if Exceeds Threshold, Preserve Direction

Company Variation

Google and DeepMind interviews often ask about gradient clipping in the context of transformer training. They expect you to know that gradient clipping is essential for transformer training (not optional) and that the typical threshold is 1.0. Meta asks about it in the context of RNN/LSTM training where exploding gradients are more dramatic.

Gradient Clipping vs. Other Solutions

Problem	Gradient Clipping	Proper Initialization	Normalization Layers	Residual Connections
Exploding gradients	Directly caps	Prevents initially	Prevents during training	Prevents via shortcut
Vanishing gradients	Does NOT help	Partially prevents	Partially prevents	Directly solves
Computational cost	Negligible	Zero (one-time)	Per-layer per-step	Per-block per-step
When to use	Always for transformers/RNNs	Always	Architecture-dependent	Deep networks

Part 3 - Mixed Precision Training

The Numerical Precision Landscape

Modern GPUs support multiple floating-point formats:

Format	Bits	Exponent	Mantissa	Range	Precision
FP32	32	8 bits	23 bits	$\pm 3.4 \times 10^{38}$	~7 decimal digits
FP16	16	5 bits	10 bits	$\pm 65504$	~3 decimal digits
BF16	16	8 bits	7 bits	$\pm 3.4 \times 10^{38}$	~2 decimal digits
TF32	19	8 bits	10 bits	$\pm 3.4 \times 10^{38}$	~3 decimal digits

Why Mixed Precision?

Speed: FP16/BF16 operations are 2-8x faster on modern GPUs (A100, H100). Memory: Half the memory per parameter means you can double the batch size or model size. Throughput: Tensor Cores on NVIDIA GPUs are optimized for FP16/BF16 matrix multiplications.

But you cannot simply cast everything to FP16. Here is why:

Where FP16 Fails

Problem 1: Overflow. FP16 max value is 65504. Activations in deep networks can exceed this. Pre-LayerNorm transformers are especially vulnerable because activations grow with sequence length.

Problem 2: Underflow. Gradient values below $2^{-24} \approx 5.96 \times 10^{-8}$ become zero in FP16. Many gradient values, especially in early layers of deep networks, fall below this threshold. The model stops learning.

Problem 3: Accumulation errors. Summing many small FP16 values loses precision rapidly. The sum of 1000 values of 0.001 in FP16 may not equal 1.0.

The Mixed Precision Recipe (Micikevicius et al., 2018)

Mixed Precision Training - FP16/BF16 Forward Pass with FP32 Master Weights

Key principles:

Master weights in FP32. Weight updates are tiny ( $\text{lr} \times \text{gradient} \approx 10^{-3} \times 10^{-4} = 10^{-7}$ ), which underflows in FP16.
Forward and backward pass in FP16. This is where the speedup comes from.
Loss scaling. Multiply the loss by a large constant $S$ (e.g., 1024) before backprop. This shifts gradients into FP16's representable range. After backprop, divide by $S$ before the weight update.
Dynamic loss scaling. Start with a large $S$ . If inf/nan appears, halve $S$ . If no inf/nan for N steps, double $S$ .

BF16: The Better Alternative

BF16 (Brain Floating Point, developed by Google Brain) uses the same exponent size as FP32 (8 bits) but with reduced mantissa (7 bits vs 23 bits).

Advantages of BF16 over FP16:

Same dynamic range as FP32 - no overflow issues
No loss scaling needed (the entire purpose of loss scaling was to handle FP16's limited range)
Simpler training pipeline

Disadvantage: Slightly less precision than FP16 (2 decimal digits vs 3). In practice, this rarely matters for training.

Interviewer's Perspective

If asked "what precision would you train a large model in?", the strong answer is: "BF16 for both forward and backward passes on hardware that supports it (A100+, TPU v3+), with FP32 master weights. This avoids the need for loss scaling. If BF16 isn't available, FP16 with dynamic loss scaling. I'd keep optimizer states (momentum, variance in Adam) in FP32 regardless." This shows you understand the practical tradeoffs, not just the theory.

Operations That Must Stay in FP32

Not everything can be safely done in reduced precision:

Softmax: Exponentiation can overflow. Compute in FP32 or use numerically stable implementation (subtract max).
Layer normalization: Variance computation needs FP32 for numerical stability.
Loss computation: Cross-entropy involves log of small values. FP16 underflow causes NaN.
Optimizer state: Adam's running averages need FP32 precision.
Gradient accumulation: Summing many small gradients must be in FP32.

Common Trap

Do NOT say "mixed precision just means using FP16 everywhere." The "mixed" means some operations use FP16 and some use FP32. Knowing which operations must stay in FP32 is what distinguishes engineers who have actually trained models from those who have only read about it.

Part 4 - Curriculum Learning

Core Idea

Train the model on easy examples first, then gradually introduce harder examples - mimicking how humans learn.

Paper: Bengio et al., 2009 - "Curriculum Learning"

Why It Works

Better optimization landscape: Easy examples provide strong, clean gradients that move the model toward a good region of parameter space. Hard/noisy examples early on provide conflicting gradients.
Faster convergence: The model builds basic representations quickly on easy examples, then refines them on hard examples.
Better generalization: Starting with clean examples helps the model learn true patterns before encountering noise.

Difficulty Metrics

How do you define "easy" vs "hard"?

Method	Definition of Difficulty	Example
Loss-based	Higher loss = harder	Sort by per-sample loss after one epoch
Confidence-based	Lower confidence = harder	Sort by model's predicted probability
Heuristic	Domain knowledge	Short sentences before long sentences in NLP
Length-based	Longer = harder	Common in sequence tasks
Noise-based	More noise = harder	Remove noisy labels first
Self-paced	Model decides	Include examples where loss is below a threshold that increases over time

Anti-Curriculum Learning

Surprisingly, some work shows that training on hard examples first can also help, particularly when:

The dataset contains many near-duplicate easy examples
Hard examples are the most informative for learning decision boundaries
The model needs to learn rare but important patterns

Company Variation

Curriculum learning is frequently asked at NLP-focused companies (Google, Meta AI, Cohere). It is especially relevant for training language models where sequence length, data quality, and domain mixing order all affect final performance. GPT-4 and Llama training reportedly use data curriculum strategies.

Part 5 - Knowledge Distillation

The Core Framework

Paper: Hinton et al., 2015 - "Distilling the Knowledge in a Neural Network"

A large "teacher" model has learned rich representations. A smaller "student" model learns to mimic the teacher's behavior, not just match the ground truth labels.

Knowledge Distillation - Teacher Soft Labels Transfer Dark Knowledge to Student

Why Soft Labels Are More Informative Than Hard Labels

Consider a teacher classifying an image. The hard label says "cat" (one-hot). But the teacher's soft predictions might be:

Class	Hard Label	Teacher Soft Label (T=1)	Teacher Soft Label (T=5)
Cat	1.0	0.92	0.45
Dog	0.0	0.05	0.25
Bird	0.0	0.02	0.18
Car	0.0	0.01	0.12

The soft labels reveal that this cat looks somewhat like a dog (both are furry animals) and less like a car. This "dark knowledge" - the inter-class similarity structure - is lost in hard labels but preserved in soft labels.

Temperature Scaling

The temperature parameter $T$ controls how "soft" the distribution is:

$p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$

$T = 1$ : Standard softmax (peaky distribution)
$T > 1$ : Softer distribution, reveals more inter-class structure
$T \to \infty$ : Uniform distribution (all classes equally likely)
$T \to 0$ : Hard argmax (one-hot)

Typical value: $T = 3$ to $T = 20$ (Hinton originally used $T = 20$ ).

The Distillation Loss

$L = \alpha \cdot T^2 \cdot \text{KL}\left(\text{softmax}(z_t/T)\ \|\ \text{softmax}(z_s/T)\right) + (1-\alpha) \cdot \text{CE}(y, \text{softmax}(z_s))$

Why $T^2$ ? The gradients of the KL divergence with temperature-scaled softmax are scaled by $1/T^2$ compared to standard softmax. Multiplying by $T^2$ compensates, keeping the relative contribution of the distillation loss consistent regardless of temperature.

Typical $\alpha$ : 0.5 to 0.9 (weight toward distillation loss).

60-Second Answer

"Knowledge distillation transfers knowledge from a large teacher model to a smaller student model. Instead of training the student on hard labels only, we also train it to match the teacher's soft probability distribution. We raise the temperature in the softmax to reveal inter-class similarities - the 'dark knowledge.' The student learns not just what the correct answer is, but what the incorrect answers look like, which is much more informative per training example. The loss combines KL divergence between teacher and student soft distributions with standard cross-entropy on hard labels. This lets you deploy a model that's 3-10x smaller with only 1-3% accuracy loss."

Types of Knowledge Distillation

Type	What is Transferred	Example
Response-based	Final layer output (soft labels)	Original Hinton distillation
Feature-based	Intermediate layer representations	FitNets (Romero et al., 2015)
Relation-based	Relationships between samples	Contrastive Representation Distillation
Self-distillation	Model distills into itself	Born-Again Networks, deep-to-shallow within same model
Online distillation	Models teach each other during training	Deep Mutual Learning

Practical Distillation Results

Teacher	Student	Task	Teacher Acc	Student Acc (no distill)	Student Acc (distilled)
BERT-Large	BERT-Small	GLUE	86.5	78.2	83.1
ResNet-152	ResNet-18	ImageNet	78.3	69.8	73.5
GPT-3	GPT-2	Text Gen	-	-	Significant improvement
Llama-70B	Llama-7B	Benchmarks	-	-	~5% improvement

Part 6 - Label Smoothing

The Problem with Hard Labels

Standard one-hot labels force the model to be infinitely confident: predict exactly 1.0 for the correct class and 0.0 for all others. To achieve this, logits must go to $\pm\infty$ , which:

Causes the model to be overconfident - calibration suffers
Encourages memorization of training data
Makes the model less generalizable - it overfits to the exact label distribution

Label Smoothing Formula

Replace the one-hot label $y$ with a smoothed version:

$y_{\text{smooth}} = (1 - \epsilon) \cdot y + \frac{\epsilon}{K}$

Where $\epsilon$ is the smoothing parameter (typically 0.1) and $K$ is the number of classes.

Example with $K=4$ and $\epsilon=0.1$ :

Class	Hard Label	Smoothed Label
Cat (correct)	1.0	0.925
Dog	0.0	0.025
Bird	0.0	0.025
Car	0.0	0.025

Why Label Smoothing Works

Regularization effect: The model cannot achieve zero loss, so it never stops learning useful representations.
Better calibration: The model learns to output probabilities closer to the true uncertainty.
Tighter clusters: Label smoothing encourages the model to keep all class representations at a fixed distance from each other, creating tighter, more uniform clusters in embedding space.

Label Smoothing vs. Knowledge Distillation

Both provide "soft" targets, but they are fundamentally different:

Aspect	Label Smoothing	Knowledge Distillation
Source of soft labels	Uniform distribution	Teacher model
Information content	No inter-class structure	Rich inter-class relationships
Requires teacher	No	Yes
Computational cost	Negligible	Requires teacher inference
Typical improvement	0.5-2%	3-10%

Common Trap

Label smoothing and knowledge distillation are sometimes confused in interviews. When asked about label smoothing, do NOT describe knowledge distillation. The key difference: label smoothing uses a uniform distribution over non-target classes (every wrong class gets equal probability), while distillation uses the teacher's distribution (wrong classes get probability proportional to their similarity to the correct class). Label smoothing is a regularization technique; distillation is a knowledge transfer technique.

Part 7 - Putting It All Together: A Training Recipe

Here is how all these techniques combine in a modern training pipeline:

Modern Training Recipe - Init, Precision, Gradient Clipping, Label Smoothing

Common Training Failures and Fixes

Symptom	Likely Cause	Fix
Loss is NaN from step 1	Bad initialization or no gradient clipping	Use He init, clip gradients at 1.0
Loss plateaus immediately	Symmetry problem or dead neurons	Check initialization, use LeakyReLU
Loss spikes periodically	Exploding gradients	Reduce gradient clip threshold
Loss decreases then suddenly NaN	FP16 overflow	Enable loss scaling or switch to BF16
Training is slow on GPU	Not using mixed precision	Enable AMP with BF16
Student much worse than teacher	Wrong temperature or alpha	Try T=5-20, alpha=0.7-0.9
Model overconfident	No label smoothing or regularization	Add label smoothing epsilon=0.1
Model underfits easy examples	Curriculum too aggressive	Start with more easy examples

Practice Problems

Problem 1: Initialization Debugging

You initialize a 100-layer MLP with ReLU activations using Xavier normal initialization. After the first forward pass, all activations in the last layer are nearly zero. Explain why and fix it.

Hint 1 - Direction

Think about what ReLU does to the variance of activations at each layer when Xavier initialization is used.

Hint 2 - Insight

ReLU zeros out approximately half the inputs. Xavier assumes linear/symmetric activations where all inputs pass through. Each ReLU layer halves the variance, so after 100 layers the variance is scaled by $(0.5)^{100} \approx 10^{-30}$ .

Hint 3 - Full Solution + Rubric

Why activations vanish:

Xavier sets $\text{Var}(w) = 2/(n_{in} + n_{out})$ , designed for linear activations. With ReLU, each layer halves the variance (since half the outputs are zeroed). After $L$ layers:

$\text{Var}(\text{activation}_L) \approx \left(\frac{n_{in}}{n_{in} + n_{out}} \right)^L \cdot \text{Var}(x)$

For $n_{in} = n_{out}$ , this simplifies to $(0.5)^{100} \approx 10^{-30}$ . Activations are effectively zero.

Fix: Switch to He initialization with $\text{Var}(w) = 2/n_{in}$ , which compensates for ReLU's variance halving. After this change:

$\text{Var}(\text{activation}_L) \approx \text{Var}(x)$

The variance is preserved through all 100 layers.

Additional fixes for very deep networks: Add residual connections, add batch/layer normalization, or use LSUV initialization.

Scoring Rubric:

Strong Hire: Correctly identifies the ReLU-Xavier mismatch, derives the variance shrinkage factor, prescribes He initialization, and mentions complementary techniques (residual connections, normalization).
Lean Hire: Identifies the problem but cannot derive the shrinkage rate or only says "use He init" without explaining why.
No Hire: Cannot explain why Xavier fails with ReLU or suggests increasing learning rate as the fix.

Problem 2: Mixed Precision Design

You are training a 10B parameter language model on 8 A100 GPUs. Design the mixed precision strategy, including what precision each component uses and how you handle numerical stability.

Hint 1 - Direction

List every component: weights, activations, gradients, optimizer states, loss computation, normalization layers. Decide the precision for each.

Hint 2 - Insight

Master weights in FP32 (8 bytes per param = 80GB just for weights). Forward/backward in BF16 (avoids loss scaling). Adam optimizer states (momentum + variance) in FP32 (adds 2x weight memory). Specific operations (softmax, layernorm, loss) in FP32 for stability.

Hint 3 - Full Solution + Rubric

Component-level precision assignment:

Component	Precision	Reason
Master weights	FP32	Weight updates are tiny, underflow in FP16
Forward activations	BF16	Speed + memory, BF16 range prevents overflow
Backward gradients	BF16	Matches forward precision
Adam first moment	FP32	Running average needs precision
Adam second moment	FP32	Running average needs precision
Softmax	FP32	exp() overflow risk
LayerNorm	FP32	Variance computation needs precision
Loss computation	FP32	log() underflow risk
Gradient accumulation	FP32	Summing many small values
Gradient all-reduce	BF16	Communication bandwidth bottleneck

Memory breakdown for 10B params:

Master weights: 10B x 4 bytes = 40GB
BF16 weights (forward copy): 10B x 2 bytes = 20GB
Adam states: 10B x 8 bytes = 80GB
Gradients: 10B x 2 bytes = 20GB
Total per-model: ~160GB across 8 GPUs = 20GB per GPU (with sharding)

Why BF16 over FP16: No loss scaling needed, simplifies the pipeline, same dynamic range as FP32. On A100s, BF16 and FP16 have the same throughput on Tensor Cores.

Scoring Rubric:

Strong Hire: Correct precision for every component, memory calculation, justification for BF16 over FP16, mentions gradient accumulation precision and communication precision.
Lean Hire: Gets the main idea (BF16 forward, FP32 master weights) but misses nuances like optimizer state precision or which operations need FP32.
No Hire: Suggests FP16 everywhere or cannot explain why master weights need FP32.

Problem 3: Distillation Temperature Selection

You are distilling a BERT-Large teacher into a DistilBERT student for sentiment classification (positive/negative). The teacher achieves 95% accuracy. With $T=1$ , the student gets 88%. How would you tune temperature to improve the student? What temperatures would you try and why?

Hint 1 - Direction

Think about what temperature controls in the softmax distribution. For a binary classification task specifically, how much "dark knowledge" is available?

Hint 2 - Insight

With only 2 classes, the soft distribution has limited dark knowledge (just the ratio between positive and negative probabilities). Higher temperatures will flatten this further. The optimal temperature for binary classification is typically lower ( $T = 2$ - $5$ ) than for multi-class ( $T = 5$ - $20$ ) because there is less inter-class structure to reveal.

Hint 3 - Full Solution + Rubric

Analysis:

For binary sentiment classification, the teacher's output is a 2-class distribution. At $T=1$ , a 95%-confident teacher outputs approximately [0.95, 0.05] for a clear positive example.

Temperature sweep:

$T=1$ : [0.95, 0.05] - Almost hard label, little dark knowledge
$T=3$ : [0.74, 0.26] - Reveals teacher's uncertainty more
$T=5$ : [0.65, 0.35] - Quite soft, but still informative
$T=10$ : [0.57, 0.43] - Approaching uniform, losing signal
$T=20$ : [0.53, 0.47] - Nearly uniform, useless

Recommended approach:

Try $T \in \{2, 3, 5, 7\}$ on a validation set
For binary classification, $T=3$ or $T=5$ is likely optimal
Tune $\alpha$ (weight between distillation and hard loss) jointly: try $\alpha \in \{0.3, 0.5, 0.7, 0.9\}$
For binary tasks, consider augmenting with feature-based distillation (match intermediate representations) since the output distribution has limited dark knowledge

Key insight: Binary classification is the worst case for response-based distillation because there are only 2 classes to distribute probability over. Feature-based distillation (matching intermediate BERT hidden states) would likely give more improvement.

Scoring Rubric:

Strong Hire: Recognizes that binary classification limits dark knowledge, recommends feature-based distillation as a complement, provides specific temperature range with reasoning, discusses $\alpha$ tuning.
Lean Hire: Suggests trying different temperatures but doesn't recognize the fundamental limitation of 2-class distillation.
No Hire: Suggests $T=20$ "because higher is better" or doesn't understand what temperature controls.

Problem 4: Curriculum Learning for Code Generation

You are training a code generation model (similar to CodeLlama). Design a curriculum learning strategy. What defines "easy" vs "hard" code? How do you schedule the curriculum?

Hint 1 - Direction

Think about multiple dimensions of code difficulty: length, language complexity, algorithmic complexity, number of dependencies, test pass rate.

Hint 2 - Insight

Easy code: short functions, single language, common patterns (getters/setters, simple loops). Hard code: multi-file projects, complex algorithms, concurrent programming, domain-specific patterns. The curriculum should consider both syntactic complexity and semantic difficulty.

Hint 3 - Full Solution + Rubric

Difficulty dimensions:

Length: Short functions (< 20 lines) to multi-file projects
Cyclomatic complexity: Linear code to deeply nested conditionals/loops
Language frequency: Python/JavaScript (common) to Rust/Haskell (rare)
Algorithmic complexity: O(n) loops to dynamic programming/graph algorithms
Dependencies: Standalone functions to complex import chains
Documentation quality: Well-commented code to uncommented code
Test pass rate: How often a baseline model generates correct code for this task

Proposed curriculum (4 phases):

Phase 1 (first 10% of training): Single-function Python/JavaScript, < 20 lines, simple logic, well-documented.

Phase 2 (10-40%): Multi-function files, add Java/TypeScript/Go, include standard library usage, introduce common algorithms.

Phase 3 (40-80%): Multi-file projects, all languages, complex algorithms, API usage, error handling patterns.

Phase 4 (80-100%): Full difficulty range including concurrent code, advanced type systems, low-level systems code, adversarial/edge-case examples.

Scheduling: Linear pacing function that gradually increases the difficulty threshold. At each phase boundary, validate on a held-out set at the target difficulty level.

Anti-curriculum consideration: For code, showing some hard examples early (e.g., complex algorithms) might help the model learn structural patterns. A mixed strategy that is 80% curriculum-ordered and 20% random might outperform strict ordering.

Scoring Rubric:

Strong Hire: Identifies multiple difficulty dimensions, provides concrete phasing, discusses scheduling, considers anti-curriculum aspects, mentions validation strategy.
Lean Hire: Provides a reasonable curriculum but only considers 1-2 difficulty dimensions.
No Hire: Cannot define code difficulty or proposes a curriculum with no clear rationale.

Problem 5: Training Recipe for Production

Your team is about to train a 1B parameter vision transformer on ImageNet from scratch. Write out the complete training recipe: initialization, precision, gradient management, label strategy, and any other techniques. Justify each choice.

Hint 1 - Direction

Think about what specific choices the ViT paper and follow-up work (DeiT, BEiT) made. Why did they deviate from standard ResNet training recipes?

Hint 2 - Insight

ViTs are notoriously harder to train than CNNs. They require: careful initialization (truncated normal, small std), strong data augmentation, longer training schedules, label smoothing, knowledge distillation from a CNN teacher (DeiT approach). Mixed precision is essential for 1B params.

Hint 3 - Full Solution + Rubric

Complete training recipe:

Initialization:

Patch embedding: truncated normal, std=0.02
Position embeddings: truncated normal, std=0.02
Attention layers: Xavier uniform
MLP layers: He normal (GELU approximates ReLU)
Final classifier head: zero-initialized (start as identity)
LayerNorm: weight=1, bias=0

Precision:

BF16 mixed precision on A100 GPUs
FP32 master weights and optimizer states
FP32 for LayerNorm and softmax attention

Gradient management:

Gradient clipping: global norm 1.0
Gradient accumulation: effective batch size 4096 (common for ViT)
Learning rate: 3e-4 peak with linear warmup (10K steps) + cosine decay

Label strategy:

Label smoothing: epsilon=0.1
Knowledge distillation from RegNetY-16GF teacher (DeiT approach)
Mixup (alpha=0.8) + CutMix (alpha=1.0) augmentation

Additional techniques:

Stochastic depth (drop path rate 0.1)
Repeated augmentation
Random erasing (probability 0.25)
Weight decay 0.05 (AdamW)
EMA of model weights for evaluation

Why these choices matter for ViTs specifically:

ViTs lack the inductive biases of CNNs (translation equivariance, locality), so they need more data and stronger regularization
Knowledge distillation from a CNN teacher provides the inductive bias that ViTs lack
Label smoothing prevents the model from becoming overconfident on ImageNet's noisy labels
Large batch size with warmup is critical for stable transformer training

Scoring Rubric:

Strong Hire: Covers all categories (init, precision, gradients, labels, augmentation), provides specific values with justifications, mentions ViT-specific challenges, references DeiT/BEiT choices.
Lean Hire: Gets the main techniques right but misses ViT-specific nuances or provides generic values without justification.
No Hire: Treats ViT training the same as ResNet training or cannot provide a coherent recipe.

Interview Cheat Sheet

Concept	Key Formula	One-Liner	Red Flag
Xavier init	$\text{Var}(w) = 2/(n_{in}+n_{out})$	Preserve variance for linear activations	"Use Xavier for everything"
He init	$\text{Var}(w) = 2/n_{in}$	Compensates for ReLU halving variance	"He and Xavier are the same"
Symmetry breaking	Random init required	Same init = same neurons forever	"Init to zero is fine"
Gradient clipping (norm)	$g \cdot \tau / \max(\\|g\\|, \tau)$	Preserve direction, limit magnitude	"Clip each gradient independently"
FP16 range	Max 65504	Needs loss scaling to avoid underflow	"FP16 is always fine"
BF16 advantage	Same range as FP32	No loss scaling needed	"BF16 and FP16 are identical"
Loss scaling	Multiply loss by $S$ before backprop	Shifts gradients into FP16 range	Not knowing why it exists
Knowledge distillation	$\text{KL}(\text{soft}_T \\| \text{soft}_S) \cdot T^2$	Teacher's soft labels transfer dark knowledge	"Just train on teacher's hard predictions"
Temperature	$p_i = \exp(z_i/T) / \sum \exp(z_j/T)$	Higher T = softer distribution	"T=100 gives best results"
Label smoothing	$(1-\epsilon) \cdot y + \epsilon/K$	Prevents overconfidence	"Same as knowledge distillation"
Curriculum learning	Easy first, hard later	Mimics human learning	"Always helps in every setting"

Spaced Repetition Checkpoints

Day 0 - Initial Learning

Read this entire page
Derive Xavier initialization from the variance preservation requirement
Derive He initialization and explain why the factor differs from Xavier
Draw the mixed precision training pipeline from memory
Explain the knowledge distillation loss function, including the $T^2$ factor

Day 3 - First Recall

Without notes, write the Xavier and He initialization formulas and state when each applies
Give the "60-Second Answer" for initialization out loud, timed
Explain the difference between FP16 and BF16 without looking
List 5 operations that must stay in FP32 during mixed precision training

Day 7 - Connections

Explain how initialization, gradient clipping, and normalization all address the same underlying problem (variance control)
Compare label smoothing and knowledge distillation - similarities and differences
Do Practice Problem 1 (initialization debugging) without hints

Day 14 - Application

Do Practice Problem 2 (mixed precision design) under timed conditions (15 minutes)
Design a knowledge distillation pipeline for a specific task of your choice
Do Practice Problem 5 (training recipe) - can you produce a complete, coherent recipe?

Day 21 - Mock Interview

Have someone ask: "Walk me through how you would initialize and train a 100-layer ResNet from scratch"
Time yourself: full answer should take 5-8 minutes
Do all 5 practice problems in sequence under timed conditions (50 minutes total)
Can you debug a training failure from a loss curve alone?

Key Takeaways

Initialization is not an afterthought. It determines whether your network can even begin learning. Xavier for symmetric activations, He for ReLU, and know why each formula exists.
Mixed precision is essential for large models. BF16 is the modern default. Know which operations must stay in FP32 and why loss scaling exists for FP16.
Knowledge distillation is one of the most practical techniques in ML. It compresses models for deployment while preserving accuracy. Understand temperature, the $T^2$ factor, and when to use feature-based distillation.
Label smoothing is cheap insurance. It costs nothing computationally and almost always improves generalization and calibration.
Every technique addresses a specific failure mode. Strong candidates can diagnose a problem from a training curve and prescribe the right technique - not just list techniques they have memorized.

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - Weight Initialization: Why It Matters​

The Symmetry Problem​

The Variance Propagation Problem​

Xavier (Glorot) Initialization​

He (Kaiming) Initialization​

LSUV - Layer-Sequential Unit-Variance Initialization​

Initialization for Specific Architectures​

Part 2 - Gradient Clipping​

Why Gradients Explode​

Clipping Strategies​

Gradient Clipping vs. Other Solutions​

Part 3 - Mixed Precision Training​

The Numerical Precision Landscape​

Why Mixed Precision?​

Where FP16 Fails​

The Mixed Precision Recipe (Micikevicius et al., 2018)​

BF16: The Better Alternative​

Operations That Must Stay in FP32​

Part 4 - Curriculum Learning​

Core Idea​

Why It Works​

Difficulty Metrics​

Anti-Curriculum Learning​

Part 5 - Knowledge Distillation​

The Core Framework​

Why Soft Labels Are More Informative Than Hard Labels​

Temperature Scaling​

The Distillation Loss​

Types of Knowledge Distillation​

Practical Distillation Results​

Part 6 - Label Smoothing​

The Problem with Hard Labels​

Label Smoothing Formula​

Why Label Smoothing Works​

Label Smoothing vs. Knowledge Distillation​

Part 7 - Putting It All Together: A Training Recipe​

Common Training Failures and Fixes​

Practice Problems​

Problem 1: Initialization Debugging​

Problem 2: Mixed Precision Design​

Problem 3: Distillation Temperature Selection​

Problem 4: Curriculum Learning for Code Generation​

Problem 5: Training Recipe for Production​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 - Initial Learning​

Day 3 - First Recall​

Day 7 - Connections​

Day 14 - Application​

Day 21 - Mock Interview​

Key Takeaways​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - Weight Initialization: Why It Matters

The Symmetry Problem

The Variance Propagation Problem

Xavier (Glorot) Initialization

He (Kaiming) Initialization

LSUV - Layer-Sequential Unit-Variance Initialization

Initialization for Specific Architectures

Part 2 - Gradient Clipping

Why Gradients Explode

Clipping Strategies

Gradient Clipping vs. Other Solutions

Part 3 - Mixed Precision Training

The Numerical Precision Landscape

Why Mixed Precision?

Where FP16 Fails

The Mixed Precision Recipe (Micikevicius et al., 2018)

BF16: The Better Alternative

Operations That Must Stay in FP32

Part 4 - Curriculum Learning

Core Idea

Why It Works

Difficulty Metrics

Anti-Curriculum Learning

Part 5 - Knowledge Distillation

The Core Framework

Why Soft Labels Are More Informative Than Hard Labels

Temperature Scaling

The Distillation Loss

Types of Knowledge Distillation

Practical Distillation Results

Part 6 - Label Smoothing

The Problem with Hard Labels

Label Smoothing Formula

Why Label Smoothing Works

Label Smoothing vs. Knowledge Distillation

Part 7 - Putting It All Together: A Training Recipe

Common Training Failures and Fixes

Practice Problems

Problem 1: Initialization Debugging

Problem 2: Mixed Precision Design

Problem 3: Distillation Temperature Selection

Problem 4: Curriculum Learning for Code Generation

Problem 5: Training Recipe for Production

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 - Initial Learning

Day 3 - First Recall

Day 7 - Connections

Day 14 - Application

Day 21 - Mock Interview

Key Takeaways