Training Techniques - From Initialization to Distillation
Reading time: ~40 min | Interview relevance: High | Roles: MLE, AI Eng, Research Engineer, Applied Scientist
The Real Interview Moment
You are in a Meta MLE on-site. The interviewer puts up a training loss curve on the screen - the loss plateaus after the first epoch and never improves. She asks: "Debug this. What could cause this, and how would you fix it?" You start listing possibilities: bad learning rate, dead ReLUs, poor initialization. She nods, then asks: "Walk me through exactly how you'd initialize a 50-layer ResNet. What initialization scheme, what numerical precision, and what gradient management strategy?"
This is where candidates separate. Junior engineers say "use default PyTorch settings." Mid-level candidates mention Xavier or He initialization. Strong candidates explain why each technique exists - the variance propagation problem, the symmetry breaking requirement, the numerical stability guarantees - and can derive the initialization formulas from first principles. They know when to use FP16 vs BF16, why loss scaling exists, and how knowledge distillation can compress a model without sacrificing accuracy.
This page teaches you every training technique that interviewers expect you to know, with the mathematical depth to survive follow-up questions.
What You Will Master
- Derive Xavier and He initialization from variance preservation principles
- Explain why symmetry breaking is necessary and what happens without it
- Implement gradient clipping (by norm and by value) and explain when each is needed
- Compare FP16, BF16, and FP32 training with precise understanding of overflow/underflow
- Design a mixed precision training pipeline with loss scaling
- Apply curriculum learning strategies and explain when they help
- Build a knowledge distillation pipeline with temperature-scaled soft labels
- Use label smoothing and explain its regularization effect mathematically
- Debug common training failures caused by each technique
Self-Assessment: Where Are You Now?
| Skill | 1 - Cannot | 2 - Vaguely | 3 - Can Explain | 4 - Can Derive | 5 - Can Teach | Your Score |
|---|---|---|---|---|---|---|
| Derive Xavier initialization | ___ | |||||
| Derive He initialization | ___ | |||||
| Explain symmetry breaking | ___ | |||||
| Implement gradient clipping | ___ | |||||
| Explain FP16 vs BF16 tradeoffs | ___ | |||||
| Design a mixed precision pipeline | ___ | |||||
| Explain knowledge distillation loss | ___ | |||||
| Apply label smoothing mathematically | ___ |
Target: All 4s and 5s before your interview.
Part 1 - Weight Initialization: Why It Matters
The Symmetry Problem
Consider a neural network where all weights are initialized to the same value (say, zero). What happens during forward and backward passes?
Forward pass: Every neuron in a layer computes the same output because all weights are identical. The layer effectively has one neuron, not hundreds.
Backward pass: Every neuron receives the same gradient. The weight update is identical for all neurons. They remain identical forever.
This is called the symmetry problem. No matter how long you train, neurons in the same layer will never differentiate. The network has the representational capacity of a single neuron per layer.
Never say "initialize all weights to zero" or "initialize all weights to the same value." This is a fundamental error that signals you do not understand neural network training. Even saying "small random values" without specifying the distribution will raise concerns.
The Variance Propagation Problem
Even with random initialization, choosing the wrong scale causes problems. Consider a layer where and .
Each output neuron computes:
Assuming and are independent with zero mean:
If , the variance of activations grows exponentially with depth. After 50 layers, activations explode to infinity.
If , the variance shrinks exponentially. After 50 layers, activations collapse to zero.
The same problem occurs in the backward pass with gradients. This is why proper initialization is critical for deep networks.
Xavier (Glorot) Initialization
Paper: Glorot & Bengio, 2010 - "Understanding the difficulty of training deep feedforward neural networks"
Derivation:
We want the variance of activations to be preserved across layers in both the forward and backward pass.
Forward pass requirement: , which gives .
Backward pass requirement: , which gives .
Xavier initialization compromises:
For a uniform distribution:
For a normal distribution:
When an interviewer asks about Xavier initialization, they want three things: (1) the motivation - variance preservation, (2) the derivation - why the specific formula, and (3) the assumption - that activations are linear. Mentioning that Xavier assumes linear activations (or symmetric activations like tanh) and explaining why this matters for ReLU will immediately set you apart.
When Xavier works: Sigmoid, tanh, or linear activations. These are symmetric around zero, so the zero-mean assumption in the derivation holds approximately.
When Xavier fails: ReLU activations. ReLU zeros out half the inputs, effectively halving the variance. Xavier initialization causes variance to shrink by a factor of 2 per layer - a 50-layer ReLU network initialized with Xavier will have activations that are times the input variance.
He (Kaiming) Initialization
Paper: He et al., 2015 - "Delving Deep into Rectifiers"
He initialization accounts for ReLU's variance-halving effect. Since ReLU zeros out approximately half the neurons (those with negative input), the effective fan-in is halved:
For a normal distribution:
Why the factor of 2? ReLU passes only positive values. For a zero-mean input, half the values are positive. The variance of the positive half of a zero-mean Gaussian is half the original variance. To compensate, we double the initialization variance.
"Weight initialization controls the scale of activations and gradients through the network. Xavier initialization sets variance to to preserve variance in both directions, assuming linear activations. He initialization sets variance to to account for ReLU zeroing out half the neurons. Using the wrong initialization for your activation function causes either exploding or vanishing activations, making training impossible or extremely slow. For modern networks with ReLU variants, I default to He initialization. For transformers with LayerNorm, the choice is less critical because normalization stabilizes variance regardless."
LSUV - Layer-Sequential Unit-Variance Initialization
Paper: Mishkin & Matas, 2016
LSUV is a data-driven initialization method:
- Initialize weights with orthogonal matrices (preserves norms)
- For each layer sequentially: a. Pass a mini-batch through the network b. Measure the variance of the layer's output c. Scale the weights so the output variance equals 1.0 d. Repeat for the next layer
Advantage over Xavier/He: Works for any activation function and any architecture without deriving activation-specific formulas. Particularly useful for exotic architectures where the variance propagation analysis is complex.
Disadvantage: Requires a forward pass during initialization, adding startup cost.
Do NOT confuse LSUV with batch normalization. Both aim to normalize activations, but LSUV is a one-time initialization procedure while BatchNorm is applied during every forward pass. LSUV sets the initial scale; BatchNorm maintains it throughout training. An interviewer may ask you to compare them.
Initialization for Specific Architectures
| Architecture | Recommended Init | Why |
|---|---|---|
| MLP with ReLU | He normal | Accounts for ReLU variance halving |
| MLP with tanh/sigmoid | Xavier normal | Assumes symmetric activations |
| ResNet | He normal + zero-init residual branch | Ensures residual block starts as identity |
| Transformer | Xavier normal scaled by for layers | GPT-2 convention, prevents output variance growth |
| Embedding layers | Normal with small std (0.01-0.02) | Empirical, no strong theoretical basis |
| LSTM | Orthogonal for recurrent weights, Xavier for input weights | Preserves gradient norms through time steps |
Part 2 - Gradient Clipping
Why Gradients Explode
During backpropagation, gradients are multiplied through layers. If the Jacobian of each layer has spectral norm greater than 1, gradients grow exponentially:
For layers with , the gradient magnification is . With , it becomes .
Clipping Strategies
Clip by value: Cap each gradient element independently.
- Simple but changes the direction of the gradient vector
- Rarely used in practice
Clip by global norm (standard approach):
- Preserves gradient direction, only scales magnitude
- This is what PyTorch's
torch.nn.utils.clip_grad_norm_does - Standard values: for transformers, for RNNs
Google and DeepMind interviews often ask about gradient clipping in the context of transformer training. They expect you to know that gradient clipping is essential for transformer training (not optional) and that the typical threshold is 1.0. Meta asks about it in the context of RNN/LSTM training where exploding gradients are more dramatic.
Gradient Clipping vs. Other Solutions
| Problem | Gradient Clipping | Proper Initialization | Normalization Layers | Residual Connections |
|---|---|---|---|---|
| Exploding gradients | Directly caps | Prevents initially | Prevents during training | Prevents via shortcut |
| Vanishing gradients | Does NOT help | Partially prevents | Partially prevents | Directly solves |
| Computational cost | Negligible | Zero (one-time) | Per-layer per-step | Per-block per-step |
| When to use | Always for transformers/RNNs | Always | Architecture-dependent | Deep networks |
Part 3 - Mixed Precision Training
The Numerical Precision Landscape
Modern GPUs support multiple floating-point formats:
| Format | Bits | Exponent | Mantissa | Range | Precision |
|---|---|---|---|---|---|
| FP32 | 32 | 8 bits | 23 bits | ~7 decimal digits | |
| FP16 | 16 | 5 bits | 10 bits | ~3 decimal digits | |
| BF16 | 16 | 8 bits | 7 bits | ~2 decimal digits | |
| TF32 | 19 | 8 bits | 10 bits | ~3 decimal digits |
Why Mixed Precision?
Speed: FP16/BF16 operations are 2-8x faster on modern GPUs (A100, H100). Memory: Half the memory per parameter means you can double the batch size or model size. Throughput: Tensor Cores on NVIDIA GPUs are optimized for FP16/BF16 matrix multiplications.
But you cannot simply cast everything to FP16. Here is why:
Where FP16 Fails
Problem 1: Overflow. FP16 max value is 65504. Activations in deep networks can exceed this. Pre-LayerNorm transformers are especially vulnerable because activations grow with sequence length.
Problem 2: Underflow. Gradient values below become zero in FP16. Many gradient values, especially in early layers of deep networks, fall below this threshold. The model stops learning.
Problem 3: Accumulation errors. Summing many small FP16 values loses precision rapidly. The sum of 1000 values of 0.001 in FP16 may not equal 1.0.
The Mixed Precision Recipe (Micikevicius et al., 2018)
Key principles:
- Master weights in FP32. Weight updates are tiny (), which underflows in FP16.
- Forward and backward pass in FP16. This is where the speedup comes from.
- Loss scaling. Multiply the loss by a large constant (e.g., 1024) before backprop. This shifts gradients into FP16's representable range. After backprop, divide by before the weight update.
- Dynamic loss scaling. Start with a large . If inf/nan appears, halve . If no inf/nan for N steps, double .
BF16: The Better Alternative
BF16 (Brain Floating Point, developed by Google Brain) uses the same exponent size as FP32 (8 bits) but with reduced mantissa (7 bits vs 23 bits).
Advantages of BF16 over FP16:
- Same dynamic range as FP32 - no overflow issues
- No loss scaling needed (the entire purpose of loss scaling was to handle FP16's limited range)
- Simpler training pipeline
Disadvantage: Slightly less precision than FP16 (2 decimal digits vs 3). In practice, this rarely matters for training.
If asked "what precision would you train a large model in?", the strong answer is: "BF16 for both forward and backward passes on hardware that supports it (A100+, TPU v3+), with FP32 master weights. This avoids the need for loss scaling. If BF16 isn't available, FP16 with dynamic loss scaling. I'd keep optimizer states (momentum, variance in Adam) in FP32 regardless." This shows you understand the practical tradeoffs, not just the theory.
Operations That Must Stay in FP32
Not everything can be safely done in reduced precision:
- Softmax: Exponentiation can overflow. Compute in FP32 or use numerically stable implementation (subtract max).
- Layer normalization: Variance computation needs FP32 for numerical stability.
- Loss computation: Cross-entropy involves log of small values. FP16 underflow causes NaN.
- Optimizer state: Adam's running averages need FP32 precision.
- Gradient accumulation: Summing many small gradients must be in FP32.
Do NOT say "mixed precision just means using FP16 everywhere." The "mixed" means some operations use FP16 and some use FP32. Knowing which operations must stay in FP32 is what distinguishes engineers who have actually trained models from those who have only read about it.
Part 4 - Curriculum Learning
Core Idea
Train the model on easy examples first, then gradually introduce harder examples - mimicking how humans learn.
Paper: Bengio et al., 2009 - "Curriculum Learning"
Why It Works
- Better optimization landscape: Easy examples provide strong, clean gradients that move the model toward a good region of parameter space. Hard/noisy examples early on provide conflicting gradients.
- Faster convergence: The model builds basic representations quickly on easy examples, then refines them on hard examples.
- Better generalization: Starting with clean examples helps the model learn true patterns before encountering noise.
Difficulty Metrics
How do you define "easy" vs "hard"?
| Method | Definition of Difficulty | Example |
|---|---|---|
| Loss-based | Higher loss = harder | Sort by per-sample loss after one epoch |
| Confidence-based | Lower confidence = harder | Sort by model's predicted probability |
| Heuristic | Domain knowledge | Short sentences before long sentences in NLP |
| Length-based | Longer = harder | Common in sequence tasks |
| Noise-based | More noise = harder | Remove noisy labels first |
| Self-paced | Model decides | Include examples where loss is below a threshold that increases over time |
Anti-Curriculum Learning
Surprisingly, some work shows that training on hard examples first can also help, particularly when:
- The dataset contains many near-duplicate easy examples
- Hard examples are the most informative for learning decision boundaries
- The model needs to learn rare but important patterns
Curriculum learning is frequently asked at NLP-focused companies (Google, Meta AI, Cohere). It is especially relevant for training language models where sequence length, data quality, and domain mixing order all affect final performance. GPT-4 and Llama training reportedly use data curriculum strategies.
Part 5 - Knowledge Distillation
The Core Framework
Paper: Hinton et al., 2015 - "Distilling the Knowledge in a Neural Network"
A large "teacher" model has learned rich representations. A smaller "student" model learns to mimic the teacher's behavior, not just match the ground truth labels.
Why Soft Labels Are More Informative Than Hard Labels
Consider a teacher classifying an image. The hard label says "cat" (one-hot). But the teacher's soft predictions might be:
| Class | Hard Label | Teacher Soft Label (T=1) | Teacher Soft Label (T=5) |
|---|---|---|---|
| Cat | 1.0 | 0.92 | 0.45 |
| Dog | 0.0 | 0.05 | 0.25 |
| Bird | 0.0 | 0.02 | 0.18 |
| Car | 0.0 | 0.01 | 0.12 |
The soft labels reveal that this cat looks somewhat like a dog (both are furry animals) and less like a car. This "dark knowledge" - the inter-class similarity structure - is lost in hard labels but preserved in soft labels.
Temperature Scaling
The temperature parameter controls how "soft" the distribution is:
- : Standard softmax (peaky distribution)
- : Softer distribution, reveals more inter-class structure
- : Uniform distribution (all classes equally likely)
- : Hard argmax (one-hot)
Typical value: to (Hinton originally used ).
The Distillation Loss
Why ? The gradients of the KL divergence with temperature-scaled softmax are scaled by compared to standard softmax. Multiplying by compensates, keeping the relative contribution of the distillation loss consistent regardless of temperature.
Typical : 0.5 to 0.9 (weight toward distillation loss).
"Knowledge distillation transfers knowledge from a large teacher model to a smaller student model. Instead of training the student on hard labels only, we also train it to match the teacher's soft probability distribution. We raise the temperature in the softmax to reveal inter-class similarities - the 'dark knowledge.' The student learns not just what the correct answer is, but what the incorrect answers look like, which is much more informative per training example. The loss combines KL divergence between teacher and student soft distributions with standard cross-entropy on hard labels. This lets you deploy a model that's 3-10x smaller with only 1-3% accuracy loss."
Types of Knowledge Distillation
| Type | What is Transferred | Example |
|---|---|---|
| Response-based | Final layer output (soft labels) | Original Hinton distillation |
| Feature-based | Intermediate layer representations | FitNets (Romero et al., 2015) |
| Relation-based | Relationships between samples | Contrastive Representation Distillation |
| Self-distillation | Model distills into itself | Born-Again Networks, deep-to-shallow within same model |
| Online distillation | Models teach each other during training | Deep Mutual Learning |
Practical Distillation Results
| Teacher | Student | Task | Teacher Acc | Student Acc (no distill) | Student Acc (distilled) |
|---|---|---|---|---|---|
| BERT-Large | BERT-Small | GLUE | 86.5 | 78.2 | 83.1 |
| ResNet-152 | ResNet-18 | ImageNet | 78.3 | 69.8 | 73.5 |
| GPT-3 | GPT-2 | Text Gen | - | - | Significant improvement |
| Llama-70B | Llama-7B | Benchmarks | - | - | ~5% improvement |
Part 6 - Label Smoothing
The Problem with Hard Labels
Standard one-hot labels force the model to be infinitely confident: predict exactly 1.0 for the correct class and 0.0 for all others. To achieve this, logits must go to , which:
- Causes the model to be overconfident - calibration suffers
- Encourages memorization of training data
- Makes the model less generalizable - it overfits to the exact label distribution
Label Smoothing Formula
Replace the one-hot label with a smoothed version:
Where is the smoothing parameter (typically 0.1) and is the number of classes.
Example with and :
| Class | Hard Label | Smoothed Label |
|---|---|---|
| Cat (correct) | 1.0 | 0.925 |
| Dog | 0.0 | 0.025 |
| Bird | 0.0 | 0.025 |
| Car | 0.0 | 0.025 |
Why Label Smoothing Works
- Regularization effect: The model cannot achieve zero loss, so it never stops learning useful representations.
- Better calibration: The model learns to output probabilities closer to the true uncertainty.
- Tighter clusters: Label smoothing encourages the model to keep all class representations at a fixed distance from each other, creating tighter, more uniform clusters in embedding space.
Label Smoothing vs. Knowledge Distillation
Both provide "soft" targets, but they are fundamentally different:
| Aspect | Label Smoothing | Knowledge Distillation |
|---|---|---|
| Source of soft labels | Uniform distribution | Teacher model |
| Information content | No inter-class structure | Rich inter-class relationships |
| Requires teacher | No | Yes |
| Computational cost | Negligible | Requires teacher inference |
| Typical improvement | 0.5-2% | 3-10% |
Label smoothing and knowledge distillation are sometimes confused in interviews. When asked about label smoothing, do NOT describe knowledge distillation. The key difference: label smoothing uses a uniform distribution over non-target classes (every wrong class gets equal probability), while distillation uses the teacher's distribution (wrong classes get probability proportional to their similarity to the correct class). Label smoothing is a regularization technique; distillation is a knowledge transfer technique.
Part 7 - Putting It All Together: A Training Recipe
Here is how all these techniques combine in a modern training pipeline:
Common Training Failures and Fixes
| Symptom | Likely Cause | Fix |
|---|---|---|
| Loss is NaN from step 1 | Bad initialization or no gradient clipping | Use He init, clip gradients at 1.0 |
| Loss plateaus immediately | Symmetry problem or dead neurons | Check initialization, use LeakyReLU |
| Loss spikes periodically | Exploding gradients | Reduce gradient clip threshold |
| Loss decreases then suddenly NaN | FP16 overflow | Enable loss scaling or switch to BF16 |
| Training is slow on GPU | Not using mixed precision | Enable AMP with BF16 |
| Student much worse than teacher | Wrong temperature or alpha | Try T=5-20, alpha=0.7-0.9 |
| Model overconfident | No label smoothing or regularization | Add label smoothing epsilon=0.1 |
| Model underfits easy examples | Curriculum too aggressive | Start with more easy examples |
Practice Problems
Problem 1: Initialization Debugging
You initialize a 100-layer MLP with ReLU activations using Xavier normal initialization. After the first forward pass, all activations in the last layer are nearly zero. Explain why and fix it.
Hint 1 - Direction
Think about what ReLU does to the variance of activations at each layer when Xavier initialization is used.
Hint 2 - Insight
ReLU zeros out approximately half the inputs. Xavier assumes linear/symmetric activations where all inputs pass through. Each ReLU layer halves the variance, so after 100 layers the variance is scaled by .
Hint 3 - Full Solution + Rubric
Why activations vanish:
Xavier sets , designed for linear activations. With ReLU, each layer halves the variance (since half the outputs are zeroed). After layers:
For , this simplifies to . Activations are effectively zero.
Fix: Switch to He initialization with , which compensates for ReLU's variance halving. After this change:
The variance is preserved through all 100 layers.
Additional fixes for very deep networks: Add residual connections, add batch/layer normalization, or use LSUV initialization.
Scoring Rubric:
- Strong Hire: Correctly identifies the ReLU-Xavier mismatch, derives the variance shrinkage factor, prescribes He initialization, and mentions complementary techniques (residual connections, normalization).
- Lean Hire: Identifies the problem but cannot derive the shrinkage rate or only says "use He init" without explaining why.
- No Hire: Cannot explain why Xavier fails with ReLU or suggests increasing learning rate as the fix.
Problem 2: Mixed Precision Design
You are training a 10B parameter language model on 8 A100 GPUs. Design the mixed precision strategy, including what precision each component uses and how you handle numerical stability.
Hint 1 - Direction
List every component: weights, activations, gradients, optimizer states, loss computation, normalization layers. Decide the precision for each.
Hint 2 - Insight
Master weights in FP32 (8 bytes per param = 80GB just for weights). Forward/backward in BF16 (avoids loss scaling). Adam optimizer states (momentum + variance) in FP32 (adds 2x weight memory). Specific operations (softmax, layernorm, loss) in FP32 for stability.
Hint 3 - Full Solution + Rubric
Component-level precision assignment:
| Component | Precision | Reason |
|---|---|---|
| Master weights | FP32 | Weight updates are tiny, underflow in FP16 |
| Forward activations | BF16 | Speed + memory, BF16 range prevents overflow |
| Backward gradients | BF16 | Matches forward precision |
| Adam first moment | FP32 | Running average needs precision |
| Adam second moment | FP32 | Running average needs precision |
| Softmax | FP32 | exp() overflow risk |
| LayerNorm | FP32 | Variance computation needs precision |
| Loss computation | FP32 | log() underflow risk |
| Gradient accumulation | FP32 | Summing many small values |
| Gradient all-reduce | BF16 | Communication bandwidth bottleneck |
Memory breakdown for 10B params:
- Master weights: 10B x 4 bytes = 40GB
- BF16 weights (forward copy): 10B x 2 bytes = 20GB
- Adam states: 10B x 8 bytes = 80GB
- Gradients: 10B x 2 bytes = 20GB
- Total per-model: ~160GB across 8 GPUs = 20GB per GPU (with sharding)
Why BF16 over FP16: No loss scaling needed, simplifies the pipeline, same dynamic range as FP32. On A100s, BF16 and FP16 have the same throughput on Tensor Cores.
Scoring Rubric:
- Strong Hire: Correct precision for every component, memory calculation, justification for BF16 over FP16, mentions gradient accumulation precision and communication precision.
- Lean Hire: Gets the main idea (BF16 forward, FP32 master weights) but misses nuances like optimizer state precision or which operations need FP32.
- No Hire: Suggests FP16 everywhere or cannot explain why master weights need FP32.
Problem 3: Distillation Temperature Selection
You are distilling a BERT-Large teacher into a DistilBERT student for sentiment classification (positive/negative). The teacher achieves 95% accuracy. With , the student gets 88%. How would you tune temperature to improve the student? What temperatures would you try and why?
Hint 1 - Direction
Think about what temperature controls in the softmax distribution. For a binary classification task specifically, how much "dark knowledge" is available?
Hint 2 - Insight
With only 2 classes, the soft distribution has limited dark knowledge (just the ratio between positive and negative probabilities). Higher temperatures will flatten this further. The optimal temperature for binary classification is typically lower (-) than for multi-class (-) because there is less inter-class structure to reveal.
Hint 3 - Full Solution + Rubric
Analysis:
For binary sentiment classification, the teacher's output is a 2-class distribution. At , a 95%-confident teacher outputs approximately [0.95, 0.05] for a clear positive example.
Temperature sweep:
- : [0.95, 0.05] - Almost hard label, little dark knowledge
- : [0.74, 0.26] - Reveals teacher's uncertainty more
- : [0.65, 0.35] - Quite soft, but still informative
- : [0.57, 0.43] - Approaching uniform, losing signal
- : [0.53, 0.47] - Nearly uniform, useless
Recommended approach:
- Try on a validation set
- For binary classification, or is likely optimal
- Tune (weight between distillation and hard loss) jointly: try
- For binary tasks, consider augmenting with feature-based distillation (match intermediate representations) since the output distribution has limited dark knowledge
Key insight: Binary classification is the worst case for response-based distillation because there are only 2 classes to distribute probability over. Feature-based distillation (matching intermediate BERT hidden states) would likely give more improvement.
Scoring Rubric:
- Strong Hire: Recognizes that binary classification limits dark knowledge, recommends feature-based distillation as a complement, provides specific temperature range with reasoning, discusses tuning.
- Lean Hire: Suggests trying different temperatures but doesn't recognize the fundamental limitation of 2-class distillation.
- No Hire: Suggests "because higher is better" or doesn't understand what temperature controls.
Problem 4: Curriculum Learning for Code Generation
You are training a code generation model (similar to CodeLlama). Design a curriculum learning strategy. What defines "easy" vs "hard" code? How do you schedule the curriculum?
Hint 1 - Direction
Think about multiple dimensions of code difficulty: length, language complexity, algorithmic complexity, number of dependencies, test pass rate.
Hint 2 - Insight
Easy code: short functions, single language, common patterns (getters/setters, simple loops). Hard code: multi-file projects, complex algorithms, concurrent programming, domain-specific patterns. The curriculum should consider both syntactic complexity and semantic difficulty.
Hint 3 - Full Solution + Rubric
Difficulty dimensions:
- Length: Short functions (< 20 lines) to multi-file projects
- Cyclomatic complexity: Linear code to deeply nested conditionals/loops
- Language frequency: Python/JavaScript (common) to Rust/Haskell (rare)
- Algorithmic complexity: O(n) loops to dynamic programming/graph algorithms
- Dependencies: Standalone functions to complex import chains
- Documentation quality: Well-commented code to uncommented code
- Test pass rate: How often a baseline model generates correct code for this task
Proposed curriculum (4 phases):
Phase 1 (first 10% of training): Single-function Python/JavaScript, < 20 lines, simple logic, well-documented.
Phase 2 (10-40%): Multi-function files, add Java/TypeScript/Go, include standard library usage, introduce common algorithms.
Phase 3 (40-80%): Multi-file projects, all languages, complex algorithms, API usage, error handling patterns.
Phase 4 (80-100%): Full difficulty range including concurrent code, advanced type systems, low-level systems code, adversarial/edge-case examples.
Scheduling: Linear pacing function that gradually increases the difficulty threshold. At each phase boundary, validate on a held-out set at the target difficulty level.
Anti-curriculum consideration: For code, showing some hard examples early (e.g., complex algorithms) might help the model learn structural patterns. A mixed strategy that is 80% curriculum-ordered and 20% random might outperform strict ordering.
Scoring Rubric:
- Strong Hire: Identifies multiple difficulty dimensions, provides concrete phasing, discusses scheduling, considers anti-curriculum aspects, mentions validation strategy.
- Lean Hire: Provides a reasonable curriculum but only considers 1-2 difficulty dimensions.
- No Hire: Cannot define code difficulty or proposes a curriculum with no clear rationale.
Problem 5: Training Recipe for Production
Your team is about to train a 1B parameter vision transformer on ImageNet from scratch. Write out the complete training recipe: initialization, precision, gradient management, label strategy, and any other techniques. Justify each choice.
Hint 1 - Direction
Think about what specific choices the ViT paper and follow-up work (DeiT, BEiT) made. Why did they deviate from standard ResNet training recipes?
Hint 2 - Insight
ViTs are notoriously harder to train than CNNs. They require: careful initialization (truncated normal, small std), strong data augmentation, longer training schedules, label smoothing, knowledge distillation from a CNN teacher (DeiT approach). Mixed precision is essential for 1B params.
Hint 3 - Full Solution + Rubric
Complete training recipe:
Initialization:
- Patch embedding: truncated normal, std=0.02
- Position embeddings: truncated normal, std=0.02
- Attention layers: Xavier uniform
- MLP layers: He normal (GELU approximates ReLU)
- Final classifier head: zero-initialized (start as identity)
- LayerNorm: weight=1, bias=0
Precision:
- BF16 mixed precision on A100 GPUs
- FP32 master weights and optimizer states
- FP32 for LayerNorm and softmax attention
Gradient management:
- Gradient clipping: global norm 1.0
- Gradient accumulation: effective batch size 4096 (common for ViT)
- Learning rate: 3e-4 peak with linear warmup (10K steps) + cosine decay
Label strategy:
- Label smoothing: epsilon=0.1
- Knowledge distillation from RegNetY-16GF teacher (DeiT approach)
- Mixup (alpha=0.8) + CutMix (alpha=1.0) augmentation
Additional techniques:
- Stochastic depth (drop path rate 0.1)
- Repeated augmentation
- Random erasing (probability 0.25)
- Weight decay 0.05 (AdamW)
- EMA of model weights for evaluation
Why these choices matter for ViTs specifically:
- ViTs lack the inductive biases of CNNs (translation equivariance, locality), so they need more data and stronger regularization
- Knowledge distillation from a CNN teacher provides the inductive bias that ViTs lack
- Label smoothing prevents the model from becoming overconfident on ImageNet's noisy labels
- Large batch size with warmup is critical for stable transformer training
Scoring Rubric:
- Strong Hire: Covers all categories (init, precision, gradients, labels, augmentation), provides specific values with justifications, mentions ViT-specific challenges, references DeiT/BEiT choices.
- Lean Hire: Gets the main techniques right but misses ViT-specific nuances or provides generic values without justification.
- No Hire: Treats ViT training the same as ResNet training or cannot provide a coherent recipe.
Interview Cheat Sheet
| Concept | Key Formula | One-Liner | Red Flag |
|---|---|---|---|
| Xavier init | Preserve variance for linear activations | "Use Xavier for everything" | |
| He init | Compensates for ReLU halving variance | "He and Xavier are the same" | |
| Symmetry breaking | Random init required | Same init = same neurons forever | "Init to zero is fine" |
| Gradient clipping (norm) | Preserve direction, limit magnitude | "Clip each gradient independently" | |
| FP16 range | Max 65504 | Needs loss scaling to avoid underflow | "FP16 is always fine" |
| BF16 advantage | Same range as FP32 | No loss scaling needed | "BF16 and FP16 are identical" |
| Loss scaling | Multiply loss by before backprop | Shifts gradients into FP16 range | Not knowing why it exists |
| Knowledge distillation | Teacher's soft labels transfer dark knowledge | "Just train on teacher's hard predictions" | |
| Temperature | Higher T = softer distribution | "T=100 gives best results" | |
| Label smoothing | Prevents overconfidence | "Same as knowledge distillation" | |
| Curriculum learning | Easy first, hard later | Mimics human learning | "Always helps in every setting" |
Spaced Repetition Checkpoints
Day 0 - Initial Learning
- Read this entire page
- Derive Xavier initialization from the variance preservation requirement
- Derive He initialization and explain why the factor differs from Xavier
- Draw the mixed precision training pipeline from memory
- Explain the knowledge distillation loss function, including the factor
Day 3 - First Recall
- Without notes, write the Xavier and He initialization formulas and state when each applies
- Give the "60-Second Answer" for initialization out loud, timed
- Explain the difference between FP16 and BF16 without looking
- List 5 operations that must stay in FP32 during mixed precision training
Day 7 - Connections
- Explain how initialization, gradient clipping, and normalization all address the same underlying problem (variance control)
- Compare label smoothing and knowledge distillation - similarities and differences
- Do Practice Problem 1 (initialization debugging) without hints
Day 14 - Application
- Do Practice Problem 2 (mixed precision design) under timed conditions (15 minutes)
- Design a knowledge distillation pipeline for a specific task of your choice
- Do Practice Problem 5 (training recipe) - can you produce a complete, coherent recipe?
Day 21 - Mock Interview
- Have someone ask: "Walk me through how you would initialize and train a 100-layer ResNet from scratch"
- Time yourself: full answer should take 5-8 minutes
- Do all 5 practice problems in sequence under timed conditions (50 minutes total)
- Can you debug a training failure from a loss curve alone?
Key Takeaways
-
Initialization is not an afterthought. It determines whether your network can even begin learning. Xavier for symmetric activations, He for ReLU, and know why each formula exists.
-
Mixed precision is essential for large models. BF16 is the modern default. Know which operations must stay in FP32 and why loss scaling exists for FP16.
-
Knowledge distillation is one of the most practical techniques in ML. It compresses models for deployment while preserving accuracy. Understand temperature, the factor, and when to use feature-based distillation.
-
Label smoothing is cheap insurance. It costs nothing computationally and almost always improves generalization and calibration.
-
Every technique addresses a specific failure mode. Strong candidates can diagnose a problem from a training curve and prescribe the right technique - not just list techniques they have memorized.
