Optimization Algorithms - The Engine Behind Every Model

Reading time: ~35 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Eng, Data Scientist

The Real Interview Moment

You're twenty minutes into a technical screen at a top AI lab. The interviewer pulls up a training loss curve on the whiteboard - it oscillates wildly for 10 epochs, then plateaus at a suspiciously high value. She asks: "This is our production model. We're using Adam with default hyperparameters. The team tried lowering the learning rate, but it just trains slower and plateaus at the same loss. What would you investigate?"

This is where most candidates freeze. They know what Adam is. They can recite the update rule. But they cannot reason about optimizer behavior in practice. The strong candidate starts asking questions: What's the batch size? Is the loss landscape convex or riddled with saddle points? Have you tried learning rate warmup? Could gradient accumulation help? Is AdamW more appropriate given the weight decay interaction?

The ability to diagnose optimization problems - not just define optimizers - is what separates a hire from a no-hire. This page gives you the mental models, the intuition, and the interview-ready explanations to handle exactly this kind of question.

Optimization is the beating heart of machine learning. Every model, from linear regression to GPT-4, is ultimately the result of an optimization algorithm searching a loss landscape for good parameters. Understanding how that search works, why it sometimes fails, and how to fix it is non-negotiable knowledge for any ML role.

What You Will Master

After reading this page, you will be able to:

Explain SGD, momentum, RMSProp, Adam, and AdamW from first principles with mathematical intuition
Draw the optimizer selection flowchart used by experienced practitioners
Diagnose common training failures: oscillation, divergence, plateaus, and saddle points
Design learning rate schedules (warmup, cosine decay, step decay, cyclical) for different training scenarios
Articulate why Adam sometimes fails and when SGD+momentum is superior
Reason about batch size effects on optimization dynamics and generalization
Implement gradient accumulation to simulate large batch training on limited hardware
Explain convergence theory at the level expected in an MLE interview (convexity, Lipschitz smoothness, convergence rates)
Navigate company-specific optimization questions (Google prefers SGD for vision, OpenAI uses AdamW for LLMs)
Handle whiteboard derivations of Adam's update rule and moment estimation bias correction

Self-Assessment: Where Are You Now?

Skill Area	1 (Never heard of it)	3 (Can explain basics)	5 (Can derive + debug)	Your Rating
SGD + momentum	What's momentum?	Know the formula	Can derive convergence guarantees	___
Adam / AdamW	Just use defaults	Know the update rule	Can explain bias correction + weight decay fix	___
Learning rate schedules	Always use constant LR	Know warmup exists	Design schedules for specific problems	___
Loss landscapes	Never visualized one	Know about local minima	Understand saddle points, flat minima, sharpness	___
Batch size effects	"Bigger is better"	Know about gradient noise	Can explain generalization-batch size relationship	___
Convergence theory	No idea	Know convex = easy	Can state convergence rates for convex/non-convex	___
Practical debugging	Never debugged training	Tried adjusting LR	Systematic optimizer diagnosis process	___

Score interpretation:

7-14: Start here and work through every section carefully. Focus on intuition first, math second.
15-25: Good foundation. Focus on the "why" sections and practice problems.
26-35: You're close to interview-ready. Drill the practice problems and edge cases.

Part 1 - The Optimization Landscape

What Are We Actually Doing?

Every ML model defines a loss function L(theta) that measures how bad the model's predictions are. Training means finding parameters theta* that minimize this loss:

theta* = argmin_theta L(theta)

The challenge: for a neural network with millions of parameters, we cannot solve this analytically. Instead, we use iterative optimization - we start somewhere and take steps that (hopefully) decrease the loss.

60-Second Answer

"Optimization in ML is about finding model parameters that minimize a loss function. We can't solve this analytically for neural networks, so we use iterative methods. The simplest is gradient descent: compute the gradient of the loss with respect to parameters, then step in the opposite direction. The key challenges are: the landscape is non-convex (many local minima and saddle points), we use noisy gradient estimates from mini-batches, and the loss surface has very different curvature in different directions. Modern optimizers like Adam address these by adapting the learning rate per-parameter based on gradient history."

The Loss Landscape: Why Optimization Is Hard

Loss Landscape Features

Interviewer's Perspective

A common interview question is: "What's more problematic in deep learning - local minima or saddle points?" The correct answer is saddle points. In high-dimensional spaces, true local minima are rare. Most critical points are saddle points - they're minima in some directions and maxima in others. A 2014 paper by Dauphin et al. showed that for random high-dimensional functions, the probability of a critical point being a true local minimum decreases exponentially with dimensionality. Strong candidates know this and can explain why adaptive methods handle saddle points better than vanilla SGD.

Convex vs. Non-Convex Optimization

Property	Convex	Non-Convex (Neural Networks)
Local minima	Every local min = global min	Many local minima, saddle points
Convergence guarantee	Yes, to global optimum	Only to stationary point
SGD convergence rate	O(1/T) for smooth convex	O(1/sqrt(T)) for smooth non-convex
Practical examples	Linear regression, logistic regression, SVMs	Neural networks, deep learning
Sensitivity to initialization	Low - any start works	High - initialization matters enormously
Learning rate sensitivity	Wide range works	Narrow range; too high = divergence

Common Trap

Don't claim that neural networks get stuck in "bad local minima." Modern research (Li et al., 2018) shows that most local minima in over-parameterized networks have loss values very close to the global minimum. The real problem is saddle points and flat regions that slow down optimization, not local minima that trap the optimizer at high loss values.

Part 2 - The Optimizer Zoo

Stochastic Gradient Descent (SGD)

The foundation of everything. Instead of computing the gradient over the entire dataset (expensive), we estimate it from a mini-batch:

theta_{t+1} = theta_t - eta * g_t

where g_t = (1/B) * sum_i grad L(x_i, theta_t) is the mini-batch gradient estimate and eta is the learning rate.

Why stochastic? The mini-batch gradient is a noisy estimate of the true gradient. This noise is actually beneficial - it helps escape saddle points and sharp minima, acting as implicit regularization.

60-Second Answer

"SGD computes the gradient on a random mini-batch instead of the full dataset. This makes each step noisy but much cheaper. The noise is actually beneficial - it helps escape sharp minima and saddle points, and there's evidence it leads to flatter minima that generalize better. The main downside is that vanilla SGD can oscillate in ravines where the loss surface has very different curvature in different directions. Momentum fixes this by accumulating a running average of past gradients."

SGD with Momentum

Momentum adds inertia - the update accumulates past gradients like a ball rolling downhill:

v_t = beta * v_{t-1} + g_t
theta_{t+1} = theta_t - eta * v_t

Typical beta = 0.9 means 90% of the velocity is carried over from the previous step.

Why it works: In a ravine (steep walls, shallow floor), the gradient oscillates between walls. Momentum cancels these oscillations (they alternate sign) and accumulates the consistent downhill component.

Nesterov momentum is a slight improvement - it computes the gradient at the "look-ahead" position:

v_t = beta * v_{t-1} + grad L(theta_t - eta * beta * v_{t-1})
theta_{t+1} = theta_t - eta * v_t

This gives a corrective term that reduces overshooting.

Variant	Convergence (Convex, Smooth)	Key Advantage
Vanilla SGD	O(1/T)	Simplicity, strong generalization
SGD + Momentum	O(1/T), better constant	Accelerates through ravines
SGD + Nesterov	O(1/T^2) for full GD	Look-ahead correction

RMSProp

RMSProp adapts the learning rate per-parameter based on the magnitude of recent gradients:

s_t = beta * s_{t-1} + (1 - beta) * g_t^2
theta_{t+1} = theta_t - eta * g_t / (sqrt(s_t) + epsilon)

Parameters with large recent gradients get a smaller effective learning rate; parameters with small gradients get a larger one. This is crucial for parameters that receive very sparse gradient signals (e.g., embeddings for rare words).

Historical note: RMSProp was proposed by Geoff Hinton in a Coursera lecture, never formally published. It was independently motivated as a fix for AdaGrad's aggressively decaying learning rate.

Adam - The Default Choice

Adam combines momentum (first moment) with RMSProp (second moment):

m_t = beta1 * m_{t-1} + (1 - beta1) * g_t          # First moment (momentum)
v_t = beta2 * v_{t-1} + (1 - beta2) * g_t^2         # Second moment (RMSProp)
m_hat_t = m_t / (1 - beta1^t)                        # Bias correction
v_hat_t = v_t / (1 - beta2^t)                        # Bias correction
theta_{t+1} = theta_t - eta * m_hat_t / (sqrt(v_hat_t) + epsilon)

Default hyperparameters: beta1=0.9, beta2=0.999, epsilon=1e-8, eta=1e-3

60-Second Answer

"Adam combines two ideas: momentum (tracking the exponential moving average of gradients, like SGD+momentum) and adaptive learning rates (tracking the exponential moving average of squared gradients, like RMSProp). The first moment gives you the direction, the second moment gives you per-parameter scaling. The bias correction terms fix the initialization problem - both moments are initialized to zero, so without correction the early estimates would be biased toward zero. Default hyperparameters (beta1=0.9, beta2=0.999, lr=1e-3) work surprisingly well for most problems, which is why Adam is the go-to optimizer."

Why bias correction matters: At step t=1, m_1 = 0.1 * g_1 (biased toward 0). The correction m_hat_1 = m_1 / (1 - 0.9^1) = m_1 / 0.1 = g_1 recovers the true gradient estimate. Without correction, early training steps would be much smaller than intended.

Common Trap

Many candidates confuse Adam's epsilon with a "small constant to prevent division by zero." While that's technically what it does, the value of epsilon actually matters. For mixed-precision training (FP16), the default epsilon=1e-8 can underflow to zero, causing NaN gradients. You need epsilon=1e-4 or 1e-5 for FP16 training. This is a real production issue that strong candidates know about.

AdamW - The Fixed Adam

The original Adam paper's weight decay implementation was wrong. In Adam, L2 regularization and weight decay are not equivalent (unlike in SGD where they are). AdamW fixes this:

Adam with L2 regularization (wrong):

g_t = grad L(theta_t) + lambda * theta_t    # L2 added to gradient
# Then Adam update as usual - but the adaptive scaling affects the regularization term!

AdamW (correct):

# Adam update as usual on the unregularized gradient
theta_{t+1} = theta_{t+1} - eta * lambda * theta_t    # Decoupled weight decay

The difference: in Adam+L2, the adaptive learning rate scales the weight decay term, which means well-conditioned parameters get more regularization and ill-conditioned ones get less - the opposite of what you want. AdamW applies weight decay directly, independent of the adaptive scaling.

Interviewer's Perspective

If a candidate can explain the AdamW vs. Adam distinction clearly, it signals deep understanding. The follow-up question I like to ask is: "So when does the Adam vs. AdamW difference actually matter?" Answer: It matters most when weight decay is large (common in LLM training with lambda=0.1) and when the adaptive scaling varies significantly across parameters (which it does in transformers where attention weights and embedding layers have very different gradient magnitudes).

Company Variation

Google/DeepMind (vision): Often prefers SGD+momentum for image classification (ResNets, EfficientNets). Their internal benchmarks show better final accuracy.
OpenAI/Anthropic (LLMs): AdamW is standard for transformer training. GPT-3 used AdamW with beta1=0.9, beta2=0.95.
Meta (recommendations): Mixed - Adam for embedding-heavy models, SGD for dense models.
Startups: Almost always Adam/AdamW - faster convergence matters more than squeezing out 0.1% accuracy.

The Optimizer Selection Flowchart

Optimizer Selection Flowchart

When Adam Fails - And You Need SGD

Adam does not always win. Key failure modes:

Generalization gap: Multiple studies (Wilson et al., 2017) showed that SGD+momentum generalizes better than Adam on image classification. The adaptive learning rate can converge to sharp minima that overfit.
Non-convergence in theory: Reddi et al. (2018) showed that Adam can diverge on simple convex problems. The fix (AMSGrad) uses the maximum of past second moments instead of the exponential average, but in practice the difference is small.
Interaction with weight decay: Before AdamW, using L2 regularization with Adam gave incorrect regularization behavior (see AdamW section above).
Large batch training: With very large batches, the gradient noise that Adam relies on for exploration diminishes. LARS/LAMB optimizers (layer-wise adaptive learning rates) work better for large-batch training.

Scenario	Best Optimizer	Why
ResNet on ImageNet (best accuracy)	SGD + momentum	Better generalization, well-studied schedules
BERT/GPT pre-training	AdamW	Adaptive LR essential for transformers
Fine-tuning a pre-trained model	AdamW with small LR	Stable, preserves pre-trained features
GAN training	Adam (tuned betas)	Stability in adversarial dynamics
Reinforcement learning	Adam	Handles non-stationary objectives
Recommendation with embeddings	Adam/AdamW	Sparse gradients need adaptive LR
Training with batch size > 8K	LARS/LAMB	Layer-wise scaling for large batches
Limited compute budget	Adam	Converges faster in wall-clock time

Part 3 - Learning Rate Schedules

The learning rate is the single most important hyperparameter. A learning rate schedule changes the learning rate during training to get the best of both worlds: fast early progress and fine-grained late convergence.

Why Schedules Matter

Learning Rate Schedule Phases

Common Learning Rate Schedules

1. Step Decay Multiply LR by a factor (e.g., 0.1) at fixed epochs.

LR = initial_lr * gamma^(floor(epoch / step_size))

Example: Start at 0.1, multiply by 0.1 at epochs 30, 60, 90.

Pros: Simple, well-studied (standard for ResNet training)
Cons: Discrete jumps can cause instability; requires manual tuning of milestones

2. Cosine Annealing Smoothly decreases LR following a cosine curve:

LR = eta_min + 0.5 * (eta_max - eta_min) * (1 + cos(pi * t / T))

Pros: Smooth, no hyperparameters beyond T; strong empirical results
Cons: Fixed schedule length; extending training requires warm restarts

3. Warmup + Cosine Decay (the modern standard) Start with a linearly increasing LR (warmup), then cosine decay:

if t &lt; warmup_steps:
    LR = initial_lr * (t / warmup_steps)
else:
    LR = eta_min + 0.5 * (initial_lr - eta_min) * (1 + cos(pi * (t - warmup_steps) / (T - warmup_steps)))

60-Second Answer

"Learning rate warmup solves a cold-start problem. At the beginning of training, the model's parameters are random, so the gradients are large and unreliable. Adam's second moment estimates are also inaccurate (initialized to zero, not yet warmed up). A high learning rate + noisy gradients + inaccurate adaptive scaling = potential divergence. Warmup starts with a tiny learning rate and ramps up linearly, giving the optimizer time to calibrate its moment estimates. This is especially critical for transformers where the attention mechanism can produce very large gradients early in training. For most transformer training, 1-5% of total steps as warmup works well."

4. Cyclical Learning Rates / Warm Restarts Periodically reset the LR to a high value:

SGDR (warm restarts): Cosine annealing with periodic resets
1cycle policy: One cycle of increasing then decreasing LR (Smith, 2018)

Warm restarts help escape local minima by periodically increasing exploration. The 1cycle policy is particularly effective for fast training - it enables super-convergence where the model trains in far fewer epochs.

5. Linear Decay Decrease LR linearly to zero:

LR = initial_lr * (1 - t / T)

Simple and effective for fine-tuning pre-trained models
Used in many NLP fine-tuning recipes (BERT, GPT)

Schedule Comparison Table

Schedule	Best For	Key Hyperparameters	Common Pitfall
Step decay	CNN training (ResNet)	Milestones, gamma	Wrong milestone selection
Cosine annealing	General-purpose	T (total steps)	Fixed schedule length
Warmup + cosine	Transformer training	Warmup steps, peak LR	Too short warmup
1cycle	Fast training, limited budget	Max LR, div factor	Max LR too aggressive
Linear decay	Fine-tuning	Initial LR	Starting LR too high for fine-tuning
Exponential decay	Legacy, rarely used now	Decay rate	Over-decays, training stalls

Company Variation

Google (BERT): Linear warmup (10K steps) + linear decay. Simple and effective.
OpenAI (GPT-3): Linear warmup (375M tokens) + cosine decay to 10% of peak LR.
Meta (LLaMA): Cosine decay with warmup (2000 steps) to 10% of peak.
Kaiming He's group (vision): Step decay (divide by 10 at specific epochs) for ResNets.

Part 4 - Batch Size, Gradient Accumulation, and Practical Training

Batch Size Effects

Batch size affects optimization in multiple ways:

Batch Size	Gradient Noise	Convergence	Generalization	Hardware
Small (16-64)	High noise	Slow in wall-clock, more updates per epoch	Often better (noise acts as regularizer)	Fits on one GPU
Medium (128-512)	Moderate	Good balance	Good	One or few GPUs
Large (1K-64K)	Low noise	Fast in wall-clock, fewer updates	Can degrade without tricks	Multi-GPU / TPU

Interviewer's Perspective

The batch size question separates textbook knowledge from practical experience. A strong candidate knows that doubling the batch size should be paired with doubling the learning rate (linear scaling rule) - but also knows this breaks down for very large batches. They mention gradient accumulation as a practical solution for simulating large batches on small GPUs. And they know that large-batch training can hurt generalization unless you use warmup, LARS/LAMB, or label smoothing to compensate.

The Linear Scaling Rule: When you increase batch size by factor k, increase LR by factor k. This keeps the expected weight update magnitude the same. However, this rule fails for very large batches - the loss landscape changes character.

The generalization gap: Larger batches tend to converge to sharper minima that generalize worse. Keskar et al. (2017) showed this empirically. Mitigations:

Warmup (let the optimizer adapt before taking large steps)
LARS/LAMB (layer-wise adaptive learning rates)
Longer training (more epochs compensate for fewer updates per epoch)
Label smoothing
Gradient noise injection

Gradient Accumulation

When your ideal batch size doesn't fit in GPU memory, accumulate gradients over multiple forward/backward passes:

optimizer.zero_grad()
for i, (inputs, labels) in enumerate(dataloader):
    outputs = model(inputs)                          # Forward pass (micro-batch)
    loss = criterion(outputs, labels) / accumulation_steps  # Scale loss
    loss.backward()                                  # Accumulate gradients

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()                              # Update weights
        optimizer.zero_grad()                         # Reset gradients

Key detail: You must divide the loss by accumulation_steps to keep the gradient magnitude equivalent to a single large batch.

Common Trap

Gradient accumulation is not mathematically identical to large batch training when using batch normalization. BN statistics are computed per micro-batch, not the accumulated batch. For models with BN layers, use synchronized batch norm across accumulation steps or switch to layer norm / group norm. This is why most transformer architectures use layer norm - it's more compatible with gradient accumulation and distributed training.

Mixed-Precision Training

Modern training uses FP16/BF16 for forward and backward passes, FP32 for weight updates:

Forward pass:   FP16 (fast, less memory)
Backward pass:  FP16 (fast, less memory)
Master weights: FP32 (precise updates)
Loss scaling:   Multiply loss by a scale factor to prevent FP16 gradient underflow

Why this matters for optimization:

FP16 has limited dynamic range - gradients smaller than ~6e-8 underflow to zero
Loss scaling multiplies the loss before backward pass (amplifies gradients) and divides the gradient before the optimizer step
BF16 (Brain Float) has the same exponent range as FP32 but less precision - better for training, no loss scaling needed
Adam's epsilon must be increased (1e-4 to 1e-5) for FP16 to avoid division by near-zero

Part 5 - Convergence Theory (Interview Level)

You don't need to prove theorems, but you need to know the key results and be able to state them.

Key Convergence Results

Setting	Algorithm	Rate	What It Means
Convex, smooth, full GD	GD	O(1/T)	After T steps, error < C/T
Convex, smooth, SGD	SGD	O(1/sqrt(T))	Slower due to gradient noise
Strongly convex, smooth, SGD	SGD (decaying LR)	O(1/T)	Strong convexity recovers the rate
Non-convex, smooth, SGD	SGD	O(1/sqrt(T)) to stationary point	Only guarantee: small gradient norm
Convex, Nesterov accelerated	Nesterov GD	O(1/T^2)	Optimal for first-order methods

60-Second Answer

"For convex problems, SGD converges at O(1/sqrt(T)) - meaning you need 4x the steps to halve the error. For strongly convex problems (like L2-regularized objectives), you get the faster O(1/T) rate with decaying learning rate. For non-convex problems like neural networks, we can only guarantee convergence to a stationary point (gradient near zero) at O(1/sqrt(T)). In practice, this means non-convex optimization is fundamentally harder - we have no guarantee of finding the global minimum, but empirically the local minima of over-parameterized networks are nearly as good as the global minimum."

The Learning Rate-Convergence Tradeoff

For SGD on a smooth, L-Lipschitz function:

Learning rate must satisfy eta < 2/L for convergence (where L is the Lipschitz constant of the gradient)
Too large: divergence (overshooting the minimum)
Too small: very slow convergence
Optimal: eta = 1/L for gradient descent, but we rarely know L in practice

Gradient Clipping

When gradients are very large (common in RNNs and early transformer training), clip them:

Clip by norm:

if ||g|| > max_norm:
    g = g * (max_norm / ||g||)

Clip by value:

g = clamp(g, -max_value, max_value)

Clip by norm is preferred because it preserves gradient direction. Clip by value can change direction.

Instant Rejection

If asked "How do you handle exploding gradients?" and you only say "gradient clipping" without mentioning why gradients explode (deep networks with large weights, recurrent connections, attention with large values), you're giving a surface-level answer. Always explain the cause before the fix. Even better, mention that gradient clipping is a symptom treatment - better architecture choices (residual connections, layer norm, proper initialization) address the root cause.

Part 6 - Advanced Topics for Senior Roles

Second-Order Methods

Second-order methods use the Hessian (matrix of second derivatives) to take better steps:

theta_{t+1} = theta_t - H^{-1} * g_t

where H is the Hessian matrix. This gives the Newton step, which converges quadratically near the minimum.

Why we don't use them directly: The Hessian is n x n where n = number of parameters. For a 100M parameter model, that's 10^16 entries - impossible to store or invert.

Approximations used in practice:

L-BFGS: Approximates the inverse Hessian from gradient history. Works for small-to-medium models.
Natural gradient: Uses the Fisher information matrix instead of the Hessian. Theoretically elegant.
K-FAC: Kronecker-factored approximate curvature. Practical for neural networks.
Shampoo (Google): Structured second-order method that's competitive with Adam on TPUs.

LARS and LAMB for Large-Batch Training

LARS (Layer-wise Adaptive Rate Scaling):

For each layer l:
    local_lr = trust_ratio * (||theta_l|| / ||g_l||)
    theta_l = theta_l - local_lr * eta * g_l

The trust ratio scales the learning rate per-layer based on the ratio of weight norm to gradient norm. This prevents layers with small weights from getting destabilizing large updates.

LAMB (Layer-wise Adaptive Moments for Batch training): Combines LARS with Adam. Used by Google to train BERT in 76 minutes with batch size 65K.

Stochastic Weight Averaging (SWA)

Average the weights from multiple points along the SGD trajectory:

After training for T_0 epochs:
    theta_SWA = (1/K) * sum_{i=1}^{K} theta_{T_0 + i*c}

SWA finds flatter minima with better generalization. It's essentially a cheap ensemble that requires no additional inference cost.

Lookahead Optimizer

Maintains two sets of weights - "fast weights" that explore with any optimizer, and "slow weights" that average the fast weights periodically:

For every k steps of inner optimizer:
    theta_slow = theta_slow + alpha * (theta_fast - theta_slow)
    theta_fast = theta_slow  # Reset fast to slow

Reduces variance and improves stability. Can wrap any optimizer (Lookahead-Adam, Lookahead-SGD).

Practice Problems

Problem 1: Diagnose the Training Curve

You're training a transformer model. The training loss decreases for 5 epochs, then suddenly spikes to NaN. What happened and how do you fix it?

Hint 1 - Direction

Think about what can cause numerical values to become NaN. Consider the learning rate, the loss scale, and specific operations in transformers that can produce very large values.

Hint 2 - Insight

NaN usually comes from: (1) gradient explosion leading to overflow, (2) division by zero (e.g., in softmax, layer norm, or Adam's epsilon), or (3) log of zero/negative values. In transformers specifically, the attention scores (QK^T) can grow very large, leading to softmax saturation and gradient spikes.

Hint 3 - Full Solution + Rubric

Root causes (investigate in order):

Learning rate too high: After 5 epochs, the optimizer reaches a region with high curvature. The update overshoots, gradients explode, NaN.
- Fix: Reduce LR, add warmup, add gradient clipping
Loss scaling overflow (FP16): The loss scale grows too large, causing FP16 overflow.
- Fix: Use dynamic loss scaling with backoff, or switch to BF16
Attention score explosion: QK^T values grow with sequence length and embedding dimension.
- Fix: Use scaled dot-product attention (divide by sqrt(d_k)), check initialization
Numerical instability in softmax/log_softmax: Large logits cause exp() overflow.
- Fix: Use log_softmax instead of separate log and softmax

Scoring Rubric:

Strong Hire: Systematically investigates multiple causes, mentions FP16 issues, knows about attention scaling, suggests gradient clipping as immediate fix + root cause analysis
Lean Hire: Identifies gradient explosion and suggests clipping and LR reduction, but misses numerical precision issues
No Hire: Only suggests "lower the learning rate" without diagnosing the actual cause

Problem 2: Optimizer Selection

You're fine-tuning a pre-trained BERT model on a classification task with 10K training examples. The team has been using SGD with momentum and getting 85% accuracy. The PM wants 90%. What optimizer changes would you try?

Hint 1 - Direction

Think about why SGD might underperform for fine-tuning a pre-trained transformer. Consider the structure of the parameter space (embeddings vs. classification head) and the learning rate requirements.

Hint 2 - Insight

Pre-trained transformers have parameters at very different scales - the embedding layers, attention weights, and the randomly initialized classification head all need different effective learning rates. Adaptive optimizers handle this naturally. Also consider the learning rate schedule and whether the current LR is appropriate for fine-tuning.

Hint 3 - Full Solution + Rubric

Recommended changes (in priority order):

Switch to AdamW: Adaptive per-parameter learning rates handle the heterogeneous parameter scales in BERT. Use lr=2e-5 (standard BERT fine-tuning LR), weight_decay=0.01.
Add linear warmup + linear decay: 6-10% of total steps as warmup. This is standard in every BERT fine-tuning recipe and makes a significant difference.
Layer-wise learning rate decay: Lower LR for earlier layers (closer to pre-trained), higher for later layers and the classification head. E.g., multiply LR by 0.95 for each layer going deeper.
Try different LR values: Grid search over [1e-5, 2e-5, 3e-5, 5e-5]. BERT fine-tuning is sensitive to LR.
Increase epochs (within reason): 3-5 epochs is typical for BERT fine-tuning. More can overfit with 10K examples.

Why SGD underperforms here: SGD uses a single LR for all parameters. The pre-trained embedding layer needs a tiny LR to avoid catastrophic forgetting, while the classification head needs a larger LR to learn from scratch. SGD cannot differentiate.

Scoring Rubric:

Strong Hire: Recommends AdamW with specific hyperparameters, mentions warmup schedule, discusses layer-wise LR decay, explains why adaptive methods help for fine-tuning
Lean Hire: Suggests switching to Adam, knows about low LR for fine-tuning, but missing schedule and layer-wise details
No Hire: Suggests "just try different learning rates with SGD" without understanding why the optimizer choice matters

Problem 3: Large Batch Training

Your team wants to reduce training time for a large language model by increasing the batch size from 256 to 4096. What changes to the optimization setup are needed?

Hint 1 - Direction

Think about what changes when you increase batch size by 16x. Consider the learning rate, gradient noise, convergence behavior, and any special techniques for large-batch training.

Hint 2 - Insight

The linear scaling rule says to scale LR proportionally to batch size, but this doesn't work perfectly at very large scales. You need warmup to let the optimizer adjust to the larger LR. Also consider that reduced gradient noise changes the optimization dynamics - you may need to train for more epochs or use techniques that inject noise back.

Hint 3 - Full Solution + Rubric

Required changes:

Scale learning rate: Increase LR by 16x (linear scaling rule). But cap at a maximum - test empirically.
Extend warmup: Longer warmup period (proportional to LR increase). Large LR + early unstable gradients = divergence without adequate warmup. Try 2-5% of total steps.
Consider LAMB optimizer: For very large batches, LAMB (Layer-wise Adaptive Moments for Batch training) outperforms AdamW. It adds layer-wise trust ratios.
Gradient clipping: More aggressive clipping may be needed. Large batch gradients are lower variance but can still have outliers.
Adjust total training steps: Fewer steps per epoch (since each step processes 16x more data). But you might need more epochs to compensate for reduced gradient noise.
Monitor carefully: Track gradient norm per layer, loss stability, and validation metrics. Large batch training is less forgiving of hyperparameter choices.

What NOT to change:

Don't change beta1, beta2, epsilon - these are generally robust to batch size changes
Don't change weight decay - it's decoupled from LR in AdamW

Scoring Rubric:

Strong Hire: Mentions linear scaling rule with caveats, warmup extension, LAMB as an option, discusses generalization risks, monitoring strategy
Lean Hire: Knows about LR scaling and warmup, but missing advanced techniques and generalization discussion
No Hire: Just says "increase the learning rate" or "it should just work with the same settings"

Problem 4: Cosine vs. Step Decay

Your colleague argues that cosine annealing is always better than step decay. Design an experiment to test this claim on your image classification task.

Hint 1 - Direction

Think about what "better" means - training speed? Final accuracy? Robustness to hyperparameters? Think about what variables you need to control and what metrics to report.

Hint 2 - Insight

A fair comparison requires equal total training budget, proper hyperparameter tuning for both schedules, and multiple random seeds. The answer likely depends on the total training budget - cosine annealing tends to win when the training budget is known in advance and you train for exactly that many steps.

Hint 3 - Full Solution + Rubric

Experimental design:

Fixed variables: Same model architecture, dataset, optimizer (SGD+momentum), weight decay, batch size, data augmentation, total epochs
Tuned variables per schedule:
- Step decay: Grid search over milestone sets x gamma values x initial LR
- Cosine: Grid search over initial LR x eta_min
Metrics: Final top-1 accuracy, best accuracy during training, time-to-threshold (epochs to reach 90% accuracy), variance across 3+ random seeds
Training budgets: Test at multiple budgets (50, 100, 200, 300 epochs) - the relative advantage may change
Expected results (based on literature):
- Cosine annealing is more robust to total budget (works well without knowing optimal milestones)
- Step decay can match or beat cosine when milestones are tuned, but the tuning cost is higher
- For short budgets, cosine's smooth decay is generally better
- For very long training, warm restarts can help cosine

Key insight: The claim "always better" is too strong. Cosine annealing is easier to tune and more robust, but well-tuned step decay can match it.

Scoring Rubric:

Strong Hire: Designs a rigorous experiment with proper controls, multiple seeds, hyperparameter search for both, discusses when each schedule wins
Lean Hire: Reasonable experimental design but misses some controls (e.g., same total epochs, HP tuning for both)
No Hire: Just says "train with both and see which is better" without experimental rigor

Problem 5: Adam's Bias Correction

Derive Adam's bias correction. Why is m_hat_t = m_t / (1 - beta1^t) the correct unbiased estimator?

Hint 1 - Direction

Start by expanding the recurrence relation for m_t. What is m_t in terms of g_1, g_2, ..., g_t? Take the expectation.

Hint 2 - Insight

m_t = (1-beta1) * sum_i=1^t beta1^(t-i) * g_i. The expectation is E[m_t] = (1-beta1) * sum_i=1^t beta1^(t-i) * E[g_i]. If we assume E[g_i] = E[g] (stationary), this simplifies using the geometric series formula.

Hint 3 - Full Solution + Rubric

Derivation:

Initialize m_0 = 0. The recurrence is:

m_t = beta1 * m_{t-1} + (1 - beta1) * g_t

Expanding:

m_1 = (1 - beta1) * g_1
m_2 = beta1 * (1 - beta1) * g_1 + (1 - beta1) * g_2
m_t = (1 - beta1) * sum_{i=1}^{t} beta1^{t-i} * g_i

Taking expectations (assuming E[g_i] = mu for all i):

E[m_t] = (1 - beta1) * mu * sum_{i=1}^{t} beta1^{t-i}
        = (1 - beta1) * mu * sum_{j=0}^{t-1} beta1^j
        = (1 - beta1) * mu * (1 - beta1^t) / (1 - beta1)
        = mu * (1 - beta1^t)

So E[m_t] = mu * (1 - beta1^t), which is biased toward zero (since (1 - beta1^t) < 1).

Dividing by (1 - beta1^t):

E[m_hat_t] = E[m_t / (1 - beta1^t)] = mu

This is now an unbiased estimator of the true gradient mean.

Why it matters: At t=1 with beta1=0.9: m_1 = 0.1 * g_1. Without correction, the first step is 10x smaller than intended. With correction: m_hat_1 = 0.1 * g_1 / 0.1 = g_1. The correction is most important early in training and becomes negligible as t grows large (beta1^t approaches 0).

Scoring Rubric:

Strong Hire: Clean derivation, geometric series, explains practical impact, mentions the correction becomes negligible for large t
Lean Hire: Gets the right formula but derivation has gaps, or explains intuitively without formal proof
No Hire: Cannot derive the formula or doesn't understand why bias correction exists

Interview Cheat Sheet

Topic	Key Fact	When to Mention
SGD	O(1/sqrt(T)) convergence; noise helps generalization	"Why not just use full-batch GD?"
Momentum	beta=0.9 standard; cancels oscillation, accumulates consistent direction	"How does momentum help?"
Adam	First + second moment; bias correction; default: lr=1e-3, beta1=0.9, beta2=0.999	Any optimizer question
AdamW	Decoupled weight decay; Adam+L2 is wrong because adaptive scaling affects regularization	"Adam vs AdamW?"
Warmup	Prevents divergence when optimizer moments are uninitialized; 1-5% of total steps	Any transformer training question
Cosine decay	Smooth; no milestone tuning; standard for LLMs with warmup	"What LR schedule?"
Step decay	Multiply by 0.1 at milestones; standard for ResNet	"How to train a CNN?"
Batch size	Larger batch = scale LR proportionally; warmup needed; may hurt generalization	"How to speed up training?"
Gradient accumulation	Simulates large batch; divide loss by accumulation steps; BN incompatible	"What if batch doesn't fit in memory?"
Gradient clipping	Clip by norm preserves direction; clip by value may change direction	"How to handle gradient explosion?"
Saddle points	More problematic than local minima in high dimensions; adaptive methods help	"What makes optimization hard?"
Convergence	Non-convex: only guarantee stationary point; convex: global optimum	"What guarantees does SGD have?"
LARS/LAMB	Layer-wise adaptive LR for large batch training; LAMB = LARS + Adam	"How to train with 64K batch size?"
Mixed precision	FP16 forward/backward, FP32 weights; loss scaling; increase Adam epsilon	"How to speed up training?"

Spaced Repetition Checkpoints

Day 0 - Immediate Recall

Write the SGD, SGD+momentum, and Adam update rules from memory
Explain why Adam needs bias correction in one sentence
Draw the optimizer selection flowchart from memory
State the linear scaling rule for batch size

Day 3 - Active Recall

Explain the difference between Adam and AdamW without looking at notes
Describe three learning rate schedules and when to use each
Explain why saddle points are more problematic than local minima in high dimensions
What happens if you use FP16 training with Adam epsilon=1e-8?

Day 7 - Application

Given a training loss curve (oscillating and not converging), diagnose three possible causes and fixes
Design the optimization setup for fine-tuning BERT (optimizer, LR, schedule, weight decay)
Explain gradient accumulation to a junior engineer, including the loss scaling detail

Day 14 - Synthesis

Compare SGD+momentum vs. Adam for image classification vs. NLP - when does each win and why?
Design an experiment to find the optimal batch size for a new task
Derive Adam's bias correction from scratch

Day 21 - Interview Simulation

Given: "Our model trains fine for 10 epochs then loss goes to NaN." Walk through a systematic diagnosis.
Given: "We want to train 10x faster." Propose a complete optimization strategy (batch size, optimizer, schedule, hardware).
Explain to a non-ML interviewer why training neural networks is hard (in terms of optimization).

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - The Optimization Landscape​

What Are We Actually Doing?​

The Loss Landscape: Why Optimization Is Hard​

Convex vs. Non-Convex Optimization​

Part 2 - The Optimizer Zoo​

Stochastic Gradient Descent (SGD)​

SGD with Momentum​

RMSProp​

Adam - The Default Choice​

AdamW - The Fixed Adam​

The Optimizer Selection Flowchart​

When Adam Fails - And You Need SGD​

Part 3 - Learning Rate Schedules​

Why Schedules Matter​

Common Learning Rate Schedules​

Schedule Comparison Table​

Part 4 - Batch Size, Gradient Accumulation, and Practical Training​

Batch Size Effects​

Gradient Accumulation​

Mixed-Precision Training​

Part 5 - Convergence Theory (Interview Level)​

Key Convergence Results​

The Learning Rate-Convergence Tradeoff​

Gradient Clipping​

Part 6 - Advanced Topics for Senior Roles​

Second-Order Methods​

LARS and LAMB for Large-Batch Training​

Stochastic Weight Averaging (SWA)​

Lookahead Optimizer​

Practice Problems​

Problem 1: Diagnose the Training Curve​

Problem 2: Optimizer Selection​

Problem 3: Large Batch Training​

Problem 4: Cosine vs. Step Decay​

Problem 5: Adam's Bias Correction​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 - Immediate Recall​

Day 3 - Active Recall​

Day 7 - Application​

Day 14 - Synthesis​

Day 21 - Interview Simulation​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - The Optimization Landscape

What Are We Actually Doing?

The Loss Landscape: Why Optimization Is Hard

Convex vs. Non-Convex Optimization

Part 2 - The Optimizer Zoo

Stochastic Gradient Descent (SGD)

SGD with Momentum

RMSProp

Adam - The Default Choice

AdamW - The Fixed Adam

The Optimizer Selection Flowchart

When Adam Fails - And You Need SGD

Part 3 - Learning Rate Schedules

Why Schedules Matter

Common Learning Rate Schedules

Schedule Comparison Table

Part 4 - Batch Size, Gradient Accumulation, and Practical Training

Batch Size Effects

Gradient Accumulation

Mixed-Precision Training

Part 5 - Convergence Theory (Interview Level)

Key Convergence Results

The Learning Rate-Convergence Tradeoff

Gradient Clipping

Part 6 - Advanced Topics for Senior Roles

Second-Order Methods

LARS and LAMB for Large-Batch Training

Stochastic Weight Averaging (SWA)

Lookahead Optimizer

Practice Problems

Problem 1: Diagnose the Training Curve

Problem 2: Optimizer Selection

Problem 3: Large Batch Training

Problem 4: Cosine vs. Step Decay

Problem 5: Adam's Bias Correction

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 - Immediate Recall

Day 3 - Active Recall

Day 7 - Application

Day 14 - Synthesis

Day 21 - Interview Simulation