Skip to main content

Optimization Algorithms - The Engine Behind Every Model

Reading time: ~35 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Eng, Data Scientist

The Real Interview Moment

You're twenty minutes into a technical screen at a top AI lab. The interviewer pulls up a training loss curve on the whiteboard - it oscillates wildly for 10 epochs, then plateaus at a suspiciously high value. She asks: "This is our production model. We're using Adam with default hyperparameters. The team tried lowering the learning rate, but it just trains slower and plateaus at the same loss. What would you investigate?"

This is where most candidates freeze. They know what Adam is. They can recite the update rule. But they cannot reason about optimizer behavior in practice. The strong candidate starts asking questions: What's the batch size? Is the loss landscape convex or riddled with saddle points? Have you tried learning rate warmup? Could gradient accumulation help? Is AdamW more appropriate given the weight decay interaction?

The ability to diagnose optimization problems - not just define optimizers - is what separates a hire from a no-hire. This page gives you the mental models, the intuition, and the interview-ready explanations to handle exactly this kind of question.

Optimization is the beating heart of machine learning. Every model, from linear regression to GPT-4, is ultimately the result of an optimization algorithm searching a loss landscape for good parameters. Understanding how that search works, why it sometimes fails, and how to fix it is non-negotiable knowledge for any ML role.

What You Will Master

After reading this page, you will be able to:

  • Explain SGD, momentum, RMSProp, Adam, and AdamW from first principles with mathematical intuition
  • Draw the optimizer selection flowchart used by experienced practitioners
  • Diagnose common training failures: oscillation, divergence, plateaus, and saddle points
  • Design learning rate schedules (warmup, cosine decay, step decay, cyclical) for different training scenarios
  • Articulate why Adam sometimes fails and when SGD+momentum is superior
  • Reason about batch size effects on optimization dynamics and generalization
  • Implement gradient accumulation to simulate large batch training on limited hardware
  • Explain convergence theory at the level expected in an MLE interview (convexity, Lipschitz smoothness, convergence rates)
  • Navigate company-specific optimization questions (Google prefers SGD for vision, OpenAI uses AdamW for LLMs)
  • Handle whiteboard derivations of Adam's update rule and moment estimation bias correction

Self-Assessment: Where Are You Now?

Skill Area1 (Never heard of it)3 (Can explain basics)5 (Can derive + debug)Your Rating
SGD + momentumWhat's momentum?Know the formulaCan derive convergence guarantees___
Adam / AdamWJust use defaultsKnow the update ruleCan explain bias correction + weight decay fix___
Learning rate schedulesAlways use constant LRKnow warmup existsDesign schedules for specific problems___
Loss landscapesNever visualized oneKnow about local minimaUnderstand saddle points, flat minima, sharpness___
Batch size effects"Bigger is better"Know about gradient noiseCan explain generalization-batch size relationship___
Convergence theoryNo ideaKnow convex = easyCan state convergence rates for convex/non-convex___
Practical debuggingNever debugged trainingTried adjusting LRSystematic optimizer diagnosis process___

Score interpretation:

  • 7-14: Start here and work through every section carefully. Focus on intuition first, math second.
  • 15-25: Good foundation. Focus on the "why" sections and practice problems.
  • 26-35: You're close to interview-ready. Drill the practice problems and edge cases.

Part 1 - The Optimization Landscape

What Are We Actually Doing?

Every ML model defines a loss function L(theta) that measures how bad the model's predictions are. Training means finding parameters theta* that minimize this loss:

theta* = argmin_theta L(theta)

The challenge: for a neural network with millions of parameters, we cannot solve this analytically. Instead, we use iterative optimization - we start somewhere and take steps that (hopefully) decrease the loss.

60-Second Answer

"Optimization in ML is about finding model parameters that minimize a loss function. We can't solve this analytically for neural networks, so we use iterative methods. The simplest is gradient descent: compute the gradient of the loss with respect to parameters, then step in the opposite direction. The key challenges are: the landscape is non-convex (many local minima and saddle points), we use noisy gradient estimates from mini-batches, and the loss surface has very different curvature in different directions. Modern optimizers like Adam address these by adapting the learning rate per-parameter based on gradient history."

The Loss Landscape: Why Optimization Is Hard

Loss Landscape Features

Interviewer's Perspective

A common interview question is: "What's more problematic in deep learning - local minima or saddle points?" The correct answer is saddle points. In high-dimensional spaces, true local minima are rare. Most critical points are saddle points - they're minima in some directions and maxima in others. A 2014 paper by Dauphin et al. showed that for random high-dimensional functions, the probability of a critical point being a true local minimum decreases exponentially with dimensionality. Strong candidates know this and can explain why adaptive methods handle saddle points better than vanilla SGD.

Convex vs. Non-Convex Optimization

PropertyConvexNon-Convex (Neural Networks)
Local minimaEvery local min = global minMany local minima, saddle points
Convergence guaranteeYes, to global optimumOnly to stationary point
SGD convergence rateO(1/T) for smooth convexO(1/sqrt(T)) for smooth non-convex
Practical examplesLinear regression, logistic regression, SVMsNeural networks, deep learning
Sensitivity to initializationLow - any start worksHigh - initialization matters enormously
Learning rate sensitivityWide range worksNarrow range; too high = divergence
Common Trap

Don't claim that neural networks get stuck in "bad local minima." Modern research (Li et al., 2018) shows that most local minima in over-parameterized networks have loss values very close to the global minimum. The real problem is saddle points and flat regions that slow down optimization, not local minima that trap the optimizer at high loss values.

Part 2 - The Optimizer Zoo

Stochastic Gradient Descent (SGD)

The foundation of everything. Instead of computing the gradient over the entire dataset (expensive), we estimate it from a mini-batch:

theta_{t+1} = theta_t - eta * g_t

where g_t = (1/B) * sum_i grad L(x_i, theta_t) is the mini-batch gradient estimate and eta is the learning rate.

Why stochastic? The mini-batch gradient is a noisy estimate of the true gradient. This noise is actually beneficial - it helps escape saddle points and sharp minima, acting as implicit regularization.

60-Second Answer

"SGD computes the gradient on a random mini-batch instead of the full dataset. This makes each step noisy but much cheaper. The noise is actually beneficial - it helps escape sharp minima and saddle points, and there's evidence it leads to flatter minima that generalize better. The main downside is that vanilla SGD can oscillate in ravines where the loss surface has very different curvature in different directions. Momentum fixes this by accumulating a running average of past gradients."

SGD with Momentum

Momentum adds inertia - the update accumulates past gradients like a ball rolling downhill:

v_t = beta * v_{t-1} + g_t
theta_{t+1} = theta_t - eta * v_t

Typical beta = 0.9 means 90% of the velocity is carried over from the previous step.

Why it works: In a ravine (steep walls, shallow floor), the gradient oscillates between walls. Momentum cancels these oscillations (they alternate sign) and accumulates the consistent downhill component.

Nesterov momentum is a slight improvement - it computes the gradient at the "look-ahead" position:

v_t = beta * v_{t-1} + grad L(theta_t - eta * beta * v_{t-1})
theta_{t+1} = theta_t - eta * v_t

This gives a corrective term that reduces overshooting.

VariantConvergence (Convex, Smooth)Key Advantage
Vanilla SGDO(1/T)Simplicity, strong generalization
SGD + MomentumO(1/T), better constantAccelerates through ravines
SGD + NesterovO(1/T^2) for full GDLook-ahead correction

RMSProp

RMSProp adapts the learning rate per-parameter based on the magnitude of recent gradients:

s_t = beta * s_{t-1} + (1 - beta) * g_t^2
theta_{t+1} = theta_t - eta * g_t / (sqrt(s_t) + epsilon)

Parameters with large recent gradients get a smaller effective learning rate; parameters with small gradients get a larger one. This is crucial for parameters that receive very sparse gradient signals (e.g., embeddings for rare words).

Historical note: RMSProp was proposed by Geoff Hinton in a Coursera lecture, never formally published. It was independently motivated as a fix for AdaGrad's aggressively decaying learning rate.

Adam - The Default Choice

Adam combines momentum (first moment) with RMSProp (second moment):

m_t = beta1 * m_{t-1} + (1 - beta1) * g_t # First moment (momentum)
v_t = beta2 * v_{t-1} + (1 - beta2) * g_t^2 # Second moment (RMSProp)
m_hat_t = m_t / (1 - beta1^t) # Bias correction
v_hat_t = v_t / (1 - beta2^t) # Bias correction
theta_{t+1} = theta_t - eta * m_hat_t / (sqrt(v_hat_t) + epsilon)

Default hyperparameters: beta1=0.9, beta2=0.999, epsilon=1e-8, eta=1e-3

60-Second Answer

"Adam combines two ideas: momentum (tracking the exponential moving average of gradients, like SGD+momentum) and adaptive learning rates (tracking the exponential moving average of squared gradients, like RMSProp). The first moment gives you the direction, the second moment gives you per-parameter scaling. The bias correction terms fix the initialization problem - both moments are initialized to zero, so without correction the early estimates would be biased toward zero. Default hyperparameters (beta1=0.9, beta2=0.999, lr=1e-3) work surprisingly well for most problems, which is why Adam is the go-to optimizer."

Why bias correction matters: At step t=1, m_1 = 0.1 * g_1 (biased toward 0). The correction m_hat_1 = m_1 / (1 - 0.9^1) = m_1 / 0.1 = g_1 recovers the true gradient estimate. Without correction, early training steps would be much smaller than intended.

Common Trap

Many candidates confuse Adam's epsilon with a "small constant to prevent division by zero." While that's technically what it does, the value of epsilon actually matters. For mixed-precision training (FP16), the default epsilon=1e-8 can underflow to zero, causing NaN gradients. You need epsilon=1e-4 or 1e-5 for FP16 training. This is a real production issue that strong candidates know about.

AdamW - The Fixed Adam

The original Adam paper's weight decay implementation was wrong. In Adam, L2 regularization and weight decay are not equivalent (unlike in SGD where they are). AdamW fixes this:

Adam with L2 regularization (wrong):

g_t = grad L(theta_t) + lambda * theta_t # L2 added to gradient
# Then Adam update as usual - but the adaptive scaling affects the regularization term!

AdamW (correct):

# Adam update as usual on the unregularized gradient
theta_{t+1} = theta_{t+1} - eta * lambda * theta_t # Decoupled weight decay

The difference: in Adam+L2, the adaptive learning rate scales the weight decay term, which means well-conditioned parameters get more regularization and ill-conditioned ones get less - the opposite of what you want. AdamW applies weight decay directly, independent of the adaptive scaling.

Interviewer's Perspective

If a candidate can explain the AdamW vs. Adam distinction clearly, it signals deep understanding. The follow-up question I like to ask is: "So when does the Adam vs. AdamW difference actually matter?" Answer: It matters most when weight decay is large (common in LLM training with lambda=0.1) and when the adaptive scaling varies significantly across parameters (which it does in transformers where attention weights and embedding layers have very different gradient magnitudes).

Company Variation
  • Google/DeepMind (vision): Often prefers SGD+momentum for image classification (ResNets, EfficientNets). Their internal benchmarks show better final accuracy.
  • OpenAI/Anthropic (LLMs): AdamW is standard for transformer training. GPT-3 used AdamW with beta1=0.9, beta2=0.95.
  • Meta (recommendations): Mixed - Adam for embedding-heavy models, SGD for dense models.
  • Startups: Almost always Adam/AdamW - faster convergence matters more than squeezing out 0.1% accuracy.

The Optimizer Selection Flowchart

Optimizer Selection Flowchart

When Adam Fails - And You Need SGD

Adam does not always win. Key failure modes:

  1. Generalization gap: Multiple studies (Wilson et al., 2017) showed that SGD+momentum generalizes better than Adam on image classification. The adaptive learning rate can converge to sharp minima that overfit.

  2. Non-convergence in theory: Reddi et al. (2018) showed that Adam can diverge on simple convex problems. The fix (AMSGrad) uses the maximum of past second moments instead of the exponential average, but in practice the difference is small.

  3. Interaction with weight decay: Before AdamW, using L2 regularization with Adam gave incorrect regularization behavior (see AdamW section above).

  4. Large batch training: With very large batches, the gradient noise that Adam relies on for exploration diminishes. LARS/LAMB optimizers (layer-wise adaptive learning rates) work better for large-batch training.

ScenarioBest OptimizerWhy
ResNet on ImageNet (best accuracy)SGD + momentumBetter generalization, well-studied schedules
BERT/GPT pre-trainingAdamWAdaptive LR essential for transformers
Fine-tuning a pre-trained modelAdamW with small LRStable, preserves pre-trained features
GAN trainingAdam (tuned betas)Stability in adversarial dynamics
Reinforcement learningAdamHandles non-stationary objectives
Recommendation with embeddingsAdam/AdamWSparse gradients need adaptive LR
Training with batch size > 8KLARS/LAMBLayer-wise scaling for large batches
Limited compute budgetAdamConverges faster in wall-clock time

Part 3 - Learning Rate Schedules

The learning rate is the single most important hyperparameter. A learning rate schedule changes the learning rate during training to get the best of both worlds: fast early progress and fine-grained late convergence.

Why Schedules Matter

Learning Rate Schedule Phases

Common Learning Rate Schedules

1. Step Decay Multiply LR by a factor (e.g., 0.1) at fixed epochs.

LR = initial_lr * gamma^(floor(epoch / step_size))

Example: Start at 0.1, multiply by 0.1 at epochs 30, 60, 90.

  • Pros: Simple, well-studied (standard for ResNet training)
  • Cons: Discrete jumps can cause instability; requires manual tuning of milestones

2. Cosine Annealing Smoothly decreases LR following a cosine curve:

LR = eta_min + 0.5 * (eta_max - eta_min) * (1 + cos(pi * t / T))
  • Pros: Smooth, no hyperparameters beyond T; strong empirical results
  • Cons: Fixed schedule length; extending training requires warm restarts

3. Warmup + Cosine Decay (the modern standard) Start with a linearly increasing LR (warmup), then cosine decay:

if t < warmup_steps:
LR = initial_lr * (t / warmup_steps)
else:
LR = eta_min + 0.5 * (initial_lr - eta_min) * (1 + cos(pi * (t - warmup_steps) / (T - warmup_steps)))
60-Second Answer

"Learning rate warmup solves a cold-start problem. At the beginning of training, the model's parameters are random, so the gradients are large and unreliable. Adam's second moment estimates are also inaccurate (initialized to zero, not yet warmed up). A high learning rate + noisy gradients + inaccurate adaptive scaling = potential divergence. Warmup starts with a tiny learning rate and ramps up linearly, giving the optimizer time to calibrate its moment estimates. This is especially critical for transformers where the attention mechanism can produce very large gradients early in training. For most transformer training, 1-5% of total steps as warmup works well."

4. Cyclical Learning Rates / Warm Restarts Periodically reset the LR to a high value:

  • SGDR (warm restarts): Cosine annealing with periodic resets
  • 1cycle policy: One cycle of increasing then decreasing LR (Smith, 2018)

Warm restarts help escape local minima by periodically increasing exploration. The 1cycle policy is particularly effective for fast training - it enables super-convergence where the model trains in far fewer epochs.

5. Linear Decay Decrease LR linearly to zero:

LR = initial_lr * (1 - t / T)
  • Simple and effective for fine-tuning pre-trained models
  • Used in many NLP fine-tuning recipes (BERT, GPT)

Schedule Comparison Table

ScheduleBest ForKey HyperparametersCommon Pitfall
Step decayCNN training (ResNet)Milestones, gammaWrong milestone selection
Cosine annealingGeneral-purposeT (total steps)Fixed schedule length
Warmup + cosineTransformer trainingWarmup steps, peak LRToo short warmup
1cycleFast training, limited budgetMax LR, div factorMax LR too aggressive
Linear decayFine-tuningInitial LRStarting LR too high for fine-tuning
Exponential decayLegacy, rarely used nowDecay rateOver-decays, training stalls
Company Variation
  • Google (BERT): Linear warmup (10K steps) + linear decay. Simple and effective.
  • OpenAI (GPT-3): Linear warmup (375M tokens) + cosine decay to 10% of peak LR.
  • Meta (LLaMA): Cosine decay with warmup (2000 steps) to 10% of peak.
  • Kaiming He's group (vision): Step decay (divide by 10 at specific epochs) for ResNets.

Part 4 - Batch Size, Gradient Accumulation, and Practical Training

Batch Size Effects

Batch size affects optimization in multiple ways:

Batch SizeGradient NoiseConvergenceGeneralizationHardware
Small (16-64)High noiseSlow in wall-clock, more updates per epochOften better (noise acts as regularizer)Fits on one GPU
Medium (128-512)ModerateGood balanceGoodOne or few GPUs
Large (1K-64K)Low noiseFast in wall-clock, fewer updatesCan degrade without tricksMulti-GPU / TPU
Interviewer's Perspective

The batch size question separates textbook knowledge from practical experience. A strong candidate knows that doubling the batch size should be paired with doubling the learning rate (linear scaling rule) - but also knows this breaks down for very large batches. They mention gradient accumulation as a practical solution for simulating large batches on small GPUs. And they know that large-batch training can hurt generalization unless you use warmup, LARS/LAMB, or label smoothing to compensate.

The Linear Scaling Rule: When you increase batch size by factor k, increase LR by factor k. This keeps the expected weight update magnitude the same. However, this rule fails for very large batches - the loss landscape changes character.

The generalization gap: Larger batches tend to converge to sharper minima that generalize worse. Keskar et al. (2017) showed this empirically. Mitigations:

  • Warmup (let the optimizer adapt before taking large steps)
  • LARS/LAMB (layer-wise adaptive learning rates)
  • Longer training (more epochs compensate for fewer updates per epoch)
  • Label smoothing
  • Gradient noise injection

Gradient Accumulation

When your ideal batch size doesn't fit in GPU memory, accumulate gradients over multiple forward/backward passes:

optimizer.zero_grad()
for i, (inputs, labels) in enumerate(dataloader):
outputs = model(inputs) # Forward pass (micro-batch)
loss = criterion(outputs, labels) / accumulation_steps # Scale loss
loss.backward() # Accumulate gradients

if (i + 1) % accumulation_steps == 0:
optimizer.step() # Update weights
optimizer.zero_grad() # Reset gradients

Key detail: You must divide the loss by accumulation_steps to keep the gradient magnitude equivalent to a single large batch.

Common Trap

Gradient accumulation is not mathematically identical to large batch training when using batch normalization. BN statistics are computed per micro-batch, not the accumulated batch. For models with BN layers, use synchronized batch norm across accumulation steps or switch to layer norm / group norm. This is why most transformer architectures use layer norm - it's more compatible with gradient accumulation and distributed training.

Mixed-Precision Training

Modern training uses FP16/BF16 for forward and backward passes, FP32 for weight updates:

Forward pass: FP16 (fast, less memory)
Backward pass: FP16 (fast, less memory)
Master weights: FP32 (precise updates)
Loss scaling: Multiply loss by a scale factor to prevent FP16 gradient underflow

Why this matters for optimization:

  • FP16 has limited dynamic range - gradients smaller than ~6e-8 underflow to zero
  • Loss scaling multiplies the loss before backward pass (amplifies gradients) and divides the gradient before the optimizer step
  • BF16 (Brain Float) has the same exponent range as FP32 but less precision - better for training, no loss scaling needed
  • Adam's epsilon must be increased (1e-4 to 1e-5) for FP16 to avoid division by near-zero

Part 5 - Convergence Theory (Interview Level)

You don't need to prove theorems, but you need to know the key results and be able to state them.

Key Convergence Results

SettingAlgorithmRateWhat It Means
Convex, smooth, full GDGDO(1/T)After T steps, error < C/T
Convex, smooth, SGDSGDO(1/sqrt(T))Slower due to gradient noise
Strongly convex, smooth, SGDSGD (decaying LR)O(1/T)Strong convexity recovers the rate
Non-convex, smooth, SGDSGDO(1/sqrt(T)) to stationary pointOnly guarantee: small gradient norm
Convex, Nesterov acceleratedNesterov GDO(1/T^2)Optimal for first-order methods
60-Second Answer

"For convex problems, SGD converges at O(1/sqrt(T)) - meaning you need 4x the steps to halve the error. For strongly convex problems (like L2-regularized objectives), you get the faster O(1/T) rate with decaying learning rate. For non-convex problems like neural networks, we can only guarantee convergence to a stationary point (gradient near zero) at O(1/sqrt(T)). In practice, this means non-convex optimization is fundamentally harder - we have no guarantee of finding the global minimum, but empirically the local minima of over-parameterized networks are nearly as good as the global minimum."

The Learning Rate-Convergence Tradeoff

For SGD on a smooth, L-Lipschitz function:

  • Learning rate must satisfy eta &lt; 2/L for convergence (where L is the Lipschitz constant of the gradient)
  • Too large: divergence (overshooting the minimum)
  • Too small: very slow convergence
  • Optimal: eta = 1/L for gradient descent, but we rarely know L in practice

Gradient Clipping

When gradients are very large (common in RNNs and early transformer training), clip them:

Clip by norm:

if ||g|| > max_norm:
g = g * (max_norm / ||g||)

Clip by value:

g = clamp(g, -max_value, max_value)

Clip by norm is preferred because it preserves gradient direction. Clip by value can change direction.

Instant Rejection

If asked "How do you handle exploding gradients?" and you only say "gradient clipping" without mentioning why gradients explode (deep networks with large weights, recurrent connections, attention with large values), you're giving a surface-level answer. Always explain the cause before the fix. Even better, mention that gradient clipping is a symptom treatment - better architecture choices (residual connections, layer norm, proper initialization) address the root cause.

Part 6 - Advanced Topics for Senior Roles

Second-Order Methods

Second-order methods use the Hessian (matrix of second derivatives) to take better steps:

theta_{t+1} = theta_t - H^{-1} * g_t

where H is the Hessian matrix. This gives the Newton step, which converges quadratically near the minimum.

Why we don't use them directly: The Hessian is n x n where n = number of parameters. For a 100M parameter model, that's 10^16 entries - impossible to store or invert.

Approximations used in practice:

  • L-BFGS: Approximates the inverse Hessian from gradient history. Works for small-to-medium models.
  • Natural gradient: Uses the Fisher information matrix instead of the Hessian. Theoretically elegant.
  • K-FAC: Kronecker-factored approximate curvature. Practical for neural networks.
  • Shampoo (Google): Structured second-order method that's competitive with Adam on TPUs.

LARS and LAMB for Large-Batch Training

LARS (Layer-wise Adaptive Rate Scaling):

For each layer l:
local_lr = trust_ratio * (||theta_l|| / ||g_l||)
theta_l = theta_l - local_lr * eta * g_l

The trust ratio scales the learning rate per-layer based on the ratio of weight norm to gradient norm. This prevents layers with small weights from getting destabilizing large updates.

LAMB (Layer-wise Adaptive Moments for Batch training): Combines LARS with Adam. Used by Google to train BERT in 76 minutes with batch size 65K.

Stochastic Weight Averaging (SWA)

Average the weights from multiple points along the SGD trajectory:

After training for T_0 epochs:
theta_SWA = (1/K) * sum_{i=1}^{K} theta_{T_0 + i*c}

SWA finds flatter minima with better generalization. It's essentially a cheap ensemble that requires no additional inference cost.

Lookahead Optimizer

Maintains two sets of weights - "fast weights" that explore with any optimizer, and "slow weights" that average the fast weights periodically:

For every k steps of inner optimizer:
theta_slow = theta_slow + alpha * (theta_fast - theta_slow)
theta_fast = theta_slow # Reset fast to slow

Reduces variance and improves stability. Can wrap any optimizer (Lookahead-Adam, Lookahead-SGD).

Practice Problems

Problem 1: Diagnose the Training Curve

You're training a transformer model. The training loss decreases for 5 epochs, then suddenly spikes to NaN. What happened and how do you fix it?

Hint 1 - Direction

Think about what can cause numerical values to become NaN. Consider the learning rate, the loss scale, and specific operations in transformers that can produce very large values.

Hint 2 - Insight

NaN usually comes from: (1) gradient explosion leading to overflow, (2) division by zero (e.g., in softmax, layer norm, or Adam's epsilon), or (3) log of zero/negative values. In transformers specifically, the attention scores (QK^T) can grow very large, leading to softmax saturation and gradient spikes.

Hint 3 - Full Solution + Rubric

Root causes (investigate in order):

  1. Learning rate too high: After 5 epochs, the optimizer reaches a region with high curvature. The update overshoots, gradients explode, NaN.
    • Fix: Reduce LR, add warmup, add gradient clipping
  2. Loss scaling overflow (FP16): The loss scale grows too large, causing FP16 overflow.
    • Fix: Use dynamic loss scaling with backoff, or switch to BF16
  3. Attention score explosion: QK^T values grow with sequence length and embedding dimension.
    • Fix: Use scaled dot-product attention (divide by sqrt(d_k)), check initialization
  4. Numerical instability in softmax/log_softmax: Large logits cause exp() overflow.
    • Fix: Use log_softmax instead of separate log and softmax

Scoring Rubric:

  • Strong Hire: Systematically investigates multiple causes, mentions FP16 issues, knows about attention scaling, suggests gradient clipping as immediate fix + root cause analysis
  • Lean Hire: Identifies gradient explosion and suggests clipping and LR reduction, but misses numerical precision issues
  • No Hire: Only suggests "lower the learning rate" without diagnosing the actual cause

Problem 2: Optimizer Selection

You're fine-tuning a pre-trained BERT model on a classification task with 10K training examples. The team has been using SGD with momentum and getting 85% accuracy. The PM wants 90%. What optimizer changes would you try?

Hint 1 - Direction

Think about why SGD might underperform for fine-tuning a pre-trained transformer. Consider the structure of the parameter space (embeddings vs. classification head) and the learning rate requirements.

Hint 2 - Insight

Pre-trained transformers have parameters at very different scales - the embedding layers, attention weights, and the randomly initialized classification head all need different effective learning rates. Adaptive optimizers handle this naturally. Also consider the learning rate schedule and whether the current LR is appropriate for fine-tuning.

Hint 3 - Full Solution + Rubric

Recommended changes (in priority order):

  1. Switch to AdamW: Adaptive per-parameter learning rates handle the heterogeneous parameter scales in BERT. Use lr=2e-5 (standard BERT fine-tuning LR), weight_decay=0.01.

  2. Add linear warmup + linear decay: 6-10% of total steps as warmup. This is standard in every BERT fine-tuning recipe and makes a significant difference.

  3. Layer-wise learning rate decay: Lower LR for earlier layers (closer to pre-trained), higher for later layers and the classification head. E.g., multiply LR by 0.95 for each layer going deeper.

  4. Try different LR values: Grid search over [1e-5, 2e-5, 3e-5, 5e-5]. BERT fine-tuning is sensitive to LR.

  5. Increase epochs (within reason): 3-5 epochs is typical for BERT fine-tuning. More can overfit with 10K examples.

Why SGD underperforms here: SGD uses a single LR for all parameters. The pre-trained embedding layer needs a tiny LR to avoid catastrophic forgetting, while the classification head needs a larger LR to learn from scratch. SGD cannot differentiate.

Scoring Rubric:

  • Strong Hire: Recommends AdamW with specific hyperparameters, mentions warmup schedule, discusses layer-wise LR decay, explains why adaptive methods help for fine-tuning
  • Lean Hire: Suggests switching to Adam, knows about low LR for fine-tuning, but missing schedule and layer-wise details
  • No Hire: Suggests "just try different learning rates with SGD" without understanding why the optimizer choice matters

Problem 3: Large Batch Training

Your team wants to reduce training time for a large language model by increasing the batch size from 256 to 4096. What changes to the optimization setup are needed?

Hint 1 - Direction

Think about what changes when you increase batch size by 16x. Consider the learning rate, gradient noise, convergence behavior, and any special techniques for large-batch training.

Hint 2 - Insight

The linear scaling rule says to scale LR proportionally to batch size, but this doesn't work perfectly at very large scales. You need warmup to let the optimizer adjust to the larger LR. Also consider that reduced gradient noise changes the optimization dynamics - you may need to train for more epochs or use techniques that inject noise back.

Hint 3 - Full Solution + Rubric

Required changes:

  1. Scale learning rate: Increase LR by 16x (linear scaling rule). But cap at a maximum - test empirically.

  2. Extend warmup: Longer warmup period (proportional to LR increase). Large LR + early unstable gradients = divergence without adequate warmup. Try 2-5% of total steps.

  3. Consider LAMB optimizer: For very large batches, LAMB (Layer-wise Adaptive Moments for Batch training) outperforms AdamW. It adds layer-wise trust ratios.

  4. Gradient clipping: More aggressive clipping may be needed. Large batch gradients are lower variance but can still have outliers.

  5. Adjust total training steps: Fewer steps per epoch (since each step processes 16x more data). But you might need more epochs to compensate for reduced gradient noise.

  6. Monitor carefully: Track gradient norm per layer, loss stability, and validation metrics. Large batch training is less forgiving of hyperparameter choices.

What NOT to change:

  • Don't change beta1, beta2, epsilon - these are generally robust to batch size changes
  • Don't change weight decay - it's decoupled from LR in AdamW

Scoring Rubric:

  • Strong Hire: Mentions linear scaling rule with caveats, warmup extension, LAMB as an option, discusses generalization risks, monitoring strategy
  • Lean Hire: Knows about LR scaling and warmup, but missing advanced techniques and generalization discussion
  • No Hire: Just says "increase the learning rate" or "it should just work with the same settings"

Problem 4: Cosine vs. Step Decay

Your colleague argues that cosine annealing is always better than step decay. Design an experiment to test this claim on your image classification task.

Hint 1 - Direction

Think about what "better" means - training speed? Final accuracy? Robustness to hyperparameters? Think about what variables you need to control and what metrics to report.

Hint 2 - Insight

A fair comparison requires equal total training budget, proper hyperparameter tuning for both schedules, and multiple random seeds. The answer likely depends on the total training budget - cosine annealing tends to win when the training budget is known in advance and you train for exactly that many steps.

Hint 3 - Full Solution + Rubric

Experimental design:

  1. Fixed variables: Same model architecture, dataset, optimizer (SGD+momentum), weight decay, batch size, data augmentation, total epochs

  2. Tuned variables per schedule:

    • Step decay: Grid search over milestone sets x gamma values x initial LR
    • Cosine: Grid search over initial LR x eta_min
  3. Metrics: Final top-1 accuracy, best accuracy during training, time-to-threshold (epochs to reach 90% accuracy), variance across 3+ random seeds

  4. Training budgets: Test at multiple budgets (50, 100, 200, 300 epochs) - the relative advantage may change

  5. Expected results (based on literature):

    • Cosine annealing is more robust to total budget (works well without knowing optimal milestones)
    • Step decay can match or beat cosine when milestones are tuned, but the tuning cost is higher
    • For short budgets, cosine's smooth decay is generally better
    • For very long training, warm restarts can help cosine

Key insight: The claim "always better" is too strong. Cosine annealing is easier to tune and more robust, but well-tuned step decay can match it.

Scoring Rubric:

  • Strong Hire: Designs a rigorous experiment with proper controls, multiple seeds, hyperparameter search for both, discusses when each schedule wins
  • Lean Hire: Reasonable experimental design but misses some controls (e.g., same total epochs, HP tuning for both)
  • No Hire: Just says "train with both and see which is better" without experimental rigor

Problem 5: Adam's Bias Correction

Derive Adam's bias correction. Why is m_hat_t = m_t / (1 - beta1^t) the correct unbiased estimator?

Hint 1 - Direction

Start by expanding the recurrence relation for m_t. What is m_t in terms of g_1, g_2, ..., g_t? Take the expectation.

Hint 2 - Insight

m_t = (1-beta1) * sum_i=1^t beta1^(t-i) * g_i. The expectation is E[m_t] = (1-beta1) * sum_i=1^t beta1^(t-i) * E[g_i]. If we assume E[g_i] = E[g] (stationary), this simplifies using the geometric series formula.

Hint 3 - Full Solution + Rubric

Derivation:

Initialize m_0 = 0. The recurrence is:

m_t = beta1 * m_{t-1} + (1 - beta1) * g_t

Expanding:

m_1 = (1 - beta1) * g_1
m_2 = beta1 * (1 - beta1) * g_1 + (1 - beta1) * g_2
m_t = (1 - beta1) * sum_{i=1}^{t} beta1^{t-i} * g_i

Taking expectations (assuming E[g_i] = mu for all i):

E[m_t] = (1 - beta1) * mu * sum_{i=1}^{t} beta1^{t-i}
= (1 - beta1) * mu * sum_{j=0}^{t-1} beta1^j
= (1 - beta1) * mu * (1 - beta1^t) / (1 - beta1)
= mu * (1 - beta1^t)

So E[m_t] = mu * (1 - beta1^t), which is biased toward zero (since (1 - beta1^t) < 1).

Dividing by (1 - beta1^t):

E[m_hat_t] = E[m_t / (1 - beta1^t)] = mu

This is now an unbiased estimator of the true gradient mean.

Why it matters: At t=1 with beta1=0.9: m_1 = 0.1 * g_1. Without correction, the first step is 10x smaller than intended. With correction: m_hat_1 = 0.1 * g_1 / 0.1 = g_1. The correction is most important early in training and becomes negligible as t grows large (beta1^t approaches 0).

Scoring Rubric:

  • Strong Hire: Clean derivation, geometric series, explains practical impact, mentions the correction becomes negligible for large t
  • Lean Hire: Gets the right formula but derivation has gaps, or explains intuitively without formal proof
  • No Hire: Cannot derive the formula or doesn't understand why bias correction exists

Interview Cheat Sheet

TopicKey FactWhen to Mention
SGDO(1/sqrt(T)) convergence; noise helps generalization"Why not just use full-batch GD?"
Momentumbeta=0.9 standard; cancels oscillation, accumulates consistent direction"How does momentum help?"
AdamFirst + second moment; bias correction; default: lr=1e-3, beta1=0.9, beta2=0.999Any optimizer question
AdamWDecoupled weight decay; Adam+L2 is wrong because adaptive scaling affects regularization"Adam vs AdamW?"
WarmupPrevents divergence when optimizer moments are uninitialized; 1-5% of total stepsAny transformer training question
Cosine decaySmooth; no milestone tuning; standard for LLMs with warmup"What LR schedule?"
Step decayMultiply by 0.1 at milestones; standard for ResNet"How to train a CNN?"
Batch sizeLarger batch = scale LR proportionally; warmup needed; may hurt generalization"How to speed up training?"
Gradient accumulationSimulates large batch; divide loss by accumulation steps; BN incompatible"What if batch doesn't fit in memory?"
Gradient clippingClip by norm preserves direction; clip by value may change direction"How to handle gradient explosion?"
Saddle pointsMore problematic than local minima in high dimensions; adaptive methods help"What makes optimization hard?"
ConvergenceNon-convex: only guarantee stationary point; convex: global optimum"What guarantees does SGD have?"
LARS/LAMBLayer-wise adaptive LR for large batch training; LAMB = LARS + Adam"How to train with 64K batch size?"
Mixed precisionFP16 forward/backward, FP32 weights; loss scaling; increase Adam epsilon"How to speed up training?"

Spaced Repetition Checkpoints

Day 0 - Immediate Recall

  • Write the SGD, SGD+momentum, and Adam update rules from memory
  • Explain why Adam needs bias correction in one sentence
  • Draw the optimizer selection flowchart from memory
  • State the linear scaling rule for batch size

Day 3 - Active Recall

  • Explain the difference between Adam and AdamW without looking at notes
  • Describe three learning rate schedules and when to use each
  • Explain why saddle points are more problematic than local minima in high dimensions
  • What happens if you use FP16 training with Adam epsilon=1e-8?

Day 7 - Application

  • Given a training loss curve (oscillating and not converging), diagnose three possible causes and fixes
  • Design the optimization setup for fine-tuning BERT (optimizer, LR, schedule, weight decay)
  • Explain gradient accumulation to a junior engineer, including the loss scaling detail

Day 14 - Synthesis

  • Compare SGD+momentum vs. Adam for image classification vs. NLP - when does each win and why?
  • Design an experiment to find the optimal batch size for a new task
  • Derive Adam's bias correction from scratch

Day 21 - Interview Simulation

  • Given: "Our model trains fine for 10 epochs then loss goes to NaN." Walk through a systematic diagnosis.
  • Given: "We want to train 10x faster." Propose a complete optimization strategy (batch size, optimizer, schedule, hardware).
  • Explain to a non-ML interviewer why training neural networks is hard (in terms of optimization).
© 2026 EngineersOfAI. All rights reserved.