Skip to main content

Adam Optimizer - The Default That Deserves Understanding

Reading time: ~30 min | Interview relevance: High | Roles: MLE, Research Engineer, Data Scientist

The Real Interview Moment

You're 40 minutes into a research engineer interview at a major AI lab. The interviewer has been probing your understanding of training dynamics, and then asks: "You're training a large transformer model and notice that loss plateaus after 10K steps. You're currently using Adam with default hyperparameters. Walk me through exactly what Adam is doing under the hood, why it might plateau, and what modifications you'd try. Also - can you explain why people sometimes switch to SGD with momentum for the final phase of training?"

You've used torch.optim.Adam a thousand times. You know it "just works." But can you derive the update rule from scratch? Can you explain why bias correction matters in the first few steps? Do you know when AdamW is actually critical versus a nice-to-have? The interviewer isn't testing whether you can call optimizer.step() - they're testing whether you understand the optimizer well enough to debug training when it breaks.

This is the optimizer interview. It separates practitioners who blindly use defaults from engineers who can reason about why training succeeds or fails.

What You Will Master

After reading this page, you will be able to:

  • Derive Adam from first principles, starting from vanilla SGD
  • Explain momentum and RMSprop and why Adam combines both
  • Write out the full Adam update rule with bias correction and explain each term
  • Distinguish AdamW from Adam with L2 regularization and explain when it matters
  • Implement Adam from scratch in Python
  • Diagnose common optimizer-related training failures
  • Argue convincingly for Adam vs SGD in different scenarios
  • Answer every common interview question about optimizers

Part 1 - The Problem Adam Solves

Why Vanilla SGD Is Not Enough

Stochastic gradient descent with a fixed learning rate has fundamental limitations:

θt+1=θtηgt\theta_{t+1} = \theta_t - \eta \cdot g_t

where gt=θL(θt)g_t = \nabla_\theta \mathcal{L}(\theta_t) is the gradient at step tt.

Instant Rejection

Never say "Adam is just better than SGD." In many settings (e.g., image classification with ResNets), well-tuned SGD with momentum outperforms Adam on final test accuracy. The advantage of Adam is faster convergence and robustness to hyperparameter choices, not universally better results.

Three problems with vanilla SGD:

ProblemDescriptionExample
Noisy gradientsMini-batch gradients are noisy estimates of the true gradientLoss oscillates wildly, slow convergence
Uniform learning rateSame step size for all parameters, regardless of gradient magnitudeRare features get tiny effective updates; frequent features dominate
Saddle points & ravinesGets stuck oscillating in steep directions while making slow progress in flat directionsLoss surface has very different curvatures along different axes

SGD Problems and Solutions: Momentum, RMSprop, Adam

Part 2 - Building Blocks: Momentum and RMSprop

Momentum (Polyak, 1964)

Momentum smooths gradients by maintaining an exponential moving average (EMA) of past gradients:

mt=β1mt1+(1β1)gtm_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t θt+1=θtηmt\theta_{t+1} = \theta_t - \eta \cdot m_t

Intuition: Think of a ball rolling downhill. Momentum accumulates velocity in consistent directions and dampens oscillations in inconsistent directions.

  • Typical value: β1=0.9\beta_1 = 0.9 (averages roughly over last 11β1=10\frac{1}{1-\beta_1} = 10 gradients)
  • Effect: Smoother trajectory, faster convergence through flat regions, reduced oscillation
60-Second Answer

"Momentum maintains a running average of past gradients. When gradients consistently point in the same direction, momentum accelerates updates. When gradients oscillate, momentum cancels out the noise. It's like a heavy ball rolling downhill - it builds speed on consistent slopes and isn't disturbed by small bumps."

RMSprop (Hinton, unpublished lecture notes, 2012)

RMSprop adapts the learning rate per parameter based on the magnitude of recent gradients:

vt=β2vt1+(1β2)gt2v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2 θt+1=θtηvt+ϵgt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t} + \epsilon} \cdot g_t

Intuition: Parameters with large gradients get smaller effective learning rates. Parameters with small gradients get larger effective learning rates.

  • Typical value: β2=0.999\beta_2 = 0.999
  • ϵ=108\epsilon = 10^{-8} prevents division by zero
  • Effect: Handles parameters at very different scales, useful for sparse gradients

Comparison of Building Blocks

PropertySGDSGD + MomentumRMSpropAdam
Gradient smoothingNoneYes (1st moment)NoneYes (1st moment)
Per-parameter LRNoNoYes (2nd moment)Yes (2nd moment)
Bias correctionN/ANoNoYes
Memory overhead01x params1x params2x params
Typical useResearch baselinesImage classificationRNNs (historical)Transformers, default

Part 3 - The Adam Algorithm

Full Derivation

Adam (Adaptive Moment Estimation) combines momentum and RMSprop with a critical addition: bias correction.

Algorithm (Kingma & Ba, 2015):

Initialize: m0=0m_0 = 0, v0=0v_0 = 0, t=0t = 0

For each step:

  1. t=t+1t = t + 1
  2. gt=θL(θt1)g_t = \nabla_\theta \mathcal{L}(\theta_{t-1}) - compute gradient
  3. mt=β1mt1+(1β1)gtm_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t - update first moment (mean)
  4. vt=β2vt1+(1β2)gt2v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2 - update second moment (uncentered variance)
  5. m^t=mt1β1t\hat{m}_t = \frac{m_t}{1 - \beta_1^t} - bias-corrected first moment
  6. v^t=vt1β2t\hat{v}_t = \frac{v_t}{1 - \beta_2^t} - bias-corrected second moment
  7. θt=θt1ηv^t+ϵm^t\theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t - update parameters

Why Bias Correction Matters

Since m0=0m_0 = 0 and v0=0v_0 = 0, the moment estimates are biased toward zero in early steps.

At step 1: m1=β10+(1β1)g1=(1β1)g1m_1 = \beta_1 \cdot 0 + (1 - \beta_1) \cdot g_1 = (1 - \beta_1) \cdot g_1

With β1=0.9\beta_1 = 0.9, m1=0.1g1m_1 = 0.1 \cdot g_1 - the moment estimate is 10x smaller than the actual gradient!

The bias correction factor: 11β1t\frac{1}{1 - \beta_1^t} compensates for this. At t=1t = 1: 110.91=10\frac{1}{1 - 0.9^1} = 10, so m^1=100.1g1=g1\hat{m}_1 = 10 \cdot 0.1 \cdot g_1 = g_1. As tt \to \infty, β1t0\beta_1^t \to 0, and the correction becomes negligible.

Common Trap

Many candidates can state the Adam update rule but cannot explain why bias correction exists. The interviewer will specifically ask about those division terms. If you just say "it corrects for initialization," follow up with the math showing that without correction, early updates are severely underestimated, especially for vtv_t where β2=0.999\beta_2 = 0.999 means the bias persists for hundreds of steps.

For the second moment with β2=0.999\beta_2 = 0.999, bias correction is even more critical. At step tt: E[vt]=(1β2t)E[gt2]\mathbb{E}[v_t] = (1 - \beta_2^t) \cdot \mathbb{E}[g_t^2]

The correction factor 110.999t\frac{1}{1 - 0.999^t} doesn't approach 1 until t1000t \approx 1000, meaning without correction, the variance estimate is severely wrong for the first thousand steps.

Default Hyperparameters

HyperparameterDefaultMeaningSensitivity
η\eta (learning rate)10310^{-3}Base step sizeVery high - most important to tune
β1\beta_10.9Momentum decayLow - rarely changed
β2\beta_20.999Variance decayMedium - sometimes 0.98 or 0.95 for unstable training
ϵ\epsilon10810^{-8}Numerical stabilityVery low - almost never changed

Part 4 - Adam from Scratch

import numpy as np

class Adam:
"""Adam optimizer implemented from scratch."""

def __init__(self, params, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8):
self.params = list(params)
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.t = 0

# Initialize moment estimates
self.m = [np.zeros_like(p) for p in self.params] # First moment
self.v = [np.zeros_like(p) for p in self.params] # Second moment

def step(self, grads):
"""Perform one optimization step."""
self.t += 1

for i, (param, grad) in enumerate(zip(self.params, grads)):
# Update biased first moment estimate
self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * grad

# Update biased second moment estimate
self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * grad**2

# Bias correction
m_hat = self.m[i] / (1 - self.beta1**self.t)
v_hat = self.v[i] / (1 - self.beta2**self.t)

# Update parameters
param -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

return self.params

PyTorch equivalent:

import torch

# Standard Adam
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.999))

# AdamW (decoupled weight decay - preferred for transformers)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

# Training loop
for batch in dataloader:
optimizer.zero_grad()
loss = model(batch)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

Part 5 - AdamW: Decoupled Weight Decay

The Weight Decay vs L2 Regularization Confusion

This is one of the most commonly confused topics in deep learning.

L2 regularization adds a penalty to the loss: Lreg=L+λ2θ2\mathcal{L}_\text{reg} = \mathcal{L} + \frac{\lambda}{2} \|\theta\|^2

This modifies the gradient: gtreg=gt+λθtg_t^\text{reg} = g_t + \lambda \theta_t

Weight decay directly shrinks weights at each step: θt+1=(1ηλ)θtηgt\theta_{t+1} = (1 - \eta \lambda) \theta_t - \eta \cdot g_t

60-Second Answer

"For vanilla SGD, L2 regularization and weight decay are mathematically equivalent. But for Adam, they're different! With L2 regularization, the regularization gradient gets scaled by Adam's adaptive learning rate, which means parameters with large gradient variance get less regularization. AdamW fixes this by applying weight decay directly to the parameters, bypassing Adam's scaling. This is called 'decoupled weight decay' and is why AdamW is the standard for training transformers."

Why They Diverge in Adam

With L2 regularization in Adam, the regularization term λθ\lambda \theta gets divided by v^t+ϵ\sqrt{\hat{v}_t} + \epsilon. This means:

  • Parameters with large gradients (large v^t\hat{v}_t) get less regularization
  • Parameters with small gradients (small v^t\hat{v}_t) get more regularization

This is not the intended behavior. Weight decay should apply uniformly.

AdamW update rule (Loshchilov & Hutter, 2019):

θt=θt1η(m^tv^t+ϵ+λθt1)\theta_t = \theta_{t-1} - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1} \right)

The weight decay term λθt1\lambda \theta_{t-1} is outside the adaptive scaling.

AdamW vs Adam+L2: Decoupled Weight Decay

When AdamW Actually Matters

ScenarioImpact of AdamW vs Adam+L2
Small models, low regularizationMinimal difference
Large transformers with weight decaySignificant - AdamW gives better generalization
Models with very heterogeneous gradient scalesCritical - L2 regularization becomes unpredictable
Fine-tuning pre-trained modelsImportant - uniform decay prevents selective forgetting

Part 6 - Learning Rate Warmup

Why Warmup Helps Adam

Despite bias correction, Adam's variance estimate vtv_t is still imprecise during early training. A large learning rate combined with imprecise vtv_t can cause dangerously large updates.

Linear warmup gradually increases the learning rate:

ηt={ηmaxtTwarmupif tTwarmupηmaxdecay(t)if t>Twarmup\eta_t = \begin{cases} \eta_\text{max} \cdot \frac{t}{T_\text{warmup}} & \text{if } t \leq T_\text{warmup} \\ \eta_\text{max} \cdot \text{decay}(t) & \text{if } t > T_\text{warmup} \end{cases}

from torch.optim.lr_scheduler import LambdaLR

def get_warmup_cosine_scheduler(optimizer, warmup_steps, total_steps):
"""Warmup + cosine decay - the standard for transformer training."""
def lr_lambda(step):
if step < warmup_steps:
return step / warmup_steps # Linear warmup
# Cosine decay
progress = (step - warmup_steps) / (total_steps - warmup_steps)
return 0.5 * (1 + np.cos(np.pi * progress))

return LambdaLR(optimizer, lr_lambda)

# Typical usage for a 100K step training run
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
scheduler = get_warmup_cosine_scheduler(optimizer, warmup_steps=2000, total_steps=100000)
Company Variation
  • Google/DeepMind: Often use warmup + inverse sqrt decay: ηt=ηmaxmin(t0.5,tTwarmup1.5)\eta_t = \eta_\text{max} \cdot \min(t^{-0.5}, t \cdot T_\text{warmup}^{-1.5}). This is the original Transformer schedule.
  • OpenAI: Cosine decay with warmup is standard for GPT-style models.
  • Meta: Sometimes use linear decay with warmup for LLaMA-style training.
  • Startups: Often just use cosine with warmup - it's robust and works well.

Learning Rate Schedule: Warmup + Cosine Decay

Part 7 - Adam vs SGD: The Great Debate

When to Use Each

FactorFavor AdamFavor SGD + Momentum
ArchitectureTransformers, LSTMs, attention modelsCNNs (ResNet, EfficientNet)
Dataset sizeSmall to medium datasetsLarge datasets with long training
Tuning budgetLimited hyperparameter search budgetCan afford extensive grid search
Training stabilityNeed robust convergenceCan tolerate more tuning
Final accuracyGood but sometimes slightly worseOften better generalization with tuning
Speed to good solutionFaster - fewer steps to convergeSlower - but can reach better final loss
Sparse gradientsExcellent (per-parameter LR)Poor (uniform LR for all params)
Memory2x parameter memory for moments1x parameter memory for momentum

The Generalization Gap

A well-documented phenomenon: Adam often converges faster but to a wider minimum, while SGD with momentum finds sharper minima that sometimes generalize better.

Common Trap

Don't simply say "Adam generalizes worse than SGD." The generalization gap has been significantly reduced by AdamW, proper weight decay, and warmup schedules. For transformers, Adam/AdamW typically generalizes better than SGD because transformers have loss landscapes that SGD struggles with. The generalization gap is primarily observed in CNN training on image classification tasks.

Key research findings:

  1. Wilson et al. (2017): SGD generalizes better than Adam on image tasks - but with extensive tuning
  2. Loshchilov & Hutter (2019): AdamW closes much of the gap by fixing weight decay
  3. Liu et al. (2020): RAdam (rectified Adam) addresses early training instability
  4. In practice (2024-2025): AdamW with warmup is the de facto standard for transformers

The Practical Answer

# For transformers / LLMs / NLP - use AdamW
optimizer = torch.optim.AdamW(
model.parameters(),
lr=3e-4, # Lower than SGD typically
betas=(0.9, 0.999),
weight_decay=0.01,
eps=1e-8
)

# For CNNs / image classification - SGD with momentum is competitive
optimizer = torch.optim.SGD(
model.parameters(),
lr=0.1, # Higher than Adam typically
momentum=0.9,
weight_decay=1e-4,
nesterov=True # Nesterov momentum slightly better
)

Part 8 - Adam Variants

AMSGrad

Problem: Adam's second moment estimate vtv_t can decrease over time, causing the effective learning rate to increase, potentially leading to divergence.

AMSGrad fix: Use the maximum of all past vtv_t values: v^t=max(v^t1,vt)\hat{v}_t = \max(\hat{v}_{t-1}, v_t)

In practice: AMSGrad rarely makes a significant difference. Included in PyTorch (amsgrad=True) but seldom used.

RAdam (Rectified Adam)

Problem: Adam's variance estimate has high variance in early training (not enough samples), causing unstable updates even with bias correction.

RAdam fix: Compute the variance of the variance estimate. When it's too high (early training), fall back to SGD without the adaptive term. As training progresses and the estimate stabilizes, smoothly transition to full Adam.

LAMB (Layer-wise Adaptive Moments)

Designed for large-batch training. Normalizes the Adam update by the ratio of parameter norm to update norm, per layer:

θt+1=θtηθtutut\theta_{t+1} = \theta_t - \eta \cdot \frac{\|\theta_t\|}{\|u_t\|} \cdot u_t

where utu_t is the Adam update. Used to train BERT with batch size 65K.

8-bit Adam (bitsandbytes)

Quantizes the optimizer states mtm_t and vtv_t to 8-bit, reducing memory by 75% with negligible accuracy loss. Critical for training large models on limited GPU memory.

import bitsandbytes as bnb

optimizer = bnb.optim.Adam8bit(
model.parameters(),
lr=1e-3,
betas=(0.9, 0.999)
)

Variant Summary

Adam Variants Family Tree

Part 9 - Diagnosing Optimizer Issues

Common Training Failures and Optimizer Fixes

SymptomLikely CauseFix
Loss explodes early in trainingLR too high + imprecise variance estimateAdd warmup, reduce initial LR
Loss plateaus after fast initial dropLR too high for fine-grained optimizationAdd LR decay (cosine)
Loss oscillates wildlyLR too high or β1\beta_1 too lowReduce LR, increase β1\beta_1
Training loss drops but val loss doesn'tOverfitting, not optimizer issueIncrease weight decay, add dropout
NaN loss after many stepsNumerical instability, gradient explosionGradient clipping, increase ϵ\epsilon
Different layers converge at very different ratesExpected - Adam handles thisThis is why you use Adam
Fine-tuning destroys pretrained knowledgeLR too high for pretrained layersUse layer-wise LR decay

Debugging Checklist

# 1. Monitor gradient norms per layer
for name, param in model.named_parameters():
if param.grad is not None:
grad_norm = param.grad.norm().item()
print(f"{name}: grad_norm={grad_norm:.6f}")

# 2. Check for NaN/Inf in gradients
for name, param in model.named_parameters():
if param.grad is not None:
if torch.isnan(param.grad).any() or torch.isinf(param.grad).any():
print(f"WARNING: NaN/Inf gradient in {name}")

# 3. Monitor effective learning rate
for group in optimizer.param_groups:
for p in group['params']:
state = optimizer.state[p]
if 'exp_avg_sq' in state:
v = state['exp_avg_sq']
effective_lr = group['lr'] / (v.sqrt() + group['eps'])
print(f"Effective LR range: [{effective_lr.min():.8f}, {effective_lr.max():.8f}]")

Part 10 - Practice Problems

Problem 1: The Bias Correction Derivation

Derive the bias correction factor for the first moment estimate. Show that E[m^t]=E[gt]\mathbb{E}[\hat{m}_t] = \mathbb{E}[g_t] when m^t=mt1β1t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}.

Hint 1 - Direction

Expand mtm_t by recursively substituting the update rule. You'll get a geometric series involving β1\beta_1.

Hint 2 - Key Insight

mt=(1β1)i=1tβ1tigim_t = (1 - \beta_1) \sum_{i=1}^{t} \beta_1^{t-i} g_i. Now take the expectation, assuming E[gi]=g\mathbb{E}[g_i] = g for all ii.

Full Answer

Expand the recursion: mt=(1β1)gt+β1mt1m_t = (1 - \beta_1) g_t + \beta_1 m_{t-1} =(1β1)gt+β1[(1β1)gt1+β1mt2]= (1 - \beta_1) g_t + \beta_1 [(1 - \beta_1) g_{t-1} + \beta_1 m_{t-2}] =(1β1)i=1tβ1tigi= (1 - \beta_1) \sum_{i=1}^{t} \beta_1^{t-i} g_i

Taking expectation (assuming stationary gradients where E[gi]=g\mathbb{E}[g_i] = g): E[mt]=(1β1)gi=1tβ1ti=(1β1)g1β1t1β1=g(1β1t)\mathbb{E}[m_t] = (1 - \beta_1) g \sum_{i=1}^{t} \beta_1^{t-i} = (1 - \beta_1) g \cdot \frac{1 - \beta_1^t}{1 - \beta_1} = g(1 - \beta_1^t)

Therefore: E[mt1β1t]=g(1β1t)1β1t=g\mathbb{E}\left[\frac{m_t}{1 - \beta_1^t}\right] = \frac{g(1 - \beta_1^t)}{1 - \beta_1^t} = g

The bias correction factor 11β1t\frac{1}{1 - \beta_1^t} exactly compensates for the initialization at zero.

Problem 2: Adam vs SGD Recommendation

You're training a ResNet-50 on ImageNet and a GPT-2 model on a text corpus. Recommend an optimizer for each and justify your choices with specific technical reasons.

Hint 1 - Direction

Consider the loss landscape properties of CNNs vs transformers. Think about the role of batch normalization in CNNs.

Full Answer + Rubric

ResNet-50 on ImageNet: SGD with momentum (0.9), Nesterov, LR=0.1, cosine decay, weight decay=1e-4.

Reasons: (1) BatchNorm normalizes activations, reducing the need for per-parameter adaptive LRs. (2) Extensive literature shows SGD achieves better final accuracy on ImageNet with ResNets. (3) Lower memory cost (1x vs 2x params). (4) Well-established training recipes exist.

GPT-2 on text: AdamW, LR=3e-4, warmup 2000 steps, cosine decay, weight decay=0.01.

Reasons: (1) Transformers have highly heterogeneous gradient scales across attention heads and layers. (2) No batch normalization - LayerNorm doesn't provide the same normalizing effect on gradients. (3) Sparse attention patterns create sparse gradients. (4) Adam's adaptive LR is critical for stable training.

Scoring:

  • Strong Hire: Gives both recommendations correctly with architecture-specific justifications
  • Lean Hire: Correct recommendations but generic justifications
  • No Hire: Says "always use Adam" or "always use SGD"

Problem 3: Debugging a Training Run

You're training a transformer with AdamW (lr=1e-3, warmup=500 steps). Training loss drops rapidly for 1000 steps, then starts increasing. What's happening and how do you fix it?

Hint 1 - Direction

The LR of 1e-3 is quite high for Adam on transformers. Consider what happens after warmup when the full learning rate kicks in.

Full Answer + Rubric

Diagnosis: The learning rate 1e-3 is likely too high. During warmup (0-500 steps), the effective LR ramps from 0 to 1e-3, and training is stable. After warmup, the full LR causes overshooting - the optimizer steps past minima, and the adaptive estimates from the low-LR warmup phase don't reflect the dynamics at the full LR.

Fixes (in order of priority):

  1. Reduce LR to 3e-4 (standard for transformers)
  2. Extend warmup to 2000+ steps
  3. Add gradient clipping (max norm = 1.0)
  4. Reduce β2\beta_2 from 0.999 to 0.98 (faster adaptation to current gradients)
  5. Add cosine decay to gradually reduce LR

Scoring:

  • Strong Hire: Identifies LR as root cause, suggests multiple targeted fixes, explains the warmup-to-full-LR transition dynamics
  • Lean Hire: Says "reduce learning rate" without deeper analysis
  • No Hire: Suggests switching optimizers or increasing batch size without diagnosing

Part 11 - The Paper in Context

Key Contributions of the Adam Paper

  1. Combined momentum + adaptive LR - not novel individually, but the combination with bias correction was
  2. Bias correction - the key technical contribution. Without it, Adam performs poorly in early training
  3. Default hyperparameters - β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999, ϵ=108\epsilon=10^{-8} work surprisingly well across many domains
  4. Convergence proof - proved convergence for convex objectives (the non-convex case is harder)

What the Paper Got Wrong

  • Convergence proof was flawed: Reddi et al. (2018) showed Adam can diverge on simple convex problems where SGD converges
  • L2 regularization: The paper didn't distinguish between L2 reg and weight decay (fixed by AdamW in 2019)
  • Generalization: The paper didn't address the generalization gap, which became apparent later

Historical Timeline

Optimizer Historical Timeline: SGD to 8-bit Adam

Interview Cheat Sheet

Question PatternFrameworkKey Phrases
"Explain Adam"SGD → Momentum (1st moment) → RMSprop (2nd moment) → Adam (both + bias correction)"Adam combines momentum's gradient smoothing with RMSprop's per-parameter scaling, plus bias correction for the zero initialization"
"What is bias correction?"Zero init → biased estimates → correction factor → proof"Since moments are initialized at zero, early estimates underestimate the true values. The factor 1/(1-beta^t) exactly compensates."
"Adam vs AdamW?"L2 reg vs weight decay → equivalent for SGD → different for Adam → decoupled decay"In Adam, L2 regularization gets scaled by the adaptive learning rate. AdamW applies weight decay directly, giving uniform regularization."
"When would you use SGD over Adam?"CNNs with BatchNorm, large-scale image classification, when you can tune extensively"For CNNs, SGD with momentum often achieves better generalization with proper tuning. For transformers, AdamW is almost always better."
"Why use warmup?"Imprecise early variance estimates → large steps → instability"The second moment estimate needs hundreds of steps to stabilize. Warmup gives it time while keeping updates small."
"Adam is not converging. Debug it."Check LR → check gradients → check warmup → check weight decay → check data"I'd first verify the learning rate is appropriate, then check for gradient explosion, then examine the warmup schedule."

Spaced Repetition Checkpoints

  • Day 0: Read this page. Write out the Adam update rule from memory. Explain bias correction to a rubber duck.
  • Day 3: Without looking, explain why AdamW differs from Adam + L2. Implement the Adam update step in 10 lines of Python.
  • Day 7: Compare Adam vs SGD for transformers vs CNNs. What three reasons make Adam better for transformers?
  • Day 14: Solve all three practice problems from memory. Time yourself - you should answer each in 5-8 minutes.
  • Day 21: Explain the full optimizer landscape (SGD → Momentum → RMSprop → Adam → AdamW → 8-bit Adam) in 3 minutes.

Next Steps

© 2026 EngineersOfAI. All rights reserved.