Skip to main content

Optimizers: Adam, SGD, RMSProp

The Real Interview Moment

You are training a ResNet-50 for image classification. You use Adam with the default learning rate of 1e-3. Training loss drops quickly at first, then plateaus at 2.1. A colleague suggests switching to SGD with momentum. You switch. The loss initially drops slower - SGD takes more careful steps - but after 90 epochs it reaches 1.8, a noticeably better final value.

Meanwhile, another colleague is training a BERT model and insists Adam is essential. She tried SGD for a week and it never converged at all. You both have the same conclusion: "Use what works." But you do not understand why they work differently.

This lesson explains the mechanism behind every major optimizer, derives why Adam and SGD behave differently on different types of loss landscapes, and gives you the vocabulary to make principled choices - not just copy hyperparameters from papers.

Optimizer Family Tree

Gradient Descent Foundation

The simplest optimizer updates each parameter by the gradient of the loss:

θt+1=θtηθL(θt)\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)

Where η\eta is the learning rate. Two fundamental problems:

  1. Computing θL\nabla_\theta L over the full dataset is prohibitively expensive for large NN
  2. No adaptivity - same step size for all parameters regardless of their loss surface geometry

Stochastic Gradient Descent uses a mini-batch estimate:

θt+1=θtηg^twhere g^t=θLBt(θt)\theta_{t+1} = \theta_t - \eta \hat{g}_t \quad \text{where } \hat{g}_t = \nabla_\theta L_{\mathcal{B}_t}(\theta_t)

Mini-batch gradients are noisy estimates of the true gradient. The noise is manageable: you get N/BN/B gradient steps per epoch instead of 1, and the noise acts as a regularizer.

SGD with Momentum: Building Velocity

Raw SGD oscillates in narrow loss landscape valleys and moves slowly along the gradient direction. Momentum adds a "velocity" term:

vt=βvt1+θLBt(θt)\mathbf{v}_t = \beta \mathbf{v}_{t-1} + \nabla_\theta L_{\mathcal{B}_t}(\theta_t)

θt+1=θtηvt\theta_{t+1} = \theta_t - \eta \mathbf{v}_t

Where β\beta is the momentum coefficient (typically 0.9). Expanding the velocity recursion:

vt=k=0tβtkLk\mathbf{v}_t = \sum_{k=0}^{t} \beta^{t-k} \nabla L_k

The velocity is an exponential moving average of past gradients - gradients from kk steps ago contribute with weight βk\beta^k. With β=0.9\beta = 0.9, gradients from 10 steps ago still contribute 0.9100.350.9^{10} \approx 0.35 to the current velocity.

Physical intuition: the parameter is a ball rolling down a hill. Momentum lets it build up speed in consistent gradient directions (the ball accelerates), while oscillations across narrow dimensions cancel out (alternating signs in the gradient diminish the velocity in that dimension). In a long narrow valley, the ball accelerates along the valley axis and oscillations across it cancel.

Alternative SGD+momentum formulation (what PyTorch actually uses):

vt=βvt1+(1β)Lt(normalized variant)\mathbf{v}_t = \beta \mathbf{v}_{t-1} + (1-\beta) \nabla L_t \quad \text{(normalized variant)}

This makes the velocity a proper exponential moving average with unit magnitude at convergence.

Nesterov Momentum: Look-Ahead Gradient

Nesterov accelerated gradient (NAG) evaluates the gradient at the "look-ahead" position - where momentum alone would take the parameters - rather than the current position:

vt=βvt1+θηβvt1L\mathbf{v}_t = \beta \mathbf{v}_{t-1} + \nabla_{\theta - \eta\beta \mathbf{v}_{t-1}} L

θt+1=θtηvt\theta_{t+1} = \theta_t - \eta \mathbf{v}_t

Why this helps: standard momentum applies the current gradient to the current position, then adds velocity. Nesterov first takes a momentum step, evaluates the gradient there, then corrects. If the momentum step is overshooting a valley, the gradient at the look-ahead position points back toward the valley floor - Nesterov corrects more quickly. In practice, Nesterov converges slightly faster than standard momentum and overshoots less near minima.

import torch
import torch.nn as nn


def configure_sgd(model: nn.Module, lr: float = 0.01, momentum: float = 0.9,
nesterov: bool = True, weight_decay: float = 5e-4):
"""
SGD with Nesterov momentum - the standard for vision tasks from scratch.
Nesterov is almost always better than standard momentum.
"""
return torch.optim.SGD(
model.parameters(),
lr=lr,
momentum=momentum,
nesterov=nesterov,
weight_decay=weight_decay,
)


# Typical ResNet training configuration
optimizer = configure_sgd(model, lr=0.1, momentum=0.9, weight_decay=5e-4)
# Learning rate schedule: step decay at epochs 30, 60, 80
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[30, 60, 80], gamma=0.1)

AdaGrad: Per-Parameter Adaptive Learning Rates

SGD uses the same learning rate for all parameters. AdaGrad (Duchi et al., 2011) adapts per-parameter based on historical gradient magnitude:

Gt,i=Gt1,i+gt,i2(cumulative sum of squared gradients)G_{t,i} = G_{t-1,i} + g_{t,i}^2 \quad \text{(cumulative sum of squared gradients)}

θt+1,i=θt,iηGt,i+ϵgt,i\theta_{t+1,i} = \theta_{t,i} - \frac{\eta}{\sqrt{G_{t,i} + \epsilon}} g_{t,i}

Intuition: parameters with large historical gradients get smaller effective learning rates; those with small historical gradients get larger effective rates. For sparse features (e.g., word embeddings), most gradient updates are zero for most vocabulary items. The few tokens that appear in a batch receive large updates. AdaGrad's per-parameter rates allow infrequent features to receive aggressive updates when they do appear.

The critical problem: Gt,iG_{t,i} is a cumulative sum that only grows. After enough training steps, Gt,i\sqrt{G_{t,i}} \to \infty for all parameters, making all effective learning rates approach zero. AdaGrad works well early in training and in convex optimization, but effectively stops learning in long training runs. This makes it mostly obsolete for deep learning.

RMSProp: Fixing AdaGrad's Dying Learning Rates

RMSProp (Hinton, unpublished, described in a Coursera lecture in 2012) replaces the cumulative sum with an exponential moving average of squared gradients:

vt,i=β2vt1,i+(1β2)gt,i2v_{t,i} = \beta_2 v_{t-1,i} + (1 - \beta_2) g_{t,i}^2

θt+1,i=θt,iηvt,i+ϵgt,i\theta_{t+1,i} = \theta_{t,i} - \frac{\eta}{\sqrt{v_{t,i} + \epsilon}} g_{t,i}

Where β2(0.9,0.999)\beta_2 \in (0.9, 0.999) controls how quickly old gradient information is forgotten. The exponential moving average gives recent gradients higher weight, so the effective learning rate does not monotonically decay - it tracks recent gradient magnitudes and remains stable over long runs.

When RMSProp excels: RNNs (where gradients can change scale dramatically across timesteps), non-stationary loss landscapes, and tasks where AdaGrad's dying learning rates are problematic. RMSProp was developed specifically for training RNNs and remains competitive in that domain.

# RMSProp configuration
optimizer_rmsprop = torch.optim.RMSprop(
model.parameters(),
lr=1e-3,
alpha=0.99, # decay rate for squared gradient EMA
eps=1e-8,
weight_decay=0,
momentum=0, # can optionally add momentum
)

Adam: Combining Momentum and RMSProp

Adam (Kingma and Ba, 2015) combines momentum (first moment estimate) with RMSProp (second moment estimate) and adds bias correction for both:

First moment estimate (smoothed gradient - like momentum):

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t

Second moment estimate (smoothed squared gradient - like RMSProp):

vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2

Bias correction - the critical step:

m^t=mt1β1tv^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \qquad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

Parameter update:

θt+1=θtηv^t+ϵm^t\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

Why Bias Correction Is Necessary

Both m0=0m_0 = 0 and v0=0v_0 = 0 are initialized to zero. At step t=1t = 1:

m1=β10+(1β1)g1=(1β1)g1m_1 = \beta_1 \cdot 0 + (1 - \beta_1) \cdot g_1 = (1 - \beta_1) g_1

With β1=0.9\beta_1 = 0.9: m1=0.1g1m_1 = 0.1 \cdot g_1. The first moment estimate is biased by a factor of 0.1 - the gradient signal is downscaled 10x in early training. This is called the cold start problem. Without bias correction, early Adam steps would be 10x too small for the first moment and 1000x too small for the second moment (v1=0.001g12v_1 = 0.001 \cdot g_1^2), making initial training extremely slow.

Bias correction divides by (1β1t)(1 - \beta_1^t), which equals 10.91=0.11 - 0.9^1 = 0.1 at t=1t = 1, canceling the 0.1 factor exactly. At t=100t = 100: 10.91000.999991 - 0.9^{100} \approx 0.99999, so bias correction has essentially no effect after the first few dozen steps. Bias correction matters most at the very start of training.

Default Hyperparameters

HyperparameterDefaultNotes
η\eta (learning rate)1e-3Most important - tune this first
β1\beta_1 (momentum decay)0.9Rarely needs tuning
β2\beta_2 (RMS decay)0.999Occasionally try 0.95 for small batches
ϵ\epsilon1e-8Rarely needs tuning; increase to 1e-6 if instability

AdamW: Decoupled Weight Decay (Loshchilov and Hutter, 2019)

Standard Adam with L2 regularization adds λ2θ2\frac{\lambda}{2}\|\theta\|^2 to the loss, meaning the weight decay gradient λθi\lambda \theta_i is treated like any other gradient component:

θt+1,i=θt,iηv^t,i+ϵ(m^t,i+λθt,i)\theta_{t+1,i} = \theta_{t,i} - \frac{\eta}{\sqrt{\hat{v}_{t,i}} + \epsilon} (\hat{m}_{t,i} + \lambda \theta_{t,i})

The weight decay is scaled by the inverse square root of the second moment - exactly like the gradient. Parameters with large gradient history (which Adam has already reduced the effective LR for) also receive reduced weight decay. Weight decay becomes weaker for parameters that are being updated most aggressively. This is the wrong behavior.

AdamW decouples weight decay from the adaptive gradient scaling:

θt+1=θtηv^t+ϵm^tηλθt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t - \eta \lambda \theta_t

The weight decay term ηλθt-\eta \lambda \theta_t is applied directly to the parameters, independently of the adaptive learning rate. This matches the intended L2 regularization behavior - all parameters shrink toward zero at the same rate regardless of their gradient history.

Practical consequence: AdamW consistently outperforms Adam + L2 regularization in large model training. The difference is particularly noticeable in transformer models where some layers have very different gradient magnitudes. The HuggingFace Transformers library defaults to AdamW. For most new work involving transformers, AdamW is the correct choice.

import torch
import torch.nn as nn


class AdamWFromScratch:
"""
NumPy-style Adam W implementation for pedagogical clarity.
Shows exactly what each step does.
"""

def __init__(self, params: list, lr: float = 1e-3, betas: tuple = (0.9, 0.999),
eps: float = 1e-8, weight_decay: float = 0.01):
self.params = list(params)
self.lr = lr
self.beta1, self.beta2 = betas
self.eps = eps
self.weight_decay = weight_decay
self.t = 0 # step counter

# Initialize moment estimates to zero
self.m = [torch.zeros_like(p) for p in self.params] # first moments
self.v = [torch.zeros_like(p) for p in self.params] # second moments

def step(self) -> None:
self.t += 1

# Bias correction factors
bc1 = 1 - self.beta1 ** self.t # approaches 1 as t grows
bc2 = 1 - self.beta2 ** self.t

for i, p in enumerate(self.params):
if p.grad is None:
continue

g = p.grad.data

# Update first moment (momentum)
self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * g

# Update second moment (RMS)
self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * g * g

# Bias-corrected estimates
m_hat = self.m[i] / bc1
v_hat = self.v[i] / bc2

# Adam gradient step
p.data -= self.lr * m_hat / (v_hat.sqrt() + self.eps)

# Decoupled weight decay - SEPARATE from adaptive gradient
# This is the key difference from Adam + L2
p.data -= self.lr * self.weight_decay * p.data


def configure_adamw(model: nn.Module, lr: float = 1e-3,
weight_decay: float = 0.01) -> torch.optim.AdamW:
"""
AdamW with proper parameter group separation.
Biases and norm parameters should NOT have weight decay.
Weight decay on biases provides no regularization benefit and
can hurt convergence by pulling biases toward zero arbitrarily.
"""
decay_params = []
no_decay_params = []

for name, param in model.named_parameters():
# No weight decay for 1D parameters (biases) and norm layers
if param.ndim == 1 or 'norm' in name.lower() or 'bias' in name.lower():
no_decay_params.append(param)
else:
decay_params.append(param)

return torch.optim.AdamW(
[
{"params": decay_params, "weight_decay": weight_decay},
{"params": no_decay_params, "weight_decay": 0.0},
],
lr=lr,
betas=(0.9, 0.999),
eps=1e-8,
)

LAMB: Large-Batch Optimizer for BERT (You et al., 2020)

LAMB (Layer-wise Adaptive Moments optimizer for Batch training) extends AdamW to enable efficient large-batch training. It scales the Adam update by the ratio of parameter norm to update norm:

θt+1=θtηθtm^t/(v^t+ϵ)+λθt(m^tv^t+ϵ+λθt)\theta_{t+1} = \theta_t - \eta \cdot \frac{\|\theta_t\|}{\|\hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon) + \lambda\theta_t\|} \cdot \left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda\theta_t\right)

The additional ratio θ/update\|\theta\| / \|\text{update}\| is the layerwise adaptive rate - it normalizes the update magnitude to be proportional to the parameter norm. Large layers (with large parameter norms) receive larger absolute updates; small layers receive smaller ones.

Why this matters for large-batch training: LAMB enabled training BERT in 76 minutes on TPUs using a batch size of 32,768 - previously requiring days. At large batch sizes, gradient noise is low but gradient direction becomes very reliable. LAMB's layerwise normalization lets large batches take correspondingly large steps without destabilizing small layers.

Lion: Sign-Based Optimizer (Chen et al., 2023)

Lion (Evolved Sign Momentum) was discovered by Google Brain via program search:

ct=β1mt1+(1β1)gtc_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t

θt+1=θtη(sign(ct)+λθt)\theta_{t+1} = \theta_t - \eta (\text{sign}(c_t) + \lambda \theta_t)

mt=β2mt1+(1β2)gtm_t = \beta_2 m_{t-1} + (1 - \beta_2) g_t

Key properties:

  • Uses only the sign of the update, not the magnitude - every parameter gets an update of exactly ±η\pm \eta
  • More memory efficient than Adam: only one momentum buffer (not two)
  • Better weight decay behavior (similar to AdamW's decoupled weight decay)
  • Works best with larger learning rates and smaller batch sizes than Adam

Lion has shown competitive results with Adam in large-scale image and language model training, with approximately 3x memory savings over Adam (one buffer vs two).

AMSGrad: Convergence Fix for Adam

Adam does not converge in some settings - the theoretical convergence proof had a gap. AMSGrad (Reddi et al., 2018) fixes this by using the maximum of past second moment estimates:

v^tmax=max(v^t1max,v^t)\hat{v}_t^{\max} = \max(\hat{v}_{t-1}^{\max}, \hat{v}_t)

θt+1=θtηv^tmax+ϵm^t\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t^{\max}} + \epsilon} \hat{m}_t

The non-decreasing v^tmax\hat{v}_t^{\max} guarantees the effective learning rate never increases, fixing the theoretical convergence issue. In practice, the improvement over Adam is marginal on most tasks.

Full NumPy Adam from Scratch

import numpy as np


class AdamOptimizer:
"""
Full Adam implementation in NumPy for pedagogical clarity.
Includes all components: bias correction, decoupled weight decay,
and gradient clipping.
"""

def __init__(self, params: list[np.ndarray], lr: float = 1e-3,
beta1: float = 0.9, beta2: float = 0.999,
eps: float = 1e-8, weight_decay: float = 0.0,
max_grad_norm: float = None):
self.params = params
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.weight_decay = weight_decay
self.max_grad_norm = max_grad_norm
self.t = 0

# Initialize moment buffers to zero - triggers bias correction
self.m = [np.zeros_like(p) for p in params] # first moments
self.v = [np.zeros_like(p) for p in params] # second moments

def clip_grad_norm(self, grads: list[np.ndarray]) -> list[np.ndarray]:
"""Clip gradient by global L2 norm."""
if self.max_grad_norm is None:
return grads
total_norm = np.sqrt(sum(np.sum(g**2) for g in grads))
if total_norm > self.max_grad_norm:
clip_coef = self.max_grad_norm / (total_norm + 1e-12)
grads = [g * clip_coef for g in grads]
return grads

def step(self, grads: list[np.ndarray]) -> None:
"""Apply one Adam update step."""
assert len(grads) == len(self.params)
self.t += 1

# Optional gradient clipping
grads = self.clip_grad_norm(grads)

# Bias correction factors (both approach 1 as t grows)
bc1 = 1 - self.beta1 ** self.t
bc2 = 1 - self.beta2 ** self.t

for i, (p, g) in enumerate(zip(self.params, grads)):
# First moment: running mean of gradient
self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * g

# Second moment: running mean of squared gradient
self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * g**2

# Bias-corrected estimates
m_hat = self.m[i] / bc1
v_hat = self.v[i] / bc2

# Adam gradient step
p -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

# Decoupled weight decay (AdamW behavior when weight_decay > 0)
if self.weight_decay > 0:
p -= self.lr * self.weight_decay * p


def train_with_numpy_adam():
"""Demonstrate Adam convergence on a simple quadratic."""
np.random.seed(42)

# Minimize: f(x, y) = x^2 + 10*y^2 (elongated quadratic - tests momentum)
params = [np.array([5.0]), np.array([2.0])] # start far from optimum
optimizer = AdamOptimizer(params, lr=0.1, weight_decay=0.0)

print(f"{'Step':>5} | {'x':>8} | {'y':>8} | {'loss':>12}")
for step in range(50):
x, y = params
loss = float(x**2 + 10 * y**2)
grads = [2 * x, 20 * y] # analytical gradients

optimizer.step(grads)

if step % 10 == 0:
print(f"{step:>5} | {float(x):>8.4f} | {float(y):>8.4f} | {loss:>12.6f}")

print(f"\nFinal position: x={float(params[0]):.6f}, y={float(params[1]):.6f}")
print(f"Optimum is (0, 0)")


train_with_numpy_adam()

SGD vs Adam: When Each Wins

This is one of the most debated empirical questions in deep learning. The key distinction:

Adam wins on:

  • Language models and transformers (BERT, GPT, T5, LLaMA)
  • Tasks with sparse gradients (word embeddings, recommendation systems)
  • Fast convergence when you want results quickly
  • Fine-tuning pretrained models
  • Any task with heterogeneous gradient scales across layers

SGD with momentum wins on:

  • Image classification from scratch (ResNet, EfficientNet)
  • Object detection (well-tuned SGD on COCO)
  • Tasks where final generalization quality matters more than convergence speed

The generalization gap: empirically, SGD with momentum often finds flatter minima than Adam. Flat minima generalize better than sharp minima (Hochreiter and Schmidhuber, 1997). Adam's adaptive learning rates efficiently navigate sharp narrow valleys in the loss landscape - but these sharp valleys correspond to sharp minima that generalize worse to unseen data. SGD with larger learning rates tends to "bounce around" more, spending more time in flatter regions. This is still an active research area.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset


def compare_optimizers_on_task(model_fn, X_train: torch.Tensor, y_train: torch.Tensor,
X_val: torch.Tensor, y_val: torch.Tensor,
n_epochs: int = 100):
"""
Compare SGD, Adam, and AdamW on the same task with matched wall-clock time.
model_fn: callable that returns a fresh model instance.
"""
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
train_ds = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
criterion = nn.CrossEntropyLoss()

optimizer_configs = {
"SGD+Nesterov": lambda m: torch.optim.SGD(
m.parameters(), lr=0.01, momentum=0.9, nesterov=True, weight_decay=5e-4
),
"Adam": lambda m: torch.optim.Adam(
m.parameters(), lr=1e-3, betas=(0.9, 0.999), weight_decay=1e-4
),
"AdamW": lambda m: torch.optim.AdamW(
m.parameters(), lr=1e-3, betas=(0.9, 0.999), weight_decay=0.01
),
"RMSProp": lambda m: torch.optim.RMSprop(
m.parameters(), lr=1e-3, alpha=0.99
),
}

results = {}
for name, opt_fn in optimizer_configs.items():
torch.manual_seed(42)
model = model_fn().to(device)
optimizer = opt_fn(model)

train_losses = []
for epoch in range(n_epochs):
model.train()
epoch_loss = 0.0
for bx, by in train_loader:
bx, by = bx.to(device), by.to(device)
optimizer.zero_grad()
loss = criterion(model(bx), by)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
train_losses.append(epoch_loss / len(train_loader))

model.eval()
with torch.no_grad():
val_logits = model(X_val.to(device))
val_acc = (val_logits.argmax(1) == y_val.to(device)).float().mean().item()

results[name] = {
"final_train_loss": train_losses[-1],
"val_acc": val_acc,
"convergence_epoch": next(
(i for i, l in enumerate(train_losses) if l < train_losses[0] * 0.5), n_epochs
),
}
print(f"{name:20s}: train_loss={train_losses[-1]:.4f}, val_acc={val_acc:.3f}, "
f"50%-convergence at epoch {results[name]['convergence_epoch']}")

return results

Parameter Groups: Fine-Grained Learning Rate Control

Different parts of a model often need different learning rates - a critical production technique:

import torch
import torch.nn as nn


def configure_layerwise_lr_decay(model: nn.Module, base_lr: float,
n_layers: int = 12,
lr_decay: float = 0.8,
weight_decay: float = 0.01) -> torch.optim.AdamW:
"""
Layer-wise learning rate decay for transformer fine-tuning.
Earlier layers get smaller learning rates - they contain more general features
that should be preserved. Later layers get larger rates - they need to adapt.

LR for layer i = base_lr * lr_decay^(n_layers - i)
e.g., n_layers=12, lr_decay=0.8:
Layer 0 (embedding): base_lr * 0.8^12 ≈ 0.069 * base_lr
Layer 6 (middle): base_lr * 0.8^6 ≈ 0.262 * base_lr
Layer 12 (top): base_lr * 0.8^0 = base_lr
"""
param_groups = []

# Group parameters by layer depth
no_decay = {'bias', 'norm', 'LayerNorm', 'layer_norm'}

for layer_idx in range(n_layers + 1):
layer_lr = base_lr * (lr_decay ** (n_layers - layer_idx))

layer_params_decay = []
layer_params_no_decay = []

# In practice: filter model.named_parameters() by layer name
# This is simplified - real implementation matches by layer index in name
for name, param in model.named_parameters():
if f"layer.{layer_idx}." in name or f"layers.{layer_idx}." in name:
if any(nd in name for nd in no_decay):
layer_params_no_decay.append(param)
else:
layer_params_decay.append(param)

if layer_params_decay:
param_groups.append({"params": layer_params_decay,
"lr": layer_lr, "weight_decay": weight_decay})
if layer_params_no_decay:
param_groups.append({"params": layer_params_no_decay,
"lr": layer_lr, "weight_decay": 0.0})

return torch.optim.AdamW(param_groups, lr=base_lr)


# Fine-tuning example: backbone gets 10x smaller LR than new head
def finetune_optimizer(backbone: nn.Module, head: nn.Module,
base_lr: float = 1e-3) -> torch.optim.AdamW:
"""Separate learning rates for pretrained backbone vs new classification head."""
return torch.optim.AdamW([
{"params": backbone.parameters(), "lr": base_lr * 0.1, "weight_decay": 0.0},
{"params": head.parameters(), "lr": base_lr, "weight_decay": 0.01},
])

Optimizer Selection Guide

Learning Rate Selection Guide

All optimizer machinery is useless without the right learning rate. Rules of thumb for starting points:

OptimizerTypical LR RangeCommon Starting Point
SGD (no momentum)0.1–1.00.1
SGD + momentum0.01–0.10.01
Adam1e-4–1e-31e-3
AdamW (transformers, pretrain)1e-4–5e-43e-4
AdamW (fine-tuning)1e-6–1e-41e-5
RMSProp1e-4–1e-31e-3

The key insight: Adam is much less sensitive to learning rate than SGD. Adam's default hyperparameters work reasonably for most tasks. SGD requires more careful LR tuning but can find better optima when properly configured. For fast prototyping, start with Adam. For pushing the last percentage points on a well-understood task, try SGD with momentum.

:::warning gradient_accumulation and optimizer.step() When using gradient accumulation, you perform multiple backward passes before calling optimizer.step(). The gradients accumulate (add up) across these passes. If you do not divide the loss by the accumulation steps, the effective gradient is proportional to accumulation_steps * actual_gradient. This means Adam's second moment estimate is tracking the wrong scale, and the effective learning rate is off. Always divide the loss by accumulation steps: loss = criterion(output, target) / grad_accum_steps. :::

:::danger zero_grad() Before Every Backward Pass PyTorch accumulates gradients by default - calling .backward() adds to existing gradients rather than replacing them. If you forget optimizer.zero_grad() before computing a new batch's loss and backward, gradients from previous batches accumulate. Effective gradient grows each step. Loss diverges. This is the single most common PyTorch bug. The standard pattern: optimizer.zero_grad()loss.backward()optimizer.step(). Never deviate from this order. :::

YouTube Resources

VideoChannelWhy Watch It
Gradient Descent, How Neural Networks Learn3Blue1BrownVisual intuition for gradient descent and loss landscapes
Adam Optimizer ExplainedAndrej KarpathyDerivation with code, bias correction visual
Why Momentum WorksYannic KilcherPhysical intuition and convergence theory
AdamW vs AdamFast.aiPractical explanation of decoupled weight decay
CS231n - Optimization AlgorithmsStanford CS231nComplete optimizer survey with SGD vs Adam comparison

Interview Q&A

Q1: Explain how Adam combines momentum and RMSProp and what problem each addresses.

Adam maintains two exponential moving averages. The first moment (mtm_t), like momentum, accumulates a smoothed estimate of the gradient direction - this reduces oscillations and builds speed in consistent gradient directions, similar to a ball gaining momentum on a slope. The second moment (vtv_t), like RMSProp, accumulates a smoothed estimate of the squared gradient magnitude - this creates per-parameter adaptive learning rates: parameters with large gradient history receive small effective learning rates, and parameters with small gradient history receive large effective rates. The division mt/vtm_t / \sqrt{v_t} normalizes the gradient direction by its historical magnitude, making the effective step size approximately constant across parameters regardless of their gradient scale. Bias correction addresses the cold start: both moments are initialized to zero, so early estimates are artificially small (by factors of 1β1t1-\beta_1^t and 1β2t1-\beta_2^t). Dividing by these factors restores the correct scale in early training.

Q2: What is AdamW and why is it preferred over Adam with L2 regularization?

With standard Adam and L2 regularization, the weight decay gradient λθi\lambda\theta_i is treated as part of the regular gradient and divided by v^t,i\sqrt{\hat{v}_{t,i}} - the adaptive scaling factor. Parameters with large gradient history (which Adam has already reduced the effective LR for) also receive reduced weight decay. Weight decay is strongest for parameters that are updated least aggressively, which is backwards - frequently-updated parameters need more regularization. AdamW decouples weight decay from gradient scaling: the weight decay step ηλθ-\eta\lambda\theta is applied directly to parameters, independent of the adaptive learning rate. This ensures weight decay functions as intended: all parameters shrink toward zero at the same proportional rate regardless of gradient history. For transformer and language model training where different layers have vastly different gradient magnitudes, proper weight decay from AdamW is critical for regularization to work correctly.

Q3: Why does SGD with momentum sometimes generalize better than Adam?

Empirically observed and partly theoretically understood. The primary hypothesis involves loss landscape geometry: Adam's adaptive learning rates efficiently navigate sharp, narrow valleys in the loss landscape, finding minima quickly. But sharp minima - points where the loss changes rapidly with small parameter perturbations - generalize poorly. When the test distribution differs slightly from training, parameters in sharp minima produce large loss increases. SGD with momentum, using a larger learning rate and no adaptive scaling, naturally explores flatter regions. The "bouncing around" that makes SGD appear slower actually causes it to spend more time in regions where the loss is flat in multiple directions - flat minima generalize better because small distribution shifts only cause small loss increases. On ImageNet with ResNets, well-tuned SGD consistently achieves 0.5–1.5% higher top-1 accuracy than Adam. For language models, this effect is not observed because heterogeneous gradient scales across embedding vs attention vs FFN layers make Adam's adaptivity essential.

Q4: What does bias correction do in Adam and why is it necessary?

Both moment estimates in Adam are initialized to zero. In the first step, the first moment is m1=(1β1)g1m_1 = (1-\beta_1) g_1. With β1=0.9\beta_1 = 0.9, this is 10% of the actual gradient - a 10x downscaling. The second moment is v1=(1β2)g12v_1 = (1-\beta_2) g_1^2. With β2=0.999\beta_2 = 0.999, this is 0.1% of g12g_1^2 - a 1000x downscaling. Without bias correction, the effective step size in the first few iterations would be dramatically wrong. Adam's update would be ηm1/v1=η0.1g1/0.001g12=η0.1/0.0013.16ηg11g1\eta \cdot m_1 / \sqrt{v_1} = \eta \cdot 0.1 g_1 / \sqrt{0.001 g_1^2} = \eta \cdot 0.1 / \sqrt{0.001} \approx 3.16\eta |g_1|^{-1} g_1 - the scale is correct by coincidence in this case, but this breaks down with vector gradients where different components have different magnitudes. Bias correction divides by (1β1t)(1-\beta_1^t) and (1β2t)(1-\beta_2^t), which cancel the startup bias exactly. As training continues (large tt), these factors approach 1 and the correction becomes negligible. Bias correction matters most at the start of training and when Adam is restarted after a learning rate reset.

Q5: You are fine-tuning BERT for text classification. Which optimizer and what learning rate strategy?

Use AdamW with decoupled weight decay. Key configuration: (1) Learning rate: 1e-5 to 5e-5 for the pretrained BERT layers - too large will destroy pretrained representations. 1e-4 to 1e-3 for the new classification head. Use separate parameter groups to set different rates. (2) Weight decay: 0.01 for weight matrices, 0.0 for biases and LayerNorm parameters. (3) Schedule: linear warmup for 6% of total training steps (typically 300–500 steps), then linear decay to zero. (4) Gradient clipping: max_norm=1.0 - prevents gradient spikes that can destroy pretrained representations. (5) Layer-wise LR decay with decay factor 0.8–0.9: earlier BERT layers get smaller rates than later layers, preserving general language representations while allowing task-specific adaptation in higher layers. (6) Number of epochs: typically 3–5 for text classification - more leads to overfitting on small datasets. This recipe from the original BERT fine-tuning paper and HuggingFace defaults works well across classification, NER, and extractive QA tasks.

Gradient Clipping: Protecting Training Stability

Gradient clipping caps the global L2 norm of all gradients before the parameter update. This prevents gradient explosion from causing catastrophic parameter updates - especially important in transformers, RNNs, and deep MLPs.

if g2>τ:gτgg2\text{if } \|\mathbf{g}\|_2 > \tau: \quad \mathbf{g} \leftarrow \tau \cdot \frac{\mathbf{g}}{\|\mathbf{g}\|_2}

The clipping threshold τ\tau is typically 1.0 for transformers and 5.0 for RNNs (which have higher gradient variability). The gradient direction is preserved - only the magnitude is clipped.

import torch
import torch.nn as nn


def train_step_with_clipping(model: nn.Module, batch, optimizer,
criterion, max_norm: float = 1.0) -> float:
"""
Standard training step with gradient clipping.
Clipping happens AFTER backward() and BEFORE optimizer.step().
"""
inputs, targets = batch
optimizer.zero_grad()

outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()

# Compute and log the gradient norm before clipping (useful for monitoring)
total_norm_before = 0.0
for p in model.parameters():
if p.grad is not None:
total_norm_before += p.grad.data.norm(2).item() ** 2
total_norm_before = total_norm_before ** 0.5

# Clip gradients - returns the total norm after clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=max_norm)

optimizer.step()

return loss.item(), total_norm_before


# Monitoring gradient norms in production - helps detect training issues early
class GradientNormMonitor:
"""
Track gradient norms over training. Useful for diagnosing:
- Gradient explosion (norm suddenly spikes to 100+)
- Gradient vanishing (norm consistently near 0)
- Training instability (very high variance in norm over time)
"""

def __init__(self, window_size: int = 100):
self.norms = []
self.window_size = window_size

def record(self, model: nn.Module) -> float:
total_norm = 0.0
for p in model.parameters():
if p.grad is not None:
total_norm += p.grad.data.norm(2).item() ** 2
total_norm = total_norm ** 0.5
self.norms.append(total_norm)
return total_norm

@property
def recent_mean(self) -> float:
recent = self.norms[-self.window_size:]
return sum(recent) / len(recent) if recent else 0.0

@property
def is_exploding(self) -> bool:
return len(self.norms) > 0 and self.norms[-1] > 10 * self.recent_mean

@property
def is_vanishing(self) -> bool:
return len(self.norms) > self.window_size and self.recent_mean < 1e-6

Optimizer State: Memory Costs and Considerations

A critical practical consideration for large models - optimizers carry significant memory overhead:

OptimizerExtra memory per parameterNotes
SGD (no momentum)0No extra state
SGD + momentum1 bufferVelocity vector
Adam / AdamW2 buffersFirst + second moment
LAMB2 buffersSame as Adam
Lion1 bufferOnly one moment
Adafactor~1 buffer (factored)Uses SVD factorization for large matrices

For a 7B parameter model (LLaMA-7B):

  • Model weights in float32: 7×109×4=287 \times 10^9 \times 4 = 28 GB
  • Adam optimizer states (2 buffers, float32): 2×28=562 \times 28 = 56 GB
  • Total for training: 8484 GB minimum - requires multiple A100s

This is why:

  1. Large-scale training uses mixed precision: weights in bfloat16 (2 bytes) but optimizer states in float32 (4 bytes) for numerical stability
  2. Gradient checkpointing trades compute for memory (recomputing activations instead of storing them)
  3. Optimizer sharding (via ZeRO in DeepSpeed) distributes optimizer states across GPUs
import torch
import torch.nn as nn


def estimate_training_memory(model: nn.Module, batch_size: int,
seq_len: int, precision: str = "fp16") -> dict:
"""
Rough estimate of total GPU memory needed for training.
Accounts for: parameters, gradients, optimizer states, activations.
"""
n_params = sum(p.numel() for p in model.parameters())
bytes_per_param = 2 if precision == "fp16" else 4

model_bytes = n_params * bytes_per_param
grad_bytes = n_params * bytes_per_param # gradients same size as params
adam_bytes = n_params * 4 * 2 # float32 first + second moment

# Rough activation estimate: depends heavily on architecture
# Transformer: O(batch * seq_len * d_model * n_layers)
activation_bytes = batch_size * seq_len * 512 * 12 * bytes_per_param # rough

total_bytes = model_bytes + grad_bytes + adam_bytes + activation_bytes

return {
"model_MB": model_bytes / 1e6,
"gradients_MB": grad_bytes / 1e6,
"adam_states_MB": adam_bytes / 1e6,
"activations_MB": activation_bytes / 1e6,
"total_MB": total_bytes / 1e6,
"total_GB": total_bytes / 1e9,
"n_params": f"{n_params / 1e6:.1f}M",
}

Hyperparameter Sensitivity: Adam vs SGD

One of Adam's most practically important properties is its robustness to hyperparameter choice:

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np


def compare_lr_sensitivity():
"""
Show that Adam is less sensitive to LR than SGD.
Both optimizers trained on the same simple regression task
with a range of learning rates.
"""
def make_model():
return nn.Sequential(nn.Linear(1, 16), nn.ReLU(), nn.Linear(16, 1))

# Simple regression: y = sin(2*pi*x)
x = torch.linspace(-1, 1, 100).unsqueeze(1)
y = torch.sin(2 * np.pi * x)

lr_values = [1e-4, 1e-3, 1e-2, 0.1, 1.0]
n_steps = 500

print(f"\n{'LR':<10} | {'SGD final loss':>16} | {'Adam final loss':>16}")
print("-" * 50)

for lr in lr_values:
# SGD
torch.manual_seed(0)
model_sgd = make_model()
opt_sgd = optim.SGD(model_sgd.parameters(), lr=lr)
for _ in range(n_steps):
opt_sgd.zero_grad()
loss = ((model_sgd(x) - y) ** 2).mean()
loss.backward()
opt_sgd.step()
sgd_loss = loss.item()

# Adam
torch.manual_seed(0)
model_adam = make_model()
opt_adam = optim.Adam(model_adam.parameters(), lr=lr)
for _ in range(n_steps):
opt_adam.zero_grad()
loss = ((model_adam(x) - y) ** 2).mean()
loss.backward()
opt_adam.step()
adam_loss = loss.item()

print(f"{lr:<10} | {sgd_loss:>16.4f} | {adam_loss:>16.4f}")
# Expected pattern:
# Small LR (1e-4): SGD barely moves, Adam converges slowly
# Medium LR (1e-3): both work, Adam usually faster
# Large LR (0.1): SGD diverges, Adam still converges
# Very large LR (1.0): both diverge, but Adam fails at higher threshold


compare_lr_sensitivity()

The practical takeaway: with Adam, the default learning rate of 1e-3 works across a wide range of problems. With SGD, you must tune the learning rate carefully for each problem - too small and it trains slowly, too large and it diverges. This 10x–100x larger robust LR range is why Adam is the default choice when prototyping or when LR tuning budget is limited.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Optimizer Race demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.