Skip to main content

Optimization Algorithms Deep Dive

Reading time: ~45 min · Interview relevance: Very High · Target roles: ML Engineer, Research Engineer, MLOps Engineer


The Training Run That Never Converged

A team at a research lab was training a large language model from scratch. They had carefully tuned the architecture, the data pipeline was clean, and the compute budget was approved. They launched the training run with SGD and momentum - the optimizer that had served them well on ResNet-50 image classification for years.

After 48 hours, the loss was barely moving. They doubled the learning rate. It diverged. They halved it. It moved - slowly, painfully slowly. By the time they reached 10% of the planned training tokens, the loss was still above what a well-tuned Adam run would reach in 1% of the same budget.

The problem was not their architecture or data. It was the optimizer. Transformers have a fundamentally different loss landscape than CNNs: the gradient magnitudes vary wildly across layers and heads, the effective Hessian has condition numbers orders of magnitude worse than convolutional architectures, and the initialization scale requirements are different. SGD with momentum - optimal for the well-conditioned spherical loss surfaces that CNNs tend to produce - performs poorly in this environment.

This is not a trivial engineering detail. The choice of optimizer, its hyperparameters, and the learning rate schedule together determine whether a training run succeeds or fails. Yet most practitioners treat optimization as a black box: "use Adam, set β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, and lr=1e3lr = 1e-3." That defaults work at all is a tribute to Adam's robustness. But at the frontier - large models, long training runs, tight compute budgets - you need to understand what these algorithms are actually doing.

This lesson derives every major optimizer from first principles. We start with the geometry of gradient descent, build through momentum and adaptive methods to Adam's bias-corrected moment estimates, and end with the practical engineering knowledge that determines real training outcomes: gradient clipping, learning rate warmup, cosine annealing, and the loss landscape visualization techniques that diagnose training failures.


Why This Exists

The fundamental problem of machine learning is optimization: find parameters θ\theta that minimize a loss function L(θ)L(\theta). For shallow models with convex losses (logistic regression, SVM), this is a solved problem - convex optimization theory provides algorithms with guaranteed convergence to the global minimum. For deep neural networks with non-convex losses, the theory is messier, but decades of empirical evidence and recent theoretical work give us a reliable toolkit.

The challenge is the scale. Modern neural networks have billions of parameters. Computing the exact gradient over the full training set for each update is infeasible. We must use stochastic gradient estimates. This introduces noise - noise that can be exploited (it helps escape sharp minima and acts as implicit regularization) but that also requires careful management through learning rate schedules and gradient clipping.

A secondary challenge is the geometry of neural network loss landscapes. The loss is not just non-convex - it has dramatically different curvature in different directions. Gradient descent with a fixed learning rate must move slowly (to avoid overshooting in the steep directions) even if the shallow directions could be traversed quickly with a larger step. Adaptive methods like Adam address this by maintaining per-parameter learning rates, effectively normalizing gradients by an estimate of the local curvature.

Understanding optimization means understanding this geometry. Once you see why Adam adapts to ill-conditioned curvature and why momentum accumulates velocity in consistent gradient directions, the algorithm becomes intuitive rather than magical.


Historical Context

Gradient descent is ancient by ML standards - Cauchy proposed it in 1847 for solving systems of equations. But the stochastic variant (SGD) was introduced by Robbins and Monro (1951) in the context of stochastic approximation - finding roots of noisy functions. The convergence conditions they proved (step sizes ηt\eta_t satisfying ηt=\sum \eta_t = \infty and ηt2<\sum \eta_t^2 < \infty) remain the theoretical foundation.

Momentum was formalized by Polyak (1964) as the "heavy ball" method and independently by Rumelhart, Hinton, and Williams (1986) for neural network training in the backpropagation paper. Nesterov's accelerated gradient (1983) proved that a "look-ahead" variant achieves the optimal O(1/k2)O(1/k^2) convergence rate for convex functions.

AdaGrad (Duchi, Hazan, Singer, 2011) introduced adaptive learning rates, motivated by sparse gradient problems in NLP. RMSProp was proposed by Hinton in an unpublished Coursera lecture in 2012, addressing AdaGrad's learning rate decay. Adam (Kingma and Ba, 2015) combined momentum with RMSProp's adaptive learning rates and added bias correction, becoming the dominant optimizer for deep learning almost immediately after publication.

AdamW (Loshchilov and Hutter, 2017) fixed a subtle but important bug in L2 regularization with Adam. The Muon optimizer (Kosson et al., 2024) extends the Shampoo/SOAP line of work, and Lion (2023, Chen et al.) added sign-based momentum to the mix. The field continues to evolve - but Adam with cosine decay remains the dominant recipe for large model training.


Convex vs Non-Convex Optimization

Convex Optimization

A function f:RdRf: \mathbb{R}^d \to \mathbb{R} is convex if:

f(λx+(1λ)y)λf(x)+(1λ)f(y)x,y,λ[0,1]f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y) \quad \forall x, y, \lambda \in [0,1]

For convex functions: any local minimum is a global minimum. Gradient descent converges to the global optimum. The convergence rate for LL-smooth convex functions is O(1/T)O(1/T) (GD) or O(1/T2)O(1/T^2) (Nesterov accelerated).

Logistic regression, linear SVMs, and linear regression have convex losses. Convex optimization theory fully covers these cases.

The Non-Convex Reality of Deep Learning

Neural network loss landscapes are non-convex. But recent theoretical and empirical work suggests the non-convexity is "benign" in practice:

  1. Saddle points, not local minima, are the main obstacle. In high dimensions, most critical points where L=0\nabla L = 0 are saddle points (some directions have positive curvature, some negative) rather than local minima. SGD with noise escapes saddle points naturally.

  2. Wide minima generalize better. Sharp minima (where the loss increases steeply as you move away) tend to generalize poorly because small data distribution shifts move you out of the minimum. Wide flat minima are robust. This is why larger batch sizes (lower noise SGD) often generalize worse.

  3. Loss decreasing does not mean converging to global minimum. For deep networks, "good enough" is sufficient. The practical question is not "did we find the global minimum?" but "did we find parameters that generalize well?"


Gradient Descent Variants

Batch Gradient Descent

θt+1=θtηθL(θt)\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)

where L(θ)=1ni=1nl(θ;xi,yi)L(\theta) = \frac{1}{n} \sum_{i=1}^n l(\theta; x_i, y_i) is the full-dataset loss.

Exact gradient, guaranteed convergence for convex problems. But: requires computing over all nn examples per step - impractical when n=109n = 10^9.

Stochastic Gradient Descent

θt+1=θtηg^t,g^t=θl(θt;xi,yi)\theta_{t+1} = \theta_t - \eta \hat{g}_t, \quad \hat{g}_t = \nabla_\theta l(\theta_t; x_i, y_i)

Single example per step. Maximum noise - gradient estimate has high variance. Historical importance: this is what trained the first practical neural networks. Rarely used raw today; mini-batch SGD is standard.

Mini-Batch SGD

g^t=1BiBtθl(θt;xi,yi)\hat{g}_t = \frac{1}{B} \sum_{i \in \mathcal{B}_t} \nabla_\theta l(\theta_t; x_i, y_i)

Batch size BB trades off between: fewer steps (large BB) vs lower variance per gradient estimate (large BB) vs better GPU utilization (large BB) vs better generalization (small BB, more noise).

The linear scaling rule (Goyal et al., 2017): when multiplying batch size by kk, multiply learning rate by kk. This approximately maintains the same noise-to-signal ratio. Works well up to B4096B \approx 4096; breaks down beyond that and requires warmup.

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from typing import List, Tuple, Optional

# ============================================================
# Core optimizer implementations from scratch
# ============================================================

class VanillaSGD:
"""Gradient descent with optional momentum."""

def __init__(self, params, lr: float = 0.01,
momentum: float = 0.0, weight_decay: float = 0.0):
self.params = list(params)
self.lr = lr
self.momentum = momentum
self.weight_decay = weight_decay
# Velocity for each parameter (initialized to zero)
self.velocity = [torch.zeros_like(p) for p in self.params]

def step(self):
for p, v in zip(self.params, self.velocity):
if p.grad is None:
continue
g = p.grad.data

# L2 regularization: add weight_decay * p to gradient
if self.weight_decay != 0:
g = g + self.weight_decay * p.data

# Momentum: v = mu * v + g
v.mul_(self.momentum).add_(g)

# Update: theta -= lr * v
p.data.sub_(self.lr * v)

def zero_grad(self):
for p in self.params:
if p.grad is not None:
p.grad.zero_()

Momentum

The Heavy Ball Intuition

Gradient descent oscillates in narrow valleys - it overshoots in the steep direction, undershoots in the shallow direction. Momentum smooths this by accumulating velocity: in consistent gradient directions, velocity builds up. In oscillating directions, positive and negative updates cancel.

The analogy is a heavy ball rolling on the loss surface. It does not change direction immediately when it hits an opposing gradient - it has inertia. For convex quadratics, momentum reduces the condition number's effect on convergence from O(κ)O(\kappa) (gradient descent) to O(κ)O(\sqrt{\kappa}) (heavy ball).

vt=μvt1+gtv_t = \mu v_{t-1} + g_t θt+1=θtηvt\theta_{t+1} = \theta_t - \eta v_t

Nesterov Accelerated Gradient

Nesterov's key insight: instead of computing the gradient at the current position θt\theta_t, compute it at the "lookahead" position θtμvt1\theta_t - \mu v_{t-1} (where you expect to be after applying momentum).

vt=μvt1+gtlookahead,gtlookahead=L(θtμvt1)v_t = \mu v_{t-1} + g_t^{\text{lookahead}}, \quad g_t^{\text{lookahead}} = \nabla L(\theta_t - \mu v_{t-1})

This corrects for momentum overshooting: by the time you apply the gradient correction, you account for where momentum is taking you. For convex functions, Nesterov achieves the optimal O(1/k2)O(1/k^2) convergence rate. In practice, it often trains faster with less oscillation.

class NesterovSGD:
"""
Nesterov Accelerated Gradient.
Compute gradient at lookahead position, not current position.
Achieves O(1/k^2) convergence for convex functions vs O(1/k) for GD.
"""

def __init__(self, params, lr: float = 0.01, momentum: float = 0.9):
self.params = list(params)
self.lr = lr
self.mu = momentum
self.velocity = [torch.zeros_like(p) for p in self.params]

def step(self, closure=None):
# PyTorch's standard Nesterov implementation
# is equivalent to: update velocity, then step with
# g + mu * v (the gradient at the lookahead point)
for p, v in zip(self.params, self.velocity):
if p.grad is None:
continue
g = p.grad.data
v.mul_(self.mu).add_(g)
# Nesterov: use g + mu * v_new as effective gradient
p.data.sub_(self.lr * (g + self.mu * v))

Adaptive Learning Rate Methods

AdaGrad

AdaGrad (Duchi et al., 2011) adapts the learning rate for each parameter based on the accumulated squared gradients. Parameters that receive frequent large gradients get a smaller learning rate; sparse parameters get a larger rate.

Gt=Gt1+gt2(accumulated squared gradient)G_t = G_{t-1} + g_t^2 \quad \text{(accumulated squared gradient)} θt+1=θtηGt+ϵgt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t} + \epsilon} g_t

Strength: great for sparse gradients (NLP word embeddings, sparse features). Weakness: GtG_t is monotonically increasing, so the effective learning rate η/Gt\eta / \sqrt{G_t} decays to zero. Training stops making progress.

RMSProp

Hinton's fix: replace the full accumulation with an exponential moving average:

E[g2]t=ρE[g2]t1+(1ρ)gt2E[g^2]_t = \rho E[g^2]_{t-1} + (1-\rho) g_t^2 θt+1=θtηE[g2]t+ϵgt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t} + \epsilon} g_t

The EMA with decay ρ0.9\rho \approx 0.9 "forgets" old gradients - the effective learning rate remains bounded and the optimizer never permanently slows down.


Adam: The Full Derivation

Adam (Kingma and Ba, 2015) combines momentum (first moment) with RMSProp (second moment), with a crucial bias correction that makes it work well from the first step.

First moment (momentum term): exponential moving average of gradients mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t

Second moment (adaptive learning rate): exponential moving average of squared gradients vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2

Why bias correction is needed: at t=1t=1, m1=(1β1)g1m_1 = (1-\beta_1)g_1. With β1=0.9\beta_1 = 0.9, that is 0.1g10.1 \cdot g_1 - only 10% of the first gradient. The estimate is heavily biased toward zero. The same applies to vtv_t.

Bias correction: m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

At t=1t=1: m^1=g1/(10.9)=g1\hat{m}_1 = g_1 / (1-0.9) = g_1. At large tt: m^tmt\hat{m}_t \approx m_t (correction becomes negligible as βt0\beta^t \to 0).

The Adam update: θt+1=θtηv^t+ϵm^t\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

class AdamOptimizer:
"""
Adam optimizer from scratch with full state tracking.
Kingma & Ba, ICLR 2015.

Key design decisions:
- beta1=0.9: momentum decay. Accumulates ~1/(1-0.9)=10 steps of gradient history
- beta2=0.999: second moment decay. Accumulates ~1/(1-0.999)=1000 steps
- eps=1e-8: numerical stability, prevents division by zero
- Bias correction: critical for early training steps
"""

def __init__(self, params, lr: float = 1e-3,
beta1: float = 0.9, beta2: float = 0.999,
eps: float = 1e-8, weight_decay: float = 0.0):
self.params = list(params)
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.weight_decay = weight_decay

# Adam state: first moment (m), second moment (v), step count (t)
self.m = [torch.zeros_like(p) for p in self.params]
self.v = [torch.zeros_like(p) for p in self.params]
self.t = 0

def step(self):
self.t += 1
b1, b2 = self.beta1, self.beta2

# Bias correction factors
bias_correction1 = 1 - b1 ** self.t
bias_correction2 = 1 - b2 ** self.t

for i, p in enumerate(self.params):
if p.grad is None:
continue
g = p.grad.data

# Note: weight_decay here is L2 regularization applied to gradient
# This is subtly different from AdamW's decoupled weight decay
if self.weight_decay != 0:
g = g + self.weight_decay * p.data

# Update first moment (momentum)
self.m[i].mul_(b1).add_((1 - b1) * g)

# Update second moment (adaptive learning rate denominator)
self.v[i].mul_(b2).add_((1 - b2) * g * g)

# Bias-corrected estimates
m_hat = self.m[i] / bias_correction1
v_hat = self.v[i] / bias_correction2

# Parameter update
p.data.sub_(self.lr * m_hat / (v_hat.sqrt() + self.eps))

def zero_grad(self):
for p in self.params:
if p.grad is not None:
p.grad.zero_()

def state_dict(self) -> dict:
"""Inspect optimizer state - useful for debugging training issues."""
return {
'step': self.t,
'lr': self.lr,
'first_moment_norms': [m.norm().item() for m in self.m],
'second_moment_norms': [v.norm().item() for v in self.v],
'effective_lr': [
self.lr / ((v / (1 - self.beta2**self.t)).sqrt().mean().item()
+ self.eps)
for v in self.v
]
}

AdamW: Decoupled Weight Decay

The bug in Adam with L2 regularization: L2 adds λθ\lambda \theta to the gradient, so the effective weight decay is λη/(v^+ϵ)\lambda \eta / (\sqrt{\hat{v}} + \epsilon). For parameters with large gradient variance (large v^\hat{v}), the effective decay is small. For parameters with small gradient variance, decay is large. This breaks the intended uniform regularization.

AdamW (Loshchilov and Hutter, 2017) decouples weight decay from the gradient update:

θt+1=θtη(m^tv^t+ϵ+λθt)\theta_{t+1} = \theta_t - \eta \left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t\right)

The weight decay λθt\lambda \theta_t is applied directly to the parameters, independent of the adaptive gradient scaling. This gives uniform regularization across all parameters regardless of their gradient statistics.

class AdamW:
"""
AdamW: Adam with decoupled weight decay.
Loshchilov & Hutter, ICLR 2018.

Standard in all modern LLM training:
GPT-2, BERT, T5, GPT-3, LLaMA all use AdamW.
"""

def __init__(self, params, lr: float = 1e-3,
beta1: float = 0.9, beta2: float = 0.999,
eps: float = 1e-8, weight_decay: float = 0.01):
self.params = list(params)
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.weight_decay = weight_decay # decoupled from gradient update

self.m = [torch.zeros_like(p) for p in self.params]
self.v = [torch.zeros_like(p) for p in self.params]
self.t = 0

def step(self):
self.t += 1
b1, b2 = self.beta1, self.beta2
bc1 = 1 - b1 ** self.t
bc2 = 1 - b2 ** self.t

for i, p in enumerate(self.params):
if p.grad is None:
continue
g = p.grad.data

# Update moment estimates (no weight decay in gradient!)
self.m[i].mul_(b1).add_((1 - b1) * g)
self.v[i].mul_(b2).add_((1 - b2) * g * g)

m_hat = self.m[i] / bc1
v_hat = self.v[i] / bc2

# Gradient step
p.data.sub_(self.lr * m_hat / (v_hat.sqrt() + self.eps))

# Decoupled weight decay: applied directly, not through gradient
p.data.mul_(1 - self.lr * self.weight_decay)

def zero_grad(self):
for p in self.params:
if p.grad is not None:
p.grad.zero_()


def compare_adam_adamw():
"""
Show that AdamW provides more uniform regularization than Adam+L2.
With Adam+L2, effective weight decay depends on gradient variance.
With AdamW, weight decay is constant for all parameters.
"""
import torch.nn as nn

torch.manual_seed(42)
d = 10

# Model with two parameter groups: one with large gradients, one with small
model = nn.Sequential(
nn.Linear(d, 100), # large gradient parameters
nn.ReLU(),
nn.Linear(100, 1) # small gradient parameters
)

# Adam+L2 (weight_decay applied to gradient)
opt_adam = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=0.01)

# AdamW (weight_decay decoupled)
opt_adamw = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

X = torch.randn(100, d)
y = torch.randn(100, 1)

print("Adam with L2 vs AdamW: effective regularization comparison")
for opt_name, opt in [('Adam+L2', opt_adam), ('AdamW', opt_adamw)]:
opt.zero_grad()
pred = model(X)
loss = F.mse_loss(pred, y)
loss.backward()
opt.step()
# Check parameter norms after update
param_norms = [p.norm().item() for p in model.parameters()]
print(f"{opt_name}: param norms = {[f'{n:.4f}' for n in param_norms]}")

Learning Rate Schedules

The learning rate schedule is as important as the optimizer. The right schedule can double training efficiency. The wrong schedule causes training collapse.

Warmup

Large models are sensitive to the learning rate at initialization. With Adam, the second moment estimate v^\hat{v} starts near zero (denominator of update), which means the effective learning rate is actually η/ϵ\eta / \epsilon - much larger than η\eta. This causes large, unstable updates early in training.

Linear warmup for TwT_w steps: ηt=ηmaxt/Tw\eta_t = \eta_{\max} \cdot t / T_w. This gives the second moment estimates time to reach accurate values before taking large steps.

Standard for transformer training: 1-2% of total steps as warmup. GPT-3 uses 375M tokens of warmup out of 300B total (0.125%).

Cosine Annealing

After warmup, decay the learning rate following a cosine curve:

ηt=ηmin+12(ηmaxηmin)(1+cos(tTwTmaxTwπ))\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t - T_w}{T_{\max} - T_w} \pi\right)\right)

This is smooth (no abrupt drops), naturally slows training near the end (allowing fine-grained convergence), and restarts can be used to escape local minima (SGDR: warm restarts).

class CosineAnnealingWithWarmup:
"""
Linear warmup + cosine annealing learning rate schedule.
Standard for LLM training: GPT-4, LLaMA, Mistral all use this.
"""

def __init__(self, optimizer: torch.optim.Optimizer,
warmup_steps: int,
total_steps: int,
min_lr: float = 1e-5,
max_lr: float = 1e-3):
self.optimizer = optimizer
self.warmup_steps = warmup_steps
self.total_steps = total_steps
self.min_lr = min_lr
self.max_lr = max_lr
self.current_step = 0

def step(self):
self.current_step += 1
lr = self.get_lr(self.current_step)
for pg in self.optimizer.param_groups:
pg['lr'] = lr
return lr

def get_lr(self, step: int) -> float:
if step < self.warmup_steps:
# Linear warmup: 0 -> max_lr
return self.max_lr * step / self.warmup_steps
elif step >= self.total_steps:
return self.min_lr
else:
# Cosine decay: max_lr -> min_lr
progress = (step - self.warmup_steps) / (
self.total_steps - self.warmup_steps)
cosine_decay = 0.5 * (1 + np.cos(np.pi * progress))
return self.min_lr + (self.max_lr - self.min_lr) * cosine_decay


def plot_lr_schedule(warmup_steps=1000, total_steps=10000,
min_lr=1e-5, max_lr=3e-4):
"""Visualize the warmup + cosine decay schedule."""
steps = np.arange(total_steps)
lrs = []

for step in steps:
if step < warmup_steps:
lr = max_lr * step / warmup_steps
else:
progress = (step - warmup_steps) / (total_steps - warmup_steps)
cosine_decay = 0.5 * (1 + np.cos(np.pi * progress))
lr = min_lr + (max_lr - min_lr) * cosine_decay
lrs.append(lr)

plt.figure(figsize=(10, 4))
plt.plot(steps, lrs, 'b-', linewidth=2)
plt.axvline(x=warmup_steps, color='r', linestyle='--',
label=f'Warmup ends (step {warmup_steps})')
plt.xlabel('Training Step')
plt.ylabel('Learning Rate')
plt.title('Cosine Annealing with Linear Warmup')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('/tmp/lr_schedule.png', dpi=100)
print(f"Max LR: {max_lr:.2e}, Min LR: {min_lr:.2e}")
print(f"LR at warmup end: {lrs[warmup_steps]:.6f}")
print(f"LR at halfway: {lrs[total_steps//2]:.6f}")
print(f"LR at end: {lrs[-1]:.6f}")

plot_lr_schedule()

Gradient Clipping

In deep networks, gradients can explode - especially in RNNs and early transformer training. A single large gradient step can destroy weeks of training.

Gradient norm clipping: if g>max_norm\|g\| > \text{max\_norm}, scale ggmax_norm/gg \leftarrow g \cdot \text{max\_norm} / \|g\|. This preserves the gradient direction while bounding its magnitude.

The difference from gradient value clipping (element-wise): norm clipping respects the relative magnitudes within the gradient - if one dimension is large, all dimensions are scaled proportionally.

def clip_gradient_norm(parameters, max_norm: float,
norm_type: float = 2.0) -> float:
"""
Clip gradients by global norm.
Returns the pre-clipping gradient norm (useful for monitoring).

Standard practice:
- Transformers: max_norm = 1.0 (GPT-3, LLaMA standard)
- RNNs: max_norm = 5.0 or 1.0
- CNNs: often not needed, but max_norm = 1.0 is safe default

Important: call AFTER loss.backward() and BEFORE optimizer.step()
"""
params = [p for p in parameters if p.grad is not None]
if not params:
return 0.0

# Compute global gradient norm
total_norm = torch.norm(
torch.stack([torch.norm(p.grad.data, norm_type) for p in params]),
norm_type
).item()

# Scale gradients if norm exceeds max_norm
clip_coef = max_norm / (total_norm + 1e-6)
if clip_coef < 1:
for p in params:
p.grad.data.mul_(clip_coef)

return total_norm


def monitor_gradient_health(model: nn.Module, step: int) -> dict:
"""
Compute gradient statistics for training diagnostics.
Call after backward(), before optimizer step.

Warning signs:
- grad_norm > 10: potential explosion (or bad batch)
- grad_norm < 1e-4: vanishing gradients
- grad_norm suddenly spikes: check for bad data samples
"""
grad_norms = {}
total_norm_sq = 0.0

for name, p in model.named_parameters():
if p.grad is not None:
param_norm = p.grad.data.norm(2).item()
grad_norms[name] = param_norm
total_norm_sq += param_norm ** 2

global_norm = total_norm_sq ** 0.5

return {
'step': step,
'global_grad_norm': global_norm,
'per_layer_norms': grad_norms,
'max_layer_norm': max(grad_norms.values()) if grad_norms else 0,
'min_layer_norm': min(grad_norms.values()) if grad_norms else 0,
}


# Complete training loop with all best practices
def train_with_best_practices(model, train_loader, total_steps=10000):
optimizer = torch.optim.AdamW(
model.parameters(),
lr=3e-4,
betas=(0.9, 0.95), # beta2=0.95 for LLM training (more responsive)
weight_decay=0.1
)

scheduler = CosineAnnealingWithWarmup(
optimizer,
warmup_steps=int(0.02 * total_steps), # 2% warmup
total_steps=total_steps,
min_lr=3e-5,
max_lr=3e-4
)

grad_norm_history = []

for step, (x, y) in enumerate(train_loader):
if step >= total_steps:
break

optimizer.zero_grad()
loss = model(x, y)
loss.backward()

# Clip gradients (monitor norm BEFORE clipping)
grad_norm = clip_gradient_norm(model.parameters(), max_norm=1.0)
grad_norm_history.append(grad_norm)

# Alert on gradient spikes
if grad_norm > 5.0:
print(f"WARNING: Large gradient at step {step}: {grad_norm:.2f}")

optimizer.step()
scheduler.step()

if step % 100 == 0:
lr = optimizer.param_groups[0]['lr']
print(f"Step {step}: loss={loss.item():.4f}, "
f"grad_norm={grad_norm:.3f}, lr={lr:.6f}")

return grad_norm_history

Loss Landscape Visualization

Understanding why an optimizer succeeds or fails requires visualizing the loss landscape. The standard technique (Li et al., 2018) is to project the landscape onto two random directions through the final parameter vector.

def visualize_loss_landscape(
model: nn.Module,
loss_fn,
data_loader,
resolution: int = 20,
range_val: float = 1.0
) -> None:
"""
Loss landscape visualization via random direction projection.
Li et al., NeurIPS 2018: "Visualizing the Loss Landscape of Neural Nets"

Creates a 2D grid around the final trained parameters.
Shows whether the minimum is sharp (bad generalization) or flat (good).
"""
# Save final parameters
theta_star = [p.data.clone() for p in model.parameters()]
total_params = sum(p.numel() for p in model.parameters())

# Sample two random Gaussian directions, filter-normalized
# Filter normalization: scale each filter to have same norm as
# the corresponding final parameter filter (makes scale consistent)
def make_direction():
direction = [torch.randn_like(p) for p in model.parameters()]
# Filter normalization
for d, p in zip(direction, model.parameters()):
d.mul_(p.norm() / (d.norm() + 1e-10))
return direction

dir1 = make_direction()
dir2 = make_direction()

alpha_range = np.linspace(-range_val, range_val, resolution)
beta_range = np.linspace(-range_val, range_val, resolution)
losses = np.zeros((resolution, resolution))

# Evaluate loss at each point in the grid
for i, alpha in enumerate(alpha_range):
for j, beta in enumerate(beta_range):
# Move parameters to theta_star + alpha * dir1 + beta * dir2
for param, star, d1, d2 in zip(model.parameters(),
theta_star, dir1, dir2):
param.data.copy_(star + alpha * d1 + beta * d2)

# Compute loss
total_loss = 0.0
n_batches = 0
with torch.no_grad():
for x, y in data_loader:
total_loss += loss_fn(model(x), y).item()
n_batches += 1
if n_batches >= 5: # subsample for speed
break
losses[i, j] = total_loss / n_batches

# Restore final parameters
for param, star in zip(model.parameters(), theta_star):
param.data.copy_(star)

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Contour plot
X, Y = np.meshgrid(alpha_range, beta_range)
levels = np.percentile(losses, np.linspace(0, 100, 30))
ax1.contourf(X, Y, losses.T, levels=levels, cmap='viridis')
ax1.set_title('Loss Landscape Contours')
ax1.set_xlabel('Direction 1')
ax1.set_ylabel('Direction 2')

# 3D surface
ax2 = fig.add_subplot(122, projection='3d')
ax2.plot_surface(X, Y, losses.T, cmap='viridis', alpha=0.8)
ax2.set_title('Loss Landscape 3D')

plt.tight_layout()
plt.savefig('/tmp/loss_landscape.png', dpi=100)
print(f"Loss range: [{losses.min():.4f}, {losses.max():.4f}]")
print(f"Loss at center (trained params): {losses[resolution//2, resolution//2]:.4f}")

Why Adam Beats SGD for Transformers But SGD Beats Adam for CNNs

This is one of the most important empirical observations in modern deep learning, and understanding the reason gives deep insight into both optimizers.

For CNNs: The loss landscape is relatively well-conditioned. Convolutional filters have similar gradient scales across layers. Weight decay with L2 (i.e., SGD+momentum's implicit behavior) provides effective regularization. SGD+momentum with cosine decay converges to flatter, better-generalizing minima than Adam for image classification. The famous WideResNet and EfficientNet training recipes explicitly use SGD+momentum.

For Transformers: The loss landscape is ill-conditioned. Gradient magnitudes vary enormously: the query/key projections have different gradient scales than the FFN weights, which have different scales than the embedding layer. This heterogeneity makes a fixed learning rate catastrophic - it is either too large for some layers or too small for others. Adam's per-parameter adaptive learning rates normalize these differences, making the effective step size consistent across all parameters.

A second factor: embedding parameters have sparse gradients (only the tokens in the current batch receive gradient updates). AdaGrad/Adam maintain large second moment estimates for embedding rows that appear frequently, giving them smaller effective learning rates - which prevents overshooting on common tokens while still training rare tokens adequately.


Muon Optimizer

Muon (Momentum + Orthogonalization, Kosson et al., 2024) extends the Shampoo line of work by orthogonalizing the gradient update. The key idea: instead of normalizing gradient components independently (Adam) or using a full second-order Hessian approximation (Shampoo), Muon applies the Newton-Schulz iteration to orthogonalize the momentum matrix.

def newton_schulz_5(G: torch.Tensor, steps: int = 5) -> torch.Tensor:
"""
5th-order Newton-Schulz iteration for matrix orthogonalization.
Finds U in the decomposition G = US where U is orthogonal.
Used in Muon optimizer for gradient normalization.
"""
assert G.ndim == 2
a, b, c = (3.4445, -4.7750, 2.0315)

X = G / (G.norm() + 1e-7) # normalize

if G.shape[0] > G.shape[1]:
X = X.T

for _ in range(steps):
A = X @ X.T
B = b * A + c * (A @ A)
X = a * X + B @ X

if G.shape[0] > G.shape[1]:
X = X.T

return X


class Muon(torch.optim.Optimizer):
"""
Muon: Momentum Orthogonalized by Newton-Schulz.
Kosson et al., 2024.

Best results on 2D parameter matrices (weight matrices in Linear layers).
Use Adam for 1D parameters (biases, norms) and embeddings.
Shows strong results in small/medium scale LLM training.
"""

def __init__(self, params, lr: float = 0.02, momentum: float = 0.95,
ns_steps: int = 5):
defaults = dict(lr=lr, momentum=momentum, ns_steps=ns_steps)
super().__init__(params, defaults)

@torch.no_grad()
def step(self):
for group in self.param_groups:
lr = group['lr']
momentum = group['momentum']
ns_steps = group['ns_steps']

for p in group['params']:
if p.grad is None:
continue

g = p.grad
state = self.state[p]

# Initialize momentum buffer
if 'momentum_buffer' not in state:
state['momentum_buffer'] = torch.zeros_like(g)

buf = state['momentum_buffer']
buf.mul_(momentum).add_(g)

# Orthogonalize gradient update for 2D parameters
if g.ndim == 2:
update = newton_schulz_5(buf, steps=ns_steps)
else:
update = buf # 1D params: standard momentum

p.data.sub_(lr * update)

Second-Order Methods

First-order methods (gradient descent, Adam) use only first derivatives. Second-order methods use the Hessian H=2LH = \nabla^2 L to approximate the curvature and take a more direct path to the minimum.

Newton's method update:

θt+1=θtHt1gt\theta_{t+1} = \theta_t - H_t^{-1} g_t

This step goes directly to the minimum of the local quadratic approximation. Convergence is quadratic for convex problems. But: computing and inverting the p×pp \times p Hessian costs O(p3)O(p^3) - for a model with p=109p = 10^9 parameters, this is completely infeasible.

L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) approximates the inverse Hessian using the last mm gradient differences, requiring only O(mp)O(mp) storage. It is standard for full-batch optimization on small-to-medium problems. PyTorch includes it:

def train_with_lbfgs(model: nn.Module, X: torch.Tensor, y: torch.Tensor,
max_iter: int = 100):
"""
L-BFGS for full-batch optimization.
Effective for:
- Fine-tuning small models on modest datasets
- Physics-informed neural networks (PINNs)
- Neural ODEs with small state spaces
NOT suitable for:
- Mini-batch training (L-BFGS is not stochastic)
- Models with stochastic layers (dropout active)
- Any model > ~100M parameters
"""
optimizer = torch.optim.LBFGS(
model.parameters(),
lr=1.0,
max_iter=20, # inner iterations per step
history_size=100, # number of gradient pairs to remember
line_search_fn='strong_wolfe' # line search for step size
)

def closure():
"""L-BFGS requires a closure that recomputes loss and gradient."""
optimizer.zero_grad()
pred = model(X)
loss = F.mse_loss(pred, y)
loss.backward()
return loss

for step in range(max_iter):
loss = optimizer.step(closure)
if step % 10 == 0:
print(f"L-BFGS step {step}: loss={loss.item():.6f}")

Production Engineering Notes

Optimizer Hyperparameter Tuning Hierarchy

Not all hyperparameters are equal. The order of importance for a standard AdamW + cosine schedule:

  1. Learning rate (η\eta): most important. Search 1e-4 to 5e-3 on log scale.
  2. Weight decay (λ\lambda): second most important. Too low = overfitting. Too high = underfitting. 0.01-0.1 for transformers.
  3. Warmup steps: 1-5% of total training steps. Too little = unstable start. Too much = wasted budget.
  4. beta2: for transformers, 0.95 > 0.999 (more responsive to recent gradients). For fine-tuning, 0.999 is safe.
  5. beta1: 0.9 is rarely wrong. Sometimes 0.95 for very long training runs.
  6. epsilon: 1e-8 is the default. For mixed precision training (float16), use 1e-6 to avoid underflow.

Mixed Precision and Optimizer State

In mixed precision training, parameters are stored in float16 for forward/backward pass (memory efficiency, speed) but optimizer state and parameter updates are done in float32 (numerical stability).

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler() # manages loss scaling for float16 gradients

def train_step_mixed_precision(model, optimizer, x, y, scaler):
optimizer.zero_grad()

with autocast(): # float16 forward pass
loss = model(x, y)

# Scale loss to prevent float16 gradient underflow
scaler.scale(loss).backward()

# Unscale before gradient clipping
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Step (float32 update internally)
scaler.step(optimizer)
scaler.update()

return loss.item()
warning

Never Clip Gradients After Scaling Without Unscaling First

In mixed-precision training, gradients are scaled by a large factor to prevent float16 underflow. If you call clip_grad_norm_ before scaler.unscale_(), you are clipping the artificially inflated gradients. The effective max_norm becomes max_norm / scale_factor - possibly near zero. Always call scaler.unscale_(optimizer) before gradient clipping.

danger

Adam on Sparse Gradients Without Regularization Is a Trap

Adam maintains a second moment estimate vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2 for every parameter. For embedding parameters with sparse gradients (only updated when the corresponding token appears in the batch), vtv_t becomes very small for rare tokens - causing Adam to take enormous steps when those tokens finally appear. The fix: use weight decay (AdamW) and consider a larger ϵ\epsilon (1e-6 to 1e-4) for the embedding layer specifically. In PyTorch, use param_groups to set different hyperparameters for embedding vs other parameters.


Interview Questions and Answers

Q1: Derive the Adam optimizer. Why is bias correction necessary?

Adam maintains two exponential moving averages: first moment mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1)g_t (momentum) and second moment vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2 (adaptive learning rate denominator).

Bias correction is necessary because both estimates are initialized to zero, causing systematic underestimation early in training. Expanding mtm_t: mt=(1β1)i=1tβ1tigim_t = (1-\beta_1)\sum_{i=1}^t \beta_1^{t-i} g_i. If we assume gigg_i \approx g (stationary), then mt=g(1β1t)m_t = g(1-\beta_1^t). But we want an estimate of gg, so we correct: m^t=mt/(1β1t)\hat{m}_t = m_t / (1-\beta_1^t).

At t=1t=1: m1=0.1g1m_1 = 0.1 g_1 (only 10% of first gradient with β1=0.9\beta_1=0.9), but m^1=g1\hat{m}_1 = g_1. At t=10t=10: 10.910=0.651-0.9^{10} = 0.65, still needs correction. At t=100t=100: 10.910011-0.9^{100} \approx 1, correction is negligible. The bias correction matters most in the first few thousand steps - exactly when learning rate stability is critical.

Q2: What is the difference between L2 regularization with Adam and AdamW?

L2 regularization adds λθ\lambda \theta to the gradient before the optimizer update. In Adam, this means the regularization term is also adaptively scaled: the effective weight decay for parameter ii is λη/v^i+ϵ\lambda \eta / \sqrt{\hat{v}_i + \epsilon}. Parameters with large gradient variance (large v^\hat{v}) receive weak regularization; parameters with small gradient variance receive strong regularization.

AdamW applies weight decay directly to the parameters after the gradient step: θt+1=(1ηλ)θtηm^t/(v^t+ϵ)\theta_{t+1} = (1 - \eta\lambda)\theta_t - \eta \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon). This gives uniform weight decay λ\lambda for all parameters, independent of gradient statistics.

In practice: AdamW consistently outperforms Adam+L2 regularization for transformer models. The LLaMA, GPT-4, and most modern LLM training recipes use AdamW. The difference is most pronounced when some layer types (e.g., attention vs FFN) have significantly different gradient variances.

Q3: Explain learning rate warmup. Why do transformers need it but ResNets typically do not?

Warmup gradually increases the learning rate from 0 (or a small value) to the target maximum over the first few hundred or thousand training steps.

The Adam-specific reason: at step 1, the second moment estimate v^1=g12\hat{v}_1 = g_1^2 (a single gradient squared). The effective learning rate is η/v^1=η/g1\eta / \sqrt{\hat{v}_1} = \eta / |g_1|. For parameters with small gradients, g1|g_1| may be tiny, making the effective step extremely large. As training progresses and v^\hat{v} accumulates a history, it stabilizes. Warmup gives this stabilization time before taking large steps.

Transformers are more sensitive because: (1) they have more parameter types with heterogeneous gradient scales; (2) LayerNorm parameters, embedding weights, and attention projections all have different gradient magnitudes; (3) attention patterns have sharp gradients at initialization that can destabilize early training.

ResNets are less sensitive because: convolutional filters have relatively homogeneous gradient scales; batch normalization stabilizes training; the architecture is more robust to moderate learning rates at initialization.

Q4: What is gradient clipping, and when should you use it?

Gradient norm clipping: if the global L2 norm of all gradients exceeds max_norm, scale all gradients proportionally so the norm equals max_norm. The gradient direction is preserved; only the magnitude is bounded.

When to use: (1) RNN/LSTM training - BPTT through long sequences causes gradient explosion; clipping at 5.0 or 1.0 is standard; (2) Transformer training - early in training, attention weights are random and can produce large gradient spikes; max_norm=1.0 is standard for GPT-scale training; (3) Any time you see loss NaN or sudden spikes in training curves.

When NOT to use (or use carefully): fine-tuning pre-trained models - if your gradients are already small, clipping at 1.0 has no effect and adds overhead; reward model training in RLHF - gradient scale depends heavily on reward signal distribution.

The key insight: gradient clipping is not a fix for vanishing gradients (it only helps with explosion). For vanishing gradients, you need architectural solutions (residual connections, LayerNorm, better initialization).

Q5: Why does Adam generally beat SGD for transformers but SGD+momentum beats Adam for CNNs?

The core reason is the condition number of the loss landscape. For transformers:

  • Gradient magnitudes vary dramatically across layer types and positions
  • Embedding gradients are sparse; attention projection gradients are dense
  • The effective Hessian has very high condition number, making a uniform learning rate ineffective
  • Adam normalizes each parameter by its gradient history, equalizing effective learning rates

For CNNs:

  • Convolutional filters at each layer have similar gradient scales due to weight sharing
  • The loss landscape is better conditioned
  • SGD+momentum finds flatter minima (wider basins of attraction) with better generalization
  • Adam's adaptive normalization can hurt CNNs by adapting away from the implicit regularization effect of uniform gradient steps

The empirical evidence is clear: WideResNet achieves 96.7% on CIFAR-10 with SGD+momentum; Adam typically reaches 95-96%. For GPT-scale transformers, SGD convergence speed is 10-100x slower than Adam, often requiring impractically small learning rates to avoid instability.

Q6: Describe cosine annealing with warm restarts. When would you use restarts?

Standard cosine annealing (without restarts) decays the learning rate from ηmax\eta_{\max} to ηmin\eta_{\min} following a cosine curve over the entire training run. With warm restarts (SGDR, Loshchilov & Hutter 2016), the learning rate periodically resets to ηmax\eta_{\max} and decays again, with each cycle potentially longer than the last.

Why restarts help: training often settles into a basin with a good but not optimal minimum. The periodic large learning rate resets allow the optimizer to escape this basin and potentially find a better one. Each restart is like a new, shorter training run initialized from the end of the previous one.

When to use: (1) ensembling - take model snapshots at the end of each cosine cycle and ensemble them (snapshot ensembles); (2) when training curves show the model settling into a plateau before the full training budget is used; (3) when exploring multiple local minima for diversity in ensemble members. When NOT to use: very long training runs where the final minimum quality is more important than escape dynamics; fine-tuning where you want smooth convergence to a nearby minimum.


Summary

Optimization is where theory meets engineering reality. The derivation of Adam from moment estimates and bias correction explains why it is robust in early training. The difference between AdamW and Adam+L2 explains why modern LLM training universally uses AdamW. The connection between loss landscape geometry and optimizer choice explains why the same model trained with different optimizers can produce dramatically different results.

The practical recipe for large model training is stable: AdamW with β1=0.9\beta_1 = 0.9, β2=0.95\beta_2 = 0.95, weight decay 0.1, gradient clipping at 1.0, linear warmup for 2% of training, cosine decay to 10% of peak LR. Deviations from this require justification, and the justification should come from understanding what each hyperparameter controls in the optimization dynamics.

Second-order methods (L-BFGS) remain valuable for small-to-medium full-batch problems. The Muon optimizer and other orthogonalization-based methods are showing promise for efficiently trained small language models. But for training at scale, the Adam family with cosine schedules remains the established baseline to beat.

© 2026 EngineersOfAI. All rights reserved.