Optimization Algorithms Deep Dive

Reading time: ~45 min · Interview relevance: Very High · Target roles: ML Engineer, Research Engineer, MLOps Engineer

The Training Run That Never Converged

A team at a research lab was training a large language model from scratch. They had carefully tuned the architecture, the data pipeline was clean, and the compute budget was approved. They launched the training run with SGD and momentum - the optimizer that had served them well on ResNet-50 image classification for years.

After 48 hours, the loss was barely moving. They doubled the learning rate. It diverged. They halved it. It moved - slowly, painfully slowly. By the time they reached 10% of the planned training tokens, the loss was still above what a well-tuned Adam run would reach in 1% of the same budget.

The problem was not their architecture or data. It was the optimizer. Transformers have a fundamentally different loss landscape than CNNs: the gradient magnitudes vary wildly across layers and heads, the effective Hessian has condition numbers orders of magnitude worse than convolutional architectures, and the initialization scale requirements are different. SGD with momentum - optimal for the well-conditioned spherical loss surfaces that CNNs tend to produce - performs poorly in this environment.

This is not a trivial engineering detail. The choice of optimizer, its hyperparameters, and the learning rate schedule together determine whether a training run succeeds or fails. Yet most practitioners treat optimization as a black box: "use Adam, set $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , and $lr = 1e-3$ ." That defaults work at all is a tribute to Adam's robustness. But at the frontier - large models, long training runs, tight compute budgets - you need to understand what these algorithms are actually doing.

This lesson derives every major optimizer from first principles. We start with the geometry of gradient descent, build through momentum and adaptive methods to Adam's bias-corrected moment estimates, and end with the practical engineering knowledge that determines real training outcomes: gradient clipping, learning rate warmup, cosine annealing, and the loss landscape visualization techniques that diagnose training failures.

Why This Exists

The fundamental problem of machine learning is optimization: find parameters $\theta$ that minimize a loss function $L(\theta)$ . For shallow models with convex losses (logistic regression, SVM), this is a solved problem - convex optimization theory provides algorithms with guaranteed convergence to the global minimum. For deep neural networks with non-convex losses, the theory is messier, but decades of empirical evidence and recent theoretical work give us a reliable toolkit.

The challenge is the scale. Modern neural networks have billions of parameters. Computing the exact gradient over the full training set for each update is infeasible. We must use stochastic gradient estimates. This introduces noise - noise that can be exploited (it helps escape sharp minima and acts as implicit regularization) but that also requires careful management through learning rate schedules and gradient clipping.

A secondary challenge is the geometry of neural network loss landscapes. The loss is not just non-convex - it has dramatically different curvature in different directions. Gradient descent with a fixed learning rate must move slowly (to avoid overshooting in the steep directions) even if the shallow directions could be traversed quickly with a larger step. Adaptive methods like Adam address this by maintaining per-parameter learning rates, effectively normalizing gradients by an estimate of the local curvature.

Understanding optimization means understanding this geometry. Once you see why Adam adapts to ill-conditioned curvature and why momentum accumulates velocity in consistent gradient directions, the algorithm becomes intuitive rather than magical.

Historical Context

Gradient descent is ancient by ML standards - Cauchy proposed it in 1847 for solving systems of equations. But the stochastic variant (SGD) was introduced by Robbins and Monro (1951) in the context of stochastic approximation - finding roots of noisy functions. The convergence conditions they proved (step sizes $\eta_t$ satisfying $\sum \eta_t = \infty$ and $\sum \eta_t^2 < \infty$ ) remain the theoretical foundation.

Momentum was formalized by Polyak (1964) as the "heavy ball" method and independently by Rumelhart, Hinton, and Williams (1986) for neural network training in the backpropagation paper. Nesterov's accelerated gradient (1983) proved that a "look-ahead" variant achieves the optimal $O(1/k^2)$ convergence rate for convex functions.

AdaGrad (Duchi, Hazan, Singer, 2011) introduced adaptive learning rates, motivated by sparse gradient problems in NLP. RMSProp was proposed by Hinton in an unpublished Coursera lecture in 2012, addressing AdaGrad's learning rate decay. Adam (Kingma and Ba, 2015) combined momentum with RMSProp's adaptive learning rates and added bias correction, becoming the dominant optimizer for deep learning almost immediately after publication.

AdamW (Loshchilov and Hutter, 2017) fixed a subtle but important bug in L2 regularization with Adam. The Muon optimizer (Kosson et al., 2024) extends the Shampoo/SOAP line of work, and Lion (2023, Chen et al.) added sign-based momentum to the mix. The field continues to evolve - but Adam with cosine decay remains the dominant recipe for large model training.

Convex vs Non-Convex Optimization

Convex Optimization

A function $f: \mathbb{R}^d \to \mathbb{R}$ is convex if:

$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y) \quad \forall x, y, \lambda \in [0,1]$

For convex functions: any local minimum is a global minimum. Gradient descent converges to the global optimum. The convergence rate for $L$ -smooth convex functions is $O(1/T)$ (GD) or $O(1/T^2)$ (Nesterov accelerated).

Logistic regression, linear SVMs, and linear regression have convex losses. Convex optimization theory fully covers these cases.

The Non-Convex Reality of Deep Learning

Neural network loss landscapes are non-convex. But recent theoretical and empirical work suggests the non-convexity is "benign" in practice:

Saddle points, not local minima, are the main obstacle. In high dimensions, most critical points where $\nabla L = 0$ are saddle points (some directions have positive curvature, some negative) rather than local minima. SGD with noise escapes saddle points naturally.
Wide minima generalize better. Sharp minima (where the loss increases steeply as you move away) tend to generalize poorly because small data distribution shifts move you out of the minimum. Wide flat minima are robust. This is why larger batch sizes (lower noise SGD) often generalize worse.
Loss decreasing does not mean converging to global minimum. For deep networks, "good enough" is sufficient. The practical question is not "did we find the global minimum?" but "did we find parameters that generalize well?"

Gradient Descent Variants

Batch Gradient Descent

$\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)$

where $L(\theta) = \frac{1}{n} \sum_{i=1}^n l(\theta; x_i, y_i)$ is the full-dataset loss.

Exact gradient, guaranteed convergence for convex problems. But: requires computing over all $n$ examples per step - impractical when $n = 10^9$ .

Stochastic Gradient Descent

$\theta_{t+1} = \theta_t - \eta \hat{g}_t, \quad \hat{g}_t = \nabla_\theta l(\theta_t; x_i, y_i)$

Single example per step. Maximum noise - gradient estimate has high variance. Historical importance: this is what trained the first practical neural networks. Rarely used raw today; mini-batch SGD is standard.

Mini-Batch SGD

$\hat{g}_t = \frac{1}{B} \sum_{i \in \mathcal{B}_t} \nabla_\theta l(\theta_t; x_i, y_i)$

Batch size $B$ trades off between: fewer steps (large $B$ ) vs lower variance per gradient estimate (large $B$ ) vs better GPU utilization (large $B$ ) vs better generalization (small $B$ , more noise).

The linear scaling rule (Goyal et al., 2017): when multiplying batch size by $k$ , multiply learning rate by $k$ . This approximately maintains the same noise-to-signal ratio. Works well up to $B \approx 4096$ ; breaks down beyond that and requires warmup.

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from typing import List, Tuple, Optional

# ============================================================
# Core optimizer implementations from scratch
# ============================================================

class VanillaSGD:
    """Gradient descent with optional momentum."""

    def __init__(self, params, lr: float = 0.01,
                 momentum: float = 0.0, weight_decay: float = 0.0):
        self.params = list(params)
        self.lr = lr
        self.momentum = momentum
        self.weight_decay = weight_decay
        # Velocity for each parameter (initialized to zero)
        self.velocity = [torch.zeros_like(p) for p in self.params]

    def step(self):
        for p, v in zip(self.params, self.velocity):
            if p.grad is None:
                continue
            g = p.grad.data

            # L2 regularization: add weight_decay * p to gradient
            if self.weight_decay != 0:
                g = g + self.weight_decay * p.data

            # Momentum: v = mu * v + g
            v.mul_(self.momentum).add_(g)

            # Update: theta -= lr * v
            p.data.sub_(self.lr * v)

    def zero_grad(self):
        for p in self.params:
            if p.grad is not None:
                p.grad.zero_()

Momentum

The Heavy Ball Intuition

Gradient descent oscillates in narrow valleys - it overshoots in the steep direction, undershoots in the shallow direction. Momentum smooths this by accumulating velocity: in consistent gradient directions, velocity builds up. In oscillating directions, positive and negative updates cancel.

The analogy is a heavy ball rolling on the loss surface. It does not change direction immediately when it hits an opposing gradient - it has inertia. For convex quadratics, momentum reduces the condition number's effect on convergence from $O(\kappa)$ (gradient descent) to $O(\sqrt{\kappa})$ (heavy ball).

$v_t = \mu v_{t-1} + g_t$ $\theta_{t+1} = \theta_t - \eta v_t$

Nesterov Accelerated Gradient

Nesterov's key insight: instead of computing the gradient at the current position $\theta_t$ , compute it at the "lookahead" position $\theta_t - \mu v_{t-1}$ (where you expect to be after applying momentum).

$v_t = \mu v_{t-1} + g_t^{\text{lookahead}}, \quad g_t^{\text{lookahead}} = \nabla L(\theta_t - \mu v_{t-1})$

This corrects for momentum overshooting: by the time you apply the gradient correction, you account for where momentum is taking you. For convex functions, Nesterov achieves the optimal $O(1/k^2)$ convergence rate. In practice, it often trains faster with less oscillation.

class NesterovSGD:
    """
    Nesterov Accelerated Gradient.
    Compute gradient at lookahead position, not current position.
    Achieves O(1/k^2) convergence for convex functions vs O(1/k) for GD.
    """

    def __init__(self, params, lr: float = 0.01, momentum: float = 0.9):
        self.params = list(params)
        self.lr = lr
        self.mu = momentum
        self.velocity = [torch.zeros_like(p) for p in self.params]

    def step(self, closure=None):
        # PyTorch's standard Nesterov implementation
        # is equivalent to: update velocity, then step with
        # g + mu * v (the gradient at the lookahead point)
        for p, v in zip(self.params, self.velocity):
            if p.grad is None:
                continue
            g = p.grad.data
            v.mul_(self.mu).add_(g)
            # Nesterov: use g + mu * v_new as effective gradient
            p.data.sub_(self.lr * (g + self.mu * v))

Adaptive Learning Rate Methods

AdaGrad

AdaGrad (Duchi et al., 2011) adapts the learning rate for each parameter based on the accumulated squared gradients. Parameters that receive frequent large gradients get a smaller learning rate; sparse parameters get a larger rate.

$G_t = G_{t-1} + g_t^2 \quad \text{(accumulated squared gradient)}$ $\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t} + \epsilon} g_t$

Strength: great for sparse gradients (NLP word embeddings, sparse features). Weakness: $G_t$ is monotonically increasing, so the effective learning rate $\eta / \sqrt{G_t}$ decays to zero. Training stops making progress.

RMSProp

Hinton's fix: replace the full accumulation with an exponential moving average:

$E[g^2]_t = \rho E[g^2]_{t-1} + (1-\rho) g_t^2$ $\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t} + \epsilon} g_t$

The EMA with decay $\rho \approx 0.9$ "forgets" old gradients - the effective learning rate remains bounded and the optimizer never permanently slows down.

Adam: The Full Derivation

Adam (Kingma and Ba, 2015) combines momentum (first moment) with RMSProp (second moment), with a crucial bias correction that makes it work well from the first step.

First moment (momentum term): exponential moving average of gradients $m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$

Second moment (adaptive learning rate): exponential moving average of squared gradients $v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$

Why bias correction is needed: at $t=1$ , $m_1 = (1-\beta_1)g_1$ . With $\beta_1 = 0.9$ , that is $0.1 \cdot g_1$ - only 10% of the first gradient. The estimate is heavily biased toward zero. The same applies to $v_t$ .

Bias correction: $\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$

At $t=1$ : $\hat{m}_1 = g_1 / (1-0.9) = g_1$ . At large $t$ : $\hat{m}_t \approx m_t$ (correction becomes negligible as $\beta^t \to 0$ ).

The Adam update: $\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$

class AdamOptimizer:
    """
    Adam optimizer from scratch with full state tracking.
    Kingma & Ba, ICLR 2015.

    Key design decisions:
    - beta1=0.9: momentum decay. Accumulates ~1/(1-0.9)=10 steps of gradient history
    - beta2=0.999: second moment decay. Accumulates ~1/(1-0.999)=1000 steps
    - eps=1e-8: numerical stability, prevents division by zero
    - Bias correction: critical for early training steps
    """

    def __init__(self, params, lr: float = 1e-3,
                 beta1: float = 0.9, beta2: float = 0.999,
                 eps: float = 1e-8, weight_decay: float = 0.0):
        self.params = list(params)
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.weight_decay = weight_decay

        # Adam state: first moment (m), second moment (v), step count (t)
        self.m = [torch.zeros_like(p) for p in self.params]
        self.v = [torch.zeros_like(p) for p in self.params]
        self.t = 0

    def step(self):
        self.t += 1
        b1, b2 = self.beta1, self.beta2

        # Bias correction factors
        bias_correction1 = 1 - b1 ** self.t
        bias_correction2 = 1 - b2 ** self.t

        for i, p in enumerate(self.params):
            if p.grad is None:
                continue
            g = p.grad.data

            # Note: weight_decay here is L2 regularization applied to gradient
            # This is subtly different from AdamW's decoupled weight decay
            if self.weight_decay != 0:
                g = g + self.weight_decay * p.data

            # Update first moment (momentum)
            self.m[i].mul_(b1).add_((1 - b1) * g)

            # Update second moment (adaptive learning rate denominator)
            self.v[i].mul_(b2).add_((1 - b2) * g * g)

            # Bias-corrected estimates
            m_hat = self.m[i] / bias_correction1
            v_hat = self.v[i] / bias_correction2

            # Parameter update
            p.data.sub_(self.lr * m_hat / (v_hat.sqrt() + self.eps))

    def zero_grad(self):
        for p in self.params:
            if p.grad is not None:
                p.grad.zero_()

    def state_dict(self) -> dict:
        """Inspect optimizer state - useful for debugging training issues."""
        return {
            'step': self.t,
            'lr': self.lr,
            'first_moment_norms': [m.norm().item() for m in self.m],
            'second_moment_norms': [v.norm().item() for v in self.v],
            'effective_lr': [
                self.lr / ((v / (1 - self.beta2**self.t)).sqrt().mean().item()
                           + self.eps)
                for v in self.v
            ]
        }

AdamW: Decoupled Weight Decay

The bug in Adam with L2 regularization: L2 adds $\lambda \theta$ to the gradient, so the effective weight decay is $\lambda \eta / (\sqrt{\hat{v}} + \epsilon)$ . For parameters with large gradient variance (large $\hat{v}$ ), the effective decay is small. For parameters with small gradient variance, decay is large. This breaks the intended uniform regularization.

AdamW (Loshchilov and Hutter, 2017) decouples weight decay from the gradient update:

$\theta_{t+1} = \theta_t - \eta \left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t\right)$

The weight decay $\lambda \theta_t$ is applied directly to the parameters, independent of the adaptive gradient scaling. This gives uniform regularization across all parameters regardless of their gradient statistics.

class AdamW:
    """
    AdamW: Adam with decoupled weight decay.
    Loshchilov & Hutter, ICLR 2018.

    Standard in all modern LLM training:
    GPT-2, BERT, T5, GPT-3, LLaMA all use AdamW.
    """

    def __init__(self, params, lr: float = 1e-3,
                 beta1: float = 0.9, beta2: float = 0.999,
                 eps: float = 1e-8, weight_decay: float = 0.01):
        self.params = list(params)
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.weight_decay = weight_decay  # decoupled from gradient update

        self.m = [torch.zeros_like(p) for p in self.params]
        self.v = [torch.zeros_like(p) for p in self.params]
        self.t = 0

    def step(self):
        self.t += 1
        b1, b2 = self.beta1, self.beta2
        bc1 = 1 - b1 ** self.t
        bc2 = 1 - b2 ** self.t

        for i, p in enumerate(self.params):
            if p.grad is None:
                continue
            g = p.grad.data

            # Update moment estimates (no weight decay in gradient!)
            self.m[i].mul_(b1).add_((1 - b1) * g)
            self.v[i].mul_(b2).add_((1 - b2) * g * g)

            m_hat = self.m[i] / bc1
            v_hat = self.v[i] / bc2

            # Gradient step
            p.data.sub_(self.lr * m_hat / (v_hat.sqrt() + self.eps))

            # Decoupled weight decay: applied directly, not through gradient
            p.data.mul_(1 - self.lr * self.weight_decay)

    def zero_grad(self):
        for p in self.params:
            if p.grad is not None:
                p.grad.zero_()


def compare_adam_adamw():
    """
    Show that AdamW provides more uniform regularization than Adam+L2.
    With Adam+L2, effective weight decay depends on gradient variance.
    With AdamW, weight decay is constant for all parameters.
    """
    import torch.nn as nn

    torch.manual_seed(42)
    d = 10

    # Model with two parameter groups: one with large gradients, one with small
    model = nn.Sequential(
        nn.Linear(d, 100),  # large gradient parameters
        nn.ReLU(),
        nn.Linear(100, 1)   # small gradient parameters
    )

    # Adam+L2 (weight_decay applied to gradient)
    opt_adam = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=0.01)

    # AdamW (weight_decay decoupled)
    opt_adamw = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

    X = torch.randn(100, d)
    y = torch.randn(100, 1)

    print("Adam with L2 vs AdamW: effective regularization comparison")
    for opt_name, opt in [('Adam+L2', opt_adam), ('AdamW', opt_adamw)]:
        opt.zero_grad()
        pred = model(X)
        loss = F.mse_loss(pred, y)
        loss.backward()
        opt.step()
        # Check parameter norms after update
        param_norms = [p.norm().item() for p in model.parameters()]
        print(f"{opt_name}: param norms = {[f'{n:.4f}' for n in param_norms]}")

Learning Rate Schedules

The learning rate schedule is as important as the optimizer. The right schedule can double training efficiency. The wrong schedule causes training collapse.

Warmup

Large models are sensitive to the learning rate at initialization. With Adam, the second moment estimate $\hat{v}$ starts near zero (denominator of update), which means the effective learning rate is actually $\eta / \epsilon$ - much larger than $\eta$ . This causes large, unstable updates early in training.

Linear warmup for $T_w$ steps: $\eta_t = \eta_{\max} \cdot t / T_w$ . This gives the second moment estimates time to reach accurate values before taking large steps.

Standard for transformer training: 1-2% of total steps as warmup. GPT-3 uses 375M tokens of warmup out of 300B total (0.125%).

Cosine Annealing

After warmup, decay the learning rate following a cosine curve:

$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t - T_w}{T_{\max} - T_w} \pi\right)\right)$

This is smooth (no abrupt drops), naturally slows training near the end (allowing fine-grained convergence), and restarts can be used to escape local minima (SGDR: warm restarts).

class CosineAnnealingWithWarmup:
    """
    Linear warmup + cosine annealing learning rate schedule.
    Standard for LLM training: GPT-4, LLaMA, Mistral all use this.
    """

    def __init__(self, optimizer: torch.optim.Optimizer,
                 warmup_steps: int,
                 total_steps: int,
                 min_lr: float = 1e-5,
                 max_lr: float = 1e-3):
        self.optimizer = optimizer
        self.warmup_steps = warmup_steps
        self.total_steps = total_steps
        self.min_lr = min_lr
        self.max_lr = max_lr
        self.current_step = 0

    def step(self):
        self.current_step += 1
        lr = self.get_lr(self.current_step)
        for pg in self.optimizer.param_groups:
            pg['lr'] = lr
        return lr

    def get_lr(self, step: int) -> float:
        if step < self.warmup_steps:
            # Linear warmup: 0 -> max_lr
            return self.max_lr * step / self.warmup_steps
        elif step >= self.total_steps:
            return self.min_lr
        else:
            # Cosine decay: max_lr -> min_lr
            progress = (step - self.warmup_steps) / (
                self.total_steps - self.warmup_steps)
            cosine_decay = 0.5 * (1 + np.cos(np.pi * progress))
            return self.min_lr + (self.max_lr - self.min_lr) * cosine_decay


def plot_lr_schedule(warmup_steps=1000, total_steps=10000,
                     min_lr=1e-5, max_lr=3e-4):
    """Visualize the warmup + cosine decay schedule."""
    steps = np.arange(total_steps)
    lrs = []

    for step in steps:
        if step < warmup_steps:
            lr = max_lr * step / warmup_steps
        else:
            progress = (step - warmup_steps) / (total_steps - warmup_steps)
            cosine_decay = 0.5 * (1 + np.cos(np.pi * progress))
            lr = min_lr + (max_lr - min_lr) * cosine_decay
        lrs.append(lr)

    plt.figure(figsize=(10, 4))
    plt.plot(steps, lrs, 'b-', linewidth=2)
    plt.axvline(x=warmup_steps, color='r', linestyle='--',
                label=f'Warmup ends (step {warmup_steps})')
    plt.xlabel('Training Step')
    plt.ylabel('Learning Rate')
    plt.title('Cosine Annealing with Linear Warmup')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig('/tmp/lr_schedule.png', dpi=100)
    print(f"Max LR: {max_lr:.2e}, Min LR: {min_lr:.2e}")
    print(f"LR at warmup end: {lrs[warmup_steps]:.6f}")
    print(f"LR at halfway: {lrs[total_steps//2]:.6f}")
    print(f"LR at end: {lrs[-1]:.6f}")

plot_lr_schedule()

Gradient Clipping

In deep networks, gradients can explode - especially in RNNs and early transformer training. A single large gradient step can destroy weeks of training.

Gradient norm clipping: if $\|g\| > \text{max\_norm}$ , scale $g \leftarrow g \cdot \text{max\_norm} / \|g\|$ . This preserves the gradient direction while bounding its magnitude.

The difference from gradient value clipping (element-wise): norm clipping respects the relative magnitudes within the gradient - if one dimension is large, all dimensions are scaled proportionally.

def clip_gradient_norm(parameters, max_norm: float,
                        norm_type: float = 2.0) -> float:
    """
    Clip gradients by global norm.
    Returns the pre-clipping gradient norm (useful for monitoring).

    Standard practice:
    - Transformers: max_norm = 1.0 (GPT-3, LLaMA standard)
    - RNNs: max_norm = 5.0 or 1.0
    - CNNs: often not needed, but max_norm = 1.0 is safe default

    Important: call AFTER loss.backward() and BEFORE optimizer.step()
    """
    params = [p for p in parameters if p.grad is not None]
    if not params:
        return 0.0

    # Compute global gradient norm
    total_norm = torch.norm(
        torch.stack([torch.norm(p.grad.data, norm_type) for p in params]),
        norm_type
    ).item()

    # Scale gradients if norm exceeds max_norm
    clip_coef = max_norm / (total_norm + 1e-6)
    if clip_coef < 1:
        for p in params:
            p.grad.data.mul_(clip_coef)

    return total_norm


def monitor_gradient_health(model: nn.Module, step: int) -> dict:
    """
    Compute gradient statistics for training diagnostics.
    Call after backward(), before optimizer step.

    Warning signs:
    - grad_norm > 10: potential explosion (or bad batch)
    - grad_norm < 1e-4: vanishing gradients
    - grad_norm suddenly spikes: check for bad data samples
    """
    grad_norms = {}
    total_norm_sq = 0.0

    for name, p in model.named_parameters():
        if p.grad is not None:
            param_norm = p.grad.data.norm(2).item()
            grad_norms[name] = param_norm
            total_norm_sq += param_norm ** 2

    global_norm = total_norm_sq ** 0.5

    return {
        'step': step,
        'global_grad_norm': global_norm,
        'per_layer_norms': grad_norms,
        'max_layer_norm': max(grad_norms.values()) if grad_norms else 0,
        'min_layer_norm': min(grad_norms.values()) if grad_norms else 0,
    }


# Complete training loop with all best practices
def train_with_best_practices(model, train_loader, total_steps=10000):
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=3e-4,
        betas=(0.9, 0.95),  # beta2=0.95 for LLM training (more responsive)
        weight_decay=0.1
    )

    scheduler = CosineAnnealingWithWarmup(
        optimizer,
        warmup_steps=int(0.02 * total_steps),  # 2% warmup
        total_steps=total_steps,
        min_lr=3e-5,
        max_lr=3e-4
    )

    grad_norm_history = []

    for step, (x, y) in enumerate(train_loader):
        if step >= total_steps:
            break

        optimizer.zero_grad()
        loss = model(x, y)
        loss.backward()

        # Clip gradients (monitor norm BEFORE clipping)
        grad_norm = clip_gradient_norm(model.parameters(), max_norm=1.0)
        grad_norm_history.append(grad_norm)

        # Alert on gradient spikes
        if grad_norm > 5.0:
            print(f"WARNING: Large gradient at step {step}: {grad_norm:.2f}")

        optimizer.step()
        scheduler.step()

        if step % 100 == 0:
            lr = optimizer.param_groups[0]['lr']
            print(f"Step {step}: loss={loss.item():.4f}, "
                  f"grad_norm={grad_norm:.3f}, lr={lr:.6f}")

    return grad_norm_history

Loss Landscape Visualization

Understanding why an optimizer succeeds or fails requires visualizing the loss landscape. The standard technique (Li et al., 2018) is to project the landscape onto two random directions through the final parameter vector.

def visualize_loss_landscape(
    model: nn.Module,
    loss_fn,
    data_loader,
    resolution: int = 20,
    range_val: float = 1.0
) -> None:
    """
    Loss landscape visualization via random direction projection.
    Li et al., NeurIPS 2018: "Visualizing the Loss Landscape of Neural Nets"

    Creates a 2D grid around the final trained parameters.
    Shows whether the minimum is sharp (bad generalization) or flat (good).
    """
    # Save final parameters
    theta_star = [p.data.clone() for p in model.parameters()]
    total_params = sum(p.numel() for p in model.parameters())

    # Sample two random Gaussian directions, filter-normalized
    # Filter normalization: scale each filter to have same norm as
    # the corresponding final parameter filter (makes scale consistent)
    def make_direction():
        direction = [torch.randn_like(p) for p in model.parameters()]
        # Filter normalization
        for d, p in zip(direction, model.parameters()):
            d.mul_(p.norm() / (d.norm() + 1e-10))
        return direction

    dir1 = make_direction()
    dir2 = make_direction()

    alpha_range = np.linspace(-range_val, range_val, resolution)
    beta_range = np.linspace(-range_val, range_val, resolution)
    losses = np.zeros((resolution, resolution))

    # Evaluate loss at each point in the grid
    for i, alpha in enumerate(alpha_range):
        for j, beta in enumerate(beta_range):
            # Move parameters to theta_star + alpha * dir1 + beta * dir2
            for param, star, d1, d2 in zip(model.parameters(),
                                            theta_star, dir1, dir2):
                param.data.copy_(star + alpha * d1 + beta * d2)

            # Compute loss
            total_loss = 0.0
            n_batches = 0
            with torch.no_grad():
                for x, y in data_loader:
                    total_loss += loss_fn(model(x), y).item()
                    n_batches += 1
                    if n_batches >= 5:  # subsample for speed
                        break
            losses[i, j] = total_loss / n_batches

    # Restore final parameters
    for param, star in zip(model.parameters(), theta_star):
        param.data.copy_(star)

    # Plot
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

    # Contour plot
    X, Y = np.meshgrid(alpha_range, beta_range)
    levels = np.percentile(losses, np.linspace(0, 100, 30))
    ax1.contourf(X, Y, losses.T, levels=levels, cmap='viridis')
    ax1.set_title('Loss Landscape Contours')
    ax1.set_xlabel('Direction 1')
    ax1.set_ylabel('Direction 2')

    # 3D surface
    ax2 = fig.add_subplot(122, projection='3d')
    ax2.plot_surface(X, Y, losses.T, cmap='viridis', alpha=0.8)
    ax2.set_title('Loss Landscape 3D')

    plt.tight_layout()
    plt.savefig('/tmp/loss_landscape.png', dpi=100)
    print(f"Loss range: [{losses.min():.4f}, {losses.max():.4f}]")
    print(f"Loss at center (trained params): {losses[resolution//2, resolution//2]:.4f}")

Why Adam Beats SGD for Transformers But SGD Beats Adam for CNNs

This is one of the most important empirical observations in modern deep learning, and understanding the reason gives deep insight into both optimizers.

For CNNs: The loss landscape is relatively well-conditioned. Convolutional filters have similar gradient scales across layers. Weight decay with L2 (i.e., SGD+momentum's implicit behavior) provides effective regularization. SGD+momentum with cosine decay converges to flatter, better-generalizing minima than Adam for image classification. The famous WideResNet and EfficientNet training recipes explicitly use SGD+momentum.

For Transformers: The loss landscape is ill-conditioned. Gradient magnitudes vary enormously: the query/key projections have different gradient scales than the FFN weights, which have different scales than the embedding layer. This heterogeneity makes a fixed learning rate catastrophic - it is either too large for some layers or too small for others. Adam's per-parameter adaptive learning rates normalize these differences, making the effective step size consistent across all parameters.

A second factor: embedding parameters have sparse gradients (only the tokens in the current batch receive gradient updates). AdaGrad/Adam maintain large second moment estimates for embedding rows that appear frequently, giving them smaller effective learning rates - which prevents overshooting on common tokens while still training rare tokens adequately.

Muon Optimizer

Muon (Momentum + Orthogonalization, Kosson et al., 2024) extends the Shampoo line of work by orthogonalizing the gradient update. The key idea: instead of normalizing gradient components independently (Adam) or using a full second-order Hessian approximation (Shampoo), Muon applies the Newton-Schulz iteration to orthogonalize the momentum matrix.

def newton_schulz_5(G: torch.Tensor, steps: int = 5) -> torch.Tensor:
    """
    5th-order Newton-Schulz iteration for matrix orthogonalization.
    Finds U in the decomposition G = US where U is orthogonal.
    Used in Muon optimizer for gradient normalization.
    """
    assert G.ndim == 2
    a, b, c = (3.4445, -4.7750, 2.0315)

    X = G / (G.norm() + 1e-7)  # normalize

    if G.shape[0] > G.shape[1]:
        X = X.T

    for _ in range(steps):
        A = X @ X.T
        B = b * A + c * (A @ A)
        X = a * X + B @ X

    if G.shape[0] > G.shape[1]:
        X = X.T

    return X


class Muon(torch.optim.Optimizer):
    """
    Muon: Momentum Orthogonalized by Newton-Schulz.
    Kosson et al., 2024.

    Best results on 2D parameter matrices (weight matrices in Linear layers).
    Use Adam for 1D parameters (biases, norms) and embeddings.
    Shows strong results in small/medium scale LLM training.
    """

    def __init__(self, params, lr: float = 0.02, momentum: float = 0.95,
                 ns_steps: int = 5):
        defaults = dict(lr=lr, momentum=momentum, ns_steps=ns_steps)
        super().__init__(params, defaults)

    @torch.no_grad()
    def step(self):
        for group in self.param_groups:
            lr = group['lr']
            momentum = group['momentum']
            ns_steps = group['ns_steps']

            for p in group['params']:
                if p.grad is None:
                    continue

                g = p.grad
                state = self.state[p]

                # Initialize momentum buffer
                if 'momentum_buffer' not in state:
                    state['momentum_buffer'] = torch.zeros_like(g)

                buf = state['momentum_buffer']
                buf.mul_(momentum).add_(g)

                # Orthogonalize gradient update for 2D parameters
                if g.ndim == 2:
                    update = newton_schulz_5(buf, steps=ns_steps)
                else:
                    update = buf  # 1D params: standard momentum

                p.data.sub_(lr * update)

Second-Order Methods

First-order methods (gradient descent, Adam) use only first derivatives. Second-order methods use the Hessian $H = \nabla^2 L$ to approximate the curvature and take a more direct path to the minimum.

Newton's method update:

$\theta_{t+1} = \theta_t - H_t^{-1} g_t$

This step goes directly to the minimum of the local quadratic approximation. Convergence is quadratic for convex problems. But: computing and inverting the $p \times p$ Hessian costs $O(p^3)$ - for a model with $p = 10^9$ parameters, this is completely infeasible.

L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) approximates the inverse Hessian using the last $m$ gradient differences, requiring only $O(mp)$ storage. It is standard for full-batch optimization on small-to-medium problems. PyTorch includes it:

def train_with_lbfgs(model: nn.Module, X: torch.Tensor, y: torch.Tensor,
                      max_iter: int = 100):
    """
    L-BFGS for full-batch optimization.
    Effective for:
    - Fine-tuning small models on modest datasets
    - Physics-informed neural networks (PINNs)
    - Neural ODEs with small state spaces
    NOT suitable for:
    - Mini-batch training (L-BFGS is not stochastic)
    - Models with stochastic layers (dropout active)
    - Any model > ~100M parameters
    """
    optimizer = torch.optim.LBFGS(
        model.parameters(),
        lr=1.0,
        max_iter=20,          # inner iterations per step
        history_size=100,     # number of gradient pairs to remember
        line_search_fn='strong_wolfe'  # line search for step size
    )

    def closure():
        """L-BFGS requires a closure that recomputes loss and gradient."""
        optimizer.zero_grad()
        pred = model(X)
        loss = F.mse_loss(pred, y)
        loss.backward()
        return loss

    for step in range(max_iter):
        loss = optimizer.step(closure)
        if step % 10 == 0:
            print(f"L-BFGS step {step}: loss={loss.item():.6f}")

Production Engineering Notes

Optimizer Hyperparameter Tuning Hierarchy

Not all hyperparameters are equal. The order of importance for a standard AdamW + cosine schedule:

Learning rate ( $\eta$ ): most important. Search 1e-4 to 5e-3 on log scale.
Weight decay ( $\lambda$ ): second most important. Too low = overfitting. Too high = underfitting. 0.01-0.1 for transformers.
Warmup steps: 1-5% of total training steps. Too little = unstable start. Too much = wasted budget.
beta2: for transformers, 0.95 > 0.999 (more responsive to recent gradients). For fine-tuning, 0.999 is safe.
beta1: 0.9 is rarely wrong. Sometimes 0.95 for very long training runs.
epsilon: 1e-8 is the default. For mixed precision training (float16), use 1e-6 to avoid underflow.

Mixed Precision and Optimizer State

In mixed precision training, parameters are stored in float16 for forward/backward pass (memory efficiency, speed) but optimizer state and parameter updates are done in float32 (numerical stability).

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()  # manages loss scaling for float16 gradients

def train_step_mixed_precision(model, optimizer, x, y, scaler):
    optimizer.zero_grad()

    with autocast():  # float16 forward pass
        loss = model(x, y)

    # Scale loss to prevent float16 gradient underflow
    scaler.scale(loss).backward()

    # Unscale before gradient clipping
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    # Step (float32 update internally)
    scaler.step(optimizer)
    scaler.update()

    return loss.item()

warning

Never Clip Gradients After Scaling Without Unscaling First

In mixed-precision training, gradients are scaled by a large factor to prevent float16 underflow. If you call clip_grad_norm_ before scaler.unscale_(), you are clipping the artificially inflated gradients. The effective max_norm becomes max_norm / scale_factor - possibly near zero. Always call scaler.unscale_(optimizer) before gradient clipping.

danger

Adam on Sparse Gradients Without Regularization Is a Trap

Adam maintains a second moment estimate $v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$ for every parameter. For embedding parameters with sparse gradients (only updated when the corresponding token appears in the batch), $v_t$ becomes very small for rare tokens - causing Adam to take enormous steps when those tokens finally appear. The fix: use weight decay (AdamW) and consider a larger $\epsilon$ (1e-6 to 1e-4) for the embedding layer specifically. In PyTorch, use param_groups to set different hyperparameters for embedding vs other parameters.

Interview Questions and Answers

Q1: Derive the Adam optimizer. Why is bias correction necessary?

Adam maintains two exponential moving averages: first moment $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$ (momentum) and second moment $v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$ (adaptive learning rate denominator).

Bias correction is necessary because both estimates are initialized to zero, causing systematic underestimation early in training. Expanding $m_t$ : $m_t = (1-\beta_1)\sum_{i=1}^t \beta_1^{t-i} g_i$ . If we assume $g_i \approx g$ (stationary), then $m_t = g(1-\beta_1^t)$ . But we want an estimate of $g$ , so we correct: $\hat{m}_t = m_t / (1-\beta_1^t)$ .

At $t=1$ : $m_1 = 0.1 g_1$ (only 10% of first gradient with $\beta_1=0.9$ ), but $\hat{m}_1 = g_1$ . At $t=10$ : $1-0.9^{10} = 0.65$ , still needs correction. At $t=100$ : $1-0.9^{100} \approx 1$ , correction is negligible. The bias correction matters most in the first few thousand steps - exactly when learning rate stability is critical.

Q2: What is the difference between L2 regularization with Adam and AdamW?

L2 regularization adds $\lambda \theta$ to the gradient before the optimizer update. In Adam, this means the regularization term is also adaptively scaled: the effective weight decay for parameter $i$ is $\lambda \eta / \sqrt{\hat{v}_i + \epsilon}$ . Parameters with large gradient variance (large $\hat{v}$ ) receive weak regularization; parameters with small gradient variance receive strong regularization.

AdamW applies weight decay directly to the parameters after the gradient step: $\theta_{t+1} = (1 - \eta\lambda)\theta_t - \eta \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$ . This gives uniform weight decay $\lambda$ for all parameters, independent of gradient statistics.

In practice: AdamW consistently outperforms Adam+L2 regularization for transformer models. The LLaMA, GPT-4, and most modern LLM training recipes use AdamW. The difference is most pronounced when some layer types (e.g., attention vs FFN) have significantly different gradient variances.

Q3: Explain learning rate warmup. Why do transformers need it but ResNets typically do not?

Warmup gradually increases the learning rate from 0 (or a small value) to the target maximum over the first few hundred or thousand training steps.

The Adam-specific reason: at step 1, the second moment estimate $\hat{v}_1 = g_1^2$ (a single gradient squared). The effective learning rate is $\eta / \sqrt{\hat{v}_1} = \eta / |g_1|$ . For parameters with small gradients, $|g_1|$ may be tiny, making the effective step extremely large. As training progresses and $\hat{v}$ accumulates a history, it stabilizes. Warmup gives this stabilization time before taking large steps.

Transformers are more sensitive because: (1) they have more parameter types with heterogeneous gradient scales; (2) LayerNorm parameters, embedding weights, and attention projections all have different gradient magnitudes; (3) attention patterns have sharp gradients at initialization that can destabilize early training.

ResNets are less sensitive because: convolutional filters have relatively homogeneous gradient scales; batch normalization stabilizes training; the architecture is more robust to moderate learning rates at initialization.

Q4: What is gradient clipping, and when should you use it?

Gradient norm clipping: if the global L2 norm of all gradients exceeds max_norm, scale all gradients proportionally so the norm equals max_norm. The gradient direction is preserved; only the magnitude is bounded.

When to use: (1) RNN/LSTM training - BPTT through long sequences causes gradient explosion; clipping at 5.0 or 1.0 is standard; (2) Transformer training - early in training, attention weights are random and can produce large gradient spikes; max_norm=1.0 is standard for GPT-scale training; (3) Any time you see loss NaN or sudden spikes in training curves.

When NOT to use (or use carefully): fine-tuning pre-trained models - if your gradients are already small, clipping at 1.0 has no effect and adds overhead; reward model training in RLHF - gradient scale depends heavily on reward signal distribution.

The key insight: gradient clipping is not a fix for vanishing gradients (it only helps with explosion). For vanishing gradients, you need architectural solutions (residual connections, LayerNorm, better initialization).

Q5: Why does Adam generally beat SGD for transformers but SGD+momentum beats Adam for CNNs?

The core reason is the condition number of the loss landscape. For transformers:

Gradient magnitudes vary dramatically across layer types and positions
Embedding gradients are sparse; attention projection gradients are dense
The effective Hessian has very high condition number, making a uniform learning rate ineffective
Adam normalizes each parameter by its gradient history, equalizing effective learning rates

For CNNs:

Convolutional filters at each layer have similar gradient scales due to weight sharing
The loss landscape is better conditioned
SGD+momentum finds flatter minima (wider basins of attraction) with better generalization
Adam's adaptive normalization can hurt CNNs by adapting away from the implicit regularization effect of uniform gradient steps

The empirical evidence is clear: WideResNet achieves 96.7% on CIFAR-10 with SGD+momentum; Adam typically reaches 95-96%. For GPT-scale transformers, SGD convergence speed is 10-100x slower than Adam, often requiring impractically small learning rates to avoid instability.

Q6: Describe cosine annealing with warm restarts. When would you use restarts?

Standard cosine annealing (without restarts) decays the learning rate from $\eta_{\max}$ to $\eta_{\min}$ following a cosine curve over the entire training run. With warm restarts (SGDR, Loshchilov & Hutter 2016), the learning rate periodically resets to $\eta_{\max}$ and decays again, with each cycle potentially longer than the last.

Why restarts help: training often settles into a basin with a good but not optimal minimum. The periodic large learning rate resets allow the optimizer to escape this basin and potentially find a better one. Each restart is like a new, shorter training run initialized from the end of the previous one.

When to use: (1) ensembling - take model snapshots at the end of each cosine cycle and ensemble them (snapshot ensembles); (2) when training curves show the model settling into a plateau before the full training budget is used; (3) when exploring multiple local minima for diversity in ensemble members. When NOT to use: very long training runs where the final minimum quality is more important than escape dynamics; fine-tuning where you want smooth convergence to a nearby minimum.

Summary

Optimization is where theory meets engineering reality. The derivation of Adam from moment estimates and bias correction explains why it is robust in early training. The difference between AdamW and Adam+L2 explains why modern LLM training universally uses AdamW. The connection between loss landscape geometry and optimizer choice explains why the same model trained with different optimizers can produce dramatically different results.

The practical recipe for large model training is stable: AdamW with $\beta_1 = 0.9$ , $\beta_2 = 0.95$ , weight decay 0.1, gradient clipping at 1.0, linear warmup for 2% of training, cosine decay to 10% of peak LR. Deviations from this require justification, and the justification should come from understanding what each hyperparameter controls in the optimization dynamics.

Second-order methods (L-BFGS) remain valuable for small-to-medium full-batch problems. The Muon optimizer and other orthogonalization-based methods are showing promise for efficiently trained small language models. But for training at scale, the Adam family with cosine schedules remains the established baseline to beat.

The Training Run That Never Converged​

Why This Exists​

Historical Context​

Convex vs Non-Convex Optimization​

Convex Optimization​

The Non-Convex Reality of Deep Learning​

Gradient Descent Variants​

Batch Gradient Descent​

Stochastic Gradient Descent​

Mini-Batch SGD​

Momentum​

The Heavy Ball Intuition​

Nesterov Accelerated Gradient​

Adaptive Learning Rate Methods​

AdaGrad​

RMSProp​

Adam: The Full Derivation​

AdamW: Decoupled Weight Decay​

Learning Rate Schedules​

Warmup​

Cosine Annealing​

Gradient Clipping​

Loss Landscape Visualization​

Why Adam Beats SGD for Transformers But SGD Beats Adam for CNNs​

Muon Optimizer​

Second-Order Methods​

Production Engineering Notes​

Optimizer Hyperparameter Tuning Hierarchy​

Mixed Precision and Optimizer State​

Interview Questions and Answers​

Summary​