What is Adam optimizer?

Complete optimizer guide - SGD momentum, Nesterov, AdaGrad, RMSProp, Adam bias correction derivation, AdamW decoupled weight decay, LAMB, Lion, AMSGrad - with NumPy Adam from scratch, PyTorch implementations, and the SGD vs Adam generalization debate.

How does SGD momentum work in practice?

Optimizers: Adam, SGD, RMSProp covers Adam optimizer, SGD momentum, Nesterov momentum from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/neural-networks/optimizers-adam-sgd-rmsprop

What is the difference between Adam optimizer and Nesterov momentum?

See the full breakdown at https://engineersofai.com/docs/ml/neural-networks/optimizers-adam-sgd-rmsprop

Optimizers: Adam, SGD, RMSProp

The Real Interview Moment

You are training a ResNet-50 for image classification. You use Adam with the default learning rate of 1e-3. Training loss drops quickly at first, then plateaus at 2.1. A colleague suggests switching to SGD with momentum. You switch. The loss initially drops slower - SGD takes more careful steps - but after 90 epochs it reaches 1.8, a noticeably better final value.

Meanwhile, another colleague is training a BERT model and insists Adam is essential. She tried SGD for a week and it never converged at all. You both have the same conclusion: "Use what works." But you do not understand why they work differently.

This lesson explains the mechanism behind every major optimizer, derives why Adam and SGD behave differently on different types of loss landscapes, and gives you the vocabulary to make principled choices - not just copy hyperparameters from papers.

Optimizer Family Tree

Gradient Descent Foundation

The simplest optimizer updates each parameter by the gradient of the loss:

$\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)$

Where $\eta$ is the learning rate. Two fundamental problems:

Computing $\nabla_\theta L$ over the full dataset is prohibitively expensive for large $N$
No adaptivity - same step size for all parameters regardless of their loss surface geometry

Stochastic Gradient Descent uses a mini-batch estimate:

$\theta_{t+1} = \theta_t - \eta \hat{g}_t \quad \text{where } \hat{g}_t = \nabla_\theta L_{\mathcal{B}_t}(\theta_t)$

Mini-batch gradients are noisy estimates of the true gradient. The noise is manageable: you get $N/B$ gradient steps per epoch instead of 1, and the noise acts as a regularizer.

SGD with Momentum: Building Velocity

Raw SGD oscillates in narrow loss landscape valleys and moves slowly along the gradient direction. Momentum adds a "velocity" term:

$\mathbf{v}_t = \beta \mathbf{v}_{t-1} + \nabla_\theta L_{\mathcal{B}_t}(\theta_t)$

$\theta_{t+1} = \theta_t - \eta \mathbf{v}_t$

Where $\beta$ is the momentum coefficient (typically 0.9). Expanding the velocity recursion:

$\mathbf{v}_t = \sum_{k=0}^{t} \beta^{t-k} \nabla L_k$

The velocity is an exponential moving average of past gradients - gradients from $k$ steps ago contribute with weight $\beta^k$ . With $\beta = 0.9$ , gradients from 10 steps ago still contribute $0.9^{10} \approx 0.35$ to the current velocity.

Physical intuition: the parameter is a ball rolling down a hill. Momentum lets it build up speed in consistent gradient directions (the ball accelerates), while oscillations across narrow dimensions cancel out (alternating signs in the gradient diminish the velocity in that dimension). In a long narrow valley, the ball accelerates along the valley axis and oscillations across it cancel.

Alternative SGD+momentum formulation (what PyTorch actually uses):

$\mathbf{v}_t = \beta \mathbf{v}_{t-1} + (1-\beta) \nabla L_t \quad \text{(normalized variant)}$

This makes the velocity a proper exponential moving average with unit magnitude at convergence.

Nesterov Momentum: Look-Ahead Gradient

Nesterov accelerated gradient (NAG) evaluates the gradient at the "look-ahead" position - where momentum alone would take the parameters - rather than the current position:

$\mathbf{v}_t = \beta \mathbf{v}_{t-1} + \nabla_{\theta - \eta\beta \mathbf{v}_{t-1}} L$

$\theta_{t+1} = \theta_t - \eta \mathbf{v}_t$

Why this helps: standard momentum applies the current gradient to the current position, then adds velocity. Nesterov first takes a momentum step, evaluates the gradient there, then corrects. If the momentum step is overshooting a valley, the gradient at the look-ahead position points back toward the valley floor - Nesterov corrects more quickly. In practice, Nesterov converges slightly faster than standard momentum and overshoots less near minima.

import torch
import torch.nn as nn


def configure_sgd(model: nn.Module, lr: float = 0.01, momentum: float = 0.9,
                  nesterov: bool = True, weight_decay: float = 5e-4):
    """
    SGD with Nesterov momentum - the standard for vision tasks from scratch.
    Nesterov is almost always better than standard momentum.
    """
    return torch.optim.SGD(
        model.parameters(),
        lr=lr,
        momentum=momentum,
        nesterov=nesterov,
        weight_decay=weight_decay,
    )


# Typical ResNet training configuration
optimizer = configure_sgd(model, lr=0.1, momentum=0.9, weight_decay=5e-4)
# Learning rate schedule: step decay at epochs 30, 60, 80
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[30, 60, 80], gamma=0.1)

AdaGrad: Per-Parameter Adaptive Learning Rates

SGD uses the same learning rate for all parameters. AdaGrad (Duchi et al., 2011) adapts per-parameter based on historical gradient magnitude:

$G_{t,i} = G_{t-1,i} + g_{t,i}^2 \quad \text{(cumulative sum of squared gradients)}$

$\theta_{t+1,i} = \theta_{t,i} - \frac{\eta}{\sqrt{G_{t,i} + \epsilon}} g_{t,i}$

Intuition: parameters with large historical gradients get smaller effective learning rates; those with small historical gradients get larger effective rates. For sparse features (e.g., word embeddings), most gradient updates are zero for most vocabulary items. The few tokens that appear in a batch receive large updates. AdaGrad's per-parameter rates allow infrequent features to receive aggressive updates when they do appear.

The critical problem: $G_{t,i}$ is a cumulative sum that only grows. After enough training steps, $\sqrt{G_{t,i}} \to \infty$ for all parameters, making all effective learning rates approach zero. AdaGrad works well early in training and in convex optimization, but effectively stops learning in long training runs. This makes it mostly obsolete for deep learning.

RMSProp: Fixing AdaGrad's Dying Learning Rates

RMSProp (Hinton, unpublished, described in a Coursera lecture in 2012) replaces the cumulative sum with an exponential moving average of squared gradients:

$v_{t,i} = \beta_2 v_{t-1,i} + (1 - \beta_2) g_{t,i}^2$

$\theta_{t+1,i} = \theta_{t,i} - \frac{\eta}{\sqrt{v_{t,i} + \epsilon}} g_{t,i}$

Where $\beta_2 \in (0.9, 0.999)$ controls how quickly old gradient information is forgotten. The exponential moving average gives recent gradients higher weight, so the effective learning rate does not monotonically decay - it tracks recent gradient magnitudes and remains stable over long runs.

When RMSProp excels: RNNs (where gradients can change scale dramatically across timesteps), non-stationary loss landscapes, and tasks where AdaGrad's dying learning rates are problematic. RMSProp was developed specifically for training RNNs and remains competitive in that domain.

# RMSProp configuration
optimizer_rmsprop = torch.optim.RMSprop(
    model.parameters(),
    lr=1e-3,
    alpha=0.99,       # decay rate for squared gradient EMA
    eps=1e-8,
    weight_decay=0,
    momentum=0,       # can optionally add momentum
)

Adam: Combining Momentum and RMSProp

Adam (Kingma and Ba, 2015) combines momentum (first moment estimate) with RMSProp (second moment estimate) and adds bias correction for both:

First moment estimate (smoothed gradient - like momentum):

$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$

Second moment estimate (smoothed squared gradient - like RMSProp):

$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$

Bias correction - the critical step:

$\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \qquad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$

Parameter update:

$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$

Why Bias Correction Is Necessary

Both $m_0 = 0$ and $v_0 = 0$ are initialized to zero. At step $t = 1$ :

$m_1 = \beta_1 \cdot 0 + (1 - \beta_1) \cdot g_1 = (1 - \beta_1) g_1$

With $\beta_1 = 0.9$ : $m_1 = 0.1 \cdot g_1$ . The first moment estimate is biased by a factor of 0.1 - the gradient signal is downscaled 10x in early training. This is called the cold start problem. Without bias correction, early Adam steps would be 10x too small for the first moment and 1000x too small for the second moment ( $v_1 = 0.001 \cdot g_1^2$ ), making initial training extremely slow.

Bias correction divides by $(1 - \beta_1^t)$ , which equals $1 - 0.9^1 = 0.1$ at $t = 1$ , canceling the 0.1 factor exactly. At $t = 100$ : $1 - 0.9^{100} \approx 0.99999$ , so bias correction has essentially no effect after the first few dozen steps. Bias correction matters most at the very start of training.

Default Hyperparameters

Hyperparameter	Default	Notes
$\eta$ (learning rate)	1e-3	Most important - tune this first
$\beta_1$ (momentum decay)	0.9	Rarely needs tuning
$\beta_2$ (RMS decay)	0.999	Occasionally try 0.95 for small batches
$\epsilon$	1e-8	Rarely needs tuning; increase to 1e-6 if instability

AdamW: Decoupled Weight Decay (Loshchilov and Hutter, 2019)

Standard Adam with L2 regularization adds $\frac{\lambda}{2}\|\theta\|^2$ to the loss, meaning the weight decay gradient $\lambda \theta_i$ is treated like any other gradient component:

$\theta_{t+1,i} = \theta_{t,i} - \frac{\eta}{\sqrt{\hat{v}_{t,i}} + \epsilon} (\hat{m}_{t,i} + \lambda \theta_{t,i})$

The weight decay is scaled by the inverse square root of the second moment - exactly like the gradient. Parameters with large gradient history (which Adam has already reduced the effective LR for) also receive reduced weight decay. Weight decay becomes weaker for parameters that are being updated most aggressively. This is the wrong behavior.

AdamW decouples weight decay from the adaptive gradient scaling:

$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t - \eta \lambda \theta_t$

The weight decay term $-\eta \lambda \theta_t$ is applied directly to the parameters, independently of the adaptive learning rate. This matches the intended L2 regularization behavior - all parameters shrink toward zero at the same rate regardless of their gradient history.

Practical consequence: AdamW consistently outperforms Adam + L2 regularization in large model training. The difference is particularly noticeable in transformer models where some layers have very different gradient magnitudes. The HuggingFace Transformers library defaults to AdamW. For most new work involving transformers, AdamW is the correct choice.

import torch
import torch.nn as nn


class AdamWFromScratch:
    """
    NumPy-style Adam W implementation for pedagogical clarity.
    Shows exactly what each step does.
    """

    def __init__(self, params: list, lr: float = 1e-3, betas: tuple = (0.9, 0.999),
                 eps: float = 1e-8, weight_decay: float = 0.01):
        self.params = list(params)
        self.lr = lr
        self.beta1, self.beta2 = betas
        self.eps = eps
        self.weight_decay = weight_decay
        self.t = 0  # step counter

        # Initialize moment estimates to zero
        self.m = [torch.zeros_like(p) for p in self.params]  # first moments
        self.v = [torch.zeros_like(p) for p in self.params]  # second moments

    def step(self) -> None:
        self.t += 1

        # Bias correction factors
        bc1 = 1 - self.beta1 ** self.t  # approaches 1 as t grows
        bc2 = 1 - self.beta2 ** self.t

        for i, p in enumerate(self.params):
            if p.grad is None:
                continue

            g = p.grad.data

            # Update first moment (momentum)
            self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * g

            # Update second moment (RMS)
            self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * g * g

            # Bias-corrected estimates
            m_hat = self.m[i] / bc1
            v_hat = self.v[i] / bc2

            # Adam gradient step
            p.data -= self.lr * m_hat / (v_hat.sqrt() + self.eps)

            # Decoupled weight decay - SEPARATE from adaptive gradient
            # This is the key difference from Adam + L2
            p.data -= self.lr * self.weight_decay * p.data


def configure_adamw(model: nn.Module, lr: float = 1e-3,
                    weight_decay: float = 0.01) -> torch.optim.AdamW:
    """
    AdamW with proper parameter group separation.
    Biases and norm parameters should NOT have weight decay.
    Weight decay on biases provides no regularization benefit and
    can hurt convergence by pulling biases toward zero arbitrarily.
    """
    decay_params = []
    no_decay_params = []

    for name, param in model.named_parameters():
        # No weight decay for 1D parameters (biases) and norm layers
        if param.ndim == 1 or 'norm' in name.lower() or 'bias' in name.lower():
            no_decay_params.append(param)
        else:
            decay_params.append(param)

    return torch.optim.AdamW(
        [
            {"params": decay_params, "weight_decay": weight_decay},
            {"params": no_decay_params, "weight_decay": 0.0},
        ],
        lr=lr,
        betas=(0.9, 0.999),
        eps=1e-8,
    )

LAMB: Large-Batch Optimizer for BERT (You et al., 2020)

LAMB (Layer-wise Adaptive Moments optimizer for Batch training) extends AdamW to enable efficient large-batch training. It scales the Adam update by the ratio of parameter norm to update norm:

$\theta_{t+1} = \theta_t - \eta \cdot \frac{\|\theta_t\|}{\|\hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon) + \lambda\theta_t\|} \cdot \left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda\theta_t\right)$

The additional ratio $\|\theta\| / \|\text{update}\|$ is the layerwise adaptive rate - it normalizes the update magnitude to be proportional to the parameter norm. Large layers (with large parameter norms) receive larger absolute updates; small layers receive smaller ones.

Why this matters for large-batch training: LAMB enabled training BERT in 76 minutes on TPUs using a batch size of 32,768 - previously requiring days. At large batch sizes, gradient noise is low but gradient direction becomes very reliable. LAMB's layerwise normalization lets large batches take correspondingly large steps without destabilizing small layers.

Lion: Sign-Based Optimizer (Chen et al., 2023)

Lion (Evolved Sign Momentum) was discovered by Google Brain via program search:

$c_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$

$\theta_{t+1} = \theta_t - \eta (\text{sign}(c_t) + \lambda \theta_t)$

$m_t = \beta_2 m_{t-1} + (1 - \beta_2) g_t$

Key properties:

Uses only the sign of the update, not the magnitude - every parameter gets an update of exactly $\pm \eta$
More memory efficient than Adam: only one momentum buffer (not two)
Better weight decay behavior (similar to AdamW's decoupled weight decay)
Works best with larger learning rates and smaller batch sizes than Adam

Lion has shown competitive results with Adam in large-scale image and language model training, with approximately 3x memory savings over Adam (one buffer vs two).

AMSGrad: Convergence Fix for Adam

Adam does not converge in some settings - the theoretical convergence proof had a gap. AMSGrad (Reddi et al., 2018) fixes this by using the maximum of past second moment estimates:

$\hat{v}_t^{\max} = \max(\hat{v}_{t-1}^{\max}, \hat{v}_t)$

$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t^{\max}} + \epsilon} \hat{m}_t$

The non-decreasing $\hat{v}_t^{\max}$ guarantees the effective learning rate never increases, fixing the theoretical convergence issue. In practice, the improvement over Adam is marginal on most tasks.

Full NumPy Adam from Scratch

import numpy as np


class AdamOptimizer:
    """
    Full Adam implementation in NumPy for pedagogical clarity.
    Includes all components: bias correction, decoupled weight decay,
    and gradient clipping.
    """

    def __init__(self, params: list[np.ndarray], lr: float = 1e-3,
                 beta1: float = 0.9, beta2: float = 0.999,
                 eps: float = 1e-8, weight_decay: float = 0.0,
                 max_grad_norm: float = None):
        self.params = params
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.weight_decay = weight_decay
        self.max_grad_norm = max_grad_norm
        self.t = 0

        # Initialize moment buffers to zero - triggers bias correction
        self.m = [np.zeros_like(p) for p in params]  # first moments
        self.v = [np.zeros_like(p) for p in params]  # second moments

    def clip_grad_norm(self, grads: list[np.ndarray]) -> list[np.ndarray]:
        """Clip gradient by global L2 norm."""
        if self.max_grad_norm is None:
            return grads
        total_norm = np.sqrt(sum(np.sum(g**2) for g in grads))
        if total_norm > self.max_grad_norm:
            clip_coef = self.max_grad_norm / (total_norm + 1e-12)
            grads = [g * clip_coef for g in grads]
        return grads

    def step(self, grads: list[np.ndarray]) -> None:
        """Apply one Adam update step."""
        assert len(grads) == len(self.params)
        self.t += 1

        # Optional gradient clipping
        grads = self.clip_grad_norm(grads)

        # Bias correction factors (both approach 1 as t grows)
        bc1 = 1 - self.beta1 ** self.t
        bc2 = 1 - self.beta2 ** self.t

        for i, (p, g) in enumerate(zip(self.params, grads)):
            # First moment: running mean of gradient
            self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * g

            # Second moment: running mean of squared gradient
            self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * g**2

            # Bias-corrected estimates
            m_hat = self.m[i] / bc1
            v_hat = self.v[i] / bc2

            # Adam gradient step
            p -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

            # Decoupled weight decay (AdamW behavior when weight_decay > 0)
            if self.weight_decay > 0:
                p -= self.lr * self.weight_decay * p


def train_with_numpy_adam():
    """Demonstrate Adam convergence on a simple quadratic."""
    np.random.seed(42)

    # Minimize: f(x, y) = x^2 + 10*y^2 (elongated quadratic - tests momentum)
    params = [np.array([5.0]), np.array([2.0])]  # start far from optimum
    optimizer = AdamOptimizer(params, lr=0.1, weight_decay=0.0)

    print(f"{'Step':>5} | {'x':>8} | {'y':>8} | {'loss':>12}")
    for step in range(50):
        x, y = params
        loss = float(x**2 + 10 * y**2)
        grads = [2 * x, 20 * y]  # analytical gradients

        optimizer.step(grads)

        if step % 10 == 0:
            print(f"{step:>5} | {float(x):>8.4f} | {float(y):>8.4f} | {loss:>12.6f}")

    print(f"\nFinal position: x={float(params[0]):.6f}, y={float(params[1]):.6f}")
    print(f"Optimum is (0, 0)")


train_with_numpy_adam()

SGD vs Adam: When Each Wins

This is one of the most debated empirical questions in deep learning. The key distinction:

Adam wins on:

Language models and transformers (BERT, GPT, T5, LLaMA)
Tasks with sparse gradients (word embeddings, recommendation systems)
Fast convergence when you want results quickly
Fine-tuning pretrained models
Any task with heterogeneous gradient scales across layers

SGD with momentum wins on:

Image classification from scratch (ResNet, EfficientNet)
Object detection (well-tuned SGD on COCO)
Tasks where final generalization quality matters more than convergence speed

The generalization gap: empirically, SGD with momentum often finds flatter minima than Adam. Flat minima generalize better than sharp minima (Hochreiter and Schmidhuber, 1997). Adam's adaptive learning rates efficiently navigate sharp narrow valleys in the loss landscape - but these sharp valleys correspond to sharp minima that generalize worse to unseen data. SGD with larger learning rates tends to "bounce around" more, spending more time in flatter regions. This is still an active research area.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset


def compare_optimizers_on_task(model_fn, X_train: torch.Tensor, y_train: torch.Tensor,
                                X_val: torch.Tensor, y_val: torch.Tensor,
                                n_epochs: int = 100):
    """
    Compare SGD, Adam, and AdamW on the same task with matched wall-clock time.
    model_fn: callable that returns a fresh model instance.
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    train_ds = TensorDataset(X_train, y_train)
    train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
    criterion = nn.CrossEntropyLoss()

    optimizer_configs = {
        "SGD+Nesterov": lambda m: torch.optim.SGD(
            m.parameters(), lr=0.01, momentum=0.9, nesterov=True, weight_decay=5e-4
        ),
        "Adam": lambda m: torch.optim.Adam(
            m.parameters(), lr=1e-3, betas=(0.9, 0.999), weight_decay=1e-4
        ),
        "AdamW": lambda m: torch.optim.AdamW(
            m.parameters(), lr=1e-3, betas=(0.9, 0.999), weight_decay=0.01
        ),
        "RMSProp": lambda m: torch.optim.RMSprop(
            m.parameters(), lr=1e-3, alpha=0.99
        ),
    }

    results = {}
    for name, opt_fn in optimizer_configs.items():
        torch.manual_seed(42)
        model = model_fn().to(device)
        optimizer = opt_fn(model)

        train_losses = []
        for epoch in range(n_epochs):
            model.train()
            epoch_loss = 0.0
            for bx, by in train_loader:
                bx, by = bx.to(device), by.to(device)
                optimizer.zero_grad()
                loss = criterion(model(bx), by)
                loss.backward()
                optimizer.step()
                epoch_loss += loss.item()
            train_losses.append(epoch_loss / len(train_loader))

        model.eval()
        with torch.no_grad():
            val_logits = model(X_val.to(device))
            val_acc = (val_logits.argmax(1) == y_val.to(device)).float().mean().item()

        results[name] = {
            "final_train_loss": train_losses[-1],
            "val_acc": val_acc,
            "convergence_epoch": next(
                (i for i, l in enumerate(train_losses) if l < train_losses[0] * 0.5), n_epochs
            ),
        }
        print(f"{name:20s}: train_loss={train_losses[-1]:.4f}, val_acc={val_acc:.3f}, "
              f"50%-convergence at epoch {results[name]['convergence_epoch']}")

    return results

Parameter Groups: Fine-Grained Learning Rate Control

Different parts of a model often need different learning rates - a critical production technique:

import torch
import torch.nn as nn


def configure_layerwise_lr_decay(model: nn.Module, base_lr: float,
                                  n_layers: int = 12,
                                  lr_decay: float = 0.8,
                                  weight_decay: float = 0.01) -> torch.optim.AdamW:
    """
    Layer-wise learning rate decay for transformer fine-tuning.
    Earlier layers get smaller learning rates - they contain more general features
    that should be preserved. Later layers get larger rates - they need to adapt.

    LR for layer i = base_lr * lr_decay^(n_layers - i)
    e.g., n_layers=12, lr_decay=0.8:
      Layer 0 (embedding): base_lr * 0.8^12 ≈ 0.069 * base_lr
      Layer 6 (middle):    base_lr * 0.8^6  ≈ 0.262 * base_lr
      Layer 12 (top):      base_lr * 0.8^0  = base_lr
    """
    param_groups = []

    # Group parameters by layer depth
    no_decay = {'bias', 'norm', 'LayerNorm', 'layer_norm'}

    for layer_idx in range(n_layers + 1):
        layer_lr = base_lr * (lr_decay ** (n_layers - layer_idx))

        layer_params_decay = []
        layer_params_no_decay = []

        # In practice: filter model.named_parameters() by layer name
        # This is simplified - real implementation matches by layer index in name
        for name, param in model.named_parameters():
            if f"layer.{layer_idx}." in name or f"layers.{layer_idx}." in name:
                if any(nd in name for nd in no_decay):
                    layer_params_no_decay.append(param)
                else:
                    layer_params_decay.append(param)

        if layer_params_decay:
            param_groups.append({"params": layer_params_decay,
                                 "lr": layer_lr, "weight_decay": weight_decay})
        if layer_params_no_decay:
            param_groups.append({"params": layer_params_no_decay,
                                 "lr": layer_lr, "weight_decay": 0.0})

    return torch.optim.AdamW(param_groups, lr=base_lr)


# Fine-tuning example: backbone gets 10x smaller LR than new head
def finetune_optimizer(backbone: nn.Module, head: nn.Module,
                        base_lr: float = 1e-3) -> torch.optim.AdamW:
    """Separate learning rates for pretrained backbone vs new classification head."""
    return torch.optim.AdamW([
        {"params": backbone.parameters(), "lr": base_lr * 0.1, "weight_decay": 0.0},
        {"params": head.parameters(), "lr": base_lr, "weight_decay": 0.01},
    ])

Optimizer Selection Guide

Learning Rate Selection Guide

All optimizer machinery is useless without the right learning rate. Rules of thumb for starting points:

Optimizer	Typical LR Range	Common Starting Point
SGD (no momentum)	0.1–1.0	0.1
SGD + momentum	0.01–0.1	0.01
Adam	1e-4–1e-3	1e-3
AdamW (transformers, pretrain)	1e-4–5e-4	3e-4
AdamW (fine-tuning)	1e-6–1e-4	1e-5
RMSProp	1e-4–1e-3	1e-3

The key insight: Adam is much less sensitive to learning rate than SGD. Adam's default hyperparameters work reasonably for most tasks. SGD requires more careful LR tuning but can find better optima when properly configured. For fast prototyping, start with Adam. For pushing the last percentage points on a well-understood task, try SGD with momentum.

:::warning gradient_accumulation and optimizer.step() When using gradient accumulation, you perform multiple backward passes before calling optimizer.step(). The gradients accumulate (add up) across these passes. If you do not divide the loss by the accumulation steps, the effective gradient is proportional to accumulation_steps * actual_gradient. This means Adam's second moment estimate is tracking the wrong scale, and the effective learning rate is off. Always divide the loss by accumulation steps: loss = criterion(output, target) / grad_accum_steps. :::

:::danger zero_grad() Before Every Backward Pass PyTorch accumulates gradients by default - calling .backward() adds to existing gradients rather than replacing them. If you forget optimizer.zero_grad() before computing a new batch's loss and backward, gradients from previous batches accumulate. Effective gradient grows each step. Loss diverges. This is the single most common PyTorch bug. The standard pattern: optimizer.zero_grad() → loss.backward() → optimizer.step(). Never deviate from this order. :::

YouTube Resources

Video	Channel	Why Watch It
Gradient Descent, How Neural Networks Learn	3Blue1Brown	Visual intuition for gradient descent and loss landscapes
Adam Optimizer Explained	Andrej Karpathy	Derivation with code, bias correction visual
Why Momentum Works	Yannic Kilcher	Physical intuition and convergence theory
AdamW vs Adam	Fast.ai	Practical explanation of decoupled weight decay
CS231n - Optimization Algorithms	Stanford CS231n	Complete optimizer survey with SGD vs Adam comparison

Interview Q&A

Q1: Explain how Adam combines momentum and RMSProp and what problem each addresses.

Adam maintains two exponential moving averages. The first moment ( $m_t$ ), like momentum, accumulates a smoothed estimate of the gradient direction - this reduces oscillations and builds speed in consistent gradient directions, similar to a ball gaining momentum on a slope. The second moment ( $v_t$ ), like RMSProp, accumulates a smoothed estimate of the squared gradient magnitude - this creates per-parameter adaptive learning rates: parameters with large gradient history receive small effective learning rates, and parameters with small gradient history receive large effective rates. The division $m_t / \sqrt{v_t}$ normalizes the gradient direction by its historical magnitude, making the effective step size approximately constant across parameters regardless of their gradient scale. Bias correction addresses the cold start: both moments are initialized to zero, so early estimates are artificially small (by factors of $1-\beta_1^t$ and $1-\beta_2^t$ ). Dividing by these factors restores the correct scale in early training.

Q2: What is AdamW and why is it preferred over Adam with L2 regularization?

With standard Adam and L2 regularization, the weight decay gradient $\lambda\theta_i$ is treated as part of the regular gradient and divided by $\sqrt{\hat{v}_{t,i}}$ - the adaptive scaling factor. Parameters with large gradient history (which Adam has already reduced the effective LR for) also receive reduced weight decay. Weight decay is strongest for parameters that are updated least aggressively, which is backwards - frequently-updated parameters need more regularization. AdamW decouples weight decay from gradient scaling: the weight decay step $-\eta\lambda\theta$ is applied directly to parameters, independent of the adaptive learning rate. This ensures weight decay functions as intended: all parameters shrink toward zero at the same proportional rate regardless of gradient history. For transformer and language model training where different layers have vastly different gradient magnitudes, proper weight decay from AdamW is critical for regularization to work correctly.

Q3: Why does SGD with momentum sometimes generalize better than Adam?

Empirically observed and partly theoretically understood. The primary hypothesis involves loss landscape geometry: Adam's adaptive learning rates efficiently navigate sharp, narrow valleys in the loss landscape, finding minima quickly. But sharp minima - points where the loss changes rapidly with small parameter perturbations - generalize poorly. When the test distribution differs slightly from training, parameters in sharp minima produce large loss increases. SGD with momentum, using a larger learning rate and no adaptive scaling, naturally explores flatter regions. The "bouncing around" that makes SGD appear slower actually causes it to spend more time in regions where the loss is flat in multiple directions - flat minima generalize better because small distribution shifts only cause small loss increases. On ImageNet with ResNets, well-tuned SGD consistently achieves 0.5–1.5% higher top-1 accuracy than Adam. For language models, this effect is not observed because heterogeneous gradient scales across embedding vs attention vs FFN layers make Adam's adaptivity essential.

Q4: What does bias correction do in Adam and why is it necessary?

Both moment estimates in Adam are initialized to zero. In the first step, the first moment is $m_1 = (1-\beta_1) g_1$ . With $\beta_1 = 0.9$ , this is 10% of the actual gradient - a 10x downscaling. The second moment is $v_1 = (1-\beta_2) g_1^2$ . With $\beta_2 = 0.999$ , this is 0.1% of $g_1^2$ - a 1000x downscaling. Without bias correction, the effective step size in the first few iterations would be dramatically wrong. Adam's update would be $\eta \cdot m_1 / \sqrt{v_1} = \eta \cdot 0.1 g_1 / \sqrt{0.001 g_1^2} = \eta \cdot 0.1 / \sqrt{0.001} \approx 3.16\eta |g_1|^{-1} g_1$ - the scale is correct by coincidence in this case, but this breaks down with vector gradients where different components have different magnitudes. Bias correction divides by $(1-\beta_1^t)$ and $(1-\beta_2^t)$ , which cancel the startup bias exactly. As training continues (large $t$ ), these factors approach 1 and the correction becomes negligible. Bias correction matters most at the start of training and when Adam is restarted after a learning rate reset.

Q5: You are fine-tuning BERT for text classification. Which optimizer and what learning rate strategy?

Use AdamW with decoupled weight decay. Key configuration: (1) Learning rate: 1e-5 to 5e-5 for the pretrained BERT layers - too large will destroy pretrained representations. 1e-4 to 1e-3 for the new classification head. Use separate parameter groups to set different rates. (2) Weight decay: 0.01 for weight matrices, 0.0 for biases and LayerNorm parameters. (3) Schedule: linear warmup for 6% of total training steps (typically 300–500 steps), then linear decay to zero. (4) Gradient clipping: max_norm=1.0 - prevents gradient spikes that can destroy pretrained representations. (5) Layer-wise LR decay with decay factor 0.8–0.9: earlier BERT layers get smaller rates than later layers, preserving general language representations while allowing task-specific adaptation in higher layers. (6) Number of epochs: typically 3–5 for text classification - more leads to overfitting on small datasets. This recipe from the original BERT fine-tuning paper and HuggingFace defaults works well across classification, NER, and extractive QA tasks.

Gradient Clipping: Protecting Training Stability

Gradient clipping caps the global L2 norm of all gradients before the parameter update. This prevents gradient explosion from causing catastrophic parameter updates - especially important in transformers, RNNs, and deep MLPs.

$\text{if } \|\mathbf{g}\|_2 > \tau: \quad \mathbf{g} \leftarrow \tau \cdot \frac{\mathbf{g}}{\|\mathbf{g}\|_2}$

The clipping threshold $\tau$ is typically 1.0 for transformers and 5.0 for RNNs (which have higher gradient variability). The gradient direction is preserved - only the magnitude is clipped.

import torch
import torch.nn as nn


def train_step_with_clipping(model: nn.Module, batch, optimizer,
                              criterion, max_norm: float = 1.0) -> float:
    """
    Standard training step with gradient clipping.
    Clipping happens AFTER backward() and BEFORE optimizer.step().
    """
    inputs, targets = batch
    optimizer.zero_grad()

    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward()

    # Compute and log the gradient norm before clipping (useful for monitoring)
    total_norm_before = 0.0
    for p in model.parameters():
        if p.grad is not None:
            total_norm_before += p.grad.data.norm(2).item() ** 2
    total_norm_before = total_norm_before ** 0.5

    # Clip gradients - returns the total norm after clipping
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=max_norm)

    optimizer.step()

    return loss.item(), total_norm_before


# Monitoring gradient norms in production - helps detect training issues early
class GradientNormMonitor:
    """
    Track gradient norms over training. Useful for diagnosing:
    - Gradient explosion (norm suddenly spikes to 100+)
    - Gradient vanishing (norm consistently near 0)
    - Training instability (very high variance in norm over time)
    """

    def __init__(self, window_size: int = 100):
        self.norms = []
        self.window_size = window_size

    def record(self, model: nn.Module) -> float:
        total_norm = 0.0
        for p in model.parameters():
            if p.grad is not None:
                total_norm += p.grad.data.norm(2).item() ** 2
        total_norm = total_norm ** 0.5
        self.norms.append(total_norm)
        return total_norm

    @property
    def recent_mean(self) -> float:
        recent = self.norms[-self.window_size:]
        return sum(recent) / len(recent) if recent else 0.0

    @property
    def is_exploding(self) -> bool:
        return len(self.norms) > 0 and self.norms[-1] > 10 * self.recent_mean

    @property
    def is_vanishing(self) -> bool:
        return len(self.norms) > self.window_size and self.recent_mean < 1e-6

Optimizer State: Memory Costs and Considerations

A critical practical consideration for large models - optimizers carry significant memory overhead:

Optimizer	Extra memory per parameter	Notes
SGD (no momentum)	0	No extra state
SGD + momentum	1 buffer	Velocity vector
Adam / AdamW	2 buffers	First + second moment
LAMB	2 buffers	Same as Adam
Lion	1 buffer	Only one moment
Adafactor	~1 buffer (factored)	Uses SVD factorization for large matrices

For a 7B parameter model (LLaMA-7B):

Model weights in float32: $7 \times 10^9 \times 4 = 28$ GB
Adam optimizer states (2 buffers, float32): $2 \times 28 = 56$ GB
Total for training: $84$ GB minimum - requires multiple A100s

This is why:

Large-scale training uses mixed precision: weights in bfloat16 (2 bytes) but optimizer states in float32 (4 bytes) for numerical stability
Gradient checkpointing trades compute for memory (recomputing activations instead of storing them)
Optimizer sharding (via ZeRO in DeepSpeed) distributes optimizer states across GPUs

import torch
import torch.nn as nn


def estimate_training_memory(model: nn.Module, batch_size: int,
                              seq_len: int, precision: str = "fp16") -> dict:
    """
    Rough estimate of total GPU memory needed for training.
    Accounts for: parameters, gradients, optimizer states, activations.
    """
    n_params = sum(p.numel() for p in model.parameters())
    bytes_per_param = 2 if precision == "fp16" else 4

    model_bytes = n_params * bytes_per_param
    grad_bytes = n_params * bytes_per_param      # gradients same size as params
    adam_bytes = n_params * 4 * 2               # float32 first + second moment

    # Rough activation estimate: depends heavily on architecture
    # Transformer: O(batch * seq_len * d_model * n_layers)
    activation_bytes = batch_size * seq_len * 512 * 12 * bytes_per_param  # rough

    total_bytes = model_bytes + grad_bytes + adam_bytes + activation_bytes

    return {
        "model_MB": model_bytes / 1e6,
        "gradients_MB": grad_bytes / 1e6,
        "adam_states_MB": adam_bytes / 1e6,
        "activations_MB": activation_bytes / 1e6,
        "total_MB": total_bytes / 1e6,
        "total_GB": total_bytes / 1e9,
        "n_params": f"{n_params / 1e6:.1f}M",
    }

Hyperparameter Sensitivity: Adam vs SGD

One of Adam's most practically important properties is its robustness to hyperparameter choice:

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np


def compare_lr_sensitivity():
    """
    Show that Adam is less sensitive to LR than SGD.
    Both optimizers trained on the same simple regression task
    with a range of learning rates.
    """
    def make_model():
        return nn.Sequential(nn.Linear(1, 16), nn.ReLU(), nn.Linear(16, 1))

    # Simple regression: y = sin(2*pi*x)
    x = torch.linspace(-1, 1, 100).unsqueeze(1)
    y = torch.sin(2 * np.pi * x)

    lr_values = [1e-4, 1e-3, 1e-2, 0.1, 1.0]
    n_steps = 500

    print(f"\n{'LR':<10} | {'SGD final loss':>16} | {'Adam final loss':>16}")
    print("-" * 50)

    for lr in lr_values:
        # SGD
        torch.manual_seed(0)
        model_sgd = make_model()
        opt_sgd = optim.SGD(model_sgd.parameters(), lr=lr)
        for _ in range(n_steps):
            opt_sgd.zero_grad()
            loss = ((model_sgd(x) - y) ** 2).mean()
            loss.backward()
            opt_sgd.step()
        sgd_loss = loss.item()

        # Adam
        torch.manual_seed(0)
        model_adam = make_model()
        opt_adam = optim.Adam(model_adam.parameters(), lr=lr)
        for _ in range(n_steps):
            opt_adam.zero_grad()
            loss = ((model_adam(x) - y) ** 2).mean()
            loss.backward()
            opt_adam.step()
        adam_loss = loss.item()

        print(f"{lr:<10} | {sgd_loss:>16.4f} | {adam_loss:>16.4f}")
    # Expected pattern:
    # Small LR (1e-4): SGD barely moves, Adam converges slowly
    # Medium LR (1e-3): both work, Adam usually faster
    # Large LR (0.1): SGD diverges, Adam still converges
    # Very large LR (1.0): both diverge, but Adam fails at higher threshold


compare_lr_sensitivity()

The practical takeaway: with Adam, the default learning rate of 1e-3 works across a wide range of problems. With SGD, you must tune the learning rate carefully for each problem - too small and it trains slowly, too large and it diverges. This 10x–100x larger robust LR range is why Adam is the default choice when prototyping or when LR tuning budget is limited.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Optimizer Race demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Optimizer Family Tree​

Gradient Descent Foundation​

SGD with Momentum: Building Velocity​

Nesterov Momentum: Look-Ahead Gradient​

AdaGrad: Per-Parameter Adaptive Learning Rates​

RMSProp: Fixing AdaGrad's Dying Learning Rates​

Adam: Combining Momentum and RMSProp​

Why Bias Correction Is Necessary​

Default Hyperparameters​

AdamW: Decoupled Weight Decay (Loshchilov and Hutter, 2019)​

LAMB: Large-Batch Optimizer for BERT (You et al., 2020)​

Lion: Sign-Based Optimizer (Chen et al., 2023)​

AMSGrad: Convergence Fix for Adam​

Full NumPy Adam from Scratch​

SGD vs Adam: When Each Wins​

Parameter Groups: Fine-Grained Learning Rate Control​

Optimizer Selection Guide​

Learning Rate Selection Guide​

YouTube Resources​

Interview Q&A​

Gradient Clipping: Protecting Training Stability​

Optimizer State: Memory Costs and Considerations​

Hyperparameter Sensitivity: Adam vs SGD​