Gradient Descent Mechanics - The Engine of Every Training Loop

Reading time: ~26 minutes | Level: Mathematical Foundations → ML Engineering

Every time you run a training loop - whether linear regression on 1,000 samples or a transformer on 1 trillion tokens - the same engine is turning underneath: gradient descent.

The parameter update is θ ← θ - α∇L(θ). Six symbols. But each decision embedded in those six symbols - how to compute the gradient (full batch? mini-batch? single sample?), how large α should be, whether to add momentum, when to decay α - determines whether your model converges in hours or diverges in minutes.

This lesson builds gradient descent from first principles and explains every mechanic that determines training success.

What You Will Learn

Gradient descent derivation from Taylor expansion
Why the gradient is the right direction to move
Learning rate: geometric interpretation, too high vs. too low behavior
Convergence conditions and guarantees
Batch GD vs. mini-batch GD vs. SGD trade-off analysis
Momentum: escaping poor curvature
Learning rate schedules: step decay, exponential, cosine, warmup
From-scratch Python implementation of each variant

Prerequisites

Lesson 01: Derivatives and Gradients (required)
Lesson 02: Chain Rule and Backpropagation (recommended)
NumPy and basic Python

Part 1 - Gradient Descent Derivation

What we are trying to solve

Given a loss function L(θ) where θ ∈ ℝⁿ, find:

$\theta^* = \arg\min_\theta L(\theta)$

If we could solve this analytically (set ∇L = 0 and solve), we would. For most ML problems, this is intractable. Instead, we use iterative refinement.

First-order Taylor approximation

Near the current point θ₀, we approximate L using a Taylor expansion:

$L(\theta_0 + \Delta\theta) \approx L(\theta_0) + \nabla L(\theta_0)^T \Delta\theta$

We want to move to a new point θ₀ + Δθ that decreases L. We need ∇L(θ₀)ᵀ Δθ < 0.

The inner product ∇L · Δθ = ‖∇L‖ · ‖Δθ‖ · cos(φ). This is most negative when Δθ points exactly opposite to ∇L - i.e., Δθ = -α∇L(θ₀).

This gives the gradient descent update:

$\boxed{\theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t)}$

Gradient descent is the direction of steepest descent from the first-order Taylor approximation.

Geometric interpretation

L(θ) contour plot (two parameters):

      θ₂
       │
  3.0  ┤       ●  ← initial θ₀
       │        ↘
  2.0  ┤         ↘  ← step in -∇L direction
       │          ●  θ₁
  1.0  ┤            ↘
       │              ●  θ₂
  0.0  ┤               ↘
       │                ★  ← minimum θ*
       └──────┬──────┬──────── θ₁
             0      2

Each step follows the gradient downhill. The path curves because the gradient direction changes at each new position.

Learning rate effect

import numpy as np

def gradient_descent_1d(
    f, grad_f, x0: float, lr: float,
    n_steps: int = 200, tol: float = 1e-8
) -> list:
    """1D gradient descent with convergence detection."""
    x = x0
    history = [{'x': x, 'f': f(x)}]
    for step in range(n_steps):
        grad = grad_f(x)
        x = x - lr * grad
        history.append({'x': x, 'f': f(x)})
        if abs(grad) < tol:
            break
    return history

# Minimize f(x) = x^2 + 2x + 1 = (x+1)^2  (minimum at x=-1)
f = lambda x: x**2 + 2*x + 1
grad_f = lambda x: 2*x + 2

print("Effect of learning rate on convergence:")
print(f"{'lr':>6} | {'steps':>6} | {'final_x':>10} | {'final_loss':>12}")
print("-" * 45)
for lr in [0.001, 0.1, 0.9, 1.0, 1.1]:
    hist = gradient_descent_1d(f, grad_f, x0=5.0, lr=lr, n_steps=200)
    final = hist[-1]
    status = "DIVERGED" if abs(final['f']) > 1e6 else f"{final['f']:.8f}"
    print(f"{lr:>6.3f} | {len(hist):>6} | {final['x']:>10.4f} | {status:>12}")

Part 2 - Convergence Theory

When does gradient descent converge?

For a convex function L with Lipschitz-continuous gradient (‖∇L(x) - ∇L(y)‖ ≤ L_const‖x-y‖), gradient descent with learning rate α ≤ 1/L_const converges to the global minimum.

Convergence rate for strongly convex functions:

$L(\theta_t) - L(\theta^*) \leq \left(1 - \alpha \mu\right)^t \left(L(\theta_0) - L(\theta^*)\right)$

where μ is the strong convexity constant. The error shrinks by a constant factor each step - linear convergence.

Safe learning rate range: 0 < α < 2/L_const. Beyond 2/L_const, gradient descent diverges.

The condition number

The condition number κ = L_const/μ (max curvature / min curvature) determines how hard optimization is:

κ = 1 (sphere):              κ = 100 (elongated ellipse):
     ○ ○ ○                   ─────────────────────────
    ○ ○★○ ○                   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
     ○ ○ ○                   ─ ─ ─ ─ ─★─ ─ ─ ─ ─ ─
                              ─────────────────────────

Fast convergence               Zigzag across narrow valley

import numpy as np

# Demonstrate condition number effect on convergence speed
def optimize_quadratic(kappa: float, n_steps: int = 500) -> list:
    """
    Minimize f(θ) = 0.5 * (κ*θ₀² + θ₁²) with gradient descent.
    L_const = κ (largest curvature), μ = 1 (smallest curvature)
    Safe learning rate: α = 1/κ
    """
    theta = np.array([1.0, 1.0])
    lr = 1.0 / kappa  # maximum safe learning rate
    losses = []

    for _ in range(n_steps):
        grad = np.array([kappa * theta[0], theta[1]])
        theta -= lr * grad
        loss = 0.5 * (kappa * theta[0]**2 + theta[1]**2)
        losses.append(loss)

    return losses

print(f"{'κ':>6} | {'steps to 0.01':>14} | {'convergence rate':>16}")
print("-" * 42)
for kappa in [1, 10, 100, 1000]:
    losses = optimize_quadratic(kappa, n_steps=5000)
    steps = next((i for i, l in enumerate(losses) if l < 0.01), 5000)
    rate = (1 - 2/(kappa+1))**1  # theoretical per-step factor
    print(f"{kappa:>6} | {steps:>14} | {rate:>16.4f}")

Part 3 - Batch, Mini-Batch, and Stochastic Gradient Descent

Full batch gradient descent

Compute the exact gradient over the entire dataset:

$\nabla L(\theta) = \frac{1}{N} \sum_{i=1}^{N} \nabla L_i(\theta)$

import numpy as np

def batch_gradient_descent(X, y, lr=0.01, n_epochs=100):
    """Full batch GD for linear regression."""
    N, d = X.shape
    theta = np.zeros(d)
    losses = []
    for _ in range(n_epochs):
        residuals = X @ theta - y
        gradient = (2/N) * X.T @ residuals
        theta -= lr * gradient
        losses.append(np.mean(residuals**2))
    return theta, losses

Pros: Exact gradient, smooth loss curve, deterministic Cons: Must fit entire dataset in memory, slow for large N

Stochastic gradient descent (SGD)

Compute the gradient on a single randomly chosen sample:

def stochastic_gradient_descent(X, y, lr=0.01, n_epochs=10):
    """SGD: one sample at a time."""
    N, d = X.shape
    theta = np.zeros(d)
    losses = []
    for _ in range(n_epochs):
        for i in np.random.permutation(N):
            xi, yi = X[i], y[i]
            grad = 2 * (xi @ theta - yi) * xi
            theta -= lr * grad
        losses.append(np.mean((X @ theta - y)**2))
    return theta, losses

Pros: Fast updates, naturally explores loss surface, escapes saddle points Cons: Noisy gradient estimates, requires careful learning rate tuning

Mini-batch gradient descent (the standard)

Compute the gradient on a batch of B samples:

def minibatch_gradient_descent(X, y, batch_size=32, lr=0.01, n_epochs=50):
    """Mini-batch GD - the standard for deep learning."""
    N, d = X.shape
    theta = np.zeros(d)
    losses = []
    for _ in range(n_epochs):
        idx = np.random.permutation(N)
        X_shuf, y_shuf = X[idx], y[idx]
        for start in range(0, N, batch_size):
            Xb = X_shuf[start:start+batch_size]
            yb = y_shuf[start:start+batch_size]
            grad = (2/len(yb)) * Xb.T @ (Xb @ theta - yb)
            theta -= lr * grad
        losses.append(np.mean((X @ theta - y)**2))
    return theta, losses

Comparison

Property	Full Batch	Mini-Batch	SGD
Gradient accuracy	Exact	Approximate	Very noisy
Memory	O(N)	O(B)	O(1)
GPU efficiency	Good	Best	Poor
Convergence noise	None	Low	High
Saddle point escape	Poor	Good	Best
Typical batch size	N	32–4096	1

# Verify all three converge to the same solution
np.random.seed(42)
N, d = 500, 3
X = np.random.randn(N, d)
true_theta = np.array([1.0, -2.0, 0.5])
y = X @ true_theta + np.random.randn(N) * 0.1

theta_batch, _ = batch_gradient_descent(X, y, lr=0.1, n_epochs=200)
theta_mini, _ = minibatch_gradient_descent(X, y, batch_size=32, lr=0.05, n_epochs=50)
theta_sgd, _ = stochastic_gradient_descent(X, y, lr=0.01, n_epochs=30)

print("True theta:       ", true_theta)
print("Batch GD theta:   ", theta_batch.round(4))
print("Mini-batch theta: ", theta_mini.round(4))
print("SGD theta:        ", theta_sgd.round(4))

:::tip Batch size and generalization Larger batch sizes converge faster but often generalize worse. Smaller batches produce noisier gradients that act as implicit regularization, steering toward flat minima that generalize better. Start with batch size 256 or 512; if generalization is poor, try 32 or 64. :::

Part 4 - Momentum

Vanilla gradient descent oscillates in narrow valleys (high condition number). Momentum accumulates gradient history to smooth oscillations.

Intuition: ball rolling downhill

Without momentum: a ball placed on a narrow slope bounces back and forth across the valley while inching forward.

With momentum: the ball builds up speed in the consistent downhill direction, smoothing out the oscillations.

Update rule

$v_t = \beta v_{t-1} + (1 - \beta) \nabla L(\theta_t)$

$\theta_{t+1} = \theta_t - \alpha v_t$

v is the velocity (exponential moving average of gradients)
β (typically 0.9) controls how much history to retain
Oscillating gradients (changing sign each step) cancel out in the EMA
Consistent gradients (same sign each step) accumulate constructively

import numpy as np

def sgd_with_momentum(
    grad_fn, theta0: np.ndarray,
    lr: float = 0.01, beta: float = 0.9, n_steps: int = 200
) -> list:
    """SGD with momentum."""
    theta = theta0.copy()
    v = np.zeros_like(theta)
    history = [theta.copy()]

    for _ in range(n_steps):
        grad = grad_fn(theta)
        v = beta * v + (1 - beta) * grad
        theta = theta - lr * v
        history.append(theta.copy())

    return history

# Demonstrate momentum on elongated (high κ) loss surface
# L(θ) = 10*θ₀² + 0.1*θ₁²  (κ = 100)
def elongated_grad(theta):
    return np.array([20.0 * theta[0], 0.2 * theta[1]])

theta0 = np.array([5.0, 5.0])

# Vanilla GD
theta_v = theta0.copy()
loss_v = []
for _ in range(300):
    theta_v -= 0.04 * elongated_grad(theta_v)
    loss_v.append(10*theta_v[0]**2 + 0.1*theta_v[1]**2)

# Momentum GD
hist_m = sgd_with_momentum(elongated_grad, theta0, lr=0.04, beta=0.9, n_steps=300)
loss_m = [10*t[0]**2 + 0.1*t[1]**2 for t in hist_m]

# Find steps to reach loss < 0.01
steps_v = next((i for i, l in enumerate(loss_v) if l < 0.01), 300)
steps_m = next((i for i, l in enumerate(loss_m) if l < 0.01), 300)

print(f"Vanilla GD:   {steps_v} steps to reach loss < 0.01")
print(f"Momentum GD:  {steps_m} steps to reach loss < 0.01")
print(f"Speedup: {steps_v / max(steps_m, 1):.1f}x")

Nesterov momentum (lookahead)

Computes the gradient at the approximate next position before updating:

$\theta_\text{look} = \theta_t - \beta v_{t-1}$ $v_t = \beta v_{t-1} + (1-\beta) \nabla L(\theta_\text{look})$ $\theta_{t+1} = \theta_t - \alpha v_t$

def sgd_nesterov(
    grad_fn, theta0: np.ndarray,
    lr: float = 0.01, beta: float = 0.9, n_steps: int = 200
) -> list:
    """SGD with Nesterov momentum (lookahead)."""
    theta = theta0.copy()
    v = np.zeros_like(theta)
    history = [theta.copy()]

    for _ in range(n_steps):
        theta_look = theta - beta * v            # look ahead
        grad = grad_fn(theta_look)               # gradient at lookahead
        v = beta * v + (1 - beta) * grad
        theta = theta - lr * v
        history.append(theta.copy())

    return history

Nesterov often converges slightly faster than vanilla momentum, particularly for convex problems, because it "corrects" the overshoot by computing the gradient where it will be after the momentum step.

Part 5 - Learning Rate Schedules

A fixed learning rate is often suboptimal:

Large early steps: fast initial convergence
Small late steps: precise final convergence

Learning rate schedules reduce the learning rate over time to get the best of both.

Step decay

Drop the learning rate by factor γ every k epochs:

$\alpha_t = \alpha_0 \cdot \gamma^{\lfloor t/k \rfloor}$

def step_decay(initial_lr, decay_factor, step_size, epoch):
    return initial_lr * (decay_factor ** (epoch // step_size))

# lr: 0.1 → 0.05 → 0.025 → 0.0125 every 20 epochs
lrs = [step_decay(0.1, 0.5, 20, e) for e in range(80)]
for e in [0, 20, 40, 60]:
    print(f"Epoch {e}: lr = {lrs[e]:.4f}")

Exponential decay

$\alpha_t = \alpha_0 \cdot e^{-\lambda t}$

import numpy as np

def exponential_decay(initial_lr, decay_rate, step):
    return initial_lr * np.exp(-decay_rate * step)

Cosine annealing (modern standard)

Follows a cosine curve from initial_lr to min_lr over T_max steps:

$\alpha_t = \alpha_{\min} + \frac{1}{2}(\alpha_0 - \alpha_{\min})\left(1 + \cos\left(\frac{\pi t}{T_{\max}}\right)\right)$

import numpy as np

def cosine_annealing(initial_lr, min_lr, t_max, step):
    """Smooth cosine decay from initial_lr to min_lr."""
    return min_lr + 0.5 * (initial_lr - min_lr) * (1 + np.cos(np.pi * step / t_max))

T_max = 100
schedule = [cosine_annealing(0.1, 0.0001, T_max, t) for t in range(T_max)]
print(f"Initial lr:  {schedule[0]:.4f}")
print(f"Halfway:     {schedule[50]:.5f}")
print(f"Final lr:    {schedule[-1]:.6f}")

Cosine annealing with warm restarts (SGDR)

Periodically reset the learning rate to allow cyclical exploration:

def cosine_warm_restarts(initial_lr, min_lr, t_0, t_mult, step):
    """Cosine annealing with warm restarts. Period grows after each restart."""
    t = t_0
    while step >= t:
        step -= t
        t *= t_mult
    return min_lr + 0.5 * (initial_lr - min_lr) * (1 + np.cos(np.pi * step / t))

Linear warmup + cosine decay (transformer standard)

def warmup_cosine_schedule(initial_lr, warmup_steps, total_steps, min_lr, current_step):
    """
    Linear warmup then cosine decay.
    Near-universal for transformer training (BERT, GPT, LLaMA, etc.)
    """
    if current_step < warmup_steps:
        return initial_lr * (current_step / warmup_steps)
    progress = (current_step - warmup_steps) / (total_steps - warmup_steps)
    return min_lr + 0.5 * (initial_lr - min_lr) * (1 + np.cos(np.pi * progress))

# Typical transformer training schedule
steps = range(0, 10000, 100)
lrs = [warmup_cosine_schedule(3e-4, 500, 10000, 1e-6, s) for s in steps]
print("Warmup + cosine schedule sample:")
for step, lr in zip([0, 200, 500, 2500, 9500], lrs[:5] + [lrs[-1]]):
    print(f"  Step {step:5d}: lr = {lr:.2e}")

:::tip Why warmup matters for transformers At the start of training, parameters are random and gradients are poorly estimated. Without warmup, a large initial learning rate causes catastrophic updates in the first few hundred steps. Warmup starts with small learning rates (safe early updates) and ramps up after gradients become more reliable. :::

Part 6 - Gradient Clipping

Exploding gradients - common in RNNs and sometimes in early transformer training - cause parameters to jump wildly. Gradient clipping bounds the gradient norm before the update.

$g \leftarrow g \cdot \min\left(1, \frac{C}{\|g\|_2}\right)$

If ‖g‖₂ ≤ C: no change. If ‖g‖₂ > C: scale to have norm C (preserving direction).

import numpy as np

def clip_gradient_norm(grad: np.ndarray, max_norm: float) -> tuple:
    """
    Clip gradient to have norm at most max_norm.
    Returns clipped gradient and original norm.
    """
    norm = np.linalg.norm(grad)
    if norm > max_norm:
        grad = grad * (max_norm / norm)
    return grad, norm

# Example: large gradient that would cause instability
grad = np.array([50.0, -30.0, 80.0])
print(f"Original grad norm: {np.linalg.norm(grad):.2f}")

clipped, original_norm = clip_gradient_norm(grad, max_norm=5.0)
print(f"Clipped grad norm:  {np.linalg.norm(clipped):.2f}")
print(f"Direction preserved: {np.allclose(clipped/np.linalg.norm(clipped), grad/original_norm)}")

import torch

# PyTorch built-in gradient clipping
def train_step(model, optimizer, loss):
    optimizer.zero_grad()
    loss.backward()

    # Clip before optimizer step
    grad_norm = torch.nn.utils.clip_grad_norm_(
        model.parameters(),
        max_norm=1.0
    )

    optimizer.step()
    return grad_norm.item()

Part 7 - Complete Training Loop

import numpy as np
from typing import Callable, Optional

class SGDOptimizer:
    """
    Mini-batch SGD with momentum, learning rate schedule, and gradient clipping.
    """

    def __init__(
        self,
        lr: float = 0.01,
        momentum: float = 0.9,
        clip_norm: Optional[float] = None,
        lr_schedule: Optional[Callable] = None,
    ):
        self.lr = lr
        self.momentum = momentum
        self.clip_norm = clip_norm
        self.lr_schedule = lr_schedule
        self.velocity = None
        self.step_count = 0

    def step(self, theta: np.ndarray, grad: np.ndarray) -> np.ndarray:
        """Apply one gradient descent step."""
        if self.velocity is None:
            self.velocity = np.zeros_like(theta)

        # Get current learning rate
        current_lr = self.lr_schedule(self.step_count) if self.lr_schedule else self.lr

        # Clip gradient
        if self.clip_norm is not None:
            norm = np.linalg.norm(grad)
            if norm > self.clip_norm:
                grad = grad * (self.clip_norm / norm)

        # Momentum update
        self.velocity = self.momentum * self.velocity + (1 - self.momentum) * grad
        theta = theta - current_lr * self.velocity

        self.step_count += 1
        return theta


def train(
    X: np.ndarray,
    y: np.ndarray,
    batch_size: int = 32,
    n_epochs: int = 50,
    lr: float = 0.1
):
    """Complete training loop with mini-batch SGD."""
    N, d = X.shape
    theta = np.zeros(d)

    # Cosine annealing schedule
    total_steps = (N // batch_size) * n_epochs
    schedule = lambda s: cosine_annealing(lr, 1e-5, total_steps, s)
    optimizer = SGDOptimizer(lr=lr, momentum=0.9, clip_norm=5.0, lr_schedule=schedule)

    for epoch in range(n_epochs):
        idx = np.random.permutation(N)
        Xs, ys = X[idx], y[idx]
        epoch_loss = 0.0
        n_batches = 0

        for start in range(0, N, batch_size):
            Xb = Xs[start:start+batch_size]
            yb = ys[start:start+batch_size]
            B = len(yb)

            preds = Xb @ theta
            residuals = preds - yb
            grad = (2/B) * Xb.T @ residuals

            theta = optimizer.step(theta, grad)
            epoch_loss += np.mean(residuals**2)
            n_batches += 1

        if epoch % 10 == 0:
            avg_loss = epoch_loss / n_batches
            print(f"Epoch {epoch:3d}: loss = {avg_loss:.6f}, "
                  f"lr = {schedule(optimizer.step_count):.2e}")

    return theta


# Demo
np.random.seed(42)
N, d = 1000, 5
X = np.random.randn(N, d)
true_theta = np.array([1.0, -2.0, 0.5, 0.3, -1.5])
y = X @ true_theta + np.random.randn(N) * 0.5

final_theta = train(X, y, batch_size=64, n_epochs=50, lr=0.1)
print(f"\nTrue theta:  {true_theta}")
print(f"Found theta: {final_theta.round(4)}")

Part 8 - Engineering Red Flags

:::danger Learning rate too high: NaN loss If your loss jumps to NaN or inf in the first steps, your learning rate is almost certainly too high. Reduce by 10x until stable.

# Diagnosis pattern
for lr in [1e-1, 1e-2, 1e-3, 1e-4]:
    loss = run_few_steps(model, lr=lr)
    status = "DIVERGED" if (np.isnan(loss) or np.isinf(loss)) else f"{loss:.4f}"
    print(f"lr={lr}: {status}")
# Use ~1/3 of the largest non-diverging lr

:::

:::warning Not shuffling in SGD Iterating through data in fixed order causes systematic gradient bias. Always shuffle at the start of each epoch.

# WRONG: fixed order
for i in range(N):
    update(X[i], y[i])

# RIGHT: shuffle each epoch
for i in np.random.permutation(N):
    update(X[i], y[i])

:::

:::warning Learning rate schedule not reset between experiments If you reuse an optimizer object across multiple experiments without resetting velocity and step count, the schedule is wrong for the second run. :::

:::tip Learning rate finder Before committing to a training run, sweep the learning rate from 1e-7 to 10 over ~100 steps. Plot loss vs. learning rate. The optimal lr is just before the loss starts increasing sharply. Use ~10% of that value for safe training. :::

Interview Questions

Q1: Derive gradient descent from first principles.

Setup: Minimize L(θ). We are at θ_t and want to find direction Δθ to decrease L.

Step 1: 1st-order Taylor approximation: $L(\theta_t + \Delta\theta) \approx L(\theta_t) + \nabla L(\theta_t)^T \Delta\theta$

Step 2: To decrease L, we need ∇L(θ_t)ᵀ Δθ < 0.

Step 3: The inner product ∇L · Δθ = ‖∇L‖ · ‖Δθ‖ · cos(φ). Most negative when φ = 180°, i.e., Δθ = -α∇L(θ_t).

Result: θ_{t+1} = θ_t - α∇L(θ_t)

Key caveat: Valid only for small α. Large α steps outside the region where the Taylor approximation holds, potentially causing divergence.

Q2: What are the trade-offs between batch size choices?

Aspect	Small batch (1-32)	Large batch (256-4096)
Gradient variance	High (noisy)	Low (accurate)
Memory	Low	High
GPU utilization	Poor	Excellent
Steps per epoch	Many	Few
Generalization	Often better	Often worse
Saddle point escape	Good	Poor

Why small batch generalizes better: Noisy gradients act as implicit regularization, steering optimization toward flat minima (wider valleys in loss landscape). Flat minima tend to generalize better because small perturbations in parameters do not dramatically increase loss. Large batch converges to sharp minima that are sensitive to perturbations (Keskar et al., 2017).

Q3: Explain momentum and when it helps.

Momentum adds a velocity term: v_t = β·v_{t-1} + (1-β)·∇L(θ_t), then θ ← θ - α·v_t.

When it helps: high condition number (elongated) loss surfaces

In a narrow valley, the gradient perpendicular to the valley direction oscillates (changes sign each step). The EMA averages out these oscillations (they cancel due to sign flipping). The gradient along the valley is consistent in direction and accumulates constructively. Effect: faster progress along the valley, less oscillation across it.

Hyperparameter: β = 0.9 is the standard default. Higher β (0.99) means more history but slower response to gradient changes.

Nesterov variation: Compute the gradient at the "lookahead" position (where momentum would take you). This gives slightly better convergence in theory and marginally better empirical results.

Q4: Why is cosine annealing preferred over step decay in modern deep learning?

Step decay drops the learning rate abruptly (e.g., multiply by 0.1 every 30 epochs):

Requires per-model tuning of the step epochs
Abrupt drop can cause temporary loss spikes
After the final drop, the learning rate may be too small to escape poor local optima

Cosine annealing follows a smooth cosine curve:

Smooth - no discontinuous drops
Principled - starts fast, slows as training progresses
Works across architectures without per-model tuning
With warm restarts (SGDR), enables cyclical exploration of the loss landscape

For transformers specifically: Linear warmup + cosine decay is the near-universal standard. Warmup prevents large destructive updates early in training when gradients are poorly estimated. This combination is used in BERT, GPT, LLaMA, and virtually every modern language model.

Q5: What is gradient clipping and when should you use it?

Gradient clipping bounds the gradient norm before the parameter update:

if grad_norm > max_norm:
    grad = grad * (max_norm / grad_norm)

This scales all gradient components down proportionally, preserving direction but bounding magnitude.

When to use it:

RNNs and LSTMs: Long sequence backprop multiplies many Jacobians together, causing exponential gradient growth
Early training: Random initialization can produce large initial gradients
Non-convex loss with cliffs: The loss landscape can have sudden sharp regions

How to set max_norm: Common choices are 1.0 (transformers) or 5.0 (RNNs). Monitor the pre-clipping gradient norm - if it frequently exceeds your threshold, the threshold is appropriate; if it almost never triggers, you may not need clipping.

Global norm vs. per-parameter: Always use global norm clipping (normalize the total gradient across all parameters). Per-parameter clipping distorts the gradient direction.

Quick Reference

Concept	Update Rule	Key Hyperparameter
Vanilla GD	θ ← θ - α∇L	α (learning rate)
Mini-batch SGD	θ ← θ - α∇̃L	α, batch size B
Momentum	v = βv + (1-β)g, θ ← θ - αv	β (typically 0.9)
Nesterov	v = βv + (1-β)g_look, θ ← θ - αv	β
Step decay	α_t = α₀ · γ^⌊t/k⌋	γ, k
Cosine annealing	α_t = α_min + 0.5(α₀-α_min)(1+cos(πt/T))	T_max, α_min
Warmup	linear ramp for first W steps	W, peak lr
Gradient clipping	g ← g · min(1, C/‖g‖)	C (max_norm)

Key Takeaways

Gradient descent derives from the 1st-order Taylor approximation - we step in the direction of steepest descent
The learning rate controls step size: too large diverges, too small is impractically slow
Mini-batch gradient descent balances accuracy, memory, and GPU efficiency - the standard for deep learning
Momentum accumulates gradient history to smooth oscillations in narrow valleys, accelerating convergence
Learning rate schedules (cosine annealing with warmup) are nearly universal in modern deep learning and significantly affect final model quality
Gradient clipping is essential for RNNs and helpful in early transformer training to prevent catastrophic updates

Next: Convex Functions and Optimization →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Gradient Descent on a Loss Surface demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

Prerequisites​

Part 1 - Gradient Descent Derivation​

What we are trying to solve​

First-order Taylor approximation​

Geometric interpretation​

Learning rate effect​

Part 2 - Convergence Theory​

When does gradient descent converge?​

The condition number​

Part 3 - Batch, Mini-Batch, and Stochastic Gradient Descent​

Full batch gradient descent​

Stochastic gradient descent (SGD)​

Mini-batch gradient descent (the standard)​

Comparison​

Part 4 - Momentum​

Intuition: ball rolling downhill​

Update rule​

Nesterov momentum (lookahead)​

Part 5 - Learning Rate Schedules​

Step decay​

Exponential decay​

Cosine annealing (modern standard)​

Cosine annealing with warm restarts (SGDR)​

Linear warmup + cosine decay (transformer standard)​

Part 6 - Gradient Clipping​

Part 7 - Complete Training Loop​

Part 8 - Engineering Red Flags​

Interview Questions​

Quick Reference​

Key Takeaways​