Gradient Descent Mechanics - The Engine of Every Training Loop
Reading time: ~26 minutes | Level: Mathematical Foundations → ML Engineering
Every time you run a training loop - whether linear regression on 1,000 samples or a transformer on 1 trillion tokens - the same engine is turning underneath: gradient descent.
The parameter update is θ ← θ - α∇L(θ). Six symbols. But each decision embedded in those six symbols - how to compute the gradient (full batch? mini-batch? single sample?), how large α should be, whether to add momentum, when to decay α - determines whether your model converges in hours or diverges in minutes.
This lesson builds gradient descent from first principles and explains every mechanic that determines training success.
What You Will Learn
- Gradient descent derivation from Taylor expansion
- Why the gradient is the right direction to move
- Learning rate: geometric interpretation, too high vs. too low behavior
- Convergence conditions and guarantees
- Batch GD vs. mini-batch GD vs. SGD trade-off analysis
- Momentum: escaping poor curvature
- Learning rate schedules: step decay, exponential, cosine, warmup
- From-scratch Python implementation of each variant
Prerequisites
- Lesson 01: Derivatives and Gradients (required)
- Lesson 02: Chain Rule and Backpropagation (recommended)
- NumPy and basic Python
Part 1 - Gradient Descent Derivation
What we are trying to solve
Given a loss function L(θ) where θ ∈ ℝⁿ, find:
If we could solve this analytically (set ∇L = 0 and solve), we would. For most ML problems, this is intractable. Instead, we use iterative refinement.
First-order Taylor approximation
Near the current point θ₀, we approximate L using a Taylor expansion:
We want to move to a new point θ₀ + Δθ that decreases L. We need ∇L(θ₀)ᵀ Δθ < 0.
The inner product ∇L · Δθ = ‖∇L‖ · ‖Δθ‖ · cos(φ). This is most negative when Δθ points exactly opposite to ∇L - i.e., Δθ = -α∇L(θ₀).
This gives the gradient descent update:
Gradient descent is the direction of steepest descent from the first-order Taylor approximation.
Geometric interpretation
L(θ) contour plot (two parameters):
θ₂
│
3.0 ┤ ● ← initial θ₀
│ ↘
2.0 ┤ ↘ ← step in -∇L direction
│ ● θ₁
1.0 ┤ ↘
│ ● θ₂
0.0 ┤ ↘
│ ★ ← minimum θ*
└──────┬──────┬──────── θ₁
0 2
Each step follows the gradient downhill. The path curves because the gradient direction changes at each new position.
Learning rate effect
import numpy as np
def gradient_descent_1d(
f, grad_f, x0: float, lr: float,
n_steps: int = 200, tol: float = 1e-8
) -> list:
"""1D gradient descent with convergence detection."""
x = x0
history = [{'x': x, 'f': f(x)}]
for step in range(n_steps):
grad = grad_f(x)
x = x - lr * grad
history.append({'x': x, 'f': f(x)})
if abs(grad) < tol:
break
return history
# Minimize f(x) = x^2 + 2x + 1 = (x+1)^2 (minimum at x=-1)
f = lambda x: x**2 + 2*x + 1
grad_f = lambda x: 2*x + 2
print("Effect of learning rate on convergence:")
print(f"{'lr':>6} | {'steps':>6} | {'final_x':>10} | {'final_loss':>12}")
print("-" * 45)
for lr in [0.001, 0.1, 0.9, 1.0, 1.1]:
hist = gradient_descent_1d(f, grad_f, x0=5.0, lr=lr, n_steps=200)
final = hist[-1]
status = "DIVERGED" if abs(final['f']) > 1e6 else f"{final['f']:.8f}"
print(f"{lr:>6.3f} | {len(hist):>6} | {final['x']:>10.4f} | {status:>12}")
Part 2 - Convergence Theory
When does gradient descent converge?
For a convex function L with Lipschitz-continuous gradient (‖∇L(x) - ∇L(y)‖ ≤ L_const‖x-y‖), gradient descent with learning rate α ≤ 1/L_const converges to the global minimum.
Convergence rate for strongly convex functions:
where μ is the strong convexity constant. The error shrinks by a constant factor each step - linear convergence.
Safe learning rate range: 0 < α < 2/L_const. Beyond 2/L_const, gradient descent diverges.
The condition number
The condition number κ = L_const/μ (max curvature / min curvature) determines how hard optimization is:
κ = 1 (sphere): κ = 100 (elongated ellipse):
○ ○ ○ ─────────────────────────
○ ○★○ ○ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
○ ○ ○ ─ ─ ─ ─ ─★─ ─ ─ ─ ─ ─
─────────────────────────
Fast convergence Zigzag across narrow valley
import numpy as np
# Demonstrate condition number effect on convergence speed
def optimize_quadratic(kappa: float, n_steps: int = 500) -> list:
"""
Minimize f(θ) = 0.5 * (κ*θ₀² + θ₁²) with gradient descent.
L_const = κ (largest curvature), μ = 1 (smallest curvature)
Safe learning rate: α = 1/κ
"""
theta = np.array([1.0, 1.0])
lr = 1.0 / kappa # maximum safe learning rate
losses = []
for _ in range(n_steps):
grad = np.array([kappa * theta[0], theta[1]])
theta -= lr * grad
loss = 0.5 * (kappa * theta[0]**2 + theta[1]**2)
losses.append(loss)
return losses
print(f"{'κ':>6} | {'steps to 0.01':>14} | {'convergence rate':>16}")
print("-" * 42)
for kappa in [1, 10, 100, 1000]:
losses = optimize_quadratic(kappa, n_steps=5000)
steps = next((i for i, l in enumerate(losses) if l < 0.01), 5000)
rate = (1 - 2/(kappa+1))**1 # theoretical per-step factor
print(f"{kappa:>6} | {steps:>14} | {rate:>16.4f}")
Part 3 - Batch, Mini-Batch, and Stochastic Gradient Descent
Full batch gradient descent
Compute the exact gradient over the entire dataset:
import numpy as np
def batch_gradient_descent(X, y, lr=0.01, n_epochs=100):
"""Full batch GD for linear regression."""
N, d = X.shape
theta = np.zeros(d)
losses = []
for _ in range(n_epochs):
residuals = X @ theta - y
gradient = (2/N) * X.T @ residuals
theta -= lr * gradient
losses.append(np.mean(residuals**2))
return theta, losses
Pros: Exact gradient, smooth loss curve, deterministic Cons: Must fit entire dataset in memory, slow for large N
Stochastic gradient descent (SGD)
Compute the gradient on a single randomly chosen sample:
def stochastic_gradient_descent(X, y, lr=0.01, n_epochs=10):
"""SGD: one sample at a time."""
N, d = X.shape
theta = np.zeros(d)
losses = []
for _ in range(n_epochs):
for i in np.random.permutation(N):
xi, yi = X[i], y[i]
grad = 2 * (xi @ theta - yi) * xi
theta -= lr * grad
losses.append(np.mean((X @ theta - y)**2))
return theta, losses
Pros: Fast updates, naturally explores loss surface, escapes saddle points Cons: Noisy gradient estimates, requires careful learning rate tuning
Mini-batch gradient descent (the standard)
Compute the gradient on a batch of B samples:
def minibatch_gradient_descent(X, y, batch_size=32, lr=0.01, n_epochs=50):
"""Mini-batch GD - the standard for deep learning."""
N, d = X.shape
theta = np.zeros(d)
losses = []
for _ in range(n_epochs):
idx = np.random.permutation(N)
X_shuf, y_shuf = X[idx], y[idx]
for start in range(0, N, batch_size):
Xb = X_shuf[start:start+batch_size]
yb = y_shuf[start:start+batch_size]
grad = (2/len(yb)) * Xb.T @ (Xb @ theta - yb)
theta -= lr * grad
losses.append(np.mean((X @ theta - y)**2))
return theta, losses
Comparison
| Property | Full Batch | Mini-Batch | SGD |
|---|---|---|---|
| Gradient accuracy | Exact | Approximate | Very noisy |
| Memory | O(N) | O(B) | O(1) |
| GPU efficiency | Good | Best | Poor |
| Convergence noise | None | Low | High |
| Saddle point escape | Poor | Good | Best |
| Typical batch size | N | 32–4096 | 1 |
# Verify all three converge to the same solution
np.random.seed(42)
N, d = 500, 3
X = np.random.randn(N, d)
true_theta = np.array([1.0, -2.0, 0.5])
y = X @ true_theta + np.random.randn(N) * 0.1
theta_batch, _ = batch_gradient_descent(X, y, lr=0.1, n_epochs=200)
theta_mini, _ = minibatch_gradient_descent(X, y, batch_size=32, lr=0.05, n_epochs=50)
theta_sgd, _ = stochastic_gradient_descent(X, y, lr=0.01, n_epochs=30)
print("True theta: ", true_theta)
print("Batch GD theta: ", theta_batch.round(4))
print("Mini-batch theta: ", theta_mini.round(4))
print("SGD theta: ", theta_sgd.round(4))
:::tip Batch size and generalization Larger batch sizes converge faster but often generalize worse. Smaller batches produce noisier gradients that act as implicit regularization, steering toward flat minima that generalize better. Start with batch size 256 or 512; if generalization is poor, try 32 or 64. :::
Part 4 - Momentum
Vanilla gradient descent oscillates in narrow valleys (high condition number). Momentum accumulates gradient history to smooth oscillations.
Intuition: ball rolling downhill
Without momentum: a ball placed on a narrow slope bounces back and forth across the valley while inching forward.
With momentum: the ball builds up speed in the consistent downhill direction, smoothing out the oscillations.
Update rule
- v is the velocity (exponential moving average of gradients)
- β (typically 0.9) controls how much history to retain
- Oscillating gradients (changing sign each step) cancel out in the EMA
- Consistent gradients (same sign each step) accumulate constructively
import numpy as np
def sgd_with_momentum(
grad_fn, theta0: np.ndarray,
lr: float = 0.01, beta: float = 0.9, n_steps: int = 200
) -> list:
"""SGD with momentum."""
theta = theta0.copy()
v = np.zeros_like(theta)
history = [theta.copy()]
for _ in range(n_steps):
grad = grad_fn(theta)
v = beta * v + (1 - beta) * grad
theta = theta - lr * v
history.append(theta.copy())
return history
# Demonstrate momentum on elongated (high κ) loss surface
# L(θ) = 10*θ₀² + 0.1*θ₁² (κ = 100)
def elongated_grad(theta):
return np.array([20.0 * theta[0], 0.2 * theta[1]])
theta0 = np.array([5.0, 5.0])
# Vanilla GD
theta_v = theta0.copy()
loss_v = []
for _ in range(300):
theta_v -= 0.04 * elongated_grad(theta_v)
loss_v.append(10*theta_v[0]**2 + 0.1*theta_v[1]**2)
# Momentum GD
hist_m = sgd_with_momentum(elongated_grad, theta0, lr=0.04, beta=0.9, n_steps=300)
loss_m = [10*t[0]**2 + 0.1*t[1]**2 for t in hist_m]
# Find steps to reach loss < 0.01
steps_v = next((i for i, l in enumerate(loss_v) if l < 0.01), 300)
steps_m = next((i for i, l in enumerate(loss_m) if l < 0.01), 300)
print(f"Vanilla GD: {steps_v} steps to reach loss < 0.01")
print(f"Momentum GD: {steps_m} steps to reach loss < 0.01")
print(f"Speedup: {steps_v / max(steps_m, 1):.1f}x")
Nesterov momentum (lookahead)
Computes the gradient at the approximate next position before updating:
def sgd_nesterov(
grad_fn, theta0: np.ndarray,
lr: float = 0.01, beta: float = 0.9, n_steps: int = 200
) -> list:
"""SGD with Nesterov momentum (lookahead)."""
theta = theta0.copy()
v = np.zeros_like(theta)
history = [theta.copy()]
for _ in range(n_steps):
theta_look = theta - beta * v # look ahead
grad = grad_fn(theta_look) # gradient at lookahead
v = beta * v + (1 - beta) * grad
theta = theta - lr * v
history.append(theta.copy())
return history
Nesterov often converges slightly faster than vanilla momentum, particularly for convex problems, because it "corrects" the overshoot by computing the gradient where it will be after the momentum step.
Part 5 - Learning Rate Schedules
A fixed learning rate is often suboptimal:
- Large early steps: fast initial convergence
- Small late steps: precise final convergence
Learning rate schedules reduce the learning rate over time to get the best of both.
Step decay
Drop the learning rate by factor γ every k epochs:
def step_decay(initial_lr, decay_factor, step_size, epoch):
return initial_lr * (decay_factor ** (epoch // step_size))
# lr: 0.1 → 0.05 → 0.025 → 0.0125 every 20 epochs
lrs = [step_decay(0.1, 0.5, 20, e) for e in range(80)]
for e in [0, 20, 40, 60]:
print(f"Epoch {e}: lr = {lrs[e]:.4f}")
Exponential decay
import numpy as np
def exponential_decay(initial_lr, decay_rate, step):
return initial_lr * np.exp(-decay_rate * step)
Cosine annealing (modern standard)
Follows a cosine curve from initial_lr to min_lr over T_max steps:
import numpy as np
def cosine_annealing(initial_lr, min_lr, t_max, step):
"""Smooth cosine decay from initial_lr to min_lr."""
return min_lr + 0.5 * (initial_lr - min_lr) * (1 + np.cos(np.pi * step / t_max))
T_max = 100
schedule = [cosine_annealing(0.1, 0.0001, T_max, t) for t in range(T_max)]
print(f"Initial lr: {schedule[0]:.4f}")
print(f"Halfway: {schedule[50]:.5f}")
print(f"Final lr: {schedule[-1]:.6f}")
Cosine annealing with warm restarts (SGDR)
Periodically reset the learning rate to allow cyclical exploration:
def cosine_warm_restarts(initial_lr, min_lr, t_0, t_mult, step):
"""Cosine annealing with warm restarts. Period grows after each restart."""
t = t_0
while step >= t:
step -= t
t *= t_mult
return min_lr + 0.5 * (initial_lr - min_lr) * (1 + np.cos(np.pi * step / t))
Linear warmup + cosine decay (transformer standard)
def warmup_cosine_schedule(initial_lr, warmup_steps, total_steps, min_lr, current_step):
"""
Linear warmup then cosine decay.
Near-universal for transformer training (BERT, GPT, LLaMA, etc.)
"""
if current_step < warmup_steps:
return initial_lr * (current_step / warmup_steps)
progress = (current_step - warmup_steps) / (total_steps - warmup_steps)
return min_lr + 0.5 * (initial_lr - min_lr) * (1 + np.cos(np.pi * progress))
# Typical transformer training schedule
steps = range(0, 10000, 100)
lrs = [warmup_cosine_schedule(3e-4, 500, 10000, 1e-6, s) for s in steps]
print("Warmup + cosine schedule sample:")
for step, lr in zip([0, 200, 500, 2500, 9500], lrs[:5] + [lrs[-1]]):
print(f" Step {step:5d}: lr = {lr:.2e}")
:::tip Why warmup matters for transformers At the start of training, parameters are random and gradients are poorly estimated. Without warmup, a large initial learning rate causes catastrophic updates in the first few hundred steps. Warmup starts with small learning rates (safe early updates) and ramps up after gradients become more reliable. :::
Part 6 - Gradient Clipping
Exploding gradients - common in RNNs and sometimes in early transformer training - cause parameters to jump wildly. Gradient clipping bounds the gradient norm before the update.
If ‖g‖₂ ≤ C: no change. If ‖g‖₂ > C: scale to have norm C (preserving direction).
import numpy as np
def clip_gradient_norm(grad: np.ndarray, max_norm: float) -> tuple:
"""
Clip gradient to have norm at most max_norm.
Returns clipped gradient and original norm.
"""
norm = np.linalg.norm(grad)
if norm > max_norm:
grad = grad * (max_norm / norm)
return grad, norm
# Example: large gradient that would cause instability
grad = np.array([50.0, -30.0, 80.0])
print(f"Original grad norm: {np.linalg.norm(grad):.2f}")
clipped, original_norm = clip_gradient_norm(grad, max_norm=5.0)
print(f"Clipped grad norm: {np.linalg.norm(clipped):.2f}")
print(f"Direction preserved: {np.allclose(clipped/np.linalg.norm(clipped), grad/original_norm)}")
import torch
# PyTorch built-in gradient clipping
def train_step(model, optimizer, loss):
optimizer.zero_grad()
loss.backward()
# Clip before optimizer step
grad_norm = torch.nn.utils.clip_grad_norm_(
model.parameters(),
max_norm=1.0
)
optimizer.step()
return grad_norm.item()
Part 7 - Complete Training Loop
import numpy as np
from typing import Callable, Optional
class SGDOptimizer:
"""
Mini-batch SGD with momentum, learning rate schedule, and gradient clipping.
"""
def __init__(
self,
lr: float = 0.01,
momentum: float = 0.9,
clip_norm: Optional[float] = None,
lr_schedule: Optional[Callable] = None,
):
self.lr = lr
self.momentum = momentum
self.clip_norm = clip_norm
self.lr_schedule = lr_schedule
self.velocity = None
self.step_count = 0
def step(self, theta: np.ndarray, grad: np.ndarray) -> np.ndarray:
"""Apply one gradient descent step."""
if self.velocity is None:
self.velocity = np.zeros_like(theta)
# Get current learning rate
current_lr = self.lr_schedule(self.step_count) if self.lr_schedule else self.lr
# Clip gradient
if self.clip_norm is not None:
norm = np.linalg.norm(grad)
if norm > self.clip_norm:
grad = grad * (self.clip_norm / norm)
# Momentum update
self.velocity = self.momentum * self.velocity + (1 - self.momentum) * grad
theta = theta - current_lr * self.velocity
self.step_count += 1
return theta
def train(
X: np.ndarray,
y: np.ndarray,
batch_size: int = 32,
n_epochs: int = 50,
lr: float = 0.1
):
"""Complete training loop with mini-batch SGD."""
N, d = X.shape
theta = np.zeros(d)
# Cosine annealing schedule
total_steps = (N // batch_size) * n_epochs
schedule = lambda s: cosine_annealing(lr, 1e-5, total_steps, s)
optimizer = SGDOptimizer(lr=lr, momentum=0.9, clip_norm=5.0, lr_schedule=schedule)
for epoch in range(n_epochs):
idx = np.random.permutation(N)
Xs, ys = X[idx], y[idx]
epoch_loss = 0.0
n_batches = 0
for start in range(0, N, batch_size):
Xb = Xs[start:start+batch_size]
yb = ys[start:start+batch_size]
B = len(yb)
preds = Xb @ theta
residuals = preds - yb
grad = (2/B) * Xb.T @ residuals
theta = optimizer.step(theta, grad)
epoch_loss += np.mean(residuals**2)
n_batches += 1
if epoch % 10 == 0:
avg_loss = epoch_loss / n_batches
print(f"Epoch {epoch:3d}: loss = {avg_loss:.6f}, "
f"lr = {schedule(optimizer.step_count):.2e}")
return theta
# Demo
np.random.seed(42)
N, d = 1000, 5
X = np.random.randn(N, d)
true_theta = np.array([1.0, -2.0, 0.5, 0.3, -1.5])
y = X @ true_theta + np.random.randn(N) * 0.5
final_theta = train(X, y, batch_size=64, n_epochs=50, lr=0.1)
print(f"\nTrue theta: {true_theta}")
print(f"Found theta: {final_theta.round(4)}")
Part 8 - Engineering Red Flags
:::danger Learning rate too high: NaN loss If your loss jumps to NaN or inf in the first steps, your learning rate is almost certainly too high. Reduce by 10x until stable.
# Diagnosis pattern
for lr in [1e-1, 1e-2, 1e-3, 1e-4]:
loss = run_few_steps(model, lr=lr)
status = "DIVERGED" if (np.isnan(loss) or np.isinf(loss)) else f"{loss:.4f}"
print(f"lr={lr}: {status}")
# Use ~1/3 of the largest non-diverging lr
:::
:::warning Not shuffling in SGD Iterating through data in fixed order causes systematic gradient bias. Always shuffle at the start of each epoch.
# WRONG: fixed order
for i in range(N):
update(X[i], y[i])
# RIGHT: shuffle each epoch
for i in np.random.permutation(N):
update(X[i], y[i])
:::
:::warning Learning rate schedule not reset between experiments If you reuse an optimizer object across multiple experiments without resetting velocity and step count, the schedule is wrong for the second run. :::
:::tip Learning rate finder Before committing to a training run, sweep the learning rate from 1e-7 to 10 over ~100 steps. Plot loss vs. learning rate. The optimal lr is just before the loss starts increasing sharply. Use ~10% of that value for safe training. :::
Interview Questions
Q1: Derive gradient descent from first principles.
Setup: Minimize L(θ). We are at θ_t and want to find direction Δθ to decrease L.
Step 1: 1st-order Taylor approximation:
Step 2: To decrease L, we need ∇L(θ_t)ᵀ Δθ < 0.
Step 3: The inner product ∇L · Δθ = ‖∇L‖ · ‖Δθ‖ · cos(φ). Most negative when φ = 180°, i.e., Δθ = -α∇L(θ_t).
Result: θ_{t+1} = θ_t - α∇L(θ_t)
Key caveat: Valid only for small α. Large α steps outside the region where the Taylor approximation holds, potentially causing divergence.
Q2: What are the trade-offs between batch size choices?
| Aspect | Small batch (1-32) | Large batch (256-4096) |
|---|---|---|
| Gradient variance | High (noisy) | Low (accurate) |
| Memory | Low | High |
| GPU utilization | Poor | Excellent |
| Steps per epoch | Many | Few |
| Generalization | Often better | Often worse |
| Saddle point escape | Good | Poor |
Why small batch generalizes better: Noisy gradients act as implicit regularization, steering optimization toward flat minima (wider valleys in loss landscape). Flat minima tend to generalize better because small perturbations in parameters do not dramatically increase loss. Large batch converges to sharp minima that are sensitive to perturbations (Keskar et al., 2017).
Q3: Explain momentum and when it helps.
Momentum adds a velocity term: v_t = β·v_{t-1} + (1-β)·∇L(θ_t), then θ ← θ - α·v_t.
When it helps: high condition number (elongated) loss surfaces
In a narrow valley, the gradient perpendicular to the valley direction oscillates (changes sign each step). The EMA averages out these oscillations (they cancel due to sign flipping). The gradient along the valley is consistent in direction and accumulates constructively. Effect: faster progress along the valley, less oscillation across it.
Hyperparameter: β = 0.9 is the standard default. Higher β (0.99) means more history but slower response to gradient changes.
Nesterov variation: Compute the gradient at the "lookahead" position (where momentum would take you). This gives slightly better convergence in theory and marginally better empirical results.
Q4: Why is cosine annealing preferred over step decay in modern deep learning?
Step decay drops the learning rate abruptly (e.g., multiply by 0.1 every 30 epochs):
- Requires per-model tuning of the step epochs
- Abrupt drop can cause temporary loss spikes
- After the final drop, the learning rate may be too small to escape poor local optima
Cosine annealing follows a smooth cosine curve:
- Smooth - no discontinuous drops
- Principled - starts fast, slows as training progresses
- Works across architectures without per-model tuning
- With warm restarts (SGDR), enables cyclical exploration of the loss landscape
For transformers specifically: Linear warmup + cosine decay is the near-universal standard. Warmup prevents large destructive updates early in training when gradients are poorly estimated. This combination is used in BERT, GPT, LLaMA, and virtually every modern language model.
Q5: What is gradient clipping and when should you use it?
Gradient clipping bounds the gradient norm before the parameter update:
if grad_norm > max_norm:
grad = grad * (max_norm / grad_norm)
This scales all gradient components down proportionally, preserving direction but bounding magnitude.
When to use it:
- RNNs and LSTMs: Long sequence backprop multiplies many Jacobians together, causing exponential gradient growth
- Early training: Random initialization can produce large initial gradients
- Non-convex loss with cliffs: The loss landscape can have sudden sharp regions
How to set max_norm: Common choices are 1.0 (transformers) or 5.0 (RNNs). Monitor the pre-clipping gradient norm - if it frequently exceeds your threshold, the threshold is appropriate; if it almost never triggers, you may not need clipping.
Global norm vs. per-parameter: Always use global norm clipping (normalize the total gradient across all parameters). Per-parameter clipping distorts the gradient direction.
Quick Reference
| Concept | Update Rule | Key Hyperparameter |
|---|---|---|
| Vanilla GD | θ ← θ - α∇L | α (learning rate) |
| Mini-batch SGD | θ ← θ - α∇̃L | α, batch size B |
| Momentum | v = βv + (1-β)g, θ ← θ - αv | β (typically 0.9) |
| Nesterov | v = βv + (1-β)g_look, θ ← θ - αv | β |
| Step decay | α_t = α₀ · γ^⌊t/k⌋ | γ, k |
| Cosine annealing | α_t = α_min + 0.5(α₀-α_min)(1+cos(πt/T)) | T_max, α_min |
| Warmup | linear ramp for first W steps | W, peak lr |
| Gradient clipping | g ← g · min(1, C/‖g‖) | C (max_norm) |
Key Takeaways
- Gradient descent derives from the 1st-order Taylor approximation - we step in the direction of steepest descent
- The learning rate controls step size: too large diverges, too small is impractically slow
- Mini-batch gradient descent balances accuracy, memory, and GPU efficiency - the standard for deep learning
- Momentum accumulates gradient history to smooth oscillations in narrow valleys, accelerating convergence
- Learning rate schedules (cosine annealing with warmup) are nearly universal in modern deep learning and significantly affect final model quality
- Gradient clipping is essential for RNNs and helpful in early transformer training to prevent catastrophic updates
Next: Convex Functions and Optimization →
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Gradient Descent on a Loss Surface demo on the EngineersOfAI Playground - no code required.
:::
