Backpropagation From Scratch
The Production Scenario
It is 2 AM on a Tuesday. The model has been training for four hours and the loss curve is doing something you have never seen: it decreases smoothly for the first 2,000 steps, then begins oscillating with increasing amplitude - not the usual LR-too-high chaos, but a slow, growing wobble that suggests a specific layer is destabilizing. You print gradient norms. Layer 1 is fine. Layer 2 is fine. The third linear layer has a gradient norm of 847. The fourth has a norm of 0.0003. Something is wrong with the backward pass.
Your colleague implemented a custom attention variant last week and changed the backward method of a custom autograd Function. You pull up the code. The forward pass looks correct - the shapes are right, the computation matches the paper. But the backward pass has a subtle shape transposition error: where it should multiply by the Jacobian transpose, it multiplies by the Jacobian directly. The gradients are wrong. The loss appears to decrease at first because the error is small when weights are small, then compounds as weights grow.
To even understand this bug, you need to know what a Jacobian is in this context, why the transpose matters, what the expected shapes should be, and how to verify it numerically in under five minutes. You need to understand not just that backprop "applies the chain rule" but exactly how that chain rule translates into matrix operations at every layer.
This is not a theoretical exercise. In production ML engineering, you will write custom layers. You will implement paper algorithms that PyTorch does not support natively. You will inherit code from colleagues who understood backprop at a surface level. The engineers who can read a backward pass, spot the bug, and verify the fix numerically - those are the engineers who debug in hours instead of days.
This lesson covers everything you need to be one of them.
Why Backpropagation Exists
Before backpropagation, training multilayer neural networks was largely considered intractable. The problem is not computing a single gradient - that is just calculus. The problem is computing gradients for a million parameters efficiently enough that training finishes in a reasonable time.
The naive approach: for each parameter , perturb it slightly, run a full forward pass, measure how the loss changes, divide by the perturbation size. This gives . Repeat for every parameter. If you have 100 million parameters, you need 100 million forward passes per training step. A forward pass through a large model takes seconds. This is completely intractable.
The early 1980s alternative: restrict to single-layer networks where gradients could be computed by hand in closed form. This severely limited the representational power of neural networks and is one reason the first "AI winter" happened - researchers could not train networks deep enough to solve hard problems.
The key insight of backprop: all the partial derivatives share enormous amounts of intermediate computation. The gradient of the loss with respect to layer 1's weights depends on the gradient flowing back through layers 4, 3, and 2. If you compute that backward flow once and reuse it for every weight in layer 1, you avoid redundant computation. By traversing the computation graph in reverse and accumulating these reusable intermediate gradients, you can compute all partial derivatives in a single backward pass - the same cost as a single forward pass.
Rumelhart, Hinton, and Williams formalized this in their landmark 1986 Nature paper, though the core ideas appear earlier in Werbos (1974). The practical impact was enormous: suddenly networks with multiple hidden layers were trainable, and the door to deep learning opened.
Historical Context
David Rumelhart, Geoffrey Hinton, Ronald Williams - 1986. The paper "Learning Representations by Back-propagating Errors" in Nature is the canonical reference. Hinton later described the key insight as recognizing that you could efficiently compute derivatives through composed functions using dynamic programming - computing each intermediate gradient exactly once and reusing it.
Paul Werbos - 1974. Werbos derived a general method for computing gradients through composite functions in his Harvard PhD thesis, predating the 1986 paper. The algorithm was essentially the same but the ML community largely missed it at the time.
The autograd revolution - 2007–2017. The jump from "implement backprop by hand per architecture" to "define a forward pass, get gradients for free" came from automatic differentiation (autodiff) research. Theano (2007), Torch (2011), and then PyTorch (2016) and TensorFlow (2015) made this practical. The key idea: build a computation graph during the forward pass, traverse it in reverse during the backward pass. This is what every modern deep learning framework does.
The "aha moment" in modern terms: the computation graph approach decouples the forward computation from the backward computation. Each operation only needs to know its local backward rule. The autograd engine chains these local rules together automatically.
The Chain Rule: From Scalar to Matrix
Scalar Chain Rule
The scalar chain rule is taught in every calculus course. If and , then:
This says: the rate of change of with respect to equals the rate of change of with respect to the intermediate value , times the rate of change of with respect to . The chain decomposes the path from to into local steps.
For a deeper chain :
Backprop computes this product from right to left (or equivalently, accumulates the product from left to right starting from ).
Vector Chain Rule and Jacobians
When inputs and outputs are vectors, derivatives become Jacobian matrices. If with and , the Jacobian is:
The chain rule in vector form: if and , then:
This is a matrix product - Jacobian times Jacobian. In principle, this is how backprop works. In practice, forming explicit Jacobian matrices is computationally catastrophic.
Why We Never Form Full Jacobians
Consider a linear layer with . The Jacobian - that is just , fine. But consider a layer with : the Jacobian has elements. Stack 100 such layers. The product of 100 Jacobians each of size is never computed explicitly - it would require exabytes of memory.
The key observation: we never actually need the full Jacobian. The loss is a scalar. We only ever need , which is a vector the same shape as .
where is the gradient vector flowing in from upstream (same shape as ). This is a vector-Jacobian product (VJP): multiply the Jacobian transpose by the upstream gradient vector.
The VJP can always be computed in time - the same as the forward pass - without ever materializing the full Jacobian. This is reverse-mode automatic differentiation.
Jacobians for Common Operations
Linear layer :
Element-wise activation (e.g., ReLU, sigmoid):
The Jacobian is diagonal - . So:
Element-wise multiplication - no matrix needed.
Softmax , :
The Jacobian is where is the Kronecker delta. This is a rank-1 update of a diagonal matrix. The VJP is:
where . In practice, softmax is almost always followed immediately by cross-entropy, and the combined gradient is much simpler: .
Reverse-Mode vs Forward-Mode Autodiff
There are two modes of automatic differentiation. Understanding both clarifies why backprop is the right choice for neural networks.
Forward-Mode: Directional Derivatives
Forward-mode computes the directional derivative for a chosen direction vector . It propagates derivatives forward alongside the values.
Cost: One forward pass per directional derivative, i.e., one pass per output if you want for each output .
When it wins: When the function maps few inputs to many outputs. Example: a simulation with 2 parameters producing 1,000 output metrics - 2 forward passes compute all partial derivatives.
Reverse-Mode: Backpropagation
Reverse-mode computes for a chosen co-vector (in practice, the upstream gradient). It propagates gradients backward from the loss.
Cost: One forward pass (to build the graph and store intermediates) + one backward pass, regardless of the number of parameters.
When it wins: When the function maps many inputs to a scalar output. A neural network with 100 million parameters and a scalar loss - one backward pass gives all 100 million gradients simultaneously.
Concrete Example
Consider evaluated at .
Forward pass values: , , .
Backward pass (reverse-mode, starting from ):
All gradients in one pass. Note receives contributions from two paths through the graph (the term in and the direct ). Backprop handles this by adding contributions at each node - this is why gradients accumulate with += in micrograd-style implementations.
Computational Graphs: Forward Builds, Backward Traverses
A computational graph is a directed acyclic graph where:
- Leaf nodes are inputs and parameters
- Interior nodes are operations (add, multiply, ReLU, matmul)
- Directed edges represent data flow
- Each node stores its forward output AND a backward function (closure over the inputs)
The forward pass simultaneously computes values and builds the graph. The backward pass traverses the graph in reverse topological order, calling each node's backward function and accumulating gradients at each node.
:::tip The Topological Order Requirement
Backprop must process nodes in reverse topological order: a node can only be processed after all nodes that depend on it have been processed. This ensures that when we call a node's backward function, the upstream gradient has already been fully accumulated. The topological sort is computed during backward() - this is what build_topo does in micrograd.
:::
Full Derivation: 3-Layer MLP
Let us derive all gradients for a 3-layer MLP with sigmoid hidden activations and softmax + cross-entropy output. This is the complete computation a backprop pass performs.
Architecture:
- Input:
- Layer 1: ,
- Layer 2: ,
- Output: ,
- Loss: (cross-entropy, is one-hot)
For a batch of examples: all vectors become matrices , etc.
Backprop step 1: Output layer gradient
The combined softmax + cross-entropy gradient has the elegant form:
This is the prediction minus the one-hot target - predicted probabilities minus ground truth. When the model is perfectly confident and correct, this is zero.
Backprop step 2: Output weight and bias gradients
Backprop step 3: Propagate through to
Backprop step 4: Through sigmoid at layer 2
Sigmoid derivative:
This is the notorious sigmoid gradient squashing: the maximum value of is 0.25. After 10 layers, the gradient is multiplied by 0.25 ten times, reducing it by a factor of roughly .
Backprop step 5: Layer 2 weight gradients
Backprop step 6: Propagate to , through sigmoid, to layer 1 weights
Shape check rule: the gradient of the loss with respect to any weight matrix must have the same shape as . If the shapes do not match, you have a transposition error - the most common backprop bug.
NumPy Implementation: 3-Layer MLP from Scratch
import numpy as np
from typing import Tuple, Dict
def sigmoid(z: np.ndarray) -> np.ndarray:
"""Numerically stable sigmoid."""
return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z)))
def sigmoid_grad(h: np.ndarray) -> np.ndarray:
"""Gradient of sigmoid given its output h = sigmoid(z)."""
return h * (1 - h)
def softmax(z: np.ndarray) -> np.ndarray:
"""Numerically stable softmax."""
z_shifted = z - z.max(axis=1, keepdims=True)
exp_z = np.exp(z_shifted)
return exp_z / exp_z.sum(axis=1, keepdims=True)
def cross_entropy_loss(p: np.ndarray, y_onehot: np.ndarray) -> float:
"""Cross-entropy loss. p: predicted probabilities, y_onehot: one-hot targets."""
eps = 1e-12
return -np.mean(np.sum(y_onehot * np.log(p + eps), axis=1))
class ThreeLayerMLP:
"""
3-layer MLP (2 hidden + 1 output) with sigmoid hidden activations.
Loss: softmax + cross-entropy.
All gradients computed manually via backprop.
"""
def __init__(
self,
d0: int, # input dimension
d1: int, # hidden layer 1 dimension
d2: int, # hidden layer 2 dimension
K: int, # output dimension (number of classes)
seed: int = 42,
):
rng = np.random.default_rng(seed)
# He initialization for sigmoid (use sqrt(2/n) for ReLU, sqrt(1/n) for sigmoid/tanh)
def he_init(fan_in: int, fan_out: int) -> np.ndarray:
return rng.standard_normal((fan_in, fan_out)) * np.sqrt(2.0 / fan_in)
self.params: Dict[str, np.ndarray] = {
"W1": he_init(d0, d1),
"b1": np.zeros(d1),
"W2": he_init(d1, d2),
"b2": np.zeros(d2),
"W3": he_init(d2, K),
"b3": np.zeros(K),
}
# Cache for backward pass - populated during forward
self.cache: Dict[str, np.ndarray] = {}
def forward(self, X: np.ndarray) -> np.ndarray:
"""
Forward pass.
X: (B, d0)
Returns: predicted probabilities (B, K)
"""
W1, b1 = self.params["W1"], self.params["b1"]
W2, b2 = self.params["W2"], self.params["b2"]
W3, b3 = self.params["W3"], self.params["b3"]
# Layer 1
Z1 = X @ W1 + b1 # (B, d1)
H1 = sigmoid(Z1) # (B, d1)
# Layer 2
Z2 = H1 @ W2 + b2 # (B, d2)
H2 = sigmoid(Z2) # (B, d2)
# Output layer
Z3 = H2 @ W3 + b3 # (B, K)
P = softmax(Z3) # (B, K)
# Store everything needed for backward
self.cache = {
"X": X, "Z1": Z1, "H1": H1,
"Z2": Z2, "H2": H2,
"Z3": Z3, "P": P,
}
return P
def backward(self, Y_onehot: np.ndarray) -> Dict[str, np.ndarray]:
"""
Backward pass. Computes gradients for all parameters.
Y_onehot: (B, K) one-hot encoded targets.
Returns: dict of gradients matching self.params keys.
"""
X = self.cache["X"]
H1 = self.cache["H1"]
H2 = self.cache["H2"]
P = self.cache["P"]
W2 = self.params["W2"]
W3 = self.params["W3"]
B = X.shape[0]
# --- Output layer: softmax + cross-entropy combined gradient ---
# dL/dZ3 = (P - Y) / B
delta3 = (P - Y_onehot) / B # (B, K)
dW3 = H2.T @ delta3 # (d2, K) - correct shape matches W3
db3 = delta3.sum(axis=0) # (K,)
# --- Propagate through layer 2 linear to H2 ---
dH2 = delta3 @ W3.T # (B, d2)
# --- Through sigmoid at layer 2 ---
# dL/dZ2 = dL/dH2 * sigmoid'(Z2) = dL/dH2 * H2 * (1 - H2)
delta2 = dH2 * sigmoid_grad(H2) # (B, d2)
dW2 = H1.T @ delta2 # (d1, d2)
db2 = delta2.sum(axis=0) # (d2,)
# --- Propagate through layer 1 linear to H1 ---
dH1 = delta2 @ W2.T # (B, d1)
# --- Through sigmoid at layer 1 ---
delta1 = dH1 * sigmoid_grad(H1) # (B, d1)
dW1 = X.T @ delta1 # (d0, d1)
db1 = delta1.sum(axis=0) # (d1,)
return {
"W1": dW1, "b1": db1,
"W2": dW2, "b2": db2,
"W3": dW3, "b3": db3,
}
def loss(self, X: np.ndarray, Y_onehot: np.ndarray) -> float:
P = self.forward(X)
return cross_entropy_loss(P, Y_onehot)
def train_mlp():
"""Train the 3-layer MLP on synthetic data, verify loss decreases."""
np.random.seed(42)
# Synthetic 3-class dataset
N, d0, K = 200, 4, 3
X = np.random.randn(N, d0)
labels = np.random.randint(0, K, size=N)
Y_onehot = np.eye(K)[labels]
model = ThreeLayerMLP(d0=d0, d1=32, d2=16, K=K)
lr = 0.1
print("Training 3-layer MLP with manual backprop:")
for epoch in range(201):
# Forward
P = model.forward(X)
loss = cross_entropy_loss(P, Y_onehot)
if epoch % 20 == 0:
acc = (P.argmax(axis=1) == labels).mean()
print(f" Epoch {epoch:>3}: loss = {loss:.4f}, acc = {acc:.3f}")
# Backward
grads = model.backward(Y_onehot)
# SGD update
for key in model.params:
model.params[key] -= lr * grads[key]
train_mlp()
Numerical Gradient Checking
Whenever you implement a custom backward pass, verify it numerically. The finite difference approximation of the derivative:
where is the unit vector along dimension and is a small step (typically ). The centered difference is more accurate than the one-sided difference because it is second-order accurate in .
import numpy as np
from typing import Callable, Dict
def numerical_gradient(
loss_fn: Callable[[np.ndarray], float],
param: np.ndarray,
eps: float = 1e-5,
) -> np.ndarray:
"""
Compute numerical gradient of loss_fn with respect to param.
Uses centered finite differences: (f(x+eps) - f(x-eps)) / (2*eps).
"""
grad = np.zeros_like(param)
it = np.nditer(param, flags=["multi_index"], op_flags=["readwrite"])
while not it.finished:
idx = it.multi_index
orig = param[idx]
param[idx] = orig + eps
f_plus = loss_fn(param)
param[idx] = orig - eps
f_minus = loss_fn(param)
param[idx] = orig # restore
grad[idx] = (f_plus - f_minus) / (2 * eps)
it.iternext()
return grad
def gradient_check(model: ThreeLayerMLP, X: np.ndarray, Y: np.ndarray, eps: float = 1e-5):
"""
Compare analytical (backprop) gradients to numerical gradients.
Prints relative error for each parameter.
"""
# Compute analytical gradients
model.forward(X)
analytical = model.backward(Y)
print("\nGradient Check (analytical vs numerical):")
print(f"{'Parameter':<12} {'Max Abs Diff':>14} {'Relative Error':>16} {'Status':>8}")
print("-" * 55)
all_pass = True
for name, param in model.params.items():
# Capture name in closure properly
def make_loss(param_ref, other_params):
def loss_fn(p):
# Temporarily set this param
saved = param_ref.copy()
param_ref[:] = p.reshape(param_ref.shape)
val = model.loss(X, Y)
param_ref[:] = saved
return val
return loss_fn
# Numerical gradient for this parameter
num_grad = numerical_gradient(
lambda p: model.loss(X, Y),
param,
eps=eps,
)
a = analytical[name]
n = num_grad
max_abs = np.abs(a - n).max()
denom = max(np.abs(a).max(), np.abs(n).max(), 1e-8)
rel_err = max_abs / denom
status = "PASS" if rel_err < 1e-4 else "FAIL"
if rel_err >= 1e-4:
all_pass = False
print(f"{name:<12} {max_abs:>14.2e} {rel_err:>16.2e} {status:>8}")
print(f"\nOverall: {'ALL PASS' if all_pass else 'SOME FAIL'}")
return all_pass
:::warning Gradient Check Gotcha Numerical gradient checking is slow - forward passes for parameters. Never use it in a training loop. Use it only during development to verify a new custom backward implementation. Also: gradient checking fails near kinks (ReLU at exactly 0). Use small inputs and avoid degenerate cases. :::
PyTorch Custom Autograd Function
When you need to implement a custom forward/backward in PyTorch - for fused operations, custom CUDA kernels, or non-differentiable steps - use torch.autograd.Function.
import torch
import torch.nn as nn
from torch.autograd import Function
class SigmoidFunction(Function):
"""
Custom sigmoid implementation to illustrate the autograd.Function API.
In practice, use torch.sigmoid - this is for learning purposes.
"""
@staticmethod
def forward(ctx, x: torch.Tensor) -> torch.Tensor:
"""
ctx is the context object. Save tensors here for use in backward.
forward() must be @staticmethod - no self.
"""
output = torch.sigmoid(x)
# Save output (not input) because sigmoid gradient uses the output value
ctx.save_for_backward(output)
return output
@staticmethod
def backward(ctx, grad_output: torch.Tensor) -> torch.Tensor:
"""
grad_output: gradient of loss w.r.t. this function's output.
Return: gradient of loss w.r.t. this function's input.
Must return one tensor per input to forward().
"""
(output,) = ctx.saved_tensors
# dL/dx = dL/d(output) * output * (1 - output)
return grad_output * output * (1 - output)
# Usage
sigmoid_fn = SigmoidFunction.apply
x = torch.randn(4, 8, requires_grad=True)
y = sigmoid_fn(x)
loss = y.sum()
loss.backward()
print(f"Analytical grad norm: {x.grad.norm():.4f}")
# Verify with gradcheck
from torch.autograd import gradcheck
x_check = torch.randn(3, 4, dtype=torch.float64, requires_grad=True)
result = gradcheck(SigmoidFunction.apply, (x_check,), eps=1e-6, atol=1e-4)
print(f"gradcheck passed: {result}")
class CustomScaledDotAttention(Function):
"""
Fused scaled dot-product attention backward - common in production.
This illustrates the pattern; a real implementation would use CUDA.
"""
@staticmethod
def forward(ctx, Q: torch.Tensor, K: torch.Tensor, V: torch.Tensor) -> torch.Tensor:
"""
Q, K, V: (B, T, d_k)
Output: (B, T, d_k)
"""
d_k = Q.shape[-1]
scale = d_k ** -0.5
# Compute attention weights
scores = torch.bmm(Q, K.transpose(-2, -1)) * scale # (B, T, T)
weights = torch.softmax(scores, dim=-1) # (B, T, T)
output = torch.bmm(weights, V) # (B, T, d_k)
# Save for backward
ctx.save_for_backward(Q, K, V, weights)
ctx.scale = scale
return output
@staticmethod
def backward(ctx, grad_output: torch.Tensor):
"""
grad_output: (B, T, d_k) - gradient w.r.t. attention output
Returns: gradients for Q, K, V
"""
Q, K, V, weights = ctx.saved_tensors
scale = ctx.scale
# Gradient w.r.t. V: dL/dV = weights^T @ grad_output
grad_V = torch.bmm(weights.transpose(-2, -1), grad_output) # (B, T, d_k)
# Gradient w.r.t. attention weights: dL/d(weights) = grad_output @ V^T
grad_weights = torch.bmm(grad_output, V.transpose(-2, -1)) # (B, T, T)
# Gradient through softmax
# dL/d(scores) = softmax_backward(weights, grad_weights)
# = weights * (grad_weights - (weights * grad_weights).sum(-1, keepdim=True))
sum_term = (weights * grad_weights).sum(dim=-1, keepdim=True)
grad_scores = weights * (grad_weights - sum_term) * scale # (B, T, T)
# Gradient w.r.t. Q: dL/dQ = grad_scores @ K
grad_Q = torch.bmm(grad_scores, K) # (B, T, d_k)
# Gradient w.r.t. K: dL/dK = grad_scores^T @ Q
grad_K = torch.bmm(grad_scores.transpose(-2, -1), Q) # (B, T, d_k)
return grad_Q, grad_K, grad_V
# Verify correctness
B, T, d_k = 2, 4, 8
Q = torch.randn(B, T, d_k, dtype=torch.float64, requires_grad=True)
K = torch.randn(B, T, d_k, dtype=torch.float64, requires_grad=True)
V = torch.randn(B, T, d_k, dtype=torch.float64, requires_grad=True)
result = gradcheck(CustomScaledDotAttention.apply, (Q, K, V), eps=1e-6, atol=1e-4)
print(f"Attention gradcheck: {result}")
Micrograd: Building Autograd from Scratch
Andrej Karpathy's micrograd approach - a scalar-valued autograd engine - is the cleanest way to understand the mechanics. Each Value node wraps a scalar, stores its gradient, and records a backward closure.
import math
from typing import Union, Callable, Set
class Value:
"""
Scalar value in a computational graph with reverse-mode autodiff.
Implements: +, -, *, /, **, exp, log, relu, tanh, sigmoid.
"""
def __init__(
self,
data: float,
_children: tuple = (),
_op: str = "",
label: str = "",
):
self.data = float(data)
self.grad = 0.0 # dL/d(self), accumulated
self._backward: Callable = lambda: None
self._prev: Set["Value"] = set(_children)
self._op = _op
self.label = label
def __repr__(self) -> str:
return f"Value({self.data:.4f}, grad={self.grad:.4f}, op='{self._op}')"
# --- Forward operations ---
def __add__(self, other: Union["Value", float]) -> "Value":
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), "+")
def _backward():
# Gradient of (a + b) w.r.t. a = 1, w.r.t. b = 1
# += because a node can be used in multiple places (multivariate chain rule)
self.grad += out.grad
other.grad += out.grad
out._backward = _backward
return out
def __mul__(self, other: Union["Value", float]) -> "Value":
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), "*")
def _backward():
# Gradient of (a * b) w.r.t. a = b, w.r.t. b = a
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = _backward
return out
def __pow__(self, exponent: float) -> "Value":
assert isinstance(exponent, (int, float))
out = Value(self.data ** exponent, (self,), f"**{exponent}")
def _backward():
# d/dx of x^n = n * x^(n-1)
self.grad += exponent * (self.data ** (exponent - 1)) * out.grad
out._backward = _backward
return out
def exp(self) -> "Value":
e = math.exp(self.data)
out = Value(e, (self,), "exp")
def _backward():
# d/dx of exp(x) = exp(x) - stored as e
self.grad += e * out.grad
out._backward = _backward
return out
def log(self) -> "Value":
assert self.data > 0, f"log({self.data}) undefined"
out = Value(math.log(self.data), (self,), "log")
def _backward():
# d/dx of log(x) = 1/x
self.grad += (1.0 / self.data) * out.grad
out._backward = _backward
return out
def relu(self) -> "Value":
out = Value(max(0.0, self.data), (self,), "ReLU")
def _backward():
# d/dx of ReLU(x) = 1 if x > 0 else 0
self.grad += (out.data > 0) * out.grad
out._backward = _backward
return out
def tanh(self) -> "Value":
t = math.tanh(self.data)
out = Value(t, (self,), "tanh")
def _backward():
# d/dx of tanh(x) = 1 - tanh^2(x)
self.grad += (1 - t ** 2) * out.grad
out._backward = _backward
return out
def sigmoid(self) -> "Value":
s = 1.0 / (1.0 + math.exp(-self.data))
out = Value(s, (self,), "sigmoid")
def _backward():
self.grad += s * (1 - s) * out.grad
out._backward = _backward
return out
def backward(self):
"""
Run backpropagation from this node.
Call on the loss value after the full forward pass.
"""
# Topological sort: process each node after all its dependents
topo: list["Value"] = []
visited: Set[int] = set()
def build_topo(v: "Value"):
if id(v) not in visited:
visited.add(id(v))
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
# Gradient of loss w.r.t. itself = 1
self.grad = 1.0
# Reverse topological order: from output to inputs
for node in reversed(topo):
node._backward()
# --- Python operator overloads ---
def __neg__(self) -> "Value": return self * -1
def __sub__(self, other): return self + (-other)
def __truediv__(self, other): return self * other ** -1
def __radd__(self, other): return self + other
def __rmul__(self, other): return self * other
def __rsub__(self, other): return other + (-self)
def __rtruediv__(self, other): return other * self ** -1
def demo_2neuron_network():
"""Demonstrate backward pass through a tiny 2-neuron network."""
# Inputs
x1 = Value(2.0, label="x1")
x2 = Value(0.0, label="x2")
# Weights and bias (hand-picked to give interesting gradients)
w1 = Value(-3.0, label="w1")
w2 = Value(1.0, label="w2")
b = Value(6.881, label="b")
# Layer computation
z = x1 * w1 + x2 * w2 + b
h = z.tanh()
# Loss = output (so dL/dh = 1, making this equivalent to maximizing h)
h.backward()
print("=== Network Gradients ===")
print(f"x1.grad = {x1.grad:.4f} (how much increasing x1 increases h)")
print(f"x2.grad = {x2.grad:.4f} (how much increasing x2 increases h)")
print(f"w1.grad = {w1.grad:.4f} (how much increasing w1 increases h)")
print(f"w2.grad = {w2.grad:.4f} (how much increasing w2 increases h)")
print(f"b.grad = {b.grad:.4f} (how much increasing b increases h)")
# Cross-check with PyTorch
import torch
x1t = torch.tensor(2.0, requires_grad=True)
x2t = torch.tensor(0.0, requires_grad=True)
w1t = torch.tensor(-3.0, requires_grad=True)
w2t = torch.tensor(1.0, requires_grad=True)
bt = torch.tensor(6.881, requires_grad=True)
zt = x1t * w1t + x2t * w2t + bt
ht = zt.tanh()
ht.backward()
print("\n=== PyTorch Cross-Check ===")
print(f"x1.grad = {x1t.grad:.4f}")
print(f"w1.grad = {w1t.grad:.4f}")
print(f"b.grad = {bt.grad:.4f}")
demo_2neuron_network()
Common Backprop Bugs
:::danger Bug 1: Shape Transposition in Weight Gradient
The weight gradient should be upstream_grad.T @ input (or equivalently input.T @ upstream_grad depending on convention). Swapping the transpose produces a gradient of the wrong shape that PyTorch will not automatically catch if the shapes happen to be square.
# WRONG: produces incorrect gradient for non-square W
dW = upstream_grad @ H.T # shape may be wrong for non-square case
# CORRECT: (d_out, B).T @ (B, d_in) -> (d_out, d_in) which matches W shape
dW = (1/B) * upstream_grad.T @ H
:::
:::danger Bug 2: Forgetting optimizer.zero_grad()
Without zero_grad, gradients accumulate across batches. PyTorch .grad accumulates with += by design (to support gradient accumulation). This means after two batches without zero_grad, the effective gradient is the sum of two batches, causing incorrect and growing updates.
# WRONG - gradients double, triple, ... with each batch
for batch_x, batch_y in loader:
loss = criterion(model(batch_x), batch_y)
loss.backward()
optimizer.step()
# CORRECT
for batch_x, batch_y in loader:
optimizer.zero_grad(set_to_none=True) # set_to_none=True is slightly faster
loss = criterion(model(batch_x), batch_y)
loss.backward()
optimizer.step()
:::
:::danger Bug 3: In-Place Operations Breaking the Graph In-place operations modify a tensor that may be stored in another node's backward closure. PyTorch detects many of these and raises a RuntimeError, but not all. Always prefer out-of-place operations during training.
# WRONG
x = torch.randn(3, requires_grad=True)
y = x + 1
y.relu_() # in-place - modifies y which x's gradient computation needs
loss = y.sum()
loss.backward() # RuntimeError
# CORRECT
y = (x + 1).relu() # out-of-place
:::
:::warning Bug 4: .item() Detaches from Graph
Calling .item() converts a tensor to a Python scalar, cutting the gradient graph. A common pattern is using .item() inside a loss function for logging, then accidentally returning the result.
# WRONG
def compute_loss(pred, target):
mse = ((pred - target)**2).mean().item() # detached!
return torch.tensor(mse) # new tensor, no gradient path
# CORRECT
def compute_loss(pred, target):
return ((pred - target)**2).mean() # stays in graph
:::
:::warning Bug 5: Retaining Graph When Not Needed (Memory Leak)
If you call loss.backward(retain_graph=True) unnecessarily, the intermediate activations are never freed, causing memory to grow with every training step. Only retain the graph when you genuinely need to call backward a second time.
# MEMORY LEAK if called every step
loss.backward(retain_graph=True)
# CORRECT - only use retain_graph when truly needed
# e.g., for meta-learning or penalty-based second-order methods
loss.backward() # graph freed automatically
:::
Inspecting Gradient Flow in PyTorch
import torch
import torch.nn as nn
from typing import Dict
def print_gradient_flow(model: nn.Module, threshold_vanish: float = 1e-6,
threshold_explode: float = 10.0) -> Dict[str, float]:
"""
Print gradient norms for all named parameters.
Call after loss.backward() and before optimizer.step().
Returns dict of {name: grad_norm}.
"""
print(f"\n{'Layer':<45} {'Grad Norm':>12} {'Param Norm':>12} {'Ratio':>10} Status")
print("-" * 95)
norms = {}
for name, param in model.named_parameters():
if param.grad is not None:
g_norm = param.grad.detach().norm(2).item()
p_norm = param.data.detach().norm(2).item()
ratio = g_norm / (p_norm + 1e-8)
norms[name] = g_norm
status = "OK"
if g_norm < threshold_vanish:
status = "VANISHING"
elif g_norm > threshold_explode:
status = "EXPLODING"
print(f"{name:<45} {g_norm:>12.4e} {p_norm:>12.4e} {ratio:>10.4e} {status}")
else:
norms[name] = 0.0
print(f"{name:<45} {'NO GRAD':>12}")
return norms
def register_gradient_hooks(model: nn.Module) -> list:
"""
Register backward hooks to monitor gradients flowing through each layer.
Hooks fire during backward() - useful for real-time monitoring.
Usage:
handles = register_gradient_hooks(model)
loss.backward()
for h in handles: h.remove()
"""
handles = []
def make_hook(name: str):
def hook(grad: torch.Tensor) -> torch.Tensor:
norm = grad.norm().item()
has_nan = torch.isnan(grad).any().item()
print(f" Backward hook [{name}]: norm={norm:.4e}, has_nan={has_nan}")
return grad # return None to keep gradient unchanged, or return modified grad
return hook
for name, param in model.named_parameters():
if param.requires_grad:
handles.append(param.register_hook(make_hook(name)))
return handles
# Demo usage
model = nn.Sequential(
nn.Linear(32, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, 10),
)
x = torch.randn(16, 32)
y = torch.randint(0, 10, (16,))
loss = nn.CrossEntropyLoss()(model(x), y)
loss.backward()
norms = print_gradient_flow(model)
YouTube Resources
| Title | Channel | Why Watch |
|---|---|---|
| The spelled-out intro to neural networks and backpropagation | Andrej Karpathy | Builds micrograd live - the single best backprop from scratch video; 2.5 hours of pure clarity |
| Backpropagation calculus | 3Blue1Brown | Beautiful visual intuition for the chain rule on computational graphs; 10 minutes |
| CS231n Lecture 4: Backpropagation and Neural Networks | Stanford / Andrej Karpathy | The definitive university lecture on backprop; covers Jacobians, gates, and implementation |
| PyTorch Autograd Explained | Elliot Waite | Detailed walkthrough of the PyTorch autograd engine internals |
| Automatic Differentiation in Machine Learning: a Survey | NeurIPS Tutorial | Academic depth on forward vs reverse mode and higher-order derivatives |
Production Engineering Notes
Memory cost scales with depth and batch size. Every intermediate activation must be stored for the backward pass. A model with 48 transformer layers trained with batch size 64 and sequence length 2048 can require tens of gigabytes just for stored activations - before parameters and gradients. Gradient checkpointing (torch.utils.checkpoint) trades compute for memory: it does not store activations during forward, then recomputes them during backward. This roughly halves memory at the cost of ~33% more compute.
Gradient accumulation for large effective batch sizes. When GPU memory limits batch size to 8 but you want an effective batch of 64, run 8 mini-batches without zero_grad, accumulate gradients, then step. Scale the loss by 1/accumulation_steps so the effective gradient magnitude matches the full batch.
Mixed precision and gradient scaling. Training in float16 reduces memory and speeds up computation, but float16 underflows to zero for gradients smaller than ~6e-8. PyTorch's GradScaler multiplies the loss by a large factor before backward (so gradients are large enough to not underflow in float16), then divides before the optimizer step. Check scaler.get_scale() - if it drops rapidly, gradients are overflowing (reduce LR or check for bugs).
Second-order gradients. Meta-learning (MAML), gradient penalty (WGAN-GP), and knowledge distillation with gradient matching all require differentiating through the backward pass. Use create_graph=True in torch.autograd.grad(), which builds the computation graph for the gradient computation itself. Memory cost doubles or triples. Profile before using in production.
Interview Q&A
Q1: Explain backpropagation to a non-technical interviewer in one paragraph.
Backpropagation is the algorithm that teaches a neural network from its mistakes. After we run data through the network and compare the output to the correct answer, we have a single number representing the error - the loss. Backprop then asks, for each weight in the network: if I increase this weight by a tiny amount, does the loss go up or down, and by how much? The answer for all weights simultaneously is computed in a single backward pass through the network - from output to input - using the chain rule from calculus to propagate responsibility for the error backward through each layer. This gives every weight a gradient that says "move me in this direction to reduce the loss." Gradient descent then takes a small step in that direction. Repeating thousands of times is how the network learns.
Q2: Why is reverse-mode autodiff preferred over forward-mode for neural networks?
For neural networks, the loss is a scalar and the parameters form a high-dimensional vector - millions to billions of elements. Forward-mode autodiff computes the directional derivative for one direction per pass. To compute all partial derivatives, you need one forward pass per parameter. With 100 million parameters, that is 100 million forward passes per training step - completely intractable.
Reverse-mode computes for all simultaneously in a single backward pass. The key insight: since is scalar, there is only one "upstream gradient" flowing backward, and each layer can compute its parameter gradients and pass the gradient downstream - all reusing the same upstream value. Reverse-mode wins when there are many inputs and one scalar output. Forward-mode wins when there are few inputs and many outputs (e.g., computing the full Jacobian of a vector-valued function with respect to two scalar inputs).
Q3: What is the vanishing gradient problem and how does it occur in backpropagation?
When backpropagating through many layers, the gradient is multiplied by the local Jacobian at each layer. For element-wise activation functions like sigmoid (maximum derivative 0.25) and tanh (maximum derivative 1.0, but typically well below 1.0 for moderate inputs), the gradient is multiplied by a number less than 1 at each layer. Across 20 layers of sigmoid activations, the gradient shrinks by at least - effectively zero. Early layers receive zero gradient and their weights stop updating. The network appears to train - the later layers learn fine - but the early feature extractors are frozen.
Solutions: ReLU activations (derivative exactly 1 for positive inputs), residual connections (which add an identity path so gradient can flow without multiplication), batch normalization (which keeps activations in a regime where sigmoid/tanh gradients are larger), and careful initialization (Xavier/He initialization ensures the gradient magnitudes are roughly preserved at initialization).
Q4: What happens when you call loss.backward() twice without retain_graph=True?
The first backward() traverses the computation graph, calls each node's backward function, accumulates gradients, and then frees the intermediate activations from memory. The forward pass stores intermediate values (like pre-activation tensors needed to compute the ReLU mask) in the computation graph nodes. After backward, these are released to free GPU memory.
If you call backward() again without retain_graph=True, the graph no longer exists - those intermediate tensors have been freed - and PyTorch raises RuntimeError: Trying to backward through the graph a second time. The fix is loss.backward(retain_graph=True) for the first call, which keeps the activations alive. The pattern arises legitimately in meta-learning (MAML differentiates through an inner gradient update), gradient penalty regularization (penalty involves gradient norms), and some RL algorithms.
Q5: How do you write and verify a custom backward pass for a PyTorch autograd Function?
Use torch.autograd.Function with @staticmethod forward and @staticmethod backward. In forward, call ctx.save_for_backward(...) to save any tensors needed in backward (do not save raw Python values - use ctx.my_scalar = value). In backward, retrieve saved tensors with ctx.saved_tensors and return one gradient per input to forward (return None for inputs that do not require gradients).
Verify with torch.autograd.gradcheck: convert inputs to float64 (required because finite differences in float32 have too much numerical noise), set requires_grad=True, and call gradcheck(MyFunc.apply, (x,), eps=1e-6, atol=1e-4). A passing gradcheck means your backward implementation matches finite differences to within tolerance. The most common bugs: forgetting a transpose, wrong shape for an intermediate term, or not accounting for the batch dimension mean in the gradient.
Q6: A training run produces NaN loss after 500 steps. Walk through your debugging process.
Step 1: Add torch.autograd.set_detect_anomaly(True) before the training loop. This adds a hook to every operation that detects the first NaN or Inf and prints a stack trace showing exactly which operation produced it. It is slow (2-3x slowdown), so only use it for one or two batches.
Step 2: If the NaN appears in the loss, check for log(0) - cross-entropy on zero probabilities. Add epsilon: torch.clamp(probs, min=1e-8) before log.
Step 3: If the NaN is in gradients, print gradient norms before the optimizer step. If any norm is Inf or NaN, the loss landscape has a spike. Add gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) and reduce LR.
Step 4: Check input data. assert not torch.isnan(batch_x).any() before every forward pass during the first 10 steps.
Step 5: Check for in-place operations that might corrupt stored activations. Grep for _() suffix operations and += on non-leaf tensors.
Step 6: If using mixed precision, check the GradScaler - if the scale drops to 1.0, gradients are overflowing in float16. Add explicit float32 casting for numerically sensitive operations.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Backpropagation demo on the EngineersOfAI Playground - no code required.
:::
