What is backpropagation?

Full chain rule derivation on computational graphs, Jacobian matrices and vector-Jacobian products, reverse-mode vs forward-mode autodiff, numpy 3-layer MLP implementation, PyTorch custom autograd Functions, and numerical gradient checking - every concept a senior engineer needs to debug, extend, and explain backprop under pressure.

How does chain rule work in practice?

Backpropagation From Scratch covers backpropagation, chain rule, computational graph from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/neural-networks/backpropagation-from-scratch

What is the difference between backpropagation and computational graph?

See the full breakdown at https://engineersofai.com/docs/ml/neural-networks/backpropagation-from-scratch

Backpropagation From Scratch

The Production Scenario

It is 2 AM on a Tuesday. The model has been training for four hours and the loss curve is doing something you have never seen: it decreases smoothly for the first 2,000 steps, then begins oscillating with increasing amplitude - not the usual LR-too-high chaos, but a slow, growing wobble that suggests a specific layer is destabilizing. You print gradient norms. Layer 1 is fine. Layer 2 is fine. The third linear layer has a gradient norm of 847. The fourth has a norm of 0.0003. Something is wrong with the backward pass.

Your colleague implemented a custom attention variant last week and changed the backward method of a custom autograd Function. You pull up the code. The forward pass looks correct - the shapes are right, the computation matches the paper. But the backward pass has a subtle shape transposition error: where it should multiply by the Jacobian transpose, it multiplies by the Jacobian directly. The gradients are wrong. The loss appears to decrease at first because the error is small when weights are small, then compounds as weights grow.

To even understand this bug, you need to know what a Jacobian is in this context, why the transpose matters, what the expected shapes should be, and how to verify it numerically in under five minutes. You need to understand not just that backprop "applies the chain rule" but exactly how that chain rule translates into matrix operations at every layer.

This is not a theoretical exercise. In production ML engineering, you will write custom layers. You will implement paper algorithms that PyTorch does not support natively. You will inherit code from colleagues who understood backprop at a surface level. The engineers who can read a backward pass, spot the bug, and verify the fix numerically - those are the engineers who debug in hours instead of days.

This lesson covers everything you need to be one of them.

Why Backpropagation Exists

Before backpropagation, training multilayer neural networks was largely considered intractable. The problem is not computing a single gradient - that is just calculus. The problem is computing gradients for a million parameters efficiently enough that training finishes in a reasonable time.

The naive approach: for each parameter $w_i$ , perturb it slightly, run a full forward pass, measure how the loss changes, divide by the perturbation size. This gives $\partial L / \partial w_i$ . Repeat for every parameter. If you have 100 million parameters, you need 100 million forward passes per training step. A forward pass through a large model takes seconds. This is completely intractable.

The early 1980s alternative: restrict to single-layer networks where gradients could be computed by hand in closed form. This severely limited the representational power of neural networks and is one reason the first "AI winter" happened - researchers could not train networks deep enough to solve hard problems.

The key insight of backprop: all the partial derivatives share enormous amounts of intermediate computation. The gradient of the loss with respect to layer 1's weights depends on the gradient flowing back through layers 4, 3, and 2. If you compute that backward flow once and reuse it for every weight in layer 1, you avoid redundant computation. By traversing the computation graph in reverse and accumulating these reusable intermediate gradients, you can compute all partial derivatives in a single backward pass - the same cost as a single forward pass.

Rumelhart, Hinton, and Williams formalized this in their landmark 1986 Nature paper, though the core ideas appear earlier in Werbos (1974). The practical impact was enormous: suddenly networks with multiple hidden layers were trainable, and the door to deep learning opened.

Historical Context

David Rumelhart, Geoffrey Hinton, Ronald Williams - 1986. The paper "Learning Representations by Back-propagating Errors" in Nature is the canonical reference. Hinton later described the key insight as recognizing that you could efficiently compute derivatives through composed functions using dynamic programming - computing each intermediate gradient exactly once and reusing it.

Paul Werbos - 1974. Werbos derived a general method for computing gradients through composite functions in his Harvard PhD thesis, predating the 1986 paper. The algorithm was essentially the same but the ML community largely missed it at the time.

The autograd revolution - 2007–2017. The jump from "implement backprop by hand per architecture" to "define a forward pass, get gradients for free" came from automatic differentiation (autodiff) research. Theano (2007), Torch (2011), and then PyTorch (2016) and TensorFlow (2015) made this practical. The key idea: build a computation graph during the forward pass, traverse it in reverse during the backward pass. This is what every modern deep learning framework does.

The "aha moment" in modern terms: the computation graph approach decouples the forward computation from the backward computation. Each operation only needs to know its local backward rule. The autograd engine chains these local rules together automatically.

The Chain Rule: From Scalar to Matrix

Scalar Chain Rule

The scalar chain rule is taught in every calculus course. If $L = f(z)$ and $z = g(w)$ , then:

$\frac{dL}{dw} = \frac{dL}{dz} \cdot \frac{dz}{dw}$

This says: the rate of change of $L$ with respect to $w$ equals the rate of change of $L$ with respect to the intermediate value $z$ , times the rate of change of $z$ with respect to $w$ . The chain decomposes the path from $w$ to $L$ into local steps.

For a deeper chain $L = f_n(f_{n-1}(\ldots f_1(w) \ldots))$ :

$\frac{dL}{dw} = \frac{dL}{df_n} \cdot \frac{df_n}{df_{n-1}} \cdots \frac{df_2}{df_1} \cdot \frac{df_1}{dw}$

Backprop computes this product from right to left (or equivalently, accumulates the product from left to right starting from $dL/df_n = 1$ ).

Vector Chain Rule and Jacobians

When inputs and outputs are vectors, derivatives become Jacobian matrices. If $\mathbf{y} = f(\mathbf{x})$ with $\mathbf{y} \in \mathbb{R}^m$ and $\mathbf{x} \in \mathbb{R}^n$ , the Jacobian is:

$J = \frac{\partial \mathbf{y}}{\partial \mathbf{x}} \in \mathbb{R}^{m \times n}, \quad J_{ij} = \frac{\partial y_i}{\partial x_j}$

The chain rule in vector form: if $\mathbf{z} = f(\mathbf{y})$ and $\mathbf{y} = g(\mathbf{x})$ , then:

$\frac{\partial \mathbf{z}}{\partial \mathbf{x}} = \frac{\partial \mathbf{z}}{\partial \mathbf{y}} \cdot \frac{\partial \mathbf{y}}{\partial \mathbf{x}}$

This is a matrix product - Jacobian times Jacobian. In principle, this is how backprop works. In practice, forming explicit Jacobian matrices is computationally catastrophic.

Why We Never Form Full Jacobians

Consider a linear layer $\mathbf{y} = \mathbf{W}\mathbf{x}$ with $\mathbf{W} \in \mathbb{R}^{m \times n}$ . The Jacobian $\frac{\partial \mathbf{y}}{\partial \mathbf{x}} \in \mathbb{R}^{m \times n}$ - that is just $\mathbf{W}$ , fine. But consider a layer with $m = n = 10{,}000$ : the Jacobian has $10^8$ elements. Stack 100 such layers. The product of 100 Jacobians each of size $10^8$ is never computed explicitly - it would require exabytes of memory.

The key observation: we never actually need the full Jacobian. The loss $L$ is a scalar. We only ever need $\frac{\partial L}{\partial \mathbf{x}}$ , which is a vector the same shape as $\mathbf{x}$ .

$\frac{\partial L}{\partial \mathbf{x}} = \left(\frac{\partial \mathbf{y}}{\partial \mathbf{x}}\right)^T \frac{\partial L}{\partial \mathbf{y}} = J^T \mathbf{g}$

where $\mathbf{g} = \frac{\partial L}{\partial \mathbf{y}}$ is the gradient vector flowing in from upstream (same shape as $\mathbf{y}$ ). This is a vector-Jacobian product (VJP): multiply the Jacobian transpose by the upstream gradient vector.

The VJP can always be computed in $O(mn)$ time - the same as the forward pass - without ever materializing the full $m \times n$ Jacobian. This is reverse-mode automatic differentiation.

Jacobians for Common Operations

Linear layer $\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}$ :

$\frac{\partial L}{\partial \mathbf{x}} = \mathbf{W}^T \frac{\partial L}{\partial \mathbf{y}}$

$\frac{\partial L}{\partial \mathbf{W}} = \frac{\partial L}{\partial \mathbf{y}} \cdot \mathbf{x}^T$

$\frac{\partial L}{\partial \mathbf{b}} = \frac{\partial L}{\partial \mathbf{y}}$

Element-wise activation $\mathbf{y} = \sigma(\mathbf{x})$ (e.g., ReLU, sigmoid):

The Jacobian is diagonal - $J_{ij} = \sigma'(x_i) \cdot \mathbf{1}[i=j]$ . So:

$\frac{\partial L}{\partial \mathbf{x}} = \sigma'(\mathbf{x}) \odot \frac{\partial L}{\partial \mathbf{y}}$

Element-wise multiplication - no matrix needed.

Softmax $\mathbf{p} = \text{softmax}(\mathbf{z})$ , $p_i = e^{z_i} / \sum_j e^{z_j}$ :

The Jacobian is $J_{ij} = p_i(\delta_{ij} - p_j)$ where $\delta_{ij}$ is the Kronecker delta. This is a rank-1 update of a diagonal matrix. The VJP is:

$\frac{\partial L}{\partial \mathbf{z}} = \mathbf{p} \odot \left(\mathbf{g} - (\mathbf{p} \cdot \mathbf{g})\mathbf{1}\right)$

where $\mathbf{g} = \frac{\partial L}{\partial \mathbf{p}}$ . In practice, softmax is almost always followed immediately by cross-entropy, and the combined gradient is much simpler: $\frac{\partial L}{\partial \mathbf{z}} = \mathbf{p} - \mathbf{y}_{\text{onehot}}$ .

Reverse-Mode vs Forward-Mode Autodiff

There are two modes of automatic differentiation. Understanding both clarifies why backprop is the right choice for neural networks.

Forward-Mode: Directional Derivatives

Forward-mode computes the directional derivative $\frac{\partial f}{\partial \mathbf{x}} \cdot \mathbf{v}$ for a chosen direction vector $\mathbf{v}$ . It propagates derivatives forward alongside the values.

Cost: One forward pass per directional derivative, i.e., one pass per output if you want $\frac{\partial f_j}{\partial \mathbf{x}}$ for each output $j$ .

When it wins: When the function maps few inputs to many outputs. Example: a simulation with 2 parameters producing 1,000 output metrics - 2 forward passes compute all partial derivatives.

Reverse-Mode: Backpropagation

Reverse-mode computes $\mathbf{v}^T \frac{\partial f}{\partial \mathbf{x}}$ for a chosen co-vector $\mathbf{v}$ (in practice, the upstream gradient). It propagates gradients backward from the loss.

Cost: One forward pass (to build the graph and store intermediates) + one backward pass, regardless of the number of parameters.

When it wins: When the function maps many inputs to a scalar output. A neural network with 100 million parameters and a scalar loss - one backward pass gives all 100 million gradients simultaneously.

Concrete Example

Consider $f(x_1, x_2) = (x_1 + x_2)^2 \cdot x_1$ evaluated at $(x_1, x_2) = (3, 2)$ .

Forward pass values: $z = x_1 + x_2 = 5$ , $w = z^2 = 25$ , $f = w \cdot x_1 = 75$ .

Backward pass (reverse-mode, starting from $\frac{\partial f}{\partial f} = 1$ ):

$\frac{\partial f}{\partial w} = x_1 = 3$

$\frac{\partial f}{\partial z} = \frac{\partial f}{\partial w} \cdot 2z = 3 \cdot 10 = 30$

$\frac{\partial f}{\partial x_1} = \frac{\partial f}{\partial z} \cdot 1 + w \cdot 1 = 30 + 25 = 55$

$\frac{\partial f}{\partial x_2} = \frac{\partial f}{\partial z} \cdot 1 = 30$

All gradients in one pass. Note $\frac{\partial f}{\partial x_1}$ receives contributions from two paths through the graph (the $x_1$ term in $z = x_1 + x_2$ and the direct $f = w \cdot x_1$ ). Backprop handles this by adding contributions at each node - this is why gradients accumulate with += in micrograd-style implementations.

Computational Graphs: Forward Builds, Backward Traverses

A computational graph is a directed acyclic graph where:

Leaf nodes are inputs and parameters
Interior nodes are operations (add, multiply, ReLU, matmul)
Directed edges represent data flow
Each node stores its forward output AND a backward function (closure over the inputs)

The forward pass simultaneously computes values and builds the graph. The backward pass traverses the graph in reverse topological order, calling each node's backward function and accumulating gradients at each node.

:::tip The Topological Order Requirement Backprop must process nodes in reverse topological order: a node can only be processed after all nodes that depend on it have been processed. This ensures that when we call a node's backward function, the upstream gradient has already been fully accumulated. The topological sort is computed during backward() - this is what build_topo does in micrograd. :::

Full Derivation: 3-Layer MLP

Let us derive all gradients for a 3-layer MLP with sigmoid hidden activations and softmax + cross-entropy output. This is the complete computation a backprop pass performs.

Architecture:

Input: $\mathbf{x} \in \mathbb{R}^{d_0}$
Layer 1: $\mathbf{z}^{(1)} = \mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)}$ , $\mathbf{h}^{(1)} = \sigma(\mathbf{z}^{(1)}) \in \mathbb{R}^{d_1}$
Layer 2: $\mathbf{z}^{(2)} = \mathbf{W}^{(2)}\mathbf{h}^{(1)} + \mathbf{b}^{(2)}$ , $\mathbf{h}^{(2)} = \sigma(\mathbf{z}^{(2)}) \in \mathbb{R}^{d_2}$
Output: $\mathbf{z}^{(3)} = \mathbf{W}^{(3)}\mathbf{h}^{(2)} + \mathbf{b}^{(3)}$ , $\hat{\mathbf{y}} = \text{softmax}(\mathbf{z}^{(3)}) \in \mathbb{R}^{K}$
Loss: $L = -\sum_k y_k \log \hat{y}_k$ (cross-entropy, $\mathbf{y}$ is one-hot)

For a batch of $B$ examples: all vectors become matrices $\mathbf{X} \in \mathbb{R}^{B \times d_0}$ , etc.

Backprop step 1: Output layer gradient

The combined softmax + cross-entropy gradient has the elegant form:

$\boldsymbol{\delta}^{(3)} = \frac{\partial L}{\partial \mathbf{Z}^{(3)}} = \hat{\mathbf{Y}} - \mathbf{Y} \in \mathbb{R}^{B \times K}$

This is the prediction minus the one-hot target - predicted probabilities minus ground truth. When the model is perfectly confident and correct, this is zero.

Backprop step 2: Output weight and bias gradients

$\frac{\partial L}{\partial \mathbf{W}^{(3)}} = \frac{1}{B} \left(\boldsymbol{\delta}^{(3)}\right)^T \mathbf{H}^{(2)} \in \mathbb{R}^{K \times d_2}$

$\frac{\partial L}{\partial \mathbf{b}^{(3)}} = \frac{1}{B} \sum_{i=1}^B \boldsymbol{\delta}^{(3)}_i \in \mathbb{R}^K$

Backprop step 3: Propagate through to $\mathbf{H}^{(2)}$

$\frac{\partial L}{\partial \mathbf{H}^{(2)}} = \boldsymbol{\delta}^{(3)} \mathbf{W}^{(3)} \in \mathbb{R}^{B \times d_2}$

Backprop step 4: Through sigmoid at layer 2

Sigmoid derivative: $\sigma'(z) = \sigma(z)(1 - \sigma(z)) = h(1 - h)$

$\boldsymbol{\delta}^{(2)} = \frac{\partial L}{\partial \mathbf{Z}^{(2)}} = \frac{\partial L}{\partial \mathbf{H}^{(2)}} \odot \mathbf{H}^{(2)} \odot (1 - \mathbf{H}^{(2)})$

This is the notorious sigmoid gradient squashing: the maximum value of $\sigma'$ is 0.25. After 10 layers, the gradient is multiplied by 0.25 ten times, reducing it by a factor of roughly $10^6$ .

Backprop step 5: Layer 2 weight gradients

$\frac{\partial L}{\partial \mathbf{W}^{(2)}} = \frac{1}{B} \left(\boldsymbol{\delta}^{(2)}\right)^T \mathbf{H}^{(1)} \in \mathbb{R}^{d_2 \times d_1}$

$\frac{\partial L}{\partial \mathbf{b}^{(2)}} = \frac{1}{B} \sum_{i=1}^B \boldsymbol{\delta}^{(2)}_i$

Backprop step 6: Propagate to $\mathbf{H}^{(1)}$ , through sigmoid, to layer 1 weights

$\frac{\partial L}{\partial \mathbf{H}^{(1)}} = \boldsymbol{\delta}^{(2)} \mathbf{W}^{(2)}$

$\boldsymbol{\delta}^{(1)} = \frac{\partial L}{\partial \mathbf{Z}^{(1)}} = \frac{\partial L}{\partial \mathbf{H}^{(1)}} \odot \mathbf{H}^{(1)} \odot (1 - \mathbf{H}^{(1)})$

$\frac{\partial L}{\partial \mathbf{W}^{(1)}} = \frac{1}{B} \left(\boldsymbol{\delta}^{(1)}\right)^T \mathbf{X}$

Shape check rule: the gradient of the loss with respect to any weight matrix $\mathbf{W}$ must have the same shape as $\mathbf{W}$ . If the shapes do not match, you have a transposition error - the most common backprop bug.

NumPy Implementation: 3-Layer MLP from Scratch

import numpy as np
from typing import Tuple, Dict


def sigmoid(z: np.ndarray) -> np.ndarray:
    """Numerically stable sigmoid."""
    return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z)))


def sigmoid_grad(h: np.ndarray) -> np.ndarray:
    """Gradient of sigmoid given its output h = sigmoid(z)."""
    return h * (1 - h)


def softmax(z: np.ndarray) -> np.ndarray:
    """Numerically stable softmax."""
    z_shifted = z - z.max(axis=1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / exp_z.sum(axis=1, keepdims=True)


def cross_entropy_loss(p: np.ndarray, y_onehot: np.ndarray) -> float:
    """Cross-entropy loss. p: predicted probabilities, y_onehot: one-hot targets."""
    eps = 1e-12
    return -np.mean(np.sum(y_onehot * np.log(p + eps), axis=1))


class ThreeLayerMLP:
    """
    3-layer MLP (2 hidden + 1 output) with sigmoid hidden activations.
    Loss: softmax + cross-entropy.
    All gradients computed manually via backprop.
    """

    def __init__(
        self,
        d0: int,       # input dimension
        d1: int,       # hidden layer 1 dimension
        d2: int,       # hidden layer 2 dimension
        K: int,        # output dimension (number of classes)
        seed: int = 42,
    ):
        rng = np.random.default_rng(seed)

        # He initialization for sigmoid (use sqrt(2/n) for ReLU, sqrt(1/n) for sigmoid/tanh)
        def he_init(fan_in: int, fan_out: int) -> np.ndarray:
            return rng.standard_normal((fan_in, fan_out)) * np.sqrt(2.0 / fan_in)

        self.params: Dict[str, np.ndarray] = {
            "W1": he_init(d0, d1),
            "b1": np.zeros(d1),
            "W2": he_init(d1, d2),
            "b2": np.zeros(d2),
            "W3": he_init(d2, K),
            "b3": np.zeros(K),
        }

        # Cache for backward pass - populated during forward
        self.cache: Dict[str, np.ndarray] = {}

    def forward(self, X: np.ndarray) -> np.ndarray:
        """
        Forward pass.
        X: (B, d0)
        Returns: predicted probabilities (B, K)
        """
        W1, b1 = self.params["W1"], self.params["b1"]
        W2, b2 = self.params["W2"], self.params["b2"]
        W3, b3 = self.params["W3"], self.params["b3"]

        # Layer 1
        Z1 = X @ W1 + b1          # (B, d1)
        H1 = sigmoid(Z1)           # (B, d1)

        # Layer 2
        Z2 = H1 @ W2 + b2          # (B, d2)
        H2 = sigmoid(Z2)           # (B, d2)

        # Output layer
        Z3 = H2 @ W3 + b3          # (B, K)
        P = softmax(Z3)            # (B, K)

        # Store everything needed for backward
        self.cache = {
            "X": X, "Z1": Z1, "H1": H1,
            "Z2": Z2, "H2": H2,
            "Z3": Z3, "P": P,
        }

        return P

    def backward(self, Y_onehot: np.ndarray) -> Dict[str, np.ndarray]:
        """
        Backward pass. Computes gradients for all parameters.
        Y_onehot: (B, K) one-hot encoded targets.
        Returns: dict of gradients matching self.params keys.
        """
        X  = self.cache["X"]
        H1 = self.cache["H1"]
        H2 = self.cache["H2"]
        P  = self.cache["P"]

        W2 = self.params["W2"]
        W3 = self.params["W3"]

        B = X.shape[0]

        # --- Output layer: softmax + cross-entropy combined gradient ---
        # dL/dZ3 = (P - Y) / B
        delta3 = (P - Y_onehot) / B    # (B, K)

        dW3 = H2.T @ delta3             # (d2, K) - correct shape matches W3
        db3 = delta3.sum(axis=0)        # (K,)

        # --- Propagate through layer 2 linear to H2 ---
        dH2 = delta3 @ W3.T             # (B, d2)

        # --- Through sigmoid at layer 2 ---
        # dL/dZ2 = dL/dH2 * sigmoid'(Z2) = dL/dH2 * H2 * (1 - H2)
        delta2 = dH2 * sigmoid_grad(H2)  # (B, d2)

        dW2 = H1.T @ delta2              # (d1, d2)
        db2 = delta2.sum(axis=0)         # (d2,)

        # --- Propagate through layer 1 linear to H1 ---
        dH1 = delta2 @ W2.T              # (B, d1)

        # --- Through sigmoid at layer 1 ---
        delta1 = dH1 * sigmoid_grad(H1)  # (B, d1)

        dW1 = X.T @ delta1               # (d0, d1)
        db1 = delta1.sum(axis=0)         # (d1,)

        return {
            "W1": dW1, "b1": db1,
            "W2": dW2, "b2": db2,
            "W3": dW3, "b3": db3,
        }

    def loss(self, X: np.ndarray, Y_onehot: np.ndarray) -> float:
        P = self.forward(X)
        return cross_entropy_loss(P, Y_onehot)


def train_mlp():
    """Train the 3-layer MLP on synthetic data, verify loss decreases."""
    np.random.seed(42)

    # Synthetic 3-class dataset
    N, d0, K = 200, 4, 3
    X = np.random.randn(N, d0)
    labels = np.random.randint(0, K, size=N)
    Y_onehot = np.eye(K)[labels]

    model = ThreeLayerMLP(d0=d0, d1=32, d2=16, K=K)
    lr = 0.1

    print("Training 3-layer MLP with manual backprop:")
    for epoch in range(201):
        # Forward
        P = model.forward(X)
        loss = cross_entropy_loss(P, Y_onehot)

        if epoch % 20 == 0:
            acc = (P.argmax(axis=1) == labels).mean()
            print(f"  Epoch {epoch:>3}: loss = {loss:.4f}, acc = {acc:.3f}")

        # Backward
        grads = model.backward(Y_onehot)

        # SGD update
        for key in model.params:
            model.params[key] -= lr * grads[key]


train_mlp()

Numerical Gradient Checking

Whenever you implement a custom backward pass, verify it numerically. The finite difference approximation of the derivative:

$\frac{\partial L}{\partial w_i} \approx \frac{L(\mathbf{w} + \epsilon \mathbf{e}_i) - L(\mathbf{w} - \epsilon \mathbf{e}_i)}{2\epsilon}$

where $\mathbf{e}_i$ is the unit vector along dimension $i$ and $\epsilon$ is a small step (typically $10^{-5}$ ). The centered difference is more accurate than the one-sided difference because it is second-order accurate in $\epsilon$ .

import numpy as np
from typing import Callable, Dict


def numerical_gradient(
    loss_fn: Callable[[np.ndarray], float],
    param: np.ndarray,
    eps: float = 1e-5,
) -> np.ndarray:
    """
    Compute numerical gradient of loss_fn with respect to param.
    Uses centered finite differences: (f(x+eps) - f(x-eps)) / (2*eps).
    """
    grad = np.zeros_like(param)
    it = np.nditer(param, flags=["multi_index"], op_flags=["readwrite"])

    while not it.finished:
        idx = it.multi_index
        orig = param[idx]

        param[idx] = orig + eps
        f_plus = loss_fn(param)

        param[idx] = orig - eps
        f_minus = loss_fn(param)

        param[idx] = orig  # restore

        grad[idx] = (f_plus - f_minus) / (2 * eps)
        it.iternext()

    return grad


def gradient_check(model: ThreeLayerMLP, X: np.ndarray, Y: np.ndarray, eps: float = 1e-5):
    """
    Compare analytical (backprop) gradients to numerical gradients.
    Prints relative error for each parameter.
    """
    # Compute analytical gradients
    model.forward(X)
    analytical = model.backward(Y)

    print("\nGradient Check (analytical vs numerical):")
    print(f"{'Parameter':<12} {'Max Abs Diff':>14} {'Relative Error':>16} {'Status':>8}")
    print("-" * 55)

    all_pass = True
    for name, param in model.params.items():
        # Capture name in closure properly
        def make_loss(param_ref, other_params):
            def loss_fn(p):
                # Temporarily set this param
                saved = param_ref.copy()
                param_ref[:] = p.reshape(param_ref.shape)
                val = model.loss(X, Y)
                param_ref[:] = saved
                return val
            return loss_fn

        # Numerical gradient for this parameter
        num_grad = numerical_gradient(
            lambda p: model.loss(X, Y),
            param,
            eps=eps,
        )

        a = analytical[name]
        n = num_grad

        max_abs = np.abs(a - n).max()
        denom = max(np.abs(a).max(), np.abs(n).max(), 1e-8)
        rel_err = max_abs / denom

        status = "PASS" if rel_err < 1e-4 else "FAIL"
        if rel_err >= 1e-4:
            all_pass = False

        print(f"{name:<12} {max_abs:>14.2e} {rel_err:>16.2e} {status:>8}")

    print(f"\nOverall: {'ALL PASS' if all_pass else 'SOME FAIL'}")
    return all_pass

:::warning Gradient Check Gotcha Numerical gradient checking is slow - $O(P)$ forward passes for $P$ parameters. Never use it in a training loop. Use it only during development to verify a new custom backward implementation. Also: gradient checking fails near kinks (ReLU at exactly 0). Use small inputs and avoid degenerate cases. :::

PyTorch Custom Autograd Function

When you need to implement a custom forward/backward in PyTorch - for fused operations, custom CUDA kernels, or non-differentiable steps - use torch.autograd.Function.

import torch
import torch.nn as nn
from torch.autograd import Function


class SigmoidFunction(Function):
    """
    Custom sigmoid implementation to illustrate the autograd.Function API.
    In practice, use torch.sigmoid - this is for learning purposes.
    """

    @staticmethod
    def forward(ctx, x: torch.Tensor) -> torch.Tensor:
        """
        ctx is the context object. Save tensors here for use in backward.
        forward() must be @staticmethod - no self.
        """
        output = torch.sigmoid(x)
        # Save output (not input) because sigmoid gradient uses the output value
        ctx.save_for_backward(output)
        return output

    @staticmethod
    def backward(ctx, grad_output: torch.Tensor) -> torch.Tensor:
        """
        grad_output: gradient of loss w.r.t. this function's output.
        Return: gradient of loss w.r.t. this function's input.
        Must return one tensor per input to forward().
        """
        (output,) = ctx.saved_tensors
        # dL/dx = dL/d(output) * output * (1 - output)
        return grad_output * output * (1 - output)


# Usage
sigmoid_fn = SigmoidFunction.apply

x = torch.randn(4, 8, requires_grad=True)
y = sigmoid_fn(x)
loss = y.sum()
loss.backward()

print(f"Analytical grad norm: {x.grad.norm():.4f}")

# Verify with gradcheck
from torch.autograd import gradcheck

x_check = torch.randn(3, 4, dtype=torch.float64, requires_grad=True)
result = gradcheck(SigmoidFunction.apply, (x_check,), eps=1e-6, atol=1e-4)
print(f"gradcheck passed: {result}")

class CustomScaledDotAttention(Function):
    """
    Fused scaled dot-product attention backward - common in production.
    This illustrates the pattern; a real implementation would use CUDA.
    """

    @staticmethod
    def forward(ctx, Q: torch.Tensor, K: torch.Tensor, V: torch.Tensor) -> torch.Tensor:
        """
        Q, K, V: (B, T, d_k)
        Output: (B, T, d_k)
        """
        d_k = Q.shape[-1]
        scale = d_k ** -0.5

        # Compute attention weights
        scores = torch.bmm(Q, K.transpose(-2, -1)) * scale  # (B, T, T)
        weights = torch.softmax(scores, dim=-1)              # (B, T, T)
        output = torch.bmm(weights, V)                       # (B, T, d_k)

        # Save for backward
        ctx.save_for_backward(Q, K, V, weights)
        ctx.scale = scale

        return output

    @staticmethod
    def backward(ctx, grad_output: torch.Tensor):
        """
        grad_output: (B, T, d_k) - gradient w.r.t. attention output
        Returns: gradients for Q, K, V
        """
        Q, K, V, weights = ctx.saved_tensors
        scale = ctx.scale

        # Gradient w.r.t. V: dL/dV = weights^T @ grad_output
        grad_V = torch.bmm(weights.transpose(-2, -1), grad_output)  # (B, T, d_k)

        # Gradient w.r.t. attention weights: dL/d(weights) = grad_output @ V^T
        grad_weights = torch.bmm(grad_output, V.transpose(-2, -1))   # (B, T, T)

        # Gradient through softmax
        # dL/d(scores) = softmax_backward(weights, grad_weights)
        # = weights * (grad_weights - (weights * grad_weights).sum(-1, keepdim=True))
        sum_term = (weights * grad_weights).sum(dim=-1, keepdim=True)
        grad_scores = weights * (grad_weights - sum_term) * scale     # (B, T, T)

        # Gradient w.r.t. Q: dL/dQ = grad_scores @ K
        grad_Q = torch.bmm(grad_scores, K)                            # (B, T, d_k)

        # Gradient w.r.t. K: dL/dK = grad_scores^T @ Q
        grad_K = torch.bmm(grad_scores.transpose(-2, -1), Q)          # (B, T, d_k)

        return grad_Q, grad_K, grad_V


# Verify correctness
B, T, d_k = 2, 4, 8
Q = torch.randn(B, T, d_k, dtype=torch.float64, requires_grad=True)
K = torch.randn(B, T, d_k, dtype=torch.float64, requires_grad=True)
V = torch.randn(B, T, d_k, dtype=torch.float64, requires_grad=True)

result = gradcheck(CustomScaledDotAttention.apply, (Q, K, V), eps=1e-6, atol=1e-4)
print(f"Attention gradcheck: {result}")

Micrograd: Building Autograd from Scratch

Andrej Karpathy's micrograd approach - a scalar-valued autograd engine - is the cleanest way to understand the mechanics. Each Value node wraps a scalar, stores its gradient, and records a backward closure.

import math
from typing import Union, Callable, Set


class Value:
    """
    Scalar value in a computational graph with reverse-mode autodiff.
    Implements: +, -, *, /, **, exp, log, relu, tanh, sigmoid.
    """

    def __init__(
        self,
        data: float,
        _children: tuple = (),
        _op: str = "",
        label: str = "",
    ):
        self.data = float(data)
        self.grad = 0.0                      # dL/d(self), accumulated
        self._backward: Callable = lambda: None
        self._prev: Set["Value"] = set(_children)
        self._op = _op
        self.label = label

    def __repr__(self) -> str:
        return f"Value({self.data:.4f}, grad={self.grad:.4f}, op='{self._op}')"

    # --- Forward operations ---

    def __add__(self, other: Union["Value", float]) -> "Value":
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), "+")

        def _backward():
            # Gradient of (a + b) w.r.t. a = 1, w.r.t. b = 1
            # += because a node can be used in multiple places (multivariate chain rule)
            self.grad += out.grad
            other.grad += out.grad

        out._backward = _backward
        return out

    def __mul__(self, other: Union["Value", float]) -> "Value":
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), "*")

        def _backward():
            # Gradient of (a * b) w.r.t. a = b, w.r.t. b = a
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad

        out._backward = _backward
        return out

    def __pow__(self, exponent: float) -> "Value":
        assert isinstance(exponent, (int, float))
        out = Value(self.data ** exponent, (self,), f"**{exponent}")

        def _backward():
            # d/dx of x^n = n * x^(n-1)
            self.grad += exponent * (self.data ** (exponent - 1)) * out.grad

        out._backward = _backward
        return out

    def exp(self) -> "Value":
        e = math.exp(self.data)
        out = Value(e, (self,), "exp")

        def _backward():
            # d/dx of exp(x) = exp(x) - stored as e
            self.grad += e * out.grad

        out._backward = _backward
        return out

    def log(self) -> "Value":
        assert self.data > 0, f"log({self.data}) undefined"
        out = Value(math.log(self.data), (self,), "log")

        def _backward():
            # d/dx of log(x) = 1/x
            self.grad += (1.0 / self.data) * out.grad

        out._backward = _backward
        return out

    def relu(self) -> "Value":
        out = Value(max(0.0, self.data), (self,), "ReLU")

        def _backward():
            # d/dx of ReLU(x) = 1 if x > 0 else 0
            self.grad += (out.data > 0) * out.grad

        out._backward = _backward
        return out

    def tanh(self) -> "Value":
        t = math.tanh(self.data)
        out = Value(t, (self,), "tanh")

        def _backward():
            # d/dx of tanh(x) = 1 - tanh^2(x)
            self.grad += (1 - t ** 2) * out.grad

        out._backward = _backward
        return out

    def sigmoid(self) -> "Value":
        s = 1.0 / (1.0 + math.exp(-self.data))
        out = Value(s, (self,), "sigmoid")

        def _backward():
            self.grad += s * (1 - s) * out.grad

        out._backward = _backward
        return out

    def backward(self):
        """
        Run backpropagation from this node.
        Call on the loss value after the full forward pass.
        """
        # Topological sort: process each node after all its dependents
        topo: list["Value"] = []
        visited: Set[int] = set()

        def build_topo(v: "Value"):
            if id(v) not in visited:
                visited.add(id(v))
                for child in v._prev:
                    build_topo(child)
                topo.append(v)

        build_topo(self)

        # Gradient of loss w.r.t. itself = 1
        self.grad = 1.0

        # Reverse topological order: from output to inputs
        for node in reversed(topo):
            node._backward()

    # --- Python operator overloads ---
    def __neg__(self) -> "Value": return self * -1
    def __sub__(self, other): return self + (-other)
    def __truediv__(self, other): return self * other ** -1
    def __radd__(self, other): return self + other
    def __rmul__(self, other): return self * other
    def __rsub__(self, other): return other + (-self)
    def __rtruediv__(self, other): return other * self ** -1


def demo_2neuron_network():
    """Demonstrate backward pass through a tiny 2-neuron network."""
    # Inputs
    x1 = Value(2.0, label="x1")
    x2 = Value(0.0, label="x2")

    # Weights and bias (hand-picked to give interesting gradients)
    w1 = Value(-3.0, label="w1")
    w2 = Value(1.0,  label="w2")
    b  = Value(6.881, label="b")

    # Layer computation
    z = x1 * w1 + x2 * w2 + b
    h = z.tanh()

    # Loss = output (so dL/dh = 1, making this equivalent to maximizing h)
    h.backward()

    print("=== Network Gradients ===")
    print(f"x1.grad = {x1.grad:.4f}  (how much increasing x1 increases h)")
    print(f"x2.grad = {x2.grad:.4f}  (how much increasing x2 increases h)")
    print(f"w1.grad = {w1.grad:.4f}  (how much increasing w1 increases h)")
    print(f"w2.grad = {w2.grad:.4f}  (how much increasing w2 increases h)")
    print(f"b.grad  = {b.grad:.4f}   (how much increasing b increases h)")

    # Cross-check with PyTorch
    import torch
    x1t = torch.tensor(2.0,  requires_grad=True)
    x2t = torch.tensor(0.0,  requires_grad=True)
    w1t = torch.tensor(-3.0, requires_grad=True)
    w2t = torch.tensor(1.0,  requires_grad=True)
    bt  = torch.tensor(6.881, requires_grad=True)

    zt  = x1t * w1t + x2t * w2t + bt
    ht  = zt.tanh()
    ht.backward()

    print("\n=== PyTorch Cross-Check ===")
    print(f"x1.grad = {x1t.grad:.4f}")
    print(f"w1.grad = {w1t.grad:.4f}")
    print(f"b.grad  = {bt.grad:.4f}")


demo_2neuron_network()

Common Backprop Bugs

:::danger Bug 1: Shape Transposition in Weight Gradient The weight gradient should be upstream_grad.T @ input (or equivalently input.T @ upstream_grad depending on convention). Swapping the transpose produces a gradient of the wrong shape that PyTorch will not automatically catch if the shapes happen to be square.

# WRONG: produces incorrect gradient for non-square W
dW = upstream_grad @ H.T     # shape may be wrong for non-square case

# CORRECT: (d_out, B).T @ (B, d_in) -> (d_out, d_in) which matches W shape
dW = (1/B) * upstream_grad.T @ H

:::

:::danger Bug 2: Forgetting optimizer.zero_grad() Without zero_grad, gradients accumulate across batches. PyTorch .grad accumulates with += by design (to support gradient accumulation). This means after two batches without zero_grad, the effective gradient is the sum of two batches, causing incorrect and growing updates.

# WRONG - gradients double, triple, ... with each batch
for batch_x, batch_y in loader:
    loss = criterion(model(batch_x), batch_y)
    loss.backward()
    optimizer.step()

# CORRECT
for batch_x, batch_y in loader:
    optimizer.zero_grad(set_to_none=True)  # set_to_none=True is slightly faster
    loss = criterion(model(batch_x), batch_y)
    loss.backward()
    optimizer.step()

:::

:::danger Bug 3: In-Place Operations Breaking the Graph In-place operations modify a tensor that may be stored in another node's backward closure. PyTorch detects many of these and raises a RuntimeError, but not all. Always prefer out-of-place operations during training.

# WRONG
x = torch.randn(3, requires_grad=True)
y = x + 1
y.relu_()     # in-place - modifies y which x's gradient computation needs
loss = y.sum()
loss.backward()  # RuntimeError

# CORRECT
y = (x + 1).relu()  # out-of-place

:::

:::warning Bug 4: .item() Detaches from Graph Calling .item() converts a tensor to a Python scalar, cutting the gradient graph. A common pattern is using .item() inside a loss function for logging, then accidentally returning the result.

# WRONG
def compute_loss(pred, target):
    mse = ((pred - target)**2).mean().item()  # detached!
    return torch.tensor(mse)  # new tensor, no gradient path

# CORRECT
def compute_loss(pred, target):
    return ((pred - target)**2).mean()  # stays in graph

:::

:::warning Bug 5: Retaining Graph When Not Needed (Memory Leak) If you call loss.backward(retain_graph=True) unnecessarily, the intermediate activations are never freed, causing memory to grow with every training step. Only retain the graph when you genuinely need to call backward a second time.

# MEMORY LEAK if called every step
loss.backward(retain_graph=True)

# CORRECT - only use retain_graph when truly needed
# e.g., for meta-learning or penalty-based second-order methods
loss.backward()  # graph freed automatically

:::

Inspecting Gradient Flow in PyTorch

import torch
import torch.nn as nn
from typing import Dict


def print_gradient_flow(model: nn.Module, threshold_vanish: float = 1e-6,
                         threshold_explode: float = 10.0) -> Dict[str, float]:
    """
    Print gradient norms for all named parameters.
    Call after loss.backward() and before optimizer.step().
    Returns dict of {name: grad_norm}.
    """
    print(f"\n{'Layer':<45} {'Grad Norm':>12} {'Param Norm':>12} {'Ratio':>10} Status")
    print("-" * 95)

    norms = {}
    for name, param in model.named_parameters():
        if param.grad is not None:
            g_norm = param.grad.detach().norm(2).item()
            p_norm = param.data.detach().norm(2).item()
            ratio = g_norm / (p_norm + 1e-8)
            norms[name] = g_norm

            status = "OK"
            if g_norm < threshold_vanish:
                status = "VANISHING"
            elif g_norm > threshold_explode:
                status = "EXPLODING"

            print(f"{name:<45} {g_norm:>12.4e} {p_norm:>12.4e} {ratio:>10.4e} {status}")
        else:
            norms[name] = 0.0
            print(f"{name:<45} {'NO GRAD':>12}")

    return norms


def register_gradient_hooks(model: nn.Module) -> list:
    """
    Register backward hooks to monitor gradients flowing through each layer.
    Hooks fire during backward() - useful for real-time monitoring.

    Usage:
        handles = register_gradient_hooks(model)
        loss.backward()
        for h in handles: h.remove()
    """
    handles = []

    def make_hook(name: str):
        def hook(grad: torch.Tensor) -> torch.Tensor:
            norm = grad.norm().item()
            has_nan = torch.isnan(grad).any().item()
            print(f"  Backward hook [{name}]: norm={norm:.4e}, has_nan={has_nan}")
            return grad  # return None to keep gradient unchanged, or return modified grad
        return hook

    for name, param in model.named_parameters():
        if param.requires_grad:
            handles.append(param.register_hook(make_hook(name)))

    return handles


# Demo usage
model = nn.Sequential(
    nn.Linear(32, 64),
    nn.ReLU(),
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, 10),
)

x = torch.randn(16, 32)
y = torch.randint(0, 10, (16,))
loss = nn.CrossEntropyLoss()(model(x), y)
loss.backward()

norms = print_gradient_flow(model)

YouTube Resources

Title	Channel	Why Watch
The spelled-out intro to neural networks and backpropagation	Andrej Karpathy	Builds micrograd live - the single best backprop from scratch video; 2.5 hours of pure clarity
Backpropagation calculus	3Blue1Brown	Beautiful visual intuition for the chain rule on computational graphs; 10 minutes
CS231n Lecture 4: Backpropagation and Neural Networks	Stanford / Andrej Karpathy	The definitive university lecture on backprop; covers Jacobians, gates, and implementation
PyTorch Autograd Explained	Elliot Waite	Detailed walkthrough of the PyTorch autograd engine internals
Automatic Differentiation in Machine Learning: a Survey	NeurIPS Tutorial	Academic depth on forward vs reverse mode and higher-order derivatives

Production Engineering Notes

Memory cost scales with depth and batch size. Every intermediate activation must be stored for the backward pass. A model with 48 transformer layers trained with batch size 64 and sequence length 2048 can require tens of gigabytes just for stored activations - before parameters and gradients. Gradient checkpointing (torch.utils.checkpoint) trades compute for memory: it does not store activations during forward, then recomputes them during backward. This roughly halves memory at the cost of ~33% more compute.

Gradient accumulation for large effective batch sizes. When GPU memory limits batch size to 8 but you want an effective batch of 64, run 8 mini-batches without zero_grad, accumulate gradients, then step. Scale the loss by 1/accumulation_steps so the effective gradient magnitude matches the full batch.

Mixed precision and gradient scaling. Training in float16 reduces memory and speeds up computation, but float16 underflows to zero for gradients smaller than ~6e-8. PyTorch's GradScaler multiplies the loss by a large factor before backward (so gradients are large enough to not underflow in float16), then divides before the optimizer step. Check scaler.get_scale() - if it drops rapidly, gradients are overflowing (reduce LR or check for bugs).

Second-order gradients. Meta-learning (MAML), gradient penalty (WGAN-GP), and knowledge distillation with gradient matching all require differentiating through the backward pass. Use create_graph=True in torch.autograd.grad(), which builds the computation graph for the gradient computation itself. Memory cost doubles or triples. Profile before using in production.

Interview Q&A

Q1: Explain backpropagation to a non-technical interviewer in one paragraph.

Backpropagation is the algorithm that teaches a neural network from its mistakes. After we run data through the network and compare the output to the correct answer, we have a single number representing the error - the loss. Backprop then asks, for each weight in the network: if I increase this weight by a tiny amount, does the loss go up or down, and by how much? The answer for all weights simultaneously is computed in a single backward pass through the network - from output to input - using the chain rule from calculus to propagate responsibility for the error backward through each layer. This gives every weight a gradient that says "move me in this direction to reduce the loss." Gradient descent then takes a small step in that direction. Repeating thousands of times is how the network learns.

Q2: Why is reverse-mode autodiff preferred over forward-mode for neural networks?

For neural networks, the loss $L$ is a scalar and the parameters $\theta$ form a high-dimensional vector - millions to billions of elements. Forward-mode autodiff computes the directional derivative $\nabla_\theta L \cdot \mathbf{v}$ for one direction $\mathbf{v}$ per pass. To compute all partial derivatives, you need one forward pass per parameter. With 100 million parameters, that is 100 million forward passes per training step - completely intractable.

Reverse-mode computes $\frac{\partial L}{\partial \theta_i}$ for all $i$ simultaneously in a single backward pass. The key insight: since $L$ is scalar, there is only one "upstream gradient" flowing backward, and each layer can compute its parameter gradients and pass the gradient downstream - all reusing the same upstream value. Reverse-mode wins when there are many inputs and one scalar output. Forward-mode wins when there are few inputs and many outputs (e.g., computing the full Jacobian of a vector-valued function with respect to two scalar inputs).

Q3: What is the vanishing gradient problem and how does it occur in backpropagation?

When backpropagating through many layers, the gradient is multiplied by the local Jacobian at each layer. For element-wise activation functions like sigmoid (maximum derivative 0.25) and tanh (maximum derivative 1.0, but typically well below 1.0 for moderate inputs), the gradient is multiplied by a number less than 1 at each layer. Across 20 layers of sigmoid activations, the gradient shrinks by at least $(0.25)^{20} \approx 10^{-12}$ - effectively zero. Early layers receive zero gradient and their weights stop updating. The network appears to train - the later layers learn fine - but the early feature extractors are frozen.

Solutions: ReLU activations (derivative exactly 1 for positive inputs), residual connections (which add an identity path so gradient can flow without multiplication), batch normalization (which keeps activations in a regime where sigmoid/tanh gradients are larger), and careful initialization (Xavier/He initialization ensures the gradient magnitudes are roughly preserved at initialization).

Q4: What happens when you call loss.backward() twice without retain_graph=True?

The first backward() traverses the computation graph, calls each node's backward function, accumulates gradients, and then frees the intermediate activations from memory. The forward pass stores intermediate values (like pre-activation tensors needed to compute the ReLU mask) in the computation graph nodes. After backward, these are released to free GPU memory.

If you call backward() again without retain_graph=True, the graph no longer exists - those intermediate tensors have been freed - and PyTorch raises RuntimeError: Trying to backward through the graph a second time. The fix is loss.backward(retain_graph=True) for the first call, which keeps the activations alive. The pattern arises legitimately in meta-learning (MAML differentiates through an inner gradient update), gradient penalty regularization (penalty involves gradient norms), and some RL algorithms.

Q5: How do you write and verify a custom backward pass for a PyTorch autograd Function?

Use torch.autograd.Function with @staticmethod forward and @staticmethod backward. In forward, call ctx.save_for_backward(...) to save any tensors needed in backward (do not save raw Python values - use ctx.my_scalar = value). In backward, retrieve saved tensors with ctx.saved_tensors and return one gradient per input to forward (return None for inputs that do not require gradients).

Verify with torch.autograd.gradcheck: convert inputs to float64 (required because finite differences in float32 have too much numerical noise), set requires_grad=True, and call gradcheck(MyFunc.apply, (x,), eps=1e-6, atol=1e-4). A passing gradcheck means your backward implementation matches finite differences to within tolerance. The most common bugs: forgetting a transpose, wrong shape for an intermediate term, or not accounting for the batch dimension mean in the gradient.

Q6: A training run produces NaN loss after 500 steps. Walk through your debugging process.

Step 1: Add torch.autograd.set_detect_anomaly(True) before the training loop. This adds a hook to every operation that detects the first NaN or Inf and prints a stack trace showing exactly which operation produced it. It is slow (2-3x slowdown), so only use it for one or two batches.

Step 2: If the NaN appears in the loss, check for log(0) - cross-entropy on zero probabilities. Add epsilon: torch.clamp(probs, min=1e-8) before log.

Step 3: If the NaN is in gradients, print gradient norms before the optimizer step. If any norm is Inf or NaN, the loss landscape has a spike. Add gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) and reduce LR.

Step 4: Check input data. assert not torch.isnan(batch_x).any() before every forward pass during the first 10 steps.

Step 5: Check for in-place operations that might corrupt stored activations. Grep for _() suffix operations and += on non-leaf tensors.

Step 6: If using mixed precision, check the GradScaler - if the scale drops to 1.0, gradients are overflowing in float16. Add explicit float32 casting for numerically sensitive operations.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Backpropagation demo on the EngineersOfAI Playground - no code required.

:::

The Production Scenario​

Why Backpropagation Exists​

Historical Context​

The Chain Rule: From Scalar to Matrix​

Scalar Chain Rule​

Vector Chain Rule and Jacobians​

Why We Never Form Full Jacobians​

Jacobians for Common Operations​

Reverse-Mode vs Forward-Mode Autodiff​

Forward-Mode: Directional Derivatives​

Reverse-Mode: Backpropagation​

Concrete Example​

Computational Graphs: Forward Builds, Backward Traverses​

Full Derivation: 3-Layer MLP​

NumPy Implementation: 3-Layer MLP from Scratch​

Numerical Gradient Checking​

PyTorch Custom Autograd Function​

Micrograd: Building Autograd from Scratch​

Common Backprop Bugs​

Inspecting Gradient Flow in PyTorch​

YouTube Resources​

Production Engineering Notes​

Interview Q&A​