Backpropagation - The Engine of Deep Learning

Reading time: ~40 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Engineer, ML Compiler Eng

The Real Interview Moment

You are in a Google MLE on-site interview. The interviewer draws a simple two-layer neural network on the whiteboard - input layer, one hidden layer with ReLU, one output layer with softmax, cross-entropy loss. She hands you the marker and says: "Compute the gradient of the loss with respect to the weights in the first layer. Show every step."

You start writing. You get through the softmax-cross-entropy gradient (you memorized that one), but then you hit the ReLU layer and hesitate. "Does the chain rule multiply or... compose?" The interviewer's expression does not change, but you can feel the evaluation shifting. She follows up: "Now tell me - what happens to this gradient when you have 100 layers instead of 2? And why does PyTorch compute all these gradients efficiently?"

This is backpropagation in interviews. Not a conceptual overview - a hands-on mathematical derivation where you must trace gradients through every operation, understand what goes wrong at scale, and explain how frameworks like PyTorch make it practical. This page gives you every tool you need.

What You Will Master

State the chain rule in both scalar and matrix forms
Draw computational graphs for arbitrary expressions and trace forward/backward passes
Derive backpropagation for a 2-layer network with ReLU and softmax-cross-entropy loss
Compute Jacobians for common operations: matrix multiply, ReLU, softmax, batch norm
Explain vanishing gradients mathematically and connect them to activation functions and depth
Explain exploding gradients and the role of gradient clipping
Distinguish forward-mode and reverse-mode automatic differentiation
Justify why reverse-mode AD is used in deep learning (one backward pass for all parameters)
Implement a simple autograd engine conceptually (define forward, store graph, backward)
Answer backpropagation interview questions at whiteboard speed

Self-Assessment: Where Are You Now?

Skill	1 -- Cannot	2 -- Vaguely	3 -- Can Explain	4 -- Can Derive	5 -- Can Teach	Your Score
State the chain rule for compositions						___
Draw a computational graph for $L = (wx + b - y)^2$						___
Trace forward pass values on a graph						___
Trace backward pass gradients on a graph						___
Derive gradients for a 2-layer network						___
Explain vanishing gradients mathematically						___
Explain forward vs reverse mode AD						___
Implement simple autograd conceptually						___

Target: All 4s and 5s before your interview.

Part 1 - The Chain Rule: From Calculus to Computation

Scalar Chain Rule

If $y = f(g(x))$ , then:

$\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}$

This is the single most important equation in deep learning. Every gradient computation in every neural network is an application of this rule, composed many times.

Example: Let $y = (3x + 2)^2$ . Define $g = 3x + 2$ and $y = g^2$ .

$\frac{dg}{dx} = 3, \quad \frac{dy}{dg} = 2g = 2(3x+2), \quad \frac{dy}{dx} = 2(3x+2) \cdot 3 = 6(3x+2)$

Multivariate Chain Rule

When a variable influences the output through multiple paths, we sum over all paths:

$\frac{\partial L}{\partial x} = \sum_{i} \frac{\partial L}{\partial y_i} \cdot \frac{\partial y_i}{\partial x}$

This is critical in neural networks because a single weight can affect the loss through multiple neurons.

60-Second Answer

"Backpropagation is just the chain rule applied systematically on a computational graph. We do a forward pass to compute the loss, then a backward pass where we propagate gradients from the loss back to every parameter. At each node, we multiply the incoming gradient by the local derivative - that is the chain rule. The key insight is that reverse-mode differentiation lets us compute gradients with respect to all parameters in a single backward pass, which is why it is used in deep learning where we have millions of parameters but a single scalar loss."

Vector/Matrix Chain Rule

For neural networks, we work with vectors and matrices. If $\mathbf{y} = f(\mathbf{x})$ where $\mathbf{x} \in \mathbb{R}^n$ and $\mathbf{y} \in \mathbb{R}^m$ , the Jacobian is:

$J = \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix}$

For a chain $L = f(\mathbf{g}(\mathbf{x}))$ where $L$ is scalar:

$\frac{\partial L}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{g}} \cdot J_g$

In practice, we never explicitly construct Jacobian matrices. We work with vector-Jacobian products (VJPs) - multiplying the upstream gradient vector by the Jacobian without materializing the full matrix.

Common Trap

Do NOT confuse the Jacobian with the gradient. The gradient $\nabla_x L$ is a vector of partial derivatives of a scalar with respect to a vector. The Jacobian is a matrix of partial derivatives of a vector with respect to a vector. In backprop, we always propagate gradients of the scalar loss, so we compute VJPs, not full Jacobians.

Part 2 - Computational Graphs

What Is a Computational Graph?

A computational graph is a directed acyclic graph (DAG) where:

Leaf nodes are inputs and parameters
Internal nodes are operations (add, multiply, matmul, ReLU, etc.)
Edges represent data flow
The root is the loss value

Every expression can be decomposed into a computational graph. This is exactly what PyTorch and TensorFlow do under the hood.

Example: Linear Regression Loss

Consider $L = (wx + b - y)^2$ for a single data point.

Computational Graph for L = (wx + b - y)²

Forward Pass

We compute values left-to-right (inputs to loss). Let $w = 2, x = 3, b = 1, y = 8$ .

Node	Computation	Value
mul	$w \cdot x = 2 \cdot 3$	6
add1	$\text{mul} + b = 6 + 1$	7
sub	$\text{add1} - y = 7 - 8$	$-1$
sq	$\text{sub}^2 = (-1)^2$	1
L	= sq	1

Backward Pass

We compute gradients right-to-left (loss to parameters). We need $\frac{\partial L}{\partial w}$ and $\frac{\partial L}{\partial b}$ .

Start at the loss: $\frac{\partial L}{\partial L} = 1$ .

Node	Local Derivative	Upstream Gradient	Gradient
sq	$\frac{\partial}{\partial \text{sub}}(\text{sub}^2) = 2 \cdot \text{sub} = 2(-1) = -2$	1	$-2$
sub	$\frac{\partial \text{sub}}{\partial \text{add1}} = 1$	$-2$	$-2$
add1 (to b)	$\frac{\partial \text{add1}}{\partial b} = 1$	$-2$	$\frac{\partial L}{\partial b} = -2$
add1 (to mul)	$\frac{\partial \text{add1}}{\partial \text{mul}} = 1$	$-2$	$-2$
mul (to w)	$\frac{\partial \text{mul}}{\partial w} = x = 3$	$-2$	$\frac{\partial L}{\partial w} = -2 \cdot 3 = -6$

Result: $\frac{\partial L}{\partial w} = -6$ , $\frac{\partial L}{\partial b} = -2$ .

Verification: $L = (wx + b - y)^2$ . By direct differentiation: $\frac{\partial L}{\partial w} = 2(wx + b - y) \cdot x = 2(-1)(3) = -6$ . Correct.

Interviewer's Perspective

"I ask candidates to trace a backward pass on a computational graph because it tells me whether they truly understand backprop or just memorized formulas. If they can handle the graph for a 2-layer network, they will not panic when I ask about more complex architectures. The key thing I watch for is: do they correctly handle nodes where the gradient flows through multiple paths? That is where most candidates make mistakes."

Fan-Out: When a Variable Is Used Twice

Consider $L = x \cdot x = x^2$ . The node $x$ fans out to two inputs of the multiply operation.

In the backward pass, when a variable feeds into multiple downstream operations, we sum the gradients from all paths:

$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial \text{mul}} \cdot \frac{\partial \text{mul}}{\partial x_{\text{left}}} + \frac{\partial L}{\partial \text{mul}} \cdot \frac{\partial \text{mul}}{\partial x_{\text{right}}} = 1 \cdot x + 1 \cdot x = 2x$

This summation rule is the multivariate chain rule in action and is one of the most common sources of interview errors.

Part 3 - Backpropagation in a 2-Layer Network

This is the canonical interview derivation. Memorize it, but more importantly, understand every step.

Network Setup

Input: $\mathbf{x} \in \mathbb{R}^{d}$ (a single training example)
Layer 1: $\mathbf{z}_1 = W_1 \mathbf{x} + \mathbf{b}_1$ , then $\mathbf{h}_1 = \text{ReLU}(\mathbf{z}_1)$
Layer 2: $\mathbf{z}_2 = W_2 \mathbf{h}_1 + \mathbf{b}_2$ , then $\hat{\mathbf{y}} = \text{softmax}(\mathbf{z}_2)$
Loss: $L = -\sum_k y_k \log \hat{y}_k$ (cross-entropy)

Where $W_1 \in \mathbb{R}^{h \times d}$ , $W_2 \in \mathbb{R}^{c \times h}$ , $h$ = hidden size, $c$ = number of classes.

2-Layer Network: Forward and Backward Pass

Forward Pass

$\mathbf{z}_1 = W_1 \mathbf{x} + \mathbf{b}_1$
$\mathbf{h}_1 = \max(0, \mathbf{z}_1)$ (element-wise ReLU)
$\mathbf{z}_2 = W_2 \mathbf{h}_1 + \mathbf{b}_2$
$\hat{\mathbf{y}} = \text{softmax}(\mathbf{z}_2)$
$L = -\sum_k y_k \log \hat{y}_k$

Backward Pass - Step by Step

Step 1: Gradient of loss w.r.t. softmax input $\mathbf{z}_2$

The combined softmax + cross-entropy gradient has a beautifully simple form:

$\frac{\partial L}{\partial \mathbf{z}_2} = \hat{\mathbf{y}} - \mathbf{y}$

This is one of the most elegant results in deep learning. The gradient is simply the prediction minus the one-hot label. If the true class is $k$ , then the gradient for class $k$ is $\hat{y}_k - 1$ and for all other classes $j$ it is $\hat{y}_j$ .

Company Variation

At Google and Meta, you may be asked to derive this result from scratch - first computing $\frac{\partial L}{\partial \hat{\mathbf{y}}}$ and then $\frac{\partial \hat{\mathbf{y}}}{\partial \mathbf{z}_2}$ (the Jacobian of softmax). At Amazon and startups, knowing the final result is usually sufficient.

Step 2: Gradients for Layer 2 parameters

$\frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial \mathbf{z}_2} \cdot \mathbf{h}_1^T = (\hat{\mathbf{y}} - \mathbf{y}) \mathbf{h}_1^T$

$\frac{\partial L}{\partial \mathbf{b}_2} = \frac{\partial L}{\partial \mathbf{z}_2} = \hat{\mathbf{y}} - \mathbf{y}$

Note: $\frac{\partial L}{\partial W_2}$ has shape $(c \times h)$ - same as $W_2$ . This is the outer product of the error signal and the hidden activations.

Step 3: Gradient flowing back to hidden layer

$\frac{\partial L}{\partial \mathbf{h}_1} = W_2^T \frac{\partial L}{\partial \mathbf{z}_2} = W_2^T (\hat{\mathbf{y}} - \mathbf{y})$

This is the key step - the gradient flows backward through the weight matrix by multiplying with its transpose.

Step 4: Gradient through ReLU

$\frac{\partial L}{\partial \mathbf{z}_1} = \frac{\partial L}{\partial \mathbf{h}_1} \odot \mathbb{1}[\mathbf{z}_1 > 0]$

where $\odot$ is element-wise multiplication and $\mathbb{1}[\mathbf{z}_1 > 0]$ is 1 where $\mathbf{z}_1 > 0$ and 0 elsewhere. ReLU passes the gradient through unchanged where the input was positive, and blocks it where the input was negative or zero.

Step 5: Gradients for Layer 1 parameters

$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \mathbf{z}_1} \cdot \mathbf{x}^T$

$\frac{\partial L}{\partial \mathbf{b}_1} = \frac{\partial L}{\partial \mathbf{z}_1}$

Summary of Gradient Shapes

Quantity	Shape	Expression
$\frac{\partial L}{\partial \mathbf{z}_2}$	$(c,)$	$\hat{\mathbf{y}} - \mathbf{y}$
$\frac{\partial L}{\partial W_2}$	$(c, h)$	$(\hat{\mathbf{y}} - \mathbf{y})\mathbf{h}_1^T$
$\frac{\partial L}{\partial \mathbf{b}_2}$	$(c,)$	$\hat{\mathbf{y}} - \mathbf{y}$
$\frac{\partial L}{\partial \mathbf{h}_1}$	$(h,)$	$W_2^T(\hat{\mathbf{y}} - \mathbf{y})$
$\frac{\partial L}{\partial \mathbf{z}_1}$	$(h,)$	$W_2^T(\hat{\mathbf{y}} - \mathbf{y}) \odot \mathbb{1}[\mathbf{z}_1 > 0]$
$\frac{\partial L}{\partial W_1}$	$(h, d)$	$\frac{\partial L}{\partial \mathbf{z}_1} \cdot \mathbf{x}^T$
$\frac{\partial L}{\partial \mathbf{b}_1}$	$(d,)$	$\frac{\partial L}{\partial \mathbf{z}_1}$

Instant Rejection

If you write $\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \mathbf{z}_1} \cdot \mathbf{x}$ (without the transpose), you will get an instant shape mismatch. Always check: the gradient of $L$ w.r.t. a parameter must have the same shape as the parameter. $W_1$ is $(h, d)$ , so $\frac{\partial L}{\partial W_1}$ must be $(h, d)$ - which requires the outer product $\frac{\partial L}{\partial \mathbf{z}_1} \cdot \mathbf{x}^T$ , where the shapes are $(h, 1) \times (1, d) = (h, d)$ .

Part 4 - Vanishing and Exploding Gradients

The Core Problem

Consider a deep network with $N$ layers. The gradient of the loss w.r.t. the first layer's weights involves a product of $N-1$ Jacobians:

$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \mathbf{z}_N} \cdot \prod_{i=2}^{N} \frac{\partial \mathbf{z}_i}{\partial \mathbf{z}_{i-1}} \cdot \frac{\partial \mathbf{z}_1}{\partial W_1}$

Each factor $\frac{\partial \mathbf{z}_i}{\partial \mathbf{z}_{i-1}}$ involves the weight matrix and the activation derivative. If these factors have eigenvalues consistently less than 1, the product shrinks exponentially. If greater than 1, it grows exponentially.

Vanishing Gradients - Mathematical Analysis

For a network with sigmoid activations, the derivative of sigmoid is:

$\sigma'(z) = \sigma(z)(1 - \sigma(z)) \leq 0.25$

The maximum value is 0.25 (at $z = 0$ ). So at each layer, the gradient is multiplied by at most 0.25 times the weight magnitude. For a 10-layer network:

$\left|\frac{\partial L}{\partial W_1}\right| \propto (0.25)^{10} \cdot \prod \|W_i\| \approx 10^{-6} \cdot \prod \|W_i\|$

Even with well-initialized weights, the gradient reaching the first layer is vanishingly small. The first layers stop learning while the last layers train normally.

Vanishing Gradients Across Layers with Sigmoid Activations

Exploding Gradients - Mathematical Analysis

Conversely, if the weight matrices have large spectral norms, the gradient product grows exponentially:

$\left|\frac{\partial L}{\partial W_1}\right| \propto \prod_{i=2}^{N} \|W_i\| \cdot \|\text{diag}(\sigma'(\mathbf{z}_i))\|$

With 100 layers and spectral norms slightly above 1 (say 1.5):

$(1.5)^{100} \approx 4 \times 10^{17}$

The gradients become astronomically large, causing NaN values or oscillating parameters.

The Gradient Flow Spectrum

Gradient Flow Spectrum: Vanishing, Stable, and Exploding Regimes

Complete Solution Table

Problem	Solution	How It Helps	Introduced In
Vanishing gradients	ReLU activation	$\text{ReLU}'(z) = 1$ for $z > 0$ , no shrinkage	Nair & Hinton, 2010
Vanishing gradients	Skip connections (ResNet)	Gradient has additive path: $\frac{\partial}{\partial x}(F(x) + x) = F'(x) + I$	He et al., 2015
Vanishing gradients	LSTM gating	Constant error carousel preserves gradient flow	Hochreiter & Schmidhuber, 1997
Vanishing gradients	He initialization	Sets $\text{Var}(W) = \frac{2}{n_{\text{in}}}$ to keep activation variance stable	He et al., 2015
Vanishing gradients	BatchNorm / LayerNorm	Prevents activations from reaching saturated regions	Ioffe & Szegedy, 2015
Exploding gradients	Gradient clipping	Caps gradient norm: if $\\|\mathbf{g}\\| > \tau$ then $\mathbf{g} \leftarrow \tau \cdot \frac{\mathbf{g}}{\\|\mathbf{g}\\|}$	Pascanu et al., 2013
Exploding gradients	Weight regularization	L2 penalty keeps weight magnitudes bounded	Standard
Exploding gradients	Lower learning rate	Smaller parameter updates even with large gradients	Standard
Both	Proper initialization	Xavier for sigmoid/tanh, He for ReLU - matches activation function	Glorot & Bengio, 2010

Initialization Deep Dive

The goal of initialization is to keep the variance of activations (and gradients) approximately constant across layers.

Xavier initialization (for sigmoid/tanh): $W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)$

Derivation: For a linear layer $z = Wx$ with $\text{Var}(x_i) = v$ , we want $\text{Var}(z_j) = v$ . Since $z_j = \sum_{i=1}^{n_\text{in}} w_{ji} x_i$ :

$\text{Var}(z_j) = n_{\text{in}} \cdot \text{Var}(w_{ji}) \cdot \text{Var}(x_i) = n_{\text{in}} \cdot \text{Var}(W) \cdot v$

Setting $\text{Var}(z_j) = v$ gives $\text{Var}(W) = \frac{1}{n_{\text{in}}}$ . Averaging the forward and backward constraints gives $\frac{2}{n_{\text{in}} + n_{\text{out}}}$ .

He initialization (for ReLU): $W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)$

ReLU zeroes out half the activations on average, halving the variance. The factor of 2 compensates for this.

Interviewer's Perspective

"When I ask about vanishing gradients, I want the candidate to give me the mathematical reason - the product of Jacobians shrinking exponentially - not just say 'gradients get small.' Then I want them to connect it to specific solutions. A really strong candidate will point out that ResNet skip connections create an additive gradient path: $\frac{\partial}{\partial x}(F(x) + x) = F'(x) + I$ , where the identity term ensures the gradient is at least 1 regardless of what $F'(x)$ does."

Part 5 - Automatic Differentiation

Why Not Symbolic or Numerical Differentiation?

Symbolic differentiation (like in Mathematica) produces exact analytical derivatives but suffers from expression swell - expressions grow exponentially with depth, making it impractical for neural networks with millions of operations.

Numerical differentiation uses finite differences: $\frac{\partial f}{\partial x_i} \approx \frac{f(x + \epsilon e_i) - f(x)}{\epsilon}$ . This requires one forward pass per parameter - millions of forward passes for a neural network. Also suffers from numerical instability (choosing $\epsilon$ too large introduces truncation error; too small introduces floating-point error).

Automatic differentiation (AD) is the best of both worlds: exact derivatives (no approximation error) computed efficiently by decomposing the program into elementary operations and applying the chain rule.

Forward-Mode AD

Forward-mode AD propagates derivatives alongside the computation. For each intermediate variable, we compute both its value and its derivative with respect to one chosen input variable.

Given $f(x_1, x_2)$ , to compute $\frac{\partial f}{\partial x_1}$ :

Seed: $\dot{x}_1 = 1, \dot{x}_2 = 0$
At each operation $v = \text{op}(a, b)$ : compute $\dot{v} = \frac{\partial \text{op}}{\partial a}\dot{a} + \frac{\partial \text{op}}{\partial b}\dot{b}$

This computes a Jacobian-vector product (JVP): $J \cdot \mathbf{v}$ where $\mathbf{v}$ is the seed vector.

Cost: One forward pass computes the derivative w.r.t. one input variable. For $n$ inputs, we need $n$ forward passes.

Example: Compute $\frac{\partial f}{\partial x_1}$ for $f(x_1, x_2) = x_1 x_2 + \sin(x_1)$ at $x_1 = 2, x_2 = 3$ .

Step	Value	Derivative ( $\dot{v}$ , seed $\dot{x}_1 = 1$ )
$v_1 = x_1$	2	$\dot{v}_1 = 1$
$v_2 = x_2$	3	$\dot{v}_2 = 0$
$v_3 = v_1 \cdot v_2$	6	$\dot{v}_3 = \dot{v}_1 \cdot v_2 + v_1 \cdot \dot{v}_2 = 3 + 0 = 3$
$v_4 = \sin(v_1)$	0.909	$\dot{v}_4 = \cos(v_1) \cdot \dot{v}_1 = \cos(2) \cdot 1 = -0.416$
$v_5 = v_3 + v_4$	6.909	$\dot{v}_5 = \dot{v}_3 + \dot{v}_4 = 3 + (-0.416) = 2.584$

Result: $\frac{\partial f}{\partial x_1} = 2.584$ .

Reverse-Mode AD (Backpropagation)

Reverse-mode AD does a forward pass first (storing intermediate values), then a backward pass that computes derivatives with respect to all inputs in a single pass.

Forward: Compute and store all intermediate values
Backward: Starting from $\bar{L} = 1$ , propagate $\bar{v}_i = \sum_{j: v_i \to v_j} \bar{v}_j \frac{\partial v_j}{\partial v_i}$

This computes a vector-Jacobian product (VJP): $\mathbf{v}^T \cdot J$ where $\mathbf{v}$ is the upstream gradient.

Cost: One forward pass + one backward pass computes derivatives w.r.t. all input variables.

Why Reverse Mode Wins for Deep Learning

Aspect	Forward Mode	Reverse Mode
Cost per input variable	$O(1)$ passes	Amortized across all inputs
Cost per output variable	Amortized across all outputs	$O(1)$ passes
Total cost for $n$ inputs, 1 output	$O(n)$ passes	$O(1)$ passes (one fwd + one bwd)
Memory	Low (no storage needed)	High (must store all intermediates)
Best when	Few inputs, many outputs	Many inputs, few outputs
Computes	JVP: $J \cdot \mathbf{v}$	VJP: $\mathbf{v}^T \cdot J$

Neural networks have millions of parameters (inputs to the loss function) but a single scalar loss (one output). Reverse-mode AD computes all gradients in one backward pass. This is why backpropagation is reverse-mode AD.

Forward Mode vs Reverse Mode Automatic Differentiation

Common Trap

Do NOT say "backpropagation IS gradient descent." Backpropagation computes gradients. Gradient descent uses those gradients to update parameters. They are separate steps. Backprop answers "which direction?", gradient descent answers "how far in that direction?" You can use backprop with Adam, SGD, or any other optimizer.

The Memory-Compute Tradeoff

Reverse-mode AD must store all intermediate activations from the forward pass (needed for the backward pass). For a network with $N$ layers, batch size $B$ , and hidden size $h$ , this is $O(N \cdot B \cdot h)$ memory.

For a Transformer with 96 layers, batch size 32, sequence length 2048, and hidden size 12288 (GPT-3 scale):

Activation memory per layer: approximately $32 \times 2048 \times 12288 \times 4$ bytes (float32) = ~3 GB
Total for 96 layers: ~288 GB - far exceeding GPU memory

Gradient checkpointing (also called activation checkpointing) trades memory for compute:

Only store activations at every $k$ -th layer
During the backward pass, recompute the missing activations from the nearest checkpoint
Reduces memory from $O(N)$ to $O(\sqrt{N})$ with optimal checkpoint placement
Cost: one additional forward pass (roughly 33% more compute)

This is essential for training large models and is a common interview topic at companies training foundation models.

Part 6 - Local Gradients for Common Operations

Knowing these local gradients lets you derive the backward pass for any network architecture. This is your reference table.

Matrix Multiplication

If $\mathbf{z} = W\mathbf{x}$ and we receive $\frac{\partial L}{\partial \mathbf{z}}$ from upstream:

$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \mathbf{z}} \cdot \mathbf{x}^T \qquad \frac{\partial L}{\partial \mathbf{x}} = W^T \cdot \frac{\partial L}{\partial \mathbf{z}}$

Memory trick: The gradient w.r.t. the weight is always upstream_grad @ input.T. The gradient w.r.t. the input is always weight.T @ upstream_grad.

Addition (Bias)

If $\mathbf{z} = \mathbf{a} + \mathbf{b}$ :

$\frac{\partial L}{\partial \mathbf{a}} = \frac{\partial L}{\partial \mathbf{z}}, \quad \frac{\partial L}{\partial \mathbf{b}} = \frac{\partial L}{\partial \mathbf{z}}$

Addition distributes (copies) the gradient equally to both inputs.

Element-wise Operations

For any element-wise function $h_i = f(z_i)$ , the Jacobian is diagonal:

$\frac{\partial L}{\partial z_i} = \frac{\partial L}{\partial h_i} \cdot f'(z_i)$

This is just element-wise multiplication with the local derivative.

ReLU

$\frac{\partial L}{\partial \mathbf{z}} = \frac{\partial L}{\partial \mathbf{h}} \odot \mathbb{1}[\mathbf{z} > 0]$

ReLU acts as a binary gate: gradient passes through where input was positive, is blocked where negative.

Sigmoid

$\frac{\partial L}{\partial z} = \frac{\partial L}{\partial h} \cdot \sigma(z)(1 - \sigma(z))$

Note: $\sigma(z)(1 - \sigma(z)) \leq 0.25$ , which is why sigmoid causes vanishing gradients.

Tanh

$\frac{\partial L}{\partial z} = \frac{\partial L}{\partial h} \cdot (1 - \tanh^2(z))$

Maximum derivative is 1 (at $z = 0$ ), better than sigmoid's 0.25 but still causes vanishing gradients for large $|z|$ .

Softmax + Cross-Entropy (Combined)

$\frac{\partial L}{\partial \mathbf{z}} = \hat{\mathbf{y}} - \mathbf{y}$

Always compute the combined gradient - computing them separately requires the full softmax Jacobian.

Batch Normalization

For $\hat{z}_i = \frac{z_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$ where $\mu$ and $\sigma^2$ are batch statistics:

$\frac{\partial L}{\partial z_i} = \frac{1}{\sqrt{\sigma^2 + \epsilon}} \left( \frac{\partial L}{\partial \hat{z}_i} - \frac{1}{B}\sum_j \frac{\partial L}{\partial \hat{z}_j} - \frac{\hat{z}_i}{B}\sum_j \frac{\partial L}{\partial \hat{z}_j}\hat{z}_j \right)$

The complexity arises because $\mu$ and $\sigma^2$ depend on all batch elements, so each element's gradient receives contributions from every other element.

Company Variation

Deriving the BatchNorm gradient is a classic Google/Meta interview question for senior roles. The key insight is that since $\mu$ and $\sigma^2$ are functions of the entire batch, the backward pass must account for these dependencies. Most candidates handle the $\frac{z_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$ part but forget the gradient contributions through $\mu$ and $\sigma^2$ .

Part 7 - Numerical Example: Full Backprop Walkthrough

Let us work through a concrete example with actual numbers. This is what you would write on a whiteboard.

Network: 2 inputs, 2 hidden units (ReLU), 2 outputs (softmax + cross-entropy)

Parameters:

$W_1 = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{bmatrix}$ , $\mathbf{b}_1 = \begin{bmatrix} 0.1 \\ 0.1 \end{bmatrix}$ , $W_2 = \begin{bmatrix} 0.5 & 0.6 \\ 0.7 & 0.8 \end{bmatrix}$ , $\mathbf{b}_2 = \begin{bmatrix} 0.1 \\ 0.1 \end{bmatrix}$

Input: $\mathbf{x} = [1.0, 2.0]^T$ , Target: $\mathbf{y} = [1, 0]^T$ (class 0)

Forward pass:

Step	Computation	Result
$\mathbf{z}_1$	$W_1\mathbf{x} + \mathbf{b}_1 = [0.1(1) + 0.2(2) + 0.1, \ 0.3(1) + 0.4(2) + 0.1]^T$	$[0.6, 1.2]^T$
$\mathbf{h}_1$	$\text{ReLU}(\mathbf{z}_1)$ - both positive	$[0.6, 1.2]^T$
$\mathbf{z}_2$	$W_2\mathbf{h}_1 + \mathbf{b}_2 = [0.5(0.6) + 0.6(1.2) + 0.1, \ 0.7(0.6) + 0.8(1.2) + 0.1]^T$	$[1.12, 1.48]^T$
$\hat{\mathbf{y}}$	$\text{softmax}([1.12, 1.48])$ : $e^{1.12} = 3.065, e^{1.48} = 4.393$ , sum $= 7.458$	$[0.411, 0.589]^T$
$L$	$-(1 \cdot \ln 0.411 + 0 \cdot \ln 0.589)$	$0.889$

Backward pass:

Step	Computation	Result
$\frac{\partial L}{\partial \mathbf{z}_2}$	$\hat{\mathbf{y}} - \mathbf{y} = [0.411 - 1, \ 0.589 - 0]^T$	$[-0.589, 0.589]^T$
$\frac{\partial L}{\partial W_2}$	$\frac{\partial L}{\partial \mathbf{z}_2} \cdot \mathbf{h}_1^T$	$\begin{bmatrix} -0.353 & -0.707 \\ 0.353 & 0.707 \end{bmatrix}$
$\frac{\partial L}{\partial \mathbf{h}_1}$	$W_2^T \cdot \frac{\partial L}{\partial \mathbf{z}_2}$	$[0.118, 0.118]^T$
$\frac{\partial L}{\partial \mathbf{z}_1}$	$\frac{\partial L}{\partial \mathbf{h}_1} \odot [1, 1]^T$ (both $z_1$ positive)	$[0.118, 0.118]^T$
$\frac{\partial L}{\partial W_1}$	$\frac{\partial L}{\partial \mathbf{z}_1} \cdot \mathbf{x}^T$	$\begin{bmatrix} 0.118 & 0.236 \\ 0.118 & 0.236 \end{bmatrix}$

Shape verification: $\frac{\partial L}{\partial W_1}$ is $(2 \times 2)$ , same as $W_1$ . $\frac{\partial L}{\partial W_2}$ is $(2 \times 2)$ , same as $W_2$ . All shapes match.

Part 8 - Backprop Through Time (BPTT)

Backpropagation in recurrent neural networks is called Backpropagation Through Time (BPTT) because the RNN is "unrolled" across time steps, creating a very deep computational graph.

For a vanilla RNN: $\mathbf{h}_t = \tanh(W_h \mathbf{h}_{t-1} + W_x \mathbf{x}_t + \mathbf{b})$

The gradient of the loss at time $T$ w.r.t. $\mathbf{h}_1$ involves:

$\frac{\partial L_T}{\partial \mathbf{h}_1} = \prod_{t=2}^{T} \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} \cdot \frac{\partial L_T}{\partial \mathbf{h}_T}$

Each factor is $\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} = \text{diag}(\tanh'(\mathbf{z}_t)) \cdot W_h$ .

Since $\tanh'(z) \leq 1$ and the same $W_h$ is reused at every step, this product either vanishes or explodes depending on the spectral radius of $W_h$ . This is exactly why LSTMs were invented - they add a cell state with additive gradient flow, analogous to how ResNets add skip connections.

Truncated BPTT: In practice, we limit the backward pass to $k$ time steps (e.g., $k = 256$ ) to manage both memory and vanishing gradients. This is a biased estimate of the true gradient but works well in practice.

Practice Problems

Problem 1: Computational Graph Gradient

Given $f(x, y, z) = (x + y) \cdot z$ , draw the computational graph and compute all partial derivatives when $x = -2, y = 5, z = -4$ .

Hint 1 - Direction

Break the expression into two operations: an addition and a multiplication. Each becomes a node in the graph.

Hint 2 - Insight

Forward: $q = x + y = 3$ , $f = q \cdot z = -12$ . Backward: start with $\frac{\partial f}{\partial f} = 1$ . The multiplication node has local gradients $z$ and $q$ for its two inputs.

Hint 3 - Full Solution + Rubric

Forward: $q = x + y = 3$ , $f = q \cdot z = -12$

Backward:

$\frac{\partial f}{\partial f} = 1$
$\frac{\partial f}{\partial q} = z = -4$ and $\frac{\partial f}{\partial z} = q = 3$
$\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \cdot \frac{\partial q}{\partial x} = -4 \cdot 1 = -4$
$\frac{\partial f}{\partial y} = \frac{\partial f}{\partial q} \cdot \frac{\partial q}{\partial y} = -4 \cdot 1 = -4$

Results: $\frac{\partial f}{\partial x} = -4$ , $\frac{\partial f}{\partial y} = -4$ , $\frac{\partial f}{\partial z} = 3$

Verification: $\frac{\partial f}{\partial x} = z = -4$ , $\frac{\partial f}{\partial y} = z = -4$ , $\frac{\partial f}{\partial z} = x + y = 3$ . Matches.

Scoring Rubric:

Strong Hire: Draws correct graph, computes all gradients correctly, verifies with direct differentiation, explains the fan-out rule for the addition node
Lean Hire: Correct final answers but hesitates on the process or skips verification
No Hire: Cannot set up the computational graph or makes sign errors in the chain rule

Problem 2: Vanishing Gradient Analysis

A 5-layer network uses sigmoid activations with Xavier-initialized weights. Estimate the magnitude of the gradient at layer 1 relative to layer 5, and propose three solutions.

Hint 1 - Direction

The gradient at each layer is multiplied by $\sigma'(z)$ and the weight matrix. What is the maximum value of $\sigma'(z)$ ?

Hint 2 - Insight

$\sigma'(z) \leq 0.25$ . With Xavier initialization, $\|W_i\| \approx 1$ . So the gradient at layer 1 is approximately $(0.25)^4$ times the gradient at layer 5.

Hint 3 - Full Solution + Rubric

With Xavier initialization, the expected singular values of $W_i$ are approximately 1. The maximum of $\sigma'(z)$ is 0.25.

$\frac{\|\nabla_{W_1} L\|}{\|\nabla_{W_5} L\|} \approx \prod_{i=2}^{5} (1 \cdot 0.25) = (0.25)^4 = 0.0039$

The gradient at layer 1 is roughly 256x smaller than at layer 5.

Three solutions:

Replace sigmoid with ReLU: gradient factor becomes 1 (for active units), eliminating the 0.25 multiplier
Add skip connections (ResNet-style): gradient has a direct additive path bypassing the multiplications
Use batch normalization: prevents activations from saturating in the flat regions of sigmoid

Scoring Rubric:

Strong Hire: Quantifies the decay factor as $(0.25)^4$ , derives it mathematically, connects to 3+ solutions, mentions that ReLU has its own issues (dying ReLU)
Lean Hire: Knows gradients shrink exponentially with sigmoid, can state solutions but cannot quantify the decay
No Hire: Vaguely mentions "gradients get small" without mathematical reasoning

Problem 3: Implement Backward Pass

Given this forward pass in pseudocode, write the backward pass:

# Forward
z1 = W1 @ x + b1      # shape: (h,)
h1 = relu(z1)          # shape: (h,)
z2 = W2 @ h1 + b2      # shape: (c,)
loss = cross_entropy_with_softmax(z2, y)  # scalar

Hint 1 - Direction

Start from the loss and work backward. The combined softmax-cross-entropy gradient is $\hat{y} - y$ . For each forward operation, write the corresponding backward operation.

Hint 2 - Insight

For matmul backward: gradient w.r.t. weight is upstream @ input.T, gradient w.r.t. input is weight.T @ upstream. For ReLU backward: multiply by binary mask of where pre-activation was positive.

Hint 3 - Full Solution + Rubric

# Backward
dz2 = softmax(z2) - y                    # (c,)
dW2 = dz2[:, None] @ h1[None, :]         # (c, h) outer product
db2 = dz2                                # (c,)
dh1 = W2.T @ dz2                         # (h,)
dz1 = dh1 * (z1 > 0).float()             # (h,) ReLU mask
dW1 = dz1[:, None] @ x[None, :]          # (h, d) outer product
db1 = dz1                                # (h,)

Key shape checks:

dW2 has same shape as W2: (c, h) -- outer product of (c, 1) and (1, h)
dW1 has same shape as W1: (h, d) -- outer product of (h, 1) and (1, d)
ReLU mask uses z1 > 0 (pre-activation values)

Scoring Rubric:

Strong Hire: Correct shapes, correct operations, explains each step, explicitly verifies shapes match parameters
Lean Hire: Mostly correct but confuses one transpose or forgets the outer product notation
No Hire: Cannot write the backward pass in correct order or gets more than 2 operations wrong

Problem 4: Forward-Mode vs Reverse-Mode AD

You have a function $f: \mathbb{R}^{1000000} \to \mathbb{R}$ (a neural network with 1M parameters and scalar loss). How many passes does forward-mode AD need? How many does reverse-mode need? Now consider $g: \mathbb{R} \to \mathbb{R}^{1000}$ . Which mode is better for $g$ ?

Hint 1 - Direction

Forward-mode computes derivatives w.r.t. one input per pass. Reverse-mode computes derivatives w.r.t. all inputs per pass but for one output.

Hint 2 - Insight

For $f$ : reverse-mode needs 1 forward + 1 backward pass for all 1M gradients. Forward-mode needs 1M passes. For $g$ : the analysis flips.

Hint 3 - Full Solution + Rubric

For $f: \mathbb{R}^{1000000} \to \mathbb{R}$ :

Forward-mode: 1,000,000 passes (one per input dimension). Completely impractical.
Reverse-mode: 1 forward + 1 backward = 2 passes for all 1M gradients. This is why deep learning uses reverse-mode.

For $g: \mathbb{R} \to \mathbb{R}^{1000}$ :

Forward-mode: 1 pass computes all 1000 output derivatives w.r.t. the single input. Efficient.
Reverse-mode: 1000 backward passes (one per output). Wasteful.
Forward-mode wins decisively.

The general rule: Use forward-mode when number of outputs >> number of inputs. Use reverse-mode when number of inputs >> number of outputs. Neural networks have millions of inputs (parameters) and one output (scalar loss), so reverse-mode always wins.

Scoring Rubric:

Strong Hire: Gives correct pass counts, states the general rule, mentions JVP vs VJP distinction, notes forward-mode is used in certain scientific computing and physics simulation contexts
Lean Hire: Correct pass counts for both cases and the general rule
No Hire: Cannot explain the asymmetry between forward and reverse mode

Problem 5: BatchNorm Backward

Explain conceptually why the BatchNorm backward pass is more complex than a simple element-wise operation. What are the three terms in the gradient, and what does each account for?

Hint 1 - Direction

BatchNorm normalizes using batch statistics $\mu$ and $\sigma^2$ . These statistics depend on ALL elements in the batch, not just the current element.

Hint 2 - Insight

The gradient has three paths: (1) the direct path through the normalization, (2) the path through the mean $\mu$ , and (3) the path through the variance $\sigma^2$ . The mean and variance create coupling between all batch elements.

Hint 3 - Full Solution + Rubric

BatchNorm computes $\hat{z}_i = \frac{z_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$ where $\mu = \frac{1}{B}\sum_j z_j$ and $\sigma^2 = \frac{1}{B}\sum_j (z_j - \mu)^2$ .

The gradient $\frac{\partial L}{\partial z_i}$ has three components:

Direct term: $\frac{1}{\sqrt{\sigma^2 + \epsilon}} \frac{\partial L}{\partial \hat{z}_i}$ - the straightforward gradient through the division, as if $\mu$ and $\sigma^2$ were constants.
Mean term: $-\frac{1}{B\sqrt{\sigma^2 + \epsilon}} \sum_j \frac{\partial L}{\partial \hat{z}_j}$ - accounts for the fact that changing $z_i$ changes $\mu$ , which affects all $\hat{z}_j$ . This term centers the gradients (subtracts their mean).
Variance term: $-\frac{\hat{z}_i}{B\sqrt{\sigma^2 + \epsilon}} \sum_j \frac{\partial L}{\partial \hat{z}_j} \hat{z}_j$ - accounts for the fact that changing $z_i$ changes $\sigma^2$ , which affects all $\hat{z}_j$ . This term prevents the gradient from scaling the variance.

The net effect: BatchNorm backward "whitens" the gradients - centering them (zero mean) and preventing variance explosion. This is one reason BatchNorm stabilizes training.

Scoring Rubric:

Strong Hire: Identifies all three paths, explains the coupling between batch elements, notes the whitening effect on gradients
Lean Hire: Knows BatchNorm backward is complex because of batch statistics, can identify at least 2 of 3 terms
No Hire: Treats BatchNorm as an element-wise operation or cannot explain why batch statistics complicate the gradient

Interview Cheat Sheet

Concept	Key Formula / Fact	Common Mistakes
Chain rule	$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x}$	Forgetting to sum over multiple paths (fan-out)
Matmul backward	$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial z} \cdot x^T$ , $\frac{\partial L}{\partial x} = W^T \cdot \frac{\partial L}{\partial z}$	Missing the transpose
ReLU backward	Multiply by $\mathbb{1}[z > 0]$	Using post-activation $h > 0$ instead of pre-activation $z > 0$
Softmax + CE	$\frac{\partial L}{\partial z} = \hat{y} - y$	Computing softmax and CE gradients separately
Sigmoid derivative	$\sigma'(z) = \sigma(z)(1 - \sigma(z)) \leq 0.25$	Saying "gradients are zero" (they are small, not zero)
Vanishing gradients	Product of Jacobians shrinks exponentially	Not connecting to specific solutions (ReLU, skip connections, LSTM)
Exploding gradients	Gradient norm grows exponentially	Not mentioning gradient clipping as the standard fix
Forward-mode AD	JVP, $n$ passes for $n$ inputs	Saying it is "the same as" reverse mode
Reverse-mode AD	VJP, 1 pass for all inputs (scalar output)	Forgetting the memory cost of storing activations
Gradient checkpointing	$O(\sqrt{N})$ memory, ~33% extra compute	Not knowing this technique exists
He initialization	$\text{Var}(W) = \frac{2}{n_{\text{in}}}$ for ReLU	Using Xavier init with ReLU (wrong by factor of 2)
BPTT	Unrolled RNN = very deep network	Not connecting to vanishing gradient explanation

Spaced Repetition Checkpoints

Day 0 - After First Read

Write the chain rule in scalar, vector, and matrix forms
Draw a computational graph for $L = (Wx + b - y)^2$ and trace forward/backward passes
State the backward rule for: matmul, ReLU, sigmoid, softmax+CE

Day 3 - First Review

Derive backprop for a 2-layer network (ReLU + softmax) without looking at notes
Explain vanishing gradients with specific numbers: sigmoid max derivative of 0.25, exponential decay
List 3 solutions to vanishing gradients and why each works mathematically

Day 7 - Connections Review

Explain why reverse-mode AD is used in deep learning in under 60 seconds
Trace the numerical backprop example (Part 7) end-to-end without looking at the solution
Explain the memory-compute tradeoff and how gradient checkpointing addresses it

Day 14 - Interview Simulation

Derive all gradients for a 2-layer network on a whiteboard in under 10 minutes
Explain forward-mode vs reverse-mode AD: when each is used, JVP vs VJP, pass counts
Given an arbitrary computational graph with fan-out, trace the backward pass without hesitation

Day 21 - Final Calibration

Complete all 5 practice problems under time pressure (8 minutes each)
Connect backpropagation to: activation functions (gradient properties), CNNs (conv gradient), RNNs (BPTT), skip connections (additive gradient path)
Explain BatchNorm backward at a conceptual level (the three terms)
Give the 60-second answer for backpropagation cold, without preparation

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - The Chain Rule: From Calculus to Computation​

Scalar Chain Rule​

Multivariate Chain Rule​

Vector/Matrix Chain Rule​

Part 2 - Computational Graphs​

What Is a Computational Graph?​

Example: Linear Regression Loss​

Forward Pass​

Backward Pass​

Fan-Out: When a Variable Is Used Twice​

Part 3 - Backpropagation in a 2-Layer Network​

Network Setup​

Forward Pass​

Backward Pass - Step by Step​

Summary of Gradient Shapes​

Part 4 - Vanishing and Exploding Gradients​

The Core Problem​

Vanishing Gradients - Mathematical Analysis​

Exploding Gradients - Mathematical Analysis​

The Gradient Flow Spectrum​

Complete Solution Table​

Initialization Deep Dive​

Part 5 - Automatic Differentiation​

Why Not Symbolic or Numerical Differentiation?​

Forward-Mode AD​

Reverse-Mode AD (Backpropagation)​

Why Reverse Mode Wins for Deep Learning​

The Memory-Compute Tradeoff​

Part 6 - Local Gradients for Common Operations​

Matrix Multiplication​

Addition (Bias)​

Element-wise Operations​

ReLU​

Sigmoid​

Tanh​

Softmax + Cross-Entropy (Combined)​

Batch Normalization​

Part 7 - Numerical Example: Full Backprop Walkthrough​

Part 8 - Backprop Through Time (BPTT)​

Practice Problems​

Problem 1: Computational Graph Gradient​

Problem 2: Vanishing Gradient Analysis​

Problem 3: Implement Backward Pass​

Problem 4: Forward-Mode vs Reverse-Mode AD​

Problem 5: BatchNorm Backward​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 - After First Read​

Day 3 - First Review​

Day 7 - Connections Review​

Day 14 - Interview Simulation​

Day 21 - Final Calibration​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - The Chain Rule: From Calculus to Computation

Scalar Chain Rule

Multivariate Chain Rule

Vector/Matrix Chain Rule

Part 2 - Computational Graphs

What Is a Computational Graph?

Example: Linear Regression Loss

Forward Pass

Backward Pass

Fan-Out: When a Variable Is Used Twice

Part 3 - Backpropagation in a 2-Layer Network

Network Setup

Forward Pass

Backward Pass - Step by Step

Summary of Gradient Shapes

Part 4 - Vanishing and Exploding Gradients

The Core Problem

Vanishing Gradients - Mathematical Analysis

Exploding Gradients - Mathematical Analysis

The Gradient Flow Spectrum

Complete Solution Table

Initialization Deep Dive

Part 5 - Automatic Differentiation

Why Not Symbolic or Numerical Differentiation?

Forward-Mode AD

Reverse-Mode AD (Backpropagation)

Why Reverse Mode Wins for Deep Learning

The Memory-Compute Tradeoff

Part 6 - Local Gradients for Common Operations

Matrix Multiplication

Addition (Bias)

Element-wise Operations

ReLU

Sigmoid

Tanh

Softmax + Cross-Entropy (Combined)

Batch Normalization

Part 7 - Numerical Example: Full Backprop Walkthrough

Part 8 - Backprop Through Time (BPTT)

Practice Problems

Problem 1: Computational Graph Gradient

Problem 2: Vanishing Gradient Analysis

Problem 3: Implement Backward Pass

Problem 4: Forward-Mode vs Reverse-Mode AD

Problem 5: BatchNorm Backward

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 - After First Read

Day 3 - First Review

Day 7 - Connections Review

Day 14 - Interview Simulation

Day 21 - Final Calibration