Skip to main content

Backpropagation - The Engine of Deep Learning

Reading time: ~40 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Engineer, ML Compiler Eng

The Real Interview Moment

You are in a Google MLE on-site interview. The interviewer draws a simple two-layer neural network on the whiteboard - input layer, one hidden layer with ReLU, one output layer with softmax, cross-entropy loss. She hands you the marker and says: "Compute the gradient of the loss with respect to the weights in the first layer. Show every step."

You start writing. You get through the softmax-cross-entropy gradient (you memorized that one), but then you hit the ReLU layer and hesitate. "Does the chain rule multiply or... compose?" The interviewer's expression does not change, but you can feel the evaluation shifting. She follows up: "Now tell me - what happens to this gradient when you have 100 layers instead of 2? And why does PyTorch compute all these gradients efficiently?"

This is backpropagation in interviews. Not a conceptual overview - a hands-on mathematical derivation where you must trace gradients through every operation, understand what goes wrong at scale, and explain how frameworks like PyTorch make it practical. This page gives you every tool you need.

What You Will Master

  • State the chain rule in both scalar and matrix forms
  • Draw computational graphs for arbitrary expressions and trace forward/backward passes
  • Derive backpropagation for a 2-layer network with ReLU and softmax-cross-entropy loss
  • Compute Jacobians for common operations: matrix multiply, ReLU, softmax, batch norm
  • Explain vanishing gradients mathematically and connect them to activation functions and depth
  • Explain exploding gradients and the role of gradient clipping
  • Distinguish forward-mode and reverse-mode automatic differentiation
  • Justify why reverse-mode AD is used in deep learning (one backward pass for all parameters)
  • Implement a simple autograd engine conceptually (define forward, store graph, backward)
  • Answer backpropagation interview questions at whiteboard speed

Self-Assessment: Where Are You Now?

Skill1 -- Cannot2 -- Vaguely3 -- Can Explain4 -- Can Derive5 -- Can TeachYour Score
State the chain rule for compositions___
Draw a computational graph for L=(wx+by)2L = (wx + b - y)^2___
Trace forward pass values on a graph___
Trace backward pass gradients on a graph___
Derive gradients for a 2-layer network___
Explain vanishing gradients mathematically___
Explain forward vs reverse mode AD___
Implement simple autograd conceptually___

Target: All 4s and 5s before your interview.

Part 1 - The Chain Rule: From Calculus to Computation

Scalar Chain Rule

If y=f(g(x))y = f(g(x)), then:

dydx=dydgdgdx\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}

This is the single most important equation in deep learning. Every gradient computation in every neural network is an application of this rule, composed many times.

Example: Let y=(3x+2)2y = (3x + 2)^2. Define g=3x+2g = 3x + 2 and y=g2y = g^2.

dgdx=3,dydg=2g=2(3x+2),dydx=2(3x+2)3=6(3x+2)\frac{dg}{dx} = 3, \quad \frac{dy}{dg} = 2g = 2(3x+2), \quad \frac{dy}{dx} = 2(3x+2) \cdot 3 = 6(3x+2)

Multivariate Chain Rule

When a variable influences the output through multiple paths, we sum over all paths:

Lx=iLyiyix\frac{\partial L}{\partial x} = \sum_{i} \frac{\partial L}{\partial y_i} \cdot \frac{\partial y_i}{\partial x}

This is critical in neural networks because a single weight can affect the loss through multiple neurons.

60-Second Answer

"Backpropagation is just the chain rule applied systematically on a computational graph. We do a forward pass to compute the loss, then a backward pass where we propagate gradients from the loss back to every parameter. At each node, we multiply the incoming gradient by the local derivative - that is the chain rule. The key insight is that reverse-mode differentiation lets us compute gradients with respect to all parameters in a single backward pass, which is why it is used in deep learning where we have millions of parameters but a single scalar loss."

Vector/Matrix Chain Rule

For neural networks, we work with vectors and matrices. If y=f(x)\mathbf{y} = f(\mathbf{x}) where xRn\mathbf{x} \in \mathbb{R}^n and yRm\mathbf{y} \in \mathbb{R}^m, the Jacobian is:

J=yx=[y1x1y1xnymx1ymxn]J = \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix}

For a chain L=f(g(x))L = f(\mathbf{g}(\mathbf{x})) where LL is scalar:

Lx=LgJg\frac{\partial L}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{g}} \cdot J_g

In practice, we never explicitly construct Jacobian matrices. We work with vector-Jacobian products (VJPs) - multiplying the upstream gradient vector by the Jacobian without materializing the full matrix.

Common Trap

Do NOT confuse the Jacobian with the gradient. The gradient xL\nabla_x L is a vector of partial derivatives of a scalar with respect to a vector. The Jacobian is a matrix of partial derivatives of a vector with respect to a vector. In backprop, we always propagate gradients of the scalar loss, so we compute VJPs, not full Jacobians.

Part 2 - Computational Graphs

What Is a Computational Graph?

A computational graph is a directed acyclic graph (DAG) where:

  • Leaf nodes are inputs and parameters
  • Internal nodes are operations (add, multiply, matmul, ReLU, etc.)
  • Edges represent data flow
  • The root is the loss value

Every expression can be decomposed into a computational graph. This is exactly what PyTorch and TensorFlow do under the hood.

Example: Linear Regression Loss

Consider L=(wx+by)2L = (wx + b - y)^2 for a single data point.

Computational Graph for L = (wx + b - y)²

Forward Pass

We compute values left-to-right (inputs to loss). Let w=2,x=3,b=1,y=8w = 2, x = 3, b = 1, y = 8.

NodeComputationValue
mulwx=23w \cdot x = 2 \cdot 36
add1mul+b=6+1\text{mul} + b = 6 + 17
subadd1y=78\text{add1} - y = 7 - 81-1
sqsub2=(1)2\text{sub}^2 = (-1)^21
L= sq1

Backward Pass

We compute gradients right-to-left (loss to parameters). We need Lw\frac{\partial L}{\partial w} and Lb\frac{\partial L}{\partial b}.

Start at the loss: LL=1\frac{\partial L}{\partial L} = 1.

NodeLocal DerivativeUpstream GradientGradient
sqsub(sub2)=2sub=2(1)=2\frac{\partial}{\partial \text{sub}}(\text{sub}^2) = 2 \cdot \text{sub} = 2(-1) = -212-2
subsubadd1=1\frac{\partial \text{sub}}{\partial \text{add1}} = 12-22-2
add1 (to b)add1b=1\frac{\partial \text{add1}}{\partial b} = 12-2Lb=2\frac{\partial L}{\partial b} = -2
add1 (to mul)add1mul=1\frac{\partial \text{add1}}{\partial \text{mul}} = 12-22-2
mul (to w)mulw=x=3\frac{\partial \text{mul}}{\partial w} = x = 32-2Lw=23=6\frac{\partial L}{\partial w} = -2 \cdot 3 = -6

Result: Lw=6\frac{\partial L}{\partial w} = -6, Lb=2\frac{\partial L}{\partial b} = -2.

Verification: L=(wx+by)2L = (wx + b - y)^2. By direct differentiation: Lw=2(wx+by)x=2(1)(3)=6\frac{\partial L}{\partial w} = 2(wx + b - y) \cdot x = 2(-1)(3) = -6. Correct.

Interviewer's Perspective

"I ask candidates to trace a backward pass on a computational graph because it tells me whether they truly understand backprop or just memorized formulas. If they can handle the graph for a 2-layer network, they will not panic when I ask about more complex architectures. The key thing I watch for is: do they correctly handle nodes where the gradient flows through multiple paths? That is where most candidates make mistakes."

Fan-Out: When a Variable Is Used Twice

Consider L=xx=x2L = x \cdot x = x^2. The node xx fans out to two inputs of the multiply operation.

In the backward pass, when a variable feeds into multiple downstream operations, we sum the gradients from all paths:

Lx=Lmulmulxleft+Lmulmulxright=1x+1x=2x\frac{\partial L}{\partial x} = \frac{\partial L}{\partial \text{mul}} \cdot \frac{\partial \text{mul}}{\partial x_{\text{left}}} + \frac{\partial L}{\partial \text{mul}} \cdot \frac{\partial \text{mul}}{\partial x_{\text{right}}} = 1 \cdot x + 1 \cdot x = 2x

This summation rule is the multivariate chain rule in action and is one of the most common sources of interview errors.

Part 3 - Backpropagation in a 2-Layer Network

This is the canonical interview derivation. Memorize it, but more importantly, understand every step.

Network Setup

  • Input: xRd\mathbf{x} \in \mathbb{R}^{d} (a single training example)
  • Layer 1: z1=W1x+b1\mathbf{z}_1 = W_1 \mathbf{x} + \mathbf{b}_1, then h1=ReLU(z1)\mathbf{h}_1 = \text{ReLU}(\mathbf{z}_1)
  • Layer 2: z2=W2h1+b2\mathbf{z}_2 = W_2 \mathbf{h}_1 + \mathbf{b}_2, then y^=softmax(z2)\hat{\mathbf{y}} = \text{softmax}(\mathbf{z}_2)
  • Loss: L=kyklogy^kL = -\sum_k y_k \log \hat{y}_k (cross-entropy)

Where W1Rh×dW_1 \in \mathbb{R}^{h \times d}, W2Rc×hW_2 \in \mathbb{R}^{c \times h}, hh = hidden size, cc = number of classes.

2-Layer Network: Forward and Backward Pass

Forward Pass

  1. z1=W1x+b1\mathbf{z}_1 = W_1 \mathbf{x} + \mathbf{b}_1
  2. h1=max(0,z1)\mathbf{h}_1 = \max(0, \mathbf{z}_1) (element-wise ReLU)
  3. z2=W2h1+b2\mathbf{z}_2 = W_2 \mathbf{h}_1 + \mathbf{b}_2
  4. y^=softmax(z2)\hat{\mathbf{y}} = \text{softmax}(\mathbf{z}_2)
  5. L=kyklogy^kL = -\sum_k y_k \log \hat{y}_k

Backward Pass - Step by Step

Step 1: Gradient of loss w.r.t. softmax input z2\mathbf{z}_2

The combined softmax + cross-entropy gradient has a beautifully simple form:

Lz2=y^y\frac{\partial L}{\partial \mathbf{z}_2} = \hat{\mathbf{y}} - \mathbf{y}

This is one of the most elegant results in deep learning. The gradient is simply the prediction minus the one-hot label. If the true class is kk, then the gradient for class kk is y^k1\hat{y}_k - 1 and for all other classes jj it is y^j\hat{y}_j.

Company Variation

At Google and Meta, you may be asked to derive this result from scratch - first computing Ly^\frac{\partial L}{\partial \hat{\mathbf{y}}} and then y^z2\frac{\partial \hat{\mathbf{y}}}{\partial \mathbf{z}_2} (the Jacobian of softmax). At Amazon and startups, knowing the final result is usually sufficient.

Step 2: Gradients for Layer 2 parameters

LW2=Lz2h1T=(y^y)h1T\frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial \mathbf{z}_2} \cdot \mathbf{h}_1^T = (\hat{\mathbf{y}} - \mathbf{y}) \mathbf{h}_1^T

Lb2=Lz2=y^y\frac{\partial L}{\partial \mathbf{b}_2} = \frac{\partial L}{\partial \mathbf{z}_2} = \hat{\mathbf{y}} - \mathbf{y}

Note: LW2\frac{\partial L}{\partial W_2} has shape (c×h)(c \times h) - same as W2W_2. This is the outer product of the error signal and the hidden activations.

Step 3: Gradient flowing back to hidden layer

Lh1=W2TLz2=W2T(y^y)\frac{\partial L}{\partial \mathbf{h}_1} = W_2^T \frac{\partial L}{\partial \mathbf{z}_2} = W_2^T (\hat{\mathbf{y}} - \mathbf{y})

This is the key step - the gradient flows backward through the weight matrix by multiplying with its transpose.

Step 4: Gradient through ReLU

Lz1=Lh11[z1>0]\frac{\partial L}{\partial \mathbf{z}_1} = \frac{\partial L}{\partial \mathbf{h}_1} \odot \mathbb{1}[\mathbf{z}_1 > 0]

where \odot is element-wise multiplication and 1[z1>0]\mathbb{1}[\mathbf{z}_1 > 0] is 1 where z1>0\mathbf{z}_1 > 0 and 0 elsewhere. ReLU passes the gradient through unchanged where the input was positive, and blocks it where the input was negative or zero.

Step 5: Gradients for Layer 1 parameters

LW1=Lz1xT\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \mathbf{z}_1} \cdot \mathbf{x}^T

Lb1=Lz1\frac{\partial L}{\partial \mathbf{b}_1} = \frac{\partial L}{\partial \mathbf{z}_1}

Summary of Gradient Shapes

QuantityShapeExpression
Lz2\frac{\partial L}{\partial \mathbf{z}_2}(c,)(c,)y^y\hat{\mathbf{y}} - \mathbf{y}
LW2\frac{\partial L}{\partial W_2}(c,h)(c, h)(y^y)h1T(\hat{\mathbf{y}} - \mathbf{y})\mathbf{h}_1^T
Lb2\frac{\partial L}{\partial \mathbf{b}_2}(c,)(c,)y^y\hat{\mathbf{y}} - \mathbf{y}
Lh1\frac{\partial L}{\partial \mathbf{h}_1}(h,)(h,)W2T(y^y)W_2^T(\hat{\mathbf{y}} - \mathbf{y})
Lz1\frac{\partial L}{\partial \mathbf{z}_1}(h,)(h,)W2T(y^y)1[z1>0]W_2^T(\hat{\mathbf{y}} - \mathbf{y}) \odot \mathbb{1}[\mathbf{z}_1 > 0]
LW1\frac{\partial L}{\partial W_1}(h,d)(h, d)Lz1xT\frac{\partial L}{\partial \mathbf{z}_1} \cdot \mathbf{x}^T
Lb1\frac{\partial L}{\partial \mathbf{b}_1}(d,)(d,)Lz1\frac{\partial L}{\partial \mathbf{z}_1}
Instant Rejection

If you write LW1=Lz1x\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \mathbf{z}_1} \cdot \mathbf{x} (without the transpose), you will get an instant shape mismatch. Always check: the gradient of LL w.r.t. a parameter must have the same shape as the parameter. W1W_1 is (h,d)(h, d), so LW1\frac{\partial L}{\partial W_1} must be (h,d)(h, d) - which requires the outer product Lz1xT\frac{\partial L}{\partial \mathbf{z}_1} \cdot \mathbf{x}^T, where the shapes are (h,1)×(1,d)=(h,d)(h, 1) \times (1, d) = (h, d).

Part 4 - Vanishing and Exploding Gradients

The Core Problem

Consider a deep network with NN layers. The gradient of the loss w.r.t. the first layer's weights involves a product of N1N-1 Jacobians:

LW1=LzNi=2Nzizi1z1W1\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \mathbf{z}_N} \cdot \prod_{i=2}^{N} \frac{\partial \mathbf{z}_i}{\partial \mathbf{z}_{i-1}} \cdot \frac{\partial \mathbf{z}_1}{\partial W_1}

Each factor zizi1\frac{\partial \mathbf{z}_i}{\partial \mathbf{z}_{i-1}} involves the weight matrix and the activation derivative. If these factors have eigenvalues consistently less than 1, the product shrinks exponentially. If greater than 1, it grows exponentially.

Vanishing Gradients - Mathematical Analysis

For a network with sigmoid activations, the derivative of sigmoid is:

σ(z)=σ(z)(1σ(z))0.25\sigma'(z) = \sigma(z)(1 - \sigma(z)) \leq 0.25

The maximum value is 0.25 (at z=0z = 0). So at each layer, the gradient is multiplied by at most 0.25 times the weight magnitude. For a 10-layer network:

LW1(0.25)10Wi106Wi\left|\frac{\partial L}{\partial W_1}\right| \propto (0.25)^{10} \cdot \prod \|W_i\| \approx 10^{-6} \cdot \prod \|W_i\|

Even with well-initialized weights, the gradient reaching the first layer is vanishingly small. The first layers stop learning while the last layers train normally.

Vanishing Gradients Across Layers with Sigmoid Activations

Exploding Gradients - Mathematical Analysis

Conversely, if the weight matrices have large spectral norms, the gradient product grows exponentially:

LW1i=2NWidiag(σ(zi))\left|\frac{\partial L}{\partial W_1}\right| \propto \prod_{i=2}^{N} \|W_i\| \cdot \|\text{diag}(\sigma'(\mathbf{z}_i))\|

With 100 layers and spectral norms slightly above 1 (say 1.5):

(1.5)1004×1017(1.5)^{100} \approx 4 \times 10^{17}

The gradients become astronomically large, causing NaN values or oscillating parameters.

The Gradient Flow Spectrum

Gradient Flow Spectrum: Vanishing, Stable, and Exploding Regimes

Complete Solution Table

ProblemSolutionHow It HelpsIntroduced In
Vanishing gradientsReLU activationReLU(z)=1\text{ReLU}'(z) = 1 for z>0z > 0, no shrinkageNair & Hinton, 2010
Vanishing gradientsSkip connections (ResNet)Gradient has additive path: x(F(x)+x)=F(x)+I\frac{\partial}{\partial x}(F(x) + x) = F'(x) + IHe et al., 2015
Vanishing gradientsLSTM gatingConstant error carousel preserves gradient flowHochreiter & Schmidhuber, 1997
Vanishing gradientsHe initializationSets Var(W)=2nin\text{Var}(W) = \frac{2}{n_{\text{in}}} to keep activation variance stableHe et al., 2015
Vanishing gradientsBatchNorm / LayerNormPrevents activations from reaching saturated regionsIoffe & Szegedy, 2015
Exploding gradientsGradient clippingCaps gradient norm: if g>τ\|\mathbf{g}\| > \tau then gτgg\mathbf{g} \leftarrow \tau \cdot \frac{\mathbf{g}}{\|\mathbf{g}\|}Pascanu et al., 2013
Exploding gradientsWeight regularizationL2 penalty keeps weight magnitudes boundedStandard
Exploding gradientsLower learning rateSmaller parameter updates even with large gradientsStandard
BothProper initializationXavier for sigmoid/tanh, He for ReLU - matches activation functionGlorot & Bengio, 2010

Initialization Deep Dive

The goal of initialization is to keep the variance of activations (and gradients) approximately constant across layers.

Xavier initialization (for sigmoid/tanh): WN(0,2nin+nout)W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)

Derivation: For a linear layer z=Wxz = Wx with Var(xi)=v\text{Var}(x_i) = v, we want Var(zj)=v\text{Var}(z_j) = v. Since zj=i=1ninwjixiz_j = \sum_{i=1}^{n_\text{in}} w_{ji} x_i:

Var(zj)=ninVar(wji)Var(xi)=ninVar(W)v\text{Var}(z_j) = n_{\text{in}} \cdot \text{Var}(w_{ji}) \cdot \text{Var}(x_i) = n_{\text{in}} \cdot \text{Var}(W) \cdot v

Setting Var(zj)=v\text{Var}(z_j) = v gives Var(W)=1nin\text{Var}(W) = \frac{1}{n_{\text{in}}}. Averaging the forward and backward constraints gives 2nin+nout\frac{2}{n_{\text{in}} + n_{\text{out}}}.

He initialization (for ReLU): WN(0,2nin)W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)

ReLU zeroes out half the activations on average, halving the variance. The factor of 2 compensates for this.

Interviewer's Perspective

"When I ask about vanishing gradients, I want the candidate to give me the mathematical reason - the product of Jacobians shrinking exponentially - not just say 'gradients get small.' Then I want them to connect it to specific solutions. A really strong candidate will point out that ResNet skip connections create an additive gradient path: x(F(x)+x)=F(x)+I\frac{\partial}{\partial x}(F(x) + x) = F'(x) + I, where the identity term ensures the gradient is at least 1 regardless of what F(x)F'(x) does."

Part 5 - Automatic Differentiation

Why Not Symbolic or Numerical Differentiation?

Symbolic differentiation (like in Mathematica) produces exact analytical derivatives but suffers from expression swell - expressions grow exponentially with depth, making it impractical for neural networks with millions of operations.

Numerical differentiation uses finite differences: fxif(x+ϵei)f(x)ϵ\frac{\partial f}{\partial x_i} \approx \frac{f(x + \epsilon e_i) - f(x)}{\epsilon}. This requires one forward pass per parameter - millions of forward passes for a neural network. Also suffers from numerical instability (choosing ϵ\epsilon too large introduces truncation error; too small introduces floating-point error).

Automatic differentiation (AD) is the best of both worlds: exact derivatives (no approximation error) computed efficiently by decomposing the program into elementary operations and applying the chain rule.

Forward-Mode AD

Forward-mode AD propagates derivatives alongside the computation. For each intermediate variable, we compute both its value and its derivative with respect to one chosen input variable.

Given f(x1,x2)f(x_1, x_2), to compute fx1\frac{\partial f}{\partial x_1}:

  • Seed: x˙1=1,x˙2=0\dot{x}_1 = 1, \dot{x}_2 = 0
  • At each operation v=op(a,b)v = \text{op}(a, b): compute v˙=opaa˙+opbb˙\dot{v} = \frac{\partial \text{op}}{\partial a}\dot{a} + \frac{\partial \text{op}}{\partial b}\dot{b}

This computes a Jacobian-vector product (JVP): JvJ \cdot \mathbf{v} where v\mathbf{v} is the seed vector.

Cost: One forward pass computes the derivative w.r.t. one input variable. For nn inputs, we need nn forward passes.

Example: Compute fx1\frac{\partial f}{\partial x_1} for f(x1,x2)=x1x2+sin(x1)f(x_1, x_2) = x_1 x_2 + \sin(x_1) at x1=2,x2=3x_1 = 2, x_2 = 3.

StepValueDerivative (v˙\dot{v}, seed x˙1=1\dot{x}_1 = 1)
v1=x1v_1 = x_12v˙1=1\dot{v}_1 = 1
v2=x2v_2 = x_23v˙2=0\dot{v}_2 = 0
v3=v1v2v_3 = v_1 \cdot v_26v˙3=v˙1v2+v1v˙2=3+0=3\dot{v}_3 = \dot{v}_1 \cdot v_2 + v_1 \cdot \dot{v}_2 = 3 + 0 = 3
v4=sin(v1)v_4 = \sin(v_1)0.909v˙4=cos(v1)v˙1=cos(2)1=0.416\dot{v}_4 = \cos(v_1) \cdot \dot{v}_1 = \cos(2) \cdot 1 = -0.416
v5=v3+v4v_5 = v_3 + v_46.909v˙5=v˙3+v˙4=3+(0.416)=2.584\dot{v}_5 = \dot{v}_3 + \dot{v}_4 = 3 + (-0.416) = 2.584

Result: fx1=2.584\frac{\partial f}{\partial x_1} = 2.584.

Reverse-Mode AD (Backpropagation)

Reverse-mode AD does a forward pass first (storing intermediate values), then a backward pass that computes derivatives with respect to all inputs in a single pass.

  • Forward: Compute and store all intermediate values
  • Backward: Starting from Lˉ=1\bar{L} = 1, propagate vˉi=j:vivjvˉjvjvi\bar{v}_i = \sum_{j: v_i \to v_j} \bar{v}_j \frac{\partial v_j}{\partial v_i}

This computes a vector-Jacobian product (VJP): vTJ\mathbf{v}^T \cdot J where v\mathbf{v} is the upstream gradient.

Cost: One forward pass + one backward pass computes derivatives w.r.t. all input variables.

Why Reverse Mode Wins for Deep Learning

AspectForward ModeReverse Mode
Cost per input variableO(1)O(1) passesAmortized across all inputs
Cost per output variableAmortized across all outputsO(1)O(1) passes
Total cost for nn inputs, 1 outputO(n)O(n) passesO(1)O(1) passes (one fwd + one bwd)
MemoryLow (no storage needed)High (must store all intermediates)
Best whenFew inputs, many outputsMany inputs, few outputs
ComputesJVP: JvJ \cdot \mathbf{v}VJP: vTJ\mathbf{v}^T \cdot J

Neural networks have millions of parameters (inputs to the loss function) but a single scalar loss (one output). Reverse-mode AD computes all gradients in one backward pass. This is why backpropagation is reverse-mode AD.

Forward Mode vs Reverse Mode Automatic Differentiation

Common Trap

Do NOT say "backpropagation IS gradient descent." Backpropagation computes gradients. Gradient descent uses those gradients to update parameters. They are separate steps. Backprop answers "which direction?", gradient descent answers "how far in that direction?" You can use backprop with Adam, SGD, or any other optimizer.

The Memory-Compute Tradeoff

Reverse-mode AD must store all intermediate activations from the forward pass (needed for the backward pass). For a network with NN layers, batch size BB, and hidden size hh, this is O(NBh)O(N \cdot B \cdot h) memory.

For a Transformer with 96 layers, batch size 32, sequence length 2048, and hidden size 12288 (GPT-3 scale):

  • Activation memory per layer: approximately 32×2048×12288×432 \times 2048 \times 12288 \times 4 bytes (float32) = ~3 GB
  • Total for 96 layers: ~288 GB - far exceeding GPU memory

Gradient checkpointing (also called activation checkpointing) trades memory for compute:

  • Only store activations at every kk-th layer
  • During the backward pass, recompute the missing activations from the nearest checkpoint
  • Reduces memory from O(N)O(N) to O(N)O(\sqrt{N}) with optimal checkpoint placement
  • Cost: one additional forward pass (roughly 33% more compute)

This is essential for training large models and is a common interview topic at companies training foundation models.

Part 6 - Local Gradients for Common Operations

Knowing these local gradients lets you derive the backward pass for any network architecture. This is your reference table.

Matrix Multiplication

If z=Wx\mathbf{z} = W\mathbf{x} and we receive Lz\frac{\partial L}{\partial \mathbf{z}} from upstream:

LW=LzxTLx=WTLz\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \mathbf{z}} \cdot \mathbf{x}^T \qquad \frac{\partial L}{\partial \mathbf{x}} = W^T \cdot \frac{\partial L}{\partial \mathbf{z}}

Memory trick: The gradient w.r.t. the weight is always upstream_grad @ input.T. The gradient w.r.t. the input is always weight.T @ upstream_grad.

Addition (Bias)

If z=a+b\mathbf{z} = \mathbf{a} + \mathbf{b}:

La=Lz,Lb=Lz\frac{\partial L}{\partial \mathbf{a}} = \frac{\partial L}{\partial \mathbf{z}}, \quad \frac{\partial L}{\partial \mathbf{b}} = \frac{\partial L}{\partial \mathbf{z}}

Addition distributes (copies) the gradient equally to both inputs.

Element-wise Operations

For any element-wise function hi=f(zi)h_i = f(z_i), the Jacobian is diagonal:

Lzi=Lhif(zi)\frac{\partial L}{\partial z_i} = \frac{\partial L}{\partial h_i} \cdot f'(z_i)

This is just element-wise multiplication with the local derivative.

ReLU

Lz=Lh1[z>0]\frac{\partial L}{\partial \mathbf{z}} = \frac{\partial L}{\partial \mathbf{h}} \odot \mathbb{1}[\mathbf{z} > 0]

ReLU acts as a binary gate: gradient passes through where input was positive, is blocked where negative.

Sigmoid

Lz=Lhσ(z)(1σ(z))\frac{\partial L}{\partial z} = \frac{\partial L}{\partial h} \cdot \sigma(z)(1 - \sigma(z))

Note: σ(z)(1σ(z))0.25\sigma(z)(1 - \sigma(z)) \leq 0.25, which is why sigmoid causes vanishing gradients.

Tanh

Lz=Lh(1tanh2(z))\frac{\partial L}{\partial z} = \frac{\partial L}{\partial h} \cdot (1 - \tanh^2(z))

Maximum derivative is 1 (at z=0z = 0), better than sigmoid's 0.25 but still causes vanishing gradients for large z|z|.

Softmax + Cross-Entropy (Combined)

Lz=y^y\frac{\partial L}{\partial \mathbf{z}} = \hat{\mathbf{y}} - \mathbf{y}

Always compute the combined gradient - computing them separately requires the full softmax Jacobian.

Batch Normalization

For z^i=ziμσ2+ϵ\hat{z}_i = \frac{z_i - \mu}{\sqrt{\sigma^2 + \epsilon}} where μ\mu and σ2\sigma^2 are batch statistics:

Lzi=1σ2+ϵ(Lz^i1BjLz^jz^iBjLz^jz^j)\frac{\partial L}{\partial z_i} = \frac{1}{\sqrt{\sigma^2 + \epsilon}} \left( \frac{\partial L}{\partial \hat{z}_i} - \frac{1}{B}\sum_j \frac{\partial L}{\partial \hat{z}_j} - \frac{\hat{z}_i}{B}\sum_j \frac{\partial L}{\partial \hat{z}_j}\hat{z}_j \right)

The complexity arises because μ\mu and σ2\sigma^2 depend on all batch elements, so each element's gradient receives contributions from every other element.

Company Variation

Deriving the BatchNorm gradient is a classic Google/Meta interview question for senior roles. The key insight is that since μ\mu and σ2\sigma^2 are functions of the entire batch, the backward pass must account for these dependencies. Most candidates handle the ziμσ2+ϵ\frac{z_i - \mu}{\sqrt{\sigma^2 + \epsilon}} part but forget the gradient contributions through μ\mu and σ2\sigma^2.

Part 7 - Numerical Example: Full Backprop Walkthrough

Let us work through a concrete example with actual numbers. This is what you would write on a whiteboard.

Network: 2 inputs, 2 hidden units (ReLU), 2 outputs (softmax + cross-entropy)

Parameters:

W1=[0.10.20.30.4]W_1 = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{bmatrix}, b1=[0.10.1]\mathbf{b}_1 = \begin{bmatrix} 0.1 \\ 0.1 \end{bmatrix}, W2=[0.50.60.70.8]W_2 = \begin{bmatrix} 0.5 & 0.6 \\ 0.7 & 0.8 \end{bmatrix}, b2=[0.10.1]\mathbf{b}_2 = \begin{bmatrix} 0.1 \\ 0.1 \end{bmatrix}

Input: x=[1.0,2.0]T\mathbf{x} = [1.0, 2.0]^T, Target: y=[1,0]T\mathbf{y} = [1, 0]^T (class 0)

Forward pass:

StepComputationResult
z1\mathbf{z}_1W1x+b1=[0.1(1)+0.2(2)+0.1, 0.3(1)+0.4(2)+0.1]TW_1\mathbf{x} + \mathbf{b}_1 = [0.1(1) + 0.2(2) + 0.1, \ 0.3(1) + 0.4(2) + 0.1]^T[0.6,1.2]T[0.6, 1.2]^T
h1\mathbf{h}_1ReLU(z1)\text{ReLU}(\mathbf{z}_1) - both positive[0.6,1.2]T[0.6, 1.2]^T
z2\mathbf{z}_2W2h1+b2=[0.5(0.6)+0.6(1.2)+0.1, 0.7(0.6)+0.8(1.2)+0.1]TW_2\mathbf{h}_1 + \mathbf{b}_2 = [0.5(0.6) + 0.6(1.2) + 0.1, \ 0.7(0.6) + 0.8(1.2) + 0.1]^T[1.12,1.48]T[1.12, 1.48]^T
y^\hat{\mathbf{y}}softmax([1.12,1.48])\text{softmax}([1.12, 1.48]): e1.12=3.065,e1.48=4.393e^{1.12} = 3.065, e^{1.48} = 4.393, sum =7.458= 7.458[0.411,0.589]T[0.411, 0.589]^T
LL(1ln0.411+0ln0.589)-(1 \cdot \ln 0.411 + 0 \cdot \ln 0.589)0.8890.889

Backward pass:

StepComputationResult
Lz2\frac{\partial L}{\partial \mathbf{z}_2}y^y=[0.4111, 0.5890]T\hat{\mathbf{y}} - \mathbf{y} = [0.411 - 1, \ 0.589 - 0]^T[0.589,0.589]T[-0.589, 0.589]^T
LW2\frac{\partial L}{\partial W_2}Lz2h1T\frac{\partial L}{\partial \mathbf{z}_2} \cdot \mathbf{h}_1^T[0.3530.7070.3530.707]\begin{bmatrix} -0.353 & -0.707 \\ 0.353 & 0.707 \end{bmatrix}
Lh1\frac{\partial L}{\partial \mathbf{h}_1}W2TLz2W_2^T \cdot \frac{\partial L}{\partial \mathbf{z}_2}[0.118,0.118]T[0.118, 0.118]^T
Lz1\frac{\partial L}{\partial \mathbf{z}_1}Lh1[1,1]T\frac{\partial L}{\partial \mathbf{h}_1} \odot [1, 1]^T (both z1z_1 positive)[0.118,0.118]T[0.118, 0.118]^T
LW1\frac{\partial L}{\partial W_1}Lz1xT\frac{\partial L}{\partial \mathbf{z}_1} \cdot \mathbf{x}^T[0.1180.2360.1180.236]\begin{bmatrix} 0.118 & 0.236 \\ 0.118 & 0.236 \end{bmatrix}

Shape verification: LW1\frac{\partial L}{\partial W_1} is (2×2)(2 \times 2), same as W1W_1. LW2\frac{\partial L}{\partial W_2} is (2×2)(2 \times 2), same as W2W_2. All shapes match.

Part 8 - Backprop Through Time (BPTT)

Backpropagation in recurrent neural networks is called Backpropagation Through Time (BPTT) because the RNN is "unrolled" across time steps, creating a very deep computational graph.

For a vanilla RNN: ht=tanh(Whht1+Wxxt+b)\mathbf{h}_t = \tanh(W_h \mathbf{h}_{t-1} + W_x \mathbf{x}_t + \mathbf{b})

The gradient of the loss at time TT w.r.t. h1\mathbf{h}_1 involves:

LTh1=t=2Ththt1LThT\frac{\partial L_T}{\partial \mathbf{h}_1} = \prod_{t=2}^{T} \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} \cdot \frac{\partial L_T}{\partial \mathbf{h}_T}

Each factor is htht1=diag(tanh(zt))Wh\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} = \text{diag}(\tanh'(\mathbf{z}_t)) \cdot W_h.

Since tanh(z)1\tanh'(z) \leq 1 and the same WhW_h is reused at every step, this product either vanishes or explodes depending on the spectral radius of WhW_h. This is exactly why LSTMs were invented - they add a cell state with additive gradient flow, analogous to how ResNets add skip connections.

Truncated BPTT: In practice, we limit the backward pass to kk time steps (e.g., k=256k = 256) to manage both memory and vanishing gradients. This is a biased estimate of the true gradient but works well in practice.

Practice Problems

Problem 1: Computational Graph Gradient

Given f(x,y,z)=(x+y)zf(x, y, z) = (x + y) \cdot z, draw the computational graph and compute all partial derivatives when x=2,y=5,z=4x = -2, y = 5, z = -4.

Hint 1 - Direction

Break the expression into two operations: an addition and a multiplication. Each becomes a node in the graph.

Hint 2 - Insight

Forward: q=x+y=3q = x + y = 3, f=qz=12f = q \cdot z = -12. Backward: start with ff=1\frac{\partial f}{\partial f} = 1. The multiplication node has local gradients zz and qq for its two inputs.

Hint 3 - Full Solution + Rubric

Forward: q=x+y=3q = x + y = 3, f=qz=12f = q \cdot z = -12

Backward:

  • ff=1\frac{\partial f}{\partial f} = 1
  • fq=z=4\frac{\partial f}{\partial q} = z = -4 and fz=q=3\frac{\partial f}{\partial z} = q = 3
  • fx=fqqx=41=4\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \cdot \frac{\partial q}{\partial x} = -4 \cdot 1 = -4
  • fy=fqqy=41=4\frac{\partial f}{\partial y} = \frac{\partial f}{\partial q} \cdot \frac{\partial q}{\partial y} = -4 \cdot 1 = -4

Results: fx=4\frac{\partial f}{\partial x} = -4, fy=4\frac{\partial f}{\partial y} = -4, fz=3\frac{\partial f}{\partial z} = 3

Verification: fx=z=4\frac{\partial f}{\partial x} = z = -4, fy=z=4\frac{\partial f}{\partial y} = z = -4, fz=x+y=3\frac{\partial f}{\partial z} = x + y = 3. Matches.

Scoring Rubric:

  • Strong Hire: Draws correct graph, computes all gradients correctly, verifies with direct differentiation, explains the fan-out rule for the addition node
  • Lean Hire: Correct final answers but hesitates on the process or skips verification
  • No Hire: Cannot set up the computational graph or makes sign errors in the chain rule

Problem 2: Vanishing Gradient Analysis

A 5-layer network uses sigmoid activations with Xavier-initialized weights. Estimate the magnitude of the gradient at layer 1 relative to layer 5, and propose three solutions.

Hint 1 - Direction

The gradient at each layer is multiplied by σ(z)\sigma'(z) and the weight matrix. What is the maximum value of σ(z)\sigma'(z)?

Hint 2 - Insight

σ(z)0.25\sigma'(z) \leq 0.25. With Xavier initialization, Wi1\|W_i\| \approx 1. So the gradient at layer 1 is approximately (0.25)4(0.25)^4 times the gradient at layer 5.

Hint 3 - Full Solution + Rubric

With Xavier initialization, the expected singular values of WiW_i are approximately 1. The maximum of σ(z)\sigma'(z) is 0.25.

W1LW5Li=25(10.25)=(0.25)4=0.0039\frac{\|\nabla_{W_1} L\|}{\|\nabla_{W_5} L\|} \approx \prod_{i=2}^{5} (1 \cdot 0.25) = (0.25)^4 = 0.0039

The gradient at layer 1 is roughly 256x smaller than at layer 5.

Three solutions:

  1. Replace sigmoid with ReLU: gradient factor becomes 1 (for active units), eliminating the 0.25 multiplier
  2. Add skip connections (ResNet-style): gradient has a direct additive path bypassing the multiplications
  3. Use batch normalization: prevents activations from saturating in the flat regions of sigmoid

Scoring Rubric:

  • Strong Hire: Quantifies the decay factor as (0.25)4(0.25)^4, derives it mathematically, connects to 3+ solutions, mentions that ReLU has its own issues (dying ReLU)
  • Lean Hire: Knows gradients shrink exponentially with sigmoid, can state solutions but cannot quantify the decay
  • No Hire: Vaguely mentions "gradients get small" without mathematical reasoning

Problem 3: Implement Backward Pass

Given this forward pass in pseudocode, write the backward pass:

# Forward
z1 = W1 @ x + b1 # shape: (h,)
h1 = relu(z1) # shape: (h,)
z2 = W2 @ h1 + b2 # shape: (c,)
loss = cross_entropy_with_softmax(z2, y) # scalar
Hint 1 - Direction

Start from the loss and work backward. The combined softmax-cross-entropy gradient is y^y\hat{y} - y. For each forward operation, write the corresponding backward operation.

Hint 2 - Insight

For matmul backward: gradient w.r.t. weight is upstream @ input.T, gradient w.r.t. input is weight.T @ upstream. For ReLU backward: multiply by binary mask of where pre-activation was positive.

Hint 3 - Full Solution + Rubric
# Backward
dz2 = softmax(z2) - y # (c,)
dW2 = dz2[:, None] @ h1[None, :] # (c, h) outer product
db2 = dz2 # (c,)
dh1 = W2.T @ dz2 # (h,)
dz1 = dh1 * (z1 > 0).float() # (h,) ReLU mask
dW1 = dz1[:, None] @ x[None, :] # (h, d) outer product
db1 = dz1 # (h,)

Key shape checks:

  • dW2 has same shape as W2: (c, h) -- outer product of (c, 1) and (1, h)
  • dW1 has same shape as W1: (h, d) -- outer product of (h, 1) and (1, d)
  • ReLU mask uses z1 > 0 (pre-activation values)

Scoring Rubric:

  • Strong Hire: Correct shapes, correct operations, explains each step, explicitly verifies shapes match parameters
  • Lean Hire: Mostly correct but confuses one transpose or forgets the outer product notation
  • No Hire: Cannot write the backward pass in correct order or gets more than 2 operations wrong

Problem 4: Forward-Mode vs Reverse-Mode AD

You have a function f:R1000000Rf: \mathbb{R}^{1000000} \to \mathbb{R} (a neural network with 1M parameters and scalar loss). How many passes does forward-mode AD need? How many does reverse-mode need? Now consider g:RR1000g: \mathbb{R} \to \mathbb{R}^{1000}. Which mode is better for gg?

Hint 1 - Direction

Forward-mode computes derivatives w.r.t. one input per pass. Reverse-mode computes derivatives w.r.t. all inputs per pass but for one output.

Hint 2 - Insight

For ff: reverse-mode needs 1 forward + 1 backward pass for all 1M gradients. Forward-mode needs 1M passes. For gg: the analysis flips.

Hint 3 - Full Solution + Rubric

For f:R1000000Rf: \mathbb{R}^{1000000} \to \mathbb{R}:

  • Forward-mode: 1,000,000 passes (one per input dimension). Completely impractical.
  • Reverse-mode: 1 forward + 1 backward = 2 passes for all 1M gradients. This is why deep learning uses reverse-mode.

For g:RR1000g: \mathbb{R} \to \mathbb{R}^{1000}:

  • Forward-mode: 1 pass computes all 1000 output derivatives w.r.t. the single input. Efficient.
  • Reverse-mode: 1000 backward passes (one per output). Wasteful.
  • Forward-mode wins decisively.

The general rule: Use forward-mode when number of outputs >> number of inputs. Use reverse-mode when number of inputs >> number of outputs. Neural networks have millions of inputs (parameters) and one output (scalar loss), so reverse-mode always wins.

Scoring Rubric:

  • Strong Hire: Gives correct pass counts, states the general rule, mentions JVP vs VJP distinction, notes forward-mode is used in certain scientific computing and physics simulation contexts
  • Lean Hire: Correct pass counts for both cases and the general rule
  • No Hire: Cannot explain the asymmetry between forward and reverse mode

Problem 5: BatchNorm Backward

Explain conceptually why the BatchNorm backward pass is more complex than a simple element-wise operation. What are the three terms in the gradient, and what does each account for?

Hint 1 - Direction

BatchNorm normalizes using batch statistics μ\mu and σ2\sigma^2. These statistics depend on ALL elements in the batch, not just the current element.

Hint 2 - Insight

The gradient has three paths: (1) the direct path through the normalization, (2) the path through the mean μ\mu, and (3) the path through the variance σ2\sigma^2. The mean and variance create coupling between all batch elements.

Hint 3 - Full Solution + Rubric

BatchNorm computes z^i=ziμσ2+ϵ\hat{z}_i = \frac{z_i - \mu}{\sqrt{\sigma^2 + \epsilon}} where μ=1Bjzj\mu = \frac{1}{B}\sum_j z_j and σ2=1Bj(zjμ)2\sigma^2 = \frac{1}{B}\sum_j (z_j - \mu)^2.

The gradient Lzi\frac{\partial L}{\partial z_i} has three components:

  1. Direct term: 1σ2+ϵLz^i\frac{1}{\sqrt{\sigma^2 + \epsilon}} \frac{\partial L}{\partial \hat{z}_i} - the straightforward gradient through the division, as if μ\mu and σ2\sigma^2 were constants.

  2. Mean term: 1Bσ2+ϵjLz^j-\frac{1}{B\sqrt{\sigma^2 + \epsilon}} \sum_j \frac{\partial L}{\partial \hat{z}_j} - accounts for the fact that changing ziz_i changes μ\mu, which affects all z^j\hat{z}_j. This term centers the gradients (subtracts their mean).

  3. Variance term: z^iBσ2+ϵjLz^jz^j-\frac{\hat{z}_i}{B\sqrt{\sigma^2 + \epsilon}} \sum_j \frac{\partial L}{\partial \hat{z}_j} \hat{z}_j - accounts for the fact that changing ziz_i changes σ2\sigma^2, which affects all z^j\hat{z}_j. This term prevents the gradient from scaling the variance.

The net effect: BatchNorm backward "whitens" the gradients - centering them (zero mean) and preventing variance explosion. This is one reason BatchNorm stabilizes training.

Scoring Rubric:

  • Strong Hire: Identifies all three paths, explains the coupling between batch elements, notes the whitening effect on gradients
  • Lean Hire: Knows BatchNorm backward is complex because of batch statistics, can identify at least 2 of 3 terms
  • No Hire: Treats BatchNorm as an element-wise operation or cannot explain why batch statistics complicate the gradient

Interview Cheat Sheet

ConceptKey Formula / FactCommon Mistakes
Chain ruleLx=Lyyx\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x}Forgetting to sum over multiple paths (fan-out)
Matmul backwardLW=LzxT\frac{\partial L}{\partial W} = \frac{\partial L}{\partial z} \cdot x^T, Lx=WTLz\frac{\partial L}{\partial x} = W^T \cdot \frac{\partial L}{\partial z}Missing the transpose
ReLU backwardMultiply by 1[z>0]\mathbb{1}[z > 0]Using post-activation h>0h > 0 instead of pre-activation z>0z > 0
Softmax + CELz=y^y\frac{\partial L}{\partial z} = \hat{y} - yComputing softmax and CE gradients separately
Sigmoid derivativeσ(z)=σ(z)(1σ(z))0.25\sigma'(z) = \sigma(z)(1 - \sigma(z)) \leq 0.25Saying "gradients are zero" (they are small, not zero)
Vanishing gradientsProduct of Jacobians shrinks exponentiallyNot connecting to specific solutions (ReLU, skip connections, LSTM)
Exploding gradientsGradient norm grows exponentiallyNot mentioning gradient clipping as the standard fix
Forward-mode ADJVP, nn passes for nn inputsSaying it is "the same as" reverse mode
Reverse-mode ADVJP, 1 pass for all inputs (scalar output)Forgetting the memory cost of storing activations
Gradient checkpointingO(N)O(\sqrt{N}) memory, ~33% extra computeNot knowing this technique exists
He initializationVar(W)=2nin\text{Var}(W) = \frac{2}{n_{\text{in}}} for ReLUUsing Xavier init with ReLU (wrong by factor of 2)
BPTTUnrolled RNN = very deep networkNot connecting to vanishing gradient explanation

Spaced Repetition Checkpoints

Day 0 - After First Read

  • Write the chain rule in scalar, vector, and matrix forms
  • Draw a computational graph for L=(Wx+by)2L = (Wx + b - y)^2 and trace forward/backward passes
  • State the backward rule for: matmul, ReLU, sigmoid, softmax+CE

Day 3 - First Review

  • Derive backprop for a 2-layer network (ReLU + softmax) without looking at notes
  • Explain vanishing gradients with specific numbers: sigmoid max derivative of 0.25, exponential decay
  • List 3 solutions to vanishing gradients and why each works mathematically

Day 7 - Connections Review

  • Explain why reverse-mode AD is used in deep learning in under 60 seconds
  • Trace the numerical backprop example (Part 7) end-to-end without looking at the solution
  • Explain the memory-compute tradeoff and how gradient checkpointing addresses it

Day 14 - Interview Simulation

  • Derive all gradients for a 2-layer network on a whiteboard in under 10 minutes
  • Explain forward-mode vs reverse-mode AD: when each is used, JVP vs VJP, pass counts
  • Given an arbitrary computational graph with fan-out, trace the backward pass without hesitation

Day 21 - Final Calibration

  • Complete all 5 practice problems under time pressure (8 minutes each)
  • Connect backpropagation to: activation functions (gradient properties), CNNs (conv gradient), RNNs (BPTT), skip connections (additive gradient path)
  • Explain BatchNorm backward at a conceptual level (the three terms)
  • Give the 60-second answer for backpropagation cold, without preparation
© 2026 EngineersOfAI. All rights reserved.