Backpropagation - The Engine of Deep Learning
Reading time: ~40 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Engineer, ML Compiler Eng
The Real Interview Moment
You are in a Google MLE on-site interview. The interviewer draws a simple two-layer neural network on the whiteboard - input layer, one hidden layer with ReLU, one output layer with softmax, cross-entropy loss. She hands you the marker and says: "Compute the gradient of the loss with respect to the weights in the first layer. Show every step."
You start writing. You get through the softmax-cross-entropy gradient (you memorized that one), but then you hit the ReLU layer and hesitate. "Does the chain rule multiply or... compose?" The interviewer's expression does not change, but you can feel the evaluation shifting. She follows up: "Now tell me - what happens to this gradient when you have 100 layers instead of 2? And why does PyTorch compute all these gradients efficiently?"
This is backpropagation in interviews. Not a conceptual overview - a hands-on mathematical derivation where you must trace gradients through every operation, understand what goes wrong at scale, and explain how frameworks like PyTorch make it practical. This page gives you every tool you need.
What You Will Master
- State the chain rule in both scalar and matrix forms
- Draw computational graphs for arbitrary expressions and trace forward/backward passes
- Derive backpropagation for a 2-layer network with ReLU and softmax-cross-entropy loss
- Compute Jacobians for common operations: matrix multiply, ReLU, softmax, batch norm
- Explain vanishing gradients mathematically and connect them to activation functions and depth
- Explain exploding gradients and the role of gradient clipping
- Distinguish forward-mode and reverse-mode automatic differentiation
- Justify why reverse-mode AD is used in deep learning (one backward pass for all parameters)
- Implement a simple autograd engine conceptually (define forward, store graph, backward)
- Answer backpropagation interview questions at whiteboard speed
Self-Assessment: Where Are You Now?
| Skill | 1 -- Cannot | 2 -- Vaguely | 3 -- Can Explain | 4 -- Can Derive | 5 -- Can Teach | Your Score |
|---|---|---|---|---|---|---|
| State the chain rule for compositions | ___ | |||||
| Draw a computational graph for | ___ | |||||
| Trace forward pass values on a graph | ___ | |||||
| Trace backward pass gradients on a graph | ___ | |||||
| Derive gradients for a 2-layer network | ___ | |||||
| Explain vanishing gradients mathematically | ___ | |||||
| Explain forward vs reverse mode AD | ___ | |||||
| Implement simple autograd conceptually | ___ |
Target: All 4s and 5s before your interview.
Part 1 - The Chain Rule: From Calculus to Computation
Scalar Chain Rule
If , then:
This is the single most important equation in deep learning. Every gradient computation in every neural network is an application of this rule, composed many times.
Example: Let . Define and .
Multivariate Chain Rule
When a variable influences the output through multiple paths, we sum over all paths:
This is critical in neural networks because a single weight can affect the loss through multiple neurons.
"Backpropagation is just the chain rule applied systematically on a computational graph. We do a forward pass to compute the loss, then a backward pass where we propagate gradients from the loss back to every parameter. At each node, we multiply the incoming gradient by the local derivative - that is the chain rule. The key insight is that reverse-mode differentiation lets us compute gradients with respect to all parameters in a single backward pass, which is why it is used in deep learning where we have millions of parameters but a single scalar loss."
Vector/Matrix Chain Rule
For neural networks, we work with vectors and matrices. If where and , the Jacobian is:
For a chain where is scalar:
In practice, we never explicitly construct Jacobian matrices. We work with vector-Jacobian products (VJPs) - multiplying the upstream gradient vector by the Jacobian without materializing the full matrix.
Do NOT confuse the Jacobian with the gradient. The gradient is a vector of partial derivatives of a scalar with respect to a vector. The Jacobian is a matrix of partial derivatives of a vector with respect to a vector. In backprop, we always propagate gradients of the scalar loss, so we compute VJPs, not full Jacobians.
Part 2 - Computational Graphs
What Is a Computational Graph?
A computational graph is a directed acyclic graph (DAG) where:
- Leaf nodes are inputs and parameters
- Internal nodes are operations (add, multiply, matmul, ReLU, etc.)
- Edges represent data flow
- The root is the loss value
Every expression can be decomposed into a computational graph. This is exactly what PyTorch and TensorFlow do under the hood.
Example: Linear Regression Loss
Consider for a single data point.
Forward Pass
We compute values left-to-right (inputs to loss). Let .
| Node | Computation | Value |
|---|---|---|
| mul | 6 | |
| add1 | 7 | |
| sub | ||
| sq | 1 | |
| L | = sq | 1 |
Backward Pass
We compute gradients right-to-left (loss to parameters). We need and .
Start at the loss: .
| Node | Local Derivative | Upstream Gradient | Gradient |
|---|---|---|---|
| sq | 1 | ||
| sub | |||
| add1 (to b) | |||
| add1 (to mul) | |||
| mul (to w) |
Result: , .
Verification: . By direct differentiation: . Correct.
"I ask candidates to trace a backward pass on a computational graph because it tells me whether they truly understand backprop or just memorized formulas. If they can handle the graph for a 2-layer network, they will not panic when I ask about more complex architectures. The key thing I watch for is: do they correctly handle nodes where the gradient flows through multiple paths? That is where most candidates make mistakes."
Fan-Out: When a Variable Is Used Twice
Consider . The node fans out to two inputs of the multiply operation.
In the backward pass, when a variable feeds into multiple downstream operations, we sum the gradients from all paths:
This summation rule is the multivariate chain rule in action and is one of the most common sources of interview errors.
Part 3 - Backpropagation in a 2-Layer Network
This is the canonical interview derivation. Memorize it, but more importantly, understand every step.
Network Setup
- Input: (a single training example)
- Layer 1: , then
- Layer 2: , then
- Loss: (cross-entropy)
Where , , = hidden size, = number of classes.
Forward Pass
- (element-wise ReLU)
Backward Pass - Step by Step
Step 1: Gradient of loss w.r.t. softmax input
The combined softmax + cross-entropy gradient has a beautifully simple form:
This is one of the most elegant results in deep learning. The gradient is simply the prediction minus the one-hot label. If the true class is , then the gradient for class is and for all other classes it is .
At Google and Meta, you may be asked to derive this result from scratch - first computing and then (the Jacobian of softmax). At Amazon and startups, knowing the final result is usually sufficient.
Step 2: Gradients for Layer 2 parameters
Note: has shape - same as . This is the outer product of the error signal and the hidden activations.
Step 3: Gradient flowing back to hidden layer
This is the key step - the gradient flows backward through the weight matrix by multiplying with its transpose.
Step 4: Gradient through ReLU
where is element-wise multiplication and is 1 where and 0 elsewhere. ReLU passes the gradient through unchanged where the input was positive, and blocks it where the input was negative or zero.
Step 5: Gradients for Layer 1 parameters
Summary of Gradient Shapes
| Quantity | Shape | Expression |
|---|---|---|
If you write (without the transpose), you will get an instant shape mismatch. Always check: the gradient of w.r.t. a parameter must have the same shape as the parameter. is , so must be - which requires the outer product , where the shapes are .
Part 4 - Vanishing and Exploding Gradients
The Core Problem
Consider a deep network with layers. The gradient of the loss w.r.t. the first layer's weights involves a product of Jacobians:
Each factor involves the weight matrix and the activation derivative. If these factors have eigenvalues consistently less than 1, the product shrinks exponentially. If greater than 1, it grows exponentially.
Vanishing Gradients - Mathematical Analysis
For a network with sigmoid activations, the derivative of sigmoid is:
The maximum value is 0.25 (at ). So at each layer, the gradient is multiplied by at most 0.25 times the weight magnitude. For a 10-layer network:
Even with well-initialized weights, the gradient reaching the first layer is vanishingly small. The first layers stop learning while the last layers train normally.
Exploding Gradients - Mathematical Analysis
Conversely, if the weight matrices have large spectral norms, the gradient product grows exponentially:
With 100 layers and spectral norms slightly above 1 (say 1.5):
The gradients become astronomically large, causing NaN values or oscillating parameters.
The Gradient Flow Spectrum
Complete Solution Table
| Problem | Solution | How It Helps | Introduced In |
|---|---|---|---|
| Vanishing gradients | ReLU activation | for , no shrinkage | Nair & Hinton, 2010 |
| Vanishing gradients | Skip connections (ResNet) | Gradient has additive path: | He et al., 2015 |
| Vanishing gradients | LSTM gating | Constant error carousel preserves gradient flow | Hochreiter & Schmidhuber, 1997 |
| Vanishing gradients | He initialization | Sets to keep activation variance stable | He et al., 2015 |
| Vanishing gradients | BatchNorm / LayerNorm | Prevents activations from reaching saturated regions | Ioffe & Szegedy, 2015 |
| Exploding gradients | Gradient clipping | Caps gradient norm: if then | Pascanu et al., 2013 |
| Exploding gradients | Weight regularization | L2 penalty keeps weight magnitudes bounded | Standard |
| Exploding gradients | Lower learning rate | Smaller parameter updates even with large gradients | Standard |
| Both | Proper initialization | Xavier for sigmoid/tanh, He for ReLU - matches activation function | Glorot & Bengio, 2010 |
Initialization Deep Dive
The goal of initialization is to keep the variance of activations (and gradients) approximately constant across layers.
Xavier initialization (for sigmoid/tanh):
Derivation: For a linear layer with , we want . Since :
Setting gives . Averaging the forward and backward constraints gives .
He initialization (for ReLU):
ReLU zeroes out half the activations on average, halving the variance. The factor of 2 compensates for this.
"When I ask about vanishing gradients, I want the candidate to give me the mathematical reason - the product of Jacobians shrinking exponentially - not just say 'gradients get small.' Then I want them to connect it to specific solutions. A really strong candidate will point out that ResNet skip connections create an additive gradient path: , where the identity term ensures the gradient is at least 1 regardless of what does."
Part 5 - Automatic Differentiation
Why Not Symbolic or Numerical Differentiation?
Symbolic differentiation (like in Mathematica) produces exact analytical derivatives but suffers from expression swell - expressions grow exponentially with depth, making it impractical for neural networks with millions of operations.
Numerical differentiation uses finite differences: . This requires one forward pass per parameter - millions of forward passes for a neural network. Also suffers from numerical instability (choosing too large introduces truncation error; too small introduces floating-point error).
Automatic differentiation (AD) is the best of both worlds: exact derivatives (no approximation error) computed efficiently by decomposing the program into elementary operations and applying the chain rule.
Forward-Mode AD
Forward-mode AD propagates derivatives alongside the computation. For each intermediate variable, we compute both its value and its derivative with respect to one chosen input variable.
Given , to compute :
- Seed:
- At each operation : compute
This computes a Jacobian-vector product (JVP): where is the seed vector.
Cost: One forward pass computes the derivative w.r.t. one input variable. For inputs, we need forward passes.
Example: Compute for at .
| Step | Value | Derivative (, seed ) |
|---|---|---|
| 2 | ||
| 3 | ||
| 6 | ||
| 0.909 | ||
| 6.909 |
Result: .
Reverse-Mode AD (Backpropagation)
Reverse-mode AD does a forward pass first (storing intermediate values), then a backward pass that computes derivatives with respect to all inputs in a single pass.
- Forward: Compute and store all intermediate values
- Backward: Starting from , propagate
This computes a vector-Jacobian product (VJP): where is the upstream gradient.
Cost: One forward pass + one backward pass computes derivatives w.r.t. all input variables.
Why Reverse Mode Wins for Deep Learning
| Aspect | Forward Mode | Reverse Mode |
|---|---|---|
| Cost per input variable | passes | Amortized across all inputs |
| Cost per output variable | Amortized across all outputs | passes |
| Total cost for inputs, 1 output | passes | passes (one fwd + one bwd) |
| Memory | Low (no storage needed) | High (must store all intermediates) |
| Best when | Few inputs, many outputs | Many inputs, few outputs |
| Computes | JVP: | VJP: |
Neural networks have millions of parameters (inputs to the loss function) but a single scalar loss (one output). Reverse-mode AD computes all gradients in one backward pass. This is why backpropagation is reverse-mode AD.
Do NOT say "backpropagation IS gradient descent." Backpropagation computes gradients. Gradient descent uses those gradients to update parameters. They are separate steps. Backprop answers "which direction?", gradient descent answers "how far in that direction?" You can use backprop with Adam, SGD, or any other optimizer.
The Memory-Compute Tradeoff
Reverse-mode AD must store all intermediate activations from the forward pass (needed for the backward pass). For a network with layers, batch size , and hidden size , this is memory.
For a Transformer with 96 layers, batch size 32, sequence length 2048, and hidden size 12288 (GPT-3 scale):
- Activation memory per layer: approximately bytes (float32) = ~3 GB
- Total for 96 layers: ~288 GB - far exceeding GPU memory
Gradient checkpointing (also called activation checkpointing) trades memory for compute:
- Only store activations at every -th layer
- During the backward pass, recompute the missing activations from the nearest checkpoint
- Reduces memory from to with optimal checkpoint placement
- Cost: one additional forward pass (roughly 33% more compute)
This is essential for training large models and is a common interview topic at companies training foundation models.
Part 6 - Local Gradients for Common Operations
Knowing these local gradients lets you derive the backward pass for any network architecture. This is your reference table.
Matrix Multiplication
If and we receive from upstream:
Memory trick: The gradient w.r.t. the weight is always upstream_grad @ input.T. The gradient w.r.t. the input is always weight.T @ upstream_grad.
Addition (Bias)
If :
Addition distributes (copies) the gradient equally to both inputs.
Element-wise Operations
For any element-wise function , the Jacobian is diagonal:
This is just element-wise multiplication with the local derivative.
ReLU
ReLU acts as a binary gate: gradient passes through where input was positive, is blocked where negative.
Sigmoid
Note: , which is why sigmoid causes vanishing gradients.
Tanh
Maximum derivative is 1 (at ), better than sigmoid's 0.25 but still causes vanishing gradients for large .
Softmax + Cross-Entropy (Combined)
Always compute the combined gradient - computing them separately requires the full softmax Jacobian.
Batch Normalization
For where and are batch statistics:
The complexity arises because and depend on all batch elements, so each element's gradient receives contributions from every other element.
Deriving the BatchNorm gradient is a classic Google/Meta interview question for senior roles. The key insight is that since and are functions of the entire batch, the backward pass must account for these dependencies. Most candidates handle the part but forget the gradient contributions through and .
Part 7 - Numerical Example: Full Backprop Walkthrough
Let us work through a concrete example with actual numbers. This is what you would write on a whiteboard.
Network: 2 inputs, 2 hidden units (ReLU), 2 outputs (softmax + cross-entropy)
Parameters:
, , ,
Input: , Target: (class 0)
Forward pass:
| Step | Computation | Result |
|---|---|---|
| - both positive | ||
| : , sum | ||
Backward pass:
| Step | Computation | Result |
|---|---|---|
| (both positive) | ||
Shape verification: is , same as . is , same as . All shapes match.
Part 8 - Backprop Through Time (BPTT)
Backpropagation in recurrent neural networks is called Backpropagation Through Time (BPTT) because the RNN is "unrolled" across time steps, creating a very deep computational graph.
For a vanilla RNN:
The gradient of the loss at time w.r.t. involves:
Each factor is .
Since and the same is reused at every step, this product either vanishes or explodes depending on the spectral radius of . This is exactly why LSTMs were invented - they add a cell state with additive gradient flow, analogous to how ResNets add skip connections.
Truncated BPTT: In practice, we limit the backward pass to time steps (e.g., ) to manage both memory and vanishing gradients. This is a biased estimate of the true gradient but works well in practice.
Practice Problems
Problem 1: Computational Graph Gradient
Given , draw the computational graph and compute all partial derivatives when .
Hint 1 - Direction
Break the expression into two operations: an addition and a multiplication. Each becomes a node in the graph.
Hint 2 - Insight
Forward: , . Backward: start with . The multiplication node has local gradients and for its two inputs.
Hint 3 - Full Solution + Rubric
Forward: ,
Backward:
- and
Results: , ,
Verification: , , . Matches.
Scoring Rubric:
- Strong Hire: Draws correct graph, computes all gradients correctly, verifies with direct differentiation, explains the fan-out rule for the addition node
- Lean Hire: Correct final answers but hesitates on the process or skips verification
- No Hire: Cannot set up the computational graph or makes sign errors in the chain rule
Problem 2: Vanishing Gradient Analysis
A 5-layer network uses sigmoid activations with Xavier-initialized weights. Estimate the magnitude of the gradient at layer 1 relative to layer 5, and propose three solutions.
Hint 1 - Direction
The gradient at each layer is multiplied by and the weight matrix. What is the maximum value of ?
Hint 2 - Insight
. With Xavier initialization, . So the gradient at layer 1 is approximately times the gradient at layer 5.
Hint 3 - Full Solution + Rubric
With Xavier initialization, the expected singular values of are approximately 1. The maximum of is 0.25.
The gradient at layer 1 is roughly 256x smaller than at layer 5.
Three solutions:
- Replace sigmoid with ReLU: gradient factor becomes 1 (for active units), eliminating the 0.25 multiplier
- Add skip connections (ResNet-style): gradient has a direct additive path bypassing the multiplications
- Use batch normalization: prevents activations from saturating in the flat regions of sigmoid
Scoring Rubric:
- Strong Hire: Quantifies the decay factor as , derives it mathematically, connects to 3+ solutions, mentions that ReLU has its own issues (dying ReLU)
- Lean Hire: Knows gradients shrink exponentially with sigmoid, can state solutions but cannot quantify the decay
- No Hire: Vaguely mentions "gradients get small" without mathematical reasoning
Problem 3: Implement Backward Pass
Given this forward pass in pseudocode, write the backward pass:
# Forward
z1 = W1 @ x + b1 # shape: (h,)
h1 = relu(z1) # shape: (h,)
z2 = W2 @ h1 + b2 # shape: (c,)
loss = cross_entropy_with_softmax(z2, y) # scalar
Hint 1 - Direction
Start from the loss and work backward. The combined softmax-cross-entropy gradient is . For each forward operation, write the corresponding backward operation.
Hint 2 - Insight
For matmul backward: gradient w.r.t. weight is upstream @ input.T, gradient w.r.t. input is weight.T @ upstream. For ReLU backward: multiply by binary mask of where pre-activation was positive.
Hint 3 - Full Solution + Rubric
# Backward
dz2 = softmax(z2) - y # (c,)
dW2 = dz2[:, None] @ h1[None, :] # (c, h) outer product
db2 = dz2 # (c,)
dh1 = W2.T @ dz2 # (h,)
dz1 = dh1 * (z1 > 0).float() # (h,) ReLU mask
dW1 = dz1[:, None] @ x[None, :] # (h, d) outer product
db1 = dz1 # (h,)
Key shape checks:
dW2has same shape asW2:(c, h)-- outer product of(c, 1)and(1, h)dW1has same shape asW1:(h, d)-- outer product of(h, 1)and(1, d)- ReLU mask uses
z1 > 0(pre-activation values)
Scoring Rubric:
- Strong Hire: Correct shapes, correct operations, explains each step, explicitly verifies shapes match parameters
- Lean Hire: Mostly correct but confuses one transpose or forgets the outer product notation
- No Hire: Cannot write the backward pass in correct order or gets more than 2 operations wrong
Problem 4: Forward-Mode vs Reverse-Mode AD
You have a function (a neural network with 1M parameters and scalar loss). How many passes does forward-mode AD need? How many does reverse-mode need? Now consider . Which mode is better for ?
Hint 1 - Direction
Forward-mode computes derivatives w.r.t. one input per pass. Reverse-mode computes derivatives w.r.t. all inputs per pass but for one output.
Hint 2 - Insight
For : reverse-mode needs 1 forward + 1 backward pass for all 1M gradients. Forward-mode needs 1M passes. For : the analysis flips.
Hint 3 - Full Solution + Rubric
For :
- Forward-mode: 1,000,000 passes (one per input dimension). Completely impractical.
- Reverse-mode: 1 forward + 1 backward = 2 passes for all 1M gradients. This is why deep learning uses reverse-mode.
For :
- Forward-mode: 1 pass computes all 1000 output derivatives w.r.t. the single input. Efficient.
- Reverse-mode: 1000 backward passes (one per output). Wasteful.
- Forward-mode wins decisively.
The general rule: Use forward-mode when number of outputs >> number of inputs. Use reverse-mode when number of inputs >> number of outputs. Neural networks have millions of inputs (parameters) and one output (scalar loss), so reverse-mode always wins.
Scoring Rubric:
- Strong Hire: Gives correct pass counts, states the general rule, mentions JVP vs VJP distinction, notes forward-mode is used in certain scientific computing and physics simulation contexts
- Lean Hire: Correct pass counts for both cases and the general rule
- No Hire: Cannot explain the asymmetry between forward and reverse mode
Problem 5: BatchNorm Backward
Explain conceptually why the BatchNorm backward pass is more complex than a simple element-wise operation. What are the three terms in the gradient, and what does each account for?
Hint 1 - Direction
BatchNorm normalizes using batch statistics and . These statistics depend on ALL elements in the batch, not just the current element.
Hint 2 - Insight
The gradient has three paths: (1) the direct path through the normalization, (2) the path through the mean , and (3) the path through the variance . The mean and variance create coupling between all batch elements.
Hint 3 - Full Solution + Rubric
BatchNorm computes where and .
The gradient has three components:
-
Direct term: - the straightforward gradient through the division, as if and were constants.
-
Mean term: - accounts for the fact that changing changes , which affects all . This term centers the gradients (subtracts their mean).
-
Variance term: - accounts for the fact that changing changes , which affects all . This term prevents the gradient from scaling the variance.
The net effect: BatchNorm backward "whitens" the gradients - centering them (zero mean) and preventing variance explosion. This is one reason BatchNorm stabilizes training.
Scoring Rubric:
- Strong Hire: Identifies all three paths, explains the coupling between batch elements, notes the whitening effect on gradients
- Lean Hire: Knows BatchNorm backward is complex because of batch statistics, can identify at least 2 of 3 terms
- No Hire: Treats BatchNorm as an element-wise operation or cannot explain why batch statistics complicate the gradient
Interview Cheat Sheet
| Concept | Key Formula / Fact | Common Mistakes |
|---|---|---|
| Chain rule | Forgetting to sum over multiple paths (fan-out) | |
| Matmul backward | , | Missing the transpose |
| ReLU backward | Multiply by | Using post-activation instead of pre-activation |
| Softmax + CE | Computing softmax and CE gradients separately | |
| Sigmoid derivative | Saying "gradients are zero" (they are small, not zero) | |
| Vanishing gradients | Product of Jacobians shrinks exponentially | Not connecting to specific solutions (ReLU, skip connections, LSTM) |
| Exploding gradients | Gradient norm grows exponentially | Not mentioning gradient clipping as the standard fix |
| Forward-mode AD | JVP, passes for inputs | Saying it is "the same as" reverse mode |
| Reverse-mode AD | VJP, 1 pass for all inputs (scalar output) | Forgetting the memory cost of storing activations |
| Gradient checkpointing | memory, ~33% extra compute | Not knowing this technique exists |
| He initialization | for ReLU | Using Xavier init with ReLU (wrong by factor of 2) |
| BPTT | Unrolled RNN = very deep network | Not connecting to vanishing gradient explanation |
Spaced Repetition Checkpoints
Day 0 - After First Read
- Write the chain rule in scalar, vector, and matrix forms
- Draw a computational graph for and trace forward/backward passes
- State the backward rule for: matmul, ReLU, sigmoid, softmax+CE
Day 3 - First Review
- Derive backprop for a 2-layer network (ReLU + softmax) without looking at notes
- Explain vanishing gradients with specific numbers: sigmoid max derivative of 0.25, exponential decay
- List 3 solutions to vanishing gradients and why each works mathematically
Day 7 - Connections Review
- Explain why reverse-mode AD is used in deep learning in under 60 seconds
- Trace the numerical backprop example (Part 7) end-to-end without looking at the solution
- Explain the memory-compute tradeoff and how gradient checkpointing addresses it
Day 14 - Interview Simulation
- Derive all gradients for a 2-layer network on a whiteboard in under 10 minutes
- Explain forward-mode vs reverse-mode AD: when each is used, JVP vs VJP, pass counts
- Given an arbitrary computational graph with fan-out, trace the backward pass without hesitation
Day 21 - Final Calibration
- Complete all 5 practice problems under time pressure (8 minutes each)
- Connect backpropagation to: activation functions (gradient properties), CNNs (conv gradient), RNNs (BPTT), skip connections (additive gradient path)
- Explain BatchNorm backward at a conceptual level (the three terms)
- Give the 60-second answer for backpropagation cold, without preparation
