Skip to main content

ResNet - The Paper That Made Depth Possible

Reading time: ~45 min | Interview relevance: High | Roles: MLE, AI Eng, Research Engineer, Computer Vision Engineer

The Real Interview Moment

You are in a Meta AI research interview. The interviewer shows you a plot: training error of a 56-layer plain network is higher than that of a 20-layer plain network. She asks: "This is not overfitting - both curves are on the training set. Why does the deeper network perform worse on training data, and how did He et al. solve this problem?"

You explain the degradation problem, and she follows up: "Write the mathematical formulation of a residual block. Now prove to me why gradients flow better through skip connections than through plain layers. Specifically, compute Lxl\frac{\partial \mathcal{L}}{\partial x_l} for a residual network and show me what happens as the network gets deep."

This question separates candidates who have memorized "skip connections help gradients" from those who understand the mathematics. The degradation problem is counterintuitive - deeper networks should perform at least as well as shallower ones because they can learn identity mappings. The fact that they do not reveals a fundamental optimization difficulty. ResNet's skip connections provide a direct solution grounded in gradient flow mathematics.

What You Will Master

  • Explain the degradation problem and why it is not overfitting
  • Derive the residual learning formulation and its mathematical properties
  • Prove why gradients flow better through skip connections
  • Describe identity mappings and pre-activation residual blocks
  • Draw ResNet architectures (18, 34, 50, 101, 152) from memory
  • Explain bottleneck blocks and their efficiency
  • Discuss ResNet's impact on modern architectures (Transformers, DenseNet, U-Net)

Self-Assessment: Where Are You Now?

Skill1 - Cannot2 - Vaguely3 - Can Explain4 - Can Derive5 - Can TeachYour Score
Explain the degradation problem___
Write the residual learning formulation___
Derive gradient flow through skip connections___
Explain identity mappings (ResNet v2)___
Draw bottleneck vs basic residual blocks___
Describe ResNet-50 architecture___
Explain skip connections in Transformers___
Compare ResNet to VGG and Inception___
Discuss modern uses of residual learning___
Implement a residual block from scratch___

Target: All 4s and 5s before your interview.

Part 1 - The Degradation Problem

The Puzzle

By 2015, the prevailing belief was simple: deeper networks are more expressive, so they should perform better. VGG had shown that going from 11 to 19 layers improved ImageNet accuracy. The natural question was: why not go to 50 or 100 layers?

He et al. tried exactly this and discovered something surprising. When they compared a 20-layer and 56-layer plain (no skip connections) network:

  • The 56-layer network had higher training error than the 20-layer network
  • This was not overfitting - the 56-layer network was worse on training data

The Degradation Problem: Deeper Networks Train Worse

Why This Is Counterintuitive

A 56-layer network is strictly more expressive than a 20-layer network. In theory, the 56-layer network could learn the identity function for 36 of its layers and replicate the 20-layer network exactly. The fact that optimization fails to find this solution reveals that the problem is not about model capacity - it is about optimization difficulty.

The deeper network has a harder loss landscape. Gradients must flow through more layers during backpropagation, and the composition of many nonlinear functions creates optimization challenges that standard SGD cannot overcome.

Instant Rejection

If asked "Why do deeper networks perform worse?" and you answer "Because of overfitting" or "Because of vanishing gradients" - both are wrong. The degradation problem is specifically about training error increasing with depth, which rules out overfitting. And while vanishing gradients are related, they were largely addressed by batch normalization and careful initialization. The degradation problem persists even with these techniques. The correct answer is that the loss landscape becomes harder to optimize - the network cannot easily learn identity mappings for the unnecessary layers.

Why Not Just Vanishing Gradients?

By 2015, vanishing gradients were partially solved:

  • Batch normalization (Ioffe & Szegedy, 2015) stabilized activations
  • ReLU activations provided non-saturating gradients
  • Careful initialization (He initialization) calibrated variance

Yet the degradation problem persisted. He et al. showed that even with batch normalization, 56-layer plain networks degraded. The problem was more fundamental than gradient magnitude - it was about the optimization landscape itself.

60-Second Answer

"The degradation problem is the observation that adding more layers to a deep network increases training error - not just test error. This is not overfitting, because the deeper network performs worse on training data too. It is not simply vanishing gradients, because it persists with batch normalization and ReLU. The core issue is that deep plain networks have difficulty learning identity mappings: in theory, the extra layers could learn to be identity functions, but in practice, optimizers cannot find these solutions in the complex loss landscape."

Part 2 - The Residual Learning Formulation

The Core Idea

Instead of learning the desired mapping H(x)H(x) directly, learn the residual F(x)=H(x)xF(x) = H(x) - x:

y=F(x)+xy = F(x) + x

If the optimal mapping is close to identity (the extra layers are unnecessary), then F(x)0F(x) \approx 0 is much easier to learn than H(x)xH(x) \approx x. Pushing weights toward zero is trivial for any optimizer - it is the default behavior of weight decay.

Residual Block: Plain vs Skip Connection

Mathematical Formulation

A basic residual block with two layers:

y=F(x,{Wi})+xy = F(x, \{W_i\}) + x

Where FF represents the residual mapping. For a block with two convolutional layers:

F(x)=W2σ(W1x+b1)+b2F(x) = W_2 \cdot \sigma(W_1 \cdot x + b_1) + b_2

Here σ\sigma is the ReLU activation and W1,W2W_1, W_2 are weight matrices (convolutional filters in practice).

The skip connection adds xx directly to the output, requiring no parameters and introducing no computational overhead.

Dimension Matching

The skip connection x+F(x)x + F(x) requires that xx and F(x)F(x) have the same dimensions. When they do not (e.g., when changing the number of channels or spatial resolution), two options exist:

Option A - Zero padding: Pad the extra dimensions with zeros. No parameters added.

y=F(x)+[x0]y = F(x) + \begin{bmatrix} x \\ 0 \end{bmatrix}

Option B - Projection: Use a linear projection WsW_s to match dimensions.

y=F(x)+Wsxy = F(x) + W_s x

The paper found that projection shortcuts give slightly better results but zero-padding works almost as well, confirming that the identity shortcut is the key ingredient.

import numpy as np

class BasicResidualBlock:
"""A basic residual block with two convolutional layers (simplified)."""

def __init__(self, in_channels, out_channels, stride=1):
self.stride = stride
self.needs_projection = (in_channels != out_channels) or (stride != 1)

# Two 3x3 conv layers (simplified as linear for demonstration)
self.W1 = np.random.randn(out_channels, in_channels) * np.sqrt(2 / in_channels)
self.W2 = np.random.randn(out_channels, out_channels) * np.sqrt(2 / out_channels)

if self.needs_projection:
self.W_proj = np.random.randn(out_channels, in_channels) * np.sqrt(2 / in_channels)

def forward(self, x):
"""Forward pass with skip connection."""
# Residual path: two layers with ReLU between them
out = self.W1 @ x # Conv 1
out = np.maximum(0, out) # ReLU
out = self.W2 @ out # Conv 2

# Skip connection: identity or projection
if self.needs_projection:
identity = self.W_proj @ x
else:
identity = x

# Add skip connection, then ReLU
out = out + identity # The key: F(x) + x
out = np.maximum(0, out) # ReLU after addition
return out


# Demonstrate: what happens when F(x) ≈ 0?
block = BasicResidualBlock(64, 64)
x = np.random.randn(64)

# If we zero out the weights, the output is just x (identity)
block.W1 = np.zeros_like(block.W1)
block.W2 = np.zeros_like(block.W2)

output = block.forward(x)
print(f"With zero weights, output ≈ ReLU(x)")
print(f"||output - ReLU(x)|| = {np.linalg.norm(output - np.maximum(0, x)):.6f}")
# 0.000000 - the block becomes an identity (through ReLU)

Part 3 - Gradient Flow: The Mathematical Proof

Why Skip Connections Work

This is the most important section for interviews. The mathematical analysis of gradient flow through residual networks explains precisely why they can be trained to extreme depth.

Consider a residual network with LL blocks. Let xlx_l denote the input to block ll:

xl+1=xl+F(xl,Wl)x_{l+1} = x_l + F(x_l, W_l)

By recursion, the output of any layer LL can be written as:

xL=xl+i=lL1F(xi,Wi)x_L = x_l + \sum_{i=l}^{L-1} F(x_i, W_i)

This is powerful: every layer's output is the sum of the input and all intermediate residuals. The input xlx_l has a direct path to any deeper layer xLx_L.

Gradient Computation

Now compute the gradient of the loss L\mathcal{L} with respect to an early layer's output xlx_l:

Lxl=LxLxLxl\frac{\partial \mathcal{L}}{\partial x_l} = \frac{\partial \mathcal{L}}{\partial x_L} \cdot \frac{\partial x_L}{\partial x_l}

Using the expansion xL=xl+i=lL1F(xi,Wi)x_L = x_l + \sum_{i=l}^{L-1} F(x_i, W_i):

xLxl=1+xli=lL1F(xi,Wi)\frac{\partial x_L}{\partial x_l} = 1 + \frac{\partial}{\partial x_l} \sum_{i=l}^{L-1} F(x_i, W_i)

Therefore:

Lxl=LxL(1+xli=lL1F(xi,Wi))\frac{\partial \mathcal{L}}{\partial x_l} = \frac{\partial \mathcal{L}}{\partial x_L} \cdot \left(1 + \frac{\partial}{\partial x_l} \sum_{i=l}^{L-1} F(x_i, W_i)\right)

The critical term is the 1. It guarantees that even if the residual gradient terms Fxl\frac{\partial F}{\partial x_l} are small, the gradient signal is never zero. In a plain network without skip connections:

Lxl=LxLi=lL1Hixi\frac{\partial \mathcal{L}}{\partial x_l} = \frac{\partial \mathcal{L}}{\partial x_L} \cdot \prod_{i=l}^{L-1} \frac{\partial H_i}{\partial x_i}

This is a product of gradients through all layers. If any factor is less than 1, the product shrinks exponentially.

Network TypeGradient FormDepth Behavior
Plain networki=lL1Hixi\prod_{i=l}^{L-1} \frac{\partial H_i}{\partial x_i}Exponential decay (vanishing) or explosion
Residual network1+xlFi1 + \frac{\partial}{\partial x_l}\sum F_iAlways includes 1 - bounded away from zero

Gradient Flow: Plain Network vs ResNet

60-Second Answer

"In a plain network, the gradient is a product of layer-wise Jacobians: Hixi\prod \frac{\partial H_i}{\partial x_i}. If these terms are less than 1, the gradient vanishes exponentially with depth. In a residual network, the gradient becomes 1+1 + \sum (residual gradient terms). The additive 1 from the skip connection creates a 'gradient highway' - the gradient can flow directly from the loss to any layer without passing through nonlinearities. This is why ResNets can train at 100+ layers while plain networks degrade at 50+."

Part 4 - Identity Mappings: ResNet v2

The Paper

"Identity Mappings in Deep Residual Networks" - He et al., 2016

The Insight

The original ResNet (v1) placed batch normalization and ReLU inside the residual function:

xl+1=ReLU(F(xl)+xl)x_{l+1} = \text{ReLU}(F(x_l) + x_l)

The ReLU after the addition means the skip connection path is not truly an identity - it passes through a nonlinearity. He et al. showed that making the skip connection a pure identity mapping (no operations on the shortcut path) is strictly better.

Pre-Activation vs Post-Activation

Post-activation (ResNet v1):

xl+1=ReLU(BN(W2ReLU(BN(W1xl)))+xl)x_{l+1} = \text{ReLU}(\text{BN}(W_2 \cdot \text{ReLU}(\text{BN}(W_1 \cdot x_l))) + x_l)

Pre-activation (ResNet v2):

xl+1=W2ReLU(BN(W1ReLU(BN(xl))))+xlx_{l+1} = W_2 \cdot \text{ReLU}(\text{BN}(W_1 \cdot \text{ReLU}(\text{BN}(x_l)))) + x_l

The key difference: in v2, BN and ReLU come before the weight layers, and the skip connection has no operations at all.

ResNet v1 vs v2: Post-Activation vs Pre-Activation

Why Pre-Activation Is Better

With pre-activation, the gradient from layer LL to layer ll is exactly:

Lxl=LxL+residual gradient terms\frac{\partial \mathcal{L}}{\partial x_l} = \frac{\partial \mathcal{L}}{\partial x_L} + \text{residual gradient terms}

No Jacobians of ReLU or BN on the shortcut path. This cleaner gradient flow enables training of 1001-layer networks (He et al. showed this explicitly).

ArchitectureCIFAR-10 ErrorCIFAR-100 Error
ResNet v1, 164 layers5.46%24.33%
ResNet v2, 164 layers5.21%23.86%
ResNet v1, 1001 layersDivergesDiverges
ResNet v2, 1001 layers4.62%22.71%
Common Trap

Many candidates say "ResNet uses skip connections" without distinguishing v1 from v2. In v1, the skip path passes through a ReLU after addition, which corrupts the identity mapping. V2 moves BN and ReLU before the weight layers, giving a pure identity shortcut. If asked about identity mappings, you must explain this distinction - it is what enabled going from 152 to 1001 layers.

Part 5 - Architecture Variants

The ResNet Family

ResNet Architecture Variants: Basic vs Bottleneck Block

Basic Block vs Bottleneck Block

Basic block (used in ResNet-18 and ResNet-34):

  • Two 3x3 convolutional layers
  • Parameters: 2×(3×3×C×C)=18C22 \times (3 \times 3 \times C \times C) = 18C^2

Bottleneck block (used in ResNet-50, 101, 152):

  • 1x1 conv (reduce channels by 4x) → 3x3 conv → 1x1 conv (restore channels)
  • Parameters: (1×1×C×C/4)+(3×3×C/4×C/4)+(1×1×C/4×C)=C2/4+9C2/16+C2/41.06C2(1 \times 1 \times C \times C/4) + (3 \times 3 \times C/4 \times C/4) + (1 \times 1 \times C/4 \times C) = C^2/4 + 9C^2/16 + C^2/4 \approx 1.06C^2

The bottleneck block has fewer parameters and less compute than the basic block while providing a deeper network:

import numpy as np

def count_params_basic(C):
"""Parameters in a basic residual block."""
# Two 3x3 convolutions: C input channels, C output channels each
return 2 * (3 * 3 * C * C)

def count_params_bottleneck(C):
"""Parameters in a bottleneck residual block."""
# 1x1 reduce: C → C/4
reduce = 1 * 1 * C * (C // 4)
# 3x3 process: C/4 → C/4
process = 3 * 3 * (C // 4) * (C // 4)
# 1x1 expand: C/4 → C
expand = 1 * 1 * (C // 4) * C
return reduce + process + expand

for C in [64, 128, 256, 512]:
basic = count_params_basic(C)
bottleneck = count_params_bottleneck(C)
ratio = bottleneck / basic
print(f"C={C:3d}: Basic={basic:>10,d} Bottleneck={bottleneck:>10,d} "
f"Ratio={ratio:.3f}")

# C= 64: Basic= 73,728 Bottleneck= 20,736 Ratio=0.281
# C=128: Basic= 294,912 Bottleneck= 82,944 Ratio=0.281
# C=256: Basic= 1,179,648 Bottleneck= 331,776 Ratio=0.281
# C=512: Basic= 4,718,592 Bottleneck= 1,327,104 Ratio=0.281

Full Architecture Table

ModelLayersBlocksParameter CountTop-1 Error (ImageNet)
ResNet-1818[2, 2, 2, 2] basic11.7M30.2%
ResNet-3434[3, 4, 6, 3] basic21.8M26.7%
ResNet-5050[3, 4, 6, 3] bottleneck25.6M24.0%
ResNet-101101[3, 4, 23, 3] bottleneck44.5M22.4%
ResNet-152152[3, 8, 36, 3] bottleneck60.2M21.3%

Note: ResNet-50 has more layers than ResNet-34 but only slightly more parameters because bottleneck blocks are more parameter-efficient.

ResNet-50 Architecture in Detail

StageOutput SizeBlock TypeChannelsBlocksStride
Conv1112 x 1127x7 conv6412
Pool56 x 563x3 max pool6412
Stage 156 x 56Bottleneck64/25631
Stage 228 x 28Bottleneck128/51242
Stage 314 x 14Bottleneck256/102462
Stage 47 x 7Bottleneck512/204832
Avg Pool1 x 1Global avg pool20481-
FC1 x 1Fully connected10001-

Part 6 - ResNet's Impact on Modern Architectures

Skip Connections Are Everywhere

ResNet's skip connections became one of the most influential ideas in deep learning. They appear in virtually every modern architecture:

ArchitectureHow Skip Connections Are Used
TransformersResidual connections around every self-attention and FFN sublayer
DenseNetConcatenation instead of addition - each layer connects to all previous layers
U-NetSkip connections between encoder and decoder at each spatial resolution
Highway NetworksGated skip connections (precursor to ResNet)
EfficientNetSkip connections in MBConv blocks
GPT/BERTResidual connections are essential for training 100+ layer Transformers

The Transformer Connection

Every Transformer layer uses the exact same residual formulation:

xl+1=xl+Attention(LayerNorm(xl))x_{l+1} = x_l + \text{Attention}(\text{LayerNorm}(x_l)) xl+2=xl+1+FFN(LayerNorm(xl+1))x_{l+2} = x_{l+1} + \text{FFN}(\text{LayerNorm}(x_{l+1}))

Without residual connections, training a 96-layer GPT-3 would be impossible. The gradient flow properties proven by He et al. for ConvNets apply identically to Transformers.

Company Variation

At vision-focused companies (Tesla Autopilot, Apple Vision, Meta Reality Labs), expect detailed ResNet questions. At NLP/LLM companies (OpenAI, Anthropic, Cohere), focus on how residual connections enable deep Transformers. The mathematical principles are identical - only the application context differs.

DenseNet: An Alternative to Residual Addition

DenseNet (Huang et al., 2017) replaced addition with concatenation:

xl=Hl([x0,x1,,xl1])x_l = H_l([x_0, x_1, \ldots, x_{l-1}])

Each layer takes as input the concatenation of ALL previous layers' outputs. This creates even stronger gradient flow but increases memory usage.

PropertyResNet (Addition)DenseNet (Concatenation)
Gradient flowGood (additive identity)Excellent (direct connections)
Feature reuseImplicit (through addition)Explicit (all features preserved)
Memory costLow (only current features)High (all previous features stored)
Parameter efficiencyModerateHigh (fewer filters needed)

Part 7 - He Initialization

The ResNet paper also popularized He initialization (also called Kaiming initialization), designed specifically for networks with ReLU activations:

WN(0,2nin)W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)

Where ninn_{\text{in}} is the number of input connections (fan-in).

Why the Factor of 2?

For a layer y=Wxy = W \cdot x followed by ReLU:

Var(y)=nin2Var(W)Var(x)\text{Var}(y) = \frac{n_{\text{in}}}{2} \cdot \text{Var}(W) \cdot \text{Var}(x)

The factor of 12\frac{1}{2} appears because ReLU zeroes out negative values, halving the expected variance. Setting Var(W)=2nin\text{Var}(W) = \frac{2}{n_{\text{in}}} ensures Var(y)=Var(x)\text{Var}(y) = \text{Var}(x), keeping activations at a stable scale across layers.

Compare to Xavier initialization (for sigmoid/tanh):

WN(0,1nin)W \sim \mathcal{N}\left(0, \frac{1}{n_{\text{in}}}\right)

Xavier assumes a symmetric activation function where the variance is preserved without the factor of 2. Using Xavier initialization with ReLU causes activations to shrink by 12\frac{1}{\sqrt{2}} per layer, which degrades deep networks.

import numpy as np

def demonstrate_initialization():
"""Show why He initialization is needed for ReLU networks."""
n_layers = 50
n_neurons = 512

# Simulate forward pass with different initializations
for name, scale in [("Xavier", 1.0), ("He", 2.0)]:
x = np.random.randn(1, n_neurons)

for i in range(n_layers):
W = np.random.randn(n_neurons, n_neurons) * np.sqrt(scale / n_neurons)
x = x @ W
x = np.maximum(0, x) # ReLU

print(f"{name:>6s} init, layer {n_layers}: "
f"mean={x.mean():.4e}, std={x.std():.4e}, "
f"fraction_zero={np.mean(x == 0):.2%}")

demonstrate_initialization()
# Xavier: activations shrink, many dead neurons
# He: activations stay stable

Part 8 - Ablation Study Results

From the Original Paper

ArchitectureLayersImageNet Top-1 ErrorNotes
VGG-191925.6%Previous SOTA
Plain-343428.5%Worse than 18-layer!
Plain-181827.9%Degradation demonstrated
ResNet-343424.5%Skip connections fix degradation
ResNet-181827.9%Similar to plain (shallow enough)
ResNet-505024.0%Bottleneck blocks
ResNet-10110122.4%Deeper = better
ResNet-15215221.3%Deepest model in the paper
Ensemble-19.4%Won ILSVRC 2015

Key observations:

  1. Plain-34 is worse than Plain-18 (degradation confirmed)
  2. ResNet-34 is better than ResNet-18 (skip connections fix degradation)
  3. Deeper ResNets consistently improve (34 → 50 → 101 → 152)
  4. ResNet won ILSVRC 2015 by a large margin

Shortcut Connection Variants

Shortcut TypeTop-1 ErrorParams
A: Zero-padding shortcuts24.5%Fewest
B: Projection shortcuts (only for dimension changes)24.0%Moderate
C: All projection shortcuts23.6%Most

Option B became the standard: use identity shortcuts where possible, projection only when dimensions change.

Part 9 - Common Interview Deep Dives

"Can you use skip connections across more than two layers?"

Yes, and this is explored in DenseNet (across all layers) and in various ResNet variants. The key principle is that the gradient highway must exist. Longer skip connections work but provide less fine-grained feature reuse.

"What happens if you use multiplication instead of addition for skip connections?"

xl+1=F(xl)×xlx_{l+1} = F(x_l) \times x_l

This would restore the product-of-gradients problem. The gradient would be:

xl+1xl=F(xl)+xlFxl\frac{\partial x_{l+1}}{\partial x_l} = F(x_l) + x_l \cdot \frac{\partial F}{\partial x_l}

This does not have the constant term 1, so gradients can still vanish. Addition is specifically chosen because it preserves gradient magnitude.

"What is the relationship between ResNets and ordinary differential equations?"

This is an advanced but increasingly common question. A ResNet can be viewed as an Euler discretization of an ODE:

dxdt=F(x(t),t,θ)\frac{dx}{dt} = F(x(t), t, \theta)

x(t+Δt)=x(t)+ΔtF(x(t),t,θ)x(t + \Delta t) = x(t) + \Delta t \cdot F(x(t), t, \theta)

Setting Δt=1\Delta t = 1 gives the residual block xl+1=xl+F(xl)x_{l+1} = x_l + F(x_l). This connection led to Neural ODEs (Chen et al., 2018), which use continuous-depth networks with adaptive solvers.

Part 10 - ResNet's Legacy: Beyond Image Classification

Object Detection and Segmentation

ResNet became the default backbone for nearly every computer vision task:

TaskModelBackboneKey Innovation
Object DetectionFaster R-CNNResNet-50/101Region Proposal Network on ResNet features
Instance SegmentationMask R-CNNResNet-50-FPNFeature Pyramid Network with ResNet
Semantic SegmentationDeepLab v3+ResNet-101Atrous convolutions on ResNet
Panoptic SegmentationPanoptic FPNResNet-50Unified detection + segmentation
Pose EstimationHRNetInspired by ResNetHigh-resolution residual connections

Feature Pyramid Networks (FPN)

FPN (Lin et al., 2017) builds on ResNet by creating multi-scale feature maps with skip connections at each scale:

Feature Pyramid Network Architecture with ResNet Backbone

The lateral connections from ResNet to FPN are themselves skip connections - the same principle applied at the architecture level rather than the layer level.

ResNeXt: Grouped Convolutions

ResNeXt (Xie et al., 2017) extended the bottleneck block by replacing the 3x3 convolution with grouped convolutions:

F(x)=i=1CTi(x)F(x) = \sum_{i=1}^{C} T_i(x)

Where CC is the "cardinality" - the number of parallel transformation paths. This creates a wider block with the same parameter count:

ModelTop-1 ErrorParametersFLOPs
ResNet-5024.0%25.6M4.1G
ResNeXt-50 (32x4d)22.2%25.0M4.3G
ResNet-10122.4%44.5M7.8G
ResNeXt-101 (32x4d)21.2%44.2M8.0G

The insight: increasing cardinality is more effective than increasing depth or width, at the same computational budget.

SE-ResNet: Channel Attention

Squeeze-and-Excitation Networks (Hu et al., 2018) added channel attention to residual blocks:

SE(x)=xσ(FC2(ReLU(FC1(GAP(x)))))\text{SE}(x) = x \cdot \sigma(\text{FC}_2(\text{ReLU}(\text{FC}_1(\text{GAP}(x)))))

Where GAP is global average pooling. This "squeezes" spatial information into a channel descriptor, then "excites" (reweights) channels based on their importance. Adding SE to ResNet-50 improved top-1 accuracy by ~1% with minimal computational overhead.

import numpy as np

def squeeze_excitation(x, r=16):
"""
Simplified SE block.
x: (channels, height, width)
r: reduction ratio
"""
C = x.shape[0]

# Squeeze: Global Average Pooling
z = x.mean(axis=(1, 2)) # (channels,)

# Excitation: FC → ReLU → FC → Sigmoid
W1 = np.random.randn(C // r, C) * 0.01
W2 = np.random.randn(C, C // r) * 0.01

s = np.maximum(0, W1 @ z) # (C/r,) - ReLU
s = 1 / (1 + np.exp(-(W2 @ s))) # (C,) - Sigmoid

# Scale: channel-wise multiplication
return x * s.reshape(-1, 1, 1)

# SE adds only ~2.5K parameters per block for C=256, r=16
C, r = 256, 16
se_params = (C * C // r) + (C // r * C)
print(f"SE block parameters: {se_params:,}") # 8,192
print(f"ResNet block parameters: ~{18 * C * C:,}") # 1,179,648
print(f"SE overhead: {se_params / (18 * C * C) * 100:.2f}%") # ~0.7%

EfficientNet and the Evolution Beyond ResNet

While ResNet dominated for years, EfficientNet (Tan & Le, 2019) showed that compound scaling (simultaneously scaling depth, width, and resolution) outperforms scaling depth alone. However, the core building block (MBConv) still uses skip connections - the residual learning principle remains foundational.

Practice Problems

Problem 1: Gradient Flow Derivation

For a 100-layer residual network, write the gradient Lx1\frac{\partial \mathcal{L}}{\partial x_1} and explain why it does not vanish, even if the residual functions FiF_i have small gradients.

Hint

Lx1=Lx100(1+x1i=199F(xi,Wi))\frac{\partial \mathcal{L}}{\partial x_1} = \frac{\partial \mathcal{L}}{\partial x_{100}} \cdot (1 + \frac{\partial}{\partial x_1}\sum_{i=1}^{99} F(x_i, W_i)). The "1" term guarantees a direct gradient path from the loss to layer 1. Even if all Fix1\frac{\partial F_i}{\partial x_1} terms are zero, the gradient is still Lx1001\frac{\partial \mathcal{L}}{\partial x_{100}} \cdot 1, which is the loss gradient at the final layer. In a plain network, the gradient would be Lx100i=199Hixi\frac{\partial \mathcal{L}}{\partial x_{100}} \cdot \prod_{i=1}^{99} \frac{\partial H_i}{\partial x_i}, which vanishes exponentially.

Problem 2: Bottleneck Efficiency

Calculate the ratio of FLOPs between a basic block and a bottleneck block for a layer with 256 input/output channels and 56x56 spatial resolution.

Hint

Basic block: Two 3x3 convolutions with 256 channels. FLOPs per conv = 2×3×3×256×256×56×563.7×1092 \times 3 \times 3 \times 256 \times 256 \times 56 \times 56 \approx 3.7 \times 10^9. Total: 2×3.7×109=7.4×1092 \times 3.7 \times 10^9 = 7.4 \times 10^9. Bottleneck: 1x1 conv (256→64): 2×1×1×256×64×56×561.0×1082 \times 1 \times 1 \times 256 \times 64 \times 56 \times 56 \approx 1.0 \times 10^8. 3x3 conv (64→64): 2×3×3×64×64×56×562.3×1082 \times 3 \times 3 \times 64 \times 64 \times 56 \times 56 \approx 2.3 \times 10^8. 1x1 conv (64→256): 2×1×1×64×256×56×561.0×1082 \times 1 \times 1 \times 64 \times 256 \times 56 \times 56 \approx 1.0 \times 10^8. Total: 4.3×1084.3 \times 10^8. Ratio: 4.3/740.0584.3 / 74 \approx 0.058. The bottleneck uses ~6% of the FLOPs!

Problem 3: Degradation vs Overfitting

Your colleague trains a 100-layer network that has lower training accuracy than a 50-layer network. She claims the model is overfitting. Design an experiment to prove her wrong.

Hint

If the model were overfitting, training accuracy would be higher (or at least equal) for the deeper model, and test accuracy would be lower. The degradation problem is specifically about training accuracy decreasing with depth. To confirm: (1) plot both training AND test accuracy curves, (2) show that the deeper model is worse on BOTH training and test data, (3) add skip connections to the 100-layer network and show that both training and test accuracy improve. The skip connection experiment is the definitive proof - if degradation were due to overfitting, skip connections (which do not reduce capacity) would not help.

Problem 4: Skip Connection Variants

You are designing a new architecture. Compare three skip connection strategies: (a) addition (x+F(x)x + F(x)), (b) concatenation ([x,F(x)][x, F(x)]), (c) gated (gF(x)+(1g)xg \cdot F(x) + (1-g) \cdot x where gg is learned). What are the trade-offs?

Hint

(a) Addition: Simple, no extra parameters, preserves dimensionality. Gradient = 1+Fx1 + \frac{\partial F}{\partial x}. Used in ResNet and Transformers. (b) Concatenation: Preserves all information (no lossy addition), but doubles feature map size per layer. Requires downstream layers to handle growing dimensions. Used in DenseNet and U-Net. More memory intensive. (c) Gated: Maximum flexibility - can learn to be identity, residual, or anything between. But the gate gg reintroduces potential gradient issues (if g0g \approx 0, the residual path is blocked; if g1g \approx 1, the identity path is blocked). Used in Highway Networks. In practice, simple addition (a) won out due to its reliability and simplicity.

Problem 5: ResNet for Modern LLMs

Explain how the residual connection principle from ResNet enables training of 96-layer GPT-3. What would happen if you removed all skip connections from a Transformer?

Hint

Each Transformer layer has two residual connections: x+Attention(LN(x))x + \text{Attention}(\text{LN}(x)) and x+FFN(LN(x))x + \text{FFN}(\text{LN}(x)). GPT-3 with 96 layers has 192 residual connections. Without them, gradients must flow through 192 layers of nonlinear functions, suffering exponential decay. With skip connections, gradients have a direct path from the loss to any layer. Additionally, at initialization, each sublayer output is approximately zero (due to small weight initialization), so the network starts close to identity, gradually learning residuals. Removing skip connections from a 96-layer Transformer would make it untrainable - gradients would vanish or explode within the first few training steps.

Interview Cheat Sheet

QuestionKey Points
"What is the degradation problem?"Deeper plain networks have higher TRAINING error. Not overfitting - optimization difficulty.
"Write the residual formulation"y=F(x)+xy = F(x) + x. Learn the residual F(x)=H(x)xF(x) = H(x) - x, not the full mapping H(x)H(x).
"Why do skip connections help gradients?"Gradient becomes 1+1 + \sum (sum) vs \prod (product). The additive 1 prevents vanishing.
"Pre-activation vs post-activation?"V2 moves BN+ReLU before weights. Skip path becomes pure identity. Enables 1001 layers.
"Basic vs bottleneck block?"Basic: two 3x3 convs. Bottleneck: 1x1 reduce → 3x3 → 1x1 expand. ~72% fewer params.
"ResNet-50 architecture?"Conv1 → Pool → Stages [3,4,6,3] bottleneck blocks → AvgPool → FC. 25.6M params.
"What is He initialization?"WN(0,2/nin)W \sim \mathcal{N}(0, 2/n_{\text{in}}). Factor of 2 compensates for ReLU halving variance.
"How do skip connections appear in Transformers?"x+Attention(LN(x))x + \text{Attention}(\text{LN}(x)), x+FFN(LN(x))x + \text{FFN}(\text{LN}(x)). Same principle.
"ResNet vs DenseNet?"ResNet: addition (simple, low memory). DenseNet: concatenation (preserves all features, high memory).
"Connection to ODEs?"xl+1=xl+F(xl)x_{l+1} = x_l + F(x_l) is Euler discretization of dx/dt=F(x,t)dx/dt = F(x,t). Leads to Neural ODEs.

Spaced Repetition Checkpoints

Day 0 (Today)

  • Explain the degradation problem in one paragraph
  • Write the residual formulation y=F(x)+xy = F(x) + x
  • Explain why the gradient includes an additive 1

Day 3

  • Derive gradient flow for a 100-layer ResNet from memory
  • Draw both basic and bottleneck residual blocks
  • Explain pre-activation vs post-activation

Day 7

  • Draw the full ResNet-50 architecture from memory
  • Calculate parameter counts for basic vs bottleneck blocks
  • Explain He initialization and why the factor is 2

Day 14

  • Mock interview: answer all 10 cheat sheet questions
  • Explain ResNet's connection to Transformers
  • Discuss DenseNet, Highway Networks, and U-Net

Day 21

  • Full 20-minute paper discussion simulation on ResNet
  • Handle follow-up questions on gradient flow, ODE connections, modern variants
  • Implement a residual block from scratch on a whiteboard

Next Steps

You now understand why depth is possible in modern deep learning - skip connections provide the gradient highways that make 100+ layer networks trainable. Next, explore Chapter 7: Batch Normalization - the other critical ingredient that enabled deep network training, and the ongoing debate about why it actually works.

© 2026 EngineersOfAI. All rights reserved.