Convolutional Neural Networks - From Pixels to Understanding

Reading time: ~40 min | Interview relevance: Critical | Roles: MLE, AI Eng, CV Eng, Research Engineer, Robotics ML

The Real Interview Moment

You are in a Tesla Autopilot MLE on-site. The interviewer draws a 6x6 input feature map on the whiteboard and says: "Apply a 3x3 convolution with stride 2 and padding 1. What is the output size? Now stack 50 of these layers - what is the receptive field? And why would we use a ResNet instead of a plain stack?"

You start computing the output size but second-guess yourself on the padding formula. You get the receptive field calculation half right but mix up the recursive formula. When the interviewer asks about ResNet, you say "skip connections help gradient flow" - correct, but she wants the math: "Show me how the gradient changes with and without the skip connection."

CNN questions in interviews are deceptively layered. They start with simple arithmetic (output size calculation) and escalate to deep architectural reasoning (why ResNet works, what 1x1 convolutions do, why depthwise separable convolutions save computation). This page arms you with both the mechanical skills and the architectural intuition.

What You Will Master

Compute output dimensions for any conv layer given input size, kernel size, stride, padding, and dilation
Trace the convolution operation as a sliding dot product with weight sharing
Calculate receptive fields for deep networks using the recursive formula
Explain pooling operations (max, average, global) and their purposes
Narrate the architecture evolution: LeNet to AlexNet to VGG to GoogLeNet to ResNet to EfficientNet to ConvNeXt
Derive why skip connections enable training of very deep networks (gradient highway argument)
Explain 1x1 convolutions as pointwise channel mixing and dimensionality reduction
Analyze depthwise separable convolutions and their computational savings
Design transfer learning and fine-tuning strategies for new tasks
Answer CNN architecture questions with both math and intuition

Self-Assessment: Where Are You Now?

Skill	1 -- Cannot	2 -- Vaguely	3 -- Can Explain	4 -- Can Derive	5 -- Can Teach	Your Score
Compute conv output size						___
Explain convolution as sliding dot product						___
Calculate receptive field						___
Explain max pooling vs average pooling						___
Trace LeNet to ResNet evolution						___
Derive skip connection gradient benefit						___
Explain 1x1 convolutions						___
Explain depthwise separable convolutions						___
Design transfer learning strategy						___

Target: All 4s and 5s before your interview.

Part 1 - The Convolution Operation

What Convolution Does

A 2D convolution slides a small filter (kernel) across an input feature map, computing a dot product at each position. This produces an output feature map (also called an activation map).

Key properties that make convolution powerful for vision:

Local connectivity: Each output neuron connects to only a small region of the input (the receptive field), not the entire input. This encodes the prior that nearby pixels are more related than distant ones.
Weight sharing: The same filter is applied at every spatial position. A feature detector learned in one part of the image works everywhere. This dramatically reduces parameters: a 3x3 filter has 9 weights regardless of image size.
Translation equivariance: If the input shifts, the output shifts by the same amount. A cat detector works regardless of where the cat is in the image.

60-Second Answer

"A CNN applies learned filters across spatial positions using three key ideas: local connectivity (each neuron sees only a small region), weight sharing (the same filter detects the same feature everywhere), and translation equivariance (features are detected regardless of position). Early layers learn edges and textures, middle layers learn parts (eyes, wheels), and deep layers learn objects. The output size formula is $\lfloor(W - K + 2P)/S\rfloor + 1$ for input size $W$ , kernel $K$ , padding $P$ , stride $S$ . Modern CNNs use skip connections (ResNet) to enable training hundreds of layers by providing direct gradient paths."

The Math: 2D Convolution

For an input feature map $X$ of size $H \times W$ and a kernel $K$ of size $k \times k$ :

$(X * K)[i, j] = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} X[i+m, j+n] \cdot K[m, n]$

Technically, this is cross-correlation, not convolution (which would flip the kernel). In deep learning, we always mean cross-correlation when we say "convolution" - the distinction does not matter because the kernels are learned.

Multi-Channel Convolution

In practice, inputs have $C_{\text{in}}$ channels (e.g., 3 for RGB) and we want $C_{\text{out}}$ output channels:

Each filter has shape $C_{\text{in}} \times k \times k$
We have $C_{\text{out}}$ such filters
Total weight shape: $C_{\text{out}} \times C_{\text{in}} \times k \times k$
Each filter produces one output channel by summing over all input channels

Parameter count: $C_{\text{out}} \times C_{\text{in}} \times k \times k + C_{\text{out}}$ (including bias)

Example: A conv layer with 64 input channels, 128 output channels, and 3x3 kernels has $128 \times 64 \times 3 \times 3 + 128 = 73,856$ parameters.

Part 2 - Output Size, Stride, Padding, and Dilation

The Output Size Formula

This is the most frequently tested calculation in CNN interviews.

$O = \left\lfloor\frac{W - K + 2P}{S}\right\rfloor + 1$

where:

$W$ = input spatial dimension (height or width)
$K$ = kernel size
$P$ = padding (zeros added to each side)
$S$ = stride (step size of the sliding window)

For dilated convolutions:

$O = \left\lfloor\frac{W - K_{\text{eff}} + 2P}{S}\right\rfloor + 1, \quad K_{\text{eff}} = K + (K-1)(D-1)$

where $D$ = dilation rate and $K_{\text{eff}}$ is the effective kernel size.

Common Configurations

Configuration	Kernel	Stride	Padding	Effect on Size
Standard	3x3	1	0	Shrinks by 2 (each side loses 1)
Same padding	3x3	1	1	Preserves spatial size
Downsampling	3x3	2	1	Halves spatial size
Aggressive downsample	7x7	2	3	Roughly halves (used in ResNet stem)
Pooling replacement	1x1	1	0	Changes channels only
Dilated	3x3, dilation=2	1	2	Preserves size, larger receptive field

Worked Examples

Example 1: Input 32x32, kernel 5x5, stride 1, padding 0.

$O = \left\lfloor\frac{32 - 5 + 0}{1}\right\rfloor + 1 = 28$

Example 2: Input 224x224, kernel 7x7, stride 2, padding 3.

$O = \left\lfloor\frac{224 - 7 + 6}{2}\right\rfloor + 1 = \left\lfloor\frac{223}{2}\right\rfloor + 1 = 111 + 1 = 112$

Example 3: Input 56x56, kernel 3x3, stride 2, padding 1.

$O = \left\lfloor\frac{56 - 3 + 2}{2}\right\rfloor + 1 = \left\lfloor\frac{55}{2}\right\rfloor + 1 = 27 + 1 = 28$

Common Trap

The floor operation matters when the division is not exact. Input 7x7, kernel 3x3, stride 2, padding 0: $O = \lfloor(7-3)/2\rfloor + 1 = \lfloor 2 \rfloor + 1 = 3$ , not 3.5. Some candidates forget the floor and get wrong answers. Also remember that the formula applies independently to height and width - they do not have to be equal.

Padding Types

Padding	Formula	When Used
Valid (no padding)	$P = 0$	When spatial shrinkage is acceptable
Same	$P = \lfloor K/2 \rfloor$ (for stride 1)	Preserve spatial dimensions
Full	$P = K - 1$	Transposed convolutions, signal processing
Causal	Pad only one side	1D convolutions for time series (no future leakage)

Dilation (Atrous Convolution)

Dilation inserts gaps between kernel elements, enlarging the effective receptive field without adding parameters or reducing resolution.

A 3x3 kernel with dilation 2 has the same 9 parameters but covers a 5x5 effective area (with gaps). With dilation 4, it covers 9x9.

Use cases: Semantic segmentation (DeepLab), where you need large receptive fields at full resolution.

Part 3 - Receptive Field

What Is the Receptive Field?

The receptive field of a neuron is the region of the original input that can influence that neuron's value. It is determined by the cumulative effect of all preceding conv and pooling layers.

Recursive Receptive Field Formula

For layer $l$ with kernel size $k_l$ and stride $s_l$ :

$r_l = r_{l-1} + (k_l - 1) \cdot \prod_{i=1}^{l-1} s_i$

where $r_0 = 1$ (a single pixel).

The key insight: stride in early layers has a multiplicative effect on receptive field growth. This is why architectures like ResNet use a stride-2 conv in the first layer - it doubles the receptive field contribution of every subsequent layer.

Worked Example: Simple 3-Layer CNN

Layer	Kernel	Stride	Receptive Field
Input	-	-	$r_0 = 1$
Conv1	3x3	1	$r_1 = 1 + (3-1) \cdot 1 = 3$
Conv2	3x3	1	$r_2 = 3 + (3-1) \cdot 1 = 5$
Conv3	3x3	1	$r_3 = 5 + (3-1) \cdot 1 = 7$

Three 3x3 conv layers with stride 1 give a 7x7 receptive field.

Why Three 3x3 Convs Instead of One 7x7?

This is a classic interview question (VGGNet insight).

Approach	Parameters	Receptive Field	Nonlinearities
One 7x7 conv (64 channels)	$64 \times 64 \times 7 \times 7 = 200,704$	7x7	1
Three 3x3 convs (64 channels)	$3 \times 64 \times 64 \times 3 \times 3 = 110,592$	7x7	3

Three 3x3 convs have 45% fewer parameters and 3x more nonlinearity for the same receptive field. The extra nonlinear layers make the function more expressive. This is why VGG exclusively uses 3x3 convolutions.

Interviewer's Perspective

"The receptive field question separates candidates who understand CNN architecture from those who just use pretrained models. I ask: 'Your model fails to detect large objects. Why?' A strong candidate immediately thinks about receptive field - if the receptive field is smaller than the object, the network literally cannot see the whole object in any single neuron. Solutions: add more layers, use dilated convolutions, use larger strides, or add a global average pooling layer."

Part 4 - Pooling Operations

Max Pooling

Takes the maximum value in each window. With a 2x2 window and stride 2, it halves the spatial dimensions.

Properties:

Provides a small amount of translation invariance
Selects the strongest activation (most prominent feature)
No learnable parameters
Discards spatial information (location within the window)

Average Pooling

Takes the mean value in each window.

Properties:

Smoother than max pooling
Preserves more spatial information
Used less frequently than max pooling in classification architectures

Global Average Pooling (GAP)

Averages each entire feature map into a single number. For a $C \times H \times W$ feature map, produces a $C$ -dimensional vector.

Properties:

Replaces fully connected layers at the end of classification CNNs (GoogLeNet, ResNet)
No parameters - eliminates the FC layer parameters
Acts as a structural regularizer
Provides complete translation invariance

Pooling Types Compared: Max, Average, and Global Average Pooling

Strided Convolution vs Pooling

Modern architectures often replace pooling with strided convolutions (stride 2):

Approach	Parameters	Learns what to discard?	Used in
Max pooling	0	No (fixed max operation)	VGG, older ResNets
Strided convolution	$C_\text{out} \times C_\text{in} \times K^2$	Yes (learned downsampling)	ResNet-D, ConvNeXt

Strided convolutions are now preferred because they allow the network to learn an optimal downsampling strategy rather than using a fixed max operation.

Part 5 - Architecture Evolution

This is one of the most frequently tested topics in CNN interviews. You must know the key innovation of each architecture and why it mattered.

CNN Architecture Evolution from LeNet (1998) to ConvNeXt (2022)

LeNet-5 (LeCun et al., 1998)

Innovation: Demonstrated that CNNs can learn useful features from raw pixels
Architecture: 2 conv layers (5x5), 2 subsampling layers, 3 FC layers
Parameters: ~60,000
Task: Handwritten digit recognition (MNIST)
Impact: Proved the concept but limited by hardware

AlexNet (Krizhevsky et al., 2012)

Innovation: Won ImageNet by a massive margin, launching the deep learning revolution
Key ideas: ReLU activation (not tanh/sigmoid), dropout regularization, GPU training, data augmentation, local response normalization
Architecture: 5 conv layers, 3 FC layers
Parameters: ~60 million
Impact: Proved that deep learning works at scale for vision

VGGNet (Simonyan & Zisserman, 2014)

Innovation: Showed that depth matters - use only 3x3 convolutions stacked deeply
Key insight: Three 3x3 convs = one 7x7 conv in receptive field, but with fewer parameters and more nonlinearity
Architecture: 16 or 19 layers, all 3x3 convs
Parameters: ~138 million (huge FC layers)
Limitation: Very expensive, no skip connections, training is difficult beyond 19 layers

GoogLeNet / Inception (Szegedy et al., 2014)

Innovation: Process at multiple scales simultaneously with the Inception module
Key idea: Each Inception module applies 1x1, 3x3, 5x5 convs and max pooling in parallel, then concatenates
1x1 convs for dimensionality reduction: Before the expensive 3x3 and 5x5 convs, a 1x1 conv reduces channels (the "bottleneck")
Parameters: ~6.8 million (12x fewer than VGG through bottleneck design)
Impact: Showed that architecture engineering (not just depth) matters

ResNet (He et al., 2015)

Innovation: Skip connections enable training of networks with 152+ layers
The problem it solved: Plain networks degrade (not overfit - degrade) beyond ~20 layers. Adding more layers makes training loss worse.
The solution: Instead of learning $H(x) = F(x)$ , learn the residual $F(x) = H(x) - x$ , so the layer computes $H(x) = F(x) + x$
Parameters: ~25M (ResNet-50) to ~60M (ResNet-152)
Impact: The single most important architecture innovation in CNNs

EfficientNet (Tan & Le, 2019)

Innovation: Compound scaling - scale depth, width, and resolution together with a principled formula
Key idea: Previous work scaled networks in one dimension (deeper OR wider OR higher resolution). EfficientNet scales all three simultaneously with compound coefficients: depth $\propto \alpha^\phi$ , width $\propto \beta^\phi$ , resolution $\propto \gamma^\phi$ with $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$
Building block: MBConv (mobile inverted bottleneck) with depthwise separable convolutions and squeeze-and-excitation
Impact: State-of-the-art efficiency - much better accuracy/FLOPs tradeoff

ConvNeXt (Liu et al., 2022)

Innovation: "A ConvNet for the 2020s" - modernized ResNet to match Vision Transformer performance
Key changes from ResNet:
1. Patchify stem (4x4 stride-4 conv, like ViT)
2. Larger kernels (7x7 depthwise conv, like Transformer attention windows)
3. GELU activation (from Transformers)
4. LayerNorm instead of BatchNorm (from Transformers)
5. Inverted bottleneck (expand then contract, from MobileNet)
6. Fewer activation functions (only one per block)
Impact: Proved that CNNs are not inherently inferior to Transformers - the architecture details matter

Part 6 - Skip Connections and Why ResNet Works

The Degradation Problem

Plain deep networks exhibit a surprising failure: deeper networks have higher training error than shallower ones. This is not overfitting (which would show low training error but high test error). This is an optimization failure - the optimizer cannot find a good solution.

If a 20-layer network achieves loss $L$ , a 56-layer network should achieve at most $L$ (it could just learn identity for the extra 36 layers). But in practice, the 56-layer network does worse. Why?

The gradient signal degrades over many layers (not just vanishing - the gradient direction becomes increasingly noisy), making it nearly impossible for early layers to learn useful features.

The Skip Connection Solution

A residual block computes:

$\mathbf{y} = F(\mathbf{x}) + \mathbf{x}$

where $F$ is the residual function (typically two conv layers with BN and ReLU).

ResNet Skip Connection: y = F(x) + x with Gradient Highway

Mathematical Proof: Gradient Highways

The gradient of the loss w.r.t. the input $\mathbf{x}$ of a residual block:

$\frac{\partial L}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{y}} \cdot \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{y}} \cdot \left(\frac{\partial F(\mathbf{x})}{\partial \mathbf{x}} + I\right) = \frac{\partial L}{\partial \mathbf{y}} \cdot \frac{\partial F(\mathbf{x})}{\partial \mathbf{x}} + \frac{\partial L}{\partial \mathbf{y}}$

The gradient has two components:

$\frac{\partial L}{\partial \mathbf{y}} \cdot \frac{\partial F}{\partial \mathbf{x}}$ : gradient through the conv layers (may vanish)
$\frac{\partial L}{\partial \mathbf{y}}$ : direct gradient through the skip connection (cannot vanish!)

For a network with $N$ residual blocks, the gradient from block $N$ to block 1 always has a direct path with factor $I^N = I$ . Even if all the residual functions $F$ have tiny gradients, the skip connections ensure the gradient reaches every layer.

Stacking Residual Blocks

For $N$ stacked residual blocks:

$\frac{\partial L}{\partial \mathbf{x}_0} = \frac{\partial L}{\partial \mathbf{x}_N} \cdot \prod_{n=1}^{N} \left(I + \frac{\partial F_n}{\partial \mathbf{x}_{n-1}}\right)$

Expanding this product creates $2^N$ terms, each representing a different path through the network. Critically, one of these paths is the all-identity path $I^N = I$ , which preserves the gradient magnitude exactly. The other $2^N - 1$ paths provide additional gradient information.

Bottleneck Residual Block

For deeper ResNets (50, 101, 152 layers), a bottleneck design reduces computation:

1x1 conv: Reduce channels (e.g., 256 to 64) - the "bottleneck"
3x3 conv: Spatial processing at reduced channel count
1x1 conv: Restore channels (e.g., 64 to 256)

This 1x1-3x3-1x1 pattern has far fewer parameters than two 3x3 convs at the full channel width.

Parameter comparison (256 channels):

Design	Parameters
Two 3x3 convs at 256 channels	$2 \times 256 \times 256 \times 9 = 1,179,648$
Bottleneck (256-64-64-256)	$256 \times 64 + 64 \times 64 \times 9 + 64 \times 256 = 69,632$

The bottleneck has 17x fewer parameters with the same receptive field.

Instant Rejection

Do NOT say "ResNet works because it prevents vanishing gradients." This is partially true but incomplete. The deeper insight is that skip connections transform the optimization landscape - they make the loss surface smoother (Li et al., 2018 visualized this). Without skip connections, the loss surface has many sharp minima and saddle points that trap the optimizer. With skip connections, the landscape becomes more convex-like. If the interviewer asks "why not just use better optimization?" you need this answer.

Part 7 - 1x1 Convolutions

What 1x1 Convolutions Do

A 1x1 convolution with $C_{\text{out}}$ filters operates only along the channel dimension. At each spatial position, it computes a linear combination of the $C_{\text{in}}$ input channels to produce $C_{\text{out}}$ output channels.

It is equivalent to applying a fully connected layer independently at every spatial position (hence also called pointwise convolution or network in network).

Three Uses of 1x1 Convolutions

1. Dimensionality reduction (bottleneck): Reduce channels before expensive operations. GoogLeNet uses 1x1 convs to reduce 256 channels to 64 before a 5x5 conv, saving $\frac{256}{64} = 4\times$ computation in the 5x5 conv.

2. Dimensionality expansion: Increase channels. In inverted bottlenecks (MobileNetV2, EfficientNet), 1x1 expands channels before depthwise conv.

3. Channel mixing: Learn cross-channel interactions without spatial operations. This is what ResNet's bottleneck does.

Parameter count for 1x1 conv: $C_{\text{out}} \times C_{\text{in}} + C_{\text{out}}$ - no spatial kernel parameters.

Part 8 - Depthwise Separable Convolutions

Standard Convolution Cost

For input $C_{\text{in}} \times H \times W$ , a standard $K \times K$ convolution to $C_{\text{out}}$ channels:

Parameters: $C_{\text{out}} \times C_{\text{in}} \times K \times K$
FLOPs: $C_{\text{out}} \times C_{\text{in}} \times K \times K \times H_{\text{out}} \times W_{\text{out}}$

Depthwise Separable Convolution

Splits the standard convolution into two steps:

Step 1 - Depthwise convolution: Apply one $K \times K$ filter per input channel independently.

Parameters: $C_{\text{in}} \times K \times K$
Each channel is filtered separately (no cross-channel mixing)

Step 2 - Pointwise convolution: Apply a 1x1 convolution to mix channels.

Parameters: $C_{\text{out}} \times C_{\text{in}}$
Performs all cross-channel interaction

Standard vs Depthwise Separable Convolution - 8-9x fewer FLOPs

Computational Savings

$\text{Ratio} = \frac{C_{\text{in}} \times K^2 + C_{\text{out}} \times C_{\text{in}}}{C_{\text{out}} \times C_{\text{in}} \times K^2} = \frac{1}{C_{\text{out}}} + \frac{1}{K^2}$

For typical values ( $C_{\text{out}} = 256$ , $K = 3$ ):

$\text{Ratio} = \frac{1}{256} + \frac{1}{9} \approx 0.115$

Depthwise separable convolutions use roughly 8-9x fewer FLOPs and 8-9x fewer parameters than standard convolutions.

Where They Are Used

Architecture	How DSC Is Used
MobileNet (Howard et al., 2017)	All convolutions are depthwise separable
Xception (Chollet, 2017)	Replaces all Inception module convolutions
EfficientNet (Tan & Le, 2019)	MBConv blocks use depthwise separable convs
ConvNeXt (Liu et al., 2022)	Uses depthwise (but not separable) convolutions

Part 9 - Transfer Learning and Fine-Tuning

Why Transfer Learning Works

CNNs learn hierarchical features:

Early layers (1-3): Low-level features - edges, textures, colors. These are universal across tasks.
Middle layers (4-8): Mid-level features - corners, contours, patterns. Somewhat task-specific.
Late layers (9+): High-level features - object parts, scenes. Highly task-specific.

Transfer learning works because early and middle layer features are useful across very different tasks (ImageNet features work for medical images, satellite images, etc.).

Fine-Tuning Strategies

Transfer Learning Strategy by Dataset Size - Feature Extraction vs Fine-Tuning

Practical Fine-Tuning Recipe

Replace the classification head: Remove the final FC layer, add a new one matching your number of classes
Freeze backbone initially: Train only the new head for 5-10 epochs
Unfreeze gradually: Start unfreezing from the last layer backward
Use differential learning rates: Early layers get 10x-100x smaller LR than the head
Use smaller overall LR: Start with $10^{-4}$ to $10^{-5}$ (not $10^{-2}$ like training from scratch)
Data augmentation: Critical when fine-tuning dataset is small

Common Fine-Tuning Mistakes

Mistake	Why It Fails	Fix
Using the same LR everywhere	Early layers overfit quickly, losing universal features	Differential LR: head 10x, middle 5x, early 1x
Not freezing initially	Random head weights send garbage gradients to backbone	Freeze backbone, train head first
Training too long	Small datasets cause overfitting quickly	Early stopping, strong augmentation
Wrong input normalization	Pretrained model expects ImageNet normalization	Always use `mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]` for ImageNet-pretrained models
Resizing inputs incorrectly	Models trained at 224x224 lose features at other sizes	Resize to the pretrained resolution or use multi-scale training

Company Variation

At Google/Meta, transfer learning questions focus on architecture design: "How would you modify a ResNet-50 for a 3-channel radar image with 500x500 resolution?" At startups, the questions are more practical: "You have 500 labeled images. Walk me through your transfer learning pipeline." At Apple, expect questions about efficient fine-tuning for on-device models.

Practice Problems

Problem 1: Output Size Calculation

A CNN has the following layers applied to a 224x224x3 input:

Conv: 64 filters, 7x7, stride 2, padding 3
Max pool: 3x3, stride 2, padding 1
Conv: 128 filters, 3x3, stride 1, padding 1
Conv: 128 filters, 3x3, stride 2, padding 1

Compute the spatial size after each layer.

Hint 1 - Direction

Apply the formula $O = \lfloor(W - K + 2P)/S\rfloor + 1$ at each layer. Be careful with the floor operation.

Hint 2 - Insight

Layer 1: $\lfloor(224 - 7 + 6)/2\rfloor + 1 = 112$ . Stride 2 halves the size. Continue for each layer.

Hint 3 - Full Solution + Rubric

Layer	Input Size	Formula	Output Size
Conv1	224x224	$\lfloor(224-7+6)/2\rfloor + 1$	112x112
MaxPool	112x112	$\lfloor(112-3+2)/2\rfloor + 1$	56x56
Conv2	56x56	$\lfloor(56-3+2)/1\rfloor + 1$	56x56
Conv3	56x56	$\lfloor(56-3+2)/2\rfloor + 1$	28x28

Final output: 28x28x128.

Total parameter count:

Conv1: $64 \times 3 \times 7 \times 7 + 64 = 9,472$
Conv2: $128 \times 64 \times 3 \times 3 + 128 = 73,856$
Conv3: $128 \times 128 \times 3 \times 3 + 128 = 147,584$
Total: 230,912

Scoring Rubric:

Strong Hire: All sizes correct, computes parameter counts, notes this resembles ResNet stem
Lean Hire: Sizes correct but needs to carefully think through the formula
No Hire: Makes errors in the stride-2 calculations or forgets the +1

Problem 2: ResNet Skip Connection Gradient

Prove mathematically that skip connections prevent vanishing gradients. Specifically, for a network with $N$ residual blocks, show that the gradient from the last block to the first always has a term with magnitude 1.

Hint 1 - Direction

Write the output of a residual block: $\mathbf{x}_{n+1} = F_n(\mathbf{x}_n) + \mathbf{x}_n$ . Apply the chain rule repeatedly.

Hint 2 - Insight

$\frac{\partial \mathbf{x}_{n+1}}{\partial \mathbf{x}_n} = \frac{\partial F_n}{\partial \mathbf{x}_n} + I$ . The product of these terms over $N$ blocks, when expanded, contains the term $I^N = I$ .

Hint 3 - Full Solution + Rubric

For residual block $n$ : $\mathbf{x}_{n+1} = F_n(\mathbf{x}_n) + \mathbf{x}_n$

The gradient:

$\frac{\partial \mathbf{x}_N}{\partial \mathbf{x}_0} = \prod_{n=0}^{N-1} \frac{\partial \mathbf{x}_{n+1}}{\partial \mathbf{x}_n} = \prod_{n=0}^{N-1} \left(I + \frac{\partial F_n}{\partial \mathbf{x}_n}\right)$

Expanding this product:

$\prod_{n=0}^{N-1}(I + J_n) = I + \sum_n J_n + \sum_{m < n} J_m J_n + \cdots + \prod_n J_n$

where $J_n = \frac{\partial F_n}{\partial \mathbf{x}_n}$ .

The first term is $I$ - the identity. This means:

$\frac{\partial L}{\partial \mathbf{x}_0} = \frac{\partial L}{\partial \mathbf{x}_N} \cdot I + \text{(other terms)} = \frac{\partial L}{\partial \mathbf{x}_N} + \text{(other terms)}$

The gradient of the loss w.r.t. $\mathbf{x}_0$ always contains the term $\frac{\partial L}{\partial \mathbf{x}_N}$ with no attenuation. No matter how many layers there are, the gradient from the last layer reaches the first layer with full magnitude through the skip connections.

Without skip connections: $\frac{\partial \mathbf{x}_N}{\partial \mathbf{x}_0} = \prod J_n$ , which vanishes exponentially if $\|J_n\| < 1$ .

Scoring Rubric:

Strong Hire: Complete derivation, expands the product to show the $I$ term, contrasts with the non-skip case, mentions the $2^N$ paths interpretation
Lean Hire: Correctly derives $I + J_n$ for one block and intuits the result for $N$ blocks
No Hire: Says "skip connections help gradient flow" without any mathematical argument

Problem 3: Depthwise Separable Computation

You need to process a 56x56x256 feature map with a 3x3 convolution producing 512 output channels. Compare the FLOPs for a standard convolution vs a depthwise separable convolution.

Hint 1 - Direction

Standard conv FLOPs: $C_\text{out} \times C_\text{in} \times K^2 \times H_\text{out} \times W_\text{out}$ . Depthwise: compute each step separately.

Hint 2 - Insight

Assume stride 1, same padding, so output is 56x56. Standard: $512 \times 256 \times 9 \times 56 \times 56$ . Depthwise: $256 \times 9 \times 56^2$ for depthwise + $512 \times 256 \times 56^2$ for pointwise.

Hint 3 - Full Solution + Rubric

Assume stride 1, padding 1 (same), output size 56x56.

Standard convolution: FLOPs = $512 \times 256 \times 9 \times 56 \times 56 = 3,698,851,840 \approx 3.7$ GFLOPs

Depthwise separable convolution:

Depthwise: $256 \times 9 \times 56 \times 56 = 7,225,344 \approx 7.2$ MFLOPs
Pointwise: $512 \times 256 \times 56 \times 56 = 411,041,792 \approx 411$ MFLOPs
Total: $418,267,136 \approx 418$ MFLOPs

Speedup: $3699 / 418 \approx 8.8\times$

This matches the theoretical ratio: $\frac{1}{C_\text{out}} + \frac{1}{K^2} = \frac{1}{512} + \frac{1}{9} \approx 0.113$ , so $1/0.113 \approx 8.8\times$ .

Scoring Rubric:

Strong Hire: Computes both correctly, derives the speedup, states the general ratio formula, mentions that actual wall-clock speedup may differ due to memory access patterns
Lean Hire: Computes both correctly and notes the large savings
No Hire: Cannot set up the FLOP calculation or confuses depthwise and pointwise steps

Problem 4: Transfer Learning Strategy

You have 2,000 labeled X-ray images (4 disease classes) and want to build a classifier. You have a ResNet-50 pretrained on ImageNet. Design your transfer learning strategy and justify each decision.

Hint 1 - Direction

2,000 images is a small dataset. Think about overfitting risk. Medical images are somewhat different from ImageNet but still share low-level features (edges, textures).

Hint 2 - Insight

Strategy: Replace the classification head (1000 to 4 classes), freeze backbone initially, then progressively unfreeze. Use strong data augmentation. Consider differential learning rates.

Hint 3 - Full Solution + Rubric

Step 1 - Modify architecture:

Replace final FC layer (1000 classes) with new FC layer (4 classes)
Consider adding a hidden layer (e.g., 512 units + ReLU + dropout 0.5) before the final classifier for more capacity

Step 2 - Phase 1: Feature extraction (10 epochs):

Freeze all ResNet backbone weights
Train only the new head with LR = $10^{-3}$
Use SGD with momentum or Adam
This establishes reasonable head weights without disturbing learned features

Step 3 - Phase 2: Fine-tune last stage (20 epochs):

Unfreeze ResNet stage 4 (last residual blocks)
Use differential LR: backbone LR = $10^{-5}$ , head LR = $10^{-4}$
This adapts high-level features to the medical domain

Step 4 - Phase 3: Fine-tune all (optional, 10 epochs):

Unfreeze everything
Use very small backbone LR ( $10^{-6}$ ), head LR ( $10^{-5}$ )
Only do this if validation accuracy is still improving

Data augmentation (critical with 2K images):

Random horizontal flip, rotation (up to 15 degrees), color jitter
Random crop with resize back to 224x224
Mixup or CutMix for regularization
Test-time augmentation for final predictions

Additional considerations:

Use ImageNet normalization: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
X-rays are grayscale - replicate to 3 channels, or modify the first conv layer
Use 5-fold cross-validation given the small dataset
Consider using a smaller model (ResNet-18) to reduce overfitting risk

Scoring Rubric:

Strong Hire: Multi-phase fine-tuning with differential LR, strong augmentation, addresses grayscale input, mentions cross-validation, considers smaller model
Lean Hire: Correct basic approach (freeze then fine-tune) with data augmentation
No Hire: Trains from scratch on 2K images or uses the same LR for all layers

Interview Cheat Sheet

Concept	Key Fact	Common Mistakes
Output size formula	$\lfloor(W - K + 2P)/S\rfloor + 1$	Forgetting the floor or the $+1$
Receptive field	$r_l = r_{l-1} + (k_l - 1) \cdot \prod s_i$	Not accounting for stride's multiplicative effect
3x3 vs 7x7	Three 3x3 = 7x7 RF, fewer params, more nonlinearity	Saying they are "the same" without quantifying
Skip connections	$\frac{\partial}{\partial x}(F(x) + x) = F'(x) + I$ - gradient cannot vanish	Saying "prevents vanishing gradients" without the math
1x1 convolutions	Channel mixing, bottleneck reduction, no spatial operation	Thinking they are useless because kernel is "too small"
Depthwise separable	~8-9x fewer FLOPs for 3x3 kernels	Confusing depthwise and pointwise steps
Global average pooling	Replaces FC layers, no parameters, full translation invariance	Not knowing it exists (many candidates only know max pooling)
Transfer learning	Freeze first, differential LR, augment heavily for small data	Using same LR everywhere or training from scratch
ResNet bottleneck	1x1-3x3-1x1 pattern, 17x fewer params than direct 3x3-3x3	Not knowing why 1x1 convs are needed
ConvNeXt	Modernized ResNet with Transformer tricks, matches ViT	Saying "CNNs are obsolete because of Transformers"

Spaced Repetition Checkpoints

Day 0 - After First Read

Write the output size formula from memory and solve 3 examples
Draw a residual block and write the gradient equation showing the identity term
List the 8 key architectures in order and state each one's primary innovation

Day 3 - First Review

Compute the receptive field for a 5-layer CNN with alternating stride-1 and stride-2 layers
Explain depthwise separable convolutions and compute the FLOP ratio for 3x3 kernels
Compare VGG-16 and ResNet-50 in terms of: depth, parameters, key innovation, performance

Day 7 - Connections Review

Explain how receptive field connects to the degradation problem connects to skip connections
Explain how 1x1 convolutions are used in: GoogLeNet (bottleneck), ResNet (channel matching), MobileNet (pointwise)
Design a CNN architecture for a given task, justifying kernel sizes, strides, and normalization

Day 14 - Interview Simulation

Given a feature map shape and a target, design 3 conv layers with correct math
Prove the ResNet gradient benefit on a whiteboard in under 5 minutes
Walk through a complete transfer learning strategy for a given scenario

Day 21 - Final Calibration

Complete all 4 practice problems under time pressure (10 minutes each)
Explain why ConvNeXt adopted ideas from Transformers and what they changed
Connect CNNs to the broader deep learning picture: how do they relate to attention (Vision Transformers), and when would you choose one over the other?

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - The Convolution Operation​

What Convolution Does​

The Math: 2D Convolution​

Multi-Channel Convolution​

Part 2 - Output Size, Stride, Padding, and Dilation​

The Output Size Formula​

Common Configurations​

Worked Examples​

Padding Types​

Dilation (Atrous Convolution)​

Part 3 - Receptive Field​

What Is the Receptive Field?​

Recursive Receptive Field Formula​

Worked Example: Simple 3-Layer CNN​

Why Three 3x3 Convs Instead of One 7x7?​

Part 4 - Pooling Operations​

Max Pooling​

Average Pooling​

Global Average Pooling (GAP)​

Strided Convolution vs Pooling​

Part 5 - Architecture Evolution​

LeNet-5 (LeCun et al., 1998)​

AlexNet (Krizhevsky et al., 2012)​

VGGNet (Simonyan & Zisserman, 2014)​

GoogLeNet / Inception (Szegedy et al., 2014)​

ResNet (He et al., 2015)​

EfficientNet (Tan & Le, 2019)​

ConvNeXt (Liu et al., 2022)​

Part 6 - Skip Connections and Why ResNet Works​

The Degradation Problem​

The Skip Connection Solution​

Mathematical Proof: Gradient Highways​

Stacking Residual Blocks​

Bottleneck Residual Block​

Part 7 - 1x1 Convolutions​

What 1x1 Convolutions Do​

Three Uses of 1x1 Convolutions​

Part 8 - Depthwise Separable Convolutions​

Standard Convolution Cost​

Depthwise Separable Convolution​

Computational Savings​

Where They Are Used​

Part 9 - Transfer Learning and Fine-Tuning​

Why Transfer Learning Works​

Fine-Tuning Strategies​

Practical Fine-Tuning Recipe​

Common Fine-Tuning Mistakes​

Practice Problems​

Problem 1: Output Size Calculation​

Problem 2: ResNet Skip Connection Gradient​

Problem 3: Depthwise Separable Computation​

Problem 4: Transfer Learning Strategy​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 - After First Read​

Day 3 - First Review​

Day 7 - Connections Review​

Day 14 - Interview Simulation​

Day 21 - Final Calibration​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - The Convolution Operation

What Convolution Does

The Math: 2D Convolution

Multi-Channel Convolution

Part 2 - Output Size, Stride, Padding, and Dilation

The Output Size Formula

Common Configurations

Worked Examples

Padding Types

Dilation (Atrous Convolution)

Part 3 - Receptive Field

What Is the Receptive Field?

Recursive Receptive Field Formula

Worked Example: Simple 3-Layer CNN

Why Three 3x3 Convs Instead of One 7x7?

Part 4 - Pooling Operations

Max Pooling

Average Pooling

Global Average Pooling (GAP)

Strided Convolution vs Pooling

Part 5 - Architecture Evolution

LeNet-5 (LeCun et al., 1998)

AlexNet (Krizhevsky et al., 2012)

VGGNet (Simonyan & Zisserman, 2014)

GoogLeNet / Inception (Szegedy et al., 2014)

ResNet (He et al., 2015)

EfficientNet (Tan & Le, 2019)

ConvNeXt (Liu et al., 2022)

Part 6 - Skip Connections and Why ResNet Works

The Degradation Problem

The Skip Connection Solution

Mathematical Proof: Gradient Highways

Stacking Residual Blocks

Bottleneck Residual Block

Part 7 - 1x1 Convolutions

What 1x1 Convolutions Do

Three Uses of 1x1 Convolutions

Part 8 - Depthwise Separable Convolutions

Standard Convolution Cost

Depthwise Separable Convolution

Computational Savings

Where They Are Used

Part 9 - Transfer Learning and Fine-Tuning

Why Transfer Learning Works

Fine-Tuning Strategies

Practical Fine-Tuning Recipe

Common Fine-Tuning Mistakes

Practice Problems

Problem 1: Output Size Calculation

Problem 2: ResNet Skip Connection Gradient

Problem 3: Depthwise Separable Computation

Problem 4: Transfer Learning Strategy

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 - After First Read

Day 3 - First Review

Day 7 - Connections Review

Day 14 - Interview Simulation

Day 21 - Final Calibration