Skip to main content

Convolutional Neural Networks - From Pixels to Understanding

Reading time: ~40 min | Interview relevance: Critical | Roles: MLE, AI Eng, CV Eng, Research Engineer, Robotics ML

The Real Interview Moment

You are in a Tesla Autopilot MLE on-site. The interviewer draws a 6x6 input feature map on the whiteboard and says: "Apply a 3x3 convolution with stride 2 and padding 1. What is the output size? Now stack 50 of these layers - what is the receptive field? And why would we use a ResNet instead of a plain stack?"

You start computing the output size but second-guess yourself on the padding formula. You get the receptive field calculation half right but mix up the recursive formula. When the interviewer asks about ResNet, you say "skip connections help gradient flow" - correct, but she wants the math: "Show me how the gradient changes with and without the skip connection."

CNN questions in interviews are deceptively layered. They start with simple arithmetic (output size calculation) and escalate to deep architectural reasoning (why ResNet works, what 1x1 convolutions do, why depthwise separable convolutions save computation). This page arms you with both the mechanical skills and the architectural intuition.

What You Will Master

  • Compute output dimensions for any conv layer given input size, kernel size, stride, padding, and dilation
  • Trace the convolution operation as a sliding dot product with weight sharing
  • Calculate receptive fields for deep networks using the recursive formula
  • Explain pooling operations (max, average, global) and their purposes
  • Narrate the architecture evolution: LeNet to AlexNet to VGG to GoogLeNet to ResNet to EfficientNet to ConvNeXt
  • Derive why skip connections enable training of very deep networks (gradient highway argument)
  • Explain 1x1 convolutions as pointwise channel mixing and dimensionality reduction
  • Analyze depthwise separable convolutions and their computational savings
  • Design transfer learning and fine-tuning strategies for new tasks
  • Answer CNN architecture questions with both math and intuition

Self-Assessment: Where Are You Now?

Skill1 -- Cannot2 -- Vaguely3 -- Can Explain4 -- Can Derive5 -- Can TeachYour Score
Compute conv output size___
Explain convolution as sliding dot product___
Calculate receptive field___
Explain max pooling vs average pooling___
Trace LeNet to ResNet evolution___
Derive skip connection gradient benefit___
Explain 1x1 convolutions___
Explain depthwise separable convolutions___
Design transfer learning strategy___

Target: All 4s and 5s before your interview.

Part 1 - The Convolution Operation

What Convolution Does

A 2D convolution slides a small filter (kernel) across an input feature map, computing a dot product at each position. This produces an output feature map (also called an activation map).

Key properties that make convolution powerful for vision:

  1. Local connectivity: Each output neuron connects to only a small region of the input (the receptive field), not the entire input. This encodes the prior that nearby pixels are more related than distant ones.

  2. Weight sharing: The same filter is applied at every spatial position. A feature detector learned in one part of the image works everywhere. This dramatically reduces parameters: a 3x3 filter has 9 weights regardless of image size.

  3. Translation equivariance: If the input shifts, the output shifts by the same amount. A cat detector works regardless of where the cat is in the image.

60-Second Answer

"A CNN applies learned filters across spatial positions using three key ideas: local connectivity (each neuron sees only a small region), weight sharing (the same filter detects the same feature everywhere), and translation equivariance (features are detected regardless of position). Early layers learn edges and textures, middle layers learn parts (eyes, wheels), and deep layers learn objects. The output size formula is (WK+2P)/S+1\lfloor(W - K + 2P)/S\rfloor + 1 for input size WW, kernel KK, padding PP, stride SS. Modern CNNs use skip connections (ResNet) to enable training hundreds of layers by providing direct gradient paths."

The Math: 2D Convolution

For an input feature map XX of size H×WH \times W and a kernel KK of size k×kk \times k:

(XK)[i,j]=m=0k1n=0k1X[i+m,j+n]K[m,n](X * K)[i, j] = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} X[i+m, j+n] \cdot K[m, n]

Technically, this is cross-correlation, not convolution (which would flip the kernel). In deep learning, we always mean cross-correlation when we say "convolution" - the distinction does not matter because the kernels are learned.

Multi-Channel Convolution

In practice, inputs have CinC_{\text{in}} channels (e.g., 3 for RGB) and we want CoutC_{\text{out}} output channels:

  • Each filter has shape Cin×k×kC_{\text{in}} \times k \times k
  • We have CoutC_{\text{out}} such filters
  • Total weight shape: Cout×Cin×k×kC_{\text{out}} \times C_{\text{in}} \times k \times k
  • Each filter produces one output channel by summing over all input channels

Parameter count: Cout×Cin×k×k+CoutC_{\text{out}} \times C_{\text{in}} \times k \times k + C_{\text{out}} (including bias)

Example: A conv layer with 64 input channels, 128 output channels, and 3x3 kernels has 128×64×3×3+128=73,856128 \times 64 \times 3 \times 3 + 128 = 73,856 parameters.

Part 2 - Output Size, Stride, Padding, and Dilation

The Output Size Formula

This is the most frequently tested calculation in CNN interviews.

O=WK+2PS+1O = \left\lfloor\frac{W - K + 2P}{S}\right\rfloor + 1

where:

  • WW = input spatial dimension (height or width)
  • KK = kernel size
  • PP = padding (zeros added to each side)
  • SS = stride (step size of the sliding window)

For dilated convolutions:

O=WKeff+2PS+1,Keff=K+(K1)(D1)O = \left\lfloor\frac{W - K_{\text{eff}} + 2P}{S}\right\rfloor + 1, \quad K_{\text{eff}} = K + (K-1)(D-1)

where DD = dilation rate and KeffK_{\text{eff}} is the effective kernel size.

Common Configurations

ConfigurationKernelStridePaddingEffect on Size
Standard3x310Shrinks by 2 (each side loses 1)
Same padding3x311Preserves spatial size
Downsampling3x321Halves spatial size
Aggressive downsample7x723Roughly halves (used in ResNet stem)
Pooling replacement1x110Changes channels only
Dilated3x3, dilation=212Preserves size, larger receptive field

Worked Examples

Example 1: Input 32x32, kernel 5x5, stride 1, padding 0.

O=325+01+1=28O = \left\lfloor\frac{32 - 5 + 0}{1}\right\rfloor + 1 = 28

Example 2: Input 224x224, kernel 7x7, stride 2, padding 3.

O=2247+62+1=2232+1=111+1=112O = \left\lfloor\frac{224 - 7 + 6}{2}\right\rfloor + 1 = \left\lfloor\frac{223}{2}\right\rfloor + 1 = 111 + 1 = 112

Example 3: Input 56x56, kernel 3x3, stride 2, padding 1.

O=563+22+1=552+1=27+1=28O = \left\lfloor\frac{56 - 3 + 2}{2}\right\rfloor + 1 = \left\lfloor\frac{55}{2}\right\rfloor + 1 = 27 + 1 = 28

Common Trap

The floor operation matters when the division is not exact. Input 7x7, kernel 3x3, stride 2, padding 0: O=(73)/2+1=2+1=3O = \lfloor(7-3)/2\rfloor + 1 = \lfloor 2 \rfloor + 1 = 3, not 3.5. Some candidates forget the floor and get wrong answers. Also remember that the formula applies independently to height and width - they do not have to be equal.

Padding Types

PaddingFormulaWhen Used
Valid (no padding)P=0P = 0When spatial shrinkage is acceptable
SameP=K/2P = \lfloor K/2 \rfloor (for stride 1)Preserve spatial dimensions
FullP=K1P = K - 1Transposed convolutions, signal processing
CausalPad only one side1D convolutions for time series (no future leakage)

Dilation (Atrous Convolution)

Dilation inserts gaps between kernel elements, enlarging the effective receptive field without adding parameters or reducing resolution.

A 3x3 kernel with dilation 2 has the same 9 parameters but covers a 5x5 effective area (with gaps). With dilation 4, it covers 9x9.

Use cases: Semantic segmentation (DeepLab), where you need large receptive fields at full resolution.

Part 3 - Receptive Field

What Is the Receptive Field?

The receptive field of a neuron is the region of the original input that can influence that neuron's value. It is determined by the cumulative effect of all preceding conv and pooling layers.

Recursive Receptive Field Formula

For layer ll with kernel size klk_l and stride sls_l:

rl=rl1+(kl1)i=1l1sir_l = r_{l-1} + (k_l - 1) \cdot \prod_{i=1}^{l-1} s_i

where r0=1r_0 = 1 (a single pixel).

The key insight: stride in early layers has a multiplicative effect on receptive field growth. This is why architectures like ResNet use a stride-2 conv in the first layer - it doubles the receptive field contribution of every subsequent layer.

Worked Example: Simple 3-Layer CNN

LayerKernelStrideReceptive Field
Input--r0=1r_0 = 1
Conv13x31r1=1+(31)1=3r_1 = 1 + (3-1) \cdot 1 = 3
Conv23x31r2=3+(31)1=5r_2 = 3 + (3-1) \cdot 1 = 5
Conv33x31r3=5+(31)1=7r_3 = 5 + (3-1) \cdot 1 = 7

Three 3x3 conv layers with stride 1 give a 7x7 receptive field.

Why Three 3x3 Convs Instead of One 7x7?

This is a classic interview question (VGGNet insight).

ApproachParametersReceptive FieldNonlinearities
One 7x7 conv (64 channels)64×64×7×7=200,70464 \times 64 \times 7 \times 7 = 200,7047x71
Three 3x3 convs (64 channels)3×64×64×3×3=110,5923 \times 64 \times 64 \times 3 \times 3 = 110,5927x73

Three 3x3 convs have 45% fewer parameters and 3x more nonlinearity for the same receptive field. The extra nonlinear layers make the function more expressive. This is why VGG exclusively uses 3x3 convolutions.

Interviewer's Perspective

"The receptive field question separates candidates who understand CNN architecture from those who just use pretrained models. I ask: 'Your model fails to detect large objects. Why?' A strong candidate immediately thinks about receptive field - if the receptive field is smaller than the object, the network literally cannot see the whole object in any single neuron. Solutions: add more layers, use dilated convolutions, use larger strides, or add a global average pooling layer."

Part 4 - Pooling Operations

Max Pooling

Takes the maximum value in each window. With a 2x2 window and stride 2, it halves the spatial dimensions.

Properties:

  • Provides a small amount of translation invariance
  • Selects the strongest activation (most prominent feature)
  • No learnable parameters
  • Discards spatial information (location within the window)

Average Pooling

Takes the mean value in each window.

Properties:

  • Smoother than max pooling
  • Preserves more spatial information
  • Used less frequently than max pooling in classification architectures

Global Average Pooling (GAP)

Averages each entire feature map into a single number. For a C×H×WC \times H \times W feature map, produces a CC-dimensional vector.

Properties:

  • Replaces fully connected layers at the end of classification CNNs (GoogLeNet, ResNet)
  • No parameters - eliminates the FC layer parameters
  • Acts as a structural regularizer
  • Provides complete translation invariance

Pooling Types Compared: Max, Average, and Global Average Pooling

Strided Convolution vs Pooling

Modern architectures often replace pooling with strided convolutions (stride 2):

ApproachParametersLearns what to discard?Used in
Max pooling0No (fixed max operation)VGG, older ResNets
Strided convolutionCout×Cin×K2C_\text{out} \times C_\text{in} \times K^2Yes (learned downsampling)ResNet-D, ConvNeXt

Strided convolutions are now preferred because they allow the network to learn an optimal downsampling strategy rather than using a fixed max operation.

Part 5 - Architecture Evolution

This is one of the most frequently tested topics in CNN interviews. You must know the key innovation of each architecture and why it mattered.

CNN Architecture Evolution from LeNet (1998) to ConvNeXt (2022)

LeNet-5 (LeCun et al., 1998)

  • Innovation: Demonstrated that CNNs can learn useful features from raw pixels
  • Architecture: 2 conv layers (5x5), 2 subsampling layers, 3 FC layers
  • Parameters: ~60,000
  • Task: Handwritten digit recognition (MNIST)
  • Impact: Proved the concept but limited by hardware

AlexNet (Krizhevsky et al., 2012)

  • Innovation: Won ImageNet by a massive margin, launching the deep learning revolution
  • Key ideas: ReLU activation (not tanh/sigmoid), dropout regularization, GPU training, data augmentation, local response normalization
  • Architecture: 5 conv layers, 3 FC layers
  • Parameters: ~60 million
  • Impact: Proved that deep learning works at scale for vision

VGGNet (Simonyan & Zisserman, 2014)

  • Innovation: Showed that depth matters - use only 3x3 convolutions stacked deeply
  • Key insight: Three 3x3 convs = one 7x7 conv in receptive field, but with fewer parameters and more nonlinearity
  • Architecture: 16 or 19 layers, all 3x3 convs
  • Parameters: ~138 million (huge FC layers)
  • Limitation: Very expensive, no skip connections, training is difficult beyond 19 layers

GoogLeNet / Inception (Szegedy et al., 2014)

  • Innovation: Process at multiple scales simultaneously with the Inception module
  • Key idea: Each Inception module applies 1x1, 3x3, 5x5 convs and max pooling in parallel, then concatenates
  • 1x1 convs for dimensionality reduction: Before the expensive 3x3 and 5x5 convs, a 1x1 conv reduces channels (the "bottleneck")
  • Parameters: ~6.8 million (12x fewer than VGG through bottleneck design)
  • Impact: Showed that architecture engineering (not just depth) matters

ResNet (He et al., 2015)

  • Innovation: Skip connections enable training of networks with 152+ layers
  • The problem it solved: Plain networks degrade (not overfit - degrade) beyond ~20 layers. Adding more layers makes training loss worse.
  • The solution: Instead of learning H(x)=F(x)H(x) = F(x), learn the residual F(x)=H(x)xF(x) = H(x) - x, so the layer computes H(x)=F(x)+xH(x) = F(x) + x
  • Parameters: ~25M (ResNet-50) to ~60M (ResNet-152)
  • Impact: The single most important architecture innovation in CNNs

EfficientNet (Tan & Le, 2019)

  • Innovation: Compound scaling - scale depth, width, and resolution together with a principled formula
  • Key idea: Previous work scaled networks in one dimension (deeper OR wider OR higher resolution). EfficientNet scales all three simultaneously with compound coefficients: depth αϕ\propto \alpha^\phi, width βϕ\propto \beta^\phi, resolution γϕ\propto \gamma^\phi with αβ2γ22\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2
  • Building block: MBConv (mobile inverted bottleneck) with depthwise separable convolutions and squeeze-and-excitation
  • Impact: State-of-the-art efficiency - much better accuracy/FLOPs tradeoff

ConvNeXt (Liu et al., 2022)

  • Innovation: "A ConvNet for the 2020s" - modernized ResNet to match Vision Transformer performance
  • Key changes from ResNet:
    1. Patchify stem (4x4 stride-4 conv, like ViT)
    2. Larger kernels (7x7 depthwise conv, like Transformer attention windows)
    3. GELU activation (from Transformers)
    4. LayerNorm instead of BatchNorm (from Transformers)
    5. Inverted bottleneck (expand then contract, from MobileNet)
    6. Fewer activation functions (only one per block)
  • Impact: Proved that CNNs are not inherently inferior to Transformers - the architecture details matter

Part 6 - Skip Connections and Why ResNet Works

The Degradation Problem

Plain deep networks exhibit a surprising failure: deeper networks have higher training error than shallower ones. This is not overfitting (which would show low training error but high test error). This is an optimization failure - the optimizer cannot find a good solution.

If a 20-layer network achieves loss LL, a 56-layer network should achieve at most LL (it could just learn identity for the extra 36 layers). But in practice, the 56-layer network does worse. Why?

The gradient signal degrades over many layers (not just vanishing - the gradient direction becomes increasingly noisy), making it nearly impossible for early layers to learn useful features.

The Skip Connection Solution

A residual block computes:

y=F(x)+x\mathbf{y} = F(\mathbf{x}) + \mathbf{x}

where FF is the residual function (typically two conv layers with BN and ReLU).

ResNet Skip Connection: y = F(x) + x with Gradient Highway

Mathematical Proof: Gradient Highways

The gradient of the loss w.r.t. the input x\mathbf{x} of a residual block:

Lx=Lyyx=Ly(F(x)x+I)=LyF(x)x+Ly\frac{\partial L}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{y}} \cdot \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{y}} \cdot \left(\frac{\partial F(\mathbf{x})}{\partial \mathbf{x}} + I\right) = \frac{\partial L}{\partial \mathbf{y}} \cdot \frac{\partial F(\mathbf{x})}{\partial \mathbf{x}} + \frac{\partial L}{\partial \mathbf{y}}

The gradient has two components:

  1. LyFx\frac{\partial L}{\partial \mathbf{y}} \cdot \frac{\partial F}{\partial \mathbf{x}}: gradient through the conv layers (may vanish)
  2. Ly\frac{\partial L}{\partial \mathbf{y}}: direct gradient through the skip connection (cannot vanish!)

For a network with NN residual blocks, the gradient from block NN to block 1 always has a direct path with factor IN=II^N = I. Even if all the residual functions FF have tiny gradients, the skip connections ensure the gradient reaches every layer.

Stacking Residual Blocks

For NN stacked residual blocks:

Lx0=LxNn=1N(I+Fnxn1)\frac{\partial L}{\partial \mathbf{x}_0} = \frac{\partial L}{\partial \mathbf{x}_N} \cdot \prod_{n=1}^{N} \left(I + \frac{\partial F_n}{\partial \mathbf{x}_{n-1}}\right)

Expanding this product creates 2N2^N terms, each representing a different path through the network. Critically, one of these paths is the all-identity path IN=II^N = I, which preserves the gradient magnitude exactly. The other 2N12^N - 1 paths provide additional gradient information.

Bottleneck Residual Block

For deeper ResNets (50, 101, 152 layers), a bottleneck design reduces computation:

  1. 1x1 conv: Reduce channels (e.g., 256 to 64) - the "bottleneck"
  2. 3x3 conv: Spatial processing at reduced channel count
  3. 1x1 conv: Restore channels (e.g., 64 to 256)

This 1x1-3x3-1x1 pattern has far fewer parameters than two 3x3 convs at the full channel width.

Parameter comparison (256 channels):

DesignParameters
Two 3x3 convs at 256 channels2×256×256×9=1,179,6482 \times 256 \times 256 \times 9 = 1,179,648
Bottleneck (256-64-64-256)256×64+64×64×9+64×256=69,632256 \times 64 + 64 \times 64 \times 9 + 64 \times 256 = 69,632

The bottleneck has 17x fewer parameters with the same receptive field.

Instant Rejection

Do NOT say "ResNet works because it prevents vanishing gradients." This is partially true but incomplete. The deeper insight is that skip connections transform the optimization landscape - they make the loss surface smoother (Li et al., 2018 visualized this). Without skip connections, the loss surface has many sharp minima and saddle points that trap the optimizer. With skip connections, the landscape becomes more convex-like. If the interviewer asks "why not just use better optimization?" you need this answer.

Part 7 - 1x1 Convolutions

What 1x1 Convolutions Do

A 1x1 convolution with CoutC_{\text{out}} filters operates only along the channel dimension. At each spatial position, it computes a linear combination of the CinC_{\text{in}} input channels to produce CoutC_{\text{out}} output channels.

It is equivalent to applying a fully connected layer independently at every spatial position (hence also called pointwise convolution or network in network).

Three Uses of 1x1 Convolutions

1. Dimensionality reduction (bottleneck): Reduce channels before expensive operations. GoogLeNet uses 1x1 convs to reduce 256 channels to 64 before a 5x5 conv, saving 25664=4×\frac{256}{64} = 4\times computation in the 5x5 conv.

2. Dimensionality expansion: Increase channels. In inverted bottlenecks (MobileNetV2, EfficientNet), 1x1 expands channels before depthwise conv.

3. Channel mixing: Learn cross-channel interactions without spatial operations. This is what ResNet's bottleneck does.

Parameter count for 1x1 conv: Cout×Cin+CoutC_{\text{out}} \times C_{\text{in}} + C_{\text{out}} - no spatial kernel parameters.

Part 8 - Depthwise Separable Convolutions

Standard Convolution Cost

For input Cin×H×WC_{\text{in}} \times H \times W, a standard K×KK \times K convolution to CoutC_{\text{out}} channels:

  • Parameters: Cout×Cin×K×KC_{\text{out}} \times C_{\text{in}} \times K \times K
  • FLOPs: Cout×Cin×K×K×Hout×WoutC_{\text{out}} \times C_{\text{in}} \times K \times K \times H_{\text{out}} \times W_{\text{out}}

Depthwise Separable Convolution

Splits the standard convolution into two steps:

Step 1 - Depthwise convolution: Apply one K×KK \times K filter per input channel independently.

  • Parameters: Cin×K×KC_{\text{in}} \times K \times K
  • Each channel is filtered separately (no cross-channel mixing)

Step 2 - Pointwise convolution: Apply a 1x1 convolution to mix channels.

  • Parameters: Cout×CinC_{\text{out}} \times C_{\text{in}}
  • Performs all cross-channel interaction

Standard vs Depthwise Separable Convolution - 8-9x fewer FLOPs

Computational Savings

Ratio=Cin×K2+Cout×CinCout×Cin×K2=1Cout+1K2\text{Ratio} = \frac{C_{\text{in}} \times K^2 + C_{\text{out}} \times C_{\text{in}}}{C_{\text{out}} \times C_{\text{in}} \times K^2} = \frac{1}{C_{\text{out}}} + \frac{1}{K^2}

For typical values (Cout=256C_{\text{out}} = 256, K=3K = 3):

Ratio=1256+190.115\text{Ratio} = \frac{1}{256} + \frac{1}{9} \approx 0.115

Depthwise separable convolutions use roughly 8-9x fewer FLOPs and 8-9x fewer parameters than standard convolutions.

Where They Are Used

ArchitectureHow DSC Is Used
MobileNet (Howard et al., 2017)All convolutions are depthwise separable
Xception (Chollet, 2017)Replaces all Inception module convolutions
EfficientNet (Tan & Le, 2019)MBConv blocks use depthwise separable convs
ConvNeXt (Liu et al., 2022)Uses depthwise (but not separable) convolutions

Part 9 - Transfer Learning and Fine-Tuning

Why Transfer Learning Works

CNNs learn hierarchical features:

  • Early layers (1-3): Low-level features - edges, textures, colors. These are universal across tasks.
  • Middle layers (4-8): Mid-level features - corners, contours, patterns. Somewhat task-specific.
  • Late layers (9+): High-level features - object parts, scenes. Highly task-specific.

Transfer learning works because early and middle layer features are useful across very different tasks (ImageNet features work for medical images, satellite images, etc.).

Fine-Tuning Strategies

Transfer Learning Strategy by Dataset Size - Feature Extraction vs Fine-Tuning

Practical Fine-Tuning Recipe

  1. Replace the classification head: Remove the final FC layer, add a new one matching your number of classes
  2. Freeze backbone initially: Train only the new head for 5-10 epochs
  3. Unfreeze gradually: Start unfreezing from the last layer backward
  4. Use differential learning rates: Early layers get 10x-100x smaller LR than the head
  5. Use smaller overall LR: Start with 10410^{-4} to 10510^{-5} (not 10210^{-2} like training from scratch)
  6. Data augmentation: Critical when fine-tuning dataset is small

Common Fine-Tuning Mistakes

MistakeWhy It FailsFix
Using the same LR everywhereEarly layers overfit quickly, losing universal featuresDifferential LR: head 10x, middle 5x, early 1x
Not freezing initiallyRandom head weights send garbage gradients to backboneFreeze backbone, train head first
Training too longSmall datasets cause overfitting quicklyEarly stopping, strong augmentation
Wrong input normalizationPretrained model expects ImageNet normalizationAlways use mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] for ImageNet-pretrained models
Resizing inputs incorrectlyModels trained at 224x224 lose features at other sizesResize to the pretrained resolution or use multi-scale training
Company Variation

At Google/Meta, transfer learning questions focus on architecture design: "How would you modify a ResNet-50 for a 3-channel radar image with 500x500 resolution?" At startups, the questions are more practical: "You have 500 labeled images. Walk me through your transfer learning pipeline." At Apple, expect questions about efficient fine-tuning for on-device models.

Practice Problems

Problem 1: Output Size Calculation

A CNN has the following layers applied to a 224x224x3 input:

  1. Conv: 64 filters, 7x7, stride 2, padding 3
  2. Max pool: 3x3, stride 2, padding 1
  3. Conv: 128 filters, 3x3, stride 1, padding 1
  4. Conv: 128 filters, 3x3, stride 2, padding 1

Compute the spatial size after each layer.

Hint 1 - Direction

Apply the formula O=(WK+2P)/S+1O = \lfloor(W - K + 2P)/S\rfloor + 1 at each layer. Be careful with the floor operation.

Hint 2 - Insight

Layer 1: (2247+6)/2+1=112\lfloor(224 - 7 + 6)/2\rfloor + 1 = 112. Stride 2 halves the size. Continue for each layer.

Hint 3 - Full Solution + Rubric
LayerInput SizeFormulaOutput Size
Conv1224x224(2247+6)/2+1\lfloor(224-7+6)/2\rfloor + 1112x112
MaxPool112x112(1123+2)/2+1\lfloor(112-3+2)/2\rfloor + 156x56
Conv256x56(563+2)/1+1\lfloor(56-3+2)/1\rfloor + 156x56
Conv356x56(563+2)/2+1\lfloor(56-3+2)/2\rfloor + 128x28

Final output: 28x28x128.

Total parameter count:

  • Conv1: 64×3×7×7+64=9,47264 \times 3 \times 7 \times 7 + 64 = 9,472
  • Conv2: 128×64×3×3+128=73,856128 \times 64 \times 3 \times 3 + 128 = 73,856
  • Conv3: 128×128×3×3+128=147,584128 \times 128 \times 3 \times 3 + 128 = 147,584
  • Total: 230,912

Scoring Rubric:

  • Strong Hire: All sizes correct, computes parameter counts, notes this resembles ResNet stem
  • Lean Hire: Sizes correct but needs to carefully think through the formula
  • No Hire: Makes errors in the stride-2 calculations or forgets the +1

Problem 2: ResNet Skip Connection Gradient

Prove mathematically that skip connections prevent vanishing gradients. Specifically, for a network with NN residual blocks, show that the gradient from the last block to the first always has a term with magnitude 1.

Hint 1 - Direction

Write the output of a residual block: xn+1=Fn(xn)+xn\mathbf{x}_{n+1} = F_n(\mathbf{x}_n) + \mathbf{x}_n. Apply the chain rule repeatedly.

Hint 2 - Insight

xn+1xn=Fnxn+I\frac{\partial \mathbf{x}_{n+1}}{\partial \mathbf{x}_n} = \frac{\partial F_n}{\partial \mathbf{x}_n} + I. The product of these terms over NN blocks, when expanded, contains the term IN=II^N = I.

Hint 3 - Full Solution + Rubric

For residual block nn: xn+1=Fn(xn)+xn\mathbf{x}_{n+1} = F_n(\mathbf{x}_n) + \mathbf{x}_n

The gradient:

xNx0=n=0N1xn+1xn=n=0N1(I+Fnxn)\frac{\partial \mathbf{x}_N}{\partial \mathbf{x}_0} = \prod_{n=0}^{N-1} \frac{\partial \mathbf{x}_{n+1}}{\partial \mathbf{x}_n} = \prod_{n=0}^{N-1} \left(I + \frac{\partial F_n}{\partial \mathbf{x}_n}\right)

Expanding this product:

n=0N1(I+Jn)=I+nJn+m<nJmJn++nJn\prod_{n=0}^{N-1}(I + J_n) = I + \sum_n J_n + \sum_{m < n} J_m J_n + \cdots + \prod_n J_n

where Jn=FnxnJ_n = \frac{\partial F_n}{\partial \mathbf{x}_n}.

The first term is II - the identity. This means:

Lx0=LxNI+(other terms)=LxN+(other terms)\frac{\partial L}{\partial \mathbf{x}_0} = \frac{\partial L}{\partial \mathbf{x}_N} \cdot I + \text{(other terms)} = \frac{\partial L}{\partial \mathbf{x}_N} + \text{(other terms)}

The gradient of the loss w.r.t. x0\mathbf{x}_0 always contains the term LxN\frac{\partial L}{\partial \mathbf{x}_N} with no attenuation. No matter how many layers there are, the gradient from the last layer reaches the first layer with full magnitude through the skip connections.

Without skip connections: xNx0=Jn\frac{\partial \mathbf{x}_N}{\partial \mathbf{x}_0} = \prod J_n, which vanishes exponentially if Jn<1\|J_n\| < 1.

Scoring Rubric:

  • Strong Hire: Complete derivation, expands the product to show the II term, contrasts with the non-skip case, mentions the 2N2^N paths interpretation
  • Lean Hire: Correctly derives I+JnI + J_n for one block and intuits the result for NN blocks
  • No Hire: Says "skip connections help gradient flow" without any mathematical argument

Problem 3: Depthwise Separable Computation

You need to process a 56x56x256 feature map with a 3x3 convolution producing 512 output channels. Compare the FLOPs for a standard convolution vs a depthwise separable convolution.

Hint 1 - Direction

Standard conv FLOPs: Cout×Cin×K2×Hout×WoutC_\text{out} \times C_\text{in} \times K^2 \times H_\text{out} \times W_\text{out}. Depthwise: compute each step separately.

Hint 2 - Insight

Assume stride 1, same padding, so output is 56x56. Standard: 512×256×9×56×56512 \times 256 \times 9 \times 56 \times 56. Depthwise: 256×9×562256 \times 9 \times 56^2 for depthwise + 512×256×562512 \times 256 \times 56^2 for pointwise.

Hint 3 - Full Solution + Rubric

Assume stride 1, padding 1 (same), output size 56x56.

Standard convolution: FLOPs = 512×256×9×56×56=3,698,851,8403.7512 \times 256 \times 9 \times 56 \times 56 = 3,698,851,840 \approx 3.7 GFLOPs

Depthwise separable convolution:

  • Depthwise: 256×9×56×56=7,225,3447.2256 \times 9 \times 56 \times 56 = 7,225,344 \approx 7.2 MFLOPs
  • Pointwise: 512×256×56×56=411,041,792411512 \times 256 \times 56 \times 56 = 411,041,792 \approx 411 MFLOPs
  • Total: 418,267,136418418,267,136 \approx 418 MFLOPs

Speedup: 3699/4188.8×3699 / 418 \approx 8.8\times

This matches the theoretical ratio: 1Cout+1K2=1512+190.113\frac{1}{C_\text{out}} + \frac{1}{K^2} = \frac{1}{512} + \frac{1}{9} \approx 0.113, so 1/0.1138.8×1/0.113 \approx 8.8\times.

Scoring Rubric:

  • Strong Hire: Computes both correctly, derives the speedup, states the general ratio formula, mentions that actual wall-clock speedup may differ due to memory access patterns
  • Lean Hire: Computes both correctly and notes the large savings
  • No Hire: Cannot set up the FLOP calculation or confuses depthwise and pointwise steps

Problem 4: Transfer Learning Strategy

You have 2,000 labeled X-ray images (4 disease classes) and want to build a classifier. You have a ResNet-50 pretrained on ImageNet. Design your transfer learning strategy and justify each decision.

Hint 1 - Direction

2,000 images is a small dataset. Think about overfitting risk. Medical images are somewhat different from ImageNet but still share low-level features (edges, textures).

Hint 2 - Insight

Strategy: Replace the classification head (1000 to 4 classes), freeze backbone initially, then progressively unfreeze. Use strong data augmentation. Consider differential learning rates.

Hint 3 - Full Solution + Rubric

Step 1 - Modify architecture:

  • Replace final FC layer (1000 classes) with new FC layer (4 classes)
  • Consider adding a hidden layer (e.g., 512 units + ReLU + dropout 0.5) before the final classifier for more capacity

Step 2 - Phase 1: Feature extraction (10 epochs):

  • Freeze all ResNet backbone weights
  • Train only the new head with LR = 10310^{-3}
  • Use SGD with momentum or Adam
  • This establishes reasonable head weights without disturbing learned features

Step 3 - Phase 2: Fine-tune last stage (20 epochs):

  • Unfreeze ResNet stage 4 (last residual blocks)
  • Use differential LR: backbone LR = 10510^{-5}, head LR = 10410^{-4}
  • This adapts high-level features to the medical domain

Step 4 - Phase 3: Fine-tune all (optional, 10 epochs):

  • Unfreeze everything
  • Use very small backbone LR (10610^{-6}), head LR (10510^{-5})
  • Only do this if validation accuracy is still improving

Data augmentation (critical with 2K images):

  • Random horizontal flip, rotation (up to 15 degrees), color jitter
  • Random crop with resize back to 224x224
  • Mixup or CutMix for regularization
  • Test-time augmentation for final predictions

Additional considerations:

  • Use ImageNet normalization: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
  • X-rays are grayscale - replicate to 3 channels, or modify the first conv layer
  • Use 5-fold cross-validation given the small dataset
  • Consider using a smaller model (ResNet-18) to reduce overfitting risk

Scoring Rubric:

  • Strong Hire: Multi-phase fine-tuning with differential LR, strong augmentation, addresses grayscale input, mentions cross-validation, considers smaller model
  • Lean Hire: Correct basic approach (freeze then fine-tune) with data augmentation
  • No Hire: Trains from scratch on 2K images or uses the same LR for all layers

Interview Cheat Sheet

ConceptKey FactCommon Mistakes
Output size formula(WK+2P)/S+1\lfloor(W - K + 2P)/S\rfloor + 1Forgetting the floor or the +1+1
Receptive fieldrl=rl1+(kl1)sir_l = r_{l-1} + (k_l - 1) \cdot \prod s_iNot accounting for stride's multiplicative effect
3x3 vs 7x7Three 3x3 = 7x7 RF, fewer params, more nonlinearitySaying they are "the same" without quantifying
Skip connectionsx(F(x)+x)=F(x)+I\frac{\partial}{\partial x}(F(x) + x) = F'(x) + I - gradient cannot vanishSaying "prevents vanishing gradients" without the math
1x1 convolutionsChannel mixing, bottleneck reduction, no spatial operationThinking they are useless because kernel is "too small"
Depthwise separable~8-9x fewer FLOPs for 3x3 kernelsConfusing depthwise and pointwise steps
Global average poolingReplaces FC layers, no parameters, full translation invarianceNot knowing it exists (many candidates only know max pooling)
Transfer learningFreeze first, differential LR, augment heavily for small dataUsing same LR everywhere or training from scratch
ResNet bottleneck1x1-3x3-1x1 pattern, 17x fewer params than direct 3x3-3x3Not knowing why 1x1 convs are needed
ConvNeXtModernized ResNet with Transformer tricks, matches ViTSaying "CNNs are obsolete because of Transformers"

Spaced Repetition Checkpoints

Day 0 - After First Read

  • Write the output size formula from memory and solve 3 examples
  • Draw a residual block and write the gradient equation showing the identity term
  • List the 8 key architectures in order and state each one's primary innovation

Day 3 - First Review

  • Compute the receptive field for a 5-layer CNN with alternating stride-1 and stride-2 layers
  • Explain depthwise separable convolutions and compute the FLOP ratio for 3x3 kernels
  • Compare VGG-16 and ResNet-50 in terms of: depth, parameters, key innovation, performance

Day 7 - Connections Review

  • Explain how receptive field connects to the degradation problem connects to skip connections
  • Explain how 1x1 convolutions are used in: GoogLeNet (bottleneck), ResNet (channel matching), MobileNet (pointwise)
  • Design a CNN architecture for a given task, justifying kernel sizes, strides, and normalization

Day 14 - Interview Simulation

  • Given a feature map shape and a target, design 3 conv layers with correct math
  • Prove the ResNet gradient benefit on a whiteboard in under 5 minutes
  • Walk through a complete transfer learning strategy for a given scenario

Day 21 - Final Calibration

  • Complete all 4 practice problems under time pressure (10 minutes each)
  • Explain why ConvNeXt adopted ideas from Transformers and what they changed
  • Connect CNNs to the broader deep learning picture: how do they relate to attention (Vision Transformers), and when would you choose one over the other?
© 2026 EngineersOfAI. All rights reserved.