Convolutional Neural Networks - From Pixels to Understanding
Reading time: ~40 min | Interview relevance: Critical | Roles: MLE, AI Eng, CV Eng, Research Engineer, Robotics ML
The Real Interview Moment
You are in a Tesla Autopilot MLE on-site. The interviewer draws a 6x6 input feature map on the whiteboard and says: "Apply a 3x3 convolution with stride 2 and padding 1. What is the output size? Now stack 50 of these layers - what is the receptive field? And why would we use a ResNet instead of a plain stack?"
You start computing the output size but second-guess yourself on the padding formula. You get the receptive field calculation half right but mix up the recursive formula. When the interviewer asks about ResNet, you say "skip connections help gradient flow" - correct, but she wants the math: "Show me how the gradient changes with and without the skip connection."
CNN questions in interviews are deceptively layered. They start with simple arithmetic (output size calculation) and escalate to deep architectural reasoning (why ResNet works, what 1x1 convolutions do, why depthwise separable convolutions save computation). This page arms you with both the mechanical skills and the architectural intuition.
What You Will Master
- Compute output dimensions for any conv layer given input size, kernel size, stride, padding, and dilation
- Trace the convolution operation as a sliding dot product with weight sharing
- Calculate receptive fields for deep networks using the recursive formula
- Explain pooling operations (max, average, global) and their purposes
- Narrate the architecture evolution: LeNet to AlexNet to VGG to GoogLeNet to ResNet to EfficientNet to ConvNeXt
- Derive why skip connections enable training of very deep networks (gradient highway argument)
- Explain 1x1 convolutions as pointwise channel mixing and dimensionality reduction
- Analyze depthwise separable convolutions and their computational savings
- Design transfer learning and fine-tuning strategies for new tasks
- Answer CNN architecture questions with both math and intuition
Self-Assessment: Where Are You Now?
| Skill | 1 -- Cannot | 2 -- Vaguely | 3 -- Can Explain | 4 -- Can Derive | 5 -- Can Teach | Your Score |
|---|---|---|---|---|---|---|
| Compute conv output size | ___ | |||||
| Explain convolution as sliding dot product | ___ | |||||
| Calculate receptive field | ___ | |||||
| Explain max pooling vs average pooling | ___ | |||||
| Trace LeNet to ResNet evolution | ___ | |||||
| Derive skip connection gradient benefit | ___ | |||||
| Explain 1x1 convolutions | ___ | |||||
| Explain depthwise separable convolutions | ___ | |||||
| Design transfer learning strategy | ___ |
Target: All 4s and 5s before your interview.
Part 1 - The Convolution Operation
What Convolution Does
A 2D convolution slides a small filter (kernel) across an input feature map, computing a dot product at each position. This produces an output feature map (also called an activation map).
Key properties that make convolution powerful for vision:
-
Local connectivity: Each output neuron connects to only a small region of the input (the receptive field), not the entire input. This encodes the prior that nearby pixels are more related than distant ones.
-
Weight sharing: The same filter is applied at every spatial position. A feature detector learned in one part of the image works everywhere. This dramatically reduces parameters: a 3x3 filter has 9 weights regardless of image size.
-
Translation equivariance: If the input shifts, the output shifts by the same amount. A cat detector works regardless of where the cat is in the image.
"A CNN applies learned filters across spatial positions using three key ideas: local connectivity (each neuron sees only a small region), weight sharing (the same filter detects the same feature everywhere), and translation equivariance (features are detected regardless of position). Early layers learn edges and textures, middle layers learn parts (eyes, wheels), and deep layers learn objects. The output size formula is for input size , kernel , padding , stride . Modern CNNs use skip connections (ResNet) to enable training hundreds of layers by providing direct gradient paths."
The Math: 2D Convolution
For an input feature map of size and a kernel of size :
Technically, this is cross-correlation, not convolution (which would flip the kernel). In deep learning, we always mean cross-correlation when we say "convolution" - the distinction does not matter because the kernels are learned.
Multi-Channel Convolution
In practice, inputs have channels (e.g., 3 for RGB) and we want output channels:
- Each filter has shape
- We have such filters
- Total weight shape:
- Each filter produces one output channel by summing over all input channels
Parameter count: (including bias)
Example: A conv layer with 64 input channels, 128 output channels, and 3x3 kernels has parameters.
Part 2 - Output Size, Stride, Padding, and Dilation
The Output Size Formula
This is the most frequently tested calculation in CNN interviews.
where:
- = input spatial dimension (height or width)
- = kernel size
- = padding (zeros added to each side)
- = stride (step size of the sliding window)
For dilated convolutions:
where = dilation rate and is the effective kernel size.
Common Configurations
| Configuration | Kernel | Stride | Padding | Effect on Size |
|---|---|---|---|---|
| Standard | 3x3 | 1 | 0 | Shrinks by 2 (each side loses 1) |
| Same padding | 3x3 | 1 | 1 | Preserves spatial size |
| Downsampling | 3x3 | 2 | 1 | Halves spatial size |
| Aggressive downsample | 7x7 | 2 | 3 | Roughly halves (used in ResNet stem) |
| Pooling replacement | 1x1 | 1 | 0 | Changes channels only |
| Dilated | 3x3, dilation=2 | 1 | 2 | Preserves size, larger receptive field |
Worked Examples
Example 1: Input 32x32, kernel 5x5, stride 1, padding 0.
Example 2: Input 224x224, kernel 7x7, stride 2, padding 3.
Example 3: Input 56x56, kernel 3x3, stride 2, padding 1.
The floor operation matters when the division is not exact. Input 7x7, kernel 3x3, stride 2, padding 0: , not 3.5. Some candidates forget the floor and get wrong answers. Also remember that the formula applies independently to height and width - they do not have to be equal.
Padding Types
| Padding | Formula | When Used |
|---|---|---|
| Valid (no padding) | When spatial shrinkage is acceptable | |
| Same | (for stride 1) | Preserve spatial dimensions |
| Full | Transposed convolutions, signal processing | |
| Causal | Pad only one side | 1D convolutions for time series (no future leakage) |
Dilation (Atrous Convolution)
Dilation inserts gaps between kernel elements, enlarging the effective receptive field without adding parameters or reducing resolution.
A 3x3 kernel with dilation 2 has the same 9 parameters but covers a 5x5 effective area (with gaps). With dilation 4, it covers 9x9.
Use cases: Semantic segmentation (DeepLab), where you need large receptive fields at full resolution.
Part 3 - Receptive Field
What Is the Receptive Field?
The receptive field of a neuron is the region of the original input that can influence that neuron's value. It is determined by the cumulative effect of all preceding conv and pooling layers.
Recursive Receptive Field Formula
For layer with kernel size and stride :
where (a single pixel).
The key insight: stride in early layers has a multiplicative effect on receptive field growth. This is why architectures like ResNet use a stride-2 conv in the first layer - it doubles the receptive field contribution of every subsequent layer.
Worked Example: Simple 3-Layer CNN
| Layer | Kernel | Stride | Receptive Field |
|---|---|---|---|
| Input | - | - | |
| Conv1 | 3x3 | 1 | |
| Conv2 | 3x3 | 1 | |
| Conv3 | 3x3 | 1 |
Three 3x3 conv layers with stride 1 give a 7x7 receptive field.
Why Three 3x3 Convs Instead of One 7x7?
This is a classic interview question (VGGNet insight).
| Approach | Parameters | Receptive Field | Nonlinearities |
|---|---|---|---|
| One 7x7 conv (64 channels) | 7x7 | 1 | |
| Three 3x3 convs (64 channels) | 7x7 | 3 |
Three 3x3 convs have 45% fewer parameters and 3x more nonlinearity for the same receptive field. The extra nonlinear layers make the function more expressive. This is why VGG exclusively uses 3x3 convolutions.
"The receptive field question separates candidates who understand CNN architecture from those who just use pretrained models. I ask: 'Your model fails to detect large objects. Why?' A strong candidate immediately thinks about receptive field - if the receptive field is smaller than the object, the network literally cannot see the whole object in any single neuron. Solutions: add more layers, use dilated convolutions, use larger strides, or add a global average pooling layer."
Part 4 - Pooling Operations
Max Pooling
Takes the maximum value in each window. With a 2x2 window and stride 2, it halves the spatial dimensions.
Properties:
- Provides a small amount of translation invariance
- Selects the strongest activation (most prominent feature)
- No learnable parameters
- Discards spatial information (location within the window)
Average Pooling
Takes the mean value in each window.
Properties:
- Smoother than max pooling
- Preserves more spatial information
- Used less frequently than max pooling in classification architectures
Global Average Pooling (GAP)
Averages each entire feature map into a single number. For a feature map, produces a -dimensional vector.
Properties:
- Replaces fully connected layers at the end of classification CNNs (GoogLeNet, ResNet)
- No parameters - eliminates the FC layer parameters
- Acts as a structural regularizer
- Provides complete translation invariance
Strided Convolution vs Pooling
Modern architectures often replace pooling with strided convolutions (stride 2):
| Approach | Parameters | Learns what to discard? | Used in |
|---|---|---|---|
| Max pooling | 0 | No (fixed max operation) | VGG, older ResNets |
| Strided convolution | Yes (learned downsampling) | ResNet-D, ConvNeXt |
Strided convolutions are now preferred because they allow the network to learn an optimal downsampling strategy rather than using a fixed max operation.
Part 5 - Architecture Evolution
This is one of the most frequently tested topics in CNN interviews. You must know the key innovation of each architecture and why it mattered.
LeNet-5 (LeCun et al., 1998)
- Innovation: Demonstrated that CNNs can learn useful features from raw pixels
- Architecture: 2 conv layers (5x5), 2 subsampling layers, 3 FC layers
- Parameters: ~60,000
- Task: Handwritten digit recognition (MNIST)
- Impact: Proved the concept but limited by hardware
AlexNet (Krizhevsky et al., 2012)
- Innovation: Won ImageNet by a massive margin, launching the deep learning revolution
- Key ideas: ReLU activation (not tanh/sigmoid), dropout regularization, GPU training, data augmentation, local response normalization
- Architecture: 5 conv layers, 3 FC layers
- Parameters: ~60 million
- Impact: Proved that deep learning works at scale for vision
VGGNet (Simonyan & Zisserman, 2014)
- Innovation: Showed that depth matters - use only 3x3 convolutions stacked deeply
- Key insight: Three 3x3 convs = one 7x7 conv in receptive field, but with fewer parameters and more nonlinearity
- Architecture: 16 or 19 layers, all 3x3 convs
- Parameters: ~138 million (huge FC layers)
- Limitation: Very expensive, no skip connections, training is difficult beyond 19 layers
GoogLeNet / Inception (Szegedy et al., 2014)
- Innovation: Process at multiple scales simultaneously with the Inception module
- Key idea: Each Inception module applies 1x1, 3x3, 5x5 convs and max pooling in parallel, then concatenates
- 1x1 convs for dimensionality reduction: Before the expensive 3x3 and 5x5 convs, a 1x1 conv reduces channels (the "bottleneck")
- Parameters: ~6.8 million (12x fewer than VGG through bottleneck design)
- Impact: Showed that architecture engineering (not just depth) matters
ResNet (He et al., 2015)
- Innovation: Skip connections enable training of networks with 152+ layers
- The problem it solved: Plain networks degrade (not overfit - degrade) beyond ~20 layers. Adding more layers makes training loss worse.
- The solution: Instead of learning , learn the residual , so the layer computes
- Parameters: ~25M (ResNet-50) to ~60M (ResNet-152)
- Impact: The single most important architecture innovation in CNNs
EfficientNet (Tan & Le, 2019)
- Innovation: Compound scaling - scale depth, width, and resolution together with a principled formula
- Key idea: Previous work scaled networks in one dimension (deeper OR wider OR higher resolution). EfficientNet scales all three simultaneously with compound coefficients: depth , width , resolution with
- Building block: MBConv (mobile inverted bottleneck) with depthwise separable convolutions and squeeze-and-excitation
- Impact: State-of-the-art efficiency - much better accuracy/FLOPs tradeoff
ConvNeXt (Liu et al., 2022)
- Innovation: "A ConvNet for the 2020s" - modernized ResNet to match Vision Transformer performance
- Key changes from ResNet:
- Patchify stem (4x4 stride-4 conv, like ViT)
- Larger kernels (7x7 depthwise conv, like Transformer attention windows)
- GELU activation (from Transformers)
- LayerNorm instead of BatchNorm (from Transformers)
- Inverted bottleneck (expand then contract, from MobileNet)
- Fewer activation functions (only one per block)
- Impact: Proved that CNNs are not inherently inferior to Transformers - the architecture details matter
Part 6 - Skip Connections and Why ResNet Works
The Degradation Problem
Plain deep networks exhibit a surprising failure: deeper networks have higher training error than shallower ones. This is not overfitting (which would show low training error but high test error). This is an optimization failure - the optimizer cannot find a good solution.
If a 20-layer network achieves loss , a 56-layer network should achieve at most (it could just learn identity for the extra 36 layers). But in practice, the 56-layer network does worse. Why?
The gradient signal degrades over many layers (not just vanishing - the gradient direction becomes increasingly noisy), making it nearly impossible for early layers to learn useful features.
The Skip Connection Solution
A residual block computes:
where is the residual function (typically two conv layers with BN and ReLU).
Mathematical Proof: Gradient Highways
The gradient of the loss w.r.t. the input of a residual block:
The gradient has two components:
- : gradient through the conv layers (may vanish)
- : direct gradient through the skip connection (cannot vanish!)
For a network with residual blocks, the gradient from block to block 1 always has a direct path with factor . Even if all the residual functions have tiny gradients, the skip connections ensure the gradient reaches every layer.
Stacking Residual Blocks
For stacked residual blocks:
Expanding this product creates terms, each representing a different path through the network. Critically, one of these paths is the all-identity path , which preserves the gradient magnitude exactly. The other paths provide additional gradient information.
Bottleneck Residual Block
For deeper ResNets (50, 101, 152 layers), a bottleneck design reduces computation:
- 1x1 conv: Reduce channels (e.g., 256 to 64) - the "bottleneck"
- 3x3 conv: Spatial processing at reduced channel count
- 1x1 conv: Restore channels (e.g., 64 to 256)
This 1x1-3x3-1x1 pattern has far fewer parameters than two 3x3 convs at the full channel width.
Parameter comparison (256 channels):
| Design | Parameters |
|---|---|
| Two 3x3 convs at 256 channels | |
| Bottleneck (256-64-64-256) |
The bottleneck has 17x fewer parameters with the same receptive field.
Do NOT say "ResNet works because it prevents vanishing gradients." This is partially true but incomplete. The deeper insight is that skip connections transform the optimization landscape - they make the loss surface smoother (Li et al., 2018 visualized this). Without skip connections, the loss surface has many sharp minima and saddle points that trap the optimizer. With skip connections, the landscape becomes more convex-like. If the interviewer asks "why not just use better optimization?" you need this answer.
Part 7 - 1x1 Convolutions
What 1x1 Convolutions Do
A 1x1 convolution with filters operates only along the channel dimension. At each spatial position, it computes a linear combination of the input channels to produce output channels.
It is equivalent to applying a fully connected layer independently at every spatial position (hence also called pointwise convolution or network in network).
Three Uses of 1x1 Convolutions
1. Dimensionality reduction (bottleneck): Reduce channels before expensive operations. GoogLeNet uses 1x1 convs to reduce 256 channels to 64 before a 5x5 conv, saving computation in the 5x5 conv.
2. Dimensionality expansion: Increase channels. In inverted bottlenecks (MobileNetV2, EfficientNet), 1x1 expands channels before depthwise conv.
3. Channel mixing: Learn cross-channel interactions without spatial operations. This is what ResNet's bottleneck does.
Parameter count for 1x1 conv: - no spatial kernel parameters.
Part 8 - Depthwise Separable Convolutions
Standard Convolution Cost
For input , a standard convolution to channels:
- Parameters:
- FLOPs:
Depthwise Separable Convolution
Splits the standard convolution into two steps:
Step 1 - Depthwise convolution: Apply one filter per input channel independently.
- Parameters:
- Each channel is filtered separately (no cross-channel mixing)
Step 2 - Pointwise convolution: Apply a 1x1 convolution to mix channels.
- Parameters:
- Performs all cross-channel interaction
Computational Savings
For typical values (, ):
Depthwise separable convolutions use roughly 8-9x fewer FLOPs and 8-9x fewer parameters than standard convolutions.
Where They Are Used
| Architecture | How DSC Is Used |
|---|---|
| MobileNet (Howard et al., 2017) | All convolutions are depthwise separable |
| Xception (Chollet, 2017) | Replaces all Inception module convolutions |
| EfficientNet (Tan & Le, 2019) | MBConv blocks use depthwise separable convs |
| ConvNeXt (Liu et al., 2022) | Uses depthwise (but not separable) convolutions |
Part 9 - Transfer Learning and Fine-Tuning
Why Transfer Learning Works
CNNs learn hierarchical features:
- Early layers (1-3): Low-level features - edges, textures, colors. These are universal across tasks.
- Middle layers (4-8): Mid-level features - corners, contours, patterns. Somewhat task-specific.
- Late layers (9+): High-level features - object parts, scenes. Highly task-specific.
Transfer learning works because early and middle layer features are useful across very different tasks (ImageNet features work for medical images, satellite images, etc.).
Fine-Tuning Strategies
Practical Fine-Tuning Recipe
- Replace the classification head: Remove the final FC layer, add a new one matching your number of classes
- Freeze backbone initially: Train only the new head for 5-10 epochs
- Unfreeze gradually: Start unfreezing from the last layer backward
- Use differential learning rates: Early layers get 10x-100x smaller LR than the head
- Use smaller overall LR: Start with to (not like training from scratch)
- Data augmentation: Critical when fine-tuning dataset is small
Common Fine-Tuning Mistakes
| Mistake | Why It Fails | Fix |
|---|---|---|
| Using the same LR everywhere | Early layers overfit quickly, losing universal features | Differential LR: head 10x, middle 5x, early 1x |
| Not freezing initially | Random head weights send garbage gradients to backbone | Freeze backbone, train head first |
| Training too long | Small datasets cause overfitting quickly | Early stopping, strong augmentation |
| Wrong input normalization | Pretrained model expects ImageNet normalization | Always use mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] for ImageNet-pretrained models |
| Resizing inputs incorrectly | Models trained at 224x224 lose features at other sizes | Resize to the pretrained resolution or use multi-scale training |
At Google/Meta, transfer learning questions focus on architecture design: "How would you modify a ResNet-50 for a 3-channel radar image with 500x500 resolution?" At startups, the questions are more practical: "You have 500 labeled images. Walk me through your transfer learning pipeline." At Apple, expect questions about efficient fine-tuning for on-device models.
Practice Problems
Problem 1: Output Size Calculation
A CNN has the following layers applied to a 224x224x3 input:
- Conv: 64 filters, 7x7, stride 2, padding 3
- Max pool: 3x3, stride 2, padding 1
- Conv: 128 filters, 3x3, stride 1, padding 1
- Conv: 128 filters, 3x3, stride 2, padding 1
Compute the spatial size after each layer.
Hint 1 - Direction
Apply the formula at each layer. Be careful with the floor operation.
Hint 2 - Insight
Layer 1: . Stride 2 halves the size. Continue for each layer.
Hint 3 - Full Solution + Rubric
| Layer | Input Size | Formula | Output Size |
|---|---|---|---|
| Conv1 | 224x224 | 112x112 | |
| MaxPool | 112x112 | 56x56 | |
| Conv2 | 56x56 | 56x56 | |
| Conv3 | 56x56 | 28x28 |
Final output: 28x28x128.
Total parameter count:
- Conv1:
- Conv2:
- Conv3:
- Total: 230,912
Scoring Rubric:
- Strong Hire: All sizes correct, computes parameter counts, notes this resembles ResNet stem
- Lean Hire: Sizes correct but needs to carefully think through the formula
- No Hire: Makes errors in the stride-2 calculations or forgets the +1
Problem 2: ResNet Skip Connection Gradient
Prove mathematically that skip connections prevent vanishing gradients. Specifically, for a network with residual blocks, show that the gradient from the last block to the first always has a term with magnitude 1.
Hint 1 - Direction
Write the output of a residual block: . Apply the chain rule repeatedly.
Hint 2 - Insight
. The product of these terms over blocks, when expanded, contains the term .
Hint 3 - Full Solution + Rubric
For residual block :
The gradient:
Expanding this product:
where .
The first term is - the identity. This means:
The gradient of the loss w.r.t. always contains the term with no attenuation. No matter how many layers there are, the gradient from the last layer reaches the first layer with full magnitude through the skip connections.
Without skip connections: , which vanishes exponentially if .
Scoring Rubric:
- Strong Hire: Complete derivation, expands the product to show the term, contrasts with the non-skip case, mentions the paths interpretation
- Lean Hire: Correctly derives for one block and intuits the result for blocks
- No Hire: Says "skip connections help gradient flow" without any mathematical argument
Problem 3: Depthwise Separable Computation
You need to process a 56x56x256 feature map with a 3x3 convolution producing 512 output channels. Compare the FLOPs for a standard convolution vs a depthwise separable convolution.
Hint 1 - Direction
Standard conv FLOPs: . Depthwise: compute each step separately.
Hint 2 - Insight
Assume stride 1, same padding, so output is 56x56. Standard: . Depthwise: for depthwise + for pointwise.
Hint 3 - Full Solution + Rubric
Assume stride 1, padding 1 (same), output size 56x56.
Standard convolution: FLOPs = GFLOPs
Depthwise separable convolution:
- Depthwise: MFLOPs
- Pointwise: MFLOPs
- Total: MFLOPs
Speedup:
This matches the theoretical ratio: , so .
Scoring Rubric:
- Strong Hire: Computes both correctly, derives the speedup, states the general ratio formula, mentions that actual wall-clock speedup may differ due to memory access patterns
- Lean Hire: Computes both correctly and notes the large savings
- No Hire: Cannot set up the FLOP calculation or confuses depthwise and pointwise steps
Problem 4: Transfer Learning Strategy
You have 2,000 labeled X-ray images (4 disease classes) and want to build a classifier. You have a ResNet-50 pretrained on ImageNet. Design your transfer learning strategy and justify each decision.
Hint 1 - Direction
2,000 images is a small dataset. Think about overfitting risk. Medical images are somewhat different from ImageNet but still share low-level features (edges, textures).
Hint 2 - Insight
Strategy: Replace the classification head (1000 to 4 classes), freeze backbone initially, then progressively unfreeze. Use strong data augmentation. Consider differential learning rates.
Hint 3 - Full Solution + Rubric
Step 1 - Modify architecture:
- Replace final FC layer (1000 classes) with new FC layer (4 classes)
- Consider adding a hidden layer (e.g., 512 units + ReLU + dropout 0.5) before the final classifier for more capacity
Step 2 - Phase 1: Feature extraction (10 epochs):
- Freeze all ResNet backbone weights
- Train only the new head with LR =
- Use SGD with momentum or Adam
- This establishes reasonable head weights without disturbing learned features
Step 3 - Phase 2: Fine-tune last stage (20 epochs):
- Unfreeze ResNet stage 4 (last residual blocks)
- Use differential LR: backbone LR = , head LR =
- This adapts high-level features to the medical domain
Step 4 - Phase 3: Fine-tune all (optional, 10 epochs):
- Unfreeze everything
- Use very small backbone LR (), head LR ()
- Only do this if validation accuracy is still improving
Data augmentation (critical with 2K images):
- Random horizontal flip, rotation (up to 15 degrees), color jitter
- Random crop with resize back to 224x224
- Mixup or CutMix for regularization
- Test-time augmentation for final predictions
Additional considerations:
- Use ImageNet normalization:
mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] - X-rays are grayscale - replicate to 3 channels, or modify the first conv layer
- Use 5-fold cross-validation given the small dataset
- Consider using a smaller model (ResNet-18) to reduce overfitting risk
Scoring Rubric:
- Strong Hire: Multi-phase fine-tuning with differential LR, strong augmentation, addresses grayscale input, mentions cross-validation, considers smaller model
- Lean Hire: Correct basic approach (freeze then fine-tune) with data augmentation
- No Hire: Trains from scratch on 2K images or uses the same LR for all layers
Interview Cheat Sheet
| Concept | Key Fact | Common Mistakes |
|---|---|---|
| Output size formula | Forgetting the floor or the | |
| Receptive field | Not accounting for stride's multiplicative effect | |
| 3x3 vs 7x7 | Three 3x3 = 7x7 RF, fewer params, more nonlinearity | Saying they are "the same" without quantifying |
| Skip connections | - gradient cannot vanish | Saying "prevents vanishing gradients" without the math |
| 1x1 convolutions | Channel mixing, bottleneck reduction, no spatial operation | Thinking they are useless because kernel is "too small" |
| Depthwise separable | ~8-9x fewer FLOPs for 3x3 kernels | Confusing depthwise and pointwise steps |
| Global average pooling | Replaces FC layers, no parameters, full translation invariance | Not knowing it exists (many candidates only know max pooling) |
| Transfer learning | Freeze first, differential LR, augment heavily for small data | Using same LR everywhere or training from scratch |
| ResNet bottleneck | 1x1-3x3-1x1 pattern, 17x fewer params than direct 3x3-3x3 | Not knowing why 1x1 convs are needed |
| ConvNeXt | Modernized ResNet with Transformer tricks, matches ViT | Saying "CNNs are obsolete because of Transformers" |
Spaced Repetition Checkpoints
Day 0 - After First Read
- Write the output size formula from memory and solve 3 examples
- Draw a residual block and write the gradient equation showing the identity term
- List the 8 key architectures in order and state each one's primary innovation
Day 3 - First Review
- Compute the receptive field for a 5-layer CNN with alternating stride-1 and stride-2 layers
- Explain depthwise separable convolutions and compute the FLOP ratio for 3x3 kernels
- Compare VGG-16 and ResNet-50 in terms of: depth, parameters, key innovation, performance
Day 7 - Connections Review
- Explain how receptive field connects to the degradation problem connects to skip connections
- Explain how 1x1 convolutions are used in: GoogLeNet (bottleneck), ResNet (channel matching), MobileNet (pointwise)
- Design a CNN architecture for a given task, justifying kernel sizes, strides, and normalization
Day 14 - Interview Simulation
- Given a feature map shape and a target, design 3 conv layers with correct math
- Prove the ResNet gradient benefit on a whiteboard in under 5 minutes
- Walk through a complete transfer learning strategy for a given scenario
Day 21 - Final Calibration
- Complete all 4 practice problems under time pressure (10 minutes each)
- Explain why ConvNeXt adopted ideas from Transformers and what they changed
- Connect CNNs to the broader deep learning picture: how do they relate to attention (Vision Transformers), and when would you choose one over the other?
