Convolutional Neural Networks
Reading Time: ~45 min | Interview Relevance: Very High | Target Roles: MLE, CV Engineer, Applied Scientist
You are four months into your role at a semiconductor company. The previous team built a wafer defect classifier: 224x224 RGB images flattened into 150,528-dimensional vectors, fed into a stack of fully connected layers. The model has 40 million parameters. It overfits immediately - training accuracy hits 98%, validation accuracy plateaus at 62%. Training takes 8 hours on a V100. They tried dropout, L2 regularization, reducing the learning rate. Nothing helped. The validation curve stayed flat.
You look at the architecture and realize the problem has nothing to do with regularization. The problem is structural. The model treats the pixel at row 0, column 0 and the pixel at row 223, column 223 as completely unrelated inputs. It has no notion that adjacent pixels are part of the same feature. It has no concept that a scratch in the top-left corner of a wafer looks the same as a scratch in the bottom-right corner. It must learn 40 million independent weights, each encoding a relationship that the architecture does not know should be shared.
You replace it with a CNN: 5 convolutional layers, batch normalization, global average pooling, a single linear classifier. Two million parameters. Validation accuracy: 89%. Training time: 45 minutes. Your manager calls a team meeting and asks you to explain, from the beginning, why this works so much better. That explanation is what this lesson covers.
Why Fully Connected Networks Fail on Images
Let's put concrete numbers on the problem before we explain the solution.
A 224x224 RGB image has 224 * 224 * 3 = 150,528 input values. If the first hidden layer has 1,024 neurons, that layer alone requires 150,528 * 1,024 = 154,140,672 weight parameters - about 154 million numbers, just for the first layer. A second layer of the same size adds another 1,024 * 1,024 = 1,048,576 parameters. And then a third. In four layers you have burned through most of your GPU memory before you have learned anything meaningful.
This is catastrophic for three distinct reasons, and it is worth being precise about each one.
Reason 1 - Memory. 154 million float32 weights occupy about 616 MB for one layer. But during training you also need to store activations (the forward pass values), gradients (the backward pass values), and optimizer state (Adam stores two momentum buffers per parameter). A naively deep FC network on images exhausts GPU memory before it gets deep enough to be useful. You end up with a shallow model that cannot learn complex features.
Reason 2 - Data efficiency. With 154 million parameters in just the first layer, you need millions of labeled examples to avoid overfitting. Statistical learning theory tells us you need roughly one example per free parameter to have any hope of generalization. Most real-world computer vision datasets - medical imaging, satellite imagery, industrial inspection - have thousands or tens of thousands of images, not hundreds of millions. The semiconductor wafer dataset has 12,000 labeled samples. An FC network will memorize them instantly and learn nothing generalizable.
Reason 3 - Semantic blindness. This is the deepest problem, and it is architectural rather than statistical. An FC layer treats each of the 150,528 pixel values as an independent feature. It has no idea that pixel (10, 20) and pixel (10, 21) are neighbors. It cannot use the fact that visual features - edges, textures, shapes - are local. A cat's whisker spans a few dozen adjacent pixels; an FC network must discover that these pixels are related from scratch, independently, at every possible position in the image.
Natural images have three fundamental properties that FC networks cannot exploit:
Local structure: meaningful patterns (edges, textures, shapes) are spatially local. A vertical edge is defined by the relationship between a few adjacent pixels, not by the relationship between pixels on opposite sides of the image. This property is exploited by local connectivity.
Translation invariance: a defect in a wafer looks the same whether it appears in the center or the corner. The feature detector for "scratch" should be the same detector regardless of where the scratch is located. An FC network must learn a separate set of weights for "scratch in top-left", "scratch in top-center", "scratch in center", and so on - for every possible position. This property is exploited by weight sharing.
Compositional hierarchy: high-level patterns are built from low-level patterns. Eyes are made of edges and circles. Faces are made of eyes, noses, and mouths. An architecture that can build hierarchical representations should identify complex objects from simpler parts. This property is exploited by depth.
CNNs were designed, explicitly, to exploit all three.
The parameter count problem is not just about memory. It is about the statistical sample complexity of learning. An FC layer on a 224x224 image must independently learn that "a horizontal edge at position (10,20)" and "a horizontal edge at position (80,150)" are the same kind of feature. A convolutional layer learns it once and applies it everywhere.
The Insight That Changed Everything: Hubel and Wiesel, 1959
The story of CNNs begins not in a computer lab but in a physiology lab at Johns Hopkins University. David Hubel and Torsten Wiesel were recording electrical signals from neurons in the visual cortex of cats. To stimulate these neurons, they projected slides onto a screen in front of the anesthetized cat.
What they discovered was surprising. Most neurons in the primary visual cortex (V1) did not respond to uniform illumination or random patterns. They responded to specific, localized stimuli: a bar of light at a particular orientation, appearing in a particular region of the visual field. Each neuron had a small "window" on the world - a region it was sensitive to, and an orientation it preferred.
Some neurons responded to a horizontal bar in the upper-left of the visual field. Others to a diagonal bar in the center. Still others to a vertical bar on the right. The key observations were:
- Local receptive fields: each neuron responded to only a small patch of the visual field, not the whole scene.
- Orientation selectivity: each neuron was tuned to a specific edge orientation - a natural "feature detector."
- Spatial organization: neurons sensitive to the same orientation but different positions were arrayed systematically across V1.
Hubel and Wiesel won the Nobel Prize in Physiology or Medicine in 1981 for this work. The implication for machine vision was profound: the brain does not process the whole visual scene simultaneously with a flat representation. It uses small, specialized, local detectors arranged hierarchically.
LeCun, at Bell Labs in 1989, was thinking about optical character recognition - reading handwritten ZIP codes on envelopes. His insight: if a feature is useful at one position, it is useful at every position. A horizontal edge detector that works at row 5, column 3 should work equally well at row 47, column 112. The same filter weights should be used everywhere. This was weight sharing, directly inspired by the organization of V1.
The result was LeNet-5 (1998): the first practical convolutional network, trained end-to-end with backpropagation, that could reliably read handwritten digits. It ran on AT&T ATMs to process check deposits. The architecture - alternating convolution, activation, pooling layers, ending with a classifier - is still recognizable in every modern CNN.
The 14-year gap between LeNet-5 and AlexNet is instructive. The ideas were right in 1998. What was missing was compute (GPUs), data (ImageNet), and a few algorithmic tricks (ReLU activations, dropout). When those arrived, CNNs achieved a step-change on ImageNet 2012 that no one expected: AlexNet achieved 15.3% top-5 error versus the runner-up's 26.2%. The field of computer vision permanently changed direction that year.
What a Convolution Actually Does
Before we write a single equation, let's build visual intuition with a concrete analogy.
Imagine you are exploring a completely dark room, carrying a small flashlight. The flashlight only illuminates a 3x3 patch of floor at a time. You slide it systematically from corner to corner - left to right, top to bottom - looking at each small patch in turn. At each position, you are asking a question of that patch: "Does this patch contain what I am looking for?"
The filter is the question. The sliding is the convolution. The answer at each position (a single number) is one value in the output feature map.
Let's make this concrete with a 5x5 grayscale image and a 3x3 vertical edge detector:
Image (5x5): Filter (3x3):
┌───────────────────────────┐ ┌─────────────────┐
│ 1 2 3 4 5 │ │ -1 0 +1 │
│ 6 7 8 9 10 │ │ -2 0 +2 │
│ 11 12 13 14 15 │ │ -1 0 +1 │
│ 16 17 18 19 20 │ └─────────────────┘
│ 21 22 23 24 25 │
└───────────────────────────┘
Output position [0,0] - top-left corner patch:
Patch: Filter: Product:
1 2 3 -1 0 +1 -1 0 +3
6 7 8 x -2 0 +2 = -12 0 +16
11 12 13 -1 0 +1 -11 0 +13
Sum = -1 + 0 + 3 - 12 + 0 + 16 - 11 + 0 + 13 = 8
The filter slides to every valid 3x3 position. For a 5x5 input with no padding, there are (5-3+1) * (5-3+1) = 9 positions, producing a 3x3 output. Each output value is a single number: the dot product of the filter weights and the corresponding image patch.
This particular filter - [-1, 0, +1] repeated vertically with weights [-1, -2, -1] - is the Sobel X filter, a vertical edge detector. It produces a large positive value when there is a dark region to the left and a bright region to the right. It produces near-zero for uniform regions. It fires wherever there is a left-to-right brightness change - a vertical edge.
Here is what different 3x3 filters detect, as a reference:
Vertical edges: Horizontal edges: Blur/Smooth: Sharpen:
-1 0 +1 -1 -2 -1 1/9 1/9 1/9 0 -1 0
-2 0 +2 0 0 0 1/9 1/9 1/9 -1 5 -1
-1 0 +1 +1 +2 +1 1/9 1/9 1/9 0 -1 0
(Sobel X) (Sobel Y) (box blur) (sharpening)
In classical computer vision (before deep learning), these filters were hand-designed by engineers with domain expertise. A CNN learns them automatically from labeled data. The remarkable fact: when you train a CNN on natural images and visualize what its first-layer filters learned, you see oriented edge detectors, color blob detectors, and texture detectors - essentially what human engineers had been designing by hand, but discovered automatically through gradient descent.
The Three Key Properties of Convolutions
The convolution operation has three structural properties that make it suited for image data. Understanding these is not academic - they come up directly in interviews, and knowing them helps you make better architectural decisions.
Property 1: Local Connectivity
Each output neuron connects to only a small local patch of the input - a K x K region - not the entire input. This is the "flashlight" property: each computation only examines a small neighborhood.
Why does this matter? Visual features are local. An edge spans a few adjacent pixels. A texture repeats over a small patch. A corner is the junction of two edges in a small region. There is no image feature that requires combining pixels from opposite corners of a 224x224 image directly - any such relationship would be built up hierarchically through many layers.
Local connectivity reduces parameter count dramatically: instead of connecting each output neuron to all H * W * C_in inputs, you connect it to only K * K * C_in inputs. For K=3 on a 224x224 image: 9 connections instead of 50,176 per channel.
Property 2: Weight Sharing
The same filter weights are used at every spatial position. The single set of K * K * C_in weights for one filter is applied at all H_out * W_out positions to produce one feature map. Every application of that filter is computing the same dot product, just at a different location.
This is the critical insight from the Hubel-Wiesel work: a horizontal edge detector that works at one position should work everywhere, because horizontal edges look the same everywhere in the image. You do not need a different detector for each position.
Weight sharing reduces parameter count further: instead of H_out * W_out * K * K * C_in * C_out parameters, you have K * K * C_in * C_out - completely independent of the spatial resolution of the input.
Property 3: Translation Equivariance
If you shift the input image by some amount and then convolve, you get the same result as convolving first and then shifting the output. Formally: if f is a convolution and T_d is a spatial translation by d, then f(T_d(x)) = T_d(f(x)).
This is a direct consequence of weight sharing. The filter "slides" with the feature - if a cat shifts 5 pixels to the right, the activations in the feature map also shift 5 pixels to the right. The network does not need to relearn that shifted features are the same feature; the architecture guarantees it.
Equivariance (the output shifts with the input) is different from invariance (the output does not change when the input shifts). Convolutions give equivariance. Pooling layers add invariance within a small window. Global pooling gives full spatial invariance. A well-designed CNN exploits equivariance through the convolutional layers and invariance through pooling and global aggregation at the end.
The Math of Convolution
Now that we have the intuition, let's be precise.
For a single-channel input and a single filter, the output at position (i, j) is:
Output[i, j] = sum over m,n of Input[i*S + m, j*S + n] * Filter[m, n] + b
Where K is the filter size (kernel size), S is the stride (how many pixels the filter jumps between positions), and b is a bias term.
For realistic inputs with C_in input channels and C_out output filters, each output filter has shape K * K * C_in - it spans all input channels simultaneously:
Output[i, j, f] = sum over c, m, n of Input[i*S+m, j*S+n, c] * Filter_f[m, n, c] + b_f
Where f indexes the output filter (from 0 to C_out - 1). The total parameter count for the layer is:
Params = C_out * K * K * C_in + C_out (weights + biases)
The output spatial dimensions given input H x W, kernel size K, padding P, stride S:
H_out = floor((H_in - K + 2*P) / S) + 1
W_out = floor((W_in - K + 2*P) / S) + 1
Three examples you should be able to compute instantly in an interview:
224x224, kernel3x3,padding=1,stride=1:floor((224-3+2)/1)+1 = 224- same size224x224, kernel3x3,padding=1,stride=2:floor((224-3+2)/2)+1 = 112- halved224x224, kernel7x7,padding=3,stride=2:floor((224-7+6)/2)+1 = 112- halved (typical stem)
Why "Convolutional" is a slight misnomer. Technically, what CNNs compute is cross-correlation, not convolution. In signal processing, true convolution flips the kernel before sliding it: (f * g)[n] = sum_k f[k] * g[n-k]. Cross-correlation skips the flip: (f star g)[n] = sum_k f[k] * g[n+k]. The difference is a kernel flip.
This does not matter in practice. The filters are learned from data. If the optimal filter for detecting a left edge is [-1, 0, +1], and we are computing cross-correlation, gradient descent will learn [-1, 0, +1]. If we were computing true convolution, gradient descent would learn [+1, 0, -1] (the flipped version). The network does not care - it learns whatever works. Every major deep learning framework calls it "convolution." The name has stuck.
Multiple Filters = Multiple Feature Maps
A single filter learns to detect one kind of pattern. Real images have many kinds of patterns - horizontal edges, vertical edges, diagonal edges, color gradients, textures, corners. To detect all of them, you use multiple filters in parallel.
For a convolutional layer with C_out filters, each filter is a separate K x K x C_in tensor with its own learned weights. Each filter produces one output feature map of shape H_out x W_out. Together, they produce an output of shape H_out x W_out x C_out.
You can think of this as: the first filter asks "is there a vertical edge here?", the second asks "is there a horizontal edge here?", the third asks "is there a red region here?", and so on. After training, the filters learn to ask questions that are jointly useful for the task.
After training on ImageNet, you might see filters like:
Filter 0: detects left-to-right edges (vertical edges)
Filter 1: detects top-to-bottom edges (horizontal edges)
Filter 2: responds to blue-yellow color contrast
Filter 3: responds to red-green color contrast
Filter 4: detects diagonal edges (top-left to bottom-right)
Filter 5: detects corners (high frequency response in all directions)
...
Filter 63: detects some pattern that has no obvious name but is useful
This is why the output of a convolutional layer is a 3D tensor, not a 2D image. The spatial dimensions track position in the image. The channel dimension tracks which kind of feature was detected at that position.
Hierarchical Feature Learning
One of the most powerful aspects of deep CNNs is not what a single layer does, but what the stack of layers learns to do.
A key empirical finding from Zeiler and Fergus (2013) and others: when you visualize what filters in different layers of a trained CNN respond to most strongly, you see a consistent pattern:
Layer 1: Oriented edges - horizontal, vertical, diagonal (Gabor-like)
Color blobs and simple textures
Layer 2: Combinations of edges: corners, curves, cross-hatching
Multi-scale textures, grids, parallel lines
Layer 3: Object parts beginning to emerge
Simple shapes: arcs, circles, rectangles
Layer 4: Recognizable parts - wheels, dog faces, text characters
Repeating structures, scene geometry
Layer 5+: Whole objects - car bodies, human faces, chairs, dogs
This is not programmed. It emerges from gradient descent on a classification objective.
Why does this hierarchy emerge? Each layer applies its filters to the feature maps of the previous layer, not to the raw image. So Layer 2 filters are looking at patterns in edge-detector outputs. A pattern of edges curving in a consistent direction looks like a curve. A pattern of curves meeting at angles looks like a shape. A pattern of shapes arranged spatially looks like a part. This compositional structure is exactly what you would want for recognizing objects, and the architecture makes it easy to learn because each layer only needs to combine local patterns from the layer below.
The hierarchy also explains transfer learning: the early layers of a CNN trained on ImageNet learn genuinely general visual features (edges, textures, shapes) that are useful for almost any visual task. Only the final layers are task-specific. This is why fine-tuning from pretrained weights works so well even when transferring to very different domains - medical imaging, satellite imagery, industrial defect detection - the early layers do not need to change.
A key design implication: when you fine-tune a pretrained CNN for a new task, freeze the early layers (the general feature extractors) and only update the later layers (the task-specific features). The more data you have, the more layers you can afford to update.
Parameter Efficiency: The Numbers That Matter
Let's make the parameter savings concrete. Compare a fully connected layer to a convolutional layer processing the same input.
Input: 224x224x3 RGB image. Goal: produce 64 feature maps.
| Layer Type | Parameters | Ratio |
|---|---|---|
| FC layer (64 outputs) | 150,528 * 64 = 9,633,792 | 1x baseline |
| Conv 3x3 (64 filters) | 3 * 3 * 3 * 64 = 1,728 | ~5,574x fewer |
| Conv 5x5 (64 filters) | 5 * 5 * 3 * 64 = 4,800 | ~2,007x fewer |
| Conv 7x7 (64 filters) | 7 * 7 * 3 * 64 = 9,408 | ~1,024x fewer |
That 3x3 conv with only 1,728 parameters scans across every 3x3 patch in the entire 224x224 image - it applies these same 1,728 weights at each of the 224 * 224 = 50,176 spatial positions. The FC layer needs 9.6 million independent weights for the same task, because it has no concept that "horizontal edge at position (10,20)" and "horizontal edge at position (50,80)" should be detected the same way.
Weight sharing is why CNNs generalize far better from small datasets. An FC layer must learn a separate representation for each spatial position. A conv layer learns one representation and applies it everywhere - far fewer parameters to overfit. This is the core statistical argument for CNNs, not just the engineering argument.
The Full CNN Forward Pass
Receptive Field: How Deep Networks See the Whole Image
The receptive field of a neuron is the region of the original input image that can affect that neuron's output. Understanding receptive field growth is essential for designing networks that can see the whole image without becoming impractically large.
Start simple: a single 3x3 conv layer gives each output neuron a 3x3 receptive field in the input. It literally only sees a 3x3 patch - 9 pixels out of 50,176. After stacking two stride-1 3x3 conv layers, each output neuron sees a 5x5 patch in the original input. After three layers, 7x7. After five, 11x11.
The formula for receptive field growth through a stack of layers (all stride 1):
RF_L = 1 + sum over l from 1 to L of (K_l - 1)
With stride, the receptive field grows much faster. If layer l has stride S_l, the contribution of later layers to the input-space receptive field is amplified. Each stride-2 layer doubles the effective distance that subsequent layers cover:
RF_L = RF_(L-1) + (K_L - 1) * product of S_i for i < L
Here is how receptive field grows through a ResNet-like backbone:
Layer Kernel Stride Receptive Field in Input Space
0 - - 1x1 (single pixel)
1 7 2 7x7 (stem conv)
2 3 2 15x15 (maxpool, equivalent)
3 3 1 19x19
4 3 1 23x23
5 3 2 31x31
6 3 1 39x39
7 3 2 55x55
8 3 1 71x71
9 3 2 103x103
10 3 1 135x135
11 3 2 199x199 (now sees most of 224x224)
After a few stride-2 downsampling steps, neurons deep in the network effectively see most of the image - even though each individual convolution only operates on a 3x3 window. The stride-2 steps are what make this happen efficiently: each one doubles the distance covered by each subsequent filter step.
The theoretical receptive field overstates how much a neuron actually "uses" its receptive field. The effective receptive field - the region that most influences the neuron's output - is smaller and approximately Gaussian in shape, centered in the theoretical RF. Center pixels contribute exponentially more than border pixels due to the multiplicative effect of many layers of local connectivity. This is why very deep networks with enormous theoretical RFs sometimes still miss long-range spatial relationships.
Why RF matters for task design:
- Classification: the final feature map neurons need RF large enough to see the whole object. If your RF is too small, neurons can only see object parts, not the whole.
- Object detection: neurons in feature maps at different scales need RF matching the scale of objects they detect. That is why FPN (Feature Pyramid Network) uses features from multiple layers.
- Semantic segmentation: you need a large RF to capture context (what is around a pixel), but you cannot lose spatial resolution. This tension is what motivated dilated convolutions.
1x1 Convolutions: Channel Mixing Without Spatial Change
A 1x1 convolution seems paradoxical at first: a filter of size 1x1 has no spatial extent. It cannot detect edges or shapes. What does it do?
It performs a learned linear combination across channels, at each spatial position independently. For an input of shape H x W x C_in and C_out output filters, a 1x1 conv applies C_out different linear combinations to the C_in-dimensional vector at each of the H * W spatial positions.
Think of it this way: at each pixel location, you have a C_in-dimensional vector (the channel values at that position). A 1x1 conv applies a learned C_in x C_out matrix multiplication to that vector, independently at each position. It is a fully connected layer applied pointwise across the spatial dimension.
This is useful for three concrete purposes:
Use case 1 - Channel dimensionality reduction (bottleneck):
If a feature map has 256 channels and you want to run a 3x3 conv over it, the 3x3 conv costs 9 * 256 * 256 = 589,824 multiply-accumulate operations per output location. First apply a 1x1 conv to reduce 256 → 64 channels, then the 3x3 costs 9 * 64 * 256 = 147,456 operations - 4x cheaper. This is the bottleneck pattern in ResNet-50.
Use case 2 - Adding non-linearity without spatial mixing:
A 1x1 conv followed by ReLU applies a learned non-linear projection to each spatial location's channel vector. This increases the model's representational power without any spatial mixing. GoogLeNet (Inception v1) used 1x1 convolutions before expensive 3x3 and 5x5 convolutions to cheaply add capacity.
Use case 3 - Channel expansion before depthwise conv:
MobileNetV2's inverted bottleneck expands channels before the depthwise conv (to give the depthwise conv more channels to work with), then compresses back with a 1x1 projection. The 1x1 convs bracket the depthwise spatial operation.
Here is the full ResNet-50 bottleneck pattern with the parameter arithmetic:
Input: 256 channels, 14x14 spatial
-- 1x1 conv: 256 -> 64 channels (4x compression, no spatial mixing)
Parameters: 256 * 64 = 16,384
-- 3x3 conv: 64 -> 64 channels (spatial mixing on cheap 64-ch maps)
Parameters: 9 * 64 * 64 = 36,864
-- 1x1 conv: 64 -> 256 channels (expand back to 256)
Parameters: 64 * 256 = 16,384
Output: 256 channels, 14x14 spatial
Total bottleneck params: 69,632
vs. direct 3x3 conv 256->256: 9 * 256 * 256 = 589,824
Reduction: 8.5x fewer parameters, similar representational capacity
Depthwise Separable Convolutions: MobileNet's Key Innovation
A standard 3x3 convolution does two things simultaneously:
- Mixes information across the
K x Kspatial neighborhood (spatial mixing) - Mixes information across all
C_ininput channels (channel mixing)
These two operations are coupled in standard convolution. Depthwise separable convolutions factor them into two separate, cheaper operations.
Step 1 - Depthwise convolution: Apply one K x K filter per input channel, independently. No mixing across channels. Each of the C_in channels produces one output channel. This is purely spatial mixing.
Step 2 - Pointwise convolution: Apply a 1x1 convolution to mix channels. Takes the C_in depthwise outputs and projects them to C_out channels. This is purely channel mixing.
The parameter arithmetic:
Standard 3x3 conv from 64 to 128 channels:
Parameters: K*K * C_in * C_out = 9 * 64 * 128 = 73,728
Depthwise separable equivalent:
Depthwise: K*K * C_in * 1 = 9 * 64 = 576 (one filter per channel)
Pointwise: 1*1 * C_in * C_out = 1 * 64 * 128 = 8,192 (channel mixing)
Total: 8,768
Reduction factor: 73,728 / 8,768 = 8.4x fewer parameters
The general reduction factor for a K x K depthwise separable conv:
Reduction = 1/C_out + 1/K^2
For K=3 and large C_out: approximately 1/9 = 9x fewer FLOPs and parameters. For K=5: about 1/25 = 25x fewer.
The intuition: spatial feature detection (finding edges, textures within a K x K patch) and channel mixing (combining different feature types) are largely independent tasks. Factoring them lets you do each cheaply and combine the results, with minimal accuracy loss because the operations are nearly independent anyway.
MobileNetV1 replaced all standard convolutions in a VGG-style network with depthwise separable convolutions, achieving about 8-9x fewer FLOPs with only ~1% top-1 accuracy drop on ImageNet. That tradeoff makes deployment on phones and edge devices practical.
Networks using depthwise separable convolutions: MobileNet (v1, v2, v3), EfficientNet, Xception, ShuffleNet.
From Scratch in NumPy: Implementing Conv2D
Let's implement a 2D convolution from scratch using NumPy to make the sliding-window operation completely concrete. This is for understanding, not production.
import numpy as np
from typing import Tuple
def conv2d_naive(
input_map: np.ndarray,
filters: np.ndarray,
stride: int = 1,
padding: int = 0,
) -> np.ndarray:
"""
Naive 2D convolution (cross-correlation) from scratch.
Args:
input_map: shape (C_in, H, W)
filters: shape (C_out, C_in, K, K)
stride: step size for the sliding window
padding: zero-padding to add around the border
Returns:
output: shape (C_out, H_out, W_out)
"""
C_in, H, W = input_map.shape
C_out, _C_in, K, _K = filters.shape
assert _C_in == C_in, f"Filter C_in={_C_in} must match input C_in={C_in}"
assert _K == K, "Filters must be square"
# Zero-pad the input if needed
if padding > 0:
input_padded = np.pad(
input_map,
pad_width=((0, 0), (padding, padding), (padding, padding)),
mode='constant',
constant_values=0
)
else:
input_padded = input_map
# Compute output spatial dimensions
H_out = (H + 2 * padding - K) // stride + 1
W_out = (W + 2 * padding - K) // stride + 1
output = np.zeros((C_out, H_out, W_out), dtype=np.float32)
# The core sliding-window triple loop
for f in range(C_out): # for each output filter
for i in range(H_out): # for each output row
for j in range(W_out): # for each output column
# Extract the K x K x C_in input patch at this position
h_start = i * stride
w_start = j * stride
patch = input_padded[:, h_start:h_start + K, w_start:w_start + K]
# Element-wise multiply with filter, sum everything
# patch shape: (C_in, K, K) - matches filters[f] shape
output[f, i, j] = np.sum(patch * filters[f])
return output
# ---- Test 1: vertical edge detector on a clean 6x6 image ----
# Left half = dark (0), right half = bright (1) -> vertical edge at column 3
image = np.array([[
[0, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 1, 1],
]], dtype=np.float32) # shape (1, 6, 6) -- single channel
# Sobel X filter: detects left-to-right brightness increase
sobel_x = np.array([[
[-1, 0, 1],
[-2, 0, 2],
[-1, 0, 1],
]], dtype=np.float32).reshape(1, 1, 3, 3) # shape (1, 1, 3, 3)
output = conv2d_naive(image, sobel_x, stride=1, padding=0)
print("Output feature map shape:", output.shape) # (1, 4, 4)
print("Output values (column 1 should be high, others near zero):")
print(output[0])
# Expected: column 1 of the output has high values (the edge is at column 3
# of the input, which maps to column 1 of the 4x4 output with no padding)
# ---- Test 2: verify against PyTorch ----
import torch
import torch.nn.functional as F
img_t = torch.from_numpy(image).unsqueeze(0) # (1, 1, 6, 6)
flt_t = torch.from_numpy(sobel_x) # (1, 1, 3, 3)
out_pt = F.conv2d(img_t, flt_t, padding=0)
print("\nPyTorch output:")
print(out_pt[0, 0].numpy())
print("NumPy matches PyTorch:", np.allclose(output, out_pt.numpy(), atol=1e-5))
# ---- Test 3: parameter count check ----
C_in, C_out, K = 3, 64, 3
params_conv = C_out * K * K * C_in + C_out # weights + biases
params_fc = (224 * 224 * C_in) * C_out # equivalent FC layer
print(f"\nConv layer params: {params_conv:,}")
print(f"FC layer params: {params_fc:,}")
print(f"Reduction: {params_fc / params_conv:.0f}x")
# Reduction: ~5,500x
Running the NumPy implementation is slow - the triple loop over (filters, rows, columns) is not vectorized. In practice, frameworks implement convolution as GEMM (General Matrix Multiply) using the im2col transformation, which reshapes the sliding patches into a matrix and calls a highly optimized BLAS routine. The result is identical; the implementation is many orders of magnitude faster.
PyTorch Conv2d: Complete Reference
import torch
import torch.nn as nn
import torch.nn.functional as F
# ---- Parameter explanations ----
conv = nn.Conv2d(
in_channels=3, # C_in: number of input channels (3 for RGB)
out_channels=64, # C_out: number of output feature maps (= number of filters)
kernel_size=3, # K: filter height and width (or tuple (kH, kW) for non-square)
stride=1, # S: sliding step (1=dense scan, 2=halve spatial dimensions)
padding=1, # P: zero-padding on each side (1 = "same" size for K=3, S=1)
dilation=1, # d: spacing between kernel elements (>1 = atrous convolution)
groups=1, # G: grouped convolution (groups=C_in gives depthwise conv)
bias=True, # whether to add a learnable bias (set False if BatchNorm follows)
padding_mode='zeros' # 'zeros', 'reflect', 'replicate', or 'circular'
)
x = torch.randn(8, 3, 224, 224) # (batch, C_in, H, W) - standard NCHW format
out = conv(x)
print(f"Input: {x.shape}") # torch.Size([8, 3, 224, 224])
print(f"Output: {out.shape}") # torch.Size([8, 64, 224, 224])
# Parameter count: 64 filters * (3 * 3 * 3) weights + 64 biases = 1,792
print(f"Parameters: {sum(p.numel() for p in conv.parameters()):,}") # 1,792
# ---- "Same" output size: padding = K // 2 for odd K, stride=1 ----
for K in [1, 3, 5, 7]:
conv_same = nn.Conv2d(3, 64, kernel_size=K, padding=K // 2)
out = conv_same(x)
print(f"K={K}, pad={K//2}: output {out.shape}") # Always [8, 64, 224, 224]
# ---- Stride=2: halve spatial dimensions ----
conv_s2 = nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=1)
out_s2 = conv_s2(x)
print(f"Stride-2 output: {out_s2.shape}") # [8, 64, 112, 112]
# ---- Depthwise conv: groups = in_channels ----
# One K x K filter per channel, no cross-channel mixing
depthwise = nn.Conv2d(
in_channels=64,
out_channels=64, # must equal in_channels when groups=in_channels
kernel_size=3,
padding=1,
groups=64, # THIS makes it depthwise - each input channel gets its own filter
bias=False
)
pointwise = nn.Conv2d(64, 128, kernel_size=1, bias=False) # channel mixing
x64 = torch.randn(8, 64, 56, 56)
out_dw = depthwise(x64) # (8, 64, 56, 56) - spatial mixing, same channels
out_pw = pointwise(out_dw) # (8, 128, 56, 56) - channel mixing
print(f"Depthwise output: {out_dw.shape}")
print(f"Pointwise output: {out_pw.shape}")
# Compare parameter counts
std_conv = nn.Conv2d(64, 128, kernel_size=3, padding=1, bias=False)
dw_params = sum(p.numel() for p in depthwise.parameters())
pw_params = sum(p.numel() for p in pointwise.parameters())
std_params = sum(p.numel() for p in std_conv.parameters())
print(f"Standard 3x3: {std_params:,} parameters") # 73,728
print(f"DW + PW: {dw_params + pw_params:,}") # 8,768
print(f"Reduction: {std_params / (dw_params + pw_params):.1f}x") # 8.4x
# ---- Dilated (atrous) convolution ----
# dilation=2: 3x3 kernel with 2-pixel gaps -> 5x5 effective receptive field
# For "same" size: padding = dilation * (K-1) / 2
dilated = nn.Conv2d(64, 64, kernel_size=3, dilation=2, padding=2)
out_dil = dilated(x64)
print(f"Dilated output: {out_dil.shape}") # (8, 64, 56, 56) - same size, bigger RF
# ---- nn.functional vs nn.Module ----
# Use nn.Conv2d for standard model building (weights stored in model, autograd tracked)
# Use F.conv2d when you need to pass weights explicitly
custom_weight = torch.randn(64, 3, 3, 3) # (C_out, C_in, K, K)
custom_bias = torch.randn(64)
out_fn = F.conv2d(x, custom_weight, custom_bias, stride=1, padding=1)
print(f"F.conv2d output: {out_fn.shape}") # (8, 64, 224, 224)
Depthwise Separable Conv as a Reusable Module
import torch
import torch.nn as nn
class DepthwiseSeparableConv(nn.Module):
"""
Factored convolution: depthwise spatial mixing + pointwise channel mixing.
Approximately K^2 fewer parameters than standard conv for large C_out.
Used in: MobileNet (v1/v2/v3), EfficientNet, Xception, ShuffleNet.
Args:
in_channels: number of input channels
out_channels: number of output channels
kernel_size: spatial kernel size (default 3)
stride: stride for the depthwise conv (use 2 to downsample)
padding: padding for the depthwise conv
"""
def __init__(
self,
in_channels: int,
out_channels: int,
kernel_size: int = 3,
stride: int = 1,
padding: int = 1,
):
super().__init__()
# Depthwise: one K x K filter per channel - spatial mixing only
self.depthwise = nn.Conv2d(
in_channels,
in_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
groups=in_channels, # The key: each channel processed independently
bias=False
)
self.bn_dw = nn.BatchNorm2d(in_channels)
# Pointwise: 1x1 conv for channel mixing - no spatial information used
self.pointwise = nn.Conv2d(
in_channels,
out_channels,
kernel_size=1,
bias=False
)
self.bn_pw = nn.BatchNorm2d(out_channels)
self.relu = nn.ReLU6(inplace=True) # ReLU6 is standard in MobileNet
def forward(self, x: torch.Tensor) -> torch.Tensor:
# Step 1: spatial mixing (no cross-channel interaction)
x = self.relu(self.bn_dw(self.depthwise(x)))
# Step 2: channel mixing (no spatial interaction)
x = self.relu(self.bn_pw(self.pointwise(x)))
return x
# Usage and parameter comparison
dw_sep = DepthwiseSeparableConv(64, 128, kernel_size=3)
standard = nn.Conv2d(64, 128, kernel_size=3, padding=1, bias=False)
dw_params = sum(p.numel() for p in dw_sep.parameters())
std_params = sum(p.numel() for p in standard.parameters())
print(f"Standard conv: {std_params:,} parameters") # 73,728
print(f"Depthwise sep: {dw_params:,} parameters") # ~8,768 + BN params
print(f"Reduction: ~{std_params / 8768:.1f}x (excluding BN)")
# Verify output shapes match
x = torch.randn(4, 64, 56, 56)
print(f"Input: {x.shape}")
print(f"DW-sep: {dw_sep(x).shape}") # (4, 128, 56, 56)
print(f"Std: {standard(x).shape}") # (4, 128, 56, 56)
Visualizing What CNNs Learn
One of the most illuminating things you can do as a CV engineer is visualize the first-layer filters of a pretrained network. The results validate the story of hierarchical feature learning - and they look strikingly like the Gabor filters that computational neuroscientists found in biological visual cortex.
import torch
import torchvision.models as models
import matplotlib.pyplot as plt
import numpy as np
def visualize_first_layer_filters(model_name: str = "resnet50") -> None:
"""
Load a pretrained model and visualize its first-layer convolutional filters.
First-layer filters operate on RGB directly, so we can display them as images.
"""
model = getattr(models, model_name)(weights="IMAGENET1K_V1")
model.eval()
# ResNet: model.conv1 | VGG: model.features[0] | EfficientNet: model.features[0][0][0]
first_conv = model.conv1
weights = first_conv.weight.data.cpu() # shape: (64, 3, 7, 7) for ResNet-50
print(f"First conv filter shape: {weights.shape}")
# (C_out=64, C_in=3, kH=7, kW=7)
# Normalize each filter independently to [0, 1] for display
n_filters = weights.shape[0]
w_min, w_max = weights.min(), weights.max()
weights_norm = (weights - w_min) / (w_max - w_min + 1e-8)
cols = 8
rows = (n_filters + cols - 1) // cols
fig, axes = plt.subplots(rows, cols, figsize=(cols * 1.5, rows * 1.5))
fig.suptitle(f"{model_name} - First Layer Filters (C_out={n_filters}, K={weights.shape[-1]})")
for idx in range(rows * cols):
ax = axes[idx // cols][idx % cols]
if idx < n_filters:
# weights[idx]: (3, K, K) -> transpose to (K, K, 3) for imshow
filter_img = weights_norm[idx].permute(1, 2, 0).numpy()
ax.imshow(np.clip(filter_img, 0, 1))
ax.axis("off")
plt.tight_layout()
plt.savefig(f"{model_name}_filters.png", dpi=150, bbox_inches="tight")
print("Saved. You should see: oriented edges at various angles, color blobs,")
print("and Gabor-like patterns - exactly what V1 neurons respond to.")
def trace_spatial_dimensions(input_size: int = 224) -> None:
"""Trace how spatial dimensions change through a ResNet-50 backbone."""
def out_size(h: int, k: int, s: int, p: int) -> int:
return (h + 2 * p - k) // s + 1
configs = [
("stem conv 7x7 s2", input_size, 7, 2, 3),
("maxpool 3x3 s2", out_size(input_size, 7, 2, 3), 3, 2, 1),
("layer1 (3x3 s1)", 56, 3, 1, 1),
("layer2 (3x3 s2)", 56, 3, 2, 1),
("layer3 (3x3 s2)", 28, 3, 2, 1),
("layer4 (3x3 s2)", 14, 3, 2, 1),
("global avg pool", 7, 7, 1, 0),
]
print(f"\nSpatial dimension trace for {input_size}x{input_size} input:")
for name, h_in, k, s, p in configs:
h_out = out_size(h_in, k, s, p)
print(f" {name:28s}: {h_in:>4}x{h_in:<4} -> {h_out}x{h_out}")
Common Mistakes
Wrong channel order (NCHW vs NHWC). PyTorch uses (batch, channels, height, width). NumPy and TensorFlow default to (batch, height, width, channels). If you build a custom layer or import weights from another framework without transposing, you will get silently wrong results or dimension mismatches. Always verify: weight.shape for a Conv2d should be (C_out, C_in, K, K).
Forgetting to normalize inputs. CNNs expect pixel values in roughly [-1, 1] or [0, 1] with zero-mean. If you feed raw uint8 values [0, 255], the first-layer weights will be scaled to compensate during training - but this means pretrained weights (tuned for normalized inputs) will produce completely wrong activations. Always apply transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) for ImageNet-pretrained models.
Using bias=True when followed by BatchNorm. BatchNorm has its own learnable shift parameter (beta) that plays the same role as a conv bias. If you use both, the conv bias is redundant and wastes parameters. Set bias=False in every conv layer that feeds into BatchNorm. This is universally done in ResNet, EfficientNet, and all modern architectures.
Using flattening + FC layers instead of Global Average Pooling. Flatten -> Linear(512*7*7, 4096) has 102 million parameters and only works on exactly 7x7 input. AdaptiveAvgPool2d(1) -> Linear(512, num_classes) has 512 * num_classes parameters and works on any input size. Whenever you see x.view(batch, -1) as the transition from convolutional to linear layers in a classifier, ask if GAP would work instead.
Not using pretrained weights. If your dataset has fewer than ~100,000 images, you should almost always start from ImageNet-pretrained weights. The early layers learn genuinely universal visual features (edges, textures, shapes) that transfer to any visual task. Training from scratch on small datasets leads to underfitting unless you design a very small architecture specifically for your scale. Start with weights='IMAGENET1K_V1', freeze early layers, fine-tune late layers.
Forgetting to adjust padding when adding dilation. For a dilated conv with dilation d and kernel K, the "same" padding is P = d * (K-1) / 2. For K=3, d=2: padding=2. For d=4: padding=4. Forgetting to adjust padding when adding dilation is a common bug that silently shrinks feature map sizes - the network trains without error, but the spatial dimensions are smaller than expected, which causes a mismatch when you try to connect to a skip connection.
Interview Q&A
Q1: Why do CNNs use weight sharing, and what inductive bias does it encode?
Weight sharing means the same filter weights are applied at every spatial position. This encodes translation equivariance: the network assumes that a useful feature at one location is equally useful at any other location. Concretely, a horizontal edge detector at position (10, 20) and at position (80, 150) should detect the same thing - so they should use the same weights. This reduces the parameter count from O(H * W * C_in * C_out) to O(K^2 * C_in * C_out), completely independent of image resolution. For a 224x224 image with 64 first-layer filters, this is ~5,500x fewer parameters compared to a fully connected layer. The practical consequence: CNNs generalize far better from small datasets because they have orders of magnitude fewer parameters to overfit. The inductive bias encoded is: "visual features are local and translation-invariant."
Q2: What is the receptive field and why does it matter for network design?
The receptive field is the region of the original input that can influence a given neuron's output. For a stack of stride-1 3x3 conv layers, it grows by 2 pixels per layer: after L layers, the RF is 1 + 2L pixels. Stride and pooling grow the RF faster: a stride-2 layer doubles the effective distance that subsequent layers "see" in input space. This matters because: (1) neurons in the final layer need a large enough RF to see the entire object they are classifying; (2) for dense prediction tasks like segmentation, you need large RF without destroying spatial resolution, which motivated dilated convolutions; (3) if your RF is too small, the network cannot integrate global context - it sees object parts but not whole objects. A common interview error is assuming deep networks automatically have large enough RFs - you need to calculate this for your specific architecture.
Q3: Why do modern CNNs use 3x3 convolutions almost exclusively instead of larger kernels like 7x7?
Two stacked 3x3 convolutions have the same 5x5 receptive field as one 5x5 convolution, but fewer parameters (2 * 9 * C^2 = 18*C^2 vs. 25*C^2) and two non-linearities instead of one. Three stacked 3x3 convolutions match a 7x7 RF with 27*C^2 vs 49*C^2 parameters - 45% fewer - plus three non-linearities. More non-linearities mean more representational power. The VGG paper (Simonyan & Zisserman, 2014) established this empirically and the industry has used 3x3 convolutions almost exclusively since. The stem convolution (the very first layer) often uses 7x7 or two 3x3s to capture large-scale structure from raw pixels, but deep in the network 3x3 dominates.
Q4: What do 1x1 convolutions actually do? Give two concrete uses.
A 1x1 convolution applies a learned linear combination across channels at each spatial position independently - it mixes channels without any spatial information. Two concrete uses: (1) Bottleneck dimensionality reduction: before an expensive 3x3 conv, use a 1x1 conv to reduce channels from 256 to 64. The 3x3 conv operates on cheap 64-channel maps, then a 1x1 expands back to 256. This is the ResNet-50 bottleneck, which achieves 8.5x parameter reduction. (2) Channel count adjustment for skip connections: when a residual skip connection needs to match the channel count of the main path (because channels were increased in the block), a 1x1 conv adjusts the channel count without spatial mixing.
Q5: What makes depthwise separable convolutions efficient, and approximately how much faster are they?
A standard convolution performs spatial mixing (across the K x K window) and channel mixing (across all C_in channels) simultaneously, costing K^2 * C_in * C_out operations per output location. Depthwise separable factorizes this: (1) depthwise conv - one K x K filter per input channel, K^2 * C_in total operations, purely spatial; (2) pointwise 1x1 conv - C_in * C_out operations, purely channel. The total is K^2 * C_in + C_in * C_out. The reduction factor vs. standard conv is 1/C_out + 1/K^2. For K=3 and large C_out: approximately 9x fewer FLOPs. MobileNetV1 achieved this reduction with only ~1% top-1 accuracy drop on ImageNet, making it practical for mobile and edge deployment.
Q6: Your CNN is overfitting on a 5,000-image medical dataset. You have tried dropout and weight decay. What architectural changes would you make?
First: switch to a pretrained backbone (ResNet-18 or EfficientNet-B0) and fine-tune only the final 1-2 layers - the early layers learn general visual features that transfer. Second: reduce the model capacity - use fewer channels, shallower network, or replace standard convolutions with depthwise separable convolutions. Third: use global average pooling instead of flattening to a fully connected layer - GAP has far fewer parameters and acts as a strong regularizer by forcing spatially distributed representations. Fourth: aggressive data augmentation (random crops, flips, color jitter, mixup) effectively multiplies your dataset size. Fifth: if the medical images have specific characteristics (e.g., grayscale, unusual color distributions), re-examine whether ImageNet normalization is appropriate - domain-specific normalization may help.
Q7: Explain the difference between translation equivariance and translation invariance in CNNs.
Equivariance means the output shifts when the input shifts: if a cat moves 5 pixels right in the input, the activation of the "cat" feature detector also shifts 5 pixels right in the feature map. This is what convolutional layers provide - the filter "slides" with the feature. Invariance means the output does not change when the input shifts: whether the cat is in the top-left or bottom-right, the classification is "cat." This is what pooling layers and global average pooling provide - by aggregating over spatial positions, small translations no longer change the pooled output. A well-designed CNN exploits equivariance through the convolutional layers (preserving spatial location for dense tasks) and achieves invariance through global aggregation at the end (for classification tasks).
:::tip 🎮 Interactive Playground
Visualize this concept: Try the 2D Convolution Visualization demo on the EngineersOfAI Playground - no code required.
:::
