Pooling, Strides, and Padding
Reading Time: ~35 min | Interview Relevance: High | Target Roles: MLE, CV Engineer, Applied Scientist
You are three weeks into debugging a CNN trained on a medical imaging dataset - chest X-rays, 512×512 pixels. The model sits at 78% AUC and refuses to budge. You try heavier augmentation, learning rate schedules, label smoothing. Nothing moves the needle by more than half a percent.
A colleague suggests something unexpected: replace the two max pooling layers in the backbone with stride-2 convolutions. You make the change in twenty lines of code, retrain overnight, and wake up to 83% AUC - a 5-point improvement from a single architectural change that took ten minutes to implement.
Your colleague explains it simply: max pooling decides what to throw away using a fixed rule - keep the maximum, discard the rest. A stride-2 convolution lets the network learn what to keep. For medical imaging, where a subtle shadow or a faint nodule might be diagnostically critical, that difference matters.
But why does throwing away values hurt here? Why does it not hurt for ImageNet classification? When does max pooling work well and when does it fail? What is the relationship between stride, padding, and receptive field, and how do you reason about spatial dimensions without running the model? This lesson answers all of it - from the physics of why we downsample at all, through the mechanics of every pooling variant, to production debugging techniques.
Why We Need Spatial Downsampling
Start with a concrete problem. An input image is 224×224 pixels. Pass it through a 3×3 conv with 64 filters, same padding, stride 1. The output is 224×224×64. Another conv with 128 filters: 224×224×128. Another with 256 filters: 224×224×256.
The memory cost of that last feature map, at batch size 32:
Just for one layer's output. Add activations for every other layer, plus gradients for backprop, and you are out of GPU memory within a few layers.
But memory is not the deepest problem. The deeper issue is receptive field. A stack of stride-1 3×3 convolutions grows the receptive field by 2 pixels per layer. After 10 layers, each output neuron sees a 21×21 patch of the original image. For a 224×224 image, you need layers to give each neuron a full-image receptive field. That is impractical - you would spend all your compute churning through 224×224 feature maps, and the network would never integrate global context.
Downsampling solves both problems simultaneously. Halving the spatial dimensions at layer multiplies the effective input-space jump of every subsequent filter by 2. After one stride-2 downsampling, a 3×3 conv in the next layer sees a 5×5 region of the original input. After two downsampling steps, it sees a 9×9 region - without adding layers.
The three reasons we downsample, in order of importance:
-
Grow the receptive field faster. Each downsampling step doubles the input-space coverage of all subsequent operations. Five stride-2 steps turn a 3×3 conv into a 3×3 conv with a 96×96 effective receptive field.
-
Reduce computation. Feature map memory and FLOP cost scale with . Halving both dimensions reduces computation by 4×.
-
Build translation invariance. After pooling or strided convolution, a feature that shifts by 1 pixel in the input may not change the output at all - the network becomes insensitive to small positional variations.
The tradeoff is information loss. Downsampling always discards spatial precision. For image classification, this is fine - you do not care exactly where the object is, just whether it is present. For object detection, semantic segmentation, and pose estimation, you cannot afford to discard spatial information aggressively - this is why architectures like U-Net and Feature Pyramid Networks carefully reconstruct spatial resolution after downsampling.
Max Pooling: The Classic Approach
Max pooling takes the maximum value within a sliding window, with stride :
Visualized with a 2×2 max pool on a 4×4 feature map (stride 2):
Input (4×4): Output (2×2):
┌────┬────┬────┬────┐ ┌──────┬──────┐
│ 1 │ 3 │ 2 │ 4 │ │ │ │
├────┼────┼────┼────┤ 2×2 max │ 7 │ 8 │
│ 5 │ 7 │ 6 │ 8 │ pool → ├──────┼──────┤
├────┼────┼────┼────┤ │ │ │
│ 3 │ 2 │ 1 │ 9 │ │ 11 │ 9 │
├────┼────┼────┼────┤ └──────┴──────┘
│ 11 │ 4 │ 5 │ 6 │
└────┴────┴────┴────┘
Top-left 2×2: Bottom-left 2×2: Top-right 2×2: Bottom-right 2×2:
max(1,3,5,7)=7 max(3,2,11,4)=11 max(2,4,6,8)=8 max(1,9,5,6)=9
The logic: if a feature detector fires strongly anywhere within the pooling window, we want to know it fired - not exactly where. A cat's ear might be 1 pixel to the left or right depending on the exact image crop; max pooling makes the network robust to that shift.
The biological analogy. David Hubel and Torsten Wiesel's Nobel-winning work on cat visual cortex (1959) distinguished "simple cells" (respond to oriented edges at specific positions - like conv filters) and "complex cells" (respond to oriented edges anywhere within a broader region - like max pooling). The CNN design mirrors this two-stage detection hierarchy.
What max pooling actually discards. A 2×2 max pool with stride 2 keeps 1 value and throws away 3 - discarding 75% of all values. There is no learning involved. The network has no say in which values survive. For a classification task where you just need to know "was this feature present?", that is fine. For a task where the precise spatial arrangement of features matters, you are throwing away potentially critical information.
Average Pooling
Average pooling computes the mean over each window:
Average pooling preserves the average response across the window. If a feature fires weakly everywhere in the window, average pooling preserves that weak signal. Max pooling would only keep the strongest response and ignore the distributed pattern.
When to use average pooling over max pooling for spatial downsampling: almost never in modern practice. Strided convolutions are strictly better for intermediate downsampling because they are learned. But average pooling has one dominant use case where it is the universal choice: global average pooling at the final spatial aggregation step.
Global Average Pooling: The Classification Head Standard
Global average pooling (GAP) reduces the entire spatial extent of a feature map to a single scalar per channel:
Input: for any and . Output: , always.
GAP replaced the traditional "Flatten → FC(4096) → FC(4096) → FC(num_classes)" head. The replacement was proposed by Lin et al. in "Network in Network" (2013) and popularized by the original ResNet (He et al., 2015). Let's understand why it won.
VGG-16's final layers vs GAP:
VGG-16 classification head:
Input feature map: (batch, 512, 7, 7)
Flatten: (batch, 512 × 7 × 7) = (batch, 25,088)
FC(25088 → 4096): 102,764,544 parameters ← most of the model is here
FC(4096 → 4096): 16,777,216 parameters
FC(4096 → 1000): 4,097,000 parameters
Total: ~124M parameters - and ~100M are in these three layers
ResNet-50 classification head (GAP version):
Input feature map: (batch, 2048, 7, 7)
GlobalAvgPool: (batch, 2048)
Linear(2048 → 1000): 2,049,000 parameters
Total head: ~2M parameters
GAP achieves this 50× parameter reduction in the head while delivering better accuracy. Four reasons:
No fixed input size requirement. The FC head requires a fixed input - you cannot run a VGG on a 384×384 image without retraining. GAP works on any spatial resolution, because it averages over whatever it receives. This is critical for deployment where you want to run inference on images of varying sizes.
Stronger regularization. The FC layers in VGG are where the model stores most of its capacity - and most of its propensity to overfit. GAP forces the network to encode classification-relevant information spatially distributed across the feature map. Each channel must summarize a semantically meaningful pattern across the entire image, which is a stronger constraint that happens to generalize better.
Class Activation Maps (CAMs). GAP is the prerequisite for the CAM technique (Zhou et al., 2016), which produces heatmaps showing which spatial regions contributed to a classification decision. This requires that the final classification weights can be projected back to spatial positions - which is only possible when the head is a single linear layer on top of GAP output, not a deep FC stack.
Spatial invariance. GAP is invariant to where in the feature map a pattern occurs. A feature that activates in the top-left and the same feature activating in the bottom-right produce the same GAP output. For classification, this is exactly what you want.
Strided Convolutions: The Modern Approach to Downsampling
A stride-2 convolution slides the kernel two pixels at a time instead of one, halving the output spatial dimensions:
For , , stride=2: . For even inputs this is exactly .
This produces the same spatial downsampling as a 2×2 max pool with stride 2. The critical difference: the stride-2 conv is parameterized. The network learns whether to average, take a maximum, emphasize the center pixel, or do something more complex - whatever the task requires.
This matters enormously in practice. ResNet (He et al., 2015) replaced max pooling layers with stride-2 convolutions throughout the network, keeping only the stem max pool. Later work showed this was the right call: on nearly every benchmark, replacing max pools with stride-2 convs improves accuracy by 0.5–2%.
Why stride-2 conv consistently beats max pooling:
Learned aggregation. Gradient descent finds the aggregation function that minimizes task loss. For medical imaging, a subtle texture might be best detected by a weighted average, not a max. For edge detection, the max might matter. The network discovers this from data.
Non-linear transformation during downsampling. A stride-2 conv → BatchNorm → ReLU simultaneously downsamples and applies a learned non-linear transformation. Max pooling just downsamples - no transformation.
Simultaneous channel transformation. A stride-2 conv can go from 64 to 128 channels while halving spatial dimensions, in one operation. Max pooling must be paired with a separate conv to change channel count.
Gradient flow. Max pooling produces sparse gradients - only the pixel that "won" the max operation receives gradient. All other positions get zero gradient. Strided conv distributes gradient across all kernel positions, providing richer gradient signal during training.
When does max pooling still appear? In the stem of the network (the very first downsampling after the input), many architectures retain a max pool after the initial conv. This early max pool provides strong translation invariance at low computational cost, before the network has learned much. Beyond the stem, modern networks use strided convolutions.
Dilated (Atrous) Convolutions: More Receptive Field Without Downsampling
There is a third way to grow the receptive field beyond stacking layers and downsampling: insert gaps between kernel elements.
A standard 3×3 conv looks at a 3×3 contiguous patch. A 3×3 conv with dilation looks at a 5×5 patch, but only samples 9 of the 25 pixels - the ones at every-other position. With , it looks at a 9×9 patch, sampling every 4th position.
The effective kernel size with dilation and kernel size :
For : , , , , .
The output size formula with dilation:
For "same" output size with dilation and : .
The key property: a dilated conv with stride 1 grows the receptive field without reducing spatial resolution. This is the defining property that makes it invaluable for dense prediction tasks.
DeepLab (Chen et al., 2015) used dilated convolutions to build semantic segmentation models with large receptive fields while maintaining 1/8 spatial resolution output - no need for a full encoder-decoder. The ASPP (Atrous Spatial Pyramid Pooling) module applies parallel dilated convolutions with to capture context at multiple scales.
WaveNet (van den Oord et al., 2016) used dilated causal convolutions for audio generation. By stacking convolutions with , WaveNet achieves a receptive field of over 1000 time steps with a manageable number of parameters - exponential growth of RF with linear depth.
Dilated convolutions - receptive field vs dilation:
Standard 3×3 (d=1): Dilated d=2: Dilated d=4:
□ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □
□ ■ ■ ■ □ □ ■ □ ■ □ □ □ □ □ □ □ □ □ □
□ ■ ■ ■ □ □ □ □ □ □ □ □ ■ □ ■ □ ■ □ □
□ ■ ■ ■ □ □ ■ □ ■ □ □ □ □ □ □ □ □ □ □
□ □ □ □ □ □ □ □ □ □ □ □ ■ □ ■ □ ■ □ □
□ ■ □ ■ □ □ □ □ □ □ □ □ □ □
RF = 3×3 □ □ □ □ □ □ □ ■ □ ■ □ ■ □ □
9 params RF = 5×5 □ □ □ □ □ □ □ □ □
9 params RF = 9×9
9 params
■ = sampled position □ = skipped
All three configurations use exactly 9 parameters (a 3×3 filter). The receptive field grows with dilation but the parameter count does not.
Padding: The Boundary Problem
Without padding, every convolution shrinks the spatial dimensions. A 3×3 conv on a 7×7 input produces a 5×5 output - the border pixels cannot be the center of a 3×3 window, so they are excluded. After 10 layers of 3×3 convolutions with no padding: - the feature map has vanished.
Padding adds values around the border of the input, allowing convolutions to be computed at border positions and controlling the output size.
Valid Padding: No Padding Added
No border values are added. The convolution is computed only where the kernel fits entirely within the input:
For a 3×3 conv on a 7×7 input, stride 1: . Each layer shrinks the feature map by pixels per side. Border pixels participate in fewer dot products than interior pixels, which creates border artifacts - the network sees less context near the edges of the image.
Valid padding is appropriate when you deliberately want to shrink the spatial dimensions with each conv layer (no-padding VGG-style designs) or when border artifacts need to be eliminated by never computing near the edge.
Same Padding: Keep Output Size Equal to Input
Add zeros to each side so that the output spatial dimensions match the input (for stride=1):
For : . For : . For : . This formula only works for odd kernel sizes. Even kernel sizes (2, 4, 6) require asymmetric padding, which differs between frameworks.
The full output size formula with padding:
For stride-2 with , : .
Padding Modes: What to Fill the Border With
Zero padding (the default) fills border values with 0. This is clean and simple but introduces an artificial discontinuity - the zero-filled border does not reflect the statistics of the image.
| Padding Mode | Border Values | Best Use |
|---|---|---|
zeros | All zeros | General purpose; default everywhere |
replicate | Copy the nearest edge pixel | Reduces edge artifacts in dense prediction |
reflect | Mirror the image about the border | Minimizes discontinuity; good for texture synthesis |
circular | Wrap around (left edge = right edge) | Panoramic images, cylindrical projections, audio |
For most classification networks, zeros is fine. For dense prediction tasks (segmentation, depth estimation), reflect or replicate can reduce the drop in quality at image borders.
padding='same' in PyTorch (nn.Conv2d) is only valid for stride=1 layers. If you use padding='same' with stride > 1, PyTorch raises an error. For stride-2 downsampling convolutions, always compute padding explicitly: for , , use padding=1.
PyTorch uses symmetric padding (same amount on each side). TensorFlow's "SAME" mode uses asymmetric padding - it adds extra padding on the bottom and right - for cases where the kernel doesn't divide evenly. This is a common source of subtle shape differences when porting models between the two frameworks.
Worked Examples: Tracing Spatial Dimensions
One of the most useful skills for a CV engineer is the ability to trace spatial dimensions through a network by hand. Interviewers check this regularly. Let's trace through two real backbones.
VGG-16 Spatial Dimension Flow
VGG-16 uses max pooling for all downsampling, same-padding 3×3 convs for feature extraction.
Stage Operation H×W Channels
─────────────────────────────────────────────────────────────────
Input 224×224 3
Block 1 (×2 conv) 3×3, pad=1, s=1 224×224 64
MaxPool 2×2, s=2 112×112 64
Block 2 (×2 conv) 3×3, pad=1, s=1 112×112 128
MaxPool 2×2, s=2 56×56 128
Block 3 (×3 conv) 3×3, pad=1, s=1 56×56 256
MaxPool 2×2, s=2 28×28 256
Block 4 (×3 conv) 3×3, pad=1, s=1 28×28 512
MaxPool 2×2, s=2 14×14 512
Block 5 (×3 conv) 3×3, pad=1, s=1 14×14 512
MaxPool 2×2, s=2 7×7 512
Flatten 25,088 -
FC 4096 - 4096
FC 4096 - 4096
FC 1000 - 1000
─────────────────────────────────────────────────────────────────
5 max pools, each halving: 224 → 112 → 56 → 28 → 14 → 7
ResNet-50 Spatial Dimension Flow
ResNet uses a 7×7 stem conv with stride 2, one max pool in the stem, then stride-2 convolutions for all subsequent downsampling.
Stage Operation H×W Channels
─────────────────────────────────────────────────────────────────
Input 224×224 3
Stem conv 7×7, pad=3, s=2 112×112 64
Stem maxpool 3×3, pad=1, s=2 56×56 64
Layer 1 (×3 blocks) bottleneck, s=1 56×56 256
Layer 2 (×4 blocks) bottleneck, s=2 28×28 512
Layer 3 (×6 blocks) bottleneck, s=2 14×14 1024
Layer 4 (×3 blocks) bottleneck, s=2 7×7 2048
GlobalAvgPool 1×1 2048
Linear - 1000
─────────────────────────────────────────────────────────────────
Spatial downsampling: 224 → 112 → 56 → 56 → 28 → 14 → 7 → 1
Note: ResNet only uses max pooling in the stem (once). All other downsampling is via stride-2 convolutions.
The pattern: spatial resolution halves (×2 in each dimension) while channel count doubles. This maintains roughly constant computation per layer (half the spatial → quarter the FLOPs, doubled channels → 2× FLOPs, net: constant). The network trades spatial precision for semantic depth.
Receptive Field: How Much Does a Neuron See?
The receptive field (RF) is the region of the original input that can influence a given neuron's output. Small RF = sees local details only. Large RF = integrates global context.
For a stack of stride-1 3×3 conv layers, the RF grows linearly: 2 pixels per layer:
After 10 stride-1 3×3 layers: RF = 21 pixels. To see a 224×224 image with stride-1 layers alone, you need 112 layers - impractical.
Stride compounds RF growth. For a stack of layers each with stride and kernel :
With dilation , the effective kernel size is , so the RF contribution from layer is .
Traced through ResNet-50's early layers:
Layer K S Cumulative stride RF contribution Total RF
────────────────────────────────────────────────────────────────────────────
stem 7×7 conv 7 2 1 6 × 1 = 6 7
maxpool 3×3 3 2 2 2 × 2 = 4 11
res block 1 (3×3) 3 1 4 2 × 4 = 8 19
res block 1 (3×3) 3 1 4 2 × 4 = 8 27
layer2 stride (3×3) 3 2 4 2 × 4 = 8 35
layer2 block (3×3) 3 1 8 2 × 8 = 16 51
...
After just 6 layers of ResNet-50, each neuron sees a 51-pixel region of the 224-pixel input. By the end of layer 4, the theoretical RF exceeds the input size.
Effective vs. theoretical receptive field. The theoretical RF is an overestimate. Empirically, the center of the RF contributes exponentially more to the neuron's output than the periphery - the effective RF is approximately Gaussian in shape, much smaller than the theoretical bound. This is why very large networks can still benefit from dilated convolutions and attention mechanisms that explicitly aggregate long-range dependencies.
Why stack 3×3 convolutions instead of one large kernel?
Two stacked 3×3 convs: same 5×5 RF as one 5×5 conv.
Parameters:
One 5×5 conv (C→C channels): 5×5×C×C = 25C²
Two 3×3 convs (C→C each): 2×9×C×C = 18C² → 28% fewer parameters
Non-linearities:
One 5×5 conv: 1 non-linearity
Two 3×3 convs: 2 non-linearities → more representational capacity
Three stacked 3×3 convs match a 7×7 RF with vs parameters - 45% fewer - and three non-linearities. This is why the entire deep learning field converged on 3×3 as the standard kernel size after VGG (2014).
Python: Tracing Spatial Dimensions Through Any Model
import torch
import torch.nn as nn
from typing import Dict, List, Tuple
def trace_spatial_dimensions(
model: nn.Module,
input_shape: Tuple[int, int, int, int], # (batch, channels, H, W)
device: str = "cpu"
) -> List[Dict]:
"""
Run a dummy forward pass and record the spatial dimensions at each layer.
Uses forward hooks to capture shape information without modifying the model.
Args:
model: any PyTorch nn.Module
input_shape: input tensor shape as (B, C, H, W)
device: 'cpu' or 'cuda'
Returns:
List of dicts with layer name and output shape
"""
records = []
hooks = []
def make_hook(name: str):
def hook_fn(module, input, output):
if isinstance(output, torch.Tensor):
shape = tuple(output.shape)
records.append({
"layer": name,
"type": type(module).__name__,
"output_shape": shape,
"spatial": f"{shape[2]}×{shape[3]}" if len(shape) == 4 else "-",
"channels": shape[1] if len(shape) >= 2 else "-",
})
return hook_fn
# Register hooks on all named modules that have parameters
for name, module in model.named_modules():
# Skip container modules (Sequential, etc.) - only leaf modules
if not list(module.children()) and name != "":
h = module.register_forward_hook(make_hook(name))
hooks.append(h)
# Run a dummy forward pass
dummy = torch.zeros(input_shape).to(device)
model.to(device)
model.eval()
with torch.no_grad():
model(dummy)
# Clean up hooks
for h in hooks:
h.remove()
return records
def print_spatial_trace(records: List[Dict], max_rows: int = 40) -> None:
"""Pretty-print the spatial dimension trace."""
print(f"\n{'Layer':<40} {'Type':<20} {'Spatial':<12} {'Channels'}")
print("─" * 85)
for r in records[:max_rows]:
print(
f"{r['layer']:<40} "
f"{r['type']:<20} "
f"{r['spatial']:<12} "
f"{r['channels']}"
)
if len(records) > max_rows:
print(f" ... and {len(records) - max_rows} more layers")
# --- Example: trace ResNet-18 ---
import torchvision.models as models
resnet = models.resnet18(weights=None)
records = trace_spatial_dimensions(resnet, input_shape=(1, 3, 224, 224))
print_spatial_trace(records)
# Expected output (truncated):
# Layer Type Spatial Channels
# conv1 Conv2d 112×112 64
# bn1 BatchNorm2d 112×112 64
# relu ReLU 112×112 64
# maxpool MaxPool2d 56×56 64
# layer1.0.conv1 Conv2d 56×56 64
# ...
# avgpool AdaptiveAvgPool2d 1×1 512
# fc Linear - 1
# --- Manual calculation: output size formula ---
def conv_out_size(h_in: int, k: int, s: int, p: int, d: int = 1) -> int:
"""
Compute output spatial size for a Conv2d layer.
Formula: floor((H_in + 2P - d*(K-1) - 1) / S) + 1
Equivalent to floor((H_in + 2P - K_eff) / S) + 1
where K_eff = d*(K-1)+1
"""
k_eff = d * (k - 1) + 1
return (h_in + 2 * p - k_eff) // s + 1
def pool_out_size(h_in: int, k: int, s: int, p: int = 0) -> int:
"""Compute output size for a pooling layer."""
return (h_in + 2 * p - k) // s + 1
# Trace ResNet-50 by hand to verify understanding
print("\nResNet-50 spatial trace (manual):")
h = 224
configs = [
("stem conv 7×7, s=2, p=3", conv_out_size, (7, 2, 3, 1)),
("stem maxpool 3×3, s=2, p=1", pool_out_size, (3, 2, 1)),
("layer1 (s=1, 3×3, p=1)", conv_out_size, (3, 1, 1, 1)),
("layer2 stride conv (s=2)", conv_out_size, (3, 2, 1, 1)),
("layer3 stride conv (s=2)", conv_out_size, (3, 2, 1, 1)),
("layer4 stride conv (s=2)", conv_out_size, (3, 2, 1, 1)),
]
for name, fn, args in configs:
h_out = fn(h, *args)
print(f" {name:<40}: {h}×{h} → {h_out}×{h_out}")
h = h_out
PyTorch Reference: All Pooling Operations
import torch
import torch.nn as nn
x = torch.randn(4, 64, 56, 56) # (batch=4, channels=64, H=56, W=56)
# 1. Max pooling - take maximum in each K×K window
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
print(max_pool(x).shape) # (4, 64, 28, 28)
# With padding
max_pool_p = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
print(max_pool_p(x).shape) # (4, 64, 28, 28) - ceil(56/2)
# 2. Max pooling with indices - for MaxUnpool2d (autoencoders, decoders)
max_pool_idx = nn.MaxPool2d(kernel_size=2, stride=2, return_indices=True)
x_pooled, indices = max_pool_idx(x)
print(x_pooled.shape) # (4, 64, 28, 28)
print(indices.shape) # (4, 64, 28, 28) - index of max in each window
# Unpool: restore spatial size (sparse - only winning positions get value)
unpool = nn.MaxUnpool2d(kernel_size=2, stride=2)
x_unpooled = unpool(x_pooled, indices, output_size=x.shape)
print(x_unpooled.shape) # (4, 64, 56, 56) - sparse, 75% zeros
# 3. Average pooling - take mean in each K×K window
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
print(avg_pool(x).shape) # (4, 64, 28, 28)
# include_pad=False: for border cells that are partially padded, average
# only over valid cells (not over zero-padded positions)
avg_pool_nopad = nn.AvgPool2d(kernel_size=3, stride=1, padding=1, count_include_pad=False)
print(avg_pool_nopad(x).shape) # (4, 64, 56, 56) - same size
# 4. Global Average Pooling - collapse all spatial positions per channel
gap_v1 = nn.AdaptiveAvgPool2d((1, 1))
gap_out = gap_v1(x)
print(gap_out.shape) # (4, 64, 1, 1)
gap_flat = gap_out.squeeze(-1).squeeze(-1)
print(gap_flat.shape) # (4, 64) - ready for linear classifier
# Equivalent using .mean()
gap_v2 = x.mean(dim=(-2, -1)) # average over H and W dimensions
print(gap_v2.shape) # (4, 64)
# 5. Adaptive Average Pooling - specify OUTPUT size; PyTorch computes stride/kernel
# Works on any input size, always produces the specified output size
adaptive = nn.AdaptiveAvgPool2d((7, 7))
x_large = torch.randn(4, 64, 100, 100) # arbitrary input size
x_small = torch.randn(4, 64, 14, 14) # another arbitrary size
print(adaptive(x_large).shape) # (4, 64, 7, 7) - always
print(adaptive(x_small).shape) # (4, 64, 7, 7) - always
# Non-square output
adaptive_rect = nn.AdaptiveAvgPool2d((4, 8)) # height=4, width=8
print(adaptive_rect(x).shape) # (4, 64, 4, 8)
# 6. Strided convolution - learned downsampling (modern networks)
stride_conv = nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1, bias=False)
print(stride_conv(x).shape) # (4, 128, 28, 28) - halved AND channel-doubled
# Compared to max pool: same spatial reduction, but learned
max_then_conv = nn.Sequential(
nn.MaxPool2d(2, 2), # (4, 64, 28, 28)
nn.Conv2d(64, 128, 1), # (4, 128, 28, 28) - separate channel change
)
print(max_then_conv(x).shape) # (4, 128, 28, 28) - same final shape
# 7. Dilated convolution - large receptive field without downsampling
# padding = dilation for "same" output size with K=3
dilated_d2 = nn.Conv2d(64, 64, kernel_size=3, dilation=2, padding=2) # RF: 5×5
dilated_d4 = nn.Conv2d(64, 64, kernel_size=3, dilation=4, padding=4) # RF: 9×9
dilated_d8 = nn.Conv2d(64, 64, kernel_size=3, dilation=8, padding=8) # RF: 17×17
dilated_d16 = nn.Conv2d(64, 64, kernel_size=3, dilation=16, padding=16) # RF: 33×33
# All produce same spatial output size, but see increasingly large input regions
for name, layer in [("d=2", dilated_d2), ("d=4", dilated_d4),
("d=8", dilated_d8), ("d=16", dilated_d16)]:
out = layer(x)
d = int(name.split("=")[1])
rf = d * (3 - 1) + 1
print(f"Dilation {name}: output {out.shape}, effective RF = {rf}×{rf}")
# 8. Padding modes
for mode in ['zeros', 'reflect', 'replicate', 'circular']:
conv = nn.Conv2d(64, 64, kernel_size=3, padding=1, padding_mode=mode)
out = conv(x)
print(f"padding_mode='{mode}': {out.shape}") # All (4, 64, 56, 56)
Computing Receptive Field Programmatically
from dataclasses import dataclass, field
from typing import List, Optional
@dataclass
class LayerSpec:
name: str
kernel: int
stride: int
dilation: int = 1
layer_type: str = "conv" # 'conv' or 'pool'
def compute_receptive_fields(layers: List[LayerSpec]) -> None:
"""
Compute and print the receptive field and spatial coverage after each layer.
Uses the formula:
RF_l = RF_{l-1} + (K_eff - 1) * cumulative_stride
where K_eff = dilation * (kernel - 1) + 1
"""
rf = 1 # receptive field in input pixel space
cum_stride = 1 # cumulative product of all strides so far
print(f"\n{'Layer':<30} {'K':>3} {'S':>3} {'d':>3} {'Cum S':>7} {'RF':>8}")
print("─" * 60)
print(f"{'Input':<30} {'-':>3} {'-':>3} {'-':>3} {'1':>7} {'1':>8}")
for spec in layers:
k_eff = spec.dilation * (spec.kernel - 1) + 1
rf = rf + (k_eff - 1) * cum_stride
cum_stride *= spec.stride
print(
f"{spec.name:<30} "
f"{spec.kernel:>3} "
f"{spec.stride:>3} "
f"{spec.dilation:>3} "
f"{cum_stride:>7} "
f"{rf:>8}"
)
# ResNet-50 receptive field through the first few stages
resnet50_layers = [
LayerSpec("stem conv 7×7 s=2", kernel=7, stride=2),
LayerSpec("stem maxpool 3×3 s=2", kernel=3, stride=2, layer_type="pool"),
LayerSpec("layer1 3×3 s=1 (×1)", kernel=3, stride=1),
LayerSpec("layer1 3×3 s=1 (×2)", kernel=3, stride=1),
LayerSpec("layer2 3×3 s=2", kernel=3, stride=2),
LayerSpec("layer2 3×3 s=1", kernel=3, stride=1),
LayerSpec("layer3 3×3 s=2", kernel=3, stride=2),
LayerSpec("layer3 3×3 s=1", kernel=3, stride=1),
LayerSpec("layer4 3×3 s=2", kernel=3, stride=2),
LayerSpec("layer4 3×3 s=1", kernel=3, stride=1),
]
compute_receptive_fields(resnet50_layers)
# DeepLab-style ASPP: dilated convolutions for large RF without downsampling
print("\nASPP dilated convs (stride=1 throughout, no downsampling):")
aspp_layers = [
LayerSpec("conv d=1 (standard)", kernel=3, stride=1, dilation=1),
LayerSpec("conv d=6", kernel=3, stride=1, dilation=6),
LayerSpec("conv d=12", kernel=3, stride=1, dilation=12),
LayerSpec("conv d=18", kernel=3, stride=1, dilation=18),
]
compute_receptive_fields(aspp_layers)
# Notice: RF grows to 37 pixels from a single dilated conv d=18 - no downsampling needed
Production Engineering Notes
The spatial dimension formula is your most important debugging tool. When a residual connection fails ("tensor sizes don't match"), when a skip connection in U-Net produces a shape error, when view() throws "invalid argument" - the cause is always an unexpected spatial dimension somewhere. Keep conv_out_size as a utility in your project. Run it on paper before you run the model.
AdaptiveAvgPool2d instead of hardcoded spatial sizes. Never write x = x.view(batch, -1) assuming a fixed spatial size. If you load a model trained on 224×224 images and run it on 320×320 images (a common practice for boosted inference accuracy), hardcoded flattening breaks silently or loudly. Always use nn.AdaptiveAvgPool2d((1, 1)) before the classifier. Your model becomes resolution-agnostic at inference time for free.
Stride-2 conv vs maxpool: test both for your task. The general recommendation is stride-2 conv, but always benchmark. On some edge hardware (especially mobile NPUs), max pooling has highly optimized kernels and may actually be faster than a parameterized stride-2 conv. The theoretical FLOP reduction does not always translate to wallclock time.
Dilated convolutions and padding. When you add or change dilation in a conv layer, you must update the padding simultaneously or your feature map size changes silently. For , the same-size padding is always padding = dilation. Forgetting this is the most common dilation bug.
Padding mode affects border quality for dense tasks. For semantic segmentation, depth estimation, and super-resolution, the choice of padding mode can noticeably affect quality near image borders. Zero padding creates an artificial discontinuity that the network learns to partially compensate for, but never perfectly. reflect padding minimizes this artifact. For classification networks, it does not matter.
Channels-last memory format for mobile. On ARM processors and Apple M-series GPUs, channels_last memory format (NHWC) can be significantly faster than the default NCHW. For deployment on mobile hardware, convert with model.to(memory_format=torch.channels_last) and profile before and after.
Global max pooling vs global average pooling. GAP is the standard and usually better. Global max pooling is used occasionally in multi-label classification where you care whether at least one spatial location activates strongly, not about the average activation level. For single-label classification, GAP is nearly always the correct choice.
Transposed Convolutions: Learnable Upsampling
Every operation covered so far either reduces spatial dimensions (pooling, strided conv) or keeps them constant (same-padded conv). For segmentation decoders, image generation, and autoencoders, you need to go the other direction - increase spatial resolution from a compact feature map back toward full resolution. This is upsampling.
The simplest approach is bilinear interpolation: take the small feature map, fill in the gaps using bilinear interpolation, then apply a standard convolution to refine. This is the approach used in FPN (Feature Pyramid Network) and many U-Net variants. It is artifact-free and parameter-efficient.
The learned approach is a transposed convolution (sometimes called "deconvolution," though this name is technically inaccurate and discouraged). A transposed convolution learns the upsampling transformation from data.
How transposed convolution works. In a regular stride-2 convolution from H to H/2, each input position maps to exactly one output position (the filter centered on that position contributes to one output). In a transposed convolution, we invert this mapping: each input position (in the small, deep feature map) contributes to multiple output positions (in the large, shallow feature map). For stride=2, each input position distributes its activation across a K x K region in the output. The learned kernel weights determine the distribution pattern.
Output size formula for transposed convolution:
H_out = (H_in - 1) * stride - 2 * padding + kernel_size + output_padding
For a standard 2x upsampling with K=4, S=2, P=1: H_out = (H-1)*2 - 2 + 4 = 2H. The output_padding parameter handles ambiguity when multiple input sizes map to the same output size.
import torch
import torch.nn as nn
x = torch.randn(1, 256, 7, 7) # decoder input: small, rich feature map
# Transposed conv: 2x upsampling (7 -> 14)
# K=4, S=2, P=1 is the "safe" configuration (K divisible by S)
tconv = nn.ConvTranspose2d(256, 128, kernel_size=4, stride=2, padding=1)
out = tconv(x)
print(f"ConvTranspose2d output: {out.shape}") # (1, 128, 14, 14)
# Progressive upsampling in a decoder (7 -> 14 -> 28 -> 56 -> 112 -> 224)
decoder = nn.Sequential(
nn.ConvTranspose2d(256, 128, kernel_size=4, stride=2, padding=1), # 7 -> 14
nn.ReLU(),
nn.ConvTranspose2d(128, 64, kernel_size=4, stride=2, padding=1), # 14 -> 28
nn.ReLU(),
nn.ConvTranspose2d(64, 32, kernel_size=4, stride=2, padding=1), # 28 -> 56
nn.ReLU(),
)
out = decoder(x)
print(f"Decoder output: {out.shape}") # (1, 32, 56, 56)
# The artifact-free alternative: bilinear upsample + conv
# This is preferred for segmentation; transposed conv for GANs
artifact_free = nn.Sequential(
nn.Upsample(scale_factor=2, mode='bilinear', align_corners=False),
nn.Conv2d(256, 128, kernel_size=3, padding=1, bias=False),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
)
out2 = artifact_free(x)
print(f"Bilinear + conv output: {out2.shape}") # (1, 128, 14, 14)
The checkerboard artifact problem. Transposed convolutions are notorious for producing regular grid patterns in their output - periodic bright and dark values that are clearly artifacts. The cause: when kernel_size is not evenly divisible by stride, different output positions receive contributions from different numbers of input positions. For kernel=3, stride=2: some output positions are "hit" by two filter applications, others by one. This creates a 2x2 periodic over-activation pattern.
Concretely: for K=3, S=2 in 1D, input positions at even indices hit output positions {0,1,2} and {2,3,4} (positions 0, 2, 4 hit by one application; position 2 also hit by position 0's output). The overlap is non-uniform, creating checkerboard patterns.
The fix: use kernel=4, stride=2, padding=1. Since 4 is divisible by 2, every output position receives contributions from exactly 2 input positions - uniform overlap, no periodic pattern. Or switch to bilinear upsample + conv entirely.
Checkerboard cause: kernel=3, stride=2 (unequal overlap)
1D transposed conv, stride=2:
Input: A B C
| | |
Kernel positions spread:
A -> [a0, a1, a2] at output positions [0, 1, 2]
B -> [b0, b1, b2] at output positions [2, 3, 4]
C -> [c0, c1, c2] at output positions [4, 5, 6]
Output position 0: hit by A (once)
Output position 2: hit by A (once) + B (once) = 2x contribution
Output position 4: hit by B (once) + C (once) = 2x contribution
-> Alternating coverage: CHECKERBOARD
Fix: use kernel=4, stride=2 (even divisibility)
Every output position hit by exactly 2 input positions -> uniform
If you see regular grid-like artifacts in segmentation masks or generated images, the cause is almost always a transposed convolution with kernel_size % stride != 0. Replace with bilinear upsampling followed by a standard 3x3 convolution. For GAN generators where you must use transposed convolutions, use kernel=4, stride=2, padding=1.
Practical Guide: When to Use What
This decision framework consolidates the key choices for every spatial operation:
The guiding principle in one sentence: classification networks can afford to lose spatial precision (use pooling/strides aggressively, GAP at the end); dense prediction networks cannot (use strides sparingly, dilated convolutions for large RF, skip connections to restore resolution).
Common Mistakes
Padding mismatch in skip connections. In U-Net and FPN architectures, encoder feature maps are concatenated or added to decoder feature maps at matching scales. If your encoder uses valid padding (no padding) while your decoder produces a different spatial size, you get a shape mismatch at the skip connection. This is the most common bug when implementing U-Net from scratch. Fix: use same padding throughout the encoder, or explicitly crop feature maps to match sizes. Verify spatial dimensions for every skip connection before training.
Stride-2 conv on odd-sized inputs. For nn.Conv2d(C, C, kernel_size=3, stride=2, padding=1), the output is ceil(H/2). For even inputs (224x224): output is 112x112. For odd inputs (225x225): output is 113x113. When you chain multiple stride-2 layers and the input is not a power-of-2 multiple, spatial dimensions can be unexpectedly non-symmetric, breaking skip connections. Safest approach: start with power-of-2 input sizes (224, 256, 512) or explicitly verify all spatial dimensions with a dry run before training.
Forgetting to adjust padding when adding dilation. The "same" padding for a dilated conv with dilation d and kernel K=3 is padding = d. If you change dilation from 1 to 2 and forget to update padding from 1 to 2, your feature maps shrink silently - the network trains without error but spatial dimensions are wrong, causing a mismatch at skip connections or the final output layer. Always update padding when changing dilation.
Over-using max pooling in dense prediction tasks. Max pooling provides translation invariance - useful for classification, harmful for segmentation and detection where exact positions matter. Using 4 stages of max pooling in a segmentation backbone means the encoder output is at 1/16 spatial resolution, and recovering pixel-accurate boundaries from a 14x14 feature map is extremely hard. Replace max pooling with strided convolutions in segmentation backbones, or use architectures like DeepLab that avoid downsampling aggressively.
Hardcoding view(batch, -1) before the classifier. This locks your model to a specific input resolution. Any image size other than the training size causes an error. Use nn.AdaptiveAvgPool2d((1, 1)) instead - it works on any resolution and adds zero parameters.
Checkerboard artifacts from wrong transposed convolution config. If your segmentation output has a grid pattern, use bilinear upsample + conv. If you must use ConvTranspose2d, use kernel=4, stride=2, padding=1 (kernel divisible by stride).
Interview Q&A
Q1: Why did modern CNNs move away from max pooling toward strided convolutions for spatial downsampling?
Max pooling applies a fixed, non-learnable rule: keep the maximum, discard everything else. Strided convolutions are parameterized - gradient descent learns what spatial information to preserve based on the task. In practice, this means a stride-2 conv can learn to perform a weighted average, detect the presence of a feature, or implement any other aggregation that is useful for the downstream objective. Max pooling has no such adaptability. Additionally, stride-2 convolutions simultaneously change channel count and downsample (one operation), apply a learned non-linear transformation when followed by BatchNorm+ReLU, and distribute gradients to all kernel positions during backprop (vs. max pooling's sparse gradients, where only the winning position receives gradient). Empirically, replacing max pooling with stride-2 convs gives consistent small improvements on most benchmarks. The only place max pooling remains dominant is the stem of some networks, where it provides strong early translation invariance cheaply.
Q2: What is Global Average Pooling and why is it used instead of flattening and FC layers?
Global Average Pooling reduces each channel's spatial feature map to a single scalar by averaging all spatial positions: . Output shape: regardless of input and . It replaced the traditional Flatten→FC(4096)→FC(4096)→FC(num_classes) head for four reasons: (1) No fixed input size - the model works on any resolution during inference; (2) Far fewer parameters - VGG's 102M FC parameters vs ResNet's 2M for the equivalent head; (3) Better regularization - each channel must encode a semantically meaningful global pattern, reducing overfitting; (4) Enables Class Activation Maps for visual explainability. GAP was introduced in "Network in Network" (2013) and became universal after ResNet (2015).
Q3: How does same padding keep spatial dimensions constant, and when does it break?
For a conv with kernel size and stride , adding zeros on each side gives - output equals input. For odd kernel sizes (1, 3, 5, 7), is exact and same padding works perfectly. For even kernel sizes (2, 4, 6), is not an integer, so you need asymmetric padding (extra pixel on one side), which PyTorch and TensorFlow handle differently - a common source of porting bugs. Same padding breaks for : PyTorch raises an error if you use padding='same' with stride > 1. For stride-2 downsampling, you must compute padding manually: for , , use , giving .
Q4: Why stack multiple 3×3 convolutions instead of using a single 7×7 convolution?
Two stacked 3×3 convolutions have the same 5×5 receptive field as one 5×5 conv, but use vs parameters - 28% fewer - and add two non-linearities instead of one. Three stacked 3×3 convs match a 7×7 RF with vs parameters - 45% fewer - and three non-linearities. More non-linearities mean more representational capacity per parameter. The VGG paper (2014) established this principle, and it has been the dominant design choice ever since. The exception is the stem convolution: many networks use a 7×7 conv for the very first layer to capture large-scale structure from raw pixels before the network has learned hierarchical features. Deeper in the network, 3×3 is always preferred.
Q5: What is dilated (atrous) convolution and when should you use it?
A dilated convolution inserts gaps of size d-1 between kernel elements, giving an effective kernel size of d*(K-1)+1 while using only K*K parameters. A 3x3 conv with dilation d=4 has a 9x9 effective receptive field but only 9 parameters - the same as a standard 3x3 conv. Critically, dilated convolutions can be used with stride=1, so they expand the receptive field without reducing spatial resolution. Use dilated convolutions when you need large receptive field AND high spatial resolution: semantic segmentation (DeepLab's ASPP uses d in {1,6,12,18} in parallel), dense depth estimation, and audio generation (WaveNet uses exponentially increasing dilation {1,2,4,...,512} to reach 1000+ timestep context). Do not use dilated convolutions blindly in classification networks - they add complexity with minimal benefit where you already downsample aggressively and high RF is achieved through depth and stride.
Q6: What causes checkerboard artifacts in transposed convolutions and how do you fix them?
Checkerboard artifacts occur when the transposed convolution's kernel size is not divisible by the stride. For kernel=3, stride=2: some output positions receive contributions from two overlapping filter applications, others from one - creating a periodic pattern of over- and under-activation. This manifests as a regular grid of bright and dark values in the output. Fix options: (1) Switch to bilinear upsampling followed by a standard convolution - nn.Upsample(scale_factor=2, mode='bilinear') + nn.Conv2d(...). This decouples upsampling from feature learning and introduces no periodic artifacts. (2) If you must use transposed convolution, use kernel=4, stride=2, padding=1 - since 4 is divisible by 2, every output position receives contributions from exactly 2 input positions, giving uniform overlap. Odena et al. (2016) "Deconvolution and Checkerboard Artifacts" documented this exhaustively and recommended bilinear + conv for most use cases.
Q7: Your segmentation model produces blurry, low-resolution predictions. What architectural changes would you consider?
Check the downsampling strategy first. If you have four max pooling layers, the encoder output is at 1/16 resolution - recovering pixel-accurate boundaries from a 14x14 feature map is extremely difficult. First change: replace max pooling with strided convolutions, which preserve more information. Second change: add skip connections from encoder layers at 1/2, 1/4, and 1/8 resolution to the decoder (U-Net style). These skip connections inject fine-grained spatial detail (edges, textures) directly into the decoder at each scale. Third change: if you still need a large receptive field in the encoder but cannot afford more downsampling, replace the last 1-2 stride-2 layers with dilated convolutions (keeping spatial resolution) - this is the DeepLab approach. Fourth: improve the decoder itself - progressive 2x upsampling with feature refinement at each scale gives much sharper results than a single large upsampling step. Finally, check your loss function - pixel-level cross-entropy treats all errors equally; adding a boundary loss or Dice loss that penalizes boundary errors specifically can improve sharpness of predicted edges.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the 2D Convolution Visualization demo on the EngineersOfAI Playground - no code required.
:::
