Skip to main content

CNN Architectures: From AlexNet to ConvNeXt

You need to deploy an image classifier for a product catalog with 10,000 categories. The engineer before you trained a VGG-16 - it gets 82% top-1 accuracy but takes 600ms per image on a CPU server. You switch to EfficientNet-B4 and get 85% accuracy at 90ms. A visitor from the research team says to use a ViT. Understanding why these architectures perform differently - not just that they do - is what lets you make the right call for each situation.

This lesson traces the full arc of CNN architecture evolution, from the era of handcrafted features through AlexNet's shock moment, the gradual sophistication of VGG and Inception, the deep insight of ResNet, the efficiency engineering of EfficientNet, and finally ConvNeXt's demonstration that CNNs are not obsolete. Every major architecture existed because the one before it had a specific, identifiable failure mode.

Reading Time: ~50 min | Interview Relevance: Very High | Target Roles: MLE, CV Engineer, Research Engineer


Part 1: Why This Exists - The World Before 2012

The Era of Handcrafted Features

Before deep learning took over, computer vision was an exercise in human ingenuity and frustration.

The workflow looked like this: you had an image, and you needed a computer to understand it. Since the computer could not figure out what was important on its own, you - a human expert - had to specify the features manually.

HOG (Histogram of Oriented Gradients) - published in 2005 by Dalal and Triggs for pedestrian detection - worked by dividing an image into small cells, computing gradient directions in each cell, and histogramming those directions. The intuition was solid: the shape of a person is defined by the arrangement of edges. HOG worked reasonably well for pedestrians in constrained scenes. Change the lighting, add occlusion, or ask it to detect something other than pedestrians and it fell apart.

SIFT (Scale-Invariant Feature Transform) - Lowe, 1999/2004 - was more sophisticated. It found interest points (corners and blobs) in an image and described each with a 128-dimensional vector that was invariant to scale and rotation. SIFT features could be matched between images even if photographed from a different angle or at a different size. But describing an entire image for classification required another layer on top - bag-of-visual-words encodings that threw away spatial relationships.

SURF (Speeded-Up Robust Features) - Bay et al., 2006 - was a faster approximation of SIFT. Same idea, better performance at scale.

All of these required a multi-step pipeline:

Raw Image
→ Feature Detector (finds interesting regions)
→ Feature Descriptor (encodes each region as a vector)
→ Feature Aggregation (bag-of-words → one image vector)
→ Classifier (SVM or similar)

The fragility was baked in at every step. The features that work for detecting pedestrians are not the same as those that work for detecting textures or faces. Every new task required new feature engineering. And the features were always a human's best guess about what matters, not learned from data.

The ImageNet Challenge Sets the Stage

In 2010, Fei-Fei Li's lab at Stanford announced the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The dataset: 1.2 million images, 1000 object categories. The task: classify each image correctly. The metric: top-5 error (is the correct label in your top 5 predictions?).

From 2010 to 2011, the winning solutions used traditional feature pipelines - HOG, SIFT, Fisher Vectors, SVMs. The best top-5 error in 2011: 25.8%. Progress was measured in fractions of a percent per year.

Then came 2012.


Part 2: AlexNet 2012 - The Moment Everything Changed

The Contest Result That Shocked Computer Vision

Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton submitted a deep convolutional neural network to ILSVRC 2012. Their top-5 error: 15.3%. Second place: 26.2%.

A ten-point gap. In a benchmark where annual progress was measured in single digits. The entire computer vision community had to stop and reconsider everything.

Why was this so stunning? Because deep networks were supposed to be untrainable. The conventional wisdom was that backpropagation vanished for deep networks - gradients became too small before reaching early layers, so networks with more than 3-4 layers just did not converge meaningfully. Hinton's group had been arguing for years that this problem was solvable. ILSVRC 2012 was the undeniable proof.

What Made AlexNet Work

AlexNet was not a single innovation - it was a combination of engineering decisions that, together, unlocked deep networks:

1. ReLU Instead of Sigmoid or Tanh

Previous networks used sigmoid or tanh activations. Both saturate: when inputs are large or small, the derivative approaches zero, and the gradient that flows backward becomes vanishingly small. Chain-multiply many near-zero gradients across many layers and you get essentially nothing reaching the first layer.

ReLU - Rectified Linear Unit - is f(x) = max(0, x). Its derivative is 1 for positive inputs and 0 for negative. No saturation for positive activations. This alone let AlexNet train roughly 6x faster than an equivalent sigmoid network.

tip

ReLU is one of the core reasons deep networks became trainable. Not a better optimizer, not a smarter architecture. Just replacing saturating activations with a non-saturating one. Simple insights have enormous consequences.

2. Dropout for Regularization

With 60 million parameters and only 1.2 million training images, overfitting was a serious risk. AlexNet applied dropout with p=0.5 to the two fully connected layers - randomly zeroing half the neurons during each forward pass. This forced the network to learn redundant representations and dramatically reduced co-adaptation between neurons.

3. Data Augmentation

The training pipeline extracted random 224×224 crops from 256×256 images and applied horizontal flips. At test time, predictions were averaged across 10 crops (4 corners + center, plus their flips). This gave the network many more effective training examples without additional labeling cost.

4. Two-GPU Training

In 2012, a single GPU could not hold AlexNet's weights and activations in memory. Krizhevsky split the network across two GTX 580 GPUs, with a specific cross-GPU communication pattern in the fully connected layers. This was a practical constraint that shaped the architecture.

5. Local Response Normalization (LRN)

AlexNet used LRN to normalize activations across adjacent feature maps, inspired by lateral inhibition in neuroscience. This was later found to be largely unnecessary and was replaced by batch normalization.

AlexNet Architecture in Detail

Input: 224×224×3

Conv1: 96 filters, 11×11, stride 4 → 55×55×96 → ReLU → LRN → MaxPool 3×3/2 → 27×27×96
Conv2: 256 filters, 5×5, pad 2 → 27×27×256 → ReLU → LRN → MaxPool 3×3/2 → 13×13×256
Conv3: 384 filters, 3×3, pad 1 → 13×13×384 → ReLU
Conv4: 384 filters, 3×3, pad 1 → 13×13×384 → ReLU
Conv5: 256 filters, 3×3, pad 1 → 13×13×256 → ReLU → MaxPool 3×3/2 → 6×6×256

Flatten → 9,216
FC6: 4096 → ReLU → Dropout(0.5)
FC7: 4096 → ReLU → Dropout(0.5)
FC8: 1000 → Softmax

Total parameters: ~62 million, dominated by FC layers. FC6 alone (9216 × 4096) is 37.7M parameters.

The architecture looks messy by modern standards - inconsistent kernel sizes, LRN that turned out not to matter much, filter splits across GPUs. But it worked. And that was what mattered in 2012.


Part 3: VGGNet 2014 - Deeper but Simpler

The Problem with AlexNet's Architecture

AlexNet used 11×11 kernels in the first layer and 5×5 in the second. Large kernels see a bigger patch of the image at once, which sounds useful, but they carry a steep parameter cost. An 11×11 filter has 121 weights per channel; a 3×3 filter has 9.

Karen Simonyan and Andrew Zisserman at Oxford's Visual Geometry Group (VGG) asked: What if we replaced all large kernels with stacks of 3×3 kernels?

The Key Insight: Equivalent Receptive Fields, Fewer Parameters

Two consecutive 3×3 convolutions cover the same spatial region as one 5×5 convolution. Three consecutive 3×3 convolutions cover the same region as one 7×7. But the parameter counts differ substantially:

  • One 5×5 conv with C input/output channels: 5 × 5 × C × C = 25C² parameters
  • Two 3×3 convs with C channels: 2 × 3 × 3 × C × C = 18C² - 28% fewer

And crucially: stacking two 3×3 convolutions gives you an extra ReLU in between. That additional non-linearity makes the function space richer. You can represent more complex functions with two 3×3 layers than with one 5×5, even with fewer parameters.

VGG-16 Architecture

VGG-16 consists of 16 learnable layers (13 conv + 3 FC), organized into 5 blocks:

Input: 224×224×3

Block 1: Conv(64, 3×3) → ReLU → Conv(64, 3×3) → ReLU → MaxPool 2×2 → 112×112×64
Block 2: Conv(128, 3×3) → ReLU → Conv(128, 3×3) → ReLU → MaxPool 2×2 → 56×56×128
Block 3: Conv(256, 3×3) × 3 → MaxPool 2×2 → 28×28×256
Block 4: Conv(512, 3×3) × 3 → MaxPool 2×2 → 14×14×512
Block 5: Conv(512, 3×3) × 3 → MaxPool 2×2 → 7×7×512

Flatten → 25,088
FC: 4096 → ReLU → Dropout
FC: 4096 → ReLU → Dropout
FC: 1000 → Softmax

Total: 138 million parameters. Achieved 7.3% top-5 error on ImageNet - a significant improvement over AlexNet.

VGG's Legacy and Its Problem

VGG's uniformity made it easy to understand and modify. For years, VGG-16 was the default feature extractor for transfer learning and the backbone for object detection. Its simplicity made the features predictable.

The problem: 138 million parameters, with roughly 90% of them in the FC layers (25088 × 4096 + 4096 × 4096). Almost no compute on feature extraction, almost all on classification. This is architecturally wasteful.

More depth also ran into diminishing returns. VGG-19 is only marginally better than VGG-16. Simply adding more uniform layers was not scaling. Nobody had a clear explanation for why - yet.


Part 4: GoogLeNet/Inception 2014 - Wider Instead of Deeper

A Different Question: What Scale Are the Features?

While VGG was getting deeper with more of the same, the Google Brain team asked a different question: Instead of choosing one kernel size, what if we used all of them simultaneously?

Objects in images appear at different scales. A nearby cat fills the frame; a distant cat is a small blob. A fixed 3×3 kernel captures fine detail but misses large-scale context. A 5×5 kernel captures broader context but is expensive and misses fine detail.

The Inception module runs multiple operations in parallel on the same input and concatenates the results:

Input feature map
|
┌────────────┼──────────────┬──────────────┐
↓ ↓ ↓ ↓
1×1 conv 1×1 conv 1×1 conv 3×3 MaxPool
↓ ↓ ↓
3×3 conv 5×5 conv 1×1 conv
└────────────┴──────────────┴──────────────┘

Concatenate along channel dim

The 1×1 convolutions before the 3×3 and 5×5 are critical bottlenecks that reduce channel count before expensive spatial operations. Without them, a 5×5 conv on a 256-channel feature map would cost 5×5×256×256 = 1.6M parameters per Inception module. With a 1×1 bottleneck reducing from 256 to 32 channels first, the 5×5 conv costs 5×5×32×256 = 204,800 - an 8x reduction.

Why 1×1 Convolutions Are More Useful Than They Look

A 1×1 convolution does not change spatial dimensions. It mixes information across channels at each pixel location independently. This sounds trivial - why would you want to mix channels without any spatial context?

Two concrete reasons:

  1. Dimensionality reduction: compress 256 channels down to 64 before an expensive 3×3 or 5×5 operation
  2. Adding non-linearity: a 1×1 conv followed by ReLU introduces another non-linear transformation for essentially zero spatial computation cost

Lin et al. (2013) called this idea "Network in Network." GoogLeNet turned it into a systematic design primitive.

GoogLeNet Key Facts

  • 22 deep layers but only 6.8 million parameters - roughly 20x fewer than AlexNet's 62M
  • No large FC layers at the end. Instead: global average pooling collapses the 7×7 feature map to a single vector per channel, then one FC layer. This eliminates tens of millions of parameters
  • Two auxiliary classifiers branch off the middle of the network during training, computing softmax losses at intermediate layers to inject gradient signal directly into the middle of the network
  • Won ILSVRC 2014 with 6.67% top-5 error
note

The auxiliary classifiers were later shown to contribute little to final accuracy - they mainly acted as regularization. ResNet made them obsolete by providing a clean gradient highway through skip connections.


Part 5: The Degradation Problem - The Mystery That Launched ResNet

Here is something that confused researchers between 2013 and 2015: deeper networks are not always better, and nobody could explain why.

Intuitively, a deeper network should be at least as good as a shallower one. Consider: take a 20-layer network that achieves some accuracy. Now build a 56-layer network where the first 20 layers are identical and the remaining 36 layers are identity functions (output = input). The 56-layer network should achieve the same accuracy as the 20-layer one at minimum, because the extra layers do nothing.

But in practice, a 56-layer plain network trained with standard SGD performs worse than a 20-layer plain network - not just on validation, but on training data too.

This definitively rules out overfitting. Overfitting gives lower training error but higher validation error. Here, both training and validation error are higher for the deeper network. The optimization itself is failing - not generalization.

He et al. (2015) named this the degradation problem. The hypothesis: learning an identity mapping through multiple stacked non-linear layers is surprisingly hard for gradient-based optimization. The network architecturally can represent the identity, but SGD cannot find it.

This was the problem ResNet was designed to solve.


Part 6: ResNet 2015 - The Skip Connection Breakthrough

The Key Insight: Learn the Residual, Not the Full Mapping

He et al.'s insight was elegant: do not ask the layers to learn the desired output directly. Ask them to learn the difference between the desired output and the input.

Instead of asking a block to learn H(x) (the desired mapping), reformulate it as H(x) = F(x) + x, where F(x) = H(x) - x is the residual. The block only needs to learn F(x).

Why is this easier? If the optimal function is close to the identity (which it often is in deep networks - many layers are doing minor refinements), then the optimal F(x) is close to zero. Pushing a function toward zero is far easier for gradient descent than pushing a stack of non-linear layers toward an exact identity mapping. The weights of F can be initialized near zero, and the block starts as an approximate identity - a natural, safe starting point.

The implementation adds a skip connection (also called a shortcut connection) that passes the input directly to the output, bypassing the convolutional layers:

Why Gradients Flow Through Skip Connections

The gradient math makes the benefit concrete. If y = F(x) + x, then:

dL/dx = dL/dy * (dF/dx + 1)

The +1 term is the key. No matter how small dF/dx becomes (the vanishing gradient problem), the gradient dL/dx is at least as large as dL/dy. The skip connection is a gradient highway - gradient flows from the loss backward through the network without diminishing at each block.

This is why ResNet can train 152-layer networks successfully, while a 56-layer plain network fails to outperform a 20-layer one.

Basic Block vs Bottleneck Block

ResNet comes in two variants depending on depth.

Basic Block (ResNet-18, ResNet-34):

x → Conv(3×3, C→C, stride=s) → BN → ReLU
→ Conv(3×3, C→C, stride=1) → BN
→ (+x via skip, projected if needed)
→ ReLU

Bottleneck Block (ResNet-50, ResNet-101, ResNet-152):

x → Conv(1×1, C→C/4) → BN → ReLU [compress]
→ Conv(3×3, C/4→C/4, s) → BN → ReLU [spatial mix at low cost]
→ Conv(1×1, C/4→C) → BN [expand]
→ (+x via skip, projected if needed)
→ ReLU

The bottleneck reduces parameters by doing expensive 3×3 spatial convolution at a fraction of the channels. For ResNet-50 with 256-channel blocks:

  • Direct 3×3 conv: 256 × 256 × 9 = 589,824 parameters
  • Bottleneck 3×3 (on 64 channels): 64 × 64 × 9 = 36,864 parameters - 16x fewer

This is why ResNet-50 (25.6M params) can go much deeper than a VGG with similar parameter count.

Projection Shortcuts

When a block changes spatial resolution (stride=2) or channel count, the skip connection cannot be a plain identity - the shapes do not match. The solution: a projection shortcut, which is a 1×1 convolution applied to the skip path to match output dimensions.

When downsampling: skip = Conv(1×1, stride=2) → BN
When channels change: skip = Conv(1×1, new_channels) → BN

The projection shortcut has learnable weights but is computationally cheap relative to the main path.

ResNet-50 Full Architecture

Input: 224×224×3

Stem: Conv(7×7, 64, stride=2) → BN → ReLU → MaxPool(3×3, stride=2) → 56×56×64

Layer1: 3 × Bottleneck(64→256 channels) 56×56 [no spatial downsampling]
Layer2: 4 × Bottleneck(128→512 channels) 28×28 [stride=2 in first block]
Layer3: 6 × Bottleneck(256→1024 channels) 14×14 [stride=2]
Layer4: 3 × Bottleneck(512→2048 channels) 7×7 [stride=2]

Global Average Pool → 2048-dim vector
FC(1000) → Softmax

ResNet won ILSVRC 2015 with 3.57% top-5 error using an ensemble. Single-model ResNet-152: 4.49% top-5. The winning 2014 model (GoogLeNet): 6.67%. Skip connections alone drove a 3-point improvement.

tip

ResNet-50 remains the most widely deployed CNN backbone in production computer vision systems. Its combination of accuracy, parameter efficiency, excellent pretrained weights, and broad ecosystem compatibility (Detectron2, MMDetection, Hugging Face) has kept it relevant a decade later.

ResNet Family Summary

ModelBlock TypeParamsImageNet Top-1
ResNet-18Basic (2 conv)11.7M69.8%
ResNet-34Basic (2 conv)21.8M73.3%
ResNet-50Bottleneck (3 conv)25.6M76.1%
ResNet-101Bottleneck (3 conv)44.5M77.4%
ResNet-152Bottleneck (3 conv)60.2M78.3%

Part 7: DenseNet 2017 - Connect Everything

Taking Skip Connections to Their Logical Extreme

ResNet adds one skip connection per block: input bypasses one or two layers, adds to output. DenseNet (Huang et al., 2017) asked: what if every layer received the feature maps from all preceding layers?

In a DenseNet dense block with L layers, each layer receives the concatenated feature maps from all L-1 previous layers:

Layer 1 → h1
Layer 2 input: [x, h1] (concatenation, not addition)
Layer 3 input: [x, h1, h2]
Layer 4 input: [x, h1, h2, h3]
...

This creates L(L+1)/2 connections in a block of L layers. With 12 layers: 78 direct connections.

Why concatenation instead of addition? Addition blends features from different layers into the same feature map, potentially losing information. Concatenation keeps all feature maps separate and lets later layers selectively use whatever is useful - early edge detectors remain accessible all the way to the final layers.

Dense Block and Transition Layer

A DenseNet alternates between dense blocks (all-to-all connections within) and transition layers (compress + downsample between blocks):

  • Dense block: L layers, each producing k new channels (growth rate). Layer l has input width k0 + k × (l-1) channels where k0 is initial channels
  • Transition layer: 1×1 conv (reduce channels by half) → 2×2 average pool (halve spatial dimensions)

With growth rate k=32, each layer adds only 32 new channels. This is extremely parameter-efficient.

DenseNet-201: 20M parameters, 77.3% top-1 - competitive with ResNet-152 (60M params). DenseNet is particularly popular in medical imaging where training data is scarce and feature reuse on small datasets matters significantly.

note

DenseNet's feature reuse idea influenced many architectures. The core insight that earlier feature maps remain useful throughout the network shows up in U-Net (for medical segmentation), various detection architectures, and multi-scale feature fusion heads.


Part 8: MobileNet and EfficientNet - Efficiency as a Design Goal

The Deployment Gap

By 2017, the pattern was clear: bigger networks are more accurate. ResNet-152 outperforms ResNet-50. More data helps. Longer training helps. But this research assumed server-grade GPUs. What about phones, edge devices, or servers where both compute cost and inference latency matter?

MobileNet: Depthwise Separable Convolutions

MobileNet (Howard et al., 2017) introduced depthwise separable convolutions: factorize a standard convolution into two operations:

  1. Depthwise convolution: apply one 3×3 filter per input channel independently (spatial mixing, no channel mixing)
  2. Pointwise convolution: apply 1×1 convolutions to mix channels (channel mixing, no spatial mixing)

Parameter comparison for a layer with C_in input channels, C_out output channels, 3×3 kernel:

  • Standard conv: C_in × C_out × 9 parameters
  • Depthwise separable: C_in × 9 (depthwise) + C_in × C_out (pointwise) parameters
  • Reduction factor: approximately 1/C_out + 1/9 - for 256 output channels: roughly 8-9x fewer parameters

MobileNetV2 added an inverted residual: expand channels with 1×1, apply depthwise 3×3, compress with 1×1. The expansion factor is typically 6 - opposite of the standard bottleneck, but efficient because the expensive depthwise conv operates in the expanded channel space where there is more representational capacity.

EfficientNet: The Compound Scaling Discovery

EfficientNet (Tan & Le, 2019) asked a surprisingly fundamental question: Given a fixed compute budget increase, what is the optimal way to scale a neural network?

Three dimensions can be scaled:

  • Depth: more layers - better feature hierarchy but harder to optimize past a point
  • Width: more channels per layer - richer representations but saturates quickly without depth
  • Resolution: larger input images - more spatial detail but quadratic cost increase

The critical finding: all three dimensions are interdependent. A deeper network benefits from wider channels. A wider network with more channels benefits from higher resolution. Scaling one dimension without the others hits diminishing returns quickly.

The compound scaling rule:

  • depth: d = alpha^phi
  • width: w = beta^phi
  • resolution: r = gamma^phi
  • constraint: alpha × beta² × gamma² ≈ 2 (total compute doubles when phi increases by 1)

The constants alpha=1.2, beta=1.1, gamma=1.15 are found by grid search. EfficientNet-B0 is the NAS-optimized base architecture. B1 through B7 increase phi from 1 to 7.

ModelphiTop-1Params
EfficientNet-B0077.1%5.3M
EfficientNet-B1179.1%7.8M
EfficientNet-B4482.9%19M
EfficientNet-B7784.3%66M

EfficientNet-B4 achieves 82.9% with 19M parameters. ResNet-152 achieves 78.3% with 60M parameters. Better accuracy at one-third the parameter count.

MBConv: EfficientNet's Building Block

EfficientNet uses Mobile Inverted Bottleneck Convolution (MBConv):

x
→ 1×1 conv (expand channels by factor k, typically 6) → BN → Swish
→ 3×3 depthwise conv → BN → Swish
→ Squeeze-and-Excitation block
→ 1×1 conv (project back to original channels) → BN
→ (+x, if same shape)

Squeeze-and-Excitation: global average pool the feature map, pass through two small FC layers with sigmoid, multiply each channel by its learned attention weight. The network learns which channels matter for the current input.

Swish activation: x × sigmoid(x). Smooth, non-monotonic, consistently outperforms ReLU in this architecture family.


Part 9: ConvNeXt 2022 - CNNs Strike Back

The Vision Transformer Disruption

In 2021, Vision Transformers (ViT - Dosovitskiy et al.) showed that pure attention-based architectures could match or outperform CNNs on ImageNet if trained with enough data. Swin Transformer (Liu et al., 2021) made ViT practical by introducing hierarchical structure and local window attention, achieving better accuracy-efficiency tradeoffs than CNNs on many benchmarks.

The narrative in the research community began shifting: CNNs are old news. Transformers are the future.

Then came ConvNeXt.

The Systematic Experiment

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu and colleagues at Meta AI Research asked: What happens if we take a ResNet-50 and systematically apply every design choice from Vision Transformers, one at a time?

They ran the experiment with rigorous ablation - one change at a time, measuring accuracy after each modification. The result was a pure CNN - no attention, no positional embeddings - that matched Swin Transformer on ImageNet and outperformed it on object detection and segmentation.

The Modernization Roadmap

Starting from ResNet-50 at 76.1% top-1 accuracy:

ChangeAccuracyNotes
Baseline ResNet-5076.1%Starting point
Better training recipe (300 epochs, AdamW, mixup, label smoothing)78.8%Most of ViT's edge was training, not architecture
Stage ratio 1:1:9:1 + patchify stem (4×4 non-overlapping conv)79.4%Match Swin's macro layout
ResNeXt-ify (depthwise conv, expand width to match FLOPs)79.4%Separate spatial/channel mixing
Inverted bottleneck (expand 4× then contract)79.9%Mirror transformer FFN: wide in middle
Large kernel 7×7 depthwise instead of 3×380.6%Larger receptive field per layer
Replace ReLU with GELU80.6%Smooth activation like transformers
Fewer activations per block (one, not three)81.3%Transformers have one activation per FFN
Fewer norm layers (one LayerNorm per block)81.4%Reduce normalization overhead
LayerNorm instead of BatchNorm81.5%Better for variable batch sizes, simpler
Separate downsampling layers between stages82.0%Match Swin's patch merging between stages

ConvNeXt-Tiny: 28M parameters, 82.1% top-1 - matching Swin-T (28M, 81.3%) with simpler code, no window attention complexity.

What ConvNeXt Demonstrated

The key lesson: most of ViT's performance advantage came from training recipes (AdamW, stronger augmentation, longer training) and design choices (LayerNorm, GELU, inverted bottleneck, large kernels) - not from self-attention itself.

A modernized CNN that incorporated these ideas matched the transformer without any attention mechanism. This does not make transformers useless - at very large scale and with vast pretraining data (hundreds of millions of images), ViT's global attention is a genuine advantage. But for typical deployment scales (10M–200M parameters, ImageNet-scale data), ConvNeXt demonstrated that CNNs are not architecturally inferior.

tip

ConvNeXt is worth understanding deeply for interviews. It demonstrates that knowing why something works matters more than adopting the current popular approach. Every design choice was isolated and measured - this is how rigorous architecture research should work.


Part 10: Architecture Comparison Table

ArchitectureYearParamsTop-1 ImageNetKey InnovationStill Used For
AlexNet201262M56.5%GPUs + ReLU + DropoutHistorical reference only
VGG-162014138M71.5%Uniform 3×3 depthPerceptual loss in style transfer
GoogLeNet20146.8M74.8%Inception module, 1×1 bottlenecksMulti-scale detection heads
ResNet-50201525.6M76.1%Residual/skip connectionsUniversal production backbone
ResNet-152201560.2M77.8%Deeper with skip connectionsHigh-accuracy backbone
DenseNet-201201720M77.3%Dense connections, feature reuseMedical imaging
EfficientNet-B020195.3M77.1%NAS + compound scalingLightweight deployment
EfficientNet-B4201919.3M82.9%Compound scaling (phi=4)Accuracy-focused production
EfficientNet-B7201966M84.3%Maximum compound scalingBenchmark accuracy
ConvNeXt-T202228.6M82.1%Modernized ResNetState-of-the-art CNN baseline
ConvNeXt-L2022197M84.3%Large modernized ResNetResearch benchmark

Part 11: Choosing an Architecture in Practice

Practical rules:

  • Start with ResNet-50: well-understood, excellent pretrained weights (IMAGENET1K_V2), compatible with Detectron2, MMDetection, and every other framework, easy to debug. The workhorse of production CV.
  • Switch to EfficientNet-B4 if accuracy matters: consistently outperforms ResNet-50 with fewer parameters for classification tasks.
  • Use MobileNetV3 or EfficientNet-Lite for mobile: designed for ARM CPUs and mobile NPUs.
  • Consider ConvNeXt for new projects: benefits from modern training techniques that older architectures were not designed around.
  • Pretrained weights matter more than architecture choice for small datasets: with under 10K images, the quality of pretrained features often dominates. ResNet-50 with IMAGENET1K_V2 weights can beat EfficientNet-B4 with older weights.
  • Always benchmark on your target hardware: parameter count does not predict inference latency. EfficientNet's depthwise convolutions are memory-bound on many backends. Profile first.

Part 12: PyTorch - Loading and Working with Pretrained Architectures

Loading Pretrained Models with torchvision

import torch
import torch.nn as nn
import torchvision.models as models

# Load pretrained ResNet-50 with latest weights
resnet = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

# Inspect the structure
# ResNet(
# (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
# (bn1): BatchNorm2d(64)
# (relu): ReLU(inplace=True)
# (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1)
# (layer1): Sequential(3 × Bottleneck, 64→256 channels, 56×56)
# (layer2): Sequential(4 × Bottleneck, 128→512 channels, 28×28)
# (layer3): Sequential(6 × Bottleneck, 256→1024 channels,14×14)
# (layer4): Sequential(3 × Bottleneck, 512→2048 channels, 7×7)
# (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
# (fc): Linear(in_features=2048, out_features=1000, bias=True)
# )

print(resnet.fc.in_features) # 2048 - the feature dimension before the head

Building Residual Blocks from Scratch

Understanding ResNet means being able to implement it. Here are both block types with all edge cases handled:

import torch
import torch.nn as nn
from typing import Optional


class BasicBlock(nn.Module):
"""ResNet basic block - used in ResNet-18 and ResNet-34."""
expansion = 1 # output channels = planes * expansion

def __init__(self, in_planes: int, planes: int, stride: int = 1):
super().__init__()
# First 3×3 conv - may downsample spatially if stride > 1
self.conv1 = nn.Conv2d(in_planes, planes, 3, stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(planes)
# Second 3×3 conv - always stride 1
self.conv2 = nn.Conv2d(planes, planes, 3, stride=1, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(planes)
self.relu = nn.ReLU(inplace=True)

# Projection shortcut: needed when stride > 1 or channels change
self.shortcut: Optional[nn.Sequential] = None
if stride != 1 or in_planes != planes * self.expansion:
self.shortcut = nn.Sequential(
nn.Conv2d(in_planes, planes * self.expansion, 1, stride=stride, bias=False),
nn.BatchNorm2d(planes * self.expansion)
)

def forward(self, x: torch.Tensor) -> torch.Tensor:
identity = x

out = self.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out)) # No ReLU before the skip addition

if self.shortcut is not None:
identity = self.shortcut(x)

out = out + identity # The residual addition - this is the magic
return self.relu(out)


class BottleneckBlock(nn.Module):
"""ResNet bottleneck - used in ResNet-50, 101, 152."""
expansion = 4 # output channels = planes * 4

def __init__(self, in_planes: int, planes: int, stride: int = 1):
super().__init__()
# 1×1: compress channels by 4x
self.conv1 = nn.Conv2d(in_planes, planes, 1, bias=False)
self.bn1 = nn.BatchNorm2d(planes)
# 3×3: spatial mixing at reduced channel count (cheap because C/4 channels)
self.conv2 = nn.Conv2d(planes, planes, 3, stride=stride, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(planes)
# 1×1: expand back to full channels
self.conv3 = nn.Conv2d(planes, planes * self.expansion, 1, bias=False)
self.bn3 = nn.BatchNorm2d(planes * self.expansion)
self.relu = nn.ReLU(inplace=True)

self.shortcut: Optional[nn.Sequential] = None
if stride != 1 or in_planes != planes * self.expansion:
self.shortcut = nn.Sequential(
nn.Conv2d(in_planes, planes * self.expansion, 1, stride=stride, bias=False),
nn.BatchNorm2d(planes * self.expansion)
)

def forward(self, x: torch.Tensor) -> torch.Tensor:
identity = x

out = self.relu(self.bn1(self.conv1(x)))
out = self.relu(self.bn2(self.conv2(out)))
out = self.bn3(self.conv3(out)) # No ReLU before the skip addition

if self.shortcut is not None:
identity = self.shortcut(x)

return self.relu(out + identity)


# Verify shapes
block_basic = BasicBlock(in_planes=64, planes=64, stride=1)
x = torch.randn(4, 64, 56, 56)
print(block_basic(x).shape) # torch.Size([4, 64, 56, 56])

block_down = BasicBlock(in_planes=64, planes=128, stride=2)
print(block_down(x).shape) # torch.Size([4, 128, 28, 28])

bottleneck = BottleneckBlock(in_planes=256, planes=64, stride=1)
x_b = torch.randn(4, 256, 28, 28)
print(bottleneck(x_b).shape) # torch.Size([4, 256, 28, 28])

Replacing Classification Heads for Custom Tasks

def build_classifier(
architecture: str,
num_classes: int,
pretrained: bool = True,
dropout: float = 0.3
) -> nn.Module:
"""
Build a pretrained model with a custom classification head.
Handles the different head structures across architectures.
"""
if architecture == "resnet50":
model = models.resnet50(
weights=models.ResNet50_Weights.IMAGENET1K_V2 if pretrained else None
)
in_features = model.fc.in_features # 2048
model.fc = nn.Sequential(
nn.Dropout(p=dropout),
nn.Linear(in_features, num_classes)
)

elif architecture == "efficientnet_b4":
model = models.efficientnet_b4(
weights=models.EfficientNet_B4_Weights.IMAGENET1K_V1 if pretrained else None
)
in_features = model.classifier[-1].in_features # 1792
model.classifier = nn.Sequential(
nn.Dropout(p=dropout, inplace=True),
nn.Linear(in_features, num_classes)
)

elif architecture == "convnext_tiny":
model = models.convnext_tiny(
weights=models.ConvNeXt_Tiny_Weights.IMAGENET1K_V1 if pretrained else None
)
in_features = model.classifier[-1].in_features # 768
model.classifier[-1] = nn.Linear(in_features, num_classes)

else:
# Fall back to timm for anything else
import timm
model = timm.create_model(architecture, pretrained=pretrained, num_classes=num_classes)

return model


# Verify output shapes
for arch in ["resnet50", "efficientnet_b4", "convnext_tiny"]:
model = build_classifier(arch, num_classes=100, pretrained=False)
x = torch.randn(2, 3, 224, 224)
out = model(x)
print(f"{arch}: {out.shape}") # (2, 100) for each

Using timm for Access to 700+ Models

timm (PyTorch Image Models by Ross Wightman) gives consistent APIs across hundreds of architectures with often better training recipes than torchvision:

import timm

# Browse available models
print(len(timm.list_models("resnet*", pretrained=True))) # 80+ variants
print(len(timm.list_models("efficientnet*", pretrained=True))) # 50+ variants

# Load model with automatic head replacement
model = timm.create_model("efficientnet_b4", pretrained=True, num_classes=50)
x = torch.randn(4, 3, 380, 380) # EfficientNet-B4 native resolution is 380×380
print(model(x).shape) # (4, 50)

# Get model config: normalization constants, native input size
cfg = model.default_cfg
print(cfg["input_size"]) # (3, 380, 380)
print(cfg["mean"]) # (0.485, 0.456, 0.406) - always ImageNet for these models
print(cfg["std"]) # (0.229, 0.224, 0.225)

# Extract multi-scale features (for detection/segmentation backbones)
backbone = timm.create_model(
"resnet50",
pretrained=True,
features_only=True,
out_indices=(1, 2, 3, 4) # Which stages to return
)
feature_maps = backbone(torch.randn(4, 3, 224, 224))
for i, fm in enumerate(feature_maps):
print(f"Stage {i+1}: {fm.shape}")
# Stage 1: torch.Size([4, 256, 56, 56])
# Stage 2: torch.Size([4, 512, 28, 28])
# Stage 3: torch.Size([4, 1024, 14, 14])
# Stage 4: torch.Size([4, 2048, 7, 7])
tip

The features_only=True mode in timm is essential for object detection and segmentation - it returns feature maps at multiple spatial scales, which you pass to a Feature Pyramid Network (FPN) or a segmentation decoder. This is how Detectron2, MMDetection, and Mask R-CNN use CNN backbones.

Parameter Count Comparison

def count_params(model: nn.Module) -> int:
return sum(p.numel() for p in model.parameters())

architectures = {
"ResNet-18": models.resnet18(weights=None),
"ResNet-50": models.resnet50(weights=None),
"ResNet-101": models.resnet101(weights=None),
"EfficientNet-B0": models.efficientnet_b0(weights=None),
"EfficientNet-B4": models.efficientnet_b4(weights=None),
"EfficientNet-B7": models.efficientnet_b7(weights=None),
"ConvNeXt-Tiny": models.convnext_tiny(weights=None),
"ConvNeXt-Base": models.convnext_base(weights=None),
}

print(f"{'Model':<22} {'Params (M)':>12}")
print("-" * 36)
for name, model in architectures.items():
params = count_params(model) / 1e6
print(f"{name:<22} {params:>11.1f}M")

# ResNet-18 11.7M
# ResNet-50 25.6M
# ResNet-101 44.5M
# EfficientNet-B0 5.3M
# EfficientNet-B4 19.3M
# EfficientNet-B7 66.4M
# ConvNeXt-Tiny 28.6M
# ConvNeXt-Base 88.6M

Part 13: Common Mistakes

warning

Forgetting model.eval() during inference. BatchNorm uses batch statistics during training (mean and std of current batch) and running statistics during evaluation (accumulated mean and std from training). In train mode during evaluation, outputs are non-deterministic - they change with every batch depending on which images happen to be in it. This breaks your validation metrics in unpredictable ways. Always: model.eval() before evaluation loops, model.train() before training loops.

danger

Using the wrong pretrained weights. torchvision provides multiple weight sets per architecture. ResNet50_Weights.IMAGENET1K_V1 achieves ~76.1% top-1; ResNet50_Weights.IMAGENET1K_V2 achieves ~80.9% top-1. That is nearly a 5-point gap from weights alone, with identical architecture. The difference comes from better training recipes (longer training, stronger augmentation, label smoothing). Always use DEFAULT or explicitly specify V2 weights. Check model.meta to see what training recipe was used.

Not applying ImageNet normalization. All ImageNet-pretrained models expect inputs normalized with mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225]. If your DataLoader skips this step, the model receives inputs completely outside its training distribution. Pretrained features will be degraded immediately, and fine-tuning will start from a much worse initial point.

from torchvision import transforms

# Always include this normalization for ImageNet-pretrained models
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])

Confusing requires_grad=False with model.eval(). They are independent. Setting requires_grad=False on parameters prevents their update during backprop - useful for freezing layers. Calling model.eval() changes the forward pass behavior of BatchNorm and Dropout. A frozen backbone should have requires_grad=False on its parameters, but may be called in model.train() mode so BatchNorm continues updating its running statistics. They serve different purposes.

Selecting architecture by parameter count alone. EfficientNet-B0 (5.3M params) and ResNet-18 (11.7M params) have very different computational profiles - EfficientNet's depthwise convolutions are memory-bandwidth-bound on many hardware backends, while standard convolutions are compute-bound. Always profile latency on your target hardware, not just count parameters.

Using VGG as a backbone for new projects. VGG's 138M parameters with 90% in FC layers makes it slow, memory-hungry, and inflexible. Its only modern legitimate use is as a frozen perceptual loss network for style transfer and image generation. Never use VGG as a classification or detection backbone in 2024+.


Part 14: Interview Q&A

Q1: Why did AlexNet use ReLU instead of sigmoid, and why did it matter so much?

Sigmoid and tanh saturate - their derivatives approach zero for large or small inputs. In a deep network, backpropagation multiplies gradients layer by layer. A product of many near-zero numbers approaches zero exponentially fast - this is the vanishing gradient problem. Early layers receive essentially no gradient signal and barely update, so the network effectively does not train beyond the first few layers.

ReLU's derivative is 1 for positive inputs and exactly 0 for negative. No saturation on the positive side. Gradients flow cleanly through every ReLU-activated positive unit. This alone let AlexNet train 6x faster than equivalent sigmoid networks according to the original paper. ReLU is not a minor engineering improvement - it is the primary reason deep networks became trainable at scale in 2012.

Q2: What is the degradation problem, and why is it not the same as overfitting?

The degradation problem: adding more layers to a plain deep network causes both training error and validation error to increase. A 56-layer plain network trains worse than a 20-layer plain network.

Overfitting looks like: training error decreases while validation error increases. The model memorizes training data but does not generalize. This is fixable with more data or regularization.

Degradation looks like: training error itself is higher with more depth. Adding layers actively makes the model worse at learning, even from training data. This is an optimization failure, not a generalization failure. The deeper network architecturally can represent all functions the shallower network can (the extra layers could learn identity), but gradient descent cannot find that solution through stacked non-linear layers.

ResNet fixes this by reformulating what each block learns: instead of H(x), ask for F(x) = H(x) - x. If identity is optimal, F(x) just needs to be zero - a natural default for near-zero initialized weights.

Q3: Explain residual connections from the gradient perspective. Why do they help training?

Given a residual block where y = F(x) + x, the gradient of the loss L with respect to the input x is:

dL/dx = dL/dy × (dF/dx + 1)

The +1 term comes from the identity skip connection. No matter how small dF/dx becomes - which is the vanishing gradient problem - the gradient dL/dx is at least as large as dL/dy. The skip connection acts as a gradient highway that bypasses any number of blocks without diminishing the gradient. For a ResNet-152 with 50+ blocks, this means the gradient from the loss reaches the first layer without the exponential decay that kills plain networks.

Q4: What is EfficientNet's compound scaling, and what problem does it solve?

Before EfficientNet, practitioners scaled CNNs ad hoc: ResNet-101 is ResNet-50 with more blocks (depth scaling only). This is suboptimal because the three scaling dimensions - depth, width, and input resolution - are interdependent.

EfficientNet's empirical finding: scaling one dimension alone quickly hits diminishing returns, but scaling all three together consistently improves accuracy. Deeper networks benefit from wider channels to use increased depth. Wider channels benefit from higher resolution to provide spatial detail for all those channels.

The compound scaling rule scales depth, width, and resolution simultaneously with a fixed ratio (d = alpha^phi, w = beta^phi, r = gamma^phi) subject to a constraint that keeps total compute doubling per step. The constants are found by grid search. This gives EfficientNet-B4 (82.9% top-1, 19M params) better accuracy than ResNet-152 (78.3%, 60M params) at one-third the parameters.

Q5: What did ConvNeXt demonstrate about Vision Transformers' advantages?

ConvNeXt showed that a substantial portion of ViT's performance gains came from training recipes and design choices rather than from self-attention. By systematically applying transformer design decisions to ResNet - GELU activations, LayerNorm instead of BatchNorm, inverted bottleneck, 7×7 depthwise convolutions, fewer normalization layers per block, and modern training with AdamW - the ConvNeXt authors matched Swin Transformer in accuracy without any attention mechanism.

The implication: at typical deployment scales (10M–200M parameters, ImageNet-level data), CNNs and transformers are competitive once you use the same training techniques. ViT has genuine advantages at very large scale with massive pretraining data, where global attention enables long-range reasoning that convolutions approximate. But for most production CV tasks, a modernized CNN like ConvNeXt is a perfectly valid choice.

Q6: How do you choose between ResNet-50, EfficientNet-B4, and ConvNeXt-T for a new task?

Four axes: (1) Inference latency on target hardware - profile each on your actual deployment hardware before committing. EfficientNet-B4 at 380×380 is much slower than ResNet-50 at 224×224 on CPU despite fewer parameters. (2) Ecosystem requirements - need to plug into Detectron2, MMDetection, or another detection/segmentation framework? ResNet-50/101 has the widest head compatibility. (3) Dataset size - under 10K images, ResNet-50's IMAGENET1K_V2 weights often transfer more robustly. 10K-100K: EfficientNet-B4 or ConvNeXt-T. (4) Debugging priority - ResNet-50 is the most studied, most documented architecture with the best-understood failure modes. When debugging matters more than squeezing accuracy, start there.

The recommended workflow: baseline with ResNet-50 first, benchmark EfficientNet-B4, check if the accuracy gain justifies the latency cost or complexity increase, and use ConvNeXt if you want a modern architecture with better training-recipe compatibility for new projects.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the 2D Convolution Visualization demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.