Module 05: Computer Vision

Computer vision is the discipline that teaches machines to interpret visual data. It underpins autonomous vehicles, medical imaging, satellite analysis, manufacturing quality control, and any product with a camera. Even if you never build a pure vision system, the architectures and training techniques developed for images have propagated into NLP, audio, and multimodal AI - residual connections, attention mechanisms, and transfer learning all have roots in vision research.

This module takes you from the mechanics of the convolution operation through production-ready architectures, detection and segmentation tasks, and finally to the Vision Transformer - the architecture that unified the vision and language worlds.

Why Computer Vision Matters for ML Engineers

Vision is where deep learning first clearly surpassed humans on a benchmark (ImageNet 2012). The techniques developed for vision have since become universal ML infrastructure:

Residual connections from ResNet are inside every modern language model
Attention mechanisms in ViT preceded their widespread use in multimodal systems
Transfer learning workflows were proven at scale on ImageNet before being applied to every domain
Data augmentation strategies developed for images are now standard across tabular, text, and audio tasks

A strong computer vision foundation is not optional for ML engineers - it is the historical and technical bedrock of the field.

Module Map

Lesson Table

#	Lesson	Key Concepts	PyTorch APIs
01	Convolutional Neural Networks	Convolution operation, weight sharing, 1x1 convolutions, depthwise separable	`nn.Conv2d`, filter visualization
02	Pooling, Strides, and Padding	Max/avg pooling, strided convolutions, receptive field, global avg pooling	`nn.MaxPool2d`, `nn.AdaptiveAvgPool2d`
03	CNN Architectures: ResNet, EfficientNet	Residual blocks, skip connections, compound scaling, ConvNeXt	`torchvision.models`, custom heads
04	Transfer Learning	Fine-tuning strategies, layer freezing, domain gap, LR scheduling	`requires_grad`, pretrained models
05	Object Detection: YOLO and R-CNN	Bounding boxes, anchor boxes, RPN, NMS, mAP@50	`torchvision.ops`, detection models
06	Semantic Segmentation	FCN, U-Net, skip connections, DeepLab, panoptic segmentation	U-Net from scratch
07	Data Augmentation	Spatial/photometric transforms, Mixup, CutMix, RandAugment, TTA	`torchvision.transforms`, Albumentations
08	Vision Transformers (ViT)	Patch embeddings, CLS token, Swin Transformer, DeiT	`timm` library, ViT fine-tuning

Prerequisites

Before this module you should be comfortable with:

Module 04 - Neural Networks: forward pass, backpropagation, batch norm, dropout, training loops in PyTorch
NumPy array indexing and reshaping
Understanding of gradient descent and cross-entropy loss

Production Reality

When you deploy vision systems, the real constraints look like this:

Latency vs accuracy. EfficientNet-B0 runs at ~6ms per image on a V100; ViT-L runs at ~50ms. A real-time 30fps application has ~33ms end-to-end budget. Architecture selection is constrained by inference latency before validation accuracy.

Distribution shift. A model trained on ImageNet-scale web images degrades when deployed on industrial cameras with different optics, color profiles, and lighting conditions. Transfer learning helps, but domain-specific data often determines whether you hit 70% or 95% accuracy.

Label cost. Pixel-level annotation for segmentation costs roughly 60x more per image than classification labels. Semi-supervised and self-supervised pretraining (DINO, MAE) are production tools that every CV engineer should know.

Edge deployment. Phones, cameras, and embedded chips impose hard model-size constraints. MobileNet, EfficientNet-Lite, and quantization-aware training are production-critical knowledge, not optional extras.

How to Study This Module

Read each lesson in order - concepts build sequentially from CNNs through ViT
Run every PyTorch code block in a Jupyter notebook or Colab
For architecture lessons, read the original papers linked in each lesson
After object detection and segmentation, experiment on a real dataset (COCO, Pascal VOC, Cityscapes)
Practice the Interview Q&A sections aloud - being able to explain these concepts verbally is what interviews test

This module prepares you for ML Engineer, Computer Vision Engineer, and Applied Scientist roles at companies ranging from startups using pretrained models to research teams designing new architectures.

Why Computer Vision Matters for ML Engineers​

Module Map​

Lesson Table​

Prerequisites​

Production Reality​

How to Study This Module​