Skip to main content

Object Detection: YOLO and R-CNN

Reading Time: ~55 min | Interview Relevance: Very High | Target Roles: MLE, CV Engineer, Applied Scientist


It is 3:47 AM on a Tuesday, and a safety alarm has just fired inside a food processing warehouse outside Frankfurt. The CCTV cameras are rolling, the conveyor belt is moving at 80 centimeters per second, and a worker's hand has drifted within 28 centimeters of a high-speed slicer. The computer vision system you built last quarter is supposed to catch exactly this. But it doesn't. It told you there was a hand in the image - classification, working fine - but it had no idea where the hand was. The bounding box you needed to compute the proximity alert simply did not exist.

You start reading papers at 4 AM. You find that your root problem is fundamental: image classification produces a single label per image ("there is a hand"), but warehouse safety requires localization - you need to know exactly where the hand is, whether it is inside the danger zone, and how close it is to the blade. These are different tasks requiring different model architectures.

By 5 AM you have sketched two paths forward. The first, Faster R-CNN, would give you the best small-object localization and fits your quality requirements - but it runs at 5 frames per second, and the conveyor moves too fast for that. The second, YOLOv8n, runs at 120 fps on your GPU, but you are not sure its accuracy on hands-near-blades is good enough. You need to understand both architectures - not just their performance numbers, but the engineering decisions that created those tradeoffs - before you can make an intelligent call.

This lesson traces the history of object detection from the brute-force era of 2012 through the modern anchor-free designs of 2023. Every architectural choice was a response to a real limitation. Understanding those limitations is the same as understanding the architecture itself.

The Detection Problem: What Classification Misses

Image classification answers one question: what is the dominant object in this image? The output is a single probability distribution over classes. This is the task that AlexNet and ResNet solved.

Object detection must answer two questions simultaneously: what objects are here, and where are they? The output is a variable-length list of predictions, where each prediction contains:

  1. A class label
  2. A confidence score
  3. A bounding box specifying location and extent

A bounding box can be parameterized two ways, and confusion between them is a persistent source of bugs:

xyxy format: (xmin,ymin,xmax,ymax)\text{xyxy format: } (x_{\min}, y_{\min}, x_{\max}, y_{\max})

cxcywh format: (cx,cy,w,h)\text{cxcywh format: } (c_x, c_y, w, h)

COCO annotations use (x,y,w,h)(x, y, w, h) where (x,y)(x, y) is the top-left corner. YOLO labels use normalized (cx,cy,w,h)(c_x, c_y, w, h) where values are divided by image width and height. torchvision uses xyxy. Check your format before computing any loss.

Object detection is harder than classification for four compounding reasons:

  • Variable output size: one image might have 0 objects, another might have 80. You cannot use a fixed-output softmax head.
  • Multi-scale objects: a car 2 meters away fills half the image; a car 200 meters away covers 50 pixels. The model must handle both.
  • Occlusion: objects partially hidden behind other objects must still be detected from partial evidence.
  • Localization: you must not just recognize what is there, but precisely specify where it is. An 80% overlap with the truth is good; a 20% overlap is useless.
Image with 3 objects: Detection output:

┌─────────────────────────┐ ┌──────────────────────────────────────────┐
│ │ │ [car, conf=0.97, (45, 30, 210, 180)] │
│ [ car ] │ │ [person, conf=0.91, (280, 60, 340, 220)] │
│ [person] │ │ [hand, conf=0.84, (400, 150, 440, 195)] │
│ [hand] │ └──────────────────────────────────────────┘
└─────────────────────────┘

This is fundamentally harder than producing a single label - and it requires architectures designed from scratch for the task.

Naive Approach: Sliding Window + Classifier

Before deep learning changed everything, the standard approach to detection was conceptually straightforward: take a classifier you trust, and apply it to every possible sub-region of the image.

The algorithm:

  1. Take a pre-trained image classifier (e.g., a HOG + SVM classifier, or later AlexNet)
  2. Define a set of window sizes (e.g., 64×64, 128×128, 256×128)
  3. Slide each window across the image at a step size of 8 pixels
  4. For each crop, run the classifier
  5. Keep crops where the classifier reports high confidence

The problem with this approach is not conceptual - it is computational. A 640×480 image with 3 window sizes and 8-pixel stride produces roughly 10,000 crops. Running a CNN forward pass on each crop is prohibitively slow. At 2015 GPU speeds, this took 30–60 seconds per image. Real-time was not in scope.

There was a second problem: fixed aspect ratios. A window designed to detect standing people (tall and narrow) will miss lying-down people, or cars (wide and short). You either use an explosion of aspect ratios (making the problem even slower) or accept missed detections.

This was the state of the art in 2012. The Deformable Parts Model (DPM) by Felzenszwalb et al. was the best system of that era - it used hand-crafted HOG features and a tree-structured model for deformable parts. DPM won the PASCAL VOC detection challenge multiple years in a row. But the writing was on the wall. Girshick, one of the DPM authors, would pivot to deep learning the following year and create R-CNN.

R-CNN (2013): The Deep Learning Revolution for Detection

Ross Girshick et al. published Regions with CNN features in 2013. The key insight was not to abandon the sliding window paradigm entirely, but to make it smarter: instead of evaluating every possible region, use a fast, non-CNN algorithm to propose the most likely object regions, then run a deep CNN only on those proposals.

The algorithm they called Selective Search uses image segmentation, color histograms, texture gradients, and region merging to propose ~2,000 candidate regions per image. These regions are not class-specific - they just look like "something interesting is here" based on low-level visual structure. Generating 2,000 proposals is much cheaper than evaluating 10,000 crops.

The three-stage pipeline:

Stage 1 - Region proposals: Selective Search proposes ~2,000 candidate bounding boxes per image.

Stage 2 - Feature extraction: Each proposal is warped to 227×227 pixels and passed through AlexNet, producing a 4,096-dimensional feature vector. This step requires 2,000 CNN forward passes per image.

Stage 3 - Classification and regression: A class-specific linear SVM classifies each feature vector. A separate class-specific linear regressor refines the bounding box coordinates.

The results were dramatic. On PASCAL VOC 2012, R-CNN improved mAP from 40.9% (the best DPM variant) to 53.3%. Deep CNN features were not just better - they were transformatively better.

But the speed was painful: 47 seconds per image at test time. The bottleneck was those 2,000 CNN forward passes. Every crop was processed independently, which meant the backbone computed features redundantly for overlapping regions. The fix would come two years later.

note

Why SVMs instead of a softmax classifier? In 2013, the conventional wisdom was that SVMs were more accurate than softmax classifiers for multi-class problems. By 2015, Girshick showed that this was an artifact of the training procedure, not a fundamental property of the loss function. Fast R-CNN replaced SVMs with a softmax head and got better results.

Fast R-CNN (2015): The Feature Sharing Fix

Girshick's follow-up paper introduced one idea that changed everything: run the CNN once on the whole image, then extract region features from the shared feature map.

The logic is simple. If two region proposals overlap (and in a typical image, many of them do), R-CNN was computing CNN features for the overlap region twice - once for each proposal. This is pure redundancy. If you run the CNN once on the full image, you get a single feature map that covers the entire image. You can then look up the features for any region by simply selecting the corresponding spatial region of the feature map.

The operation that makes this work is called RoI Pooling (Region of Interest Pooling). Given a proposal in the original image coordinates, RoI Pooling:

  1. Projects the proposal's coordinates into the feature map space (divides by the backbone's stride)
  2. Divides the projected region into a fixed H×W grid (typically 7×7)
  3. Max-pools each cell of the grid

The output is always a fixed 7×7 feature map for any input proposal size. This fixed-size feature can then be fed into fully connected layers for classification and regression.

Full image → Backbone CNN → Feature Map (stride 32)

(one forward pass)

For each of 2,000 proposals:
Project to feature map coords → RoI Pool (7×7) → FC layers → class + box

(cheap, no CNN)

Fast R-CNN also replaced the multi-stage training (pre-train CNN, train SVM, train regressor separately) with end-to-end joint training. The total loss is:

L=Lcls+Lbox\mathcal{L} = \mathcal{L}_{\text{cls}} + \mathcal{L}_{\text{box}}

where Lcls\mathcal{L}_{\text{cls}} is cross-entropy over class predictions and Lbox\mathcal{L}_{\text{box}} is Smooth L1 loss over the four bounding box offsets (applied only for non-background proposals).

Speed improvement: from 47 seconds per image down to 0.3 seconds per image for the CNN portion. But the new bottleneck emerged: Selective Search still took 2 seconds per image - more than 6× slower than the rest of the system. The fix required making region proposals a part of the network itself.

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun published Faster R-CNN in late 2015, introducing the Region Proposal Network (RPN) - a small convolutional network that slides over the feature map and generates proposals directly from learned features, at essentially zero additional cost.

The key realization: the backbone feature map, already computed for the detection head, contains all the information needed to identify potential object locations. Instead of using a separate algorithm (Selective Search) that does not benefit from training, you can train a mini-network on top of the same feature map to propose regions.

How the RPN works:

The RPN slides a 3×3 convolutional window over the feature map. At each spatial location, it considers kk pre-defined reference boxes called anchors. Anchors cover multiple scales and aspect ratios - typically 3 scales (128², 256², 512² pixels in the original image) and 3 aspect ratios (1:1, 1:2, 2:1), giving k=9k = 9 anchors per location.

For each anchor, the RPN predicts two things:

  1. Objectness score: is there an object here? (binary classification, foreground vs background)
  2. Bounding box offsets: how should the anchor coordinates be adjusted to better fit the object?

The offsets are parameterized relative to the anchor:

tx=xxawa,ty=yyahat_x = \frac{x - x_a}{w_a}, \quad t_y = \frac{y - y_a}{h_a}

tw=log ⁣(wwa),th=log ⁣(hha)t_w = \log\!\left(\frac{w}{w_a}\right), \quad t_h = \log\!\left(\frac{h}{h_a}\right)

where (xa,ya,wa,ha)(x_a, y_a, w_a, h_a) are the anchor coordinates and (x,y,w,h)(x, y, w, h) are the target ground-truth box coordinates. The log parameterization for width and height prevents scale imbalance - a width of 10 pixels vs 100 pixels is a difference of 1 in log space, not 90 in linear space.

The RPN is trained jointly with the detection head. Since both share the backbone feature map, the entire system is one end-to-end network with a single forward pass (plus a non-differentiable proposal selection step in between).

┌─────────────────────────────────────────┐
│ Faster R-CNN Architecture │
└─────────────────────────────────────────┘

Input Image (H × W × 3)


┌──────────────┐
│ Backbone │ ResNet-50, VGG, etc.
│ CNN (once) │
└──────────────┘


Feature Map (H/32 × W/32 × C)
┌───┴──────────────────┐
│ │
▼ ▼
┌──────┐ ┌──────────┐
│ RPN │→proposals→│ RoI Pool │
│ │ │ (7×7) │
└──────┘ └──────────┘
(objectness + │
box offsets) ▼
┌──────────────┐
│ Detection │
│ Head │
│ (cls + box) │
└──────────────┘


Final Detections

Speed: 0.2 seconds per image (~5 fps). Not yet real-time for video, but getting there. More importantly, the accuracy on COCO improved significantly because the RPN proposals are better than Selective Search proposals - they are trained on the same task, using the same features.

IoU: The Metric at the Heart of Detection

Before going further, you need to understand Intersection over Union (IoU) - the single most important metric in object detection. It measures how well two bounding boxes overlap.

IoU(A,B)=ABAB\text{IoU}(A, B) = \frac{|A \cap B|}{|A \cup B|}

Ground truth box (G): Predicted box (P):

┌─────────────┐ ┌─────────────┐
│ G │ ┌─────┼──┐ │
│ │ │ │ │ P │
│ │ │ ┌───┘ └────────┐ │
└─────────────┘ │ │ G ∩ P │ │
└─│──────────────│─┘
└──────────────┘

IoU = Area(G ∩ P) / Area(G ∪ P)
  • IoU = 1.0: perfect prediction - boxes are identical
  • IoU = 0.0: no overlap - prediction and truth share no pixels
  • IoU = 0.5: the conventional threshold for PASCAL VOC (a detection "counts" if IoU > 0.5)
  • IoU = 0.75: the strict COCO threshold

IoU is used in three places in the detection pipeline:

  1. Training the RPN: anchors with IoU > 0.7 with any ground truth box are labeled "positive" (object). Anchors with IoU < 0.3 with all ground truth boxes are labeled "negative" (background). Anchors between 0.3 and 0.7 are ignored during training.
  2. Non-Maximum Suppression: suppress overlapping predicted boxes with IoU > threshold.
  3. Evaluation: determine if a predicted box correctly localizes a ground truth object.
import torch


def compute_iou(box1: torch.Tensor, box2: torch.Tensor) -> torch.Tensor:
"""
Compute pairwise IoU between two sets of boxes in xyxy format.

Args:
box1: (N, 4) tensor, format (x1, y1, x2, y2)
box2: (M, 4) tensor, format (x1, y1, x2, y2)

Returns:
(N, M) tensor of pairwise IoU values
"""
# Areas of each box
area1 = (box1[:, 2] - box1[:, 0]) * (box1[:, 3] - box1[:, 1]) # (N,)
area2 = (box2[:, 2] - box2[:, 0]) * (box2[:, 3] - box2[:, 1]) # (M,)

# Intersection top-left and bottom-right
# Broadcasting: (N, 1, 2) vs (1, M, 2) → (N, M, 2)
inter_tl = torch.max(box1[:, None, :2], box2[None, :, :2]) # (N, M, 2)
inter_br = torch.min(box1[:, None, 2:], box2[None, :, 2:]) # (N, M, 2)

# Intersection dimensions (clamped to 0 if no overlap)
inter_wh = (inter_br - inter_tl).clamp(min=0) # (N, M, 2)
inter_area = inter_wh[:, :, 0] * inter_wh[:, :, 1] # (N, M)

# Union area
union_area = area1[:, None] + area2[None, :] - inter_area # (N, M)

return inter_area / union_area.clamp(min=1e-6)


# Sanity check
boxes_a = torch.tensor([
[0., 0., 10., 10.], # 10×10 box
[5., 5., 15., 15.], # 10×10 box, overlapping
])
boxes_b = torch.tensor([[5., 5., 15., 15.]])

iou = compute_iou(boxes_a, boxes_b)
print(iou)
# tensor([[0.1429], ← small overlap (25/175 intersection)
# [1.0000]]) ← perfect match

Non-Maximum Suppression (NMS)

When a network predicts boxes for every anchor at every location, a single object (say, a person) will generate dozens to hundreds of overlapping boxes - every anchor near the person fires with some confidence. You need to collapse this set of overlapping predictions to a single "best" prediction per object.

Non-Maximum Suppression is the standard algorithm:

Step 1: Filter all boxes below a confidence threshold (e.g., 0.05) Step 2: Sort remaining boxes by confidence score, descending Step 3: Take the highest-confidence box and add it to the output Step 4: Remove all remaining boxes whose IoU with the selected box exceeds a threshold (typically 0.5) Step 5: Repeat from Step 3 with the remaining boxes until none remain

def nms_from_scratch(
boxes: torch.Tensor,
scores: torch.Tensor,
iou_threshold: float = 0.5
) -> torch.Tensor:
"""
Non-Maximum Suppression implementation from first principles.

Args:
boxes: (N, 4) bounding boxes in xyxy format
scores: (N,) confidence scores
iou_threshold: IoU above which boxes are suppressed

Returns:
(K,) indices of boxes to keep, K <= N
"""
if boxes.numel() == 0:
return torch.empty(0, dtype=torch.long)

# Sort by score descending
order = scores.argsort(descending=True)
keep = []

while order.numel() > 0:
# The current best box
idx = order[0].item()
keep.append(idx)

if order.numel() == 1:
break

# Compute IoU of current box with all remaining
current_box = boxes[idx].unsqueeze(0) # (1, 4)
remaining_boxes = boxes[order[1:]] # (K, 4)
iou = compute_iou(current_box, remaining_boxes).squeeze(0) # (K,)

# Keep only boxes with low IoU (not overlapping with current)
low_iou_mask = iou <= iou_threshold
order = order[1:][low_iou_mask]

return torch.tensor(keep, dtype=torch.long)


# Demonstrate on overlapping boxes
boxes = torch.tensor([
[10., 10., 50., 50.], # box A
[12., 12., 52., 52.], # box B - heavily overlaps A
[14., 14., 54., 54.], # box C - heavily overlaps A and B
[200., 200., 240., 240.], # box D - far away, different object
])
scores = torch.tensor([0.9, 0.75, 0.6, 0.85])

keep_indices = nms_from_scratch(boxes, scores, iou_threshold=0.5)
print(keep_indices) # tensor([0, 3]) - keeps box A (highest score) and box D (different object)

Soft-NMS is a gentler alternative. Instead of hard-removing boxes with IoU > threshold, it decays their confidence score by a Gaussian function of their IoU with the selected box:

sisiexp ⁣(IoU(M,bi)2σ)s_i \leftarrow s_i \cdot \exp\!\left(-\frac{\text{IoU}(M, b_i)^2}{\sigma}\right)

In dense scenes - a crowd of people, a shelf of products - hard NMS frequently suppresses legitimate detections. Soft-NMS improves recall in these cases at the cost of more false positives, so the confidence threshold must be raised accordingly.

tip

For most deployment scenarios, use torchvision.ops.nms or torchvision.ops.batched_nms. These are CUDA-accelerated C++ implementations that run orders of magnitude faster than pure Python. The from-scratch implementation above is for understanding - not production.

YOLO (2015): You Only Look Once

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi published YOLO in 2015 - and it was a genuinely radical rethinking of the detection problem.

R-CNN and its descendants decomposed detection into stages: propose regions, then classify them. YOLO asked: what if detection were a single regression problem, solved in one forward pass? What if you looked at the whole image once and directly predicted all boxes and classes?

YOLO v1 Architecture:

Divide the image into an S×S grid (7×7 in the original paper). Each grid cell is responsible for detecting objects whose center falls within that cell.

Each cell predicts:

  • B bounding boxes (2 in v1), each with 4 coordinates and 1 confidence score
  • C class probabilities (conditional on an object being present)

Final output tensor: S×S×(B×5+C)S \times S \times (B \times 5 + C)

For PASCAL VOC with S=7, B=2, C=20: 7×7×(2×5+20)=7×7×307 \times 7 \times (2 \times 5 + 20) = 7 \times 7 \times 30

This entire tensor is predicted in a single forward pass through a CNN with 24 convolutional layers.

YOLO v1 Loss Function:

L=λcoordi,j1ijobj[(xix^i)2+(yiy^i)2]\mathcal{L} = \lambda_{\text{coord}} \sum_{i,j} \mathbf{1}_{ij}^{\text{obj}} \left[(x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2\right] +λcoordi,j1ijobj[(wiw^i)2+(hih^i)2]+ \lambda_{\text{coord}} \sum_{i,j} \mathbf{1}_{ij}^{\text{obj}} \left[(\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i} - \sqrt{\hat{h}_i})^2\right] +i,j1ijobj(CiC^i)2+λnoobji,j1ijnoobj(CiC^i)2+ \sum_{i,j} \mathbf{1}_{ij}^{\text{obj}} (C_i - \hat{C}_i)^2 + \lambda_{\text{noobj}} \sum_{i,j} \mathbf{1}_{ij}^{\text{noobj}} (C_i - \hat{C}_i)^2 +i1iobjcclasses(pi(c)p^i(c))2+ \sum_{i} \mathbf{1}_{i}^{\text{obj}} \sum_{c \in \text{classes}} (p_i(c) - \hat{p}_i(c))^2

Key design decisions in this loss:

  • Square root of width and height: large boxes tolerate larger absolute errors than small boxes. The square root compresses the large-box error, giving small boxes more relative influence.
  • λcoord=5\lambda_{\text{coord}} = 5: coordinates are weighted 5× more than class probabilities - localization is hard, and you want to push gradient toward it.
  • λnoobj=0.5\lambda_{\text{noobj}} = 0.5: background cells are weighted 0.5× - most cells have no object, so without downweighting, no-object gradients would dominate.

YOLO v1 Speed vs Accuracy:

YOLO v1 ran at 45 fps on a Titan X GPU - real-time video detection. Faster R-CNN ran at 5 fps. YOLO was 9× faster.

But YOLO v1 had real limitations:

  • Each grid cell predicts only one class - if two objects of different classes share the same cell (a hand overlapping a glove), only one can be detected
  • Small objects were missed often - the 7×7 grid is coarse
  • YOLO v1's mAP on PASCAL VOC was 63.4% vs Faster R-CNN's 73.2%

The YOLO series spent the next 8 years closing that accuracy gap while preserving the speed advantage.

YOLOv3 through YOLOv8: The Evolution

Each YOLO version addressed specific weaknesses with targeted architectural improvements.

YOLOv2 (2016): Added anchor boxes (borrowed from Faster R-CNN), allowing multiple objects per grid cell. Used k-means clustering on training box dimensions to design dataset-specific anchor shapes. Introduced Darknet-19 backbone. mAP improved to 78.6% on VOC.

YOLOv3 (2018): Introduced multi-scale prediction - detection at 3 different feature map scales (13×13, 26×26, 52×52). Large objects are detected at the coarse 13×13 scale (large receptive field). Small objects are detected at the fine 52×52 scale (high spatial resolution). Replaced softmax with independent binary classifiers for multi-label detection. Used Darknet-53 backbone (residual connections). mAP@50 on COCO: 57.9%.

YOLOv4 (2020): A systems-engineering masterclass by Alexey Bochkovskiy. Introduced:

  • CSP (Cross-Stage Partial) networks: split feature maps across stages to reduce computation
  • PANet (Path Aggregation Network): bottom-up feature propagation alongside FPN's top-down path
  • Mosaic augmentation: combine 4 images into one training sample - the model sees 4× the object variety per sample
  • CIoU loss: a better box regression loss that accounts for center distance and aspect ratio

YOLOv5 (2020, Ultralytics): Not an academic paper but an engineering product. Introduced clean Python/PyTorch codebase, easy fine-tuning API, ONNX/TensorRT export. Became the dominant production deployment choice by 2021.

YOLOv8 (2023, Ultralytics): The current production standard:

  • Anchor-free head: no pre-defined anchors. The model directly predicts box center, width, height. Simpler to tune for custom datasets.
  • Decoupled head: separate branches for classification and box regression (vs coupled in older YOLO). Reduces task interference.
  • C2f backbone: improved CSP variant with gradient flow through all layers.
  • Distribution Focal Loss (DFL): models box edge positions as distributions rather than point estimates - more stable training.
VersionBackbonemAP@50 (COCO)FPS (V100)Notable Innovation
YOLOv1Custom CNN~63% (VOC)45Single-pass detection
YOLOv3Darknet-5357.9%~30Multi-scale prediction
YOLOv5mCSP-Darknet63.9%~74Production-ready API
YOLOv8nC2f (nano)52.9%~160Anchor-free, decoupled head
YOLOv8xC2f (xlarge)61.9%~20Best YOLOv8 accuracy

Feature Pyramid Networks (FPN)

Both the two-stage and one-stage families faced the same fundamental problem: multi-scale objects. A CNN's feature map at the end of the backbone is semantically rich but spatially coarse - great for detecting large objects, terrible for small ones. The early layers have high spatial resolution but poor semantics - they know where edges are, not what objects are.

Feature Pyramid Networks (Lin et al., 2017) solved this with a top-down pathway that fuses semantics from deep layers with spatial detail from shallow layers:

ResNet Backbone (bottom-up): FPN (top-down):

C2 → 56×56 × 256 P2 ← 56×56 × 256 ← detect smallest objects
↑ upsample + add
C3 → 28×28 × 512 P3 ← 28×28 × 256
↑ upsample + add
C4 → 14×14 × 1024 P4 ← 14×14 × 256
↑ upsample + add
C5 → 7×7 × 2048 P5 ← 7×7 × 256 ← detect largest objects

At each FPN level:

  1. Apply a 1×1 conv to match the lateral connection's channel count to FPN's output channels
  2. Upsample the feature map from the level above by 2× (nearest neighbor)
  3. Add the two feature maps element-wise
  4. Apply a 3×3 conv to reduce aliasing

Each FPN level (P2 through P5) is connected to a detection head. Small objects get detected at P2 (high resolution, now with top-down semantic context), large objects at P5 (low resolution but correct scale). This is why Faster R-CNN + FPN is substantially better on small objects than the original Faster R-CNN.

YOLOv3 introduced multi-scale predictions, and YOLOv4+ use PANet - a bidirectional FPN variant that also propagates features bottom-up, giving each scale access to both high-level semantics and fine-grained local detail.

mAP: Evaluating Detection Models

Mean Average Precision (mAP) is the standard evaluation metric for object detection. Understanding it deeply matters for interpreting benchmark comparisons.

Step 1: For each class, build a Precision-Recall curve.

Given a confidence threshold τ\tau, a predicted box is a True Positive (TP) if:

  • Its confidence τ\geq \tau
  • Its IoU with an unmatched ground truth box of the correct class 0.5\geq 0.5 (or another IoU threshold)

A predicted box is a False Positive (FP) if the confidence threshold is met but IoU < threshold, or if there's no unmatched ground truth box. A False Negative (FN) is a ground truth box with no matching prediction.

Sweep τ\tau from 0 to 1. At each threshold, compute Precision and Recall:

Precision=TPTP+FP,Recall=TPTP+FN\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}, \quad \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}

Step 2: Average Precision (AP) = area under the Precision-Recall curve for a single class. Interpolated at 11 points (PASCAL VOC old) or 101 points (COCO).

Step 3: mAP = mean AP across all classes.

mAP=1Cc=1CAPc\text{mAP} = \frac{1}{C} \sum_{c=1}^{C} \text{AP}_c

mAP@50 vs [email protected]:0.95:

  • mAP@50: IoU threshold is 0.5. A box is "correct" if it overlaps the truth by at least half. Relatively permissive - a rough bounding box counts.
  • [email protected]:0.95 (COCO metric): average mAP over IoU thresholds 0.5, 0.55, 0.60, ..., 0.95. Penalizes imprecise localization heavily. A model with mAP@50 of 70% might score only 45% on [email protected]:0.95.

COCO also reports AP_S (objects < 32² pixels), AP_M (32²–96²), and AP_L (> 96²) separately. This reveals where a model fails - YOLOv1's AP_S was catastrophically low; Faster R-CNN + FPN is consistently stronger on small objects.

Python: Complete YOLOv8 Inference Pipeline

"""
YOLOv8 inference using the Ultralytics library.
Install: pip install ultralytics
"""
from pathlib import Path
import torch
from ultralytics import YOLO
from PIL import Image
import numpy as np


class WarehouseSafetyDetector:
"""
End-to-end YOLOv8-based safety detection system.
Detects hands and reports proximity to danger zones.
"""

DANGER_ZONE_CLASSES = ["hand", "person"]

def __init__(
self,
model_path: str = "yolov8n.pt", # nano for edge, yolov8s/m for better accuracy
confidence_threshold: float = 0.4,
nms_iou_threshold: float = 0.5,
device: str = "auto",
):
self.model = YOLO(model_path)
self.conf_threshold = confidence_threshold
self.iou_threshold = nms_iou_threshold
self.device = device

def detect(self, image_path: str) -> list[dict]:
"""
Run inference and return structured detections.

Returns:
List of dicts: {class_name, confidence, box_xyxy, box_cxcywh}
"""
results = self.model(
image_path,
conf=self.conf_threshold,
iou=self.iou_threshold, # NMS IoU threshold
device=self.device,
verbose=False,
)

detections = []
for result in results:
boxes = result.boxes
if boxes is None:
continue

for box in boxes:
xyxy = box.xyxy[0].cpu().numpy() # [x1, y1, x2, y2]
xywh = box.xywh[0].cpu().numpy() # [cx, cy, w, h]
conf = box.conf[0].item()
cls_id = int(box.cls[0].item())
cls_name = result.names[cls_id]

detections.append({
"class_name": cls_name,
"confidence": conf,
"box_xyxy": xyxy.tolist(),
"box_cxcywh": xywh.tolist(),
})

return detections

def check_proximity(
self,
detections: list[dict],
danger_zone: tuple[float, float, float, float], # xyxy
proximity_threshold_px: float = 50.0,
) -> list[dict]:
"""
For each detection, compute distance to the danger zone box.
Returns detections that are within proximity_threshold_px pixels.
"""
dx1, dy1, dx2, dy2 = danger_zone
alerts = []

for det in detections:
if det["class_name"] not in self.DANGER_ZONE_CLASSES:
continue

bx1, by1, bx2, by2 = det["box_xyxy"]

# Distance from box edge to danger zone edge
dist_x = max(0., max(dx1 - bx2, bx1 - dx2))
dist_y = max(0., max(dy1 - by2, by1 - dy2))
dist = (dist_x**2 + dist_y**2) ** 0.5

if dist <= proximity_threshold_px:
det["distance_to_danger_px"] = round(dist, 1)
alerts.append(det)

return alerts


def fine_tune_yolov8(
data_yaml: str,
base_model: str = "yolov8n.pt",
epochs: int = 100,
imgsz: int = 640,
) -> None:
"""
Fine-tune YOLOv8 on a custom dataset.

data_yaml must define:
path: /path/to/dataset
train: images/train
val: images/val
names: {0: hand, 1: glove, ...}
"""
model = YOLO(base_model)
results = model.train(
data=data_yaml,
epochs=epochs,
imgsz=imgsz,
batch=16,
lr0=0.01,
lrf=0.01,
mosaic=1.0, # Mosaic augmentation (combine 4 images)
copy_paste=0.5, # CopyPaste augmentation
degrees=10.0, # Rotation augmentation
fliplr=0.5,
patience=50, # Early stopping
save_period=10, # Save checkpoint every N epochs
project="warehouse_safety",
name="yolov8n_hands",
)
print(f"Best mAP@50: {results.results_dict['metrics/mAP50(B)']:.4f}")


def export_for_production(model_path: str) -> None:
"""
Export YOLOv8 to ONNX and TensorRT for deployment.
"""
model = YOLO(model_path)

# Export to ONNX (cross-platform, works on CPU/GPU)
model.export(format="onnx", dynamic=True, simplify=True)

# Export to TensorRT (NVIDIA GPU only, fastest inference)
model.export(format="engine", half=True) # FP16 TensorRT engine

print("Exported ONNX and TensorRT models")

Two-Stage vs One-Stage: Engineering Tradeoffs

AspectTwo-Stage (Faster R-CNN + FPN)One-Stage (YOLOv8)
Speed100–500ms/image (CPU: 2–10s)5–50ms/image (CPU: 50–200ms)
mAP@50-95 (COCO)~46–50% (ResNet-101)~44–55% (v8x)
Small object detectionStrong (dedicated FPN + RPN)Weaker (improving in v8)
Dense object scenesBetter (per-region classification)Struggles with heavy overlap
Deployment complexityHigher (two-stage graph, RoI ops)Lower (single graph)
Fine-tuningModerate (torchvision API)Easy (Ultralytics API, 3 lines)
Edge deploymentDifficult (RoI ops hardware support)Natural (ONNX, CoreML, TensorRT)
Typical use caseMedical imaging, satellite, QA inspectionReal-time video, mobile, robotics
Memory footprintHigher (proposals + features stored)Lower
warning

Do not choose architecture by benchmark numbers alone. YOLOv8x can match Faster R-CNN + FPN on COCO mAP@50-95 - but benchmarks are on the COCO distribution. Your dataset may have different properties. If you have many small objects (< 32 pixels), evaluate AP_S specifically. If latency matters, measure on your actual hardware with your batch size - not reported FPS from a paper.

Feature Pyramid Network (FPN): PyTorch Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F


class FPN(nn.Module):
"""
Feature Pyramid Network (Lin et al., 2017).
Merges multi-scale backbone features into a unified pyramid
where each level has the same number of channels and carries
both semantic richness and spatial detail.
"""

def __init__(self, in_channels_list: list[int], out_channels: int = 256):
"""
Args:
in_channels_list: channel counts of backbone feature maps,
e.g., [256, 512, 1024, 2048] for ResNet-50
out_channels: uniform output channels at all FPN levels
"""
super().__init__()

# Lateral 1×1 convolutions: reduce each level to out_channels
self.lateral_convs = nn.ModuleList([
nn.Conv2d(in_ch, out_channels, kernel_size=1)
for in_ch in in_channels_list
])

# Output 3×3 convolutions: smooth the merged feature maps
self.output_convs = nn.ModuleList([
nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1)
for _ in in_channels_list
])

def forward(self, feature_maps: list[torch.Tensor]) -> list[torch.Tensor]:
"""
Args:
feature_maps: list of tensors from backbone stages,
coarsest to finest: [C5, C4, C3, C2]
or finest to coarsest: depends on convention

Convention here: index 0 = finest (largest spatial), index -1 = coarsest.
We reverse internally for the top-down pass.
"""
# Apply lateral convolutions
laterals = [conv(f) for conv, f in zip(self.lateral_convs, feature_maps)]

# Top-down pathway: merge coarse semantics into fine spatial maps
# Start from the coarsest level (last index)
for i in range(len(laterals) - 2, -1, -1):
# Upsample next coarser level
upsampled = F.interpolate(
laterals[i + 1],
size=laterals[i].shape[-2:],
mode="nearest",
)
laterals[i] = laterals[i] + upsampled

# Apply output convolutions to smooth each level
outputs = [conv(lat) for conv, lat in zip(self.output_convs, laterals)]
return outputs


# Example: 4-level FPN on top of ResNet-50 feature sizes
fpn = FPN(in_channels_list=[256, 512, 1024, 2048], out_channels=256)

# Simulate ResNet-50 feature maps for a 640×640 input
feature_maps = [
torch.randn(2, 256, 160, 160), # C2 (stride 4)
torch.randn(2, 512, 80, 80), # C3 (stride 8)
torch.randn(2, 1024, 40, 40), # C4 (stride 16)
torch.randn(2, 2048, 20, 20), # C5 (stride 32)
]

outputs = fpn(feature_maps)
for i, out in enumerate(outputs):
print(f"P{i+2}: {out.shape}") # All have 256 channels, matching spatial sizes

Production Notes

Model export for deployment:

# ONNX export for cross-platform inference
import torch
from torchvision.models.detection import fasterrcnn_resnet50_fpn

model = fasterrcnn_resnet50_fpn(weights=None, num_classes=13)
model.eval()

# Faster R-CNN requires tracing with fixed-size inputs
dummy_input = [torch.randn(3, 640, 640)]
torch.onnx.export(
model, dummy_input, "faster_rcnn.onnx",
input_names=["input"], output_names=["boxes", "labels", "scores"],
opset_version=11,
)

# TensorRT: use torch2trt or tensorrt Python API
# Best speed on NVIDIA hardware - FP16 halves memory and doubles throughput

Batch inference for throughput:

# YOLOv8 batch inference
from ultralytics import YOLO

model = YOLO("yolov8n.pt")

# Process multiple images in one forward pass
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg", "img4.jpg"]
results = model(image_paths, batch=4, stream=True) # stream=True reduces memory

for result in results:
# Process one result at a time - memory efficient for large batches
boxes = result.boxes.xyxy.cpu()
print(f"Detected {len(boxes)} objects")

Confidence calibration:

Raw model confidence scores are often overconfident (safety-critical systems) or underconfident (recall-sensitive systems). Calibrate with temperature scaling:

def apply_temperature_scaling(logits: torch.Tensor, temperature: float) -> torch.Tensor:
"""Divide logits by temperature before softmax. T > 1 softens, T < 1 sharpens."""
return torch.softmax(logits / temperature, dim=-1)

# Tune temperature on a validation set to minimize Expected Calibration Error (ECE)

Anchor design for custom datasets:

If using YOLOv5 or anchor-based detection on a custom dataset where object shapes differ from COCO (e.g., very thin objects like wires, or wide objects like conveyor belts), run k-means on your training box dimensions:

import numpy as np


def kmeans_anchor_boxes(wh: np.ndarray, k: int = 9) -> np.ndarray:
"""
K-means clustering on box dimensions to design optimal anchors.
Uses 1 - IoU as the distance metric (IoU-based k-means).

Args:
wh: (N, 2) array of box widths and heights (normalized)
k: number of anchors
Returns:
(k, 2) anchor sizes
"""
n = wh.shape[0]
anchors = wh[np.random.choice(n, k, replace=False)]

for _ in range(300):
# Compute IoU between each box and each anchor
# Using broadcast: expand dims for vectorized computation
min_w = np.minimum(wh[:, None, 0], anchors[None, :, 0])
min_h = np.minimum(wh[:, None, 1], anchors[None, :, 1])
inter = min_w * min_h
union = (wh[:, None, 0] * wh[:, None, 1]
+ anchors[None, :, 0] * anchors[None, :, 1] - inter)
iou = inter / (union + 1e-7) # (N, k)

# Assign each box to nearest anchor (highest IoU = lowest distance)
assignments = iou.argmax(axis=1)
new_anchors = np.array([
wh[assignments == j].mean(axis=0) if (assignments == j).any()
else anchors[j]
for j in range(k)
])

if np.allclose(new_anchors, anchors, atol=1e-4):
break
anchors = new_anchors

return np.round(anchors, 3)

Interview Q&A

Q1: What is the key architectural difference between two-stage (Faster R-CNN) and one-stage (YOLO) detectors? When would you choose each?

Two-stage detectors decompose detection into: (1) generate region proposals (RPN), (2) classify and refine each proposal. The two stages allow the model to dedicate specialized computation to each task - proposals are coarsely scored for objectness, then the best candidates are carefully classified. This gives better accuracy, especially on small objects, because the RPN focuses attention precisely. One-stage detectors predict boxes and classes directly from a grid or anchor layout in a single forward pass - no separate proposal step. This is faster (5–50ms vs 100–500ms) and simpler to deploy, but historically had lower accuracy on small objects and dense scenes. Choose two-stage for medical imaging, satellite imagery, quality inspection where accuracy dominates. Choose one-stage for real-time video, edge devices, and scenarios with latency constraints.

Q2: Explain the Region Proposal Network (RPN). What exactly does it predict, and how are anchors used?

The RPN is a small network (3×3 conv + two 1×1 conv heads) that slides over the backbone's shared feature map. At each spatial location, it evaluates kk pre-defined reference boxes called anchors - typically 9 per location (3 scales × 3 aspect ratios). For each anchor, the RPN predicts two outputs: (1) an objectness score - binary classification, "is there any object here or is this background?" - and (2) four bounding box offsets (tx,ty,tw,th)(t_x, t_y, t_w, t_h) that transform the anchor into a tighter-fitting proposal. The offsets are parameterized relative to the anchor size so the regression task is scale-independent. Training assigns positive labels to anchors with IoU > 0.7 with any ground truth box and negative labels to anchors with IoU < 0.3. The top proposals by objectness are passed to the detection head. Because the RPN shares the backbone feature map with the detection head, region proposals are generated at essentially zero additional cost - this is why it replaced Selective Search.

Q3: How does Non-Maximum Suppression work? What does the IoU threshold control, and when would you use Soft-NMS instead?

NMS collapses many overlapping predictions for the same object into a single "best" box. Algorithm: sort all predicted boxes by confidence score descending; take the highest-confidence box and add it to the output; remove all remaining boxes whose IoU with the selected box exceeds the IoU threshold; repeat until no boxes remain. The IoU threshold controls aggressiveness. Lower threshold (0.3) removes boxes with even modest overlap - good for non-overlapping objects where any duplicate is clearly wrong. Higher threshold (0.6–0.7) only removes heavily overlapping boxes - better for densely packed objects like a crowd or shelf, where different objects may genuinely share a lot of image space. Soft-NMS replaces the hard removal with a score decay: boxes with high IoU with the selected box have their confidence reduced by a Gaussian function rather than deleted. This improves recall in dense scenes because legitimate nearby detections survive with reduced (but nonzero) scores, and a slightly higher confidence threshold then filters them correctly.

Q4: What is [email protected]:0.95 (COCO metric), and why is it much harder to achieve than mAP@50?

mAP@50 computes Average Precision using IoU threshold 0.5 - a predicted box counts as a correct detection if it overlaps the ground truth by at least 50%. This is relatively permissive: a box that roughly covers the object qualifies. [email protected]:0.95 averages mAP over 10 IoU thresholds - 0.50, 0.55, 0.60, ..., 0.95. The strict thresholds (0.85, 0.90, 0.95) require precise localization: the predicted box must nearly exactly cover the ground truth. A model might detect the right object in the right area but have sloppy boundaries. At IoU@50 it scores full credit; at [email protected] it scores zero. This is why a model with mAP@50 of 65% might only achieve [email protected]:0.95 of 38%. The COCO metric is better at separating models that truly localize objects precisely from those that just detect presence loosely. For downstream tasks like segmentation initialization or robotic grasping, precise localization matters - [email protected]:0.95 measures what you actually need.

Q5: YOLO v1 divides the image into a 7×7 grid and predicts objects per cell. What are the fundamental limitations of this approach, and how did later versions fix them?

YOLO v1's grid cell constraint is its most significant limitation: each grid cell predicts only one class. If two objects of different classes have overlapping centers or centers in the same grid cell (a hand on top of a tool), YOLO v1 can only detect one. This was partly addressed by predicting B=2 boxes per cell but the single class prediction per cell remained the binding constraint. The coarse 7×7 grid also means small objects whose entire extent falls within one cell rely on a single cell's prediction - if that prediction is wrong, there is no recovery. YOLOv2 fixed the class issue by adopting anchor boxes: each cell predicts multiple boxes with independent class predictions, so multiple objects per cell are possible. YOLOv3 introduced multi-scale prediction at 3 different feature map resolutions (13×13, 26×26, 52×52), which dramatically improved small object detection by giving small objects a dedicated fine-grained prediction surface. YOLOv8 went anchor-free entirely - instead of predicting offsets from pre-defined anchors, it directly predicts center coordinates and dimensions, which is simpler to tune for custom datasets where object aspect ratios differ from COCO.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the 2D Convolution Visualization demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.