:::tip 🎮 Interactive Playground Visualize this concept: Try the Computer Vision System Design demo on the EngineersOfAI Playground - no code required. :::
Computer Vision Systems
30 Cameras, 100Hz, No Second Chances
The perception system for an autonomous vehicle is the most unforgiving production ML deployment in existence. Every 10 milliseconds, 30 cameras mounted around the vehicle - forward-facing, rear-facing, side-facing, and fisheye - capture frames simultaneously. Each frame must be processed: detected objects (cars, pedestrians, cyclists, traffic lights, lane markings) must be identified, tracked across frames, and their 3D positions estimated. All of this must complete within that same 10ms window before the next set of frames arrives.
The latency budget is not a soft target. If the perception system falls behind by even one frame, the vehicle's planning system makes decisions with stale data. At highway speeds (30 m/s), one missed 10ms frame means the vehicle's model of the world is 30 centimeters stale. In a dense urban intersection, 30 centimeters is the difference between a correct prediction and a critical failure.
The engineering constraints are genuinely hard: 30 cameras × 1920×1080 resolution × 30 fps = approximately 1.8 GB/s of raw image data arriving continuously. Total computation budget: 10ms for detection, 3ms for tracking, 2ms for 3D fusion - 15ms total. Hardware: NVIDIA DRIVE Orin SoC with 254 TOPS (Tera Operations Per Second). No cloud inference - the vehicle must operate autonomously even with no network connectivity.
This case study is about how you design, optimize, and maintain a production computer vision system under these constraints. The principles - hardware-aware model selection, TensorRT optimization, quantization, active learning for continuous improvement - apply beyond autonomous vehicles to any latency-critical vision deployment.
Requirements Analysis
Functional requirements:
- Detect and classify: vehicles, pedestrians, cyclists, traffic signs, traffic lights, lane markings
- 3D position estimation (relative to vehicle) for all detected objects
- Multi-object tracking: consistent IDs across frames
- Semantic segmentation of driveable area
- Support for all lighting conditions (day, night, rain, fog, glare)
Non-functional requirements:
- Latency: 10ms per frame for detection (100Hz pipeline)
- Throughput: process all 30 cameras per cycle (some cameras processed in parallel)
- Accuracy: pedestrian recall above 99.5% at IoU 0.5 (false negatives are more dangerous than false positives)
- Edge deployment: inference on NVIDIA DRIVE Orin (no cloud connectivity)
- Model size: under 50MB per model after quantization (cache pressure)
System Architecture
Component 1: Model Selection - YOLO vs RT-DETR
The choice of detection architecture determines whether you hit the 10ms latency target.
YOLO family (YOLOv9, YOLOv10): Single-stage detectors. Divide the image into a grid; each grid cell predicts bounding boxes and class probabilities directly. Extremely fast - YOLOv9-S achieves 2-4ms inference on a modern GPU at 640x640 input. The anchor-based design makes small object detection harder. Good for constrained edge hardware.
RT-DETR (Real-Time Detection Transformer): Transformer-based detector without NMS post-processing. Processes the image with a ResNet backbone + deformable attention. Competitive with YOLO on speed (3-6ms) but with better small object detection due to attention-based feature aggregation. Better for complex scenes with occlusion.
Decision for autonomous driving: Use RT-DETR for the forward-facing cameras (complex urban scenes, small pedestrians at distance). Use YOLOv9 for side and rear cameras (simpler scenes, larger nearby objects, tighter latency budget).
The key insight: you do not need to use one model architecture for all cameras. Different cameras see different distributions of objects and have different latency tolerances.
TensorRT Optimization
Raw PyTorch inference on the Orin would be too slow. TensorRT converts the PyTorch model into an optimized inference engine:
import tensorrt as trt
import torch
import numpy as np
from pathlib import Path
def export_to_tensorrt(
model_path: str,
output_path: str,
input_shape: tuple = (1, 3, 640, 640), # batch, C, H, W
precision: str = "fp16", # "fp32", "fp16", "int8"
workspace_gb: int = 4,
) -> None:
"""
Export PyTorch model to optimized TensorRT engine.
Args:
model_path: path to ONNX model (export from PyTorch first)
output_path: path to save TensorRT engine
precision: inference precision (fp16 is typical for automotive)
workspace_gb: GPU memory workspace for TensorRT optimization
"""
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)
# Parse ONNX model
with open(model_path, "rb") as f:
if not parser.parse(f.read()):
for error in range(parser.num_errors):
print(parser.get_error(error))
raise RuntimeError("Failed to parse ONNX model")
# Build configuration
config = builder.create_builder_config()
config.max_workspace_size = workspace_gb * (1 << 30)
if precision == "fp16" and builder.platform_has_fast_fp16:
config.set_flag(trt.BuilderFlag.FP16)
elif precision == "int8":
config.set_flag(trt.BuilderFlag.INT8)
# INT8 requires calibration data to determine quantization parameters
# config.int8_calibrator = create_calibrator(calibration_data)
# Optimization profiles for dynamic batch sizes
profile = builder.create_optimization_profile()
profile.set_shape("input", (1, 3, 640, 640), input_shape, (4, 3, 640, 640))
config.add_optimization_profile(profile)
# Build and serialize engine
engine = builder.build_engine(network, config)
with open(output_path, "wb") as f:
f.write(engine.serialize())
print(f"TensorRT engine saved to {output_path}")
print(f"Estimated latency improvement: 2-5x over PyTorch inference")
class TRTInferenceEngine:
"""Fast inference using a pre-built TensorRT engine."""
def __init__(self, engine_path: str):
logger = trt.Logger(trt.Logger.WARNING)
runtime = trt.Runtime(logger)
with open(engine_path, "rb") as f:
self.engine = runtime.deserialize_cuda_engine(f.read())
self.context = self.engine.create_execution_context()
# Pre-allocate GPU buffers
self.bindings = []
for binding in self.engine:
size = trt.volume(self.engine.get_binding_shape(binding))
dtype = trt.nptype(self.engine.get_binding_dtype(binding))
mem = cuda.mem_alloc(size * dtype.itemsize)
self.bindings.append(int(mem))
def infer(self, input_tensor: np.ndarray) -> np.ndarray:
"""Run inference. Input and output are numpy arrays."""
import pycuda.driver as cuda
stream = cuda.Stream()
cuda.memcpy_htod_async(self.bindings[0], input_tensor, stream)
self.context.execute_async_v2(self.bindings, stream.handle)
cuda.memcpy_dtoh_async(output, self.bindings[1], stream)
stream.synchronize()
return output
INT8 Quantization for Edge Deployment
FP16 inference on the DRIVE Orin achieves 2-4ms per frame. INT8 quantization reduces this to 1-2ms by using integer arithmetic, which the Orin's dedicated DLA (Deep Learning Accelerator) handles natively.
INT8 quantization maps FP32 weights to INT8 representation: , where is the scale factor. Determining the scale factor requires calibration data - a representative sample of inference inputs that characterizes the activation distribution.
Post-Training Quantization (PTQ): Run calibration data through the FP32 model, record activation statistics, determine INT8 scale factors. Fast but may lose 1-3% mAP on challenging detection tasks.
Quantization-Aware Training (QAT): Simulate quantization during training using fake quantization operators. The model learns to represent weights in a quantization-friendly way. Recovers most of the accuracy lost in PTQ, at the cost of retraining.
For safety-critical automotive applications, QAT is recommended. The 1-3% mAP recovery at pedestrian-relevant recall thresholds is worth the extra training effort.
Component 2: Multi-Object Tracking
Detection identifies objects in each frame independently. Tracking assigns consistent IDs across frames, enabling the planning system to predict trajectories.
SORT (Simple Online Realtime Tracking): The standard baseline. Uses Kalman filtering to predict each tracked object's position in the next frame. Uses Hungarian algorithm to match predictions to new detections. Fast (1-2ms), effective for non-occluded scenarios.
ByteTrack: Improves SORT by associating low-confidence detections (objects partially occluded) with existing tracks before discarding them. Significantly better tracking through occlusions, which is critical in dense urban environments.
import numpy as np
from scipy.optimize import linear_sum_assignment
from dataclasses import dataclass, field
from typing import List, Optional
@dataclass
class Track:
track_id: int
bbox: np.ndarray # [x1, y1, x2, y2]
confidence: float
class_id: int
age: int = 0
hits: int = 1
velocity: np.ndarray = field(default_factory=lambda: np.zeros(4))
def iou(bbox_a: np.ndarray, bbox_b: np.ndarray) -> float:
"""Compute IoU between two bounding boxes [x1, y1, x2, y2]."""
x1 = max(bbox_a[0], bbox_b[0])
y1 = max(bbox_a[1], bbox_b[1])
x2 = min(bbox_a[2], bbox_b[2])
y2 = min(bbox_a[3], bbox_b[3])
intersection = max(0, x2 - x1) * max(0, y2 - y1)
area_a = (bbox_a[2] - bbox_a[0]) * (bbox_a[3] - bbox_a[1])
area_b = (bbox_b[2] - bbox_b[0]) * (bbox_b[3] - bbox_b[1])
union = area_a + area_b - intersection
return intersection / union if union > 0 else 0
class ByteTracker:
"""
ByteTrack: high-quality multi-object tracker.
Associates both high-confidence and low-confidence detections with tracks.
"""
def __init__(
self,
high_threshold: float = 0.6,
low_threshold: float = 0.1,
max_lost: int = 30, # frames before removing a lost track
min_hits: int = 3, # frames before confirming a new track
iou_threshold: float = 0.3,
):
self.high_threshold = high_threshold
self.low_threshold = low_threshold
self.max_lost = max_lost
self.min_hits = min_hits
self.iou_threshold = iou_threshold
self.tracks: List[Track] = []
self.next_id = 0
def update(self, detections: List[dict]) -> List[Track]:
"""
Update tracks with new frame detections.
Args:
detections: list of {"bbox": np.ndarray, "confidence": float, "class_id": int}
"""
# Split into high and low confidence detections
high_dets = [d for d in detections if d["confidence"] >= self.high_threshold]
low_dets = [d for d in detections if self.low_threshold <= d["confidence"] < self.high_threshold]
# Step 1: Match high-confidence detections to tracks
unmatched_tracks, matched_high = self._match(self.tracks, high_dets)
# Step 2: Match low-confidence detections to unmatched tracks
unmatched_tracks2, _ = self._match(unmatched_tracks, low_dets)
# Update matched tracks
for track, det in matched_high:
track.bbox = det["bbox"]
track.confidence = det["confidence"]
track.hits += 1
track.age = 0
# Increment age of unmatched tracks
for track in unmatched_tracks2:
track.age += 1
# Remove old lost tracks
self.tracks = [t for t in self.tracks if t.age <= self.max_lost]
# Create new tracks from unmatched high-confidence detections
matched_track_ids = {id(t) for t, _ in matched_high}
for det in high_dets:
if not any(id(t) == matched_track_ids for t, d in matched_high if d is det):
new_track = Track(
track_id=self.next_id,
bbox=det["bbox"],
confidence=det["confidence"],
class_id=det["class_id"],
)
self.next_id += 1
self.tracks.append(new_track)
# Return confirmed tracks (enough hits to be reliable)
return [t for t in self.tracks if t.hits >= self.min_hits]
def _match(self, tracks, detections):
"""Hungarian algorithm matching on IoU matrix."""
if not tracks or not detections:
return tracks, []
iou_matrix = np.zeros((len(tracks), len(detections)))
for i, track in enumerate(tracks):
for j, det in enumerate(detections):
iou_matrix[i, j] = iou(track.bbox, det["bbox"])
row_ind, col_ind = linear_sum_assignment(-iou_matrix)
matched = [(tracks[r], detections[c]) for r, c in zip(row_ind, col_ind)
if iou_matrix[r, c] >= self.iou_threshold]
matched_track_indices = {r for r, c in zip(row_ind, col_ind) if iou_matrix[r, c] >= self.iou_threshold}
unmatched_tracks = [t for i, t in enumerate(tracks) if i not in matched_track_indices]
return unmatched_tracks, matched
Component 3: Active Learning for Vision
Manual annotation of autonomous driving data is expensive. A pedestrian annotation at 99.5% recall quality costs $8-15 per image. At 100Hz with 30 cameras, you generate 180,000 images per hour of driving. You cannot annotate everything. Active learning selects which images to annotate, maximizing model improvement per annotation dollar.
Uncertainty sampling: Select frames where the model is most uncertain (low confidence detections, or high variance in ensemble predictions). These frames are most informative for model improvement.
Diversity sampling: Select frames that are most different from the existing training set. Avoids redundantly annotating similar scenes.
Core-set selection: Select a representative subset of unlabeled frames that covers the embedding space of all unlabeled frames. Uses greedy k-center selection.
from typing import List, Tuple
import numpy as np
class ActiveLearningSelector:
"""
Select frames for annotation using uncertainty and diversity sampling.
"""
def __init__(self, uncertainty_weight: float = 0.6, diversity_weight: float = 0.4):
self.unc_weight = uncertainty_weight
self.div_weight = diversity_weight
def compute_uncertainty(self, detections: List[dict]) -> float:
"""
Frame uncertainty: based on low-confidence detections.
High uncertainty = many objects with confidence near the threshold.
"""
if not detections:
return 0.5 # No detection = uncertain about empty scene
# Entropy of confidence distribution
confidences = np.array([d["confidence"] for d in detections])
# Uncertainty is high when confidences cluster near 0.5
uncertainty_per_box = 1 - np.abs(confidences - 0.5) * 2
return float(np.mean(uncertainty_per_box))
def select_frames_for_annotation(
self,
frame_pool: List[dict], # {"frame_id": str, "detections": [...], "embedding": np.ndarray}
existing_training_embeddings: np.ndarray,
n_select: int = 1000,
) -> List[str]:
"""
Select n_select frames from frame_pool for annotation.
Balances uncertainty and diversity relative to existing training set.
"""
uncertainty_scores = np.array([
self.compute_uncertainty(f["detections"]) for f in frame_pool
])
# Diversity score: distance to nearest neighbor in training set
pool_embeddings = np.stack([f["embedding"] for f in frame_pool])
diversity_scores = self._min_distance_to_set(pool_embeddings, existing_training_embeddings)
# Normalize both scores to [0, 1]
unc_norm = (uncertainty_scores - uncertainty_scores.min()) / (uncertainty_scores.ptp() + 1e-8)
div_norm = (diversity_scores - diversity_scores.min()) / (diversity_scores.ptp() + 1e-8)
# Combined score
combined = self.unc_weight * unc_norm + self.div_weight * div_norm
top_indices = np.argsort(combined)[::-1][:n_select]
return [frame_pool[i]["frame_id"] for i in top_indices]
def _min_distance_to_set(
self,
query_embeddings: np.ndarray,
reference_embeddings: np.ndarray,
) -> np.ndarray:
"""Compute minimum L2 distance from each query to the reference set."""
distances = np.linalg.norm(
query_embeddings[:, np.newaxis, :] - reference_embeddings[np.newaxis, :, :],
axis=-1
) # (n_queries, n_reference)
return distances.min(axis=1) # (n_queries,)
Production Engineering Notes
Data Pipeline for Vision
A production vision ML pipeline requires:
Data versioning: Every training dataset version must be reproducible. Use DVC (Data Version Control) to version datasets alongside model weights. A model trained on dataset-v42 should be reproducible from that tag alone.
Augmentation pipeline: Standard augmentations for autonomous driving - horizontal flip, color jitter, cutout, mosaic (combine 4 images into one training example). GPU-accelerated augmentation (NVIDIA DALI) saves significant CPU bottlenecks in data loading.
Quality control: Every annotated frame undergoes automated quality checks: Are bounding boxes tight to objects? Are class labels consistent with the scene? Are there missing annotations (unannotated objects in the frame)? Automated QC catches annotation errors before they enter training data.
Quality Metrics
Standard vision metrics for autonomous driving:
mAP (mean Average Precision): Area under the precision-recall curve, averaged over all classes and IoU thresholds. Standard for comparing models. Target: [email protected] above 0.75.
Class-specific recall at fixed precision: For pedestrians, the system must maintain 99.5% recall at IoU 0.5. This is more important than average mAP - missing one pedestrian is categorically different from misclassifying a signpost.
Tracking metrics: MOTA (Multiple Object Tracking Accuracy) = 1 - (FN + FP + ID switches) / GT. MOTP (precision): average IoU of matched detections. Target: MOTA above 0.80.
Latency histogram: Not just mean latency - the P99 and P99.9 latency matter. At 100Hz, a P99 latency violation means the perception system falls behind 1% of the time = 60 seconds per hour. P99 must be under 10ms.
Common Mistakes
Mistake: Optimizing for mAP without checking class-specific recall at safety-critical thresholds.
mAP averages across classes. A model can achieve mAP=0.85 while having pedestrian recall of only 95% by excelling at vehicle detection. For autonomous driving, each safety-critical class (pedestrians, cyclists) must be evaluated separately at the recall threshold that corresponds to safe operation. Define minimum recall thresholds per class as hard requirements in the model specification, not as soft preferences.
Mistake: Testing latency only on a development GPU and not the deployment hardware.
A PyTorch model that runs in 8ms on an A100 may run in 45ms on the DRIVE Orin with native PyTorch. TensorRT FP16 on the Orin may achieve 9ms. INT8 on the Orin's DLA may achieve 5ms. Always profile on target hardware. Latency is not transferable across hardware platforms without measurement.
Tip: Use separate detection heads for different camera types rather than one universal model.
Fisheye cameras produce heavily distorted images where standard bounding box detection degrades significantly. Train separate detection heads (or separate models) for fisheye vs. rectilinear cameras. Deformable convolutions handle fisheye distortion better than standard convolutions. The accuracy improvement from camera-type-specific models is substantial and the serving cost is minimal since cameras of each type run in parallel anyway.
Interview Q&A
Q: Design a production computer vision system for autonomous vehicle perception with 30 cameras at 100Hz.
A: The constraints define the architecture. 30 cameras at 100Hz = 3,000 frames per second total, but they share the 10ms window. I group cameras: 8 cameras can be processed in a batched forward pass in 8ms on a TensorRT-optimized YOLOv9 or RT-DETR model. Three camera groups run in parallel on three DLA cores, each processing 10 cameras in 8ms. Model: YOLOv9-M for side and rear cameras, RT-DETR for forward cameras (better at small distant pedestrians). Both exported to TensorRT INT8 and run on the Orin's DLA. After detection, ByteTrack for multi-object tracking (2ms). Kalman filtering for velocity estimation. LiDAR-camera fusion for 3D position estimation (1ms, uses precomputed calibration matrices). Total: 11ms - just within budget with 1ms margin. Continuous improvement via active learning: select uncertainty-and-diversity-maximizing frames for annotation, weekly model updates validated against regression test suite before deployment.
Q: How do you quantize a vision model for edge deployment without losing critical recall on pedestrians?
A: Post-training quantization first, with careful calibration. Calibration data should be drawn from the same distribution as deployment - use a representative dataset of challenging scenarios (night, rain, occlusion). After PTQ, measure pedestrian-specific recall at the operating IoU threshold. If recall drops more than 0.5%, switch to quantization-aware training (QAT). QAT simulates quantization during fine-tuning using fake quantization operators, allowing the model to adapt its weight distribution to be quantization-friendly. QAT typically recovers 50-80% of the recall loss from PTQ. Additionally: use sensitivity analysis to identify which layers are most sensitive to quantization (typically the first and last layers) and keep those in FP16 while quantizing the rest to INT8. This mixed-precision approach recovers quality in the most sensitive layers with minimal latency cost.
Q: How does active learning work for improving a vision model continuously in production?
A: The core insight is that you cannot annotate every frame - at 100Hz with 30 cameras, that is millions of frames per hour. Active learning selects which frames provide the most information per annotation cost. Two key signals: uncertainty (frames where the model is least confident - low confidence detections, detection disappearing and reappearing across frames) and diversity (frames most different from the existing training set, measured by embedding distance). A sampling strategy that balances both prevents redundant annotation of easy common scenes and focuses budget on rare, challenging scenarios (unusual lighting, novel object configurations, edge cases). In production: run inference on every frame, compute uncertainty and diversity scores, queue high-scoring frames for human annotation, retrain weekly on the augmented dataset. This creates a data flywheel: the model improves, new model failures reveal new challenging scenarios, those scenarios are annotated, the cycle continues.
