Design: Autonomous Driving ML - Perception, Prediction, and Planning

Reading time: ~25 min | Interview relevance: Medium-High | Roles: MLE (specialized)

The Real Interview Moment

"Design the ML system for a self-driving car." The scope is enormous. You start describing a CNN for object detection. The interviewer asks: "Your model detects a pedestrian at 50 meters. What happens next? How does the car decide what to do? What if the pedestrian is partially occluded? What if the model is 99.9% accurate but that means 1 missed detection per 1000 - at highway speed, that's fatal."

Autonomous driving is the system design question where safety constraints dominate everything. The strongest candidates design for failure - what happens when the model is wrong, not just when it's right.

What You Will Master

The perception-prediction-planning stack
Object detection and 3D perception (LiDAR + camera fusion)
Behavior prediction for other road agents
Motion planning under uncertainty
Safety architecture: redundancy, fallbacks, operational design domain
Real-time constraints: end-to-end latency <100ms

The Complete Design

Step 1: Requirements (5 min)

Functional requirements:

Perceive the environment: detect vehicles, pedestrians, cyclists, lane markings, traffic signs
Predict behavior of other agents (other cars, pedestrians)
Plan a safe trajectory
Execute the plan via vehicle control

Non-functional requirements:

Latency: <100ms end-to-end (perception → plan)
Safety: Mean time between critical failures > 10M miles
Operational design domain (ODD): Highway driving in clear weather (Level 4 scope)
Sensor redundancy: System must function with any single sensor failure

Step 2: Problem Formulation (5 min)

Autonomous Driving Stack - Sensors → Perception → Prediction → Planning → Control

Component	ML Problem	Key Metric
Perception	3D object detection + tracking	mAP, recall for pedestrians
Prediction	Trajectory forecasting	ADE (average displacement error)
Planning	Trajectory optimization	Safety score, comfort score

Step 3: Perception Stack (8 min)

Sensor Fusion

Sensor	Strengths	Weaknesses
Camera	Color, texture, traffic signs, lane markings	No depth, poor in darkness/glare
LiDAR	Precise 3D depth, works in dark	No color, sparse at distance, expensive
Radar	Velocity directly, works in rain/fog	Low resolution, no shape info

Fusion approach: Early fusion (raw point cloud + image features) or late fusion (separate detectors, merge at output).

Object Detection

Approach	How It Works	Latency	Accuracy
PointPillars	Encode LiDAR points into pillars, 2D CNN	~20ms	Good
CenterPoint	Center-based 3D detection on LiDAR	~30ms	Very good
BEVFusion	Bird's-eye-view fusion of camera + LiDAR	~50ms	Best

The key design trade-off: Camera-only (Tesla's approach) vs. LiDAR+Camera (Waymo, Cruise). Camera-only is cheaper at scale but harder to achieve safety targets. LiDAR+Camera is more reliable but expensive.

Common Trap

Don't focus only on detection accuracy. In autonomous driving, recall for vulnerable road users (pedestrians, cyclists) is more important than overall mAP. Missing a car at 50m is bad. Missing a child at 50m is catastrophic. Design your loss function to weight classes by safety impact.

Step 4: Prediction (5 min)

Behavior Prediction

Given detected agents, predict their future trajectories (next 3-8 seconds).

Approaches:

Method	How It Works	Pro	Con
Physics-based	Constant velocity/acceleration	Simple, interpretable	Doesn't capture intent
Social LSTM	RNN encoding agent interactions	Captures social forces	Slow, limited context
Transformer-based	Attention over agent histories + map	Best accuracy, captures interactions	Compute-heavy

Key insight: Predictions should be multi-modal - a car at an intersection might go straight, turn left, or turn right. Output multiple possible trajectories with probabilities, not a single prediction.

Step 5: Planning & Safety (8 min)

Motion Planning

Motion Planning Pipeline - Predicted Trajectories + Route Graph → Generation → Score (Safety, Comfort, Progress) → Select Best

Trajectory scoring: Score = w1*Safety + w2*Comfort + w3*Progress

Safety: Distance to predicted agent trajectories, time-to-collision
Comfort: Jerk (rate of acceleration change), lateral acceleration
Progress: Distance toward goal, speed relative to speed limit

Safety Architecture

Layer	What It Does	Example
Primary system	Full ML stack (perception → planning)	Normal driving
Safety monitor	Rule-based checks on planned trajectory	Reject trajectory if collision risk > threshold
Emergency system	Hardcoded emergency maneuvers	Emergency brake if object detected < 5m
Minimal risk condition	Bring vehicle to safe stop	Pull over and stop if system degrades

Interviewer's Perspective

The candidate who designs for failure gets Strong Hire. "What happens when your perception model fails?" should have an answer: redundant sensors, safety monitors, emergency braking, and a minimal risk condition. The system should fail safe, not fail dangerous.

Step 6: Evaluation (8 min)

Testing Hierarchy

Level	What	Scale
Unit tests	Component-level (detector precision, predictor error)	Automated, thousands of scenarios
Simulation	Full-stack in simulated environments (CARLA, nuPlan)	Millions of simulated miles
Closed course	Real vehicle on test tracks	Thousands of scenarios
Public road	Safety driver present, operational design domain	Millions of real miles

Key Metrics

Metric	Definition	Target
Miles per disengagement	How often the safety driver takes over	> 10K miles
Miles per critical event	Near-miss or actual collision	> 1M miles
Pedestrian recall	% of pedestrians detected	> 99.99%
Trajectory ADE	Average prediction error at 3 seconds	< 1m

Practice Problems

Problem 1: Handling Occlusion

Direction

A child runs out from behind a parked van. Your LiDAR and cameras can't see the child until they're 3 meters in front of the car. How does your system handle this?

Key Insight

This is a "phantom object" / occlusion reasoning problem. Solutions: (1) Reduce speed near parked vehicles and school zones - if you can't see, go slower. (2) Occupancy prediction: model what might be behind occluded areas. (3) Use radar (can detect motion through gaps). (4) Conservative planning: treat occluded regions as potentially containing pedestrians. The fundamental answer is: if you can't perceive, you must plan conservatively.

Problem 2: End-to-End vs. Modular

Direction

Tesla uses an end-to-end neural network that goes from camera pixels to driving commands. Waymo uses a modular stack (perception → prediction → planning). Discuss the trade-offs.

Key Insight

End-to-end: trains on driving demonstrations, can learn things hard to specify in rules, but is a black box (hard to debug, hard to certify safety). Modular: each component is testable and debuggable, but information is lost at interfaces between components, and errors compound. The industry trend is toward hybrid: modular architecture with learned components, and an end-to-end model as a "teacher" that provides additional training signal.

Interview Cheat Sheet

Question Pattern	Framework	Key Phrases
"Design AV ML stack"	Perception → Prediction → Planning	"Multi-sensor perception, multi-modal prediction, safety-constrained planning"
"Camera vs. LiDAR?"	Sensor trade-offs	"LiDAR gives reliable depth, camera gives semantics - fusion is strongest"
"How do you ensure safety?"	Redundant layers	"Safety monitor, emergency system, minimal risk condition - fail safe, not fail dangerous"
"How do you test?"	Testing hierarchy	"Simulation for scale, closed course for validation, public roads for final proof"

Spaced Repetition Checkpoints

Day 0: Draw the perception → prediction → planning pipeline from memory.
Day 3: Explain camera vs. LiDAR trade-offs. What's the argument for each?
Day 7: Design the safety architecture with all four layers. What triggers each?
Day 14: Explain multi-modal trajectory prediction. Why is a single prediction insufficient?
Day 21: Mock interview with follow-ups on occlusion, edge cases, and testing methodology.

What's Next

Visual Search - Embedding models and nearest neighbor search
AI Chatbot System - Another system where safety constraints are critical

The Real Interview Moment​

What You Will Master​

The Complete Design​

Step 1: Requirements (5 min)​

Step 2: Problem Formulation (5 min)​

Step 3: Perception Stack (8 min)​

Sensor Fusion​

Object Detection​

Step 4: Prediction (5 min)​

Behavior Prediction​

Step 5: Planning & Safety (8 min)​

Motion Planning​

Safety Architecture​

Step 6: Evaluation (8 min)​

Testing Hierarchy​

Key Metrics​

Practice Problems​

Problem 1: Handling Occlusion​

Problem 2: End-to-End vs. Modular​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

What's Next​