Design: Autonomous Driving ML - Perception, Prediction, and Planning
Reading time: ~25 min | Interview relevance: Medium-High | Roles: MLE (specialized)
The Real Interview Moment
"Design the ML system for a self-driving car." The scope is enormous. You start describing a CNN for object detection. The interviewer asks: "Your model detects a pedestrian at 50 meters. What happens next? How does the car decide what to do? What if the pedestrian is partially occluded? What if the model is 99.9% accurate but that means 1 missed detection per 1000 - at highway speed, that's fatal."
Autonomous driving is the system design question where safety constraints dominate everything. The strongest candidates design for failure - what happens when the model is wrong, not just when it's right.
What You Will Master
- The perception-prediction-planning stack
- Object detection and 3D perception (LiDAR + camera fusion)
- Behavior prediction for other road agents
- Motion planning under uncertainty
- Safety architecture: redundancy, fallbacks, operational design domain
- Real-time constraints: end-to-end latency <100ms
The Complete Design
Step 1: Requirements (5 min)
Functional requirements:
- Perceive the environment: detect vehicles, pedestrians, cyclists, lane markings, traffic signs
- Predict behavior of other agents (other cars, pedestrians)
- Plan a safe trajectory
- Execute the plan via vehicle control
Non-functional requirements:
- Latency: <100ms end-to-end (perception → plan)
- Safety: Mean time between critical failures > 10M miles
- Operational design domain (ODD): Highway driving in clear weather (Level 4 scope)
- Sensor redundancy: System must function with any single sensor failure
Step 2: Problem Formulation (5 min)
| Component | ML Problem | Key Metric |
|---|---|---|
| Perception | 3D object detection + tracking | mAP, recall for pedestrians |
| Prediction | Trajectory forecasting | ADE (average displacement error) |
| Planning | Trajectory optimization | Safety score, comfort score |
Step 3: Perception Stack (8 min)
Sensor Fusion
| Sensor | Strengths | Weaknesses |
|---|---|---|
| Camera | Color, texture, traffic signs, lane markings | No depth, poor in darkness/glare |
| LiDAR | Precise 3D depth, works in dark | No color, sparse at distance, expensive |
| Radar | Velocity directly, works in rain/fog | Low resolution, no shape info |
Fusion approach: Early fusion (raw point cloud + image features) or late fusion (separate detectors, merge at output).
Object Detection
| Approach | How It Works | Latency | Accuracy |
|---|---|---|---|
| PointPillars | Encode LiDAR points into pillars, 2D CNN | ~20ms | Good |
| CenterPoint | Center-based 3D detection on LiDAR | ~30ms | Very good |
| BEVFusion | Bird's-eye-view fusion of camera + LiDAR | ~50ms | Best |
The key design trade-off: Camera-only (Tesla's approach) vs. LiDAR+Camera (Waymo, Cruise). Camera-only is cheaper at scale but harder to achieve safety targets. LiDAR+Camera is more reliable but expensive.
Don't focus only on detection accuracy. In autonomous driving, recall for vulnerable road users (pedestrians, cyclists) is more important than overall mAP. Missing a car at 50m is bad. Missing a child at 50m is catastrophic. Design your loss function to weight classes by safety impact.
Step 4: Prediction (5 min)
Behavior Prediction
Given detected agents, predict their future trajectories (next 3-8 seconds).
Approaches:
| Method | How It Works | Pro | Con |
|---|---|---|---|
| Physics-based | Constant velocity/acceleration | Simple, interpretable | Doesn't capture intent |
| Social LSTM | RNN encoding agent interactions | Captures social forces | Slow, limited context |
| Transformer-based | Attention over agent histories + map | Best accuracy, captures interactions | Compute-heavy |
Key insight: Predictions should be multi-modal - a car at an intersection might go straight, turn left, or turn right. Output multiple possible trajectories with probabilities, not a single prediction.
Step 5: Planning & Safety (8 min)
Motion Planning
Trajectory scoring: Score = w1*Safety + w2*Comfort + w3*Progress
- Safety: Distance to predicted agent trajectories, time-to-collision
- Comfort: Jerk (rate of acceleration change), lateral acceleration
- Progress: Distance toward goal, speed relative to speed limit
Safety Architecture
| Layer | What It Does | Example |
|---|---|---|
| Primary system | Full ML stack (perception → planning) | Normal driving |
| Safety monitor | Rule-based checks on planned trajectory | Reject trajectory if collision risk > threshold |
| Emergency system | Hardcoded emergency maneuvers | Emergency brake if object detected < 5m |
| Minimal risk condition | Bring vehicle to safe stop | Pull over and stop if system degrades |
The candidate who designs for failure gets Strong Hire. "What happens when your perception model fails?" should have an answer: redundant sensors, safety monitors, emergency braking, and a minimal risk condition. The system should fail safe, not fail dangerous.
Step 6: Evaluation (8 min)
Testing Hierarchy
| Level | What | Scale |
|---|---|---|
| Unit tests | Component-level (detector precision, predictor error) | Automated, thousands of scenarios |
| Simulation | Full-stack in simulated environments (CARLA, nuPlan) | Millions of simulated miles |
| Closed course | Real vehicle on test tracks | Thousands of scenarios |
| Public road | Safety driver present, operational design domain | Millions of real miles |
Key Metrics
| Metric | Definition | Target |
|---|---|---|
| Miles per disengagement | How often the safety driver takes over | > 10K miles |
| Miles per critical event | Near-miss or actual collision | > 1M miles |
| Pedestrian recall | % of pedestrians detected | > 99.99% |
| Trajectory ADE | Average prediction error at 3 seconds | < 1m |
Practice Problems
Problem 1: Handling Occlusion
Direction
A child runs out from behind a parked van. Your LiDAR and cameras can't see the child until they're 3 meters in front of the car. How does your system handle this?
Key Insight
This is a "phantom object" / occlusion reasoning problem. Solutions: (1) Reduce speed near parked vehicles and school zones - if you can't see, go slower. (2) Occupancy prediction: model what might be behind occluded areas. (3) Use radar (can detect motion through gaps). (4) Conservative planning: treat occluded regions as potentially containing pedestrians. The fundamental answer is: if you can't perceive, you must plan conservatively.
Problem 2: End-to-End vs. Modular
Direction
Tesla uses an end-to-end neural network that goes from camera pixels to driving commands. Waymo uses a modular stack (perception → prediction → planning). Discuss the trade-offs.
Key Insight
End-to-end: trains on driving demonstrations, can learn things hard to specify in rules, but is a black box (hard to debug, hard to certify safety). Modular: each component is testable and debuggable, but information is lost at interfaces between components, and errors compound. The industry trend is toward hybrid: modular architecture with learned components, and an end-to-end model as a "teacher" that provides additional training signal.
Interview Cheat Sheet
| Question Pattern | Framework | Key Phrases |
|---|---|---|
| "Design AV ML stack" | Perception → Prediction → Planning | "Multi-sensor perception, multi-modal prediction, safety-constrained planning" |
| "Camera vs. LiDAR?" | Sensor trade-offs | "LiDAR gives reliable depth, camera gives semantics - fusion is strongest" |
| "How do you ensure safety?" | Redundant layers | "Safety monitor, emergency system, minimal risk condition - fail safe, not fail dangerous" |
| "How do you test?" | Testing hierarchy | "Simulation for scale, closed course for validation, public roads for final proof" |
Spaced Repetition Checkpoints
- Day 0: Draw the perception → prediction → planning pipeline from memory.
- Day 3: Explain camera vs. LiDAR trade-offs. What's the argument for each?
- Day 7: Design the safety architecture with all four layers. What triggers each?
- Day 14: Explain multi-modal trajectory prediction. Why is a single prediction insufficient?
- Day 21: Mock interview with follow-ups on occlusion, edge cases, and testing methodology.
What's Next
- Visual Search - Embedding models and nearest neighbor search
- AI Chatbot System - Another system where safety constraints are critical
