Skip to main content

Design: Autonomous Driving ML - Perception, Prediction, and Planning

Reading time: ~25 min | Interview relevance: Medium-High | Roles: MLE (specialized)

The Real Interview Moment

"Design the ML system for a self-driving car." The scope is enormous. You start describing a CNN for object detection. The interviewer asks: "Your model detects a pedestrian at 50 meters. What happens next? How does the car decide what to do? What if the pedestrian is partially occluded? What if the model is 99.9% accurate but that means 1 missed detection per 1000 - at highway speed, that's fatal."

Autonomous driving is the system design question where safety constraints dominate everything. The strongest candidates design for failure - what happens when the model is wrong, not just when it's right.

What You Will Master

  • The perception-prediction-planning stack
  • Object detection and 3D perception (LiDAR + camera fusion)
  • Behavior prediction for other road agents
  • Motion planning under uncertainty
  • Safety architecture: redundancy, fallbacks, operational design domain
  • Real-time constraints: end-to-end latency <100ms

The Complete Design

Step 1: Requirements (5 min)

Functional requirements:

  • Perceive the environment: detect vehicles, pedestrians, cyclists, lane markings, traffic signs
  • Predict behavior of other agents (other cars, pedestrians)
  • Plan a safe trajectory
  • Execute the plan via vehicle control

Non-functional requirements:

  • Latency: <100ms end-to-end (perception → plan)
  • Safety: Mean time between critical failures > 10M miles
  • Operational design domain (ODD): Highway driving in clear weather (Level 4 scope)
  • Sensor redundancy: System must function with any single sensor failure

Step 2: Problem Formulation (5 min)

Autonomous Driving Stack - Sensors → Perception → Prediction → Planning → Control

ComponentML ProblemKey Metric
Perception3D object detection + trackingmAP, recall for pedestrians
PredictionTrajectory forecastingADE (average displacement error)
PlanningTrajectory optimizationSafety score, comfort score

Step 3: Perception Stack (8 min)

Sensor Fusion

SensorStrengthsWeaknesses
CameraColor, texture, traffic signs, lane markingsNo depth, poor in darkness/glare
LiDARPrecise 3D depth, works in darkNo color, sparse at distance, expensive
RadarVelocity directly, works in rain/fogLow resolution, no shape info

Fusion approach: Early fusion (raw point cloud + image features) or late fusion (separate detectors, merge at output).

Object Detection

ApproachHow It WorksLatencyAccuracy
PointPillarsEncode LiDAR points into pillars, 2D CNN~20msGood
CenterPointCenter-based 3D detection on LiDAR~30msVery good
BEVFusionBird's-eye-view fusion of camera + LiDAR~50msBest

The key design trade-off: Camera-only (Tesla's approach) vs. LiDAR+Camera (Waymo, Cruise). Camera-only is cheaper at scale but harder to achieve safety targets. LiDAR+Camera is more reliable but expensive.

Common Trap

Don't focus only on detection accuracy. In autonomous driving, recall for vulnerable road users (pedestrians, cyclists) is more important than overall mAP. Missing a car at 50m is bad. Missing a child at 50m is catastrophic. Design your loss function to weight classes by safety impact.

Step 4: Prediction (5 min)

Behavior Prediction

Given detected agents, predict their future trajectories (next 3-8 seconds).

Approaches:

MethodHow It WorksProCon
Physics-basedConstant velocity/accelerationSimple, interpretableDoesn't capture intent
Social LSTMRNN encoding agent interactionsCaptures social forcesSlow, limited context
Transformer-basedAttention over agent histories + mapBest accuracy, captures interactionsCompute-heavy

Key insight: Predictions should be multi-modal - a car at an intersection might go straight, turn left, or turn right. Output multiple possible trajectories with probabilities, not a single prediction.

Step 5: Planning & Safety (8 min)

Motion Planning

Motion Planning Pipeline - Predicted Trajectories + Route Graph → Generation → Score (Safety, Comfort, Progress) → Select Best

Trajectory scoring: Score = w1*Safety + w2*Comfort + w3*Progress

  • Safety: Distance to predicted agent trajectories, time-to-collision
  • Comfort: Jerk (rate of acceleration change), lateral acceleration
  • Progress: Distance toward goal, speed relative to speed limit

Safety Architecture

LayerWhat It DoesExample
Primary systemFull ML stack (perception → planning)Normal driving
Safety monitorRule-based checks on planned trajectoryReject trajectory if collision risk > threshold
Emergency systemHardcoded emergency maneuversEmergency brake if object detected < 5m
Minimal risk conditionBring vehicle to safe stopPull over and stop if system degrades
Interviewer's Perspective

The candidate who designs for failure gets Strong Hire. "What happens when your perception model fails?" should have an answer: redundant sensors, safety monitors, emergency braking, and a minimal risk condition. The system should fail safe, not fail dangerous.

Step 6: Evaluation (8 min)

Testing Hierarchy

LevelWhatScale
Unit testsComponent-level (detector precision, predictor error)Automated, thousands of scenarios
SimulationFull-stack in simulated environments (CARLA, nuPlan)Millions of simulated miles
Closed courseReal vehicle on test tracksThousands of scenarios
Public roadSafety driver present, operational design domainMillions of real miles

Key Metrics

MetricDefinitionTarget
Miles per disengagementHow often the safety driver takes over> 10K miles
Miles per critical eventNear-miss or actual collision> 1M miles
Pedestrian recall% of pedestrians detected> 99.99%
Trajectory ADEAverage prediction error at 3 seconds< 1m

Practice Problems

Problem 1: Handling Occlusion

Direction

A child runs out from behind a parked van. Your LiDAR and cameras can't see the child until they're 3 meters in front of the car. How does your system handle this?

Key Insight

This is a "phantom object" / occlusion reasoning problem. Solutions: (1) Reduce speed near parked vehicles and school zones - if you can't see, go slower. (2) Occupancy prediction: model what might be behind occluded areas. (3) Use radar (can detect motion through gaps). (4) Conservative planning: treat occluded regions as potentially containing pedestrians. The fundamental answer is: if you can't perceive, you must plan conservatively.

Problem 2: End-to-End vs. Modular

Direction

Tesla uses an end-to-end neural network that goes from camera pixels to driving commands. Waymo uses a modular stack (perception → prediction → planning). Discuss the trade-offs.

Key Insight

End-to-end: trains on driving demonstrations, can learn things hard to specify in rules, but is a black box (hard to debug, hard to certify safety). Modular: each component is testable and debuggable, but information is lost at interfaces between components, and errors compound. The industry trend is toward hybrid: modular architecture with learned components, and an end-to-end model as a "teacher" that provides additional training signal.

Interview Cheat Sheet

Question PatternFrameworkKey Phrases
"Design AV ML stack"Perception → Prediction → Planning"Multi-sensor perception, multi-modal prediction, safety-constrained planning"
"Camera vs. LiDAR?"Sensor trade-offs"LiDAR gives reliable depth, camera gives semantics - fusion is strongest"
"How do you ensure safety?"Redundant layers"Safety monitor, emergency system, minimal risk condition - fail safe, not fail dangerous"
"How do you test?"Testing hierarchy"Simulation for scale, closed course for validation, public roads for final proof"

Spaced Repetition Checkpoints

  • Day 0: Draw the perception → prediction → planning pipeline from memory.
  • Day 3: Explain camera vs. LiDAR trade-offs. What's the argument for each?
  • Day 7: Design the safety architecture with all four layers. What triggers each?
  • Day 14: Explain multi-modal trajectory prediction. Why is a single prediction insufficient?
  • Day 21: Mock interview with follow-ups on occlusion, edge cases, and testing methodology.

What's Next

© 2026 EngineersOfAI. All rights reserved.