:::tip 🎮 Interactive Playground Visualize this concept: Try the ML System Design Framework demo on the EngineersOfAI Playground - no code required. :::
The ML System Design Framework
The Production Moment
It is 9:47 AM on a Tuesday. You are sitting in a room with a senior staff engineer, a principal scientist, and the hiring manager for the ML Platform team at a company that runs recommendations for 400 million users. The whiteboard is clean. They have given you one sentence: "Design a video recommendation system."
You have 45 minutes.
Most engineers immediately start drawing boxes. They sketch a "model" in the center, put "user" on the left, "recommendations" on the right, and begin listing features they would use - watch history, click-through rate, video duration. By minute ten they are deep in a debate with themselves about whether to use two-tower models or session-based transformers, while the interviewers exchange glances.
Here is what the experienced engineer does differently. They pick up the marker, write three words at the top of the board - Requirements. Scale. Design - and start asking questions. Not because they don't know about two-tower models. Because they know that the right architecture depends entirely on answers to questions they have not asked yet. Is this for a cold-start user or a returning user? Does the response need to be personalized in real time or can it be precomputed? What is the acceptable latency? What markets does this serve?
The framework is not a crutch. It is what separates engineers who build systems that work from engineers who build systems that look good on whiteboards but fall apart in production. The 4-step ML System Design Framework - Requirements, Scale Estimation, High-Level Design, Deep Dive - is how production ML systems actually get built. This lesson teaches you to use it.
Why This Framework Exists
Before structured ML system design methodology emerged, ML engineers designed systems the same way research scientists did: start with the model, add infrastructure as an afterthought. You would fine-tune BERT for a week, achieve 92% accuracy on your validation set, and then hand it to a platform team who discovered that inference took 800ms per request, you needed 40 GPU machines to serve peak traffic, and there was no monitoring plan.
The result was predictable. Beautiful models that never made it to production. Systems that worked in the lab but melted under real traffic. Feature pipelines that produced training data that looked nothing like what the model would see at inference time - the training-serving skew problem that Google's engineers identified as one of the top failure modes in production ML (Sculley et al., "Hidden Technical Debt in Machine Learning Systems," NeurIPS 2015).
The structured framework emerged from hard lessons at companies dealing with scale. Google's Site Reliability Engineering practices, Facebook's ML infrastructure retrospectives, and Netflix's engineering blog posts in the 2015–2019 period all converged on the same insight: ML systems are software systems first. They need requirements engineering, capacity planning, and architectural discipline - with an ML-specific layer on top.
The framework described here synthesizes practices from Google's ML system design interviews, Meta's internal design review process, and the "Designing Machine Learning Systems" methodology popularized by Chip Huyen (2022). It is not a theoretical construct - it is what production ML teams actually use.
The 4-Step Framework
Step 1: Requirements (5–8 minutes)
Before drawing a single box, ask questions. Your goal is to constrain the solution space from infinite to tractable.
Functional requirements define what the system must do. For a recommendation system:
- What is being recommended? (Videos, products, people to follow, ads)
- Who are the users? (Logged-in only? Guest users too?)
- How many recommendations per request?
- Does it need to filter already-seen content?
- Are there diversity or freshness constraints? (No two videos from same creator, nothing older than 30 days)
Non-functional requirements define how well it must do it:
- Latency: What is the maximum acceptable response time? (50ms? 200ms? 1 second?)
- Throughput: How many requests per second at peak?
- Availability: 99.9% (8.7 hours downtime/year) vs 99.99% (52 minutes/year)?
- Consistency: Does every user always see the same recommendations, or is eventual consistency acceptable?
ML-specific requirements are the layer that traditional system design ignores:
- Accuracy: What metric? (Precision@K, NDCG, click-through rate, watch time)
- Freshness: How quickly do new videos become recommendable? Real-time? Within an hour?
- Fairness: Are there constraints on creator diversity or demographic fairness?
- Explainability: Do users or regulators need to understand why they got a recommendation?
- Online vs offline: Is this real-time inference or can recommendations be precomputed?
:::tip Interview Strategy In interviews, spend the first 5–8 minutes asking questions and writing answers on the board. Do not start designing. Interviewers at Google and Meta explicitly state that candidates who jump to design without clarifying requirements fail the "requirements engineering" rubric - even if their design is technically sound. :::
Step 2: Scale Estimation (3–5 minutes)
Once you have requirements, estimate the scale. This determines your infrastructure choices.
The estimation framework follows a cascade: users → requests → data → compute → storage.
For our video recommendation system, a worked example:
- Daily Active Users (DAU): 50 million (stated in requirements)
- Requests per second: 50M DAU × 10 page loads/day / 86,400 seconds ≈ 5,800 QPS (peak: 3× ≈ 17,000 QPS)
- Candidate set size: 1 billion videos in the catalog
- Feature data per user: 1,000 user features × 8 bytes × 50M users = 400 GB of user features
- Training data: 50M users × 100 events/day × 365 days × 100 bytes/event = 182 TB/year
- Model size: 2-tower model with 100M parameters × 4 bytes = 400 MB per replica
These numbers tell you: you need a distributed feature store (400 GB will not fit in one machine's RAM), you need aggressive caching at 17,000 QPS, and you need Spark or Flink to process 182 TB/year of training data.
Step 3: High-Level Design (10–15 minutes)
Now you draw boxes. A production ML system almost always has two separate pipelines: training and serving.
The key insight is that training and serving have completely different requirements:
- Training cares about throughput: process all training data as fast as possible (batch, can take hours)
- Serving cares about latency: respond to each user request in milliseconds (online, must be fast)
This is why they are separate pipelines. A common architectural mistake is designing one pipeline that tries to satisfy both - it ends up satisfying neither.
Step 4: Deep Dive (15–20 minutes)
Pick the two or three most interesting or risky components and go deep. In a recommendation system, the interviewer likely wants to hear about:
-
Retrieval (candidate generation): How do you go from 1 billion videos to 500 candidates in under 10ms? Answer: Approximate Nearest Neighbor (ANN) search with FAISS or ScaNN over learned embeddings.
-
Feature freshness: How do you ensure the model sees up-to-date user behavior? Answer: streaming feature computation with Kafka and Flink, updating the online feature store in near-real-time.
-
Training-serving skew: How do you ensure features computed at training time match features computed at serving time? Answer: point-in-time correct joins in the offline store, shared feature computation logic.
The ML-Specific Layer
Traditional system design (databases, APIs, caches) has a well-established vocabulary. ML systems need an additional vocabulary layer.
Online vs Offline Inference
Offline (batch) inference precomputes predictions for all users and stores them. You query a database instead of running a model at request time.
Pros: Fast (database lookup), cheap (no GPU at serving time), simple. Cons: Staleness (recommendations do not react to recent behavior), storage cost (storing predictions for all users × items).
Use when: Recommendations where freshness does not matter much (weekly email digest), compute-expensive models you cannot afford to run online.
Online (real-time) inference runs the model at request time.
Pros: Fresh (sees recent user behavior), personalized, adapts to context. Cons: Latency pressure, GPU cost, model must be fast enough.
Use when: Search ranking, fraud detection, real-time personalization, anything where recency matters.
Near-real-time hybrid is what most production systems use: precompute candidates offline (heavy lifting), rank them online (fast forward pass on small candidate set).
# The canonical 2-stage recommendation pattern
class RecommendationSystem:
def __init__(self, ann_index, ranking_model, feature_store):
self.ann_index = ann_index # FAISS or ScaNN
self.ranking_model = ranking_model # lightweight neural ranker
self.feature_store = feature_store # Redis for online features
def get_recommendations(self, user_id: str, k: int = 20) -> list:
# Stage 1: Retrieval (offline-built index, fast ANN search)
# Get user embedding from online feature store (pre-computed)
user_embedding = self.feature_store.get(f"user_emb:{user_id}")
# ANN search: 1B items -> 500 candidates in ~5ms
candidate_ids, _ = self.ann_index.search(
user_embedding.reshape(1, -1),
k=500
)
# Stage 2: Ranking (online model inference, 500 candidates)
user_features = self.feature_store.get_user_features(user_id)
item_features = self.feature_store.get_item_features_batch(
candidate_ids[0].tolist()
)
# Score each candidate (fast forward pass)
scores = self.ranking_model.predict(user_features, item_features)
# Stage 3: Re-ranking (business rules, diversity)
ranked = self._rerank(candidate_ids[0], scores)
return ranked[:k]
def _rerank(self, candidates, scores) -> list:
"""Apply business rules: diversity, freshness, filtering."""
ranked = sorted(
zip(candidates.tolist(), scores.tolist()),
key=lambda x: x[1],
reverse=True
)
# Deduplicate creators (max 2 videos per creator)
seen_creators: dict = {}
filtered = []
for item_id, score in ranked:
creator = self._get_creator(item_id)
if seen_creators.get(creator, 0) < 2:
filtered.append(item_id)
seen_creators[creator] = seen_creators.get(creator, 0) + 1
return filtered
def _get_creator(self, item_id) -> str:
return self.feature_store.get(f"item_creator:{item_id}") or "unknown"
Full Worked Example: Video Recommendation System
Let's walk through the complete framework for a YouTube-scale video recommendation system.
Requirements (stated):
- 50M DAU, 1B videos in catalog
- Recommendations served on homepage and end-of-video
- Latency: under 150ms end-to-end
- Freshness: new videos discoverable within 1 hour of upload
- Age-gated content must not appear for unverified users
Scale estimation:
- 50M DAU × 10 requests/day / 86,400s ≈ 5,800 QPS (peak ~17,000 QPS)
- With batching (batch size 64): 17,000/64 ≈ 265 GPU calls/second - manageable with ~10 GPU replicas
- Feature store: 50M users × 512-dim float32 embedding = 50M × 2KB = 100 GB of user embeddings (fits in Redis cluster)
- Item index: 1B items × 256-dim float32 = 1B × 1KB = 1 TB FAISS index (sharded across machines)
The latency budget decomposition - this is critical to discuss explicitly:
| Component | Budget |
|---|---|
| Network (client to load balancer) | 5ms |
| API gateway and auth | 5ms |
| Feature fetch (Redis) | 10ms |
| ANN retrieval (FAISS) | 15ms |
| Ranking model inference | 40ms |
| Re-ranking and business rules | 10ms |
| Response serialization | 5ms |
| Network (return) | 5ms |
| Total | 95ms (55ms headroom) |
Notice the explicit budget. Without it, every team adds their own 20ms and you end up at 500ms with nobody to blame.
Deep dive - point-in-time correct training data:
The most subtle correctness issue is temporal leakage. If you train with features computed at the wrong time, you get a model that looks good in offline evaluation but fails in production.
# WRONG: Using current feature values for historical events
# This leaks future information into training - causes offline/online gap
wrong_query = """
SELECT
event.user_id,
event.item_id,
event.label,
-- BUG: user_features was computed TODAY, not at event time
user_features.watch_count_30d
FROM events
JOIN user_features ON events.user_id = user_features.user_id
"""
# CORRECT: Point-in-time join using feature history
# The feature store must retain historical snapshots for this to work
correct_query = """
SELECT
e.user_id,
e.item_id,
e.label,
uf.watch_count_30d
FROM events e
-- Uses feature values as they existed AT event time
ASOF JOIN user_feature_history uf
ON e.user_id = uf.user_id
AND uf.computed_at <= e.event_timestamp
"""
Common ML System Design Questions
The framework applies to every ML system design question. Here are the five most common with the key insight for each.
Video/Content Recommendation (YouTube, Netflix, TikTok): Two-stage pipeline - retrieval then ranking. Item embedding index built offline. User embeddings updated near-real-time. Diversity constraints in re-ranking.
Search Ranking (Google, LinkedIn, Amazon): Query understanding, then candidate retrieval, then ranking. Heavy signals (neural ranking) reserved for top-K candidates. Latency budget is tight (100–150ms end-to-end).
Fraud Detection (Stripe, PayPal, banks): Must be online (real-time), low latency (sub-50ms), high precision (false positives kill UX), needs explainability (regulatory). Streaming features critical.
Content Moderation (Facebook, Twitter, YouTube): Two-stage: automated (fast, recall-focused) then human review queue. Adversarial robustness matters. Feedback loops from human labels.
Ad Click Prediction (Facebook Ads, Google Ads): Extreme scale (tens of billions of predictions/day). Feature hashing for high-cardinality categoricals. Calibrated probability outputs required.
Production Engineering Notes
Training Pipeline Monitoring
Every production training pipeline needs these monitoring hooks:
import time
from dataclasses import dataclass
from typing import Optional
@dataclass
class TrainingRunMetrics:
run_id: str
dataset_size: int
training_loss_final: float
validation_metric: float
training_duration_seconds: float
features_used: list
data_freshness_hours: float # how old is the training data?
class TrainingPipelineMonitor:
def __init__(self, metrics_client):
self.metrics = metrics_client
def pre_training_checks(self, dataset_path: str) -> bool:
"""Validate training data before expensive training run."""
# Check data freshness
data_age_hours = self._get_data_age(dataset_path)
if data_age_hours > 48:
self.metrics.emit_alert(
"training_data_stale",
{"age_hours": data_age_hours, "threshold": 48}
)
return False
# Check dataset size hasn't dropped dramatically
current_size = self._get_dataset_size(dataset_path)
previous_size = self.metrics.get_previous_run_dataset_size()
if previous_size and current_size < previous_size * 0.8:
self.metrics.emit_alert(
"dataset_size_regression",
{"current": current_size, "previous": previous_size}
)
return False
return True
def _get_data_age(self, path: str) -> float:
"""Return age of dataset in hours."""
# implementation depends on storage backend
return 0.0
def _get_dataset_size(self, path: str) -> int:
"""Return number of rows in dataset."""
return 0
Model Registry Integration
Every model trained in production should be registered with metadata for reproducibility:
import mlflow
def register_trained_model(
model,
metrics: TrainingRunMetrics,
model_name: str,
feature_config: dict
) -> str:
"""Register model with full reproducibility metadata."""
with mlflow.start_run() as run:
# Log training metrics
mlflow.log_metric("val_ndcg", metrics.validation_metric)
mlflow.log_metric("training_loss", metrics.training_loss_final)
mlflow.log_metric("dataset_size", metrics.dataset_size)
# Log configuration for reproducibility
mlflow.log_params({
"features": str(metrics.features_used),
"data_freshness_hours": metrics.data_freshness_hours,
})
mlflow.log_dict(feature_config, "feature_config.json")
# Log the model
mlflow.pytorch.log_model(model, "model")
# Register in Model Registry
model_uri = f"runs:/{run.info.run_id}/model"
mv = mlflow.register_model(model_uri, model_name)
return mv.version
:::warning The Model-First Trap The single most common mistake in ML system design is leading with model choice: "I would use a transformer-based two-tower model with..." The interviewer does not care about the model choice until you have established: what latency budget you are working with, how much training data exists, and whether online or offline inference is appropriate. Model choice is a consequence of system requirements, not a starting point. :::
:::danger Designing Without Numbers A design without numbers is not a design - it is a suggestion. Every architectural claim you make should be backed by an estimate. "We need a distributed feature store" needs to be followed by "because our user feature data is 400 GB, which will not fit on a single machine." Unmotivated design decisions are a red flag in both interviews and real engineering reviews. :::
What Interviewers Actually Look For
Based on published engineering blogs and interview debrief posts from Google, Meta, and Netflix, here is what distinguishes candidates who pass ML system design:
Top performers:
- Ask 5–8 clarifying questions before drawing anything
- Identify the training pipeline AND serving pipeline as separate concerns
- Proactively mention training-serving skew
- Know that scale estimation drives architecture choices
- Discuss failure modes and monitoring
- Make explicit trade-offs ("we could use X, but because of Y constraint, Z is better")
Common failure modes:
- Starting with model architecture before understanding requirements
- Designing only the happy path with no failure handling
- Ignoring the data pipeline entirely (how does training data get created?)
- Not discussing monitoring (how do you know if the system is working?)
- Using the same architecture for wildly different scale requirements
The System Design Template
Use this template for any ML system design question:
## Requirements
### Functional
- [List what the system must do]
### Non-functional
- Latency: [Xms at pYY]
- Throughput: [X QPS]
- Availability: [X nines]
### ML-specific
- Metric: [what you are optimizing]
- Freshness: [how stale is acceptable]
- Online vs offline: [which and why]
## Scale Estimation
- DAU: X million
- Peak QPS: X
- Data size: X TB/month
- Model size: X GB
## High-Level Design
[Draw training pipeline and serving pipeline separately]
## Deep Dive: [Component 1]
[Detailed design, trade-offs, failure modes]
## Deep Dive: [Component 2]
[Detailed design, trade-offs, failure modes]
## Monitoring
- Business metrics: [CTR, engagement, conversion]
- ML metrics: [online accuracy, prediction distribution]
- Infrastructure metrics: [latency, throughput, error rate]
- Data quality: [freshness, completeness, drift]
Interview Q&A
Q1: What is the difference between functional and non-functional requirements for an ML system, and why does the distinction matter?
Functional requirements define what the system does - the capabilities it must expose. For a recommendation system: "given a user ID, return 20 ranked video recommendations." Non-functional requirements define how well it does it - latency, throughput, availability, accuracy.
The distinction matters because they drive different architectural decisions. Functional requirements determine what components you need. Non-functional requirements determine how you build those components. A recommendation system that requires 50ms latency needs ANN search and an in-memory feature store. One that allows 2-second latency can use a simpler architecture. ML systems have a third category - ML-specific requirements (freshness, fairness, explainability) - that do not exist in traditional software and often get forgotten entirely.
Q2: What is training-serving skew and why is it so dangerous?
Training-serving skew is when the features computed during model training do not match the features computed during serving, even for the same input. It is dangerous because it silently degrades model performance without any obvious error.
Classic example: you train with "user_watch_count_30d" computed as of training time. In production, you compute the same feature differently - using a different time window, a different aggregation, or data from a different source. The model was optimized for one distribution; it receives another. Offline metrics look great; production performance is poor.
Prevention requires: shared feature computation code between training and serving, point-in-time correct feature retrieval for training data, and monitoring feature distributions at serving time against the training distribution.
Q3: When would you use offline inference vs online inference?
Use offline (batch) inference when: recommendations do not need to react to very recent behavior, the model is computationally expensive, or you need to serve very high QPS with low infrastructure cost. Example: weekly "recommended for you" email digest - precompute recommendations for all users nightly, store in a database, and retrieval is just a database lookup.
Use online (real-time) inference when: recency matters (a user just watched 5 videos and their next recommendations should reflect that), context matters (different recommendations in the morning vs evening), or the user's session behavior should influence the current request.
Most production systems use a hybrid: heavy retrieval done offline (precomputed item embeddings, ANN index built daily), while ranking is done online (lightweight model runs in real-time using fresh session features).
Q4: How would you estimate the GPU infrastructure needed to serve a recommendation model at 10,000 QPS?
Work through the math explicitly. First, measure model forward pass latency on target hardware - say your ranking model takes 2ms per forward pass on an A100 GPU. Capacity per GPU: 1000ms / 2ms = 500 requests per second. Without batching: 10,000 QPS / 500 = 20 A100 GPUs, plus a 3× safety factor for traffic spikes = 60 GPUs.
With batching (batch size 32): forward pass takes ~5ms total for 32 requests. One GPU handles 32/5ms × 1000 = 6,400 requests/second. Now you need 10,000 / 6,400 ≈ 2 GPUs, plus safety margin = 6 GPUs. Batching provides a 10× cost reduction in this example - this is why GPU utilization optimization is critical for ML serving economics.
Q5: What does a good monitoring strategy look like for an ML system?
ML systems need monitoring at three layers. The infrastructure layer covers CPU/GPU utilization, memory, request latency (p50/p95/p99), error rates, and cache hit rates - standard SRE-style monitoring. The ML pipeline layer covers training job success/failure, data pipeline freshness, and model training metrics over time. The business and ML health layer covers online metrics (click-through rate, watch time, conversion rate) with statistical significance testing, plus feature distribution drift and prediction distribution drift.
The most important operational distinction is separating model degradation from data pipeline failure. Both cause business metrics to drop, but the fixes are completely different. Separate alerting for each layer enables fast diagnosis.
Summary
The ML System Design Framework is a discipline, not a template. The four steps - Requirements, Scale Estimation, High-Level Design, Deep Dive - give you structure when facing an open-ended problem. The ML-specific layer - online vs offline, training pipeline vs serving pipeline, training-serving skew - is the vocabulary that separates ML engineers from software engineers doing ML work.
The most important habit to build: always ask why before what. Why does this system need to exist? Why does it need to run in 50ms? Why do you need a feature store instead of just querying the database? The answers to "why" determine everything about the design.
:::tip Key Takeaway In any ML system design situation - interview or real project - spend at least 20% of your time on requirements before drawing a single box. The most expensive mistakes in ML system design happen in the first five minutes, when people assume requirements instead of asking. :::
