ML System Design Framework - Your 45-Minute Playbook

Reading time: ~20 min | Interview relevance: Critical | Roles: MLE, AI Eng, MLOps

The Framework: RPFMSE

Every ML system design answer should follow these 6 steps. This structure ensures you cover everything interviewers evaluate - and prevents the most common failure mode (spending 30 minutes on the model and 0 minutes on serving and evaluation).

RPFMSE Framework - 6 steps: Requirements, Problem Formulation, Features, Model, Serving, Evaluation

Step 1: Requirements (5 min)

Goal: Clarify the problem before designing anything.

Questions to Ask

Functional requirements:

What is the core user experience? What does the user see/do?
What inputs does the system receive? What outputs does it produce?
What scale are we designing for? (users, requests/sec, data volume)

Non-functional requirements:

What's the latency budget? (real-time: <100ms, near-real-time: <1s, batch: hours)
What's the accuracy/quality bar? (99.9% precision, or 80% recall is fine?)
What's the cost budget? (GPU serving at scale is expensive)
Is there existing data? Labels? Infrastructure?

BAD: Skip requirements and start drawing boxes.

GOOD: "Before I design, let me make sure I understand the constraints. We're building a recommendation system for an e-commerce platform with 10M users, 1M products, and we need to serve recommendations in under 200ms. Are there specific business objectives - increasing revenue, engagement, or both? And do we have historical user-item interaction data to start with?"

Interviewer's Perspective

The requirements phase is where I assess seniority. Junior candidates skip it entirely. Mid-level candidates ask basic questions. Senior candidates ask the questions that change the design - like "Is this real-time or batch?" or "Do we optimize for revenue or engagement?" These questions show you've built real systems and know what matters.

Step 2: Problem Formulation (5 min)

Goal: Translate the business problem into an ML problem.

The Translation

Business Goal	ML Objective	Metric
"Increase purchases"	Predict P(purchase \| user, item)	Conversion rate, revenue per session
"Reduce fraud"	Binary classification: fraud vs. legit	Precision @ low FPR, dollar amount saved
"Show relevant search results"	Learning-to-rank: order results by relevance	NDCG, MRR
"Filter harmful content"	Multi-label classification: toxicity categories	Recall (catch harmful) + Precision (don't over-block)
"Answer customer questions"	RAG + generation: retrieve context, generate answer	Answer accuracy, user satisfaction, resolution rate

Key Decisions at This Stage

What type of ML problem? Classification, regression, ranking, generation, retrieval?
What's the prediction target? P(click), P(fraud), relevance score, generated text?
What's the north star metric? One primary metric + 2-3 guardrail metrics.
Offline vs. online? Can we do batch predictions or need real-time?

Step 3: Features & Data (8 min)

Goal: Identify data sources, engineer features, and handle labels.

Feature Categories

For most ML systems, features fall into these categories:

Category	Examples	Freshness
User features	Demographics, history, preferences, engagement patterns	Updated hourly-daily
Item features	Category, price, description embeddings, popularity	Updated on change
Context features	Time of day, device, location, session behavior	Real-time
Cross features	User-item affinity, user-category preference, co-occurrence	Computed batch or real-time

Data Considerations

Label availability: Do we have ground truth? How is it collected? What's the label delay?
Class imbalance: What's the positive rate? (fraud: 0.1%, clicks: 3%, purchases: 1%)
Data freshness: How often does the data distribution change?
Data quality: Missing values, duplicates, noise, adversarial data
Training data construction: How do you avoid leakage? Point-in-time correctness?

Common Trap

Many candidates list features without thinking about serving. "Average purchase amount over the last 30 days" is easy in batch SQL but requires a streaming aggregation pipeline for real-time. Always ask: "Can I compute this feature at serving time within my latency budget?"

Step 4: Model (8 min)

Goal: Start simple, iterate toward complexity with justification.

The Progression

Model Progression - Baseline (Rules/LR) → Gradient Boosting → Deep Learning → Hybrid/Ensemble

Always start with a baseline. This shows engineering judgment and gives the interviewer confidence you won't over-engineer.

Problem Type	Baseline	Strong Model	Why Start Simple
Classification	Logistic Regression	XGBoost → Neural Network	Interpretable, fast, sets benchmark
Ranking	Pointwise LR	Pairwise (LambdaMART) → Listwise (Deep Ranking)	Understand feature importance first
Retrieval	TF-IDF + BM25	Two-tower embedding model	Fast, no training needed
Generation	Template-based	LLM with RAG	Reliable, deterministic

What to Cover

Architecture: What model and why (for this specific problem)?
Training: How do you train? Data splits, hyperparameter tuning, training infrastructure.
Offline evaluation: What metrics, on what holdout set?
Trade-offs: Why this model over alternatives? What did you sacrifice?

Step 5: Serving (8 min)

Goal: Get the model into production reliably.

Key Decisions

Decision	Options	Trade-offs
Real-time vs. batch	Real-time: per-request predictions. Batch: pre-compute, cache.	Latency vs. freshness vs. cost
Model format	PyTorch, ONNX, TensorRT	Flexibility vs. inference speed
Infrastructure	GPU vs. CPU	Cost vs. latency
Scaling	Horizontal (more replicas) vs. vertical (bigger machines)	Cost vs. simplicity
Caching	Cache predictions for common inputs	Reduces cost, but stale results
Fallback	What happens when model is down?	Rules-based fallback, cached results, or graceful degradation

Multi-Stage Serving (Common for Ranking)

Multi-Stage Serving - Candidate Generation → Scoring → Re-ranking → User (top 20)

Step 6: Evaluation & Iteration (8 min)

Goal: Measure, monitor, and improve.

Offline Evaluation

Holdout test set with proper temporal split (no future data leakage)
Metrics matched to business goals (see Step 2)
Error analysis: where does the model fail? What patterns emerge?

Online Evaluation

A/B testing: Treatment (new model) vs. Control (current model), measure business KPIs
Interleaving: For ranking systems, interleave results from both models in the same list
Canary deployment: Roll out to 5-10% of traffic, monitor for regressions

Monitoring

Input drift: Feature distributions changing from training data
Output drift: Prediction distribution shifting
Performance drift: Online metrics degrading
Alerting: Automated alerts with thresholds + human review

Iteration Plan

What would V2 look like? What's the next biggest improvement?
What data would you need? What experiments would you run?
What would you change about the architecture?

Interviewer's Perspective

Ending with an iteration plan is the strongest possible close. It tells me: "This person doesn't think they're done - they're already thinking about how to make it better." That's exactly the mindset I want on my team.

Time Management Cheat Sheet

Phase	Time	What to Say When Transitioning
Requirements	0:00-5:00	"Now that I understand the constraints, let me formulate this as an ML problem."
Problem Formulation	5:00-10:00	"With the objective defined, let me think about the features and data pipeline."
Features & Data	10:00-18:00	"Given these features, here's my model approach."
Model	18:00-26:00	"Now let me discuss how we'd serve this in production."
Serving	26:00-34:00	"Finally, let me cover evaluation and monitoring."
Evaluation	34:00-42:00	"Here's what I'd focus on for V2."
Q&A	42:00-45:00	"Happy to go deeper on any component."

Spaced Repetition Checkpoints

Day 0: Memorize the 6 steps (RPFMSE). Draw the framework from memory.
Day 3: Apply the framework to a Recommendation System. Time yourself.
Day 7: Apply to a completely different problem (Fraud Detection). Verify you hit all 6 steps.
Day 14: Do a mock interview. Have your partner score you on each step.
Day 21: The framework should be automatic. Focus on depth within each step.

What's Next

Evaluation Rubric - Understand exactly how you're scored
Recommendation System - The most commonly asked design problem

The Framework: RPFMSE​

Step 1: Requirements (5 min)​

Questions to Ask​

Step 2: Problem Formulation (5 min)​

The Translation​

Key Decisions at This Stage​

Step 3: Features & Data (8 min)​

Feature Categories​

Data Considerations​

Step 4: Model (8 min)​

The Progression​

What to Cover​

Step 5: Serving (8 min)​

Key Decisions​

Multi-Stage Serving (Common for Ranking)​

Step 6: Evaluation & Iteration (8 min)​

Offline Evaluation​

Online Evaluation​

Monitoring​

Iteration Plan​

Time Management Cheat Sheet​

Spaced Repetition Checkpoints​

What's Next​