ML System Design Framework - Your 45-Minute Playbook
Reading time: ~20 min | Interview relevance: Critical | Roles: MLE, AI Eng, MLOps
The Framework: RPFMSE
Every ML system design answer should follow these 6 steps. This structure ensures you cover everything interviewers evaluate - and prevents the most common failure mode (spending 30 minutes on the model and 0 minutes on serving and evaluation).
Step 1: Requirements (5 min)
Goal: Clarify the problem before designing anything.
Questions to Ask
Functional requirements:
- What is the core user experience? What does the user see/do?
- What inputs does the system receive? What outputs does it produce?
- What scale are we designing for? (users, requests/sec, data volume)
Non-functional requirements:
- What's the latency budget? (real-time: <100ms, near-real-time: <1s, batch: hours)
- What's the accuracy/quality bar? (99.9% precision, or 80% recall is fine?)
- What's the cost budget? (GPU serving at scale is expensive)
- Is there existing data? Labels? Infrastructure?
BAD: Skip requirements and start drawing boxes.
GOOD: "Before I design, let me make sure I understand the constraints. We're building a recommendation system for an e-commerce platform with 10M users, 1M products, and we need to serve recommendations in under 200ms. Are there specific business objectives - increasing revenue, engagement, or both? And do we have historical user-item interaction data to start with?"
The requirements phase is where I assess seniority. Junior candidates skip it entirely. Mid-level candidates ask basic questions. Senior candidates ask the questions that change the design - like "Is this real-time or batch?" or "Do we optimize for revenue or engagement?" These questions show you've built real systems and know what matters.
Step 2: Problem Formulation (5 min)
Goal: Translate the business problem into an ML problem.
The Translation
| Business Goal | ML Objective | Metric |
|---|---|---|
| "Increase purchases" | Predict P(purchase | user, item) | Conversion rate, revenue per session |
| "Reduce fraud" | Binary classification: fraud vs. legit | Precision @ low FPR, dollar amount saved |
| "Show relevant search results" | Learning-to-rank: order results by relevance | NDCG, MRR |
| "Filter harmful content" | Multi-label classification: toxicity categories | Recall (catch harmful) + Precision (don't over-block) |
| "Answer customer questions" | RAG + generation: retrieve context, generate answer | Answer accuracy, user satisfaction, resolution rate |
Key Decisions at This Stage
- What type of ML problem? Classification, regression, ranking, generation, retrieval?
- What's the prediction target? P(click), P(fraud), relevance score, generated text?
- What's the north star metric? One primary metric + 2-3 guardrail metrics.
- Offline vs. online? Can we do batch predictions or need real-time?
Step 3: Features & Data (8 min)
Goal: Identify data sources, engineer features, and handle labels.
Feature Categories
For most ML systems, features fall into these categories:
| Category | Examples | Freshness |
|---|---|---|
| User features | Demographics, history, preferences, engagement patterns | Updated hourly-daily |
| Item features | Category, price, description embeddings, popularity | Updated on change |
| Context features | Time of day, device, location, session behavior | Real-time |
| Cross features | User-item affinity, user-category preference, co-occurrence | Computed batch or real-time |
Data Considerations
- Label availability: Do we have ground truth? How is it collected? What's the label delay?
- Class imbalance: What's the positive rate? (fraud: 0.1%, clicks: 3%, purchases: 1%)
- Data freshness: How often does the data distribution change?
- Data quality: Missing values, duplicates, noise, adversarial data
- Training data construction: How do you avoid leakage? Point-in-time correctness?
Many candidates list features without thinking about serving. "Average purchase amount over the last 30 days" is easy in batch SQL but requires a streaming aggregation pipeline for real-time. Always ask: "Can I compute this feature at serving time within my latency budget?"
Step 4: Model (8 min)
Goal: Start simple, iterate toward complexity with justification.
The Progression
Always start with a baseline. This shows engineering judgment and gives the interviewer confidence you won't over-engineer.
| Problem Type | Baseline | Strong Model | Why Start Simple |
|---|---|---|---|
| Classification | Logistic Regression | XGBoost → Neural Network | Interpretable, fast, sets benchmark |
| Ranking | Pointwise LR | Pairwise (LambdaMART) → Listwise (Deep Ranking) | Understand feature importance first |
| Retrieval | TF-IDF + BM25 | Two-tower embedding model | Fast, no training needed |
| Generation | Template-based | LLM with RAG | Reliable, deterministic |
What to Cover
- Architecture: What model and why (for this specific problem)?
- Training: How do you train? Data splits, hyperparameter tuning, training infrastructure.
- Offline evaluation: What metrics, on what holdout set?
- Trade-offs: Why this model over alternatives? What did you sacrifice?
Step 5: Serving (8 min)
Goal: Get the model into production reliably.
Key Decisions
| Decision | Options | Trade-offs |
|---|---|---|
| Real-time vs. batch | Real-time: per-request predictions. Batch: pre-compute, cache. | Latency vs. freshness vs. cost |
| Model format | PyTorch, ONNX, TensorRT | Flexibility vs. inference speed |
| Infrastructure | GPU vs. CPU | Cost vs. latency |
| Scaling | Horizontal (more replicas) vs. vertical (bigger machines) | Cost vs. simplicity |
| Caching | Cache predictions for common inputs | Reduces cost, but stale results |
| Fallback | What happens when model is down? | Rules-based fallback, cached results, or graceful degradation |
Multi-Stage Serving (Common for Ranking)
Step 6: Evaluation & Iteration (8 min)
Goal: Measure, monitor, and improve.
Offline Evaluation
- Holdout test set with proper temporal split (no future data leakage)
- Metrics matched to business goals (see Step 2)
- Error analysis: where does the model fail? What patterns emerge?
Online Evaluation
- A/B testing: Treatment (new model) vs. Control (current model), measure business KPIs
- Interleaving: For ranking systems, interleave results from both models in the same list
- Canary deployment: Roll out to 5-10% of traffic, monitor for regressions
Monitoring
- Input drift: Feature distributions changing from training data
- Output drift: Prediction distribution shifting
- Performance drift: Online metrics degrading
- Alerting: Automated alerts with thresholds + human review
Iteration Plan
- What would V2 look like? What's the next biggest improvement?
- What data would you need? What experiments would you run?
- What would you change about the architecture?
Ending with an iteration plan is the strongest possible close. It tells me: "This person doesn't think they're done - they're already thinking about how to make it better." That's exactly the mindset I want on my team.
Time Management Cheat Sheet
| Phase | Time | What to Say When Transitioning |
|---|---|---|
| Requirements | 0:00-5:00 | "Now that I understand the constraints, let me formulate this as an ML problem." |
| Problem Formulation | 5:00-10:00 | "With the objective defined, let me think about the features and data pipeline." |
| Features & Data | 10:00-18:00 | "Given these features, here's my model approach." |
| Model | 18:00-26:00 | "Now let me discuss how we'd serve this in production." |
| Serving | 26:00-34:00 | "Finally, let me cover evaluation and monitoring." |
| Evaluation | 34:00-42:00 | "Here's what I'd focus on for V2." |
| Q&A | 42:00-45:00 | "Happy to go deeper on any component." |
Spaced Repetition Checkpoints
- Day 0: Memorize the 6 steps (RPFMSE). Draw the framework from memory.
- Day 3: Apply the framework to a Recommendation System. Time yourself.
- Day 7: Apply to a completely different problem (Fraud Detection). Verify you hit all 6 steps.
- Day 14: Do a mock interview. Have your partner score you on each step.
- Day 21: The framework should be automatic. Focus on depth within each step.
What's Next
- Evaluation Rubric - Understand exactly how you're scored
- Recommendation System - The most commonly asked design problem
