Evaluation Rubric - How You're Actually Scored
Reading time: ~12 min | Interview relevance: Critical | Roles: MLE, AI Eng, MLOps
The Rubric
Every ML system design interview is scored across 6 dimensions. Understanding this rubric lets you optimize your answer for maximum impact.
Dimension 1: Requirements & Problem Formulation (15%)
| Rating | Behavior |
|---|---|
| Strong Hire | Asks questions that change the design. Identifies non-obvious constraints. Formulates a precise ML objective with clear metrics. |
| Lean Hire | Asks basic clarifying questions. Reasonable problem formulation. |
| No Hire | Skips requirements. Wrong or vague ML objective. Starts drawing boxes immediately. |
Dimension 2: Data & Features (20%)
| Rating | Behavior |
|---|---|
| Strong Hire | Creative feature engineering. Considers feature freshness and serving feasibility. Addresses data quality, labeling, and leakage. |
| Lean Hire | Reasonable features. Basic awareness of data challenges. |
| No Hire | Only uses raw columns. No thought about labels, leakage, or data quality. |
Dimension 3: Model Architecture (20%)
| Rating | Behavior |
|---|---|
| Strong Hire | Starts with baseline, iterates with justification. Discusses trade-offs (complexity vs. latency vs. interpretability). |
| Lean Hire | Reasonable model choice but jumps to a complex model without baseline. |
| No Hire | Can't justify model choice. Picks the most complex model without reasoning. |
Dimension 4: Serving & Infrastructure (15%)
| Rating | Behavior |
|---|---|
| Strong Hire | Detailed serving architecture. Addresses latency, caching, fallbacks, multi-stage ranking. Considers cost. |
| Lean Hire | Basic serving discussion. Mentions real-time vs. batch. |
| No Hire | Ignores serving entirely. "The model outputs predictions." |
Dimension 5: Evaluation & Monitoring (15%)
| Rating | Behavior |
|---|---|
| Strong Hire | Offline + online evaluation. A/B testing methodology. Drift monitoring. Clear iteration plan. |
| Lean Hire | Mentions offline metrics. Basic monitoring awareness. |
| No Hire | No evaluation plan. No monitoring. No iteration strategy. |
Dimension 6: Communication & Structure (15%)
| Rating | Behavior |
|---|---|
| Strong Hire | Organized, clear structure. Proactively addresses concerns. Manages time well. Engages with interviewer questions. |
| Lean Hire | Reasonably organized. Answers questions adequately. |
| No Hire | Unstructured rambling. Hard to follow. Ignores interviewer signals. |
The Most Common Failure Modes
The #1 failure mode is spending too much time on the model and not enough on everything else. Interviewers see many candidates who can describe a transformer architecture in detail but can't explain how to serve it at 10K QPS or how to detect when it starts degrading. Balance your time across all 6 dimensions.
What "Strong Hire" Looks Like: A Pattern
The candidates who consistently get "Strong Hire" share these traits:
- They start with requirements - the design is driven by constraints, not technology preferences
- They justify every decision - "I chose XGBoost over a neural network because with 50 features and 1M samples, tree models typically outperform, train faster, and are more interpretable"
- They acknowledge what they don't know - "I'd need to validate this assumption with the data team"
- They think about failure - "When the model is down, we fall back to a popularity-based ranking"
- They think about iteration - "For V2, I'd explore real-time features to capture session intent"
Practice Exercise
Take any design problem from this section. After completing your design, score yourself on each dimension (1-5). Be honest. Your weakest dimension is your biggest prep priority.
| Dimension | Self-Score (1-5) | Notes |
|---|---|---|
| Requirements & Problem Formulation | ___ | |
| Data & Features | ___ | |
| Model Architecture | ___ | |
| Serving & Infrastructure | ___ | |
| Evaluation & Monitoring | ___ | |
| Communication & Structure | ___ |
What's Next
Now that you understand the framework and rubric, start practicing with design problems:
- Recommendation System - The most commonly asked problem
- Search Ranking - Multi-stage ranking at scale
- Fraud Detection - Real-time classification with extreme imbalance
