Evaluation Metrics - Measuring What Actually Matters
Reading time: ~40 min | Interview relevance: Critical | Roles: MLE, AI Eng, Data Scientist, Research Eng
The Real Interview Moment
You're in a machine learning design round at a fintech company. The interviewer describes a fraud detection system: "We have 100 million transactions per day. About 0.01% are fraudulent. Our current model has 99.5% accuracy. The business says fraud losses are still too high. How would you evaluate this model properly?"
The candidate who says "99.5% accuracy sounds good" is done. A model that predicts "not fraud" for every single transaction would achieve 99.99% accuracy on this dataset. The interviewer wants to hear you immediately recognize the class imbalance problem, pivot to precision-recall analysis, discuss the business cost asymmetry (missing fraud is 100x more expensive than a false alert), and propose AUC-PR as the primary metric with a threshold tuned to minimize total expected cost.
Choosing the right evaluation metric is not a footnote - it's the first design decision in any ML project. The wrong metric leads to the wrong model, the wrong threshold, and the wrong business outcome. In interviews, your metric choice reveals whether you understand the problem or just the algorithm.
What You Will Master
After reading this page, you will be able to:
- Select the correct evaluation metric for any ML problem type (classification, ranking, generation, regression)
- Explain why accuracy is misleading for imbalanced datasets and what to use instead
- Derive and interpret precision, recall, F1, and their micro/macro/weighted variants
- Draw and interpret ROC curves, PR curves, and explain when each is appropriate
- Apply ranking metrics (NDCG, MAP, MRR) for search and recommendation systems
- Use generation metrics (BLEU, ROUGE, perplexity) for NLP tasks
- Design threshold selection strategies based on business requirements
- Explain model calibration and why it matters for decision-making
- Handle multi-class and multi-label metric computation
- Navigate the "what metric would you use?" interview question for any scenario
Self-Assessment: Where Are You Now?
| Skill Area | 1 (Never heard of it) | 3 (Can define it) | 5 (Can derive + apply) | Your Rating |
|---|---|---|---|---|
| Precision / Recall / F1 | Never used | Know the formulas | Can explain micro/macro/weighted variants | ___ |
| AUC-ROC | Don't know what ROC stands for | Can draw the curve | Know why AUC-PR is better for imbalanced data | ___ |
| Log loss / Cross-entropy | Never computed | Know the formula | Can explain calibration connection | ___ |
| Ranking metrics (NDCG, MAP) | Never heard of NDCG | Know the formulas | Can compute by hand and explain position bias | ___ |
| NLP metrics (BLEU, ROUGE) | Never used | Know they exist | Can explain limitations and when to use each | ___ |
| Threshold selection | Just use 0.5 | Know it's adjustable | Can design cost-sensitive threshold optimization | ___ |
| Calibration | Never heard of it | Know what it means | Can diagnose and fix calibration issues | ___ |
Score interpretation:
- 7-14: Essential reading. Metrics questions appear in almost every ML interview.
- 15-25: Good foundation. Focus on the edge cases and practice problems.
- 26-35: You're well-prepared. Drill the company-specific scenarios and ranking/generation metrics.
Part 1 - Classification Metrics
The Confusion Matrix: Where Everything Starts
Every binary classification metric derives from four numbers:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actually Positive | True Positive (TP) | False Negative (FN) |
| Actually Negative | False Positive (FP) | True Negative (TN) |
"The confusion matrix is the foundation of all classification metrics. From it, we derive accuracy (how often we're right overall), precision (of the things we called positive, how many actually were), recall (of the actual positives, how many did we catch), and F1 (the harmonic mean of precision and recall). The choice between these depends on the cost asymmetry - if missing a positive is expensive (cancer detection, fraud), optimize for recall. If false alarms are expensive (spam filtering in important email), optimize for precision. F1 balances both when the costs are roughly equal."
Core Metrics
Accuracy = (TP + TN) / (TP + TN + FP + FN)
The proportion of correct predictions. Simple but dangerous.
Precision = TP / (TP + FP)
"Of the items I flagged as positive, what fraction actually were?" High precision = few false alarms.
Recall (Sensitivity, True Positive Rate) = TP / (TP + FN)
"Of the actual positive items, what fraction did I catch?" High recall = few missed positives.
Specificity (True Negative Rate) = TN / (TN + FP)
"Of the actual negative items, what fraction did I correctly identify?" Important for medical screening.
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
The harmonic mean of precision and recall. The harmonic mean penalizes extreme imbalance: if either precision or recall is near zero, F1 is near zero too.
F-beta Score = (1 + beta^2) * (Precision * Recall) / (beta^2 * Precision + Recall)
Generalizes F1. F2 weighs recall 2x more than precision (use when missing positives is costly). F0.5 weighs precision 2x more than recall (use when false alarms are costly).
Many candidates say F1 is "the average of precision and recall." It's the harmonic mean, not the arithmetic mean. The harmonic mean is always less than or equal to the arithmetic mean, and it's zero when either input is zero. This is important: a model with 100% precision and 0% recall has an arithmetic mean of 50% but an F1 of 0%. If an interviewer catches this mistake, it signals you've memorized rather than understood.
When Accuracy Is Misleading
Example: Fraud detection with 0.01% positive rate
| Model | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| Predict all "not fraud" | 99.99% | undefined | 0% | 0% |
| Random (match base rate) | 99.98% | 0.01% | 50% | 0.02% |
| Decent fraud model | 99.95% | 5% | 80% | 9.4% |
| Good fraud model | 99.90% | 2% | 95% | 3.9% |
Notice: The "good" fraud model has lower accuracy and lower precision than the decent one, but its recall is much higher. Which is better depends on whether catching 95% vs. 80% of fraud is worth the extra false positives.
When I ask "what metric would you use for X?" I'm testing three things: (1) Do you recognize the class distribution? (2) Do you understand the business cost of each error type? (3) Can you translate that into a concrete metric? The best answers start with "It depends on the cost of false positives vs. false negatives" and then propose a specific metric with reasoning. The worst answers are "accuracy" for any imbalanced problem or "F1" without discussing why.
Multi-Class Metrics: Micro, Macro, Weighted
For K classes, you have K precision/recall values. How do you aggregate?
Macro average: Compute metric per class, then average. Treats all classes equally.
Macro-F1 = (1/K) * sum_{c=1}^{K} F1_c
Micro average: Pool all TP, FP, FN across classes, then compute metric. Dominated by frequent classes.
Micro-Precision = sum_c TP_c / sum_c (TP_c + FP_c)
Weighted average: Like macro but weighted by class frequency (support).
Weighted-F1 = sum_{c=1}^{K} (n_c / N) * F1_c
| Averaging | When to Use | Key Property |
|---|---|---|
| Macro | All classes equally important regardless of frequency | A rare class with F1=0 drags down the average |
| Micro | Overall correctness matters most | Equals accuracy for single-label classification |
| Weighted | Compromise - account for class frequency but report per-class performance | Most common default in sklearn |
If you report a single F1 number for a multi-class problem without specifying macro, micro, or weighted, you haven't fully answered the question. In an interview, always specify the averaging method and explain your choice. For imbalanced multi-class problems, macro-F1 is usually the right choice because it exposes poor performance on minority classes.
Multi-Label Metrics
In multi-label classification (each sample can have multiple labels), the metrics extend differently:
- Exact match ratio: Fraction of samples where the predicted label set exactly matches the true label set. Very strict.
- Hamming loss: Fraction of label-sample pairs that are incorrect. More lenient.
- Per-label F1 + macro average: Most common in practice.
- Sample-averaged F1: Compute F1 per sample, then average.
Part 2 - Probabilistic and Threshold-Based Metrics
Log Loss (Binary Cross-Entropy)
Log Loss = -(1/N) * sum_{i=1}^{N} [y_i * log(p_i) + (1 - y_i) * log(1 - p_i)]
Measures the quality of probability estimates, not just the binary prediction. A model that outputs 0.51 for a positive example is penalized much more than one that outputs 0.99.
Key properties:
- Minimized when predicted probabilities match true class probabilities (calibration)
- Heavily penalizes confident wrong predictions (predicting 0.01 for a positive: -log(0.01) = 4.6)
- Unlike accuracy/F1, log loss evaluates the entire probability distribution, not just the argmax
"Log loss measures how well your model's predicted probabilities match reality. It's different from accuracy or F1 because it cares about confidence - predicting 0.51 for a true positive is correct for accuracy but terrible for log loss compared to predicting 0.99. Log loss is the proper scoring rule for probability estimation, which makes it the right training loss for classification. But for model evaluation, you often want metrics that reflect the business decision (precision, recall) rather than the probability quality (log loss). Log loss is most useful when you need well-calibrated probabilities for downstream decision-making."
AUC-ROC
The ROC curve plots True Positive Rate (recall) vs. False Positive Rate (1 - specificity) at every possible classification threshold.
AUC-ROC = Area Under the ROC Curve. Interpretation: the probability that a randomly chosen positive example is scored higher than a randomly chosen negative example.
| AUC-ROC Value | Interpretation |
|---|---|
| 1.0 | Perfect separation |
| 0.5 | Random classifier (no discrimination) |
| < 0.5 | Worse than random (flip predictions) |
| 0.7-0.8 | Acceptable for many applications |
| 0.8-0.9 | Good discrimination |
| > 0.9 | Excellent discrimination |
Advantages:
- Threshold-independent - evaluates the model's ability to rank positives above negatives
- Scale-independent - doesn't depend on calibration
- Easy to interpret probabilistically
Disadvantages:
- Misleading for imbalanced datasets - ROC can look great even when precision is terrible
- Includes performance at thresholds you'd never use (e.g., classifying everything as positive)
- FPR denominator (TN + FP) is huge for imbalanced data, making FPR artificially low
AUC-PR (Average Precision)
The PR curve plots precision vs. recall at every threshold. AUC-PR is the area under this curve (approximated by Average Precision in sklearn).
A common interview mistake is saying "AUC-ROC is always the best metric for classification." For imbalanced datasets, AUC-ROC can be very high (0.95+) even when the model's precision is terrible. This happens because AUC-ROC uses False Positive Rate in the denominator, which includes all true negatives - when negatives vastly outnumber positives, even many false positives barely move the FPR. AUC-PR is much more informative for imbalanced problems because it focuses exclusively on the positive class predictions.
Threshold Selection
The model outputs probabilities; the threshold converts them to binary decisions. Choosing the threshold is a business decision, not a statistical one.
Strategy 1: Maximize F1 Find the threshold that maximizes F1 on validation data. Good default when costs are symmetric.
Strategy 2: Target a specific recall "We must catch 95% of fraud." Find the threshold that gives recall >= 0.95, then report the resulting precision.
Strategy 3: Minimize expected cost Define a cost matrix:
Cost = C_FN * FN + C_FP * FP
Find the threshold that minimizes this on validation data.
Strategy 4: Precision at fixed recall (or vice versa) Common in retrieval: "What's our precision when we recall 80% of relevant items?"
| Strategy | When to Use | Example |
|---|---|---|
| Max F1 | Balanced costs | General classification |
| Target recall | Missing positives is very costly | Medical screening, fraud |
| Target precision | False alarms are very costly | Content moderation, spam |
| Min expected cost | Known cost asymmetry | Any business-critical decision |
| Precision@k | Top-k retrieval | Search, recommendations |
- Google Search: Uses NDCG and MRR for ranking quality; precision@k for snippet relevance
- Meta Ads: Optimizes for calibrated probabilities (log loss) because bid = p(click) * value
- Stripe/PayPal (fraud): Recall at fixed precision ("catch 95% of fraud with <1% false positive rate on value")
- Netflix/Spotify: Uses ranking metrics (NDCG) and engagement metrics (click-through rate, play rate)
Calibration
A model is calibrated if its predicted probability matches the true frequency: when it says "80% chance of positive," 80% of those cases should actually be positive.
Why calibration matters:
- Decision-making: If you use probabilities to set prices, bids, or risk scores, they must be calibrated
- Threshold stability: A well-calibrated model's optimal threshold is more stable across data distributions
- Ensembling: Combining models requires calibrated probabilities
How to check calibration:
- Reliability diagram: Bin predictions by confidence, plot predicted probability vs. actual frequency. Perfect calibration = diagonal line.
- Expected Calibration Error (ECE): Average gap between predicted and actual probability across bins.
How to fix poor calibration:
- Platt scaling: Fit a logistic regression on the model's outputs using a validation set
- Temperature scaling: Divide logits by a learned temperature parameter T (simplest, works well for neural networks)
- Isotonic regression: Fit a non-decreasing piecewise function (more flexible, needs more data)
Calibration is an advanced topic that impresses interviewers. If you're asked "how would you deploy this model for credit risk scoring?" and you mention calibration - explaining that the predicted probability needs to match the actual default rate for the bank's pricing models to work - you're demonstrating production awareness that most candidates lack.
Part 3 - Ranking Metrics
For search engines, recommendation systems, and information retrieval, we care about the order of results, not just binary correctness.
Precision@k and Recall@k
Precision@k: Of the top k results, how many are relevant?
Precision@k = |relevant items in top k| / k
Recall@k: Of all relevant items, how many appear in the top k?
Recall@k = |relevant items in top k| / |total relevant items|
Mean Reciprocal Rank (MRR)
For queries where there's one "correct" answer: What's the rank of the first correct result?
MRR = (1/Q) * sum_{q=1}^{Q} 1/rank_q
Example: If the correct results appear at positions [3, 1, 5] for three queries: MRR = (1/3 + 1/1 + 1/5) / 3 = 0.51
Use when: There's a single correct answer (question answering, entity resolution, "I'm feeling lucky" search).
Mean Average Precision (MAP)
For queries with multiple relevant results: Compute precision at every relevant position, average.
AP(q) = (1/R_q) * sum_{k=1}^{n} Precision@k * rel(k)
MAP = (1/Q) * sum_{q=1}^{Q} AP(q)
where rel(k) = 1 if item at position k is relevant, 0 otherwise, and R_q is the total number of relevant documents for query q.
Example: Query with 3 relevant docs. Ranked list: [R, N, R, N, R]
- Precision@1 = 1/1 (relevant) -> counted
- Position 2: not relevant -> skip
- Precision@3 = 2/3 (relevant) -> counted
- Position 4: not relevant -> skip
- Precision@5 = 3/5 (relevant) -> counted
- AP = (1.0 + 0.667 + 0.6) / 3 = 0.756
Normalized Discounted Cumulative Gain (NDCG)
The gold standard for ranking metrics. Unlike MAP (which uses binary relevance), NDCG handles graded relevance (e.g., "highly relevant" = 3, "somewhat relevant" = 1, "not relevant" = 0).
DCG@k = sum_{i=1}^{k} (2^{rel_i} - 1) / log2(i + 1)
NDCG@k = DCG@k / IDCG@k
where IDCG is the DCG of the ideal (perfectly sorted) ranking.
Key properties:
- Ranges from 0 to 1 (normalized)
- Position-weighted: top positions matter exponentially more
- Handles graded relevance (not just binary)
- The log discount means: position 1 is twice as valuable as position 2, which is 1.7x position 3
"NDCG measures how well a ranking system orders results by relevance, accounting for position bias - users look at top results more than bottom ones. It uses graded relevance (not just relevant/irrelevant) and normalizes by the ideal ranking. The formula has two key components: the numerator (2^relevance - 1) rewards highly relevant items exponentially, and the denominator (log2(position + 1)) discounts items at lower positions logarithmically. NDCG = 1 means perfect ranking. I'd use NDCG when relevance is graded (like search with highly/somewhat/not relevant labels) and MAP when relevance is binary."
Ranking Metrics Comparison
| Metric | Relevance Type | Position-Weighted | Use Case |
|---|---|---|---|
| Precision@k | Binary | Equal weight for top k | "How good is page 1?" |
| Recall@k | Binary | No | "Did we find everything?" |
| MRR | Binary (single answer) | First correct only | QA, entity search |
| MAP | Binary (multiple answers) | Yes (precision at each relevant position) | Document retrieval |
| NDCG@k | Graded (multi-level) | Yes (log discount) | Search ranking, recommendations |
Part 4 - Generation and NLP Metrics
Perplexity
The standard metric for language models. Measures how "surprised" the model is by the test data.
PPL = exp(-(1/N) * sum_{i=1}^{N} log p(w_i | w_1, ..., w_{i-1}))
Equivalent to the exponential of the average cross-entropy loss.
Interpretation:
- PPL = 10 means the model is as "confused" as if it had to choose uniformly among 10 options at each step
- Lower is better
- GPT-2: ~35 PPL on WikiText-103. GPT-3: ~20. GPT-4: not disclosed but estimated < 10.
Limitations:
- Only comparable across models using the same tokenizer and vocabulary
- Doesn't measure factual accuracy, coherence, or usefulness
- A model can have low perplexity but generate repetitive, boring text
BLEU (Bilingual Evaluation Understudy)
Measures n-gram overlap between generated text and reference translations. Originally designed for machine translation.
BLEU = BP * exp(sum_{n=1}^{4} w_n * log(precision_n))
where BP is a brevity penalty and precision_n is the modified precision for n-grams.
Key details:
- Modified precision: each n-gram in the candidate can match at most as many times as it appears in the reference (prevents gaming by repeating common words)
- Standard BLEU uses n=1,2,3,4 with equal weights (w_n = 0.25)
- Brevity penalty penalizes short translations: BP = min(1, exp(1 - r/c)) where r is reference length, c is candidate length
Limitations:
- Doesn't capture meaning - a paraphrase with different words gets a low BLEU score
- Sensitive to tokenization and preprocessing choices
- Not great for open-ended generation (summarization, dialogue)
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
Measures recall of n-grams from the reference in the generated text. Primarily used for summarization.
Variants:
- ROUGE-1: Unigram recall
- ROUGE-2: Bigram recall
- ROUGE-L: Longest Common Subsequence (captures sentence-level structure)
- ROUGE-Lsum: ROUGE-L applied to summaries (sentence-level then aggregated)
BLEU vs. ROUGE:
| Aspect | BLEU | ROUGE |
|---|---|---|
| Focus | Precision (are generated n-grams in the reference?) | Recall (are reference n-grams in the generated text?) |
| Best for | Machine translation | Summarization |
| Multiple references | Matches against all references | Matches against all references |
| Limitation | Penalizes valid paraphrases | High recall possible by generating very long text |
BERTScore
Uses contextual embeddings (from BERT) to compute similarity between generated and reference text at the token level. Handles paraphrases better than BLEU/ROUGE because it compares meanings, not surface forms.
Advantages over BLEU/ROUGE:
- Captures semantic similarity, not just lexical overlap
- Better correlation with human judgment
- Handles paraphrases and synonyms
Disadvantages:
- Slower to compute (requires BERT inference)
- Not as interpretable as n-gram counts
- Results depend on which BERT model is used
METEOR
Improves on BLEU by incorporating stemming, synonyms, and a recall component:
- Matches words after stemming ("running" matches "ran")
- Uses WordNet synonyms ("good" matches "excellent")
- Computes both precision and recall, combining with harmonic mean weighted toward recall
- Adds a fragmentation penalty for matches that aren't contiguous
Generally correlates better with human judgment than BLEU for machine translation.
Human Evaluation
For open-ended generation (chatbots, creative writing, general summarization), automated metrics correlate poorly with human preference. Human evaluation methods:
- Likert scale rating: Rate outputs 1-5 on fluency, relevance, factual accuracy
- Pairwise comparison: "Which response is better, A or B?" (used for RLHF)
- Win rate: Fraction of comparisons won against a baseline (Chatbot Arena approach)
- Best-worst scaling: Identify best and worst among several options (more reliable than Likert)
Candidates sometimes cite BLEU as the metric for evaluating chatbots or summarization systems. BLEU was designed for machine translation where there's a specific correct answer. For open-ended generation, BLEU has very low correlation with human judgment. Strong candidates mention BERTScore for automated evaluation and human evaluation (with specific protocols) as the gold standard, noting the cost-quality tradeoff.
Generation Metrics Comparison
| Metric | Measures | Best For | Correlation with Humans |
|---|---|---|---|
| Perplexity | Language model quality | LM evaluation, pretraining | Moderate (good proxy for fluency) |
| BLEU | N-gram precision | Machine translation | Moderate for MT, low for open-ended |
| ROUGE | N-gram recall | Summarization | Moderate |
| BERTScore | Semantic similarity | Any generation task | High |
| METEOR | N-gram + synonyms + stemming | Machine translation | Better than BLEU |
| CIDEr | TF-IDF weighted n-grams | Image captioning | Moderate |
| Human eval | Everything | Any task | Gold standard (by definition) |
Part 5 - Regression Metrics and Special Cases
Regression Metrics
| Metric | Formula | Interpretation | When to Use |
|---|---|---|---|
| MAE | mean(|y - y_hat|) | Average absolute error in original units | When outliers shouldn't dominate |
| MSE | mean((y - y_hat)^2) | Average squared error | When large errors are especially bad |
| RMSE | sqrt(MSE) | Error in original units (like MAE) | Same as MSE, more interpretable |
| MAPE | mean(|y - y_hat| / |y|) * 100% | Percentage error | When relative error matters |
| R^2 | 1 - SS_res / SS_tot | Proportion of variance explained (0 to 1 for reasonable models) | Model comparison, reporting |
"MAE vs MSE comes down to how you want to treat outliers. MSE squares the errors, so a prediction that's off by 10 contributes 100 to the loss - it heavily penalizes large errors. MAE treats all errors linearly. If your business cares equally about being off by 10, use MAE. If being off by 1, use MSE. Practically, MSE leads to the mean as the optimal prediction, while MAE leads to the median. For data with outliers or heavy-tailed error distributions, MAE is more robust."
R-Squared Pitfalls
- R^2 can be negative (model is worse than predicting the mean)
- R^2 always increases when you add more features (use adjusted R^2 to penalize model complexity)
- R^2 = 0.8 does NOT mean the model explains 80% of the data - it means it explains 80% of the variance
- High R^2 doesn't guarantee good predictions - it can mask systematic bias
- R^2 is not comparable across different datasets or different target variables
MAPE Limitations
- Undefined when actual values are zero (division by zero)
- Asymmetric: penalizes over-predictions more than under-predictions
- Biased toward models that under-predict
- Alternative: Symmetric MAPE (SMAPE) or MASE (Mean Absolute Scaled Error)
Segmented Evaluation
Always evaluate metrics on meaningful segments, not just overall:
Overall AUC-ROC: 0.92
- New users: 0.78 (much worse!)
- Power users: 0.97
- Users with < 5 interactions: 0.65 (terrible!)
An experienced ML engineer always asks: "What's the metric breakdown by segment?" In production, overall metrics hide critical failures. I've seen models with 0.95 AUC-ROC that performed at random for a specific user cohort responsible for 30% of revenue. In an interview, if you're asked "how would you evaluate this model?" and you only talk about aggregate metrics without mentioning segmented evaluation, you're missing a key production insight.
Fairness Metrics
Increasingly important in ML interviews, especially at companies deploying models that affect people:
- Demographic parity: Positive prediction rate is equal across groups
- Equalized odds: TPR and FPR are equal across groups
- Predictive parity: Precision is equal across groups
- Calibration across groups: Model is well-calibrated for each demographic group
These metrics often conflict - you cannot satisfy all fairness criteria simultaneously (impossibility theorem by Chouldechova, 2017). The choice depends on the application context and legal requirements.
Part 6 - The Metric Selection Decision Tree
Practice Problems
Problem 1: The Imbalanced Dataset
You're building a cancer screening model. The dataset has 1% positive (cancer) and 99% negative. Your model achieves 98% accuracy, 40% precision, and 95% recall. Is this model good? What single metric would you report to stakeholders?
Hint 1 - Direction
Consider what each error type means clinically. What's the consequence of a false negative (missed cancer) vs. a false positive (unnecessary biopsy)?
Hint 2 - Insight
For cancer screening, recall is critical - missing a cancer case can be fatal. The 95% recall means we catch 95% of cancers. The 40% precision means 60% of flagged patients don't have cancer (false alarms). Is that acceptable? Consider what happens after a positive screening result.
Hint 3 - Full Solution + Rubric
Analysis:
- 98% accuracy is below the 99% baseline (predicting all negative). So accuracy tells us the model does something, but it's not a useful metric here.
- 95% recall means we miss 5% of cancers. Whether this is acceptable depends on the screening context - for a first-pass screening that's followed by specialist review, this may be fine.
- 40% precision means 60% of flagged patients get unnecessary follow-up testing. In a screening context, this is often acceptable - the cost of a follow-up test is much lower than the cost of missing cancer.
- The 98% accuracy actually means the model is worse than the trivial baseline by accuracy, but infinitely better by recall (baseline recall = 0%).
Recommended metrics:
- Primary: Recall (sensitivity) at a minimum threshold (e.g., "must catch 95%+ of cancers")
- Secondary: Precision (specificity) to understand false alarm burden
- Summary: AUC-PR (better than AUC-ROC for 1% positive rate)
- For stakeholders: "The model catches 95 out of 100 cancers, but 60% of alerts are false alarms requiring follow-up testing"
This model is good for a screening application where false negatives are much more costly than false positives. It would NOT be good for a confirmatory diagnostic test.
Scoring Rubric:
- Strong Hire: Immediately recognizes accuracy is misleading, discusses cost asymmetry (missed cancer vs false alarm), recommends recall as primary metric, mentions AUC-PR, discusses screening vs diagnostic context
- Lean Hire: Knows accuracy is misleading, recommends F1 or AUC-ROC, but doesn't fully articulate the cost asymmetry
- No Hire: Says "98% accuracy is good" or only recommends F1 without cost discussion
Problem 2: Ranking System Evaluation
You're evaluating a product search engine. For the query "running shoes," the top 5 results are: [relevant, irrelevant, highly relevant, irrelevant, relevant]. Compute Precision@5, MRR, and NDCG@5 (use relevance scores: highly relevant=3, relevant=1, irrelevant=0).
Hint 1 - Direction
Apply each formula carefully. For NDCG, you need to compute DCG first (using the graded relevance), then the ideal DCG (sort by relevance), then divide.
Hint 2 - Insight
Precision@5 treats everything as binary (relevant or not). MRR only cares about the first relevant result. NDCG uses the graded scores and discounts by position. This problem illustrates why different metrics capture different aspects of ranking quality.
Hint 3 - Full Solution + Rubric
Precision@5:
- Relevant items in top 5: 3 (positions 1, 3, 5)
- Precision@5 = 3/5 = 0.60
MRR:
- First relevant result is at position 1
- RR = 1/1 = 1.0
- (If this were one of many queries, we'd average across queries)
NDCG@5:
-
Relevance scores: [1, 0, 3, 0, 1]
-
DCG@5 = (2^1 - 1)/log2(2) + (2^0 - 1)/log2(3) + (2^3 - 1)/log2(4) + (2^0 - 1)/log2(5) + (2^1 - 1)/log2(6)
-
DCG@5 = 1/1 + 0/1.585 + 7/2 + 0/2.322 + 1/2.585
-
DCG@5 = 1.0 + 0 + 3.5 + 0 + 0.387 = 4.887
-
Ideal order: [3, 1, 1, 0, 0] (sort by relevance descending)
-
IDCG@5 = (2^3 - 1)/log2(2) + (2^1 - 1)/log2(3) + (2^1 - 1)/log2(4) + 0 + 0
-
IDCG@5 = 7/1 + 1/1.585 + 1/2 + 0 + 0
-
IDCG@5 = 7.0 + 0.631 + 0.5 = 8.131
-
NDCG@5 = 4.887 / 8.131 = 0.601
Key insight: MRR = 1.0 (perfect!) because the first result is relevant, but NDCG = 0.601 because the highly relevant item is at position 3 instead of position 1. NDCG captures that the best result should be first.
Scoring Rubric:
- Strong Hire: Correct computation of all three, explains what each metric captures differently, notes that MRR misses the position of the "highly relevant" item
- Lean Hire: Gets Precision@5 and MRR correct, struggles with NDCG computation but understands the concept
- No Hire: Cannot compute any metric correctly or confuses the formulas
Problem 3: Metric for a Chatbot
You're evaluating a customer service chatbot. The PM asks you to "measure how good the chatbot is." What metrics would you propose and how would you collect them?
Hint 1 - Direction
Think about what "good" means for a customer service chatbot. There are multiple dimensions: factual accuracy, helpfulness, response quality, conversation efficiency, and user satisfaction.
Hint 2 - Insight
Automated metrics (BLEU, ROUGE) are nearly useless for chatbot evaluation because there's no single "correct" response. You need a combination of automated proxy metrics (resolution rate, conversation length), offline human evaluation (correctness ratings), and online metrics (user satisfaction, escalation rate).
Hint 3 - Full Solution + Rubric
Proposed metric framework:
Online metrics (production):
- Resolution rate: % of conversations resolved without human escalation
- Conversation length: Average turns to resolution (shorter is usually better)
- User satisfaction: Post-conversation rating (1-5 stars or thumbs up/down)
- Repeat contact rate: Does the user come back with the same issue?
- Escalation rate: % of conversations escalated to human agent
Offline evaluation (development):
- Factual accuracy: Human raters grade responses for correctness (sampled)
- Relevance: Does the response address the user's question? (1-5 scale)
- Hallucination rate: % of responses containing fabricated information
- Safety: % of responses that violate content policies
Automated proxies (fast iteration):
- BERTScore against known-good responses (for common questions)
- Intent classification accuracy (does the chatbot understand the query?)
- Response latency
What NOT to use:
- BLEU/ROUGE (no single reference answer for open-ended dialogue)
- Perplexity alone (doesn't measure helpfulness or accuracy)
Data collection:
- A/B testing for online metrics (new model vs baseline)
- Human evaluation on a sampled set of conversations (weekly cadence)
- Automated pipeline for BERTScore on regression test set
Scoring Rubric:
- Strong Hire: Proposes a multi-layered framework (online + offline + automated), explains why standard NLP metrics don't work, mentions specific collection methods, discusses tradeoffs between evaluation speed and quality
- Lean Hire: Mentions some good metrics (resolution rate, user satisfaction) but missing the systematic framework
- No Hire: Proposes BLEU or accuracy as the primary metric
Problem 4: A/B Test Metrics
You've deployed a new recommendation model. The A/B test shows: click-through rate improved by 3%, but average session duration decreased by 8%. Should you ship the new model?
Hint 1 - Direction
Think about what these metrics actually measure and what business goal they serve. Higher CTR with lower session duration could mean different things depending on context.
Hint 2 - Insight
Higher CTR could mean better recommendations (users find what they want faster) OR clickbait-style recommendations (users click but bounce). Lower session duration could mean efficiency (users accomplish their goal faster) OR dissatisfaction (users leave). You need to dig deeper.
Hint 3 - Full Solution + Rubric
Do NOT ship immediately. Investigate further.
Possible interpretations:
-
Good scenario: Better recommendations help users find what they want faster. CTR up, session duration down because they accomplish their goal in fewer steps. Check: Did conversion rate also increase? Did bounce rate decrease?
-
Bad scenario: The model recommends clickbait content. Users click more but get disappointed, leading to shorter sessions. Check: What's the bounce rate after clicking? Did return visit rate change?
Additional metrics to check:
- Conversion rate (purchases, signups) - the ultimate business metric
- Bounce rate after clicking a recommendation
- Return visit rate (do users come back?)
- Revenue per session
- Downstream engagement metrics (time on clicked item, scroll depth)
- Statistical significance of both changes (is -8% session duration even significant?)
Decision framework:
- If conversion rate is up and bounce rate is down: SHIP (users are finding what they want faster)
- If conversion rate is flat and bounce rate is up: DO NOT SHIP (clickbait)
- If results are mixed: Run the test longer, add guardrail metrics
Key principle: Never make shipping decisions on a single metric. Define a primary metric (usually revenue or conversion) and guardrail metrics (session duration, return rate) before the test starts.
Scoring Rubric:
- Strong Hire: Does not give a yes/no answer immediately, proposes additional metrics to investigate, considers multiple interpretations, mentions guardrail metrics and statistical significance
- Lean Hire: Recognizes the ambiguity, suggests looking at conversion, but missing the systematic investigation framework
- No Hire: Says "ship it, CTR is up" or "don't ship, session duration is down" without further analysis
Problem 5: Calibration in Practice
Your ad click prediction model outputs probability 0.3 for a set of ads, but the actual click rate for those ads is 0.1. Is this a problem? How would you fix it?
Hint 1 - Direction
Think about what miscalibration means for an ad system. How are predicted probabilities used in ad auction mechanics?
Hint 2 - Insight
In ad systems, the bid is typically: bid = p(click) * value_per_click. If p(click) is over-estimated by 3x, the system will over-bid and overspend. This directly costs money.
Hint 3 - Full Solution + Rubric
Yes, this is a serious problem. The model is over-confident by 3x.
Business impact:
- Ad auction bid = p(click) * value_per_click
- Overestimating p(click) by 3x means overbidding by 3x
- This means the platform charges advertisers 3x more than the ads are worth, or internal ad allocation is heavily skewed
Diagnosis:
- Plot a reliability diagram across all probability bins (not just 0.3)
- Compute ECE (Expected Calibration Error) to quantify the overall miscalibration
- Check if miscalibration is uniform or varies by segment (ad type, user segment, device)
Fix options (in order of simplicity):
- Temperature scaling: Learn a single scalar T such that p_calibrated = sigmoid(logit / T). Simplest, often sufficient.
- Platt scaling: Fit logistic regression on validation set: p_calibrated = sigmoid(a * logit + b). Adds a bias term.
- Isotonic regression: Non-parametric calibration. More flexible but needs more data and can overfit.
- Retrain with calibration objective: Add a calibration loss term during training. More complex.
Validation: After calibration, re-check the reliability diagram. The calibrated model should show predicted = actual across all bins.
Scoring Rubric:
- Strong Hire: Explains business impact (overbidding in ad auctions), proposes calibration methods in order of complexity, mentions reliability diagram for validation, discusses segment-level calibration
- Lean Hire: Recognizes the problem and suggests Platt scaling, but misses the business impact discussion
- No Hire: Doesn't understand why miscalibration matters or suggests retraining from scratch
Interview Cheat Sheet
| Topic | Key Fact | When to Mention |
|---|---|---|
| Accuracy | Misleading for imbalanced data; trivial baseline can beat it | Any imbalanced classification question |
| Precision | TP/(TP+FP); high precision = few false alarms | When FP cost is high (spam, content moderation) |
| Recall | TP/(TP+FN); high recall = few missed positives | When FN cost is high (fraud, medical, safety) |
| F1 | Harmonic mean of P and R; F-beta for asymmetric costs | Balanced cost classification |
| AUC-ROC | Threshold-independent; probability of ranking pos above neg | Balanced datasets, model comparison |
| AUC-PR | Better than AUC-ROC for imbalanced data; focuses on positive class | Any imbalanced classification |
| Log loss | Evaluates probability quality, not just binary prediction | When calibrated probabilities matter |
| NDCG | Graded relevance + position discount; normalized by ideal ranking | Search and recommendation ranking |
| MAP | Binary relevance, precision at each relevant position | Document retrieval |
| MRR | 1/rank of first correct result | Single-answer retrieval (QA) |
| BLEU | N-gram precision + brevity penalty; bad for open-ended generation | Machine translation only |
| ROUGE | N-gram recall (especially ROUGE-L) | Summarization |
| Perplexity | exp(cross-entropy); lower is better; LM quality | Language model evaluation |
| Calibration | Predicted probability matches true frequency; critical for bidding/pricing | Risk scoring, ad systems |
| Segmented eval | Always check metrics by user/item segments | Any production ML question |
| MAE vs MSE | MAE: robust to outliers, median; MSE: penalizes large errors, mean | Regression metric choice |
| Fairness | Demographic parity, equalized odds - cannot satisfy all simultaneously | User-facing models, hiring, lending |
Spaced Repetition Checkpoints
Day 0 - Immediate Recall
- Write the formulas for precision, recall, F1 from memory
- Explain why accuracy is misleading for imbalanced data in one sentence
- Name one scenario where AUC-PR is better than AUC-ROC
- Define NDCG in one sentence
Day 3 - Active Recall
- Without notes: When would you use F2 vs F0.5? Give an example for each.
- Explain the difference between micro, macro, and weighted F1
- Compute NDCG@3 for a ranking [relevant, irrelevant, highly relevant] with scores [1, 0, 3]
- Why is BLEU bad for chatbot evaluation?
Day 7 - Application
- Design the evaluation metrics for a fraud detection system at a bank. Include primary, secondary, and guardrail metrics.
- Explain model calibration to a product manager. Why should they care?
- Given an imbalanced classification problem, walk through threshold selection using business costs.
Day 14 - Synthesis
- Compare the evaluation strategy for: (a) image classification, (b) product search ranking, (c) text summarization, (d) chatbot. What metrics for each and why?
- An A/B test shows improved NDCG but decreased CTR. What does this mean and what would you investigate?
- Design a comprehensive evaluation pipeline for a new recommendation system from scratch.
Day 21 - Interview Simulation
- "Our model has 99.9% accuracy on production data." What's your first question?
- "We need to evaluate our new LLM for customer support." Propose a complete evaluation framework.
- "Our search ranking model has NDCG@10 of 0.75. Is that good?" How do you answer this?
