Evaluation Metrics - Measuring What Actually Matters

Reading time: ~40 min | Interview relevance: Critical | Roles: MLE, AI Eng, Data Scientist, Research Eng

The Real Interview Moment

You're in a machine learning design round at a fintech company. The interviewer describes a fraud detection system: "We have 100 million transactions per day. About 0.01% are fraudulent. Our current model has 99.5% accuracy. The business says fraud losses are still too high. How would you evaluate this model properly?"

The candidate who says "99.5% accuracy sounds good" is done. A model that predicts "not fraud" for every single transaction would achieve 99.99% accuracy on this dataset. The interviewer wants to hear you immediately recognize the class imbalance problem, pivot to precision-recall analysis, discuss the business cost asymmetry (missing fraud is 100x more expensive than a false alert), and propose AUC-PR as the primary metric with a threshold tuned to minimize total expected cost.

Choosing the right evaluation metric is not a footnote - it's the first design decision in any ML project. The wrong metric leads to the wrong model, the wrong threshold, and the wrong business outcome. In interviews, your metric choice reveals whether you understand the problem or just the algorithm.

What You Will Master

After reading this page, you will be able to:

Select the correct evaluation metric for any ML problem type (classification, ranking, generation, regression)
Explain why accuracy is misleading for imbalanced datasets and what to use instead
Derive and interpret precision, recall, F1, and their micro/macro/weighted variants
Draw and interpret ROC curves, PR curves, and explain when each is appropriate
Apply ranking metrics (NDCG, MAP, MRR) for search and recommendation systems
Use generation metrics (BLEU, ROUGE, perplexity) for NLP tasks
Design threshold selection strategies based on business requirements
Explain model calibration and why it matters for decision-making
Handle multi-class and multi-label metric computation
Navigate the "what metric would you use?" interview question for any scenario

Self-Assessment: Where Are You Now?

Skill Area	1 (Never heard of it)	3 (Can define it)	5 (Can derive + apply)	Your Rating
Precision / Recall / F1	Never used	Know the formulas	Can explain micro/macro/weighted variants	___
AUC-ROC	Don't know what ROC stands for	Can draw the curve	Know why AUC-PR is better for imbalanced data	___
Log loss / Cross-entropy	Never computed	Know the formula	Can explain calibration connection	___
Ranking metrics (NDCG, MAP)	Never heard of NDCG	Know the formulas	Can compute by hand and explain position bias	___
NLP metrics (BLEU, ROUGE)	Never used	Know they exist	Can explain limitations and when to use each	___
Threshold selection	Just use 0.5	Know it's adjustable	Can design cost-sensitive threshold optimization	___
Calibration	Never heard of it	Know what it means	Can diagnose and fix calibration issues	___

Score interpretation:

7-14: Essential reading. Metrics questions appear in almost every ML interview.
15-25: Good foundation. Focus on the edge cases and practice problems.
26-35: You're well-prepared. Drill the company-specific scenarios and ranking/generation metrics.

Part 1 - Classification Metrics

The Confusion Matrix: Where Everything Starts

Every binary classification metric derives from four numbers:

	Predicted Positive	Predicted Negative
Actually Positive	True Positive (TP)	False Negative (FN)
Actually Negative	False Positive (FP)	True Negative (TN)

60-Second Answer

"The confusion matrix is the foundation of all classification metrics. From it, we derive accuracy (how often we're right overall), precision (of the things we called positive, how many actually were), recall (of the actual positives, how many did we catch), and F1 (the harmonic mean of precision and recall). The choice between these depends on the cost asymmetry - if missing a positive is expensive (cancer detection, fraud), optimize for recall. If false alarms are expensive (spam filtering in important email), optimize for precision. F1 balances both when the costs are roughly equal."

Core Metrics

Accuracy = (TP + TN) / (TP + TN + FP + FN)

The proportion of correct predictions. Simple but dangerous.

Precision = TP / (TP + FP)

"Of the items I flagged as positive, what fraction actually were?" High precision = few false alarms.

Recall (Sensitivity, True Positive Rate) = TP / (TP + FN)

"Of the actual positive items, what fraction did I catch?" High recall = few missed positives.

Specificity (True Negative Rate) = TN / (TN + FP)

"Of the actual negative items, what fraction did I correctly identify?" Important for medical screening.

F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

The harmonic mean of precision and recall. The harmonic mean penalizes extreme imbalance: if either precision or recall is near zero, F1 is near zero too.

F-beta Score = (1 + beta^2) * (Precision * Recall) / (beta^2 * Precision + Recall)

Generalizes F1. F2 weighs recall 2x more than precision (use when missing positives is costly). F0.5 weighs precision 2x more than recall (use when false alarms are costly).

Common Trap

Many candidates say F1 is "the average of precision and recall." It's the harmonic mean, not the arithmetic mean. The harmonic mean is always less than or equal to the arithmetic mean, and it's zero when either input is zero. This is important: a model with 100% precision and 0% recall has an arithmetic mean of 50% but an F1 of 0%. If an interviewer catches this mistake, it signals you've memorized rather than understood.

When Accuracy Is Misleading

Accuracy Limitations Flowchart

Example: Fraud detection with 0.01% positive rate

Model	Accuracy	Precision	Recall	F1
Predict all "not fraud"	99.99%	undefined	0%	0%
Random (match base rate)	99.98%	0.01%	50%	0.02%
Decent fraud model	99.95%	5%	80%	9.4%
Good fraud model	99.90%	2%	95%	3.9%

Notice: The "good" fraud model has lower accuracy and lower precision than the decent one, but its recall is much higher. Which is better depends on whether catching 95% vs. 80% of fraud is worth the extra false positives.

Interviewer's Perspective

When I ask "what metric would you use for X?" I'm testing three things: (1) Do you recognize the class distribution? (2) Do you understand the business cost of each error type? (3) Can you translate that into a concrete metric? The best answers start with "It depends on the cost of false positives vs. false negatives" and then propose a specific metric with reasoning. The worst answers are "accuracy" for any imbalanced problem or "F1" without discussing why.

Multi-Class Metrics: Micro, Macro, Weighted

For K classes, you have K precision/recall values. How do you aggregate?

Macro average: Compute metric per class, then average. Treats all classes equally.

Macro-F1 = (1/K) * sum_{c=1}^{K} F1_c

Micro average: Pool all TP, FP, FN across classes, then compute metric. Dominated by frequent classes.

Micro-Precision = sum_c TP_c / sum_c (TP_c + FP_c)

Weighted average: Like macro but weighted by class frequency (support).

Weighted-F1 = sum_{c=1}^{K} (n_c / N) * F1_c

Averaging	When to Use	Key Property
Macro	All classes equally important regardless of frequency	A rare class with F1=0 drags down the average
Micro	Overall correctness matters most	Equals accuracy for single-label classification
Weighted	Compromise - account for class frequency but report per-class performance	Most common default in sklearn

Instant Rejection

If you report a single F1 number for a multi-class problem without specifying macro, micro, or weighted, you haven't fully answered the question. In an interview, always specify the averaging method and explain your choice. For imbalanced multi-class problems, macro-F1 is usually the right choice because it exposes poor performance on minority classes.

Multi-Label Metrics

In multi-label classification (each sample can have multiple labels), the metrics extend differently:

Exact match ratio: Fraction of samples where the predicted label set exactly matches the true label set. Very strict.
Hamming loss: Fraction of label-sample pairs that are incorrect. More lenient.
Per-label F1 + macro average: Most common in practice.
Sample-averaged F1: Compute F1 per sample, then average.

Part 2 - Probabilistic and Threshold-Based Metrics

Log Loss (Binary Cross-Entropy)

Log Loss = -(1/N) * sum_{i=1}^{N} [y_i * log(p_i) + (1 - y_i) * log(1 - p_i)]

Measures the quality of probability estimates, not just the binary prediction. A model that outputs 0.51 for a positive example is penalized much more than one that outputs 0.99.

Key properties:

Minimized when predicted probabilities match true class probabilities (calibration)
Heavily penalizes confident wrong predictions (predicting 0.01 for a positive: -log(0.01) = 4.6)
Unlike accuracy/F1, log loss evaluates the entire probability distribution, not just the argmax

60-Second Answer

"Log loss measures how well your model's predicted probabilities match reality. It's different from accuracy or F1 because it cares about confidence - predicting 0.51 for a true positive is correct for accuracy but terrible for log loss compared to predicting 0.99. Log loss is the proper scoring rule for probability estimation, which makes it the right training loss for classification. But for model evaluation, you often want metrics that reflect the business decision (precision, recall) rather than the probability quality (log loss). Log loss is most useful when you need well-calibrated probabilities for downstream decision-making."

AUC-ROC

The ROC curve plots True Positive Rate (recall) vs. False Positive Rate (1 - specificity) at every possible classification threshold.

AUC-ROC = Area Under the ROC Curve. Interpretation: the probability that a randomly chosen positive example is scored higher than a randomly chosen negative example.

AUC-ROC Value	Interpretation
1.0	Perfect separation
0.5	Random classifier (no discrimination)
< 0.5	Worse than random (flip predictions)
0.7-0.8	Acceptable for many applications
0.8-0.9	Good discrimination
> 0.9	Excellent discrimination

Advantages:

Threshold-independent - evaluates the model's ability to rank positives above negatives
Scale-independent - doesn't depend on calibration
Easy to interpret probabilistically

Disadvantages:

Misleading for imbalanced datasets - ROC can look great even when precision is terrible
Includes performance at thresholds you'd never use (e.g., classifying everything as positive)
FPR denominator (TN + FP) is huge for imbalanced data, making FPR artificially low

AUC-PR (Average Precision)

The PR curve plots precision vs. recall at every threshold. AUC-PR is the area under this curve (approximated by Average Precision in sklearn).

AUC-ROC vs AUC-PR Comparison

Common Trap

A common interview mistake is saying "AUC-ROC is always the best metric for classification." For imbalanced datasets, AUC-ROC can be very high (0.95+) even when the model's precision is terrible. This happens because AUC-ROC uses False Positive Rate in the denominator, which includes all true negatives - when negatives vastly outnumber positives, even many false positives barely move the FPR. AUC-PR is much more informative for imbalanced problems because it focuses exclusively on the positive class predictions.

Threshold Selection

The model outputs probabilities; the threshold converts them to binary decisions. Choosing the threshold is a business decision, not a statistical one.

Strategy 1: Maximize F1 Find the threshold that maximizes F1 on validation data. Good default when costs are symmetric.

Strategy 2: Target a specific recall "We must catch 95% of fraud." Find the threshold that gives recall >= 0.95, then report the resulting precision.

Strategy 3: Minimize expected cost Define a cost matrix:

Cost = C_FN * FN + C_FP * FP

Find the threshold that minimizes this on validation data.

Strategy 4: Precision at fixed recall (or vice versa) Common in retrieval: "What's our precision when we recall 80% of relevant items?"

Strategy	When to Use	Example
Max F1	Balanced costs	General classification
Target recall	Missing positives is very costly	Medical screening, fraud
Target precision	False alarms are very costly	Content moderation, spam
Min expected cost	Known cost asymmetry	Any business-critical decision
Precision@k	Top-k retrieval	Search, recommendations

Company Variation

Google Search: Uses NDCG and MRR for ranking quality; precision@k for snippet relevance
Meta Ads: Optimizes for calibrated probabilities (log loss) because bid = p(click) * value
Stripe/PayPal (fraud): Recall at fixed precision ("catch 95% of fraud with <1% false positive rate on value")
Netflix/Spotify: Uses ranking metrics (NDCG) and engagement metrics (click-through rate, play rate)

Calibration

A model is calibrated if its predicted probability matches the true frequency: when it says "80% chance of positive," 80% of those cases should actually be positive.

Why calibration matters:

Decision-making: If you use probabilities to set prices, bids, or risk scores, they must be calibrated
Threshold stability: A well-calibrated model's optimal threshold is more stable across data distributions
Ensembling: Combining models requires calibrated probabilities

How to check calibration:

Reliability diagram: Bin predictions by confidence, plot predicted probability vs. actual frequency. Perfect calibration = diagonal line.
Expected Calibration Error (ECE): Average gap between predicted and actual probability across bins.

How to fix poor calibration:

Platt scaling: Fit a logistic regression on the model's outputs using a validation set
Temperature scaling: Divide logits by a learned temperature parameter T (simplest, works well for neural networks)
Isotonic regression: Fit a non-decreasing piecewise function (more flexible, needs more data)

Interviewer's Perspective

Calibration is an advanced topic that impresses interviewers. If you're asked "how would you deploy this model for credit risk scoring?" and you mention calibration - explaining that the predicted probability needs to match the actual default rate for the bank's pricing models to work - you're demonstrating production awareness that most candidates lack.

Part 3 - Ranking Metrics

For search engines, recommendation systems, and information retrieval, we care about the order of results, not just binary correctness.

Precision@k and Recall@k

Precision@k: Of the top k results, how many are relevant?

Precision@k = |relevant items in top k| / k

Recall@k: Of all relevant items, how many appear in the top k?

Recall@k = |relevant items in top k| / |total relevant items|

Mean Reciprocal Rank (MRR)

For queries where there's one "correct" answer: What's the rank of the first correct result?

MRR = (1/Q) * sum_{q=1}^{Q} 1/rank_q

Example: If the correct results appear at positions [3, 1, 5] for three queries: MRR = (1/3 + 1/1 + 1/5) / 3 = 0.51

Use when: There's a single correct answer (question answering, entity resolution, "I'm feeling lucky" search).

Mean Average Precision (MAP)

For queries with multiple relevant results: Compute precision at every relevant position, average.

AP(q) = (1/R_q) * sum_{k=1}^{n} Precision@k * rel(k)
MAP = (1/Q) * sum_{q=1}^{Q} AP(q)

where rel(k) = 1 if item at position k is relevant, 0 otherwise, and R_q is the total number of relevant documents for query q.

Example: Query with 3 relevant docs. Ranked list: [R, N, R, N, R]

Precision@1 = 1/1 (relevant) -> counted
Position 2: not relevant -> skip
Precision@3 = 2/3 (relevant) -> counted
Position 4: not relevant -> skip
Precision@5 = 3/5 (relevant) -> counted
AP = (1.0 + 0.667 + 0.6) / 3 = 0.756

Normalized Discounted Cumulative Gain (NDCG)

The gold standard for ranking metrics. Unlike MAP (which uses binary relevance), NDCG handles graded relevance (e.g., "highly relevant" = 3, "somewhat relevant" = 1, "not relevant" = 0).

DCG@k = sum_{i=1}^{k} (2^{rel_i} - 1) / log2(i + 1)
NDCG@k = DCG@k / IDCG@k

where IDCG is the DCG of the ideal (perfectly sorted) ranking.

Key properties:

Ranges from 0 to 1 (normalized)
Position-weighted: top positions matter exponentially more
Handles graded relevance (not just binary)
The log discount means: position 1 is twice as valuable as position 2, which is 1.7x position 3

60-Second Answer

"NDCG measures how well a ranking system orders results by relevance, accounting for position bias - users look at top results more than bottom ones. It uses graded relevance (not just relevant/irrelevant) and normalizes by the ideal ranking. The formula has two key components: the numerator (2^relevance - 1) rewards highly relevant items exponentially, and the denominator (log2(position + 1)) discounts items at lower positions logarithmically. NDCG = 1 means perfect ranking. I'd use NDCG when relevance is graded (like search with highly/somewhat/not relevant labels) and MAP when relevance is binary."

Ranking Metrics Comparison

Metric	Relevance Type	Position-Weighted	Use Case
Precision@k	Binary	Equal weight for top k	"How good is page 1?"
Recall@k	Binary	No	"Did we find everything?"
MRR	Binary (single answer)	First correct only	QA, entity search
MAP	Binary (multiple answers)	Yes (precision at each relevant position)	Document retrieval
NDCG@k	Graded (multi-level)	Yes (log discount)	Search ranking, recommendations

Ranking Metric Selection

Part 4 - Generation and NLP Metrics

Perplexity

The standard metric for language models. Measures how "surprised" the model is by the test data.

PPL = exp(-(1/N) * sum_{i=1}^{N} log p(w_i | w_1, ..., w_{i-1}))

Equivalent to the exponential of the average cross-entropy loss.

Interpretation:

PPL = 10 means the model is as "confused" as if it had to choose uniformly among 10 options at each step
Lower is better
GPT-2: ~35 PPL on WikiText-103. GPT-3: ~20. GPT-4: not disclosed but estimated < 10.

Limitations:

Only comparable across models using the same tokenizer and vocabulary
Doesn't measure factual accuracy, coherence, or usefulness
A model can have low perplexity but generate repetitive, boring text

BLEU (Bilingual Evaluation Understudy)

Measures n-gram overlap between generated text and reference translations. Originally designed for machine translation.

BLEU = BP * exp(sum_{n=1}^{4} w_n * log(precision_n))

where BP is a brevity penalty and precision_n is the modified precision for n-grams.

Key details:

Modified precision: each n-gram in the candidate can match at most as many times as it appears in the reference (prevents gaming by repeating common words)
Standard BLEU uses n=1,2,3,4 with equal weights (w_n = 0.25)
Brevity penalty penalizes short translations: BP = min(1, exp(1 - r/c)) where r is reference length, c is candidate length

Limitations:

Doesn't capture meaning - a paraphrase with different words gets a low BLEU score
Sensitive to tokenization and preprocessing choices
Not great for open-ended generation (summarization, dialogue)

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Measures recall of n-grams from the reference in the generated text. Primarily used for summarization.

Variants:

ROUGE-1: Unigram recall
ROUGE-2: Bigram recall
ROUGE-L: Longest Common Subsequence (captures sentence-level structure)
ROUGE-Lsum: ROUGE-L applied to summaries (sentence-level then aggregated)

BLEU vs. ROUGE:

Aspect	BLEU	ROUGE
Focus	Precision (are generated n-grams in the reference?)	Recall (are reference n-grams in the generated text?)
Best for	Machine translation	Summarization
Multiple references	Matches against all references	Matches against all references
Limitation	Penalizes valid paraphrases	High recall possible by generating very long text

BERTScore

Uses contextual embeddings (from BERT) to compute similarity between generated and reference text at the token level. Handles paraphrases better than BLEU/ROUGE because it compares meanings, not surface forms.

Advantages over BLEU/ROUGE:

Captures semantic similarity, not just lexical overlap
Better correlation with human judgment
Handles paraphrases and synonyms

Disadvantages:

Slower to compute (requires BERT inference)
Not as interpretable as n-gram counts
Results depend on which BERT model is used

METEOR

Improves on BLEU by incorporating stemming, synonyms, and a recall component:

Matches words after stemming ("running" matches "ran")
Uses WordNet synonyms ("good" matches "excellent")
Computes both precision and recall, combining with harmonic mean weighted toward recall
Adds a fragmentation penalty for matches that aren't contiguous

Generally correlates better with human judgment than BLEU for machine translation.

Human Evaluation

For open-ended generation (chatbots, creative writing, general summarization), automated metrics correlate poorly with human preference. Human evaluation methods:

Likert scale rating: Rate outputs 1-5 on fluency, relevance, factual accuracy
Pairwise comparison: "Which response is better, A or B?" (used for RLHF)
Win rate: Fraction of comparisons won against a baseline (Chatbot Arena approach)
Best-worst scaling: Identify best and worst among several options (more reliable than Likert)

Common Trap

Candidates sometimes cite BLEU as the metric for evaluating chatbots or summarization systems. BLEU was designed for machine translation where there's a specific correct answer. For open-ended generation, BLEU has very low correlation with human judgment. Strong candidates mention BERTScore for automated evaluation and human evaluation (with specific protocols) as the gold standard, noting the cost-quality tradeoff.

Generation Metrics Comparison

Metric	Measures	Best For	Correlation with Humans
Perplexity	Language model quality	LM evaluation, pretraining	Moderate (good proxy for fluency)
BLEU	N-gram precision	Machine translation	Moderate for MT, low for open-ended
ROUGE	N-gram recall	Summarization	Moderate
BERTScore	Semantic similarity	Any generation task	High
METEOR	N-gram + synonyms + stemming	Machine translation	Better than BLEU
CIDEr	TF-IDF weighted n-grams	Image captioning	Moderate
Human eval	Everything	Any task	Gold standard (by definition)

Part 5 - Regression Metrics and Special Cases

Regression Metrics

Metric	Formula	Interpretation	When to Use
MAE	mean(\|y - y_hat\|)	Average absolute error in original units	When outliers shouldn't dominate
MSE	mean((y - y_hat)^2)	Average squared error	When large errors are especially bad
RMSE	sqrt(MSE)	Error in original units (like MAE)	Same as MSE, more interpretable
MAPE	mean(\|y - y_hat\| / \|y\|) * 100%	Percentage error	When relative error matters
R^2	1 - SS_res / SS_tot	Proportion of variance explained (0 to 1 for reasonable models)	Model comparison, reporting

60-Second Answer

"MAE vs MSE comes down to how you want to treat outliers. MSE squares the errors, so a prediction that's off by 10 contributes 100 to the loss - it heavily penalizes large errors. MAE treats all errors linearly. If your business cares equally about being off by $1 vs being off by$ 10, use MAE. If being off by $10 is catastrophically worse than 10 instances of being off by$ 1, use MSE. Practically, MSE leads to the mean as the optimal prediction, while MAE leads to the median. For data with outliers or heavy-tailed error distributions, MAE is more robust."

R-Squared Pitfalls

R^2 can be negative (model is worse than predicting the mean)
R^2 always increases when you add more features (use adjusted R^2 to penalize model complexity)
R^2 = 0.8 does NOT mean the model explains 80% of the data - it means it explains 80% of the variance
High R^2 doesn't guarantee good predictions - it can mask systematic bias
R^2 is not comparable across different datasets or different target variables

MAPE Limitations

Undefined when actual values are zero (division by zero)
Asymmetric: penalizes over-predictions more than under-predictions
Biased toward models that under-predict
Alternative: Symmetric MAPE (SMAPE) or MASE (Mean Absolute Scaled Error)

Segmented Evaluation

Always evaluate metrics on meaningful segments, not just overall:

Overall AUC-ROC: 0.92
  - New users: 0.78 (much worse!)
  - Power users: 0.97
  - Users with &lt; 5 interactions: 0.65 (terrible!)

Interviewer's Perspective

An experienced ML engineer always asks: "What's the metric breakdown by segment?" In production, overall metrics hide critical failures. I've seen models with 0.95 AUC-ROC that performed at random for a specific user cohort responsible for 30% of revenue. In an interview, if you're asked "how would you evaluate this model?" and you only talk about aggregate metrics without mentioning segmented evaluation, you're missing a key production insight.

Fairness Metrics

Increasingly important in ML interviews, especially at companies deploying models that affect people:

Demographic parity: Positive prediction rate is equal across groups
Equalized odds: TPR and FPR are equal across groups
Predictive parity: Precision is equal across groups
Calibration across groups: Model is well-calibrated for each demographic group

These metrics often conflict - you cannot satisfy all fairness criteria simultaneously (impossibility theorem by Chouldechova, 2017). The choice depends on the application context and legal requirements.

Part 6 - The Metric Selection Decision Tree

Metric Selection Decision Tree

Practice Problems

Problem 1: The Imbalanced Dataset

You're building a cancer screening model. The dataset has 1% positive (cancer) and 99% negative. Your model achieves 98% accuracy, 40% precision, and 95% recall. Is this model good? What single metric would you report to stakeholders?

Hint 1 - Direction

Consider what each error type means clinically. What's the consequence of a false negative (missed cancer) vs. a false positive (unnecessary biopsy)?

Hint 2 - Insight

For cancer screening, recall is critical - missing a cancer case can be fatal. The 95% recall means we catch 95% of cancers. The 40% precision means 60% of flagged patients don't have cancer (false alarms). Is that acceptable? Consider what happens after a positive screening result.

Hint 3 - Full Solution + Rubric

Analysis:

98% accuracy is below the 99% baseline (predicting all negative). So accuracy tells us the model does something, but it's not a useful metric here.
95% recall means we miss 5% of cancers. Whether this is acceptable depends on the screening context - for a first-pass screening that's followed by specialist review, this may be fine.
40% precision means 60% of flagged patients get unnecessary follow-up testing. In a screening context, this is often acceptable - the cost of a follow-up test is much lower than the cost of missing cancer.
The 98% accuracy actually means the model is worse than the trivial baseline by accuracy, but infinitely better by recall (baseline recall = 0%).

Recommended metrics:

Primary: Recall (sensitivity) at a minimum threshold (e.g., "must catch 95%+ of cancers")
Secondary: Precision (specificity) to understand false alarm burden
Summary: AUC-PR (better than AUC-ROC for 1% positive rate)
For stakeholders: "The model catches 95 out of 100 cancers, but 60% of alerts are false alarms requiring follow-up testing"

This model is good for a screening application where false negatives are much more costly than false positives. It would NOT be good for a confirmatory diagnostic test.

Scoring Rubric:

Strong Hire: Immediately recognizes accuracy is misleading, discusses cost asymmetry (missed cancer vs false alarm), recommends recall as primary metric, mentions AUC-PR, discusses screening vs diagnostic context
Lean Hire: Knows accuracy is misleading, recommends F1 or AUC-ROC, but doesn't fully articulate the cost asymmetry
No Hire: Says "98% accuracy is good" or only recommends F1 without cost discussion

Problem 2: Ranking System Evaluation

You're evaluating a product search engine. For the query "running shoes," the top 5 results are: [relevant, irrelevant, highly relevant, irrelevant, relevant]. Compute Precision@5, MRR, and NDCG@5 (use relevance scores: highly relevant=3, relevant=1, irrelevant=0).

Hint 1 - Direction

Apply each formula carefully. For NDCG, you need to compute DCG first (using the graded relevance), then the ideal DCG (sort by relevance), then divide.

Hint 2 - Insight

Precision@5 treats everything as binary (relevant or not). MRR only cares about the first relevant result. NDCG uses the graded scores and discounts by position. This problem illustrates why different metrics capture different aspects of ranking quality.

Hint 3 - Full Solution + Rubric

Precision@5:

Relevant items in top 5: 3 (positions 1, 3, 5)
Precision@5 = 3/5 = 0.60

MRR:

First relevant result is at position 1
RR = 1/1 = 1.0
(If this were one of many queries, we'd average across queries)

NDCG@5:

Relevance scores: [1, 0, 3, 0, 1]
DCG@5 = (2^1 - 1)/log2(2) + (2^0 - 1)/log2(3) + (2^3 - 1)/log2(4) + (2^0 - 1)/log2(5) + (2^1 - 1)/log2(6)
DCG@5 = 1/1 + 0/1.585 + 7/2 + 0/2.322 + 1/2.585
DCG@5 = 1.0 + 0 + 3.5 + 0 + 0.387 = 4.887
Ideal order: [3, 1, 1, 0, 0] (sort by relevance descending)
IDCG@5 = (2^3 - 1)/log2(2) + (2^1 - 1)/log2(3) + (2^1 - 1)/log2(4) + 0 + 0
IDCG@5 = 7/1 + 1/1.585 + 1/2 + 0 + 0
IDCG@5 = 7.0 + 0.631 + 0.5 = 8.131
NDCG@5 = 4.887 / 8.131 = 0.601

Key insight: MRR = 1.0 (perfect!) because the first result is relevant, but NDCG = 0.601 because the highly relevant item is at position 3 instead of position 1. NDCG captures that the best result should be first.

Scoring Rubric:

Strong Hire: Correct computation of all three, explains what each metric captures differently, notes that MRR misses the position of the "highly relevant" item
Lean Hire: Gets Precision@5 and MRR correct, struggles with NDCG computation but understands the concept
No Hire: Cannot compute any metric correctly or confuses the formulas

Problem 3: Metric for a Chatbot

You're evaluating a customer service chatbot. The PM asks you to "measure how good the chatbot is." What metrics would you propose and how would you collect them?

Hint 1 - Direction

Think about what "good" means for a customer service chatbot. There are multiple dimensions: factual accuracy, helpfulness, response quality, conversation efficiency, and user satisfaction.

Hint 2 - Insight

Automated metrics (BLEU, ROUGE) are nearly useless for chatbot evaluation because there's no single "correct" response. You need a combination of automated proxy metrics (resolution rate, conversation length), offline human evaluation (correctness ratings), and online metrics (user satisfaction, escalation rate).

Hint 3 - Full Solution + Rubric

Proposed metric framework:

Online metrics (production):

Resolution rate: % of conversations resolved without human escalation
Conversation length: Average turns to resolution (shorter is usually better)
User satisfaction: Post-conversation rating (1-5 stars or thumbs up/down)
Repeat contact rate: Does the user come back with the same issue?
Escalation rate: % of conversations escalated to human agent

Offline evaluation (development):

Factual accuracy: Human raters grade responses for correctness (sampled)
Relevance: Does the response address the user's question? (1-5 scale)
Hallucination rate: % of responses containing fabricated information
Safety: % of responses that violate content policies

Automated proxies (fast iteration):

BERTScore against known-good responses (for common questions)
Intent classification accuracy (does the chatbot understand the query?)
Response latency

What NOT to use:

BLEU/ROUGE (no single reference answer for open-ended dialogue)
Perplexity alone (doesn't measure helpfulness or accuracy)

Data collection:

A/B testing for online metrics (new model vs baseline)
Human evaluation on a sampled set of conversations (weekly cadence)
Automated pipeline for BERTScore on regression test set

Scoring Rubric:

Strong Hire: Proposes a multi-layered framework (online + offline + automated), explains why standard NLP metrics don't work, mentions specific collection methods, discusses tradeoffs between evaluation speed and quality
Lean Hire: Mentions some good metrics (resolution rate, user satisfaction) but missing the systematic framework
No Hire: Proposes BLEU or accuracy as the primary metric

Problem 4: A/B Test Metrics

You've deployed a new recommendation model. The A/B test shows: click-through rate improved by 3%, but average session duration decreased by 8%. Should you ship the new model?

Hint 1 - Direction

Think about what these metrics actually measure and what business goal they serve. Higher CTR with lower session duration could mean different things depending on context.

Hint 2 - Insight

Higher CTR could mean better recommendations (users find what they want faster) OR clickbait-style recommendations (users click but bounce). Lower session duration could mean efficiency (users accomplish their goal faster) OR dissatisfaction (users leave). You need to dig deeper.

Hint 3 - Full Solution + Rubric

Do NOT ship immediately. Investigate further.

Possible interpretations:

Good scenario: Better recommendations help users find what they want faster. CTR up, session duration down because they accomplish their goal in fewer steps. Check: Did conversion rate also increase? Did bounce rate decrease?
Bad scenario: The model recommends clickbait content. Users click more but get disappointed, leading to shorter sessions. Check: What's the bounce rate after clicking? Did return visit rate change?

Additional metrics to check:

Conversion rate (purchases, signups) - the ultimate business metric
Bounce rate after clicking a recommendation
Return visit rate (do users come back?)
Revenue per session
Downstream engagement metrics (time on clicked item, scroll depth)
Statistical significance of both changes (is -8% session duration even significant?)

Decision framework:

If conversion rate is up and bounce rate is down: SHIP (users are finding what they want faster)
If conversion rate is flat and bounce rate is up: DO NOT SHIP (clickbait)
If results are mixed: Run the test longer, add guardrail metrics

Key principle: Never make shipping decisions on a single metric. Define a primary metric (usually revenue or conversion) and guardrail metrics (session duration, return rate) before the test starts.

Scoring Rubric:

Strong Hire: Does not give a yes/no answer immediately, proposes additional metrics to investigate, considers multiple interpretations, mentions guardrail metrics and statistical significance
Lean Hire: Recognizes the ambiguity, suggests looking at conversion, but missing the systematic investigation framework
No Hire: Says "ship it, CTR is up" or "don't ship, session duration is down" without further analysis

Problem 5: Calibration in Practice

Your ad click prediction model outputs probability 0.3 for a set of ads, but the actual click rate for those ads is 0.1. Is this a problem? How would you fix it?

Hint 1 - Direction

Think about what miscalibration means for an ad system. How are predicted probabilities used in ad auction mechanics?

Hint 2 - Insight

In ad systems, the bid is typically: bid = p(click) * value_per_click. If p(click) is over-estimated by 3x, the system will over-bid and overspend. This directly costs money.

Hint 3 - Full Solution + Rubric

Yes, this is a serious problem. The model is over-confident by 3x.

Business impact:

Ad auction bid = p(click) * value_per_click
Overestimating p(click) by 3x means overbidding by 3x
This means the platform charges advertisers 3x more than the ads are worth, or internal ad allocation is heavily skewed

Diagnosis:

Plot a reliability diagram across all probability bins (not just 0.3)
Compute ECE (Expected Calibration Error) to quantify the overall miscalibration
Check if miscalibration is uniform or varies by segment (ad type, user segment, device)

Fix options (in order of simplicity):

Temperature scaling: Learn a single scalar T such that p_calibrated = sigmoid(logit / T). Simplest, often sufficient.
Platt scaling: Fit logistic regression on validation set: p_calibrated = sigmoid(a * logit + b). Adds a bias term.
Isotonic regression: Non-parametric calibration. More flexible but needs more data and can overfit.
Retrain with calibration objective: Add a calibration loss term during training. More complex.

Validation: After calibration, re-check the reliability diagram. The calibrated model should show predicted = actual across all bins.

Scoring Rubric:

Strong Hire: Explains business impact (overbidding in ad auctions), proposes calibration methods in order of complexity, mentions reliability diagram for validation, discusses segment-level calibration
Lean Hire: Recognizes the problem and suggests Platt scaling, but misses the business impact discussion
No Hire: Doesn't understand why miscalibration matters or suggests retraining from scratch

Interview Cheat Sheet

Topic	Key Fact	When to Mention
Accuracy	Misleading for imbalanced data; trivial baseline can beat it	Any imbalanced classification question
Precision	TP/(TP+FP); high precision = few false alarms	When FP cost is high (spam, content moderation)
Recall	TP/(TP+FN); high recall = few missed positives	When FN cost is high (fraud, medical, safety)
F1	Harmonic mean of P and R; F-beta for asymmetric costs	Balanced cost classification
AUC-ROC	Threshold-independent; probability of ranking pos above neg	Balanced datasets, model comparison
AUC-PR	Better than AUC-ROC for imbalanced data; focuses on positive class	Any imbalanced classification
Log loss	Evaluates probability quality, not just binary prediction	When calibrated probabilities matter
NDCG	Graded relevance + position discount; normalized by ideal ranking	Search and recommendation ranking
MAP	Binary relevance, precision at each relevant position	Document retrieval
MRR	1/rank of first correct result	Single-answer retrieval (QA)
BLEU	N-gram precision + brevity penalty; bad for open-ended generation	Machine translation only
ROUGE	N-gram recall (especially ROUGE-L)	Summarization
Perplexity	exp(cross-entropy); lower is better; LM quality	Language model evaluation
Calibration	Predicted probability matches true frequency; critical for bidding/pricing	Risk scoring, ad systems
Segmented eval	Always check metrics by user/item segments	Any production ML question
MAE vs MSE	MAE: robust to outliers, median; MSE: penalizes large errors, mean	Regression metric choice
Fairness	Demographic parity, equalized odds - cannot satisfy all simultaneously	User-facing models, hiring, lending

Spaced Repetition Checkpoints

Day 0 - Immediate Recall

Write the formulas for precision, recall, F1 from memory
Explain why accuracy is misleading for imbalanced data in one sentence
Name one scenario where AUC-PR is better than AUC-ROC
Define NDCG in one sentence

Day 3 - Active Recall

Without notes: When would you use F2 vs F0.5? Give an example for each.
Explain the difference between micro, macro, and weighted F1
Compute NDCG@3 for a ranking [relevant, irrelevant, highly relevant] with scores [1, 0, 3]
Why is BLEU bad for chatbot evaluation?

Day 7 - Application

Design the evaluation metrics for a fraud detection system at a bank. Include primary, secondary, and guardrail metrics.
Explain model calibration to a product manager. Why should they care?
Given an imbalanced classification problem, walk through threshold selection using business costs.

Day 14 - Synthesis

Compare the evaluation strategy for: (a) image classification, (b) product search ranking, (c) text summarization, (d) chatbot. What metrics for each and why?
An A/B test shows improved NDCG but decreased CTR. What does this mean and what would you investigate?
Design a comprehensive evaluation pipeline for a new recommendation system from scratch.

Day 21 - Interview Simulation

"Our model has 99.9% accuracy on production data." What's your first question?
"We need to evaluate our new LLM for customer support." Propose a complete evaluation framework.
"Our search ranking model has NDCG@10 of 0.75. Is that good?" How do you answer this?

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - Classification Metrics​

The Confusion Matrix: Where Everything Starts​

Core Metrics​

When Accuracy Is Misleading​

Multi-Class Metrics: Micro, Macro, Weighted​

Multi-Label Metrics​

Part 2 - Probabilistic and Threshold-Based Metrics​

Log Loss (Binary Cross-Entropy)​

AUC-ROC​

AUC-PR (Average Precision)​

Threshold Selection​

Calibration​

Part 3 - Ranking Metrics​

Precision@k and Recall@k​

Mean Reciprocal Rank (MRR)​

Mean Average Precision (MAP)​

Normalized Discounted Cumulative Gain (NDCG)​

Ranking Metrics Comparison​

Part 4 - Generation and NLP Metrics​

Perplexity​

BLEU (Bilingual Evaluation Understudy)​

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)​

BERTScore​

METEOR​

Human Evaluation​

Generation Metrics Comparison​

Part 5 - Regression Metrics and Special Cases​

Regression Metrics​

R-Squared Pitfalls​

MAPE Limitations​

Segmented Evaluation​

Fairness Metrics​

Part 6 - The Metric Selection Decision Tree​

Practice Problems​

Problem 1: The Imbalanced Dataset​

Problem 2: Ranking System Evaluation​

Problem 3: Metric for a Chatbot​

Problem 4: A/B Test Metrics​

Problem 5: Calibration in Practice​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 - Immediate Recall​

Day 3 - Active Recall​

Day 7 - Application​

Day 14 - Synthesis​

Day 21 - Interview Simulation​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - Classification Metrics

The Confusion Matrix: Where Everything Starts

Core Metrics

When Accuracy Is Misleading

Multi-Class Metrics: Micro, Macro, Weighted

Multi-Label Metrics

Part 2 - Probabilistic and Threshold-Based Metrics

Log Loss (Binary Cross-Entropy)

AUC-ROC

AUC-PR (Average Precision)

Threshold Selection

Calibration

Part 3 - Ranking Metrics

Precision@k and Recall@k

Mean Reciprocal Rank (MRR)

Mean Average Precision (MAP)

Normalized Discounted Cumulative Gain (NDCG)

Ranking Metrics Comparison

Part 4 - Generation and NLP Metrics

Perplexity

BLEU (Bilingual Evaluation Understudy)

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

BERTScore

METEOR

Human Evaluation

Generation Metrics Comparison

Part 5 - Regression Metrics and Special Cases

Regression Metrics

R-Squared Pitfalls

MAPE Limitations

Segmented Evaluation

Fairness Metrics

Part 6 - The Metric Selection Decision Tree

Practice Problems

Problem 1: The Imbalanced Dataset

Problem 2: Ranking System Evaluation

Problem 3: Metric for a Chatbot

Problem 4: A/B Test Metrics

Problem 5: Calibration in Practice

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 - Immediate Recall

Day 3 - Active Recall

Day 7 - Application

Day 14 - Synthesis

Day 21 - Interview Simulation