Evaluation Metrics for Classification
Reading time: ~30 minutes | Level: ML Foundations | Role: MLE, ML Engineer, Data Scientist, Research Engineer
A fraud detection team built a model. It predicted "not fraud" for every single transaction. On a dataset with 0.1% fraud rate, this gave them 99.9% accuracy.
They shipped it.
Accuracy is wrong for imbalanced problems. The right metric depends on the costs of false positives and false negatives - and the business context, not the dataset statistics.
An ML engineer who does not understand classification metrics deeply will optimize for the wrong thing, ship models that look great offline and fail in production, and have no vocabulary to explain model behavior to stakeholders.
What You Will Learn
- The confusion matrix: the fundamental four-cell structure every metric derives from
- Precision, recall, F1 - definitions, formulas, and when each matters
- The precision-recall tradeoff and threshold selection
- AUC-ROC and AUC-PR: ranking-based metrics and when they differ dramatically
- Log loss: the probabilistic evaluation metric for calibrated models
- Matthews Correlation Coefficient (MCC): the metric for severe class imbalance
- Multi-class extensions: macro, micro, and weighted averaging
- Code for everything with sklearn, matplotlib, and business context
- Five interview questions at senior ML engineer level
Part 1 - The Confusion Matrix
Every classification metric is derived from the 2×2 confusion matrix for binary classification:
Predicted
Positive Negative
Actual Positive TP FN (Positive class row)
Negative FP TN (Negative class row)
- True Positive (TP): Model predicts positive, truth is positive ✓
- True Negative (TN): Model predicts negative, truth is negative ✓
- False Positive (FP): Model predicts positive, truth is negative ✗ (Type I error)
- False Negative (FN): Model predicts negative, truth is positive ✗ (Type II error)
Which error is worse? Depends entirely on the application:
| Application | Worse error | Why |
|---|---|---|
| Cancer screening | FN (miss a cancer) | Patient doesn't get treatment → death |
| Spam filter | FP (block real email) | User loses important messages |
| Fraud detection | FN (miss fraud) | Bank loses money |
| Fraud detection (strict) | FP (block legit transaction) | Customer churn |
| Hiring model | FP (hire wrong) or FN (reject right) | Depends on cost of bad hire vs cost of missed talent |
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
# Simulate a fraud detection model
np.random.seed(42)
n = 10000
fraud_rate = 0.02 # 2% fraud
y_true = np.random.choice([0, 1], size=n, p=[1-fraud_rate, fraud_rate])
# Model scores: fraud has higher scores, non-fraud lower
scores = np.where(y_true == 1,
np.random.beta(7, 3, n), # fraud: high scores
np.random.beta(2, 8, n)) # legit: low scores
threshold = 0.5
y_pred = (scores >= threshold).astype(int)
cm = confusion_matrix(y_true, y_pred)
TP, FN = cm[1, 1], cm[1, 0]
FP, TN = cm[0, 1], cm[0, 0]
print("Confusion Matrix:")
print(f" TP={TP}, FN={FN}")
print(f" FP={FP}, TN={TN}")
print(f"\nAccuracy = (TP+TN)/(all) = {(TP+TN)/n:.4f}")
print(f"Null accuracy (predict all negative) = {TN+FN:.0f}/{n} = {(TN+FN)/n:.4f}")
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Legit', 'Fraud'])
fig, ax = plt.subplots(figsize=(5, 4))
disp.plot(ax=ax, colorbar=False)
plt.title('Confusion Matrix - Fraud Detection', fontsize=12)
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=150)
Part 2 - Precision, Recall, F1
Precision
"Of all the things I called positive, what fraction actually were?"
Low precision = many false alarms. High precision = when you say fraud, you're right.
Recall (Sensitivity, True Positive Rate)
"Of all actual positives, what fraction did I catch?"
Low recall = missing many real cases. High recall = catching most fraud/cancer/defects.
The Precision-Recall Tradeoff
Lower threshold → more positives predicted → higher recall, lower precision. Higher threshold → fewer positives predicted → lower recall, higher precision.
High threshold (strict): Low threshold (loose):
Few fraud alerts Many fraud alerts
Low FP (few false alarms) Low FN (catch more fraud)
High FN (miss more fraud) High FP (many false alarms)
→ High precision → High recall
→ Low recall → Low precision
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score, recall_score, f1_score
# Use the fraud scores from above
thresholds = np.linspace(0.01, 0.99, 100)
precisions, recalls, f1s = [], [], []
for t in thresholds:
y_pred_t = (scores >= t).astype(int)
# Avoid division by zero when all predictions are negative
if y_pred_t.sum() == 0:
precisions.append(1.0); recalls.append(0.0); f1s.append(0.0)
else:
precisions.append(precision_score(y_true, y_pred_t, zero_division=1))
recalls.append(recall_score(y_true, y_pred_t))
f1s.append(f1_score(y_true, y_pred_t))
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Precision-Recall vs threshold
ax = axes[0]
ax.plot(thresholds, precisions, 'b-', lw=2, label='Precision')
ax.plot(thresholds, recalls, 'r-', lw=2, label='Recall')
ax.plot(thresholds, f1s, 'g-', lw=2, label='F1 Score')
ax.axvline(0.5, color='gray', ls='--', lw=1.5, label='Default threshold 0.5')
best_f1_idx = np.argmax(f1s)
ax.axvline(thresholds[best_f1_idx], color='purple', ls=':', lw=2,
label=f'Best F1 threshold={thresholds[best_f1_idx]:.2f}')
ax.set_xlabel('Classification Threshold')
ax.set_ylabel('Score')
ax.set_title('Precision, Recall, F1 vs Threshold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)
# Precision-Recall curve
ax = axes[1]
ax.plot(recalls, precisions, 'b-', lw=2)
ax.scatter([recalls[best_f1_idx]], [precisions[best_f1_idx]],
color='purple', s=100, zorder=5, label=f'Best F1={f1s[best_f1_idx]:.3f}')
ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
ax.set_title('Precision-Recall Curve (Fraud Detection)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('precision_recall_curves.png', dpi=150)
print(f"\nAt default threshold=0.5:")
y_default = (scores >= 0.5).astype(int)
print(f" Precision: {precision_score(y_true, y_default):.4f}")
print(f" Recall: {recall_score(y_true, y_default):.4f}")
print(f" F1: {f1_score(y_true, y_default):.4f}")
print(f"\nAt optimal threshold={thresholds[best_f1_idx]:.2f}:")
y_opt = (scores >= thresholds[best_f1_idx]).astype(int)
print(f" Precision: {precision_score(y_true, y_opt):.4f}")
print(f" Recall: {recall_score(y_true, y_opt):.4f}")
print(f" F1: {f1_score(y_true, y_opt):.4f}")
F1 Score
The harmonic mean of precision and recall:
F1 penalizes extreme imbalances between precision and recall. A model with precision=1.0, recall=0.01 has F1 = 0.02 - near-useless despite perfect precision.
F-beta generalizes F1 to weight recall times more than precision:
- : recall twice as important (missed cancer costs more than false alarm)
- : precision twice as important (false alarms cost more)
Part 3 - AUC-ROC: The Ranking Metric
The ROC Curve
The Receiver Operating Characteristic curve plots:
- True Positive Rate (TPR) = Recall on the y-axis
- False Positive Rate (FPR) on the x-axis
Each point on the curve corresponds to a different threshold.
TPR
1.0 │ *****
│ **
│ * Perfect model:
│ * AUC = 1.0
│* (curve hugs top-left)
0.5 │
│ Random:
│ AUC = 0.5
│ (diagonal)
│
0.0 └──────────────
0.0 0.5 1.0
FPR
AUC-ROC
The Area Under the ROC Curve (AUC or AUROC) summarizes the ROC curve as a single number:
- AUC = 1.0: perfect discrimination
- AUC = 0.5: random (diagonal line)
- AUC < 0.5: worse than random (swap predictions)
Probabilistic interpretation: AUC = probability that a randomly chosen positive example is ranked higher (has a higher model score) than a randomly chosen negative example.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc, RocCurveDisplay
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
np.random.seed(42)
X, y = make_classification(n_samples=2000, n_features=20, n_informative=8,
n_classes=2, weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
models = {
'Logistic Regression': LogisticRegression(random_state=42, max_iter=500),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
}
fig, ax = plt.subplots(figsize=(8, 6))
for name, model in models.items():
model.fit(X_train, y_train)
proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, proba)
roc_auc = auc(fpr, tpr)
ax.plot(fpr, tpr, lw=2, label=f'{name} (AUC={roc_auc:.3f})')
ax.plot([0,1],[0,1], 'k--', lw=1.5, label='Random (AUC=0.5)')
ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curves - 90/10 Imbalanced Classification', fontsize=13)
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('roc_curves.png', dpi=150)
When AUC-ROC Is Misleading: Use AUC-PR for Imbalanced Data
AUC-ROC can be overly optimistic for severely imbalanced datasets. The FPR denominator () is dominated by the large negative class - a model can have a small FPR even with many false positives in absolute terms.
AUC-PR (area under the precision-recall curve) is more informative when:
- Positive class is rare (fraud, rare disease, defects)
- The cost of false positives and false negatives is very different
- You care about how well you rank the positive class, not just separate it from negatives
from sklearn.metrics import average_precision_score, precision_recall_curve
fig, ax = plt.subplots(figsize=(8, 6))
for name, model in models.items():
model.fit(X_train, y_train)
proba = model.predict_proba(X_test)[:, 1]
precision, recall, _ = precision_recall_curve(y_test, proba)
ap = average_precision_score(y_test, proba)
ax.plot(recall, precision, lw=2, label=f'{name} (AP={ap:.3f})')
baseline_rate = y_test.mean()
ax.axhline(baseline_rate, color='gray', ls='--', lw=1.5,
label=f'Random (AP={baseline_rate:.3f})')
ax.set_xlabel('Recall', fontsize=12)
ax.set_ylabel('Precision', fontsize=12)
ax.set_title('Precision-Recall Curves - 90/10 Imbalanced', fontsize=13)
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('pr_curves.png', dpi=150)
print("Note: Random baseline AUC-ROC = 0.5 always.")
print(f"Random baseline AUC-PR = class prevalence = {baseline_rate:.3f}")
print("AUC-PR is much harder to inflate on imbalanced data.")
Part 4 - Log Loss (Cross-Entropy Loss)
Log loss evaluates probability calibration - not just which class was predicted, but how confident the model was:
- Perfect model: log loss → 0
- Random model: log loss = for binary classification
- Wrong with confidence: log loss → ∞ (predicting 0.99 when true label is 0 is catastrophic)
import numpy as np
from sklearn.metrics import log_loss
# Demonstrate calibration vs discrimination
y_true = np.array([1, 1, 0, 0, 1, 0, 1, 0])
# Well-calibrated model: confident when right
p_calibrated = np.array([0.9, 0.8, 0.1, 0.2, 0.85, 0.15, 0.7, 0.3])
# Overconfident wrong model: confident when wrong too
p_overconfident = np.array([0.99, 0.95, 0.01, 0.05, 0.98, 0.05, 0.9, 0.6])
# Underconfident model: never commits
p_underconfident = np.array([0.6, 0.55, 0.45, 0.4, 0.58, 0.42, 0.62, 0.38])
print(f"Log loss comparison:")
print(f" Calibrated model: {log_loss(y_true, p_calibrated):.4f}")
print(f" Overconfident model: {log_loss(y_true, p_overconfident):.4f}")
print(f" Underconfident model: {log_loss(y_true, p_underconfident):.4f}")
# Show why overconfidence hurts
wrong_confident = np.array([0.99])
wrong_y = np.array([0])
print(f"\nPredicting 0.99 when true is 0: log loss = {log_loss(wrong_y, wrong_confident):.4f}")
print(f"Predicting 0.60 when true is 0: log loss = {log_loss(wrong_y, [0.6]):.4f}")
print("Overconfidence is heavily penalized - log loss rewards calibration.")
When to use log loss: When you need well-calibrated probabilities - risk scoring, medical diagnosis, anything where the probability value itself matters (not just the class label).
Part 5 - Matthews Correlation Coefficient (MCC)
For severely imbalanced datasets, even F1 can be misleading. MCC is a balanced metric that accounts for all four cells of the confusion matrix:
- MCC = +1: perfect prediction
- MCC = 0: random prediction
- MCC = -1: perfectly inverted prediction
MCC is symmetric: it treats the positive and negative class equivalently. It gives high scores only when all four confusion matrix cells are good simultaneously.
from sklearn.metrics import matthews_corrcoef
# Extreme imbalance: 1000 negatives, 10 positives
np.random.seed(42)
y_imbalanced = np.array([0]*1000 + [1]*10)
# Model 1: Predict all negatives (naive baseline)
y_all_neg = np.zeros(1010, dtype=int)
# Model 2: Good model (catches 9/10 fraud, 5 false alarms)
y_good = np.zeros(1010, dtype=int)
y_good[1000:1009] = 1 # catch 9 of 10 fraud
y_good[50:55] = 1 # 5 false alarms on legit
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print(f"{'Metric':<20} {'All-negative':>15} {'Good model':>15}")
print('-' * 52)
for metric_fn, name in [
(accuracy_score, 'Accuracy'),
(lambda yt, yp: precision_score(yt, yp, zero_division=0), 'Precision'),
(recall_score, 'Recall'),
(lambda yt, yp: f1_score(yt, yp, zero_division=0), 'F1'),
(matthews_corrcoef, 'MCC'),
]:
v_neg = metric_fn(y_imbalanced, y_all_neg)
v_good = metric_fn(y_imbalanced, y_good)
print(f"{name:<20} {v_neg:>15.4f} {v_good:>15.4f}")
print("\nMCC correctly distinguishes: all-negative baseline vs useful model.")
print("Accuracy and precision miss this - MCC is preferred for severe imbalance.")
Part 6 - Multi-Class Metrics
For classes, extend binary metrics using averaging strategies:
from sklearn.metrics import classification_report
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred, digits=3))
Averaging strategies:
| Strategy | Formula | When to use |
|---|---|---|
| Macro | Mean of per-class scores | All classes equally important; highlights poor performance on rare classes |
| Weighted | Class-size-weighted mean | Imbalanced classes; overall performance weighted by frequency |
| Micro | Global TP/FP/FN across all classes | Same as accuracy for multiclass with one prediction per example |
Macro F1: F1(class 0) + F1(class 1) + ... + F1(class K-1)
─────────────────────────────────────────────────
K
Weighted F1: Σ_k (n_k / n) * F1(class k)
where n_k = number of examples in class k
Part 7 - Choosing the Right Metric: A Decision Guide
| Application | Recommended Primary | Secondary |
|---|---|---|
| Balanced classification | Accuracy + AUC-ROC | F1 |
| Fraud detection | AUC-PR + F1 | Recall at fixed FPR |
| Medical diagnosis (rare disease) | Recall (sensitivity) | AUC-PR |
| Spam filter | Precision + F1 | AUC-ROC |
| Recommendation (top-k) | Precision@k, NDCG | AUC-PR |
| Risk scoring | Log loss + AUC-ROC | Calibration curve |
| Severe imbalance (< 1%) | MCC | AUC-PR |
Recommended Resources
:::tip Video Resources StatQuest - Confusion Matrix Essential visual walkthrough of TP/FP/TN/FN with examples. (~8 min)
StatQuest - Sensitivity and Specificity Recall vs specificity, and when each matters. (~8 min)
StatQuest - ROC and AUC Explained The best explanation of what AUC-ROC actually means. (~16 min)
StatQuest - Precision-Recall Curves When to use PR curves instead of ROC curves. (~11 min) :::
Interview Questions
Q1: What is the difference between AUC-ROC and AUC-PR, and when does it matter?
AUC-ROC plots TPR vs FPR; AUC-PR plots precision vs recall. For balanced datasets, both give similar rankings of model quality. For imbalanced datasets (rare positive class), AUC-ROC is misleading: the FPR denominator includes the large negative class, so FPR is small even with many false positives in absolute terms - making a bad model look good. AUC-PR avoids this: precision is TP/(TP+FP) - affected by absolute FP count. A random classifier has AUC-ROC = 0.5 regardless of class balance, but AUC-PR (average precision) equals the class prevalence for a random classifier. For fraud detection (0.1% fraud), random AUC-PR = 0.001 - so even a model with AUC-PR = 0.1 is 100x better than random. Use AUC-PR whenever the positive class is rare.
Q2: When would you choose recall over precision, and why?
Recall = TP/(TP+FN). Choose recall when false negatives are more costly than false positives: cancer screening (missing a cancer is worse than a false scare), sepsis detection (missing sepsis → death), safety systems (missing a braking event). Choose precision when false positives are more costly: spam filter (blocking real emails is worse than letting some spam through), recommendation systems (showing irrelevant results degrades user experience), hiring models (making a wrong hire may cost more than missing a good candidate). F-beta with formalizes "recall is more important": means recall is twice as important as precision.
Q3: Explain log loss and why it penalizes overconfident wrong predictions so harshly.
. For a positive example (), the loss is - approaches 0 as , approaches as . Predicting for a positive example: loss = . Predicting for a positive example: loss = . The logarithm creates an asymmetric penalty: being confidently wrong is catastrophically expensive, while being confidently right is only marginally better than being moderately confident right. This incentivizes calibration - the model should express uncertainty when it's uncertain, not suppress it for better-looking average accuracy.
Q4: Your model has precision=0.95 and recall=0.3 on the positive class. What does this mean and is it acceptable?
Precision=0.95 means when the model says "positive," it's right 95% of the time - very few false alarms. Recall=0.3 means the model catches only 30% of actual positives - misses 70%. This is a model that is very conservative: only flags things it's very sure about, misses most positive cases. Whether it's acceptable depends entirely on the application. For a preliminary screening tool (you'll do a follow-up test anyway), 30% recall may be fine if each positive flagged is acted upon. For a fraud detection system where you need to catch most fraud, 30% recall means 70% of fraud goes undetected - unacceptable. F1 = 2*(0.95*0.3)/(0.95+0.3) = 0.46 - mediocre. The fix: lower the classification threshold to increase recall at the cost of precision, then evaluate the precision-recall tradeoff at the new operating point.
Q5: What is the Matthews Correlation Coefficient and when should you use it instead of F1?
MCC = . It accounts for all four confusion matrix cells and ranges from -1 to +1. Unlike F1, MCC is symmetric - it gives the same score when you swap the positive and negative class labels. F1 can be misleadingly high in severe imbalance: a model predicting all examples as positive on a 99%/1% dataset gets F1 = 0.66 (from TP recall = 1.0 on rare positives and precision = 0.01 balanced with recall). MCC correctly gives ~0 for such a model. Use MCC when: (1) severe class imbalance (< 5% positive), (2) you want a symmetric single-number summary, (3) comparing models across datasets with different class balances. MCC is particularly common in bioinformatics and medical ML where rare-event detection is the norm.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Confusion Matrix & ROC Curve demo on the EngineersOfAI Playground - no code required.
:::
