Evaluation Metrics for Classification

Reading time: ~30 minutes | Level: ML Foundations | Role: MLE, ML Engineer, Data Scientist, Research Engineer

A fraud detection team built a model. It predicted "not fraud" for every single transaction. On a dataset with 0.1% fraud rate, this gave them 99.9% accuracy.

They shipped it.

Accuracy is wrong for imbalanced problems. The right metric depends on the costs of false positives and false negatives - and the business context, not the dataset statistics.

An ML engineer who does not understand classification metrics deeply will optimize for the wrong thing, ship models that look great offline and fail in production, and have no vocabulary to explain model behavior to stakeholders.

What You Will Learn

The confusion matrix: the fundamental four-cell structure every metric derives from
Precision, recall, F1 - definitions, formulas, and when each matters
The precision-recall tradeoff and threshold selection
AUC-ROC and AUC-PR: ranking-based metrics and when they differ dramatically
Log loss: the probabilistic evaluation metric for calibrated models
Matthews Correlation Coefficient (MCC): the metric for severe class imbalance
Multi-class extensions: macro, micro, and weighted averaging
Code for everything with sklearn, matplotlib, and business context
Five interview questions at senior ML engineer level

Part 1 - The Confusion Matrix

Every classification metric is derived from the 2×2 confusion matrix for binary classification:

                    Predicted
                 Positive    Negative
Actual  Positive    TP          FN       (Positive class row)
        Negative    FP          TN       (Negative class row)

True Positive (TP): Model predicts positive, truth is positive ✓
True Negative (TN): Model predicts negative, truth is negative ✓
False Positive (FP): Model predicts positive, truth is negative ✗ (Type I error)
False Negative (FN): Model predicts negative, truth is positive ✗ (Type II error)

Which error is worse? Depends entirely on the application:

Application	Worse error	Why
Cancer screening	FN (miss a cancer)	Patient doesn't get treatment → death
Spam filter	FP (block real email)	User loses important messages
Fraud detection	FN (miss fraud)	Bank loses money
Fraud detection (strict)	FP (block legit transaction)	Customer churn
Hiring model	FP (hire wrong) or FN (reject right)	Depends on cost of bad hire vs cost of missed talent

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Simulate a fraud detection model
np.random.seed(42)
n = 10000
fraud_rate = 0.02    # 2% fraud

y_true = np.random.choice([0, 1], size=n, p=[1-fraud_rate, fraud_rate])
# Model scores: fraud has higher scores, non-fraud lower
scores = np.where(y_true == 1,
                  np.random.beta(7, 3, n),   # fraud: high scores
                  np.random.beta(2, 8, n))   # legit: low scores

threshold = 0.5
y_pred = (scores >= threshold).astype(int)

cm = confusion_matrix(y_true, y_pred)
TP, FN = cm[1, 1], cm[1, 0]
FP, TN = cm[0, 1], cm[0, 0]

print("Confusion Matrix:")
print(f"  TP={TP}, FN={FN}")
print(f"  FP={FP}, TN={TN}")
print(f"\nAccuracy = (TP+TN)/(all) = {(TP+TN)/n:.4f}")
print(f"Null accuracy (predict all negative) = {TN+FN:.0f}/{n} = {(TN+FN)/n:.4f}")

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Legit', 'Fraud'])
fig, ax = plt.subplots(figsize=(5, 4))
disp.plot(ax=ax, colorbar=False)
plt.title('Confusion Matrix - Fraud Detection', fontsize=12)
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=150)

Part 2 - Precision, Recall, F1

Precision

$\text{Precision} = \frac{TP}{TP + FP}$

"Of all the things I called positive, what fraction actually were?"

Low precision = many false alarms. High precision = when you say fraud, you're right.

Recall (Sensitivity, True Positive Rate)

$\text{Recall} = \frac{TP}{TP + FN}$

"Of all actual positives, what fraction did I catch?"

Low recall = missing many real cases. High recall = catching most fraud/cancer/defects.

The Precision-Recall Tradeoff

Lower threshold → more positives predicted → higher recall, lower precision. Higher threshold → fewer positives predicted → lower recall, higher precision.

High threshold (strict):     Low threshold (loose):
  Few fraud alerts               Many fraud alerts
  Low FP (few false alarms)      Low FN (catch more fraud)
  High FN (miss more fraud)      High FP (many false alarms)
  → High precision               → High recall
  → Low recall                   → Low precision

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score, recall_score, f1_score

# Use the fraud scores from above
thresholds = np.linspace(0.01, 0.99, 100)
precisions, recalls, f1s = [], [], []

for t in thresholds:
    y_pred_t = (scores >= t).astype(int)
    # Avoid division by zero when all predictions are negative
    if y_pred_t.sum() == 0:
        precisions.append(1.0); recalls.append(0.0); f1s.append(0.0)
    else:
        precisions.append(precision_score(y_true, y_pred_t, zero_division=1))
        recalls.append(recall_score(y_true, y_pred_t))
        f1s.append(f1_score(y_true, y_pred_t))

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Precision-Recall vs threshold
ax = axes[0]
ax.plot(thresholds, precisions, 'b-', lw=2, label='Precision')
ax.plot(thresholds, recalls,    'r-', lw=2, label='Recall')
ax.plot(thresholds, f1s,        'g-', lw=2, label='F1 Score')
ax.axvline(0.5, color='gray', ls='--', lw=1.5, label='Default threshold 0.5')
best_f1_idx = np.argmax(f1s)
ax.axvline(thresholds[best_f1_idx], color='purple', ls=':', lw=2,
           label=f'Best F1 threshold={thresholds[best_f1_idx]:.2f}')
ax.set_xlabel('Classification Threshold')
ax.set_ylabel('Score')
ax.set_title('Precision, Recall, F1 vs Threshold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

# Precision-Recall curve
ax = axes[1]
ax.plot(recalls, precisions, 'b-', lw=2)
ax.scatter([recalls[best_f1_idx]], [precisions[best_f1_idx]],
           color='purple', s=100, zorder=5, label=f'Best F1={f1s[best_f1_idx]:.3f}')
ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
ax.set_title('Precision-Recall Curve (Fraud Detection)')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('precision_recall_curves.png', dpi=150)

print(f"\nAt default threshold=0.5:")
y_default = (scores >= 0.5).astype(int)
print(f"  Precision: {precision_score(y_true, y_default):.4f}")
print(f"  Recall:    {recall_score(y_true, y_default):.4f}")
print(f"  F1:        {f1_score(y_true, y_default):.4f}")

print(f"\nAt optimal threshold={thresholds[best_f1_idx]:.2f}:")
y_opt = (scores >= thresholds[best_f1_idx]).astype(int)
print(f"  Precision: {precision_score(y_true, y_opt):.4f}")
print(f"  Recall:    {recall_score(y_true, y_opt):.4f}")
print(f"  F1:        {f1_score(y_true, y_opt):.4f}")

F1 Score

The harmonic mean of precision and recall:

$F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}$

F1 penalizes extreme imbalances between precision and recall. A model with precision=1.0, recall=0.01 has F1 = 0.02 - near-useless despite perfect precision.

F-beta generalizes F1 to weight recall $\beta$ times more than precision:

$F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}$

$\beta = 2$ : recall twice as important (missed cancer costs more than false alarm)
$\beta = 0.5$ : precision twice as important (false alarms cost more)

Part 3 - AUC-ROC: The Ranking Metric

The ROC Curve

The Receiver Operating Characteristic curve plots:

True Positive Rate (TPR) = Recall $= TP/(TP+FN)$ on the y-axis
False Positive Rate (FPR) $= FP/(FP+TN) = 1 - \text{Specificity}$ on the x-axis

Each point on the curve corresponds to a different threshold.

TPR
1.0 │     *****
    │   **
    │  *              Perfect model:
    │ *               AUC = 1.0
    │*                (curve hugs top-left)
0.5 │
    │     Random:
    │     AUC = 0.5
    │     (diagonal)
    │
0.0 └──────────────
    0.0   0.5   1.0
             FPR

AUC-ROC

The Area Under the ROC Curve (AUC or AUROC) summarizes the ROC curve as a single number:

AUC = 1.0: perfect discrimination
AUC = 0.5: random (diagonal line)
AUC < 0.5: worse than random (swap predictions)

Probabilistic interpretation: AUC = probability that a randomly chosen positive example is ranked higher (has a higher model score) than a randomly chosen negative example.

$\text{AUC} = P(\hat{p}(y=1|x^+) > \hat{p}(y=1|x^-))$

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc, RocCurveDisplay
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

np.random.seed(42)
X, y = make_classification(n_samples=2000, n_features=20, n_informative=8,
                            n_classes=2, weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=500),
    'Random Forest':       RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting':   GradientBoostingClassifier(n_estimators=100, random_state=42),
}

fig, ax = plt.subplots(figsize=(8, 6))
for name, model in models.items():
    model.fit(X_train, y_train)
    proba = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, proba)
    roc_auc = auc(fpr, tpr)
    ax.plot(fpr, tpr, lw=2, label=f'{name} (AUC={roc_auc:.3f})')

ax.plot([0,1],[0,1], 'k--', lw=1.5, label='Random (AUC=0.5)')
ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curves - 90/10 Imbalanced Classification', fontsize=13)
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('roc_curves.png', dpi=150)

When AUC-ROC Is Misleading: Use AUC-PR for Imbalanced Data

AUC-ROC can be overly optimistic for severely imbalanced datasets. The FPR denominator ( $FP+TN$ ) is dominated by the large negative class - a model can have a small FPR even with many false positives in absolute terms.

AUC-PR (area under the precision-recall curve) is more informative when:

Positive class is rare (fraud, rare disease, defects)
The cost of false positives and false negatives is very different
You care about how well you rank the positive class, not just separate it from negatives

from sklearn.metrics import average_precision_score, precision_recall_curve

fig, ax = plt.subplots(figsize=(8, 6))
for name, model in models.items():
    model.fit(X_train, y_train)
    proba = model.predict_proba(X_test)[:, 1]
    precision, recall, _ = precision_recall_curve(y_test, proba)
    ap = average_precision_score(y_test, proba)
    ax.plot(recall, precision, lw=2, label=f'{name} (AP={ap:.3f})')

baseline_rate = y_test.mean()
ax.axhline(baseline_rate, color='gray', ls='--', lw=1.5,
           label=f'Random (AP={baseline_rate:.3f})')
ax.set_xlabel('Recall', fontsize=12)
ax.set_ylabel('Precision', fontsize=12)
ax.set_title('Precision-Recall Curves - 90/10 Imbalanced', fontsize=13)
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('pr_curves.png', dpi=150)

print("Note: Random baseline AUC-ROC = 0.5 always.")
print(f"Random baseline AUC-PR = class prevalence = {baseline_rate:.3f}")
print("AUC-PR is much harder to inflate on imbalanced data.")

Part 4 - Log Loss (Cross-Entropy Loss)

Log loss evaluates probability calibration - not just which class was predicted, but how confident the model was:

$\text{Log Loss} = -\frac{1}{n}\sum_{i=1}^n \left[y_i\log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)\right]$

Perfect model: log loss → 0
Random model: log loss = $\log(2) \approx 0.693$ for binary classification
Wrong with confidence: log loss → ∞ (predicting 0.99 when true label is 0 is catastrophic)

import numpy as np
from sklearn.metrics import log_loss

# Demonstrate calibration vs discrimination
y_true = np.array([1, 1, 0, 0, 1, 0, 1, 0])

# Well-calibrated model: confident when right
p_calibrated = np.array([0.9, 0.8, 0.1, 0.2, 0.85, 0.15, 0.7, 0.3])

# Overconfident wrong model: confident when wrong too
p_overconfident = np.array([0.99, 0.95, 0.01, 0.05, 0.98, 0.05, 0.9, 0.6])

# Underconfident model: never commits
p_underconfident = np.array([0.6, 0.55, 0.45, 0.4, 0.58, 0.42, 0.62, 0.38])

print(f"Log loss comparison:")
print(f"  Calibrated model:     {log_loss(y_true, p_calibrated):.4f}")
print(f"  Overconfident model:  {log_loss(y_true, p_overconfident):.4f}")
print(f"  Underconfident model: {log_loss(y_true, p_underconfident):.4f}")

# Show why overconfidence hurts
wrong_confident = np.array([0.99])
wrong_y = np.array([0])
print(f"\nPredicting 0.99 when true is 0: log loss = {log_loss(wrong_y, wrong_confident):.4f}")
print(f"Predicting 0.60 when true is 0: log loss = {log_loss(wrong_y, [0.6]):.4f}")
print("Overconfidence is heavily penalized - log loss rewards calibration.")

When to use log loss: When you need well-calibrated probabilities - risk scoring, medical diagnosis, anything where the probability value itself matters (not just the class label).

Part 5 - Matthews Correlation Coefficient (MCC)

For severely imbalanced datasets, even F1 can be misleading. MCC is a balanced metric that accounts for all four cells of the confusion matrix:

$\text{MCC} = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$

MCC = +1: perfect prediction
MCC = 0: random prediction
MCC = -1: perfectly inverted prediction

MCC is symmetric: it treats the positive and negative class equivalently. It gives high scores only when all four confusion matrix cells are good simultaneously.

from sklearn.metrics import matthews_corrcoef

# Extreme imbalance: 1000 negatives, 10 positives
np.random.seed(42)
y_imbalanced = np.array([0]*1000 + [1]*10)

# Model 1: Predict all negatives (naive baseline)
y_all_neg = np.zeros(1010, dtype=int)

# Model 2: Good model (catches 9/10 fraud, 5 false alarms)
y_good = np.zeros(1010, dtype=int)
y_good[1000:1009] = 1   # catch 9 of 10 fraud
y_good[50:55] = 1        # 5 false alarms on legit

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print(f"{'Metric':<20} {'All-negative':>15} {'Good model':>15}")
print('-' * 52)
for metric_fn, name in [
    (accuracy_score, 'Accuracy'),
    (lambda yt, yp: precision_score(yt, yp, zero_division=0), 'Precision'),
    (recall_score, 'Recall'),
    (lambda yt, yp: f1_score(yt, yp, zero_division=0), 'F1'),
    (matthews_corrcoef, 'MCC'),
]:
    v_neg = metric_fn(y_imbalanced, y_all_neg)
    v_good = metric_fn(y_imbalanced, y_good)
    print(f"{name:<20} {v_neg:>15.4f} {v_good:>15.4f}")

print("\nMCC correctly distinguishes: all-negative baseline vs useful model.")
print("Accuracy and precision miss this - MCC is preferred for severe imbalance.")

Part 6 - Multi-Class Metrics

For $K > 2$ classes, extend binary metrics using averaging strategies:

from sklearn.metrics import classification_report
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred, digits=3))

Averaging strategies:

Strategy	Formula	When to use
Macro	Mean of per-class scores	All classes equally important; highlights poor performance on rare classes
Weighted	Class-size-weighted mean	Imbalanced classes; overall performance weighted by frequency
Micro	Global TP/FP/FN across all classes	Same as accuracy for multiclass with one prediction per example

Macro F1:    F1(class 0) + F1(class 1) + ... + F1(class K-1)
             ─────────────────────────────────────────────────
                                   K

Weighted F1: Σ_k (n_k / n) * F1(class k)
             where n_k = number of examples in class k

Part 7 - Choosing the Right Metric: A Decision Guide

Application	Recommended Primary	Secondary
Balanced classification	Accuracy + AUC-ROC	F1
Fraud detection	AUC-PR + F1	Recall at fixed FPR
Medical diagnosis (rare disease)	Recall (sensitivity)	AUC-PR
Spam filter	Precision + F1	AUC-ROC
Recommendation (top-k)	Precision@k, NDCG	AUC-PR
Risk scoring	Log loss + AUC-ROC	Calibration curve
Severe imbalance (< 1%)	MCC	AUC-PR

Recommended Resources

:::tip Video Resources StatQuest - Confusion Matrix Essential visual walkthrough of TP/FP/TN/FN with examples. (~8 min)

StatQuest - Sensitivity and Specificity Recall vs specificity, and when each matters. (~8 min)

StatQuest - ROC and AUC Explained The best explanation of what AUC-ROC actually means. (~16 min)

StatQuest - Precision-Recall Curves When to use PR curves instead of ROC curves. (~11 min) :::

Interview Questions

Q1: What is the difference between AUC-ROC and AUC-PR, and when does it matter?

AUC-ROC plots TPR vs FPR; AUC-PR plots precision vs recall. For balanced datasets, both give similar rankings of model quality. For imbalanced datasets (rare positive class), AUC-ROC is misleading: the FPR denominator includes the large negative class, so FPR is small even with many false positives in absolute terms - making a bad model look good. AUC-PR avoids this: precision is TP/(TP+FP) - affected by absolute FP count. A random classifier has AUC-ROC = 0.5 regardless of class balance, but AUC-PR (average precision) equals the class prevalence for a random classifier. For fraud detection (0.1% fraud), random AUC-PR = 0.001 - so even a model with AUC-PR = 0.1 is 100x better than random. Use AUC-PR whenever the positive class is rare.

Q2: When would you choose recall over precision, and why?

Recall = TP/(TP+FN). Choose recall when false negatives are more costly than false positives: cancer screening (missing a cancer is worse than a false scare), sepsis detection (missing sepsis → death), safety systems (missing a braking event). Choose precision when false positives are more costly: spam filter (blocking real emails is worse than letting some spam through), recommendation systems (showing irrelevant results degrades user experience), hiring models (making a wrong hire may cost more than missing a good candidate). F-beta with $\beta > 1$ formalizes "recall is more important": $\beta = 2$ means recall is twice as important as precision.

Q3: Explain log loss and why it penalizes overconfident wrong predictions so harshly.

$\text{Log loss} = -\frac{1}{n}\sum_i [y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)]$ . For a positive example ( $y_i=1$ ), the loss is $-\log(\hat{p}_i)$ - approaches 0 as $\hat{p}_i \to 1$ , approaches $\infty$ as $\hat{p}_i \to 0$ . Predicting $\hat{p}=0.99$ for a positive example: loss = $-\log(0.99) \approx 0.01$ . Predicting $\hat{p}=0.01$ for a positive example: loss = $-\log(0.01) \approx 4.6$ . The logarithm creates an asymmetric penalty: being confidently wrong is catastrophically expensive, while being confidently right is only marginally better than being moderately confident right. This incentivizes calibration - the model should express uncertainty when it's uncertain, not suppress it for better-looking average accuracy.

Q4: Your model has precision=0.95 and recall=0.3 on the positive class. What does this mean and is it acceptable?

Precision=0.95 means when the model says "positive," it's right 95% of the time - very few false alarms. Recall=0.3 means the model catches only 30% of actual positives - misses 70%. This is a model that is very conservative: only flags things it's very sure about, misses most positive cases. Whether it's acceptable depends entirely on the application. For a preliminary screening tool (you'll do a follow-up test anyway), 30% recall may be fine if each positive flagged is acted upon. For a fraud detection system where you need to catch most fraud, 30% recall means 70% of fraud goes undetected - unacceptable. F1 = 2*(0.95*0.3)/(0.95+0.3) = 0.46 - mediocre. The fix: lower the classification threshold to increase recall at the cost of precision, then evaluate the precision-recall tradeoff at the new operating point.

Q5: What is the Matthews Correlation Coefficient and when should you use it instead of F1?

MCC = $(TP \cdot TN - FP \cdot FN) / \sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}$ . It accounts for all four confusion matrix cells and ranges from -1 to +1. Unlike F1, MCC is symmetric - it gives the same score when you swap the positive and negative class labels. F1 can be misleadingly high in severe imbalance: a model predicting all examples as positive on a 99%/1% dataset gets F1 = 0.66 (from TP recall = 1.0 on rare positives and precision = 0.01 balanced with recall). MCC correctly gives ~0 for such a model. Use MCC when: (1) severe class imbalance (< 5% positive), (2) you want a symmetric single-number summary, (3) comparing models across datasets with different class balances. MCC is particularly common in bioinformatics and medical ML where rare-event detection is the norm.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Confusion Matrix & ROC Curve demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

Part 1 - The Confusion Matrix​

Part 2 - Precision, Recall, F1​

Precision​

Recall (Sensitivity, True Positive Rate)​

The Precision-Recall Tradeoff​

F1 Score​

Part 3 - AUC-ROC: The Ranking Metric​

The ROC Curve​

AUC-ROC​

When AUC-ROC Is Misleading: Use AUC-PR for Imbalanced Data​

Part 4 - Log Loss (Cross-Entropy Loss)​

Part 5 - Matthews Correlation Coefficient (MCC)​

Part 6 - Multi-Class Metrics​

Part 7 - Choosing the Right Metric: A Decision Guide​

Recommended Resources​

Interview Questions​