Evaluation Metrics for Regression

Reading time: ~35 minutes | Level: ML Foundations | Role: MLE, Data Scientist, MLOps, AI Engineer

Your house price model has an RMSE of $48,000. Your manager asks: "Is that good?" Most junior engineers freeze. The instinct is to say "it depends" - but a senior engineer says what it depends on and why.

That $48,000 is meaningless without knowing: (1) the mean house price - if it's$ 500K, RMSE is 9.6% which may be acceptable; if $80K it's catastrophic; (2) your baseline - can a naive mean-prediction model beat it?; (3) the business cost - is a$ 48K error symmetric, or does overestimating trigger costly over-purchasing?

This lesson builds the complete reasoning framework. You'll leave knowing not just the formulas, but which metric to use when, why each one fails in specific situations, how to diagnose your model from residuals, and how to monitor regression quality in production.

What You Will Learn

The math and intuition behind MAE, MSE, RMSE, R², MAPE, sMAPE, and Huber loss
When each metric misleads you and what to use instead
Residual analysis as a systematic model diagnostic tool - heteroscedasticity, autocorrelation, normality
Custom asymmetric loss functions for business-aligned evaluation
Proper baseline comparisons and the null model benchmark
Production monitoring with rolling window metrics and bias detection
Complete sklearn, scipy, and statsmodels implementations
Eight interview Q&As at senior MLE/DS level

Part 1 - The Error Decomposition Mindset

Before choosing a metric, understand what kind of errors you care about.

Every regression prediction produces a residual:
  residual_i = y_i - ŷ_i

Positive residual → model under-predicted
Negative residual → model over-predicted

The metric is just a different summary statistic over these residuals.
Different summaries optimise for different things.

┌──────────────────────────────────────────────────────────┐
│                  METRIC FAMILIES                         │
│                                                          │
│  Scale-dependent     Scale-free       Variance-based     │
│  ───────────────     ──────────       ──────────────     │
│  MAE                 MAPE             R²                  │
│  RMSE                sMAPE            Adjusted R²         │
│  MSE                 RMSLE            Explained Variance  │
│  Huber                                                    │
│                                                          │
│  Robust metrics      Directional      Custom/Business     │
│  ──────────────      ───────────      ───────────────    │
│  MAE                 Bias (mean res)  Asymmetric MAE      │
│  Huber               DA (dir. acc.)   Pinball / Quantile  │
│  Median AE           Theil's U        Business loss fn    │
└──────────────────────────────────────────────────────────┘

The first question before choosing a metric: what does a costly error look like?

Are over-predictions worse than under-predictions? → Asymmetric loss
Are percentage errors more meaningful than absolute? → MAPE family
Do outlier predictions matter critically? → RMSE over MAE
Do you need scale-free comparison across datasets? → R² or MAPE

Part 2 - MAE: Mean Absolute Error

$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$

Intuition: The average magnitude of prediction error, in the same units as your target. If your target is house prices in thousands of dollars and MAE = 18, you're off by $18K on average.

Why MAE corresponds to the median: MAE is minimised when your model predicts the conditional median $\hat{y} = \text{median}(y | x)$ , not the mean. This is because the subgradient of $|y - \hat{y}|$ with respect to $\hat{y}$ is $-1$ when $\hat{y} < y$ and $+1$ when $\hat{y} > y$ , leading to the median as the minimiser.

Properties:

Same units as target variable - directly interpretable
Robust to outliers: each error contributes proportionally to its magnitude
Not differentiable at zero (creates issues for some gradient-based optimizers)
No squaring means small and large errors are treated on the same scale

import numpy as np
from sklearn.metrics import mean_absolute_error, median_absolute_error
import matplotlib.pyplot as plt

# Setup
np.random.seed(42)
y_true = np.array([300, 450, 200, 350, 500, 275, 420, 380, 310, 480])
y_pred = np.array([310, 430, 220, 370, 480, 260, 440, 360, 325, 500])

residuals = y_true - y_pred

# Standard MAE
mae = mean_absolute_error(y_true, y_pred)
print(f"MAE:        {mae:.2f}")

# Median Absolute Error (more robust than MAE for heavy-tailed distributions)
medae = median_absolute_error(y_true, y_pred)
print(f"Median AE:  {medae:.2f}")

# Manual implementation - helpful in interviews
mae_manual = np.mean(np.abs(y_true - y_pred))
print(f"MAE manual: {mae_manual:.2f}")

# Weighted MAE - when some predictions matter more
weights = np.array([1, 2, 1, 3, 2, 1, 2, 1, 1, 2])  # e.g., high-value items
wmae = np.average(np.abs(y_true - y_pred), weights=weights)
print(f"Weighted MAE: {wmae:.2f}")

MAE on Skewed Distributions

# Demonstrate MAE vs Median AE with outliers
y_true_outlier = np.append(y_true, [5000])  # add a large outlier
y_pred_outlier = np.append(y_pred, [400])   # model misses it badly

print(f"\nWith outlier:")
print(f"MAE:        {mean_absolute_error(y_true_outlier, y_pred_outlier):.2f}")
print(f"Median AE:  {median_absolute_error(y_true_outlier, y_pred_outlier):.2f}")
# MAE jumps dramatically; Median AE barely changes → Median AE more robust

When to use MAE:

Business costs are proportional to error magnitude (e.g., delivery delay in minutes)
Outliers are measurement noise that shouldn't dominate the metric
Stakeholders want simple, interpretable error in target units
Comparing models where you care about typical performance, not worst-case

Part 3 - MSE and RMSE

$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

$\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$

Intuition: Squaring residuals before averaging means large errors are penalised much more than small ones. A prediction that's off by 20 contributes 4× more to MSE than one off by 10, and 400× more than one off by 1.

Why MSE corresponds to the mean: MSE is minimised by the conditional expectation $\hat{y} = \mathbb{E}[y | x]$ . Taking the derivative of $\sum(y_i - \hat{y})^2$ with respect to $\hat{y}$ and setting to zero gives $\hat{y} = \bar{y}$ .

from sklearn.metrics import mean_squared_error
import numpy as np

y_true = np.array([3.0, -0.5, 2.0, 7.0, 4.5])
y_pred = np.array([2.5,  0.0, 2.0, 8.0, 4.0])

mse  = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)

print(f"MSE:  {mse:.4f}")   # units are (target)²
print(f"RMSE: {rmse:.4f}")  # same units as target

# sklearn 1.4+ has root_mean_squared_error
try:
    from sklearn.metrics import root_mean_squared_error
    print(f"RMSE (sklearn 1.4+): {root_mean_squared_error(y_true, y_pred):.4f}")
except ImportError:
    pass

# Relationship: RMSE >= MAE always (Jensen's inequality)
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_true, y_pred)
print(f"\nMAE: {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"RMSE/MAE ratio: {rmse/mae:.4f}")  # 1.0 = uniform errors; >1 = outlier influence

The RMSE/MAE Ratio as an Outlier Detector

def error_ratio_analysis(y_true, y_pred, label="Model"):
    """
    RMSE/MAE ratio reveals outlier influence on the metric.
    Ratio = 1.0: all errors are equal (highly unlikely in practice)
    Ratio > 1.5: few large errors dominating RMSE
    Ratio > 2.0: severe outlier influence - investigate those samples
    """
    mae  = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    ratio = rmse / mae

    print(f"\n{label}")
    print(f"  MAE:        {mae:.4f}")
    print(f"  RMSE:       {rmse:.4f}")
    print(f"  RMSE/MAE:   {ratio:.4f}")

    if ratio < 1.2:
        print("  → Errors are fairly uniform")
    elif ratio < 1.5:
        print("  → Moderate outlier influence")
    else:
        print("  → HIGH outlier influence - investigate worst predictions")

    # Find the worst predictions
    errors = np.abs(y_true - y_pred)
    worst_idx = np.argsort(errors)[-3:]
    print(f"  → Worst 3 errors: indices {worst_idx}, errors {errors[worst_idx]}")

# Example: compare uniform vs outlier-dominated errors
y_t = np.random.normal(100, 10, 100)
y_uniform = y_t + np.random.normal(0, 5, 100)
y_outlier = y_t + np.random.normal(0, 5, 100)
y_outlier[[10, 20, 30]] += 80  # inject 3 large errors

error_ratio_analysis(y_t, y_uniform, "Uniform errors")
error_ratio_analysis(y_t, y_outlier, "Outlier-influenced errors")

Visualising the Difference

import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

errors = np.linspace(-10, 10, 100)

# MAE penalty - linear
ax1.plot(errors, np.abs(errors), 'b-', linewidth=2, label='MAE penalty')
ax1.fill_between(errors, np.abs(errors), alpha=0.2)
ax1.set_title('MAE: Linear penalty (robust to outliers)')
ax1.set_xlabel('Residual')
ax1.set_ylabel('Loss contribution')
ax1.legend()

# RMSE penalty - quadratic
ax2.plot(errors, errors**2, 'r-', linewidth=2, label='MSE penalty (RMSE²)')
ax2.fill_between(errors, errors**2, alpha=0.2, color='r')
ax2.set_title('MSE: Quadratic penalty (sensitive to outliers)')
ax2.set_xlabel('Residual')
ax2.set_ylabel('Loss contribution')
ax2.legend()

plt.tight_layout()
plt.savefig('mae_vs_mse_penalty.png', dpi=150)

Part 4 - R² (Coefficient of Determination)

$R^2 = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}$

Intuition: The fraction of variance in $y$ explained by the model. A naive baseline that always predicts $\bar{y}$ has R² = 0 exactly. A perfect model has R² = 1. Any model worse than predicting the mean has R² < 0.

SS_tot: total variance in y (how spread out the actual values are)
SS_res: residual variance (how much variance the model fails to explain)

R² = 1 - fraction_unexplained
   = fraction_explained

from sklearn.metrics import r2_score
import numpy as np

np.random.seed(42)
X = np.random.randn(200, 1)
y = 3 * X.ravel() + 2 + np.random.normal(0, 1, 200)

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)

r2 = r2_score(y, y_pred)
print(f"R²: {r2:.4f}")

# Manual verification
ss_res = np.sum((y - y_pred)**2)
ss_tot = np.sum((y - np.mean(y))**2)
r2_manual = 1 - ss_res / ss_tot
print(f"R² (manual): {r2_manual:.4f}")

# Baseline comparison
y_pred_mean = np.full_like(y, y.mean())  # always predict mean
r2_baseline = r2_score(y, y_pred_mean)
print(f"R² (baseline mean): {r2_baseline:.4f}")  # exactly 0.0

R² Pitfalls - The Complete List

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score

# PITFALL 1: R² always increases with more features (Adjusted R² fixes this)
# -------------------------------------------------------------------
# Adding useless features increases R² even if they have no real predictive power

np.random.seed(42)
n = 100
y_true = np.random.randn(n)

# R² as we add random features
r2_values = []
for k in range(1, 50):
    X_random = np.random.randn(n, k)
    from sklearn.linear_model import LinearRegression
    model = LinearRegression()
    model.fit(X_random, y_true)
    y_p = model.predict(X_random)
    r2_values.append(r2_score(y_true, y_p))

print("R² with random features:")
for k, r2 in zip([1, 10, 20, 40, 49], [r2_values[0], r2_values[9],
                  r2_values[19], r2_values[39], r2_values[48]]):
    print(f"  k={k:2d} features: R²={r2:.4f}")
# R² grows toward 1.0 even with pure noise features!

# PITFALL 2: High R² doesn't prevent terrible predictions in some regions
# -------------------------------------------------------------------
# Anscombe's quartet: 4 datasets with identical R², wildly different structure
x = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
y1 = np.array([8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68])
y2 = np.array([9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74])
y3 = np.array([7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73])

from sklearn.linear_model import LinearRegression
X_anscombe = x.reshape(-1, 1)
for label, yi in [("Dataset I", y1), ("Dataset II", y2), ("Dataset III", y3)]:
    m = LinearRegression().fit(X_anscombe, yi)
    r2 = r2_score(yi, m.predict(X_anscombe))
    print(f"{label}: R² = {r2:.3f}")  # All ≈ 0.67 - despite very different shapes!

# PITFALL 3: R² can be negative on test data even with a good training R²
# -------------------------------------------------------------------
X_train = np.random.randn(100, 5)
X_test  = np.random.randn(100, 5) * 10  # different scale/distribution
y_train = X_train[:, 0] * 2 + np.random.randn(100) * 0.1
y_test  = X_test[:, 0] * 2 + np.random.randn(100) * 0.1

model = LinearRegression().fit(X_train, y_train)
r2_train = r2_score(y_train, model.predict(X_train))
r2_test  = r2_score(y_test,  model.predict(X_test))
print(f"\nDistribution shift: R²_train={r2_train:.3f}, R²_test={r2_test:.3f}")
# R²_test can be negative - model fails to generalize

Adjusted R²

Adjusted R² penalises adding features that don't improve the model:

$\bar{R}^2 = 1 - (1 - R^2) \cdot \frac{n - 1}{n - k - 1}$

where $k$ = number of predictors, $n$ = number of observations.

def adjusted_r2(r2: float, n: int, k: int) -> float:
    """
    Compute Adjusted R².

    r2: R-squared on the dataset
    n:  number of observations
    k:  number of predictor features (excluding intercept)

    Returns:
        Adjusted R² - penalises unnecessary features
    """
    return 1 - (1 - r2) * (n - 1) / (n - k - 1)

# Demo: Adjusted R² vs R² as features are added
np.random.seed(42)
n = 100
X_true = np.random.randn(n, 3)  # 3 truly predictive features
y = X_true @ np.array([2, -1, 0.5]) + np.random.randn(n) * 0.5

print(f"{'Features':>10} {'R²':>10} {'Adj R²':>10}")
print("-" * 35)

for k_noise in range(0, 15, 3):
    X_noise = np.hstack([X_true, np.random.randn(n, k_noise)])
    model = LinearRegression().fit(X_noise, y)
    y_p   = model.predict(X_noise)
    r2    = r2_score(y, y_p)
    adjr2 = adjusted_r2(r2, n, 3 + k_noise)
    print(f"{3 + k_noise:>10} {r2:>10.4f} {adjr2:>10.4f}")
# Adj R² stays stable or drops when adding noise; R² always increases

Part 5 - MAPE and Its Variants

$\text{MAPE} = \frac{100\%}{n} \sum_{i=1}^{n} \left|\frac{y_i - \hat{y}_i}{y_i}\right|$

Intuition: Average percentage deviation from truth. Scale-free - directly comparable across datasets with different target scales. "MAPE = 8%" is intuitively meaningful to non-technical stakeholders.

import numpy as np

def mape(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Mean Absolute Percentage Error - fails when y_true contains zeros."""
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    if np.any(y_true == 0):
        raise ValueError("MAPE undefined when y_true contains zeros.")
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

# Example
y_true = np.array([100, 200, 300, 400, 500])
y_pred = np.array([110, 185, 315, 390, 520])
print(f"MAPE: {mape(y_true, y_pred):.2f}%")

Why MAPE Fails - Three Critical Problems

Problem 1: Division by zero when y_true = 0
  y_true = 0, y_pred = 5 → MAPE = inf
  Common in sales forecasting (0 units sold on some days)

Problem 2: Asymmetric penalty
  y_true = 100, y_pred = 50  → error = 50%   (under-predict by half)
  y_true = 100, y_pred = 200 → error = 100%  (double the prediction)
  The under-prediction is bounded at 100%, over-prediction unbounded.
  → Models minimising MAPE are biased toward under-prediction.

Problem 3: Scale distortion
  y_true = 1, y_pred = 2    → 100% error  (small absolute error, large %)
  y_true = 1000, y_pred = 1050 → 5% error (large absolute error, small %)
  Small-valued predictions dominate MAPE even with tiny absolute errors.

sMAPE - Symmetric Solution

$\text{sMAPE} = \frac{100\%}{n} \sum_{i=1}^{n} \frac{|y_i - \hat{y}_i|}{(|y_i| + |\hat{y}_i|) / 2}$

def smape(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Symmetric Mean Absolute Percentage Error - handles asymmetry."""
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    numerator   = np.abs(y_true - y_pred)
    denominator = (np.abs(y_true) + np.abs(y_pred)) / 2
    # Handle edge case: both y_true=0 and y_pred=0 → 0% error
    mask = denominator != 0
    return np.mean(numerator[mask] / denominator[mask]) * 100

# Compare MAPE vs sMAPE symmetry
cases = [
    ("Under-predict by 50%", 100, 50),
    ("Over-predict by 50%",  100, 150),
    ("Under-predict by 90%", 100, 10),
    ("Over-predict by 90%",  100, 190),
]
print(f"\n{'Case':<30} {'MAPE':>8} {'sMAPE':>8}")
print("-" * 50)
for label, yt, yp in cases:
    m = abs(yt - yp) / abs(yt) * 100
    s = abs(yt - yp) / ((abs(yt) + abs(yp)) / 2) * 100
    print(f"{label:<30} {m:>8.1f}% {s:>8.1f}%")

RMSLE - Root Mean Squared Log Error

Useful when the target spans multiple orders of magnitude:

$\text{RMSLE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (\log(y_i + 1) - \log(\hat{y}_i + 1))^2}$

from sklearn.metrics import mean_squared_log_error
import numpy as np

y_true = np.array([100, 1000, 10000, 100000])
y_pred = np.array([120, 1100,  9500,  95000])

rmsle = np.sqrt(mean_squared_log_error(y_true, y_pred))
rmse  = np.sqrt(mean_squared_error(y_true, y_pred))

print(f"RMSE:  {rmse:.2f}")    # 3606 - dominated by 100K prediction
print(f"RMSLE: {rmsle:.4f}")  # ≈0.05 - balanced across all scales
# Use RMSLE for house prices, population, revenue - anything log-normal

Part 6 - Huber Loss

Huber loss is the best of both worlds - quadratic for small errors (smooth, fast convergence), linear for large errors (robust to outliers):

What You Will Learn​

Part 1 - The Error Decomposition Mindset​

Part 2 - MAE: Mean Absolute Error​

MAE on Skewed Distributions​

Part 3 - MSE and RMSE​

The RMSE/MAE Ratio as an Outlier Detector​

Visualising the Difference​

Part 4 - R² (Coefficient of Determination)​

R² Pitfalls - The Complete List​

Adjusted R²​

Part 5 - MAPE and Its Variants​

Why MAPE Fails - Three Critical Problems​

sMAPE - Symmetric Solution​

RMSLE - Root Mean Squared Log Error​

Part 6 - Huber Loss​