Skip to main content

Evaluation Metrics for Regression

Reading time: ~35 minutes | Level: ML Foundations | Role: MLE, Data Scientist, MLOps, AI Engineer

Your house price model has an RMSE of $48,000. Your manager asks: "Is that good?" Most junior engineers freeze. The instinct is to say "it depends" - but a senior engineer says what it depends on and why.

That 48,000ismeaninglesswithoutknowing:(1)themeanhousepriceifits48,000 is meaningless without knowing: (1) the mean house price - if it's 500K, RMSE is 9.6% which may be acceptable; if 80Kitscatastrophic;(2)yourbaselinecananaivemeanpredictionmodelbeatit?;(3)thebusinesscostisa80K it's catastrophic; (2) your baseline - can a naive mean-prediction model beat it?; (3) the business cost - is a 48K error symmetric, or does overestimating trigger costly over-purchasing?

This lesson builds the complete reasoning framework. You'll leave knowing not just the formulas, but which metric to use when, why each one fails in specific situations, how to diagnose your model from residuals, and how to monitor regression quality in production.

What You Will Learn

  • The math and intuition behind MAE, MSE, RMSE, R², MAPE, sMAPE, and Huber loss
  • When each metric misleads you and what to use instead
  • Residual analysis as a systematic model diagnostic tool - heteroscedasticity, autocorrelation, normality
  • Custom asymmetric loss functions for business-aligned evaluation
  • Proper baseline comparisons and the null model benchmark
  • Production monitoring with rolling window metrics and bias detection
  • Complete sklearn, scipy, and statsmodels implementations
  • Eight interview Q&As at senior MLE/DS level

Part 1 - The Error Decomposition Mindset

Before choosing a metric, understand what kind of errors you care about.

Every regression prediction produces a residual:
residual_i = y_i - ŷ_i

Positive residual → model under-predicted
Negative residual → model over-predicted

The metric is just a different summary statistic over these residuals.
Different summaries optimise for different things.
┌──────────────────────────────────────────────────────────┐
│ METRIC FAMILIES │
│ │
│ Scale-dependent Scale-free Variance-based │
│ ─────────────── ────────── ────────────── │
│ MAE MAPE R² │
│ RMSE sMAPE Adjusted R² │
│ MSE RMSLE Explained Variance │
│ Huber │
│ │
│ Robust metrics Directional Custom/Business │
│ ────────────── ─────────── ─────────────── │
│ MAE Bias (mean res) Asymmetric MAE │
│ Huber DA (dir. acc.) Pinball / Quantile │
│ Median AE Theil's U Business loss fn │
└──────────────────────────────────────────────────────────┘

The first question before choosing a metric: what does a costly error look like?

  • Are over-predictions worse than under-predictions? → Asymmetric loss
  • Are percentage errors more meaningful than absolute? → MAPE family
  • Do outlier predictions matter critically? → RMSE over MAE
  • Do you need scale-free comparison across datasets? → R² or MAPE

Part 2 - MAE: Mean Absolute Error

MAE=1ni=1nyiy^i\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|

Intuition: The average magnitude of prediction error, in the same units as your target. If your target is house prices in thousands of dollars and MAE = 18, you're off by $18K on average.

Why MAE corresponds to the median: MAE is minimised when your model predicts the conditional median y^=median(yx)\hat{y} = \text{median}(y | x), not the mean. This is because the subgradient of yy^|y - \hat{y}| with respect to y^\hat{y} is 1-1 when y^<y\hat{y} < y and +1+1 when y^>y\hat{y} > y, leading to the median as the minimiser.

Properties:

  • Same units as target variable - directly interpretable
  • Robust to outliers: each error contributes proportionally to its magnitude
  • Not differentiable at zero (creates issues for some gradient-based optimizers)
  • No squaring means small and large errors are treated on the same scale
import numpy as np
from sklearn.metrics import mean_absolute_error, median_absolute_error
import matplotlib.pyplot as plt

# Setup
np.random.seed(42)
y_true = np.array([300, 450, 200, 350, 500, 275, 420, 380, 310, 480])
y_pred = np.array([310, 430, 220, 370, 480, 260, 440, 360, 325, 500])

residuals = y_true - y_pred

# Standard MAE
mae = mean_absolute_error(y_true, y_pred)
print(f"MAE: {mae:.2f}")

# Median Absolute Error (more robust than MAE for heavy-tailed distributions)
medae = median_absolute_error(y_true, y_pred)
print(f"Median AE: {medae:.2f}")

# Manual implementation - helpful in interviews
mae_manual = np.mean(np.abs(y_true - y_pred))
print(f"MAE manual: {mae_manual:.2f}")

# Weighted MAE - when some predictions matter more
weights = np.array([1, 2, 1, 3, 2, 1, 2, 1, 1, 2]) # e.g., high-value items
wmae = np.average(np.abs(y_true - y_pred), weights=weights)
print(f"Weighted MAE: {wmae:.2f}")

MAE on Skewed Distributions

# Demonstrate MAE vs Median AE with outliers
y_true_outlier = np.append(y_true, [5000]) # add a large outlier
y_pred_outlier = np.append(y_pred, [400]) # model misses it badly

print(f"\nWith outlier:")
print(f"MAE: {mean_absolute_error(y_true_outlier, y_pred_outlier):.2f}")
print(f"Median AE: {median_absolute_error(y_true_outlier, y_pred_outlier):.2f}")
# MAE jumps dramatically; Median AE barely changes → Median AE more robust

When to use MAE:

  • Business costs are proportional to error magnitude (e.g., delivery delay in minutes)
  • Outliers are measurement noise that shouldn't dominate the metric
  • Stakeholders want simple, interpretable error in target units
  • Comparing models where you care about typical performance, not worst-case

Part 3 - MSE and RMSE

MSE=1ni=1n(yiy^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

RMSE=MSE=1ni=1n(yiy^i)2\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}

Intuition: Squaring residuals before averaging means large errors are penalised much more than small ones. A prediction that's off by 20 contributes 4× more to MSE than one off by 10, and 400× more than one off by 1.

Why MSE corresponds to the mean: MSE is minimised by the conditional expectation y^=E[yx]\hat{y} = \mathbb{E}[y | x]. Taking the derivative of (yiy^)2\sum(y_i - \hat{y})^2 with respect to y^\hat{y} and setting to zero gives y^=yˉ\hat{y} = \bar{y}.

from sklearn.metrics import mean_squared_error
import numpy as np

y_true = np.array([3.0, -0.5, 2.0, 7.0, 4.5])
y_pred = np.array([2.5, 0.0, 2.0, 8.0, 4.0])

mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)

print(f"MSE: {mse:.4f}") # units are (target)²
print(f"RMSE: {rmse:.4f}") # same units as target

# sklearn 1.4+ has root_mean_squared_error
try:
from sklearn.metrics import root_mean_squared_error
print(f"RMSE (sklearn 1.4+): {root_mean_squared_error(y_true, y_pred):.4f}")
except ImportError:
pass

# Relationship: RMSE >= MAE always (Jensen's inequality)
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_true, y_pred)
print(f"\nMAE: {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"RMSE/MAE ratio: {rmse/mae:.4f}") # 1.0 = uniform errors; >1 = outlier influence

The RMSE/MAE Ratio as an Outlier Detector

def error_ratio_analysis(y_true, y_pred, label="Model"):
"""
RMSE/MAE ratio reveals outlier influence on the metric.
Ratio = 1.0: all errors are equal (highly unlikely in practice)
Ratio > 1.5: few large errors dominating RMSE
Ratio > 2.0: severe outlier influence - investigate those samples
"""
mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
ratio = rmse / mae

print(f"\n{label}")
print(f" MAE: {mae:.4f}")
print(f" RMSE: {rmse:.4f}")
print(f" RMSE/MAE: {ratio:.4f}")

if ratio < 1.2:
print(" → Errors are fairly uniform")
elif ratio < 1.5:
print(" → Moderate outlier influence")
else:
print(" → HIGH outlier influence - investigate worst predictions")

# Find the worst predictions
errors = np.abs(y_true - y_pred)
worst_idx = np.argsort(errors)[-3:]
print(f" → Worst 3 errors: indices {worst_idx}, errors {errors[worst_idx]}")

# Example: compare uniform vs outlier-dominated errors
y_t = np.random.normal(100, 10, 100)
y_uniform = y_t + np.random.normal(0, 5, 100)
y_outlier = y_t + np.random.normal(0, 5, 100)
y_outlier[[10, 20, 30]] += 80 # inject 3 large errors

error_ratio_analysis(y_t, y_uniform, "Uniform errors")
error_ratio_analysis(y_t, y_outlier, "Outlier-influenced errors")

Visualising the Difference

import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

errors = np.linspace(-10, 10, 100)

# MAE penalty - linear
ax1.plot(errors, np.abs(errors), 'b-', linewidth=2, label='MAE penalty')
ax1.fill_between(errors, np.abs(errors), alpha=0.2)
ax1.set_title('MAE: Linear penalty (robust to outliers)')
ax1.set_xlabel('Residual')
ax1.set_ylabel('Loss contribution')
ax1.legend()

# RMSE penalty - quadratic
ax2.plot(errors, errors**2, 'r-', linewidth=2, label='MSE penalty (RMSE²)')
ax2.fill_between(errors, errors**2, alpha=0.2, color='r')
ax2.set_title('MSE: Quadratic penalty (sensitive to outliers)')
ax2.set_xlabel('Residual')
ax2.set_ylabel('Loss contribution')
ax2.legend()

plt.tight_layout()
plt.savefig('mae_vs_mse_penalty.png', dpi=150)

Part 4 - R² (Coefficient of Determination)

R2=1SSresSStot=1i=1n(yiy^i)2i=1n(yiyˉ)2R^2 = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}

Intuition: The fraction of variance in yy explained by the model. A naive baseline that always predicts yˉ\bar{y} has R² = 0 exactly. A perfect model has R² = 1. Any model worse than predicting the mean has R² < 0.

SS_tot: total variance in y (how spread out the actual values are)
SS_res: residual variance (how much variance the model fails to explain)

R² = 1 - fraction_unexplained
= fraction_explained
from sklearn.metrics import r2_score
import numpy as np

np.random.seed(42)
X = np.random.randn(200, 1)
y = 3 * X.ravel() + 2 + np.random.normal(0, 1, 200)

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)

r2 = r2_score(y, y_pred)
print(f"R²: {r2:.4f}")

# Manual verification
ss_res = np.sum((y - y_pred)**2)
ss_tot = np.sum((y - np.mean(y))**2)
r2_manual = 1 - ss_res / ss_tot
print(f"R² (manual): {r2_manual:.4f}")

# Baseline comparison
y_pred_mean = np.full_like(y, y.mean()) # always predict mean
r2_baseline = r2_score(y, y_pred_mean)
print(f"R² (baseline mean): {r2_baseline:.4f}") # exactly 0.0

R² Pitfalls - The Complete List

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score

# PITFALL 1: R² always increases with more features (Adjusted R² fixes this)
# -------------------------------------------------------------------
# Adding useless features increases R² even if they have no real predictive power

np.random.seed(42)
n = 100
y_true = np.random.randn(n)

# R² as we add random features
r2_values = []
for k in range(1, 50):
X_random = np.random.randn(n, k)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_random, y_true)
y_p = model.predict(X_random)
r2_values.append(r2_score(y_true, y_p))

print("R² with random features:")
for k, r2 in zip([1, 10, 20, 40, 49], [r2_values[0], r2_values[9],
r2_values[19], r2_values[39], r2_values[48]]):
print(f" k={k:2d} features: R²={r2:.4f}")
# R² grows toward 1.0 even with pure noise features!

# PITFALL 2: High R² doesn't prevent terrible predictions in some regions
# -------------------------------------------------------------------
# Anscombe's quartet: 4 datasets with identical R², wildly different structure
x = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
y1 = np.array([8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68])
y2 = np.array([9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74])
y3 = np.array([7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73])

from sklearn.linear_model import LinearRegression
X_anscombe = x.reshape(-1, 1)
for label, yi in [("Dataset I", y1), ("Dataset II", y2), ("Dataset III", y3)]:
m = LinearRegression().fit(X_anscombe, yi)
r2 = r2_score(yi, m.predict(X_anscombe))
print(f"{label}: R² = {r2:.3f}") # All ≈ 0.67 - despite very different shapes!

# PITFALL 3: R² can be negative on test data even with a good training R²
# -------------------------------------------------------------------
X_train = np.random.randn(100, 5)
X_test = np.random.randn(100, 5) * 10 # different scale/distribution
y_train = X_train[:, 0] * 2 + np.random.randn(100) * 0.1
y_test = X_test[:, 0] * 2 + np.random.randn(100) * 0.1

model = LinearRegression().fit(X_train, y_train)
r2_train = r2_score(y_train, model.predict(X_train))
r2_test = r2_score(y_test, model.predict(X_test))
print(f"\nDistribution shift: R²_train={r2_train:.3f}, R²_test={r2_test:.3f}")
# R²_test can be negative - model fails to generalize

Adjusted R²

Adjusted R² penalises adding features that don't improve the model:

Rˉ2=1(1R2)n1nk1\bar{R}^2 = 1 - (1 - R^2) \cdot \frac{n - 1}{n - k - 1}

where kk = number of predictors, nn = number of observations.

def adjusted_r2(r2: float, n: int, k: int) -> float:
"""
Compute Adjusted R².

r2: R-squared on the dataset
n: number of observations
k: number of predictor features (excluding intercept)

Returns:
Adjusted R² - penalises unnecessary features
"""
return 1 - (1 - r2) * (n - 1) / (n - k - 1)

# Demo: Adjusted R² vs R² as features are added
np.random.seed(42)
n = 100
X_true = np.random.randn(n, 3) # 3 truly predictive features
y = X_true @ np.array([2, -1, 0.5]) + np.random.randn(n) * 0.5

print(f"{'Features':>10} {'R²':>10} {'Adj R²':>10}")
print("-" * 35)

for k_noise in range(0, 15, 3):
X_noise = np.hstack([X_true, np.random.randn(n, k_noise)])
model = LinearRegression().fit(X_noise, y)
y_p = model.predict(X_noise)
r2 = r2_score(y, y_p)
adjr2 = adjusted_r2(r2, n, 3 + k_noise)
print(f"{3 + k_noise:>10} {r2:>10.4f} {adjr2:>10.4f}")
# Adj R² stays stable or drops when adding noise; R² always increases

Part 5 - MAPE and Its Variants

MAPE=100%ni=1nyiy^iyi\text{MAPE} = \frac{100\%}{n} \sum_{i=1}^{n} \left|\frac{y_i - \hat{y}_i}{y_i}\right|

Intuition: Average percentage deviation from truth. Scale-free - directly comparable across datasets with different target scales. "MAPE = 8%" is intuitively meaningful to non-technical stakeholders.

import numpy as np

def mape(y_true: np.ndarray, y_pred: np.ndarray) -> float:
"""Mean Absolute Percentage Error - fails when y_true contains zeros."""
y_true, y_pred = np.array(y_true), np.array(y_pred)
if np.any(y_true == 0):
raise ValueError("MAPE undefined when y_true contains zeros.")
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

# Example
y_true = np.array([100, 200, 300, 400, 500])
y_pred = np.array([110, 185, 315, 390, 520])
print(f"MAPE: {mape(y_true, y_pred):.2f}%")

Why MAPE Fails - Three Critical Problems

Problem 1: Division by zero when y_true = 0
y_true = 0, y_pred = 5 → MAPE = inf
Common in sales forecasting (0 units sold on some days)

Problem 2: Asymmetric penalty
y_true = 100, y_pred = 50 → error = 50% (under-predict by half)
y_true = 100, y_pred = 200 → error = 100% (double the prediction)
The under-prediction is bounded at 100%, over-prediction unbounded.
→ Models minimising MAPE are biased toward under-prediction.

Problem 3: Scale distortion
y_true = 1, y_pred = 2 → 100% error (small absolute error, large %)
y_true = 1000, y_pred = 1050 → 5% error (large absolute error, small %)
Small-valued predictions dominate MAPE even with tiny absolute errors.

sMAPE - Symmetric Solution

sMAPE=100%ni=1nyiy^i(yi+y^i)/2\text{sMAPE} = \frac{100\%}{n} \sum_{i=1}^{n} \frac{|y_i - \hat{y}_i|}{(|y_i| + |\hat{y}_i|) / 2}

def smape(y_true: np.ndarray, y_pred: np.ndarray) -> float:
"""Symmetric Mean Absolute Percentage Error - handles asymmetry."""
y_true, y_pred = np.array(y_true), np.array(y_pred)
numerator = np.abs(y_true - y_pred)
denominator = (np.abs(y_true) + np.abs(y_pred)) / 2
# Handle edge case: both y_true=0 and y_pred=0 → 0% error
mask = denominator != 0
return np.mean(numerator[mask] / denominator[mask]) * 100

# Compare MAPE vs sMAPE symmetry
cases = [
("Under-predict by 50%", 100, 50),
("Over-predict by 50%", 100, 150),
("Under-predict by 90%", 100, 10),
("Over-predict by 90%", 100, 190),
]
print(f"\n{'Case':<30} {'MAPE':>8} {'sMAPE':>8}")
print("-" * 50)
for label, yt, yp in cases:
m = abs(yt - yp) / abs(yt) * 100
s = abs(yt - yp) / ((abs(yt) + abs(yp)) / 2) * 100
print(f"{label:<30} {m:>8.1f}% {s:>8.1f}%")

RMSLE - Root Mean Squared Log Error

Useful when the target spans multiple orders of magnitude:

RMSLE=1ni=1n(log(yi+1)log(y^i+1))2\text{RMSLE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (\log(y_i + 1) - \log(\hat{y}_i + 1))^2}

from sklearn.metrics import mean_squared_log_error
import numpy as np

y_true = np.array([100, 1000, 10000, 100000])
y_pred = np.array([120, 1100, 9500, 95000])

rmsle = np.sqrt(mean_squared_log_error(y_true, y_pred))
rmse = np.sqrt(mean_squared_error(y_true, y_pred))

print(f"RMSE: {rmse:.2f}") # 3606 - dominated by 100K prediction
print(f"RMSLE: {rmsle:.4f}") # ≈0.05 - balanced across all scales
# Use RMSLE for house prices, population, revenue - anything log-normal

Part 6 - Huber Loss

Huber loss is the best of both worlds - quadratic for small errors (smooth, fast convergence), linear for large errors (robust to outliers):

\frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\ \delta \cdot |y - \hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}$$ The parameter $\delta$ defines the transition point. Errors below $\delta$ are treated quadratically; errors above $\delta$ linearly. ```python import numpy as np import matplotlib.pyplot as plt from sklearn.metrics import mean_squared_error, mean_absolute_error def huber_loss(y_true: np.ndarray, y_pred: np.ndarray, delta: float = 1.0) -> float: """Huber loss - robust to outliers, smooth near zero.""" residuals = np.abs(y_true - y_pred) quadratic = 0.5 * residuals**2 linear = delta * residuals - 0.5 * delta**2 return np.mean(np.where(residuals <= delta, quadratic, linear)) # Visualise the Huber loss for different deltas errors = np.linspace(-5, 5, 1000) fig, ax = plt.subplots(figsize=(10, 5)) ax.plot(errors, 0.5 * errors**2, 'b-', linewidth=2, label='MSE (quadratic)') ax.plot(errors, np.abs(errors), 'g-', linewidth=2, label='MAE (linear)') ax.plot(errors, [huber_loss(np.array([0.0]), np.array([e]), delta=2.0) for e in errors], 'r-', linewidth=2, label='Huber (δ=2)') ax.set_title('Loss Function Comparison: MSE vs MAE vs Huber') ax.set_xlabel('Residual') ax.set_ylabel('Loss') ax.legend() ax.grid(True, alpha=0.3) plt.tight_layout() # Compare on outlier dataset np.random.seed(42) y_t = np.random.normal(50, 10, 100) y_p = y_t + np.random.normal(0, 2, 100) # good predictions y_p_outlier = y_p.copy() y_p_outlier[[5, 15, 25]] = [200, -100, 300] # inject extreme errors print("\nWith outliers injected:") for delta in [0.5, 1.0, 5.0, 10.0]: h = huber_loss(y_t, y_p_outlier, delta=delta) print(f" Huber(δ={delta:.1f}): {h:.4f}") print(f" MAE: {mean_absolute_error(y_t, y_p_outlier):.4f}") print(f" RMSE: {np.sqrt(mean_squared_error(y_t, y_p_outlier)):.4f}") ``` ### Huber in PyTorch and sklearn ```python import torch import torch.nn as nn # PyTorch (smooth_l1_loss = Huber with delta=1) y_pred_torch = torch.tensor([2.5, 0.0, 2.0, 8.0]) y_true_torch = torch.tensor([3.0, -0.5, 2.0, 7.0]) huber_torch = nn.HuberLoss(delta=1.0) loss = huber_torch(y_pred_torch, y_true_torch) print(f"PyTorch Huber: {loss.item():.4f}") # sklearn HuberRegressor (training with Huber loss) from sklearn.linear_model import HuberRegressor, Ridge from sklearn.datasets import make_regression X, y, _ = make_regression(n_samples=200, n_features=5, noise=10, coef=True, random_state=42) # Inject outliers y[10:15] += 500 huber_reg = HuberRegressor(epsilon=1.35) # epsilon controls sensitivity ridge_reg = Ridge(alpha=1.0) huber_reg.fit(X, y) ridge_reg.fit(X, y) print(f"Huber MAE: {mean_absolute_error(y, huber_reg.predict(X)):.2f}") print(f"Ridge MAE: {mean_absolute_error(y, ridge_reg.predict(X)):.2f}") # Huber should be more robust to the injected outliers ``` ## Part 7 - Residual Analysis: Diagnosing What Went Wrong Metrics tell you *how much* the model errs. Residual analysis tells you *why*. $$\text{residual}_i = y_i - \hat{y}_i$$ A well-specified model has residuals that are: 1. **Centred at zero** - no systematic bias (positive or negative) 2. **Homoscedastic** - constant variance across the range of predictions 3. **Independent** - no autocorrelation (especially in time series) 4. **Approximately normal** - valid for confidence interval construction ```python import numpy as np import matplotlib.pyplot as plt from scipy import stats from sklearn.linear_model import LinearRegression from sklearn.datasets import make_regression # Case 1: Well-specified model (residuals look good) np.random.seed(42) X, y = make_regression(n_samples=300, n_features=3, noise=15, random_state=42) model = LinearRegression() model.fit(X, y) y_pred = model.predict(X) residuals = y - y_pred def plot_residual_diagnostics(y_pred, residuals, title="Residual Diagnostics"): fig, axes = plt.subplots(2, 2, figsize=(14, 10)) fig.suptitle(title, fontsize=14, fontweight='bold') # 1. Residuals vs Fitted - check for non-linearity and heteroscedasticity axes[0, 0].scatter(y_pred, residuals, alpha=0.5, s=20) axes[0, 0].axhline(0, color='red', linestyle='--', linewidth=1.5) axes[0, 0].set_xlabel('Fitted Values') axes[0, 0].set_ylabel('Residuals') axes[0, 0].set_title('Residuals vs Fitted') # 2. Q-Q plot - check normality of residuals stats.probplot(residuals, dist="norm", plot=axes[0, 1]) axes[0, 1].set_title('Normal Q-Q Plot') # 3. Scale-Location - check homoscedasticity sqrt_abs_res = np.sqrt(np.abs(residuals)) axes[1, 0].scatter(y_pred, sqrt_abs_res, alpha=0.5, s=20) axes[1, 0].set_xlabel('Fitted Values') axes[1, 0].set_ylabel('√|Residuals|') axes[1, 0].set_title('Scale-Location (Spread vs Level)') # 4. Residual histogram with normal overlay axes[1, 1].hist(residuals, bins=30, density=True, alpha=0.7, edgecolor='black') x_range = np.linspace(residuals.min(), residuals.max(), 100) axes[1, 1].plot(x_range, stats.norm.pdf(x_range, residuals.mean(), residuals.std()), 'r-', linewidth=2, label='Normal fit') axes[1, 1].set_xlabel('Residuals') axes[1, 1].set_title('Residual Distribution') axes[1, 1].legend() plt.tight_layout() return fig plot_residual_diagnostics(y_pred, residuals, "Well-Specified Model") ``` ### Residual Pattern Reference Guide ``` RESIDUALS VS FITTED: ───────────────────────────────────────────────────── Random scatter around 0 → Model well-specified Systematic curve (arch/U) → Missing non-linear terms → Add polynomial features or use non-linear model Fan shape (spread grows) → Heteroscedasticity → Try log(y) transformation or WLS Band getting tighter → Reverse heteroscedasticity → High predictions are constrained (ceiling effect) Diagonal streaks or clusters → Discrete target or group effects Q-Q PLOT PATTERNS: ───────────────────────────────────────────────────── Points on 45° diagonal → Normal residuals ✓ Heavy S-curve → Heavy tails - outliers influential Concave curve (left skew) → Left-skewed distribution Convex curve (right skew) → Right-skewed distribution → Try log/sqrt transform of target SCALE-LOCATION PLOT: ───────────────────────────────────────────────────── Flat horizontal band → Homoscedasticity ✓ Increasing trend → Variance grows with fitted value Decreasing trend → Variance shrinks with fitted value ``` ### Formal Statistical Tests ```python import statsmodels.api as sm from statsmodels.stats.stattools import durbin_watson from statsmodels.stats.diagnostic import ( het_breuschpagan, het_white, acorr_ljungbox ) from scipy.stats import shapiro, kstest, normaltest def residual_test_suite(y, y_pred, X=None): """Run formal statistical tests on model residuals.""" residuals = y - y_pred n = len(residuals) print("=" * 55) print("RESIDUAL DIAGNOSTIC TESTS") print("=" * 55) # 1. Normality: Shapiro-Wilk (n < 5000), or D'Agostino if n <= 5000: stat_sw, p_sw = shapiro(residuals) print(f"\n1. Shapiro-Wilk normality test:") print(f" W={stat_sw:.4f}, p={p_sw:.4f}") print(f" → {'Normal (p>0.05)' if p_sw > 0.05 else 'NOT normal (p<0.05) - investigate!'}") else: stat_dag, p_dag = normaltest(residuals) print(f"\n1. D'Agostino normality test (large n):") print(f" stat={stat_dag:.4f}, p={p_dag:.4f}") # 2. Heteroscedasticity: Breusch-Pagan if X is not None: X_const = sm.add_constant(X) bp_lm, bp_p, _, _ = het_breuschpagan(residuals, X_const) print(f"\n2. Breusch-Pagan heteroscedasticity test:") print(f" LM={bp_lm:.4f}, p={bp_p:.4f}") print(f" → {'Homoscedastic (p>0.05)' if bp_p > 0.05 else 'HETEROSCEDASTIC (p<0.05) - consider log(y) or WLS'}") # 3. Autocorrelation: Durbin-Watson dw = durbin_watson(residuals) print(f"\n3. Durbin-Watson autocorrelation statistic: {dw:.4f}") if dw < 1.5: print(" → Positive autocorrelation - residuals are correlated") elif dw > 2.5: print(" → Negative autocorrelation") else: print(" → No significant autocorrelation ✓") # 4. Residual bias bias = np.mean(residuals) print(f"\n4. Residual bias (mean residual): {bias:.4f}") if abs(bias) < 0.01 * np.std(residuals): print(" → No systematic bias ✓") else: sign = "over" if bias < 0 else "under" print(f" → Systematic {sign}-prediction detected!") print("=" * 55) # Usage residual_test_suite(y, y_pred, X) ``` ## Part 8 - Baseline Comparisons A metric without a baseline is worthless. Always compare against: 1. **Mean baseline** - predict the training mean for every sample (R² = 0 by definition) 2. **Median baseline** - predict the training median (more robust than mean) 3. **Last-value baseline** (time series) - predict the previous period's value 4. **Simple rule** - domain expert heuristic ("predict price = 1.2 × cost") ```python import numpy as np from sklearn.dummy import DummyRegressor from sklearn.linear_model import LinearRegression, Ridge from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor from sklearn.metrics import mean_absolute_error, r2_score from sklearn.model_selection import train_test_split from sklearn.datasets import make_regression # Dataset X, y = make_regression(n_samples=500, n_features=10, noise=20, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) def evaluate_model(name, model, X_train, y_train, X_test, y_test): model.fit(X_train, y_train) y_pred = model.predict(X_test) mae = mean_absolute_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) return name, mae, r2 models = [ ("Mean Baseline", DummyRegressor(strategy='mean')), ("Median Baseline", DummyRegressor(strategy='median')), ("Linear Regression", LinearRegression()), ("Ridge", Ridge(alpha=1.0)), ("Random Forest", RandomForestRegressor(n_estimators=100, random_state=42)), ("Gradient Boosting", GradientBoostingRegressor(n_estimators=100, random_state=42)), ] print(f"\n{'Model':<25} {'MAE':>10} {'R²':>10} {'vs Baseline'}") print("-" * 60) _, baseline_mae, _ = evaluate_model( "baseline", DummyRegressor(strategy='mean'), X_train, y_train, X_test, y_test ) for name, model in models: n, mae, r2 = evaluate_model(name, model, X_train, y_train, X_test, y_test) improvement = (baseline_mae - mae) / baseline_mae * 100 print(f"{n:<25} {mae:>10.2f} {r2:>10.4f} {improvement:>+.1f}%") ``` ## Part 9 - Custom Business-Aligned Metrics Real ML systems often need metrics that reflect business costs: ```python import numpy as np from typing import Callable def asymmetric_mae( y_true: np.ndarray, y_pred: np.ndarray, over_penalty: float = 2.0, under_penalty: float = 1.0 ) -> float: """ Asymmetric MAE for cases where over-prediction and under-prediction have different business costs. Example use cases: - Demand forecasting: overstock (wasted capital) > stockout (lost sale) → over_penalty = 3.0, under_penalty = 1.0 - Medical dosing: under-dose is dangerous → under_penalty = 5.0, over_penalty = 1.0 - SLA prediction: missing SLA (under-estimate time to fix) → over_penalty high """ residuals = y_pred - y_true # positive = over-prediction losses = np.where( residuals > 0, over_penalty * np.abs(residuals), under_penalty * np.abs(residuals) ) return np.mean(losses) def pinball_loss( y_true: np.ndarray, y_pred: np.ndarray, quantile: float = 0.9 ) -> float: """ Quantile (pinball) loss - for quantile regression. At quantile q, over-predictions penalised at rate (1-q), under-predictions at rate q. Use when you need P90 or P95 predictions (SLAs, capacity planning). """ residuals = y_true - y_pred return np.mean(np.where( residuals >= 0, quantile * residuals, (quantile - 1) * residuals )) # Inventory planning example np.random.seed(42) actual_demand = np.random.poisson(lam=100, size=365) model_forecast = actual_demand + np.random.normal(0, 15, 365) print("Demand forecasting metrics:") print(f" Standard MAE: {mean_absolute_error(actual_demand, model_forecast):.2f}") print(f" Asymmetric (overstock 3x): {asymmetric_mae(actual_demand, model_forecast, over_penalty=3.0):.2f}") print(f" P90 Pinball loss: {pinball_loss(actual_demand, model_forecast, quantile=0.9):.2f}") # WAPE - Weighted Average Percentage Error (common in supply chain) def wape(y_true, y_pred): """More stable than MAPE - aggregates errors before dividing.""" return np.sum(np.abs(y_true - y_pred)) / np.sum(np.abs(y_true)) * 100 print(f" WAPE: {wape(actual_demand, model_forecast):.2f}%") ``` ## Part 10 - Metric Selection Flowchart ```mermaid flowchart TD A[Choosing a Regression Metric] --> B{Outliers common?} B -- Yes --> C{Business impact<br/>of outliers?} B -- No --> D{Scale comparison<br/>needed?} C -- Low impact --> E[MAE - robust to outliers] C -- High impact --> F[RMSE or Huber Loss] D -- Yes, cross-dataset --> G{Zero values<br/>in target?} D -- No, same scale --> H{Comparing models with<br/>different feature counts?} G -- Yes --> I[sMAPE or RMSLE] G -- No --> J[MAPE] H -- Yes --> K[Adjusted R²] H -- No --> L[R²] L --> M{Distribution of<br/>errors matters?} M -- Yes --> N[Run residual diagnostics<br/>+ Shapiro-Wilk, BP test] M -- No --> O[Report MAE + R² together] style E fill:#dcfce7,color:#14532d style F fill:#dbeafe,color:#1e3a8a style I fill:#fef3c7,color:#78350f style J fill:#fef3c7,color:#78350f style K fill:#f3e8ff,color:#581c87 style L fill:#f3e8ff,color:#581c87 ``` ## Part 11 - Production Monitoring Tracking metrics in production requires rolling window computation, alerting on degradation, and detecting systematic bias: ```python import numpy as np from collections import deque from dataclasses import dataclass, field from typing import Optional, List, Dict import time @dataclass class RegressionMonitor: """ Production regression metric monitor. Maintains a rolling window of predictions and actuals. Raises alerts when metrics exceed thresholds or when systematic bias is detected. """ window_size: int = 1000 mae_threshold: Optional[float] = None rmse_threshold: Optional[float] = None r2_threshold: Optional[float] = None bias_fraction_threshold: float = 0.15 # alert when bias > 15% of MAE def __post_init__(self): self._actuals = deque(maxlen=self.window_size) self._predictions = deque(maxlen=self.window_size) self._timestamps = deque(maxlen=self.window_size) def record(self, y_true: float, y_pred: float) -> None: """Record a single prediction-actual pair.""" self._actuals.append(y_true) self._predictions.append(y_pred) self._timestamps.append(time.time()) def record_batch(self, y_true: np.ndarray, y_pred: np.ndarray) -> None: """Record a batch of predictions.""" for yt, yp in zip(y_true, y_pred): self.record(float(yt), float(yp)) def current_metrics(self) -> Optional[Dict]: """Compute current rolling window metrics.""" if len(self._actuals) < 10: return None y_true = np.array(self._actuals) y_pred = np.array(self._predictions) residuals = y_true - y_pred mae = np.mean(np.abs(residuals)) rmse = np.sqrt(np.mean(residuals**2)) bias = np.mean(residuals) ss_res = np.sum(residuals**2) ss_tot = np.sum((y_true - y_true.mean())**2) r2 = 1 - ss_res / ss_tot if ss_tot > 0 else float('nan') return { 'mae': mae, 'rmse': rmse, 'r2': r2, 'bias': bias, 'p50_error': np.percentile(np.abs(residuals), 50), 'p95_error': np.percentile(np.abs(residuals), 95), 'p99_error': np.percentile(np.abs(residuals), 99), 'rmse_mae_ratio': rmse / mae if mae > 0 else float('nan'), 'n_samples': len(y_true), } def check_alerts(self) -> List[str]: """Return list of active alert messages.""" metrics = self.current_metrics() if metrics is None: return [] alerts = [] if self.mae_threshold and metrics['mae'] > self.mae_threshold: alerts.append( f"MAE {metrics['mae']:.3f} exceeds threshold {self.mae_threshold}" ) if self.rmse_threshold and metrics['rmse'] > self.rmse_threshold: alerts.append( f"RMSE {metrics['rmse']:.3f} exceeds threshold {self.rmse_threshold}" ) if self.r2_threshold and metrics['r2'] < self.r2_threshold: alerts.append( f"R² {metrics['r2']:.3f} below threshold {self.r2_threshold}" ) if abs(metrics['bias']) > self.bias_fraction_threshold * metrics['mae']: direction = "over" if metrics['bias'] < 0 else "under" alerts.append( f"Systematic {direction}-prediction: bias={metrics['bias']:.3f}" ) if metrics['rmse_mae_ratio'] > 2.0: alerts.append( f"High RMSE/MAE ratio ({metrics['rmse_mae_ratio']:.2f}) - outlier predictions detected" ) return alerts def summary(self) -> str: m = self.current_metrics() if m is None: return "Insufficient data (<10 samples)" lines = [ f"Regression Monitor (n={m['n_samples']})", f" MAE: {m['mae']:.4f}", f" RMSE: {m['rmse']:.4f}", f" R²: {m['r2']:.4f}", f" Bias: {m['bias']:.4f}", f" P50 error: {m['p50_error']:.4f}", f" P95 error: {m['p95_error']:.4f}", f" P99 error: {m['p99_error']:.4f}", ] alerts = self.check_alerts() if alerts: lines.append("\n ALERTS:") for a in alerts: lines.append(f" ⚠ {a}") return "\n".join(lines) # Simulate production inference stream monitor = RegressionMonitor( window_size=500, mae_threshold=15.0, r2_threshold=0.7, bias_fraction_threshold=0.2 ) np.random.seed(42) for batch in range(20): y_true_batch = np.random.normal(100, 20, 25) # Introduce bias after batch 10 drift = 20 if batch > 10 else 0 y_pred_batch = y_true_batch + np.random.normal(drift, 5, 25) monitor.record_batch(y_true_batch, y_pred_batch) print(monitor.summary()) ``` ## YouTube Resources | Video | Channel | Focus | |-------|---------|-------| | [MSE, RMSE, MAE, R-squared](https://www.youtube.com/watch?v=LW27n5_vBLQ) | StatQuest | Core regression metrics intuition | | [R-squared, Clearly Explained](https://www.youtube.com/watch?v=2AQKmw14mHM) | StatQuest | R² from first principles | | [Regression Diagnostics in Python](https://www.youtube.com/watch?v=eTZ4VUZHzxw) | ritvikmath | Residual plots and statsmodels | | [Choosing Evaluation Metrics](https://www.youtube.com/watch?v=jqkzFpIbSqg) | Data School | MAPE vs RMSE trade-offs | | [Huber Loss Explained](https://www.youtube.com/watch?v=SBfHoiTiHKk) | Serrano.Academy | Robust loss functions | ## Interview Questions **Q1: RMSE = 45,000 on a house price prediction model. Is that good?** That number is meaningless without context. I'd immediately ask three things: (1) What's the mean/median house price? If the median is $500K, 9% RMSE might be acceptable; if it's $80K, it's catastrophic. (2) What does the mean baseline achieve? If predicting the mean gives RMSE = 200,000, then 45,000 is excellent. (3) What's the business tolerance? A $45K pricing error might cause a deal to fall through or a lending model to approve the wrong loan amount. Only after benchmarking against baselines and business costs can you assess whether the metric is acceptable. **Q2: Why does R² increase when you add features even if they're random noise?** Every added feature, even random noise, allows the regression to fit the training data slightly better. The numerator (SS_res) decreases or stays equal, while SS_tot is unchanged, so R² can only increase or stay the same when you add features to an OLS regression. This is because OLS minimises SS_res and can always find small coefficients for noise features that accidentally correlate with the target. Adjusted R² corrects for this by penalising model complexity: it multiplies $(1 - R^2)$ by $(n-1)/(n-k-1)$, which grows as $k$ increases. When you add a useless feature, R² barely improves but the penalty increases, so adjusted R² falls. **Q3: What's the difference between heteroscedasticity in training and test sets, and why does it matter?** Heteroscedasticity means the variance of residuals depends on the level of the fitted values or predictors. In training, heteroscedasticity violates OLS assumptions and makes coefficient standard errors unreliable (invalid hypothesis tests, poor confidence intervals). In production, it means your model has larger errors for certain prediction ranges - which may align with high-stakes predictions. Detection: residuals vs fitted plot (funnel shape). Fix: log-transform the target (if target is right-skewed), use WLS (weighted least squares) with weights proportional to 1/variance, or use a model that doesn't assume homoscedasticity (tree-based models). **Q4: When should you use MAPE and when should you avoid it?** Use MAPE when: targets span different scales (forecasting products with prices $1–$10,000), stakeholders think in percentages (retail demand, revenue forecasting), and you need an intuitive business-facing metric. Avoid MAPE when: targets can be zero or near-zero (division by zero or extreme sensitivity - common in cold-start forecasting); when error direction matters (MAPE penalises under-prediction at most 100% but over-prediction unboundedly); when small-value items dominate the metric unfairly. Alternatives: sMAPE for symmetry, WAPE (weighted APE) for supply chain, RMSLE for log-normal targets, MAE for zero-containing targets. **Q5: Your model has bias (mean residual ≠ 0). How do you diagnose the root cause?** A non-zero mean residual signals systematic prediction error. I'd investigate in this order: (1) Check if the bias appeared after a certain date → could be distribution shift or feature drift; (2) Plot bias as a function of prediction magnitude - if error grows proportionally, the model is under/over-fitted for a specific range; (3) Check if the bias is segment-specific (e.g., only on certain customer types or product categories) - segment-level mean residuals reveal if a global model fails a subgroup; (4) Inspect feature distributions - a shifted mean in an input feature (seasonal effect, new user cohort) propagates into biased predictions; (5) If the model uses a linear activation with uncentered features, the intercept may have absorbed a biased distribution during training. Fix: recalibrate intercept, retrain with updated data, or apply a post-hoc bias correction layer. **Q6: How do you choose δ (delta) for Huber loss?** Delta defines which residuals are treated quadratically vs linearly. I'd set delta based on: (1) the interquartile range of residuals from a simpler model - delta around the 75th–90th percentile of absolute residuals means typical errors are quadratically penalised while outliers are linearly penalised; (2) domain knowledge about what constitutes an "outlier" error; (3) cross-validation over a grid of delta values, optimising MAE on validation data. A practical starting point: fit the dataset with OLS, compute residual quantiles, set delta = 90th percentile absolute residual, then tune. In PyTorch, HuberLoss(delta=1.0) defaults to the common setting; sklearn HuberRegressor uses epsilon=1.35 by convention from robust statistics. **Q7: What residual patterns indicate that you need to transform the target variable?** Right-skewed residuals with increasing variance (fan shape in residuals-vs-fitted) often indicate the target follows a log-normal distribution - applying log(y) frequently resolves both. U-shaped residuals (systematic curve) indicate a missing quadratic term - either add y² as a feature or switch to a polynomial/non-linear model. Bimodal residuals suggest two distinct subpopulations that should be modelled separately. Heavy-tailed Q-Q plots (S-curves) suggest outlier influence - either Huber loss or log(y+1) transformation. Periodic/oscillating patterns in time-order residuals indicate seasonality not captured by the model. Always visualise before applying transformations. **Q8: How would you monitor regression model quality in production?** I'd implement a rolling window metric tracker (last N predictions, typically 1,000–10,000 depending on traffic). Key metrics to track: MAE/RMSE (absolute performance), bias (mean residual - systematic errors), P95/P99 errors (tail behaviour), and RMSE/MAE ratio (outlier influence). Alert thresholds: (1) absolute metric exceeds a pre-set business tolerance; (2) metric degrades > 2σ from the 30-day rolling baseline; (3) bias grows beyond 15–20% of MAE (systematic drift). Separately, monitor input feature distributions with PSI or KS tests to distinguish input drift (handled by retraining) from model degradation (requires deeper investigation). Store all predictions for post-hoc analysis. :::tip Role-Specific Angles **MLE Interview**: Huber loss implementation, residual diagnostics, custom asymmetric loss functions, production monitoring design **Data Scientist**: R², adjusted R², MAPE selection for business reporting, stakeholder communication **MLOps**: Rolling window monitoring, alert thresholds, metric degradation detection, drift vs model failure distinction **AI Engineer**: Huber in PyTorch training objectives, quantile loss for uncertainty-aware predictions **Research Engineer**: Theoretical properties (MAE ↔ median, MSE ↔ mean), Jensen's inequality, bias-variance of estimators ::: :::tip 🎮 Interactive Playground **Visualize this concept:** Try the **[Regression Explorer](/playground/regression-explorer)** demo on the EngineersOfAI Playground - no code required. :::
© 2026 EngineersOfAI. All rights reserved.