Bayesian Model Comparison
The Problem: Which Model Should You Trust?
You're choosing between three regression models:
- M1: Linear regression - simple, few parameters
- M2: Polynomial degree-3 - moderate complexity
- M3: Polynomial degree-10 - high complexity
On the training set, M3 achieves the lowest error. Always. But you know overfitting is real - M3 might just be memorizing noise.
The frequentist approach: use cross-validation to estimate generalization error, then pick the best model. This works but requires held-out data.
The Bayesian approach: use the marginal likelihood (model evidence) - - which automatically penalizes overly complex models through integration over the prior. No held-out data required. Occam's razor emerges mathematically.
This lesson covers the Bayesian framework for model comparison: Bayes factors, marginal likelihood, and how BIC/AIC approximate the Bayesian answer.
Bayesian Model Comparison: The Framework
Given data and two competing models and , we want to compute the posterior odds:
The first factor is the Bayes factor ; the second is the prior odds:
The Bayes factor is the ratio of marginal likelihoods:
where the marginal likelihood (model evidence) for model is:
This integral averages the likelihood over all possible parameter values under the prior. A model with many parameters has a diffuse prior over a large space - most of those parameter values fit the data poorly, so the average likelihood is lower than for a simpler model that concentrates its prior probability in a region that explains the data.
Occam's Razor: Complexity Penalty for Free
The marginal likelihood automatically penalizes complexity. Here's the intuition:
Consider two models:
- : simple, has a concentrated prior over a small region of data space
- : complex, has a diffuse prior over a large region of data space
P(D | M1): Prior concentrates on few datasets P(D | M2): Prior spreads over many datasets
↓ ↓
────┐ ───────────────────────────
│ │ │
M1 │ high density here │ M2 low density │
│ ██████████ │ everywhere │
────┘ ───────────────────────────
D1 D_observed D_observed
If is in the concentrated region of , is HIGH. The complex model places little probability on any specific dataset because it can explain so many - so is LOW even if fits well with its MAP parameters.
Formally, the marginal likelihood can be approximated as:
where the Occam factor is:
Here is the Hessian of the negative log-posterior at and is the number of parameters. The Occam factor penalizes models with many parameters (large ) and diffuse posteriors (small ).
The Bayes Factor: Interpretation Scale
Kass & Raftery (1995) provide an interpretation scale:
| Bayes Factor | Evidence Against | |
|---|---|---|
| 0 to 2 | 1 to 3 | Not worth mentioning |
| 2 to 6 | 3 to 20 | Positive evidence for |
| 6 to 10 | 20 to 150 | Strong evidence for |
| > 10 | > 150 | Very strong evidence for |
Note: Unlike p-values, the Bayes factor provides evidence for any of the hypotheses - including the null. A Bayes factor of is strong evidence for (the null).
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
def interpret_bayes_factor(B12):
"""Interpret Bayes factor using Kass & Raftery scale."""
log2B = 2 * np.log(B12)
if B12 < 1:
return f"B12 = {B12:.3f}: Evidence for M2 (against M1)"
elif log2B < 2:
return f"B12 = {B12:.3f}: Not worth mentioning"
elif log2B < 6:
return f"B12 = {B12:.3f}: Positive evidence for M1"
elif log2B < 10:
return f"B12 = {B12:.3f}: Strong evidence for M1"
else:
return f"B12 = {B12:.3f}: Very strong evidence for M1"
# Example Bayes factors
for B12 in [0.5, 2.0, 5.0, 25.0, 200.0]:
print(interpret_bayes_factor(B12))
Computing Marginal Likelihood for Simple Models
For conjugate models, the marginal likelihood has a closed form.
Beta-Binomial Marginal Likelihood
For Bernoulli observations with Beta prior :
where is the Beta function.
from scipy.special import betaln
def beta_binomial_log_evidence(h, t, alpha, beta):
"""
Log marginal likelihood for Beta-Binomial model.
h: number of successes
t: number of failures
alpha, beta: prior parameters
"""
return betaln(alpha + h, beta + t) - betaln(alpha, beta)
# Model comparison: is this coin fair (M1) or biased toward heads (M2)?
# Data: 15 heads, 5 tails (n=20)
h, t = 15, 5
# M1: Fair coin prior Beta(10, 10) -- strong belief coin is fair
log_ev_M1 = beta_binomial_log_evidence(h, t, alpha=10, beta=10)
# M2: Biased coin prior Beta(2, 1) -- weak prior favoring heads
log_ev_M2 = beta_binomial_log_evidence(h, t, alpha=2, beta=1)
# Bayes factor
log_BF_12 = log_ev_M1 - log_ev_M2
BF_12 = np.exp(log_BF_12)
print(f"Data: {h} heads, {t} tails")
print(f"Log evidence M1 (fair coin prior): {log_ev_M1:.4f}")
print(f"Log evidence M2 (biased prior): {log_ev_M2:.4f}")
print(f"Log Bayes factor (M1 vs M2): {log_BF_12:.4f}")
print(f"Bayes factor: {BF_12:.4f}")
print(interpret_bayes_factor(BF_12))
Gaussian Linear Regression Marginal Likelihood
For Bayesian linear regression with Gaussian prior on weights:
where is the marginal covariance of .
This is the same formula used in Gaussian process regression for kernel hyperparameter optimization.
BIC and AIC: Frequentist Approximations to Bayesian Model Comparison
Bayesian Information Criterion (BIC)
The BIC approximates the log marginal likelihood using the Laplace approximation:
where = number of parameters, = number of data points.
Derivation: The Laplace approximation to around the MAP gives:
Assuming a flat prior (so and the prior term is constant), and approximating the Hessian, the dominant term gives:
BIC model selection: Choose the model with the lowest BIC.
Akaike Information Criterion (AIC)
AIC is not derived from the marginal likelihood - it estimates the expected out-of-sample KL divergence from the true data distribution to the fitted model (Akaike, 1974). It uses a weaker complexity penalty ( vs ) and is more appropriate for predictive model selection when is small.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
def compute_aic_bic(X, y, degree):
"""Fit polynomial regression and compute AIC and BIC."""
n = len(y)
poly = Pipeline([
('poly', PolynomialFeatures(degree=degree, include_bias=True)),
('lr', LinearRegression(fit_intercept=False))
])
poly.fit(X.reshape(-1, 1), y)
y_pred = poly.predict(X.reshape(-1, 1))
residuals = y - y_pred
sse = np.sum(residuals**2)
# Number of parameters
k = degree + 1 # polynomial coefficients
# Also estimate sigma^2 (or fix it)
sigma2_mle = sse / n
# Log-likelihood assuming Gaussian errors
log_lik = -n/2 * np.log(2*np.pi*sigma2_mle) - sse/(2*sigma2_mle)
aic = -2 * log_lik + 2 * k
bic = -2 * log_lik + k * np.log(n)
return log_lik, aic, bic, poly
# Generate data: true model is cubic, but noisy
np.random.seed(42)
n = 50
X = np.linspace(-3, 3, n)
true_y = 0.5 * X**3 - 2 * X + 1
y = true_y + np.random.normal(0, 2.0, n)
print(f"{'Degree':<8} {'LogLik':<12} {'AIC':<12} {'BIC':<12}")
print("-" * 44)
for degree in range(1, 11):
log_lik, aic, bic, _ = compute_aic_bic(X, y, degree)
marker = " <-- AIC best" if degree == 3 else ""
marker += " <-- BIC best" if degree == 3 else ""
print(f"{degree:<8} {log_lik:<12.2f} {aic:<12.2f} {bic:<12.2f}{marker}")
print()
print("Both AIC and BIC should select degree 3 (the true model)")
print("BIC penalizes complexity more heavily (k*log(n) vs 2k)")
print("BIC is consistent: selects the true model as n->inf")
print("AIC minimizes predictive error, may prefer slightly higher degree")
AIC vs BIC vs Marginal Likelihood
| Property | AIC | BIC | Marginal Likelihood |
|---|---|---|---|
| Penalty term | Automatic via integration | ||
| Consistent? | No | Yes | Yes |
| Optimal for prediction? | Yes (asymptotically) | Not for prediction | Yes (for Bayesian prediction) |
| Requires prior? | No | No | Yes |
| Computable analytically? | Always | Always | Only for conjugate models |
| Preferred for n small | AIC preferred | BIC too aggressive | Marginal likelihood |
| Preferred for n large | Both converge | BIC better | Both converge |
| Theoretical basis | KL divergence | Laplace approx to log P(D|M) | Exact Bayesian inference |
Computing Marginal Likelihood Approximately
For complex models where the marginal likelihood has no closed form:
1. Laplace Approximation
Approximate the posterior as a Gaussian around the MAP:
where is the Hessian of the negative log-posterior.
2. Harmonic Mean Estimator (from MCMC samples)
where are MCMC samples from the posterior. Warning: This estimator has infinite variance in general - avoid in practice.
3. Annealed Importance Sampling (AIS)
The most reliable general-purpose method for estimating log marginal likelihoods, especially for complex models. Uses a sequence of intermediate distributions to bridge the prior and posterior.
4. WAIC (Widely Applicable Information Criterion)
where is the posterior predictive and is the effective number of parameters estimated from posterior variance.
WAIC is a fully Bayesian information criterion that works with MCMC samples and is the recommended approach in PyMC / Stan for model comparison.
import pymc as pm
import arviz as az
import numpy as np
# Compare two models using LOO-CV (leave-one-out cross-validation)
# LOO approximates the expected log predictive density
# and is the recommended comparison tool in PyMC/ArviZ
np.random.seed(42)
n = 100
X = np.random.randn(n)
y = 1.5 * X + np.random.randn(n) # true: linear
# Model 1: Linear regression
with pm.Model() as linear_model:
alpha = pm.Normal('alpha', 0, 10)
beta = pm.Normal('beta', 0, 10)
sigma = pm.HalfNormal('sigma', 1)
mu = alpha + beta * X
y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
trace_linear = pm.sample(2000, tune=1000, return_inferencedata=True,
progressbar=False, random_seed=42)
# Compute log-likelihood for WAIC/LOO
pm.compute_log_likelihood(trace_linear)
# Model 2: Quadratic regression (overly complex)
with pm.Model() as quad_model:
alpha = pm.Normal('alpha', 0, 10)
beta1 = pm.Normal('beta1', 0, 10)
beta2 = pm.Normal('beta2', 0, 10)
sigma = pm.HalfNormal('sigma', 1)
mu = alpha + beta1 * X + beta2 * X**2
y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
trace_quad = pm.sample(2000, tune=1000, return_inferencedata=True,
progressbar=False, random_seed=42)
pm.compute_log_likelihood(trace_quad)
# Compare using LOO
loo_linear = az.loo(trace_linear)
loo_quad = az.loo(trace_quad)
print(f"LOO comparison (higher ELPD = better):")
print(f" Linear model ELPD: {loo_linear.elpd_loo:.2f} ± {loo_linear.se:.2f}")
print(f" Quadratic model ELPD: {loo_quad.elpd_loo:.2f} ± {loo_quad.se:.2f}")
print(f"\nLinear model correctly preferred (true model)")
print(f"Occam's razor: quadratic model penalized for extra complexity")
# Also compare with WAIC
comparison = az.compare(
{'linear': trace_linear, 'quadratic': trace_quad},
ic='loo'
)
print("\nModel comparison table:")
print(comparison)
Practical Recommendation for ML Engineers
| Situation | Recommended Approach |
|---|---|
| Comparing neural network architectures | Cross-validation or LOO-CV |
| Bayesian model in PyMC/Stan | WAIC or LOO via ArviZ |
| Model selection with many candidate models | BIC (fast, interpretable) |
| Small n, want predictive optimality | AIC or WAIC |
| Scientific publication, comparing nested models | Bayes factor (requires proper priors) |
| Comparing GP kernels | Log marginal likelihood (closed form) |
| Deep learning architecture search | NAS + validation loss; Bayesian optimization with BIC proxy |
Interview Questions
Q1: What is the Bayes factor and how does it differ from a p-value for model comparison?
The Bayes factor is the ratio of marginal likelihoods - how much more likely the data is under than , after averaging over all parameter values. Unlike a p-value: (1) Bayes factors can provide evidence for any hypothesis, including the simpler null model (a p-value can only reject the null, never confirm it); (2) Bayes factors are symmetric (); (3) Bayes factors have a direct probability interpretation - with equal prior odds, means ; (4) Bayes factors automatically penalize model complexity through integration (Occam's razor), while p-values require explicit degrees-of-freedom corrections. The downside: Bayes factors require proper priors and are sensitive to prior choice in a way that posterior estimates are not.
Q2: Why does the marginal likelihood automatically penalize model complexity?
The marginal likelihood integrates the likelihood over the prior: . A complex model has many parameters and thus a diffuse prior spread across a large space. Most of those parameter settings explain the data poorly, so the average likelihood is low. A simple model concentrates its prior probability in a smaller region - if that region happens to contain the data-generating process, the average likelihood is high. This is Occam's razor as mathematics: simple models that explain the data get higher marginal likelihood than complex models that overfit (which would need specific parameter values to achieve the same fit). The Occam factor - the ratio of posterior volume to prior volume - quantifies this penalty. It's proportional to , which is exactly the BIC penalty.
Q3: What is the difference between AIC and BIC, and when would you prefer each?
AIC penalizes complexity by (twice the number of parameters), while BIC uses (larger for ). Theoretically: AIC minimizes the expected KL divergence from the true data distribution to the model - it's optimized for prediction. BIC approximates the log marginal likelihood and is consistent: as , BIC selects the true model (assuming it's in the candidate set), while AIC tends to overfit slightly. Practically: use AIC when you want the best predictive model and aren't sure the true model is among your candidates; use BIC when you want to identify the "true" model structure. For small (< 40), prefer AICc (corrected AIC): .
Q4: How does cross-validation relate to the marginal likelihood?
Leave-one-out cross-validation (LOO-CV) approximates the expected log predictive density (ELPD): where denotes all data except the -th point. The marginal likelihood can be written as: - the product of sequential one-step-ahead predictive probabilities (prequential interpretation). For large , WAIC and the marginal likelihood converge to LOO-CV when the posterior is approximately normal. In practice: LOO-CV (computed via Pareto-smoothed importance sampling in ArviZ) is the recommended model comparison tool in PyMC/Stan because it: works with MCMC samples, handles model misspecification, and gives interpretable effective number of parameters .
Q5: Why is the marginal likelihood sensitive to the prior, and what does this imply for model comparison?
The marginal likelihood integrates over the prior: . An improper prior (integrating to infinity) makes the marginal likelihood undefined or improper - Bayes factors between models with improper priors are undefined. Even with proper priors, a very diffuse prior assigns probability to regions where the likelihood is negligible, reducing the marginal likelihood. This means Bayes factors are sensitive to prior choice, especially for parameters with little data. Practical implications: (1) always use proper priors for model comparison; (2) Bayes factors are most reliable when both models have informative priors informed by domain knowledge; (3) for exploratory model comparison, prefer predictive criteria (LOO-CV, WAIC) which are less prior-sensitive; (4) when using BIC, the prior cancels out in the Laplace approximation, so BIC is more robust to prior choice (but requires large n for the approximation to hold).
:::tip 🎮 Interactive Playground
Visualize this concept: Try the KL Divergence demo on the EngineersOfAI Playground - no code required.
:::
