Skip to main content

Bayesian Model Comparison

The Problem: Which Model Should You Trust?

You're choosing between three regression models:

  • M1: Linear regression - simple, few parameters
  • M2: Polynomial degree-3 - moderate complexity
  • M3: Polynomial degree-10 - high complexity

On the training set, M3 achieves the lowest error. Always. But you know overfitting is real - M3 might just be memorizing noise.

The frequentist approach: use cross-validation to estimate generalization error, then pick the best model. This works but requires held-out data.

The Bayesian approach: use the marginal likelihood (model evidence) - P(DMi)P(\mathcal{D} \mid M_i) - which automatically penalizes overly complex models through integration over the prior. No held-out data required. Occam's razor emerges mathematically.

This lesson covers the Bayesian framework for model comparison: Bayes factors, marginal likelihood, and how BIC/AIC approximate the Bayesian answer.

Bayesian Model Comparison: The Framework

Given data D\mathcal{D} and two competing models M1M_1 and M2M_2, we want to compute the posterior odds:

P(M1D)P(M2D)=P(DM1)P(DM2)P(M1)P(M2)\frac{P(M_1 \mid \mathcal{D})}{P(M_2 \mid \mathcal{D})} = \frac{P(\mathcal{D} \mid M_1)}{P(\mathcal{D} \mid M_2)} \cdot \frac{P(M_1)}{P(M_2)}

The first factor is the Bayes factor B12B_{12}; the second is the prior odds:

posterior odds=B12×prior odds\text{posterior odds} = B_{12} \times \text{prior odds}

The Bayes factor B12B_{12} is the ratio of marginal likelihoods:

B12=P(DM1)P(DM2)B_{12} = \frac{P(\mathcal{D} \mid M_1)}{P(\mathcal{D} \mid M_2)}

where the marginal likelihood (model evidence) for model MiM_i is:

P(DMi)=P(Dθi,Mi)P(θiMi)dθiP(\mathcal{D} \mid M_i) = \int P(\mathcal{D} \mid \theta_i, M_i) \cdot P(\theta_i \mid M_i) \, d\theta_i

This integral averages the likelihood over all possible parameter values under the prior. A model with many parameters has a diffuse prior over a large space - most of those parameter values fit the data poorly, so the average likelihood is lower than for a simpler model that concentrates its prior probability in a region that explains the data.

Occam's Razor: Complexity Penalty for Free

The marginal likelihood automatically penalizes complexity. Here's the intuition:

Consider two models:

  • M1M_1: simple, has a concentrated prior over a small region of data space
  • M2M_2: complex, has a diffuse prior over a large region of data space
P(D | M1): Prior concentrates on few datasets P(D | M2): Prior spreads over many datasets
↓ ↓
────┐ ───────────────────────────
│ │ │
M1 │ high density here │ M2 low density │
│ ██████████ │ everywhere │
────┘ ───────────────────────────
D1 D_observed D_observed

If Dobserved\mathcal{D}_{observed} is in the concentrated region of M1M_1, P(DM1)P(\mathcal{D} \mid M_1) is HIGH. The complex model M2M_2 places little probability on any specific dataset because it can explain so many - so P(DM2)P(\mathcal{D} \mid M_2) is LOW even if M2M_2 fits Dobserved\mathcal{D}_{observed} well with its MAP parameters.

Formally, the marginal likelihood can be approximated as:

logP(DM)logP(Dθ^MAP)+Occam factor\log P(\mathcal{D} \mid M) \approx \log P(\mathcal{D} \mid \hat{\theta}_{MAP}) + \text{Occam factor}

where the Occam factor is:

Occam factor=d2log2π12logH\text{Occam factor} = \frac{d}{2}\log 2\pi - \frac{1}{2}\log|\mathbf{H}|

Here H\mathbf{H} is the Hessian of the negative log-posterior at θ^MAP\hat{\theta}_{MAP} and dd is the number of parameters. The Occam factor penalizes models with many parameters (large dd) and diffuse posteriors (small H|\mathbf{H}|).

The Bayes Factor: Interpretation Scale

Kass & Raftery (1995) provide an interpretation scale:

2lnB122\ln B_{12}Bayes Factor B12B_{12}Evidence Against M2M_2
0 to 21 to 3Not worth mentioning
2 to 63 to 20Positive evidence for M1M_1
6 to 1020 to 150Strong evidence for M1M_1
> 10> 150Very strong evidence for M1M_1

Note: Unlike p-values, the Bayes factor provides evidence for any of the hypotheses - including the null. A Bayes factor of B12=0.05B_{12} = 0.05 is strong evidence for M2M_2 (the null).

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def interpret_bayes_factor(B12):
"""Interpret Bayes factor using Kass & Raftery scale."""
log2B = 2 * np.log(B12)
if B12 < 1:
return f"B12 = {B12:.3f}: Evidence for M2 (against M1)"
elif log2B < 2:
return f"B12 = {B12:.3f}: Not worth mentioning"
elif log2B < 6:
return f"B12 = {B12:.3f}: Positive evidence for M1"
elif log2B < 10:
return f"B12 = {B12:.3f}: Strong evidence for M1"
else:
return f"B12 = {B12:.3f}: Very strong evidence for M1"

# Example Bayes factors
for B12 in [0.5, 2.0, 5.0, 25.0, 200.0]:
print(interpret_bayes_factor(B12))

Computing Marginal Likelihood for Simple Models

For conjugate models, the marginal likelihood has a closed form.

Beta-Binomial Marginal Likelihood

For Bernoulli observations with Beta prior θBeta(α,β)\theta \sim \text{Beta}(\alpha, \beta):

P(DM)=01θh(1θ)tθα1(1θ)β1B(α,β)dθ=B(α+h,β+t)B(α,β)P(\mathcal{D} \mid M) = \int_0^1 \theta^h (1-\theta)^t \cdot \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)} d\theta = \frac{B(\alpha+h, \beta+t)}{B(\alpha,\beta)}

where B(α,β)=Γ(α)Γ(β)/Γ(α+β)B(\alpha, \beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta) is the Beta function.

from scipy.special import betaln

def beta_binomial_log_evidence(h, t, alpha, beta):
"""
Log marginal likelihood for Beta-Binomial model.
h: number of successes
t: number of failures
alpha, beta: prior parameters
"""
return betaln(alpha + h, beta + t) - betaln(alpha, beta)

# Model comparison: is this coin fair (M1) or biased toward heads (M2)?
# Data: 15 heads, 5 tails (n=20)
h, t = 15, 5

# M1: Fair coin prior Beta(10, 10) -- strong belief coin is fair
log_ev_M1 = beta_binomial_log_evidence(h, t, alpha=10, beta=10)

# M2: Biased coin prior Beta(2, 1) -- weak prior favoring heads
log_ev_M2 = beta_binomial_log_evidence(h, t, alpha=2, beta=1)

# Bayes factor
log_BF_12 = log_ev_M1 - log_ev_M2
BF_12 = np.exp(log_BF_12)

print(f"Data: {h} heads, {t} tails")
print(f"Log evidence M1 (fair coin prior): {log_ev_M1:.4f}")
print(f"Log evidence M2 (biased prior): {log_ev_M2:.4f}")
print(f"Log Bayes factor (M1 vs M2): {log_BF_12:.4f}")
print(f"Bayes factor: {BF_12:.4f}")
print(interpret_bayes_factor(BF_12))

Gaussian Linear Regression Marginal Likelihood

For Bayesian linear regression with Gaussian prior on weights:

logP(yX,M)=n2log2π12logC12yC1y\log P(\mathbf{y} \mid \mathbf{X}, M) = -\frac{n}{2}\log 2\pi - \frac{1}{2}\log|\mathbf{C}| - \frac{1}{2}\mathbf{y}^\top \mathbf{C}^{-1} \mathbf{y}

where C=σn2I+σw2XX\mathbf{C} = \sigma_n^2\mathbf{I} + \sigma_w^2 \mathbf{X}\mathbf{X}^\top is the marginal covariance of y\mathbf{y}.

This is the same formula used in Gaussian process regression for kernel hyperparameter optimization.

BIC and AIC: Frequentist Approximations to Bayesian Model Comparison

Bayesian Information Criterion (BIC)

The BIC approximates the log marginal likelihood using the Laplace approximation:

BIC=2logP(Dθ^MLE)+klogn\text{BIC} = -2\log P(\mathcal{D} \mid \hat{\theta}_{MLE}) + k \log n

where kk = number of parameters, nn = number of data points.

Derivation: The Laplace approximation to logP(DM)\log P(\mathcal{D} \mid M) around the MAP gives:

logP(DM)logP(Dθ^MAP)+logP(θ^MAP)k2logn+const\log P(\mathcal{D} \mid M) \approx \log P(\mathcal{D} \mid \hat{\theta}_{MAP}) + \log P(\hat{\theta}_{MAP}) - \frac{k}{2}\log n + \text{const}

Assuming a flat prior (so θ^MAP=θ^MLE\hat{\theta}_{MAP} = \hat{\theta}_{MLE} and the prior term is constant), and approximating the Hessian, the dominant term gives:

2logP(DM)2logP(Dθ^MLE)+klogn=BIC-2\log P(\mathcal{D} \mid M) \approx -2\log P(\mathcal{D} \mid \hat{\theta}_{MLE}) + k\log n = \text{BIC}

BIC model selection: Choose the model with the lowest BIC.

Akaike Information Criterion (AIC)

AIC=2logP(Dθ^MLE)+2k\text{AIC} = -2\log P(\mathcal{D} \mid \hat{\theta}_{MLE}) + 2k

AIC is not derived from the marginal likelihood - it estimates the expected out-of-sample KL divergence from the true data distribution to the fitted model (Akaike, 1974). It uses a weaker complexity penalty (2k2k vs klognk\log n) and is more appropriate for predictive model selection when nn is small.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

def compute_aic_bic(X, y, degree):
"""Fit polynomial regression and compute AIC and BIC."""
n = len(y)
poly = Pipeline([
('poly', PolynomialFeatures(degree=degree, include_bias=True)),
('lr', LinearRegression(fit_intercept=False))
])
poly.fit(X.reshape(-1, 1), y)
y_pred = poly.predict(X.reshape(-1, 1))
residuals = y - y_pred
sse = np.sum(residuals**2)

# Number of parameters
k = degree + 1 # polynomial coefficients
# Also estimate sigma^2 (or fix it)
sigma2_mle = sse / n

# Log-likelihood assuming Gaussian errors
log_lik = -n/2 * np.log(2*np.pi*sigma2_mle) - sse/(2*sigma2_mle)

aic = -2 * log_lik + 2 * k
bic = -2 * log_lik + k * np.log(n)

return log_lik, aic, bic, poly

# Generate data: true model is cubic, but noisy
np.random.seed(42)
n = 50
X = np.linspace(-3, 3, n)
true_y = 0.5 * X**3 - 2 * X + 1
y = true_y + np.random.normal(0, 2.0, n)

print(f"{'Degree':<8} {'LogLik':<12} {'AIC':<12} {'BIC':<12}")
print("-" * 44)
for degree in range(1, 11):
log_lik, aic, bic, _ = compute_aic_bic(X, y, degree)
marker = " <-- AIC best" if degree == 3 else ""
marker += " <-- BIC best" if degree == 3 else ""
print(f"{degree:<8} {log_lik:<12.2f} {aic:<12.2f} {bic:<12.2f}{marker}")

print()
print("Both AIC and BIC should select degree 3 (the true model)")
print("BIC penalizes complexity more heavily (k*log(n) vs 2k)")
print("BIC is consistent: selects the true model as n->inf")
print("AIC minimizes predictive error, may prefer slightly higher degree")

AIC vs BIC vs Marginal Likelihood

PropertyAICBICMarginal Likelihood
Penalty term2k2kklognk\log nAutomatic via integration
Consistent?NoYesYes
Optimal for prediction?Yes (asymptotically)Not for predictionYes (for Bayesian prediction)
Requires prior?NoNoYes
Computable analytically?AlwaysAlwaysOnly for conjugate models
Preferred for n smallAIC preferredBIC too aggressiveMarginal likelihood
Preferred for n largeBoth convergeBIC betterBoth converge
Theoretical basisKL divergenceLaplace approx to log P(D|M)Exact Bayesian inference

Computing Marginal Likelihood Approximately

For complex models where the marginal likelihood has no closed form:

1. Laplace Approximation

Approximate the posterior as a Gaussian around the MAP:

logP(DM)logP(Dθ^MAP)+logP(θ^MAP)+d2log2π12logH\log P(\mathcal{D} \mid M) \approx \log P(\mathcal{D} \mid \hat{\theta}_{MAP}) + \log P(\hat{\theta}_{MAP}) + \frac{d}{2}\log 2\pi - \frac{1}{2}\log|\mathbf{H}|

where H=2logP(θD)θ^MAP\mathbf{H} = -\nabla^2 \log P(\theta \mid \mathcal{D})|_{\hat{\theta}_{MAP}} is the Hessian of the negative log-posterior.

2. Harmonic Mean Estimator (from MCMC samples)

logP(DM)log[1Nt=1N1P(Dθ(t))]\log P(\mathcal{D} \mid M) \approx -\log\left[\frac{1}{N}\sum_{t=1}^N \frac{1}{P(\mathcal{D} \mid \theta^{(t)})}\right]

where θ(t)\theta^{(t)} are MCMC samples from the posterior. Warning: This estimator has infinite variance in general - avoid in practice.

3. Annealed Importance Sampling (AIS)

The most reliable general-purpose method for estimating log marginal likelihoods, especially for complex models. Uses a sequence of intermediate distributions to bridge the prior and posterior.

4. WAIC (Widely Applicable Information Criterion)

WAIC=2i=1nlogP(yiD)+2pWAIC\text{WAIC} = -2\sum_{i=1}^n \log P(y_i \mid \mathcal{D}) + 2\text{p}_{WAIC}

where P(yiD)=1NtP(yiθ(t))P(y_i \mid \mathcal{D}) = \frac{1}{N}\sum_t P(y_i \mid \theta^{(t)}) is the posterior predictive and pWAIC\text{p}_{WAIC} is the effective number of parameters estimated from posterior variance.

WAIC is a fully Bayesian information criterion that works with MCMC samples and is the recommended approach in PyMC / Stan for model comparison.

import pymc as pm
import arviz as az
import numpy as np

# Compare two models using LOO-CV (leave-one-out cross-validation)
# LOO approximates the expected log predictive density
# and is the recommended comparison tool in PyMC/ArviZ

np.random.seed(42)
n = 100
X = np.random.randn(n)
y = 1.5 * X + np.random.randn(n) # true: linear

# Model 1: Linear regression
with pm.Model() as linear_model:
alpha = pm.Normal('alpha', 0, 10)
beta = pm.Normal('beta', 0, 10)
sigma = pm.HalfNormal('sigma', 1)
mu = alpha + beta * X
y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
trace_linear = pm.sample(2000, tune=1000, return_inferencedata=True,
progressbar=False, random_seed=42)
# Compute log-likelihood for WAIC/LOO
pm.compute_log_likelihood(trace_linear)

# Model 2: Quadratic regression (overly complex)
with pm.Model() as quad_model:
alpha = pm.Normal('alpha', 0, 10)
beta1 = pm.Normal('beta1', 0, 10)
beta2 = pm.Normal('beta2', 0, 10)
sigma = pm.HalfNormal('sigma', 1)
mu = alpha + beta1 * X + beta2 * X**2
y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
trace_quad = pm.sample(2000, tune=1000, return_inferencedata=True,
progressbar=False, random_seed=42)
pm.compute_log_likelihood(trace_quad)

# Compare using LOO
loo_linear = az.loo(trace_linear)
loo_quad = az.loo(trace_quad)

print(f"LOO comparison (higher ELPD = better):")
print(f" Linear model ELPD: {loo_linear.elpd_loo:.2f} ± {loo_linear.se:.2f}")
print(f" Quadratic model ELPD: {loo_quad.elpd_loo:.2f} ± {loo_quad.se:.2f}")
print(f"\nLinear model correctly preferred (true model)")
print(f"Occam's razor: quadratic model penalized for extra complexity")

# Also compare with WAIC
comparison = az.compare(
{'linear': trace_linear, 'quadratic': trace_quad},
ic='loo'
)
print("\nModel comparison table:")
print(comparison)

Practical Recommendation for ML Engineers

SituationRecommended Approach
Comparing neural network architecturesCross-validation or LOO-CV
Bayesian model in PyMC/StanWAIC or LOO via ArviZ
Model selection with many candidate modelsBIC (fast, interpretable)
Small n, want predictive optimalityAIC or WAIC
Scientific publication, comparing nested modelsBayes factor (requires proper priors)
Comparing GP kernelsLog marginal likelihood (closed form)
Deep learning architecture searchNAS + validation loss; Bayesian optimization with BIC proxy

Interview Questions

Q1: What is the Bayes factor and how does it differ from a p-value for model comparison?

The Bayes factor B12=P(DM1)/P(DM2)B_{12} = P(\mathcal{D}|M_1)/P(\mathcal{D}|M_2) is the ratio of marginal likelihoods - how much more likely the data is under M1M_1 than M2M_2, after averaging over all parameter values. Unlike a p-value: (1) Bayes factors can provide evidence for any hypothesis, including the simpler null model (a p-value can only reject the null, never confirm it); (2) Bayes factors are symmetric (B12=1/B21B_{12} = 1/B_{21}); (3) Bayes factors have a direct probability interpretation - with equal prior odds, B12=5B_{12} = 5 means P(M1D)/P(M2D)=5P(M_1|\mathcal{D})/P(M_2|\mathcal{D}) = 5; (4) Bayes factors automatically penalize model complexity through integration (Occam's razor), while p-values require explicit degrees-of-freedom corrections. The downside: Bayes factors require proper priors and are sensitive to prior choice in a way that posterior estimates are not.

Q2: Why does the marginal likelihood automatically penalize model complexity?

The marginal likelihood integrates the likelihood over the prior: P(DM)=P(Dθ,M)P(θM)dθP(\mathcal{D}|M) = \int P(\mathcal{D}|\theta,M)P(\theta|M)d\theta. A complex model has many parameters and thus a diffuse prior spread across a large space. Most of those parameter settings explain the data poorly, so the average likelihood is low. A simple model concentrates its prior probability in a smaller region - if that region happens to contain the data-generating process, the average likelihood is high. This is Occam's razor as mathematics: simple models that explain the data get higher marginal likelihood than complex models that overfit (which would need specific parameter values to achieve the same fit). The Occam factor - the ratio of posterior volume to prior volume - quantifies this penalty. It's proportional to exp(k2logn)\exp(-\frac{k}{2}\log n), which is exactly the BIC penalty.

Q3: What is the difference between AIC and BIC, and when would you prefer each?

AIC penalizes complexity by 2k2k (twice the number of parameters), while BIC uses klognk\log n (larger for n>7n > 7). Theoretically: AIC minimizes the expected KL divergence from the true data distribution to the model - it's optimized for prediction. BIC approximates the log marginal likelihood and is consistent: as nn \to \infty, BIC selects the true model (assuming it's in the candidate set), while AIC tends to overfit slightly. Practically: use AIC when you want the best predictive model and aren't sure the true model is among your candidates; use BIC when you want to identify the "true" model structure. For small nn (< 40), prefer AICc (corrected AIC): AICc=AIC+2k(k+1)/(nk1)\text{AICc} = \text{AIC} + 2k(k+1)/(n-k-1).

Q4: How does cross-validation relate to the marginal likelihood?

Leave-one-out cross-validation (LOO-CV) approximates the expected log predictive density (ELPD): ELPDLOO=ilogP(yiyi,M)\text{ELPD}_{LOO} = \sum_i \log P(y_i | y_{-i}, M) where yiy_{-i} denotes all data except the ii-th point. The marginal likelihood can be written as: logP(DM)=i=1nlogP(yiy1:i1,M)\log P(\mathcal{D}|M) = \sum_{i=1}^n \log P(y_i | y_{1:i-1}, M) - the product of sequential one-step-ahead predictive probabilities (prequential interpretation). For large nn, WAIC and the marginal likelihood converge to LOO-CV when the posterior is approximately normal. In practice: LOO-CV (computed via Pareto-smoothed importance sampling in ArviZ) is the recommended model comparison tool in PyMC/Stan because it: works with MCMC samples, handles model misspecification, and gives interpretable effective number of parameters pLOOp_{LOO}.

Q5: Why is the marginal likelihood sensitive to the prior, and what does this imply for model comparison?

The marginal likelihood integrates over the prior: P(DM)=P(Dθ)P(θM)dθP(\mathcal{D}|M) = \int P(\mathcal{D}|\theta)P(\theta|M)d\theta. An improper prior (integrating to infinity) makes the marginal likelihood undefined or improper - Bayes factors between models with improper priors are undefined. Even with proper priors, a very diffuse prior assigns probability to regions where the likelihood is negligible, reducing the marginal likelihood. This means Bayes factors are sensitive to prior choice, especially for parameters with little data. Practical implications: (1) always use proper priors for model comparison; (2) Bayes factors are most reliable when both models have informative priors informed by domain knowledge; (3) for exploratory model comparison, prefer predictive criteria (LOO-CV, WAIC) which are less prior-sensitive; (4) when using BIC, the prior cancels out in the Laplace approximation, so BIC is more robust to prior choice (but requires large n for the approximation to hold).

:::tip 🎮 Interactive Playground

Visualize this concept: Try the KL Divergence demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.