Bayesian Model Comparison

The Problem: Which Model Should You Trust?

You're choosing between three regression models:

M1: Linear regression - simple, few parameters
M2: Polynomial degree-3 - moderate complexity
M3: Polynomial degree-10 - high complexity

On the training set, M3 achieves the lowest error. Always. But you know overfitting is real - M3 might just be memorizing noise.

The frequentist approach: use cross-validation to estimate generalization error, then pick the best model. This works but requires held-out data.

The Bayesian approach: use the marginal likelihood (model evidence) - $P(\mathcal{D} \mid M_i)$ - which automatically penalizes overly complex models through integration over the prior. No held-out data required. Occam's razor emerges mathematically.

This lesson covers the Bayesian framework for model comparison: Bayes factors, marginal likelihood, and how BIC/AIC approximate the Bayesian answer.

Bayesian Model Comparison: The Framework

Given data $\mathcal{D}$ and two competing models $M_1$ and $M_2$ , we want to compute the posterior odds:

$\frac{P(M_1 \mid \mathcal{D})}{P(M_2 \mid \mathcal{D})} = \frac{P(\mathcal{D} \mid M_1)}{P(\mathcal{D} \mid M_2)} \cdot \frac{P(M_1)}{P(M_2)}$

The first factor is the Bayes factor $B_{12}$ ; the second is the prior odds:

$\text{posterior odds} = B_{12} \times \text{prior odds}$

The Bayes factor $B_{12}$ is the ratio of marginal likelihoods:

$B_{12} = \frac{P(\mathcal{D} \mid M_1)}{P(\mathcal{D} \mid M_2)}$

where the marginal likelihood (model evidence) for model $M_i$ is:

$P(\mathcal{D} \mid M_i) = \int P(\mathcal{D} \mid \theta_i, M_i) \cdot P(\theta_i \mid M_i) \, d\theta_i$

This integral averages the likelihood over all possible parameter values under the prior. A model with many parameters has a diffuse prior over a large space - most of those parameter values fit the data poorly, so the average likelihood is lower than for a simpler model that concentrates its prior probability in a region that explains the data.

Occam's Razor: Complexity Penalty for Free

The marginal likelihood automatically penalizes complexity. Here's the intuition:

Consider two models:

$M_1$ : simple, has a concentrated prior over a small region of data space
$M_2$ : complex, has a diffuse prior over a large region of data space

P(D | M1):  Prior concentrates on few datasets     P(D | M2): Prior spreads over many datasets
            ↓                                                   ↓
  ────┐                                           ───────────────────────────
      │                                          │                           │
   M1 │       high density here                  │  M2  low density          │
      │       ██████████                         │      everywhere           │
  ────┘                                           ───────────────────────────
           D1   D_observed                                D_observed

If $\mathcal{D}_{observed}$ is in the concentrated region of $M_1$ , $P(\mathcal{D} \mid M_1)$ is HIGH. The complex model $M_2$ places little probability on any specific dataset because it can explain so many - so $P(\mathcal{D} \mid M_2)$ is LOW even if $M_2$ fits $\mathcal{D}_{observed}$ well with its MAP parameters.

Formally, the marginal likelihood can be approximated as:

$\log P(\mathcal{D} \mid M) \approx \log P(\mathcal{D} \mid \hat{\theta}_{MAP}) + \text{Occam factor}$

where the Occam factor is:

$\text{Occam factor} = \frac{d}{2}\log 2\pi - \frac{1}{2}\log|\mathbf{H}|$

Here $\mathbf{H}$ is the Hessian of the negative log-posterior at $\hat{\theta}_{MAP}$ and $d$ is the number of parameters. The Occam factor penalizes models with many parameters (large $d$ ) and diffuse posteriors (small $|\mathbf{H}|$ ).

The Bayes Factor: Interpretation Scale

Kass & Raftery (1995) provide an interpretation scale:

$2\ln B_{12}$	Bayes Factor $B_{12}$	Evidence Against $M_2$
0 to 2	1 to 3	Not worth mentioning
2 to 6	3 to 20	Positive evidence for $M_1$
6 to 10	20 to 150	Strong evidence for $M_1$
> 10	> 150	Very strong evidence for $M_1$

Note: Unlike p-values, the Bayes factor provides evidence for any of the hypotheses - including the null. A Bayes factor of $B_{12} = 0.05$ is strong evidence for $M_2$ (the null).

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def interpret_bayes_factor(B12):
    """Interpret Bayes factor using Kass & Raftery scale."""
    log2B = 2 * np.log(B12)
    if B12 < 1:
        return f"B12 = {B12:.3f}: Evidence for M2 (against M1)"
    elif log2B < 2:
        return f"B12 = {B12:.3f}: Not worth mentioning"
    elif log2B < 6:
        return f"B12 = {B12:.3f}: Positive evidence for M1"
    elif log2B < 10:
        return f"B12 = {B12:.3f}: Strong evidence for M1"
    else:
        return f"B12 = {B12:.3f}: Very strong evidence for M1"

# Example Bayes factors
for B12 in [0.5, 2.0, 5.0, 25.0, 200.0]:
    print(interpret_bayes_factor(B12))

Computing Marginal Likelihood for Simple Models

For conjugate models, the marginal likelihood has a closed form.

Beta-Binomial Marginal Likelihood

For Bernoulli observations with Beta prior $\theta \sim \text{Beta}(\alpha, \beta)$ :

$P(\mathcal{D} \mid M) = \int_0^1 \theta^h (1-\theta)^t \cdot \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)} d\theta = \frac{B(\alpha+h, \beta+t)}{B(\alpha,\beta)}$

where $B(\alpha, \beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta)$ is the Beta function.

from scipy.special import betaln

def beta_binomial_log_evidence(h, t, alpha, beta):
    """
    Log marginal likelihood for Beta-Binomial model.
    h: number of successes
    t: number of failures
    alpha, beta: prior parameters
    """
    return betaln(alpha + h, beta + t) - betaln(alpha, beta)

# Model comparison: is this coin fair (M1) or biased toward heads (M2)?
# Data: 15 heads, 5 tails (n=20)
h, t = 15, 5

# M1: Fair coin prior Beta(10, 10) -- strong belief coin is fair
log_ev_M1 = beta_binomial_log_evidence(h, t, alpha=10, beta=10)

# M2: Biased coin prior Beta(2, 1) -- weak prior favoring heads
log_ev_M2 = beta_binomial_log_evidence(h, t, alpha=2, beta=1)

# Bayes factor
log_BF_12 = log_ev_M1 - log_ev_M2
BF_12 = np.exp(log_BF_12)

print(f"Data: {h} heads, {t} tails")
print(f"Log evidence M1 (fair coin prior): {log_ev_M1:.4f}")
print(f"Log evidence M2 (biased prior):    {log_ev_M2:.4f}")
print(f"Log Bayes factor (M1 vs M2): {log_BF_12:.4f}")
print(f"Bayes factor: {BF_12:.4f}")
print(interpret_bayes_factor(BF_12))

Gaussian Linear Regression Marginal Likelihood

For Bayesian linear regression with Gaussian prior on weights:

$\log P(\mathbf{y} \mid \mathbf{X}, M) = -\frac{n}{2}\log 2\pi - \frac{1}{2}\log|\mathbf{C}| - \frac{1}{2}\mathbf{y}^\top \mathbf{C}^{-1} \mathbf{y}$

where $\mathbf{C} = \sigma_n^2\mathbf{I} + \sigma_w^2 \mathbf{X}\mathbf{X}^\top$ is the marginal covariance of $\mathbf{y}$ .

This is the same formula used in Gaussian process regression for kernel hyperparameter optimization.

BIC and AIC: Frequentist Approximations to Bayesian Model Comparison

Bayesian Information Criterion (BIC)

The BIC approximates the log marginal likelihood using the Laplace approximation:

$\text{BIC} = -2\log P(\mathcal{D} \mid \hat{\theta}_{MLE}) + k \log n$

where $k$ = number of parameters, $n$ = number of data points.

Derivation: The Laplace approximation to $\log P(\mathcal{D} \mid M)$ around the MAP gives:

$\log P(\mathcal{D} \mid M) \approx \log P(\mathcal{D} \mid \hat{\theta}_{MAP}) + \log P(\hat{\theta}_{MAP}) - \frac{k}{2}\log n + \text{const}$

Assuming a flat prior (so $\hat{\theta}_{MAP} = \hat{\theta}_{MLE}$ and the prior term is constant), and approximating the Hessian, the dominant term gives:

$-2\log P(\mathcal{D} \mid M) \approx -2\log P(\mathcal{D} \mid \hat{\theta}_{MLE}) + k\log n = \text{BIC}$

BIC model selection: Choose the model with the lowest BIC.

Akaike Information Criterion (AIC)

$\text{AIC} = -2\log P(\mathcal{D} \mid \hat{\theta}_{MLE}) + 2k$

AIC is not derived from the marginal likelihood - it estimates the expected out-of-sample KL divergence from the true data distribution to the fitted model (Akaike, 1974). It uses a weaker complexity penalty ( $2k$ vs $k\log n$ ) and is more appropriate for predictive model selection when $n$ is small.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

def compute_aic_bic(X, y, degree):
    """Fit polynomial regression and compute AIC and BIC."""
    n = len(y)
    poly = Pipeline([
        ('poly', PolynomialFeatures(degree=degree, include_bias=True)),
        ('lr', LinearRegression(fit_intercept=False))
    ])
    poly.fit(X.reshape(-1, 1), y)
    y_pred = poly.predict(X.reshape(-1, 1))
    residuals = y - y_pred
    sse = np.sum(residuals**2)

    # Number of parameters
    k = degree + 1  # polynomial coefficients
    # Also estimate sigma^2 (or fix it)
    sigma2_mle = sse / n

    # Log-likelihood assuming Gaussian errors
    log_lik = -n/2 * np.log(2*np.pi*sigma2_mle) - sse/(2*sigma2_mle)

    aic = -2 * log_lik + 2 * k
    bic = -2 * log_lik + k * np.log(n)

    return log_lik, aic, bic, poly

# Generate data: true model is cubic, but noisy
np.random.seed(42)
n = 50
X = np.linspace(-3, 3, n)
true_y = 0.5 * X**3 - 2 * X + 1
y = true_y + np.random.normal(0, 2.0, n)

print(f"{'Degree':<8} {'LogLik':<12} {'AIC':<12} {'BIC':<12}")
print("-" * 44)
for degree in range(1, 11):
    log_lik, aic, bic, _ = compute_aic_bic(X, y, degree)
    marker = " <-- AIC best" if degree == 3 else ""
    marker += " <-- BIC best" if degree == 3 else ""
    print(f"{degree:<8} {log_lik:<12.2f} {aic:<12.2f} {bic:<12.2f}{marker}")

print()
print("Both AIC and BIC should select degree 3 (the true model)")
print("BIC penalizes complexity more heavily (k*log(n) vs 2k)")
print("BIC is consistent: selects the true model as n->inf")
print("AIC minimizes predictive error, may prefer slightly higher degree")

AIC vs BIC vs Marginal Likelihood

Property	AIC	BIC	Marginal Likelihood
Penalty term	$2k$	$k\log n$	Automatic via integration
Consistent?	No	Yes	Yes
Optimal for prediction?	Yes (asymptotically)	Not for prediction	Yes (for Bayesian prediction)
Requires prior?	No	No	Yes
Computable analytically?	Always	Always	Only for conjugate models
Preferred for n small	AIC preferred	BIC too aggressive	Marginal likelihood
Preferred for n large	Both converge	BIC better	Both converge
Theoretical basis	KL divergence	Laplace approx to log P(D\|M)	Exact Bayesian inference

Computing Marginal Likelihood Approximately

For complex models where the marginal likelihood has no closed form:

1. Laplace Approximation

Approximate the posterior as a Gaussian around the MAP:

$\log P(\mathcal{D} \mid M) \approx \log P(\mathcal{D} \mid \hat{\theta}_{MAP}) + \log P(\hat{\theta}_{MAP}) + \frac{d}{2}\log 2\pi - \frac{1}{2}\log|\mathbf{H}|$

where $\mathbf{H} = -\nabla^2 \log P(\theta \mid \mathcal{D})|_{\hat{\theta}_{MAP}}$ is the Hessian of the negative log-posterior.

2. Harmonic Mean Estimator (from MCMC samples)

$\log P(\mathcal{D} \mid M) \approx -\log\left[\frac{1}{N}\sum_{t=1}^N \frac{1}{P(\mathcal{D} \mid \theta^{(t)})}\right]$

where $\theta^{(t)}$ are MCMC samples from the posterior. Warning: This estimator has infinite variance in general - avoid in practice.

3. Annealed Importance Sampling (AIS)

The most reliable general-purpose method for estimating log marginal likelihoods, especially for complex models. Uses a sequence of intermediate distributions to bridge the prior and posterior.

4. WAIC (Widely Applicable Information Criterion)

$\text{WAIC} = -2\sum_{i=1}^n \log P(y_i \mid \mathcal{D}) + 2\text{p}_{WAIC}$

where $P(y_i \mid \mathcal{D}) = \frac{1}{N}\sum_t P(y_i \mid \theta^{(t)})$ is the posterior predictive and $\text{p}_{WAIC}$ is the effective number of parameters estimated from posterior variance.

WAIC is a fully Bayesian information criterion that works with MCMC samples and is the recommended approach in PyMC / Stan for model comparison.

import pymc as pm
import arviz as az
import numpy as np

# Compare two models using LOO-CV (leave-one-out cross-validation)
# LOO approximates the expected log predictive density
# and is the recommended comparison tool in PyMC/ArviZ

np.random.seed(42)
n = 100
X = np.random.randn(n)
y = 1.5 * X + np.random.randn(n)  # true: linear

# Model 1: Linear regression
with pm.Model() as linear_model:
    alpha = pm.Normal('alpha', 0, 10)
    beta = pm.Normal('beta', 0, 10)
    sigma = pm.HalfNormal('sigma', 1)
    mu = alpha + beta * X
    y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
    trace_linear = pm.sample(2000, tune=1000, return_inferencedata=True,
                             progressbar=False, random_seed=42)
    # Compute log-likelihood for WAIC/LOO
    pm.compute_log_likelihood(trace_linear)

# Model 2: Quadratic regression (overly complex)
with pm.Model() as quad_model:
    alpha = pm.Normal('alpha', 0, 10)
    beta1 = pm.Normal('beta1', 0, 10)
    beta2 = pm.Normal('beta2', 0, 10)
    sigma = pm.HalfNormal('sigma', 1)
    mu = alpha + beta1 * X + beta2 * X**2
    y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
    trace_quad = pm.sample(2000, tune=1000, return_inferencedata=True,
                           progressbar=False, random_seed=42)
    pm.compute_log_likelihood(trace_quad)

# Compare using LOO
loo_linear = az.loo(trace_linear)
loo_quad = az.loo(trace_quad)

print(f"LOO comparison (higher ELPD = better):")
print(f"  Linear model ELPD:    {loo_linear.elpd_loo:.2f} ± {loo_linear.se:.2f}")
print(f"  Quadratic model ELPD: {loo_quad.elpd_loo:.2f} ± {loo_quad.se:.2f}")
print(f"\nLinear model correctly preferred (true model)")
print(f"Occam's razor: quadratic model penalized for extra complexity")

# Also compare with WAIC
comparison = az.compare(
    {'linear': trace_linear, 'quadratic': trace_quad},
    ic='loo'
)
print("\nModel comparison table:")
print(comparison)

Practical Recommendation for ML Engineers

Situation	Recommended Approach
Comparing neural network architectures	Cross-validation or LOO-CV
Bayesian model in PyMC/Stan	WAIC or LOO via ArviZ
Model selection with many candidate models	BIC (fast, interpretable)
Small n, want predictive optimality	AIC or WAIC
Scientific publication, comparing nested models	Bayes factor (requires proper priors)
Comparing GP kernels	Log marginal likelihood (closed form)
Deep learning architecture search	NAS + validation loss; Bayesian optimization with BIC proxy

Interview Questions

Q1: What is the Bayes factor and how does it differ from a p-value for model comparison?

The Bayes factor $B_{12} = P(\mathcal{D}|M_1)/P(\mathcal{D}|M_2)$ is the ratio of marginal likelihoods - how much more likely the data is under $M_1$ than $M_2$ , after averaging over all parameter values. Unlike a p-value: (1) Bayes factors can provide evidence for any hypothesis, including the simpler null model (a p-value can only reject the null, never confirm it); (2) Bayes factors are symmetric ( $B_{12} = 1/B_{21}$ ); (3) Bayes factors have a direct probability interpretation - with equal prior odds, $B_{12} = 5$ means $P(M_1|\mathcal{D})/P(M_2|\mathcal{D}) = 5$ ; (4) Bayes factors automatically penalize model complexity through integration (Occam's razor), while p-values require explicit degrees-of-freedom corrections. The downside: Bayes factors require proper priors and are sensitive to prior choice in a way that posterior estimates are not.

Q2: Why does the marginal likelihood automatically penalize model complexity?

The marginal likelihood integrates the likelihood over the prior: $P(\mathcal{D}|M) = \int P(\mathcal{D}|\theta,M)P(\theta|M)d\theta$ . A complex model has many parameters and thus a diffuse prior spread across a large space. Most of those parameter settings explain the data poorly, so the average likelihood is low. A simple model concentrates its prior probability in a smaller region - if that region happens to contain the data-generating process, the average likelihood is high. This is Occam's razor as mathematics: simple models that explain the data get higher marginal likelihood than complex models that overfit (which would need specific parameter values to achieve the same fit). The Occam factor - the ratio of posterior volume to prior volume - quantifies this penalty. It's proportional to $\exp(-\frac{k}{2}\log n)$ , which is exactly the BIC penalty.

Q3: What is the difference between AIC and BIC, and when would you prefer each?

AIC penalizes complexity by $2k$ (twice the number of parameters), while BIC uses $k\log n$ (larger for $n > 7$ ). Theoretically: AIC minimizes the expected KL divergence from the true data distribution to the model - it's optimized for prediction. BIC approximates the log marginal likelihood and is consistent: as $n \to \infty$ , BIC selects the true model (assuming it's in the candidate set), while AIC tends to overfit slightly. Practically: use AIC when you want the best predictive model and aren't sure the true model is among your candidates; use BIC when you want to identify the "true" model structure. For small $n$ (< 40), prefer AICc (corrected AIC): $\text{AICc} = \text{AIC} + 2k(k+1)/(n-k-1)$ .

Q4: How does cross-validation relate to the marginal likelihood?

Leave-one-out cross-validation (LOO-CV) approximates the expected log predictive density (ELPD): $\text{ELPD}_{LOO} = \sum_i \log P(y_i | y_{-i}, M)$ where $y_{-i}$ denotes all data except the $i$ -th point. The marginal likelihood can be written as: $\log P(\mathcal{D}|M) = \sum_{i=1}^n \log P(y_i | y_{1:i-1}, M)$ - the product of sequential one-step-ahead predictive probabilities (prequential interpretation). For large $n$ , WAIC and the marginal likelihood converge to LOO-CV when the posterior is approximately normal. In practice: LOO-CV (computed via Pareto-smoothed importance sampling in ArviZ) is the recommended model comparison tool in PyMC/Stan because it: works with MCMC samples, handles model misspecification, and gives interpretable effective number of parameters $p_{LOO}$ .

Q5: Why is the marginal likelihood sensitive to the prior, and what does this imply for model comparison?

The marginal likelihood integrates over the prior: $P(\mathcal{D}|M) = \int P(\mathcal{D}|\theta)P(\theta|M)d\theta$ . An improper prior (integrating to infinity) makes the marginal likelihood undefined or improper - Bayes factors between models with improper priors are undefined. Even with proper priors, a very diffuse prior assigns probability to regions where the likelihood is negligible, reducing the marginal likelihood. This means Bayes factors are sensitive to prior choice, especially for parameters with little data. Practical implications: (1) always use proper priors for model comparison; (2) Bayes factors are most reliable when both models have informative priors informed by domain knowledge; (3) for exploratory model comparison, prefer predictive criteria (LOO-CV, WAIC) which are less prior-sensitive; (4) when using BIC, the prior cancels out in the Laplace approximation, so BIC is more robust to prior choice (but requires large n for the approximation to hold).

:::tip 🎮 Interactive Playground

Visualize this concept: Try the KL Divergence demo on the EngineersOfAI Playground - no code required.

:::

The Problem: Which Model Should You Trust?​

Bayesian Model Comparison: The Framework​

Occam's Razor: Complexity Penalty for Free​

The Bayes Factor: Interpretation Scale​

Computing Marginal Likelihood for Simple Models​

Beta-Binomial Marginal Likelihood​

Gaussian Linear Regression Marginal Likelihood​

BIC and AIC: Frequentist Approximations to Bayesian Model Comparison​

Bayesian Information Criterion (BIC)​

Akaike Information Criterion (AIC)​

AIC vs BIC vs Marginal Likelihood​

Computing Marginal Likelihood Approximately​

1. Laplace Approximation​

2. Harmonic Mean Estimator (from MCMC samples)​

3. Annealed Importance Sampling (AIS)​

4. WAIC (Widely Applicable Information Criterion)​

Practical Recommendation for ML Engineers​

Interview Questions​