Markov Chain Monte Carlo
The Problem: Most Posteriors Have No Closed Form
You want to perform Bayesian inference on a Bayesian neural network. The posterior is:
where has millions of parameters. To compute any quantity of interest - the posterior mean, credible intervals, predictive distributions - you need to integrate over this high-dimensional posterior:
This integral is intractable. The posterior is a complex, potentially multimodal distribution in millions of dimensions. You can't compute it analytically, and numerical quadrature fails catastrophically in high dimensions.
MCMC is the solution: instead of computing the integral analytically, draw samples from the posterior. Then approximate any integral by a sample average:
where are samples from .
The genius of MCMC: you can draw samples from a distribution you can only evaluate up to a normalizing constant. You never need to compute .
Why Markov Chains Work
A Markov chain is a sequence of random variables where each state depends only on the previous state:
The transition kernel defines the probability of moving from state to state .
The key theorem: If the Markov chain satisfies:
- Irreducibility: Can reach any state from any state
- Aperiodicity: Doesn't cycle with a fixed period
- Detailed balance:
Then the chain has a unique stationary distribution , and in the limit, the chain samples from .
The MCMC trick: design a transition kernel that satisfies detailed balance with respect to the target posterior .
Metropolis-Hastings Algorithm
The most fundamental MCMC algorithm. Given current state :
Step 1 - Propose: sample a candidate from a proposal distribution
Step 2 - Accept/Reject: compute the acceptance ratio:
For symmetric proposals (, e.g., Gaussian random walk), this simplifies to:
Step 3 - Update: set with probability , otherwise .
Why this works: The detailed balance condition is satisfied by construction. If has higher posterior probability than , we always move there. If it has lower posterior probability, we move with probability proportional to the ratio - allowing the chain to explore but spending more time in high-probability regions.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
def metropolis_hastings(log_posterior, theta_init, proposal_std, n_samples, burnin=1000):
"""
Metropolis-Hastings with Gaussian random walk proposal.
Args:
log_posterior: function that computes log p(theta | data)
theta_init: starting point
proposal_std: standard deviation of Gaussian proposal
n_samples: number of samples to draw after burnin
burnin: number of initial samples to discard
Returns:
samples: array of shape (n_samples, d)
acceptance_rate: fraction of proposals accepted
"""
d = len(np.atleast_1d(theta_init))
theta = np.atleast_1d(theta_init).copy().astype(float)
samples = []
n_accepted = 0
total_steps = n_samples + burnin
for i in range(total_steps):
# Propose: symmetric Gaussian random walk
theta_star = theta + np.random.normal(0, proposal_std, size=d)
# Acceptance ratio (log space for numerical stability)
log_alpha = log_posterior(theta_star) - log_posterior(theta)
alpha = min(1.0, np.exp(log_alpha))
# Accept or reject
if np.random.uniform() < alpha:
theta = theta_star.copy()
n_accepted += 1
# Collect samples after burnin
if i >= burnin:
samples.append(theta.copy())
acceptance_rate = n_accepted / total_steps
return np.array(samples), acceptance_rate
# Example: posterior for a coin flip
# Prior: Beta(2, 2) -- weakly centered at 0.5
# Data: 12 heads, 8 tails (n=20)
n_heads, n_tails = 12, 8
def log_posterior_coin(theta_arr):
theta = theta_arr[0]
if not (0 < theta < 1):
return -np.inf
log_likelihood = n_heads * np.log(theta) + n_tails * np.log(1 - theta)
log_prior = stats.beta(2, 2).logpdf(theta)
return log_likelihood + log_prior
# True posterior: Beta(2+12, 2+8) = Beta(14, 10)
true_posterior = stats.beta(14, 10)
# Run MH sampler
np.random.seed(42)
samples, acc_rate = metropolis_hastings(
log_posterior=log_posterior_coin,
theta_init=np.array([0.5]),
proposal_std=0.1,
n_samples=5000,
burnin=1000
)
samples = samples[:, 0] # Extract scalar theta
print(f"Acceptance rate: {acc_rate:.3f} (ideal: 0.23-0.44 for 1D)")
print(f"Sample mean: {samples.mean():.4f} (true: {true_posterior.mean():.4f})")
print(f"Sample std: {samples.std():.4f} (true: {true_posterior.std():.4f})")
print(f"Sample 95% CI: ({np.percentile(samples, 2.5):.4f}, {np.percentile(samples, 97.5):.4f})")
print(f"True 95% CI: ({true_posterior.ppf(0.025):.4f}, {true_posterior.ppf(0.975):.4f})")
:::tip The 0.234 Optimal Acceptance Rate For high-dimensional problems, the theoretically optimal acceptance rate for a Gaussian random walk MH sampler is approximately 0.234 (23.4%). In 1D, it's about 0.44. If your acceptance rate is too high (> 0.5), your proposals are too small - the chain mixes slowly. If too low (< 0.1), proposals are too large - almost all proposals are rejected. Tune the proposal variance to hit these targets during a warm-up phase. :::
Gibbs Sampling
Gibbs sampling is a special case of MH where acceptance is always 1 - by sampling from full conditional distributions.
The algorithm: For a parameter vector , cycle through each dimension, sampling from its conditional given all others:
When to use Gibbs: When each full conditional is a known distribution you can sample from (often happens with conjugate priors in hierarchical models).
The key advantage: No proposal tuning needed. Acceptance rate is always 1. Works beautifully in graphical models where conditionals factor nicely.
def gibbs_sampler_gaussian(mu_true, Sigma_true, n_samples=5000, burnin=500):
"""
Gibbs sampler for a 2D Gaussian posterior.
Full conditionals of a bivariate Gaussian are univariate Gaussians.
"""
rho = Sigma_true[0, 1] / np.sqrt(Sigma_true[0, 0] * Sigma_true[1, 1])
sig1, sig2 = np.sqrt(Sigma_true[0, 0]), np.sqrt(Sigma_true[1, 1])
samples = np.zeros((n_samples + burnin, 2))
theta = np.zeros(2) # Initialize at origin
for t in range(1, n_samples + burnin):
# Sample theta_1 | theta_2 (Gaussian full conditional)
mu1_given2 = mu_true[0] + rho * (sig1/sig2) * (theta[1] - mu_true[1])
sig1_given2 = sig1 * np.sqrt(1 - rho**2)
theta[0] = np.random.normal(mu1_given2, sig1_given2)
# Sample theta_2 | theta_1 (Gaussian full conditional)
mu2_given1 = mu_true[1] + rho * (sig2/sig1) * (theta[0] - mu_true[0])
sig2_given1 = sig2 * np.sqrt(1 - rho**2)
theta[1] = np.random.normal(mu2_given1, sig2_given1)
samples[t] = theta.copy()
return samples[burnin:]
# Example: sample from a bivariate Gaussian with correlation
mu_true = np.array([1.0, -0.5])
Sigma_true = np.array([[1.0, 0.8], [0.8, 1.0]])
np.random.seed(42)
samples = gibbs_sampler_gaussian(mu_true, Sigma_true, n_samples=5000)
print(f"Sample mean: {samples.mean(axis=0)}")
print(f"True mean: {mu_true}")
print(f"Sample cov diagonal: {np.var(samples, axis=0)}")
print(f"True cov diagonal: {np.diag(Sigma_true)}")
print(f"Sample correlation: {np.corrcoef(samples[:, 0], samples[:, 1])[0, 1]:.4f}")
print(f"True correlation: 0.8")
Weakness of Gibbs: When variables are highly correlated, Gibbs sampling mixes very slowly - the chain explores only small regions at each step. High-correlation structures require many steps to traverse the full posterior.
Modern MCMC: NUTS and HMC
Modern probabilistic programming libraries use Hamiltonian Monte Carlo (HMC) and No-U-Turn Sampler (NUTS), which are far more efficient than random-walk MH in high dimensions.
The key idea: Instead of a random walk, use gradient information (like an optimization step) to make informed proposals. HMC adds momentum variables and simulates Hamiltonian dynamics:
The "potential energy" is the negative log posterior. The "kinetic energy" is a Gaussian term. Simulating Hamilton's equations using leapfrog integration generates proposals that move along iso-probability contours - very high acceptance rates.
NUTS automatically tunes the number of leapfrog steps and step size, making it the default sampler in PyMC, Stan, and NumPyro.
Practical MCMC with PyMC
import pymc as pm
import arviz as az
import numpy as np
# Generate synthetic data: Bayesian linear regression
np.random.seed(42)
n = 100
X = np.random.normal(0, 1, n)
true_alpha, true_beta, true_sigma = 1.5, 2.5, 0.5
y = true_alpha + true_beta * X + np.random.normal(0, true_sigma, n)
# Build Bayesian model in PyMC
with pm.Model() as linear_model:
# Priors (weakly informative)
alpha = pm.Normal('alpha', mu=0, sigma=10)
beta = pm.Normal('beta', mu=0, sigma=10)
sigma = pm.HalfNormal('sigma', sigma=1)
# Likelihood
mu = alpha + beta * X
y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
# Sample using NUTS (default)
trace = pm.sample(
draws=2000, # samples per chain after warmup
tune=1000, # warmup/adaptation steps
chains=4, # run 4 independent chains for diagnostics
return_inferencedata=True,
progressbar=True,
random_seed=42
)
# Summarize posterior
print(az.summary(trace, var_names=['alpha', 'beta', 'sigma']))
print(f"\nTrue values: alpha={true_alpha}, beta={true_beta}, sigma={true_sigma}")
# Plot posterior distributions
az.plot_posterior(trace, var_names=['alpha', 'beta', 'sigma'])
# Plot trace plots (MUST check for good mixing)
az.plot_trace(trace, var_names=['alpha', 'beta', 'sigma'])
MCMC Diagnostics: Detecting Convergence Problems
Running MCMC is not enough - you must verify the chain has converged to the true posterior. Several diagnostics help:
1. The R-hat (Gelman-Rubin) Statistic
Run multiple chains from different starting points. If they've converged to the same distribution, their variance should be the same as the within-chain variance. R-hat measures this:
where is the estimated total variance and is the within-chain variance.
- : chains have converged (aim for with NUTS)
- : chains haven't converged - run longer or fix model
2. Effective Sample Size (ESS)
MCMC samples are correlated. The ESS measures the equivalent number of independent samples:
where is the autocorrelation at lag . For reliable estimates, aim for per chain, ideally total.
3. Trace Plots
Visually inspect the sample sequence for each parameter. Good traces look like "fuzzy caterpillars" - stationary, well-mixed, with no obvious trends or stuck periods.
import arviz as az
import numpy as np
def check_mcmc_diagnostics(trace):
"""Check MCMC convergence diagnostics."""
summary = az.summary(trace)
print("=== MCMC Convergence Diagnostics ===\n")
print(f"{'Parameter':<15} {'R-hat':<10} {'ESS bulk':<12} {'ESS tail':<12} {'Status':<10}")
print("-" * 60)
for param in summary.index:
rhat = summary.loc[param, 'r_hat']
ess_bulk = summary.loc[param, 'ess_bulk']
ess_tail = summary.loc[param, 'ess_tail']
# Check for convergence issues
issues = []
if rhat > 1.01:
issues.append(f"R-hat={rhat:.3f}>1.01")
if ess_bulk < 400:
issues.append(f"ESS={ess_bulk:.0f}<400")
status = "WARN: " + ", ".join(issues) if issues else "OK"
print(f"{param:<15} {rhat:<10.4f} {ess_bulk:<12.1f} {ess_tail:<12.1f} {status:<10}")
# Check for divergences (HMC-specific: indicates model pathologies)
divergences = trace.sample_stats.diverging.values.sum()
if divergences > 0:
print(f"\nWARNING: {divergences} divergent transitions detected!")
print("Consider reparameterizing the model or using more informative priors.")
else:
print(f"\nNo divergences detected.")
# Run with the PyMC trace from above
# check_mcmc_diagnostics(trace)
# Common diagnostic patterns:
print("""
Common MCMC Problems and Solutions:
1. High R-hat (> 1.01):
- Run more chains
- Run longer (more draws, more tune steps)
- Model may have identifiability issues
- Consider reparameterization
2. Low ESS (< 100):
- Increase draws
- Increase target_accept (e.g., target_accept=0.9)
- The model may have correlated parameters
3. Divergences (HMC-specific):
- Model pathology: parameters near 0 with half-normal priors
- Solution: non-centered parameterization
- Solution: more informative priors to constrain parameter space
4. Bimodal trace plots:
- Chain is stuck and bouncing between modes
- Model may be non-identifiable
- Try different initializations or more warmup steps
""")
4. Divergences (HMC/NUTS Only)
A divergence occurs when the leapfrog integrator goes off track - numerically unstable. Divergences indicate the posterior has sharp curvature in some region that the sampler can't navigate. This almost always indicates a model problem:
# Non-centered parameterization: a crucial trick for hierarchical models
# PROBLEMATIC: centered parameterization causes funnel geometry
with pm.Model() as centered_model:
mu_global = pm.Normal('mu_global', 0, 1)
sigma = pm.HalfNormal('sigma', 1)
theta = pm.Normal('theta', mu_global, sigma, shape=10)
# BETTER: non-centered parameterization avoids the funnel
with pm.Model() as noncentered_model:
mu_global = pm.Normal('mu_global', 0, 1)
sigma = pm.HalfNormal('sigma', 1)
# Reparameterize: theta = mu_global + sigma * theta_raw
theta_raw = pm.Normal('theta_raw', 0, 1, shape=10)
theta = pm.Deterministic('theta', mu_global + sigma * theta_raw)
# Now theta_raw and sigma are nearly independent -> no funnel
MCMC in Production: When to Use It
MCMC is powerful but expensive. Use it when:
| Use Case | Verdict |
|---|---|
| Model has < 1000 parameters | MCMC feasible |
| Need exact posterior quantiles (not just mean) | MCMC preferred |
| Hierarchical model with complex dependencies | MCMC (often with non-centered parameterization) |
| Scientific publication requiring credible intervals | MCMC |
| Neural network with millions of parameters | Use variational inference instead |
| Real-time inference | MCMC too slow - use VI or approximate methods |
| Exploratory Bayesian analysis | MCMC in PyMC/Stan |
Interview Questions
Q1: Explain why the Metropolis-Hastings acceptance criterion guarantees sampling from the correct distribution.
The acceptance criterion is designed to satisfy the detailed balance condition: . Here, where . To verify detailed balance, consider two cases: If , then and . Then both sides of detailed balance equal . The other case is symmetric. Detailed balance implies is a stationary distribution of the Markov chain, and irreducibility + aperiodicity ensure it's the unique stationary distribution.
Q2: What is the difference between warm-up/burn-in samples and main samples in MCMC?
During warm-up (burn-in), the chain starts at an arbitrary initial point and hasn't yet reached the stationary distribution. These initial samples are not representative of the posterior and must be discarded. In PyMC's NUTS, the warm-up period also serves for adaptation: the sampler automatically tunes the step size and mass matrix (the covariance of the momentum distribution in HMC) to achieve an optimal acceptance rate (target 0.8 for NUTS). This adaptation is crucial for efficiency. After warm-up, adaptation stops and the tuned parameters are fixed for the main sampling phase. Typically, 500-1000 warm-up steps suffice for simple models; complex hierarchical models may need 2000-4000.
Q3: Why does Gibbs sampling mix slowly when parameters are highly correlated?
In Gibbs sampling, each step updates one parameter conditioned on all others at their current values. When two parameters are highly correlated (say, correlation 0.99), the conditional is a very narrow distribution concentrated around . So each Gibbs step can only move a tiny amount in . To move across the full posterior range, the chain needs many steps - the mixing time scales as where is the correlation. HMC avoids this by simulating trajectories that can traverse the entire posterior in one proposal - it naturally accounts for correlations through the mass matrix (a preconditioning matrix). This is why HMC/NUTS is so much more efficient than Gibbs for correlated posteriors.
Q4: What are divergences in HMC, and what do they indicate about a model?
Divergences occur when the leapfrog integrator in HMC makes a numerical error - the total energy (Hamiltonian) changes too much during a trajectory. This happens when the posterior has regions of extremely high curvature, like Neal's funnel geometry in poorly parameterized hierarchical models. The funnel occurs when a scale parameter can approach zero: the posterior becomes very concentrated in that region, and the leapfrog integrator can't handle the rapid curvature change. Divergences indicate the model has pathological geometry. Solutions: (1) non-centered parameterization (reparameterize so the geometry is more uniform); (2) more informative priors on scale parameters to prevent them from approaching zero; (3) increase target_accept to force smaller step sizes. Crucially, divergences indicate model problems, not just computational problems - they suggest your model may not be well-specified.
Q5: How would you use MCMC to estimate predictive uncertainty for a classification model?
Use Bayesian logistic regression with MCMC. First, define the model: weights , likelihood . Run MCMC to get posterior samples . For a new input , compute the posterior predictive: . This gives a probability that accounts for parameter uncertainty. The variance across the predictions quantifies epistemic uncertainty - it's high near decision boundaries and in regions with little training data. Unlike point-estimate logistic regression, this tells you not just the predicted probability but also how confident you should be in that probability.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the MCMC Explorer demo on the EngineersOfAI Playground - no code required.
:::
