Markov Chain Monte Carlo

The Problem: Most Posteriors Have No Closed Form

You want to perform Bayesian inference on a Bayesian neural network. The posterior is:

$P(\mathbf{W} \mid \mathcal{D}) \propto P(\mathcal{D} \mid \mathbf{W}) \cdot P(\mathbf{W})$

where $\mathbf{W}$ has millions of parameters. To compute any quantity of interest - the posterior mean, credible intervals, predictive distributions - you need to integrate over this high-dimensional posterior:

$E[f(\mathbf{W}) \mid \mathcal{D}] = \int f(\mathbf{W}) P(\mathbf{W} \mid \mathcal{D}) d\mathbf{W}$

This integral is intractable. The posterior is a complex, potentially multimodal distribution in millions of dimensions. You can't compute it analytically, and numerical quadrature fails catastrophically in high dimensions.

MCMC is the solution: instead of computing the integral analytically, draw samples from the posterior. Then approximate any integral by a sample average:

$E[f(\mathbf{W}) \mid \mathcal{D}] \approx \frac{1}{N} \sum_{t=1}^{N} f(\mathbf{W}^{(t)})$

where $\mathbf{W}^{(1)}, \ldots, \mathbf{W}^{(N)}$ are samples from $P(\mathbf{W} \mid \mathcal{D})$ .

The genius of MCMC: you can draw samples from a distribution you can only evaluate up to a normalizing constant. You never need to compute $P(\mathcal{D})$ .

Why Markov Chains Work

A Markov chain is a sequence of random variables $\theta^{(1)}, \theta^{(2)}, \ldots$ where each state depends only on the previous state:

$P(\theta^{(t+1)} \mid \theta^{(t)}, \theta^{(t-1)}, \ldots) = P(\theta^{(t+1)} \mid \theta^{(t)})$

The transition kernel $T(\theta' \mid \theta)$ defines the probability of moving from state $\theta$ to state $\theta'$ .

The key theorem: If the Markov chain satisfies:

Irreducibility: Can reach any state from any state
Aperiodicity: Doesn't cycle with a fixed period
Detailed balance: $\pi(\theta) T(\theta' \mid \theta) = \pi(\theta') T(\theta \mid \theta')$

Then the chain has a unique stationary distribution $\pi(\theta)$ , and in the limit, the chain samples from $\pi(\theta)$ .

The MCMC trick: design a transition kernel $T$ that satisfies detailed balance with respect to the target posterior $\pi(\theta) = P(\theta \mid \mathcal{D})$ .

Metropolis-Hastings Algorithm

The most fundamental MCMC algorithm. Given current state $\theta^{(t)}$ :

Step 1 - Propose: sample a candidate from a proposal distribution $\theta^* \sim q(\theta^* \mid \theta^{(t)})$

Step 2 - Accept/Reject: compute the acceptance ratio:

$\alpha = \min\left(1, \frac{\pi(\theta^*) \cdot q(\theta^{(t)} \mid \theta^*)}{\pi(\theta^{(t)}) \cdot q(\theta^* \mid \theta^{(t)})}\right)$

For symmetric proposals ( $q(\theta^* \mid \theta) = q(\theta \mid \theta^*)$ , e.g., Gaussian random walk), this simplifies to:

$\alpha = \min\left(1, \frac{\pi(\theta^*)}{\pi(\theta^{(t)})}\right) = \min\left(1, \frac{P(\mathcal{D} \mid \theta^*) P(\theta^*)}{P(\mathcal{D} \mid \theta^{(t)}) P(\theta^{(t)})}\right)$

Step 3 - Update: set $\theta^{(t+1)} = \theta^*$ with probability $\alpha$ , otherwise $\theta^{(t+1)} = \theta^{(t)}$ .

Why this works: The detailed balance condition is satisfied by construction. If $\theta^*$ has higher posterior probability than $\theta^{(t)}$ , we always move there. If it has lower posterior probability, we move with probability proportional to the ratio - allowing the chain to explore but spending more time in high-probability regions.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

def metropolis_hastings(log_posterior, theta_init, proposal_std, n_samples, burnin=1000):
    """
    Metropolis-Hastings with Gaussian random walk proposal.

    Args:
        log_posterior: function that computes log p(theta | data)
        theta_init: starting point
        proposal_std: standard deviation of Gaussian proposal
        n_samples: number of samples to draw after burnin
        burnin: number of initial samples to discard

    Returns:
        samples: array of shape (n_samples, d)
        acceptance_rate: fraction of proposals accepted
    """
    d = len(np.atleast_1d(theta_init))
    theta = np.atleast_1d(theta_init).copy().astype(float)

    samples = []
    n_accepted = 0
    total_steps = n_samples + burnin

    for i in range(total_steps):
        # Propose: symmetric Gaussian random walk
        theta_star = theta + np.random.normal(0, proposal_std, size=d)

        # Acceptance ratio (log space for numerical stability)
        log_alpha = log_posterior(theta_star) - log_posterior(theta)
        alpha = min(1.0, np.exp(log_alpha))

        # Accept or reject
        if np.random.uniform() < alpha:
            theta = theta_star.copy()
            n_accepted += 1

        # Collect samples after burnin
        if i >= burnin:
            samples.append(theta.copy())

    acceptance_rate = n_accepted / total_steps
    return np.array(samples), acceptance_rate


# Example: posterior for a coin flip
# Prior: Beta(2, 2) -- weakly centered at 0.5
# Data: 12 heads, 8 tails (n=20)
n_heads, n_tails = 12, 8

def log_posterior_coin(theta_arr):
    theta = theta_arr[0]
    if not (0 < theta < 1):
        return -np.inf
    log_likelihood = n_heads * np.log(theta) + n_tails * np.log(1 - theta)
    log_prior = stats.beta(2, 2).logpdf(theta)
    return log_likelihood + log_prior

# True posterior: Beta(2+12, 2+8) = Beta(14, 10)
true_posterior = stats.beta(14, 10)

# Run MH sampler
np.random.seed(42)
samples, acc_rate = metropolis_hastings(
    log_posterior=log_posterior_coin,
    theta_init=np.array([0.5]),
    proposal_std=0.1,
    n_samples=5000,
    burnin=1000
)
samples = samples[:, 0]  # Extract scalar theta

print(f"Acceptance rate: {acc_rate:.3f} (ideal: 0.23-0.44 for 1D)")
print(f"Sample mean: {samples.mean():.4f} (true: {true_posterior.mean():.4f})")
print(f"Sample std: {samples.std():.4f} (true: {true_posterior.std():.4f})")
print(f"Sample 95% CI: ({np.percentile(samples, 2.5):.4f}, {np.percentile(samples, 97.5):.4f})")
print(f"True 95% CI:   ({true_posterior.ppf(0.025):.4f}, {true_posterior.ppf(0.975):.4f})")

:::tip The 0.234 Optimal Acceptance Rate For high-dimensional problems, the theoretically optimal acceptance rate for a Gaussian random walk MH sampler is approximately 0.234 (23.4%). In 1D, it's about 0.44. If your acceptance rate is too high (> 0.5), your proposals are too small - the chain mixes slowly. If too low (< 0.1), proposals are too large - almost all proposals are rejected. Tune the proposal variance to hit these targets during a warm-up phase. :::

Gibbs Sampling

Gibbs sampling is a special case of MH where acceptance is always 1 - by sampling from full conditional distributions.

The algorithm: For a parameter vector $\theta = (\theta_1, \ldots, \theta_d)$ , cycle through each dimension, sampling from its conditional given all others:

$\theta_j^{(t+1)} \sim P(\theta_j \mid \theta_1^{(t+1)}, \ldots, \theta_{j-1}^{(t+1)}, \theta_{j+1}^{(t)}, \ldots, \theta_d^{(t)})$

When to use Gibbs: When each full conditional is a known distribution you can sample from (often happens with conjugate priors in hierarchical models).

The key advantage: No proposal tuning needed. Acceptance rate is always 1. Works beautifully in graphical models where conditionals factor nicely.

def gibbs_sampler_gaussian(mu_true, Sigma_true, n_samples=5000, burnin=500):
    """
    Gibbs sampler for a 2D Gaussian posterior.
    Full conditionals of a bivariate Gaussian are univariate Gaussians.
    """
    rho = Sigma_true[0, 1] / np.sqrt(Sigma_true[0, 0] * Sigma_true[1, 1])
    sig1, sig2 = np.sqrt(Sigma_true[0, 0]), np.sqrt(Sigma_true[1, 1])

    samples = np.zeros((n_samples + burnin, 2))
    theta = np.zeros(2)  # Initialize at origin

    for t in range(1, n_samples + burnin):
        # Sample theta_1 | theta_2 (Gaussian full conditional)
        mu1_given2 = mu_true[0] + rho * (sig1/sig2) * (theta[1] - mu_true[1])
        sig1_given2 = sig1 * np.sqrt(1 - rho**2)
        theta[0] = np.random.normal(mu1_given2, sig1_given2)

        # Sample theta_2 | theta_1 (Gaussian full conditional)
        mu2_given1 = mu_true[1] + rho * (sig2/sig1) * (theta[0] - mu_true[0])
        sig2_given1 = sig2 * np.sqrt(1 - rho**2)
        theta[1] = np.random.normal(mu2_given1, sig2_given1)

        samples[t] = theta.copy()

    return samples[burnin:]

# Example: sample from a bivariate Gaussian with correlation
mu_true = np.array([1.0, -0.5])
Sigma_true = np.array([[1.0, 0.8], [0.8, 1.0]])

np.random.seed(42)
samples = gibbs_sampler_gaussian(mu_true, Sigma_true, n_samples=5000)

print(f"Sample mean: {samples.mean(axis=0)}")
print(f"True mean: {mu_true}")
print(f"Sample cov diagonal: {np.var(samples, axis=0)}")
print(f"True cov diagonal: {np.diag(Sigma_true)}")
print(f"Sample correlation: {np.corrcoef(samples[:, 0], samples[:, 1])[0, 1]:.4f}")
print(f"True correlation: 0.8")

Weakness of Gibbs: When variables are highly correlated, Gibbs sampling mixes very slowly - the chain explores only small regions at each step. High-correlation structures require many steps to traverse the full posterior.

Modern MCMC: NUTS and HMC

Modern probabilistic programming libraries use Hamiltonian Monte Carlo (HMC) and No-U-Turn Sampler (NUTS), which are far more efficient than random-walk MH in high dimensions.

The key idea: Instead of a random walk, use gradient information (like an optimization step) to make informed proposals. HMC adds momentum variables $\mathbf{p}$ and simulates Hamiltonian dynamics:

$H(\theta, \mathbf{p}) = -\log P(\theta \mid \mathcal{D}) + \frac{1}{2}\mathbf{p}^\top M^{-1} \mathbf{p}$

The "potential energy" is the negative log posterior. The "kinetic energy" is a Gaussian term. Simulating Hamilton's equations using leapfrog integration generates proposals that move along iso-probability contours - very high acceptance rates.

NUTS automatically tunes the number of leapfrog steps and step size, making it the default sampler in PyMC, Stan, and NumPyro.

Practical MCMC with PyMC

import pymc as pm
import arviz as az
import numpy as np

# Generate synthetic data: Bayesian linear regression
np.random.seed(42)
n = 100
X = np.random.normal(0, 1, n)
true_alpha, true_beta, true_sigma = 1.5, 2.5, 0.5
y = true_alpha + true_beta * X + np.random.normal(0, true_sigma, n)

# Build Bayesian model in PyMC
with pm.Model() as linear_model:
    # Priors (weakly informative)
    alpha = pm.Normal('alpha', mu=0, sigma=10)
    beta = pm.Normal('beta', mu=0, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=1)

    # Likelihood
    mu = alpha + beta * X
    y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)

    # Sample using NUTS (default)
    trace = pm.sample(
        draws=2000,        # samples per chain after warmup
        tune=1000,         # warmup/adaptation steps
        chains=4,          # run 4 independent chains for diagnostics
        return_inferencedata=True,
        progressbar=True,
        random_seed=42
    )

# Summarize posterior
print(az.summary(trace, var_names=['alpha', 'beta', 'sigma']))
print(f"\nTrue values: alpha={true_alpha}, beta={true_beta}, sigma={true_sigma}")

# Plot posterior distributions
az.plot_posterior(trace, var_names=['alpha', 'beta', 'sigma'])

# Plot trace plots (MUST check for good mixing)
az.plot_trace(trace, var_names=['alpha', 'beta', 'sigma'])

MCMC Diagnostics: Detecting Convergence Problems

Running MCMC is not enough - you must verify the chain has converged to the true posterior. Several diagnostics help:

1. The R-hat (Gelman-Rubin) Statistic

Run multiple chains from different starting points. If they've converged to the same distribution, their variance should be the same as the within-chain variance. R-hat measures this:

$\hat{R} = \sqrt{\frac{\hat{V}}{W}}$

where $\hat{V}$ is the estimated total variance and $W$ is the within-chain variance.

$\hat{R} \approx 1.0$ : chains have converged (aim for $< 1.01$ with NUTS)
$\hat{R} > 1.01$ : chains haven't converged - run longer or fix model

2. Effective Sample Size (ESS)

MCMC samples are correlated. The ESS measures the equivalent number of independent samples:

$N_{eff} = \frac{N}{1 + 2\sum_{k=1}^\infty \rho_k}$

where $\rho_k$ is the autocorrelation at lag $k$ . For reliable estimates, aim for $N_{eff} > 100$ per chain, ideally $> 400$ total.

3. Trace Plots

Visually inspect the sample sequence for each parameter. Good traces look like "fuzzy caterpillars" - stationary, well-mixed, with no obvious trends or stuck periods.

import arviz as az
import numpy as np

def check_mcmc_diagnostics(trace):
    """Check MCMC convergence diagnostics."""
    summary = az.summary(trace)

    print("=== MCMC Convergence Diagnostics ===\n")
    print(f"{'Parameter':<15} {'R-hat':<10} {'ESS bulk':<12} {'ESS tail':<12} {'Status':<10}")
    print("-" * 60)

    for param in summary.index:
        rhat = summary.loc[param, 'r_hat']
        ess_bulk = summary.loc[param, 'ess_bulk']
        ess_tail = summary.loc[param, 'ess_tail']

        # Check for convergence issues
        issues = []
        if rhat > 1.01:
            issues.append(f"R-hat={rhat:.3f}>1.01")
        if ess_bulk < 400:
            issues.append(f"ESS={ess_bulk:.0f}<400")

        status = "WARN: " + ", ".join(issues) if issues else "OK"
        print(f"{param:<15} {rhat:<10.4f} {ess_bulk:<12.1f} {ess_tail:<12.1f} {status:<10}")

    # Check for divergences (HMC-specific: indicates model pathologies)
    divergences = trace.sample_stats.diverging.values.sum()
    if divergences > 0:
        print(f"\nWARNING: {divergences} divergent transitions detected!")
        print("Consider reparameterizing the model or using more informative priors.")
    else:
        print(f"\nNo divergences detected.")

# Run with the PyMC trace from above
# check_mcmc_diagnostics(trace)

# Common diagnostic patterns:
print("""
Common MCMC Problems and Solutions:

1. High R-hat (> 1.01):
   - Run more chains
   - Run longer (more draws, more tune steps)
   - Model may have identifiability issues
   - Consider reparameterization

2. Low ESS (< 100):
   - Increase draws
   - Increase target_accept (e.g., target_accept=0.9)
   - The model may have correlated parameters

3. Divergences (HMC-specific):
   - Model pathology: parameters near 0 with half-normal priors
   - Solution: non-centered parameterization
   - Solution: more informative priors to constrain parameter space

4. Bimodal trace plots:
   - Chain is stuck and bouncing between modes
   - Model may be non-identifiable
   - Try different initializations or more warmup steps
""")

4. Divergences (HMC/NUTS Only)

A divergence occurs when the leapfrog integrator goes off track - numerically unstable. Divergences indicate the posterior has sharp curvature in some region that the sampler can't navigate. This almost always indicates a model problem:

# Non-centered parameterization: a crucial trick for hierarchical models
# PROBLEMATIC: centered parameterization causes funnel geometry
with pm.Model() as centered_model:
    mu_global = pm.Normal('mu_global', 0, 1)
    sigma = pm.HalfNormal('sigma', 1)
    theta = pm.Normal('theta', mu_global, sigma, shape=10)

# BETTER: non-centered parameterization avoids the funnel
with pm.Model() as noncentered_model:
    mu_global = pm.Normal('mu_global', 0, 1)
    sigma = pm.HalfNormal('sigma', 1)
    # Reparameterize: theta = mu_global + sigma * theta_raw
    theta_raw = pm.Normal('theta_raw', 0, 1, shape=10)
    theta = pm.Deterministic('theta', mu_global + sigma * theta_raw)
    # Now theta_raw and sigma are nearly independent -> no funnel

MCMC in Production: When to Use It

MCMC is powerful but expensive. Use it when:

Use Case	Verdict
Model has < 1000 parameters	MCMC feasible
Need exact posterior quantiles (not just mean)	MCMC preferred
Hierarchical model with complex dependencies	MCMC (often with non-centered parameterization)
Scientific publication requiring credible intervals	MCMC
Neural network with millions of parameters	Use variational inference instead
Real-time inference	MCMC too slow - use VI or approximate methods
Exploratory Bayesian analysis	MCMC in PyMC/Stan

Interview Questions

Q1: Explain why the Metropolis-Hastings acceptance criterion guarantees sampling from the correct distribution.

The acceptance criterion is designed to satisfy the detailed balance condition: $\pi(\theta)T(\theta'|\theta) = \pi(\theta')T(\theta|\theta')$ . Here, $T(\theta'|\theta) = q(\theta'|\theta)\alpha(\theta,\theta')$ where $\alpha = \min(1, \frac{\pi(\theta')q(\theta|\theta')}{\pi(\theta)q(\theta'|\theta)})$ . To verify detailed balance, consider two cases: If $\pi(\theta')q(\theta|\theta') > \pi(\theta)q(\theta'|\theta)$ , then $\alpha(\theta\to\theta') = 1$ and $\alpha(\theta'\to\theta) = \frac{\pi(\theta)q(\theta'|\theta)}{\pi(\theta')q(\theta|\theta')}$ . Then both sides of detailed balance equal $\pi(\theta)q(\theta'|\theta)$ . The other case is symmetric. Detailed balance implies $\pi$ is a stationary distribution of the Markov chain, and irreducibility + aperiodicity ensure it's the unique stationary distribution.

Q2: What is the difference between warm-up/burn-in samples and main samples in MCMC?

During warm-up (burn-in), the chain starts at an arbitrary initial point and hasn't yet reached the stationary distribution. These initial samples are not representative of the posterior and must be discarded. In PyMC's NUTS, the warm-up period also serves for adaptation: the sampler automatically tunes the step size and mass matrix (the covariance of the momentum distribution in HMC) to achieve an optimal acceptance rate (target 0.8 for NUTS). This adaptation is crucial for efficiency. After warm-up, adaptation stops and the tuned parameters are fixed for the main sampling phase. Typically, 500-1000 warm-up steps suffice for simple models; complex hierarchical models may need 2000-4000.

Q3: Why does Gibbs sampling mix slowly when parameters are highly correlated?

In Gibbs sampling, each step updates one parameter conditioned on all others at their current values. When two parameters $\theta_1, \theta_2$ are highly correlated (say, correlation 0.99), the conditional $P(\theta_1|\theta_2 = c)$ is a very narrow distribution concentrated around $0.99c$ . So each Gibbs step can only move a tiny amount in $\theta_1$ . To move across the full posterior range, the chain needs many steps - the mixing time scales as $O(1/(1-\rho^2))$ where $\rho$ is the correlation. HMC avoids this by simulating trajectories that can traverse the entire posterior in one proposal - it naturally accounts for correlations through the mass matrix (a preconditioning matrix). This is why HMC/NUTS is so much more efficient than Gibbs for correlated posteriors.

Q4: What are divergences in HMC, and what do they indicate about a model?

Divergences occur when the leapfrog integrator in HMC makes a numerical error - the total energy (Hamiltonian) changes too much during a trajectory. This happens when the posterior has regions of extremely high curvature, like Neal's funnel geometry in poorly parameterized hierarchical models. The funnel occurs when a scale parameter $\sigma$ can approach zero: the posterior becomes very concentrated in that region, and the leapfrog integrator can't handle the rapid curvature change. Divergences indicate the model has pathological geometry. Solutions: (1) non-centered parameterization (reparameterize so the geometry is more uniform); (2) more informative priors on scale parameters to prevent them from approaching zero; (3) increase target_accept to force smaller step sizes. Crucially, divergences indicate model problems, not just computational problems - they suggest your model may not be well-specified.

Q5: How would you use MCMC to estimate predictive uncertainty for a classification model?

Use Bayesian logistic regression with MCMC. First, define the model: weights $\mathbf{w} \sim \mathcal{N}(0, I)$ , likelihood $P(y_i = 1 | x_i, \mathbf{w}) = \sigma(\mathbf{w}^\top x_i)$ . Run MCMC to get $N$ posterior samples $\{\mathbf{w}^{(t)}\}_{t=1}^N$ . For a new input $x^*$ , compute the posterior predictive: $P(y^* = 1 | x^*, \mathcal{D}) \approx \frac{1}{N}\sum_t \sigma((\mathbf{w}^{(t)})^\top x^*)$ . This gives a probability that accounts for parameter uncertainty. The variance across the $N$ predictions $\text{Var}[\sigma(\mathbf{w}^\top x^*)]$ quantifies epistemic uncertainty - it's high near decision boundaries and in regions with little training data. Unlike point-estimate logistic regression, this tells you not just the predicted probability but also how confident you should be in that probability.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the MCMC Explorer demo on the EngineersOfAI Playground - no code required.

:::

The Problem: Most Posteriors Have No Closed Form​

Why Markov Chains Work​

Metropolis-Hastings Algorithm​

Gibbs Sampling​

Modern MCMC: NUTS and HMC​

Practical MCMC with PyMC​

MCMC Diagnostics: Detecting Convergence Problems​

1. The R-hat (Gelman-Rubin) Statistic​

2. Effective Sample Size (ESS)​

3. Trace Plots​

4. Divergences (HMC/NUTS Only)​

MCMC in Production: When to Use It​

Interview Questions​