Sampling Methods
Reading time: ~50 minutes | Interview relevance: High | Target roles: ML Engineer, AI Engineer, Research Scientist, Data Scientist
The ML Scenario That Motivates This Lesson
You're training a Bayesian neural network. After updating the model with data, the posterior distribution over weights is a high-dimensional, non-Gaussian distribution - it has no closed form. You need to:
- Sample from this distribution to make predictions (average over posterior)
- Estimate integrals over this distribution (expected loss, predictive variance)
- Explore the distribution to understand model uncertainty
None of these are possible analytically. You need sampling algorithms.
Sampling methods appear throughout modern ML:
- Dropout: stochastic approximation to Bayesian neural network inference
- Data augmentation: sampling transformed versions of training data
- Language model generation: sampling from
- Diffusion models: iterative denoising is a sampling process from a learned distribution
- RLHF: sampling model outputs for human preference labeling
- VAE: sampling from the latent space prior
This lesson covers the algorithmic toolkit for sampling from arbitrary probability distributions.
1. Why Sampling Is Hard
Sampling from a distribution seems simple: generate random numbers that follow . But in ML, we often face:
- High dimensionality: over (image pixel space)
- Intractable normalization: but is unknown
- Complex shape: multi-modal, non-Gaussian distributions
- Implicit distributions: defined via a neural network (GAN, diffusion)
Different sampling algorithms handle different subsets of these challenges:
SAMPLING ALGORITHM SELECTION GUIDE:
Can you evaluate F^{-1}(u) efficiently? ──► YES ──► Inverse CDF Method
│
NO
│
▼
Do you know a proposal distribution q(x)? ──► YES ──► Rejection Sampling
│ or Importance Sampling
NO
│
▼
Can you evaluate p(x) up to a constant? ──► YES ──► MCMC (Metropolis-Hastings,
│ Gibbs Sampling, HMC)
NO
│
▼
Can you sample from a neural network? ──► YES ──► Direct (GAN, Diffusion, Flow)
2. Inverse CDF (Inverse Transform) Method
The most elegant sampling algorithm for 1D distributions.
The Key Theorem
If and is a CDF, then:
Proof: . The last step uses .
Algorithm
1. Sample u ~ Uniform(0, 1)
2. Return x = F^{-1}(u)
Example: Exponential Distribution
For :
- CDF:
- Inverse:
Since if , we can simplify to .
import numpy as np
from scipy.stats import exponential, ks_2samp
np.random.seed(42)
n_samples = 100_000
lam = 2.0
# --- Inverse CDF sampling ---
def sample_exponential_icdf(lam, n):
"""Sample from Exponential(lam) using inverse CDF."""
u = np.random.uniform(0, 1, n)
return -np.log(1 - u) / lam # equivalently: -np.log(u) / lam
samples_icdf = sample_exponential_icdf(lam, n_samples)
samples_scipy = np.random.exponential(scale=1/lam, size=n_samples)
print(f"Exponential({lam}) via Inverse CDF:")
print(f" Empirical mean: {samples_icdf.mean():.5f} (true: {1/lam:.5f})")
print(f" Empirical std: {samples_icdf.std():.5f} (true: {1/lam:.5f})")
# KS test: are the samples from the right distribution?
ks_stat, p_val = ks_2samp(samples_icdf, samples_scipy)
print(f" KS test (vs scipy): stat={ks_stat:.5f}, p={p_val:.4f} (p>>0.05 = good)")
# Discrete inverse CDF: sampling from a custom discrete distribution
def sample_discrete_icdf(probs, n_samples):
"""
Sample from a discrete distribution using the inverse CDF.
Equivalent to np.random.choice but shows the mechanism.
"""
cdf = np.cumsum(probs)
u = np.random.uniform(0, 1, n_samples)
# Find the smallest index k such that CDF(k) >= u
samples = np.searchsorted(cdf, u)
return samples
# Custom discrete distribution: P(0)=0.1, P(1)=0.4, P(2)=0.3, P(3)=0.2
probs = np.array([0.1, 0.4, 0.3, 0.2])
samples = sample_discrete_icdf(probs, n_samples=100_000)
print("\nDiscrete Inverse CDF Sampling:")
print(f"{'Value':>6} | {'Empirical Freq':>15} | {'True Prob':>10}")
for k, p in enumerate(probs):
print(f"{k:>6} | {(samples==k).mean():>15.5f} | {p:>10.4f}")
Limitation
Inverse CDF requires knowing analytically. For many distributions (multivariate Gaussian, posterior distributions in Bayesian models), this is not available.
3. Rejection Sampling
Sample from a proposal distribution that is easy to sample from, and reject samples that are unlikely under the target .
Algorithm
Given target and proposal with constant such that for all :
REPEAT:
1. Sample x ~ q(x)
2. Sample u ~ Uniform(0, 1)
3. IF u <= p(x) / (M * q(x)):
ACCEPT x (return x)
ELSE:
REJECT x (try again)
Acceptance probability:
The tighter the proposal envelope ( close to 1), the fewer rejections.
Rejection Sampling Visualization:
M·q(x) ───────────────────────────────── (proposal envelope)
│ │
p(x) │ ╭───────╮ │ (target density)
│ ╭─╯ ╰─╮ │
│ ╭─╯ ╰───╮ │
└──────────────────────────── x
Accept region (between p(x) and M·q(x)): probability p(x)/(M·q(x))
Reject region (above p(x)): probability 1 - p(x)/(M·q(x))
import numpy as np
from scipy.stats import norm, beta
def rejection_sampling(target_pdf, proposal_pdf, proposal_sampler, M, n_samples):
"""
Generic rejection sampler.
Args:
target_pdf: function x -> p(x)
proposal_pdf: function x -> q(x)
proposal_sampler: function n -> n samples from q
M: constant such that p(x) <= M * q(x) for all x
n_samples: desired number of accepted samples
Returns:
accepted_samples, n_total_attempts
"""
accepted = []
n_attempts = 0
while len(accepted) < n_samples:
# Step 1: sample from proposal
x = proposal_sampler(1)[0]
n_attempts += 1
# Step 2: compute acceptance ratio
accept_ratio = target_pdf(x) / (M * proposal_pdf(x))
# Step 3: accept or reject
if np.random.uniform() <= accept_ratio:
accepted.append(x)
return np.array(accepted), n_attempts
# Example: sample from Beta(3, 5) using Uniform(0,1) as proposal
from scipy.stats import beta as beta_dist, uniform
alpha, beta_param = 3.0, 5.0
target = beta_dist(alpha, beta_param)
# Max of Beta(3,5) pdf is at mode = (alpha-1)/(alpha+beta-2) = 2/6 ≈ 0.333
mode = (alpha - 1) / (alpha + beta_param - 2)
M = target.pdf(mode) # max density value
print(f"Rejection Sampling: Beta({alpha}, {beta_param}) from Uniform(0,1)")
print(f" M = {M:.4f} (proposal envelope)")
print(f" Theoretical acceptance rate = 1/M = {1/M:.4f}")
np.random.seed(42)
samples, n_attempts = rejection_sampling(
target_pdf=target.pdf,
proposal_pdf=lambda x: uniform.pdf(x),
proposal_sampler=lambda n: np.random.uniform(size=n),
M=M,
n_samples=10000
)
print(f" Actual acceptance rate: {10000/n_attempts:.4f}")
print(f" Sample mean: {samples.mean():.5f} (true: {target.mean():.5f})")
print(f" Sample std: {samples.std():.5f} (true: {target.std():.5f})")
Limitation
Rejection sampling is inefficient in high dimensions because the acceptance rate decreases exponentially: , and must cover the full distribution, which becomes exponentially hard as dimension grows (the "curse of dimensionality").
4. Importance Sampling
Rather than sampling from directly, use samples from but weight them to correct for the wrong distribution.
The Key Identity
The weights are called importance weights.
Algorithm
1. Sample x_1, ..., x_n ~ q(x)
2. Compute importance weights: w_i = p(x_i) / q(x_i)
3. Estimate: E_p[f(X)] ≈ (Σ_i w_i f(x_i)) / (Σ_i w_i)
The unnormalized estimator requires knowing exactly. The self-normalized estimator works even when is known only up to a constant.
import numpy as np
from scipy.stats import norm
np.random.seed(42)
n_samples = 100_000
# Estimate E_p[X^2] where p = N(3, 1) using q = N(0, 1) as proposal
def importance_sampling(f, target_logpdf, proposal_logpdf, proposal_sampler, n):
"""
Estimate E_p[f(X)] using importance sampling with proposal q.
Uses log-weights for numerical stability.
"""
samples = proposal_sampler(n)
# Log importance weights
log_weights = target_logpdf(samples) - proposal_logpdf(samples)
# Normalize using log-sum-exp trick
log_weights -= log_weights.max() # for numerical stability
weights = np.exp(log_weights)
weights /= weights.sum()
return np.sum(weights * f(samples)), weights
target = norm(loc=3.0, scale=1.0)
proposal = norm(loc=0.0, scale=1.0)
# Estimate E_p[X^2]
# True value: E[X^2] = Var(X) + E[X]^2 = 1 + 9 = 10
f = lambda x: x**2
est, weights = importance_sampling(
f=f,
target_logpdf=target.logpdf,
proposal_logpdf=proposal.logpdf,
proposal_sampler=lambda n: proposal.rvs(n),
n=n_samples
)
# Direct Monte Carlo from target (ground truth)
direct_mc = f(target.rvs(n_samples)).mean()
print(f"Importance Sampling: E_{{N(3,1)}}[X^2]")
print(f" True value: 10.0000")
print(f" Importance sampling: {est:.5f}")
print(f" Direct MC: {direct_mc:.5f}")
# Effective sample size (ESS) - measures quality of importance weights
ess = 1.0 / np.sum(weights**2)
print(f" Effective sample size (ESS): {ess:.0f} / {n_samples}")
print(f" ESS ratio: {ess/n_samples:.4f} (1.0 = perfect, <<1 = bad proposal)")
Effective Sample Size (ESS)
ESS measures how many iid samples from are equivalent to your weighted samples. If , weights are all similar, and . If is a poor proposal, a few samples dominate, and ESS .
:::warning When Importance Sampling Fails Importance sampling fails catastrophically when the proposal has lighter tails than the target . If for some in the support of , the importance weights have infinite variance. This happens when, e.g., estimating tail probabilities of a Gaussian using a Uniform proposal. In high dimensions, finding a good proposal is extremely difficult - which is why MCMC is preferred for high-dimensional posterior sampling. :::
5. Monte Carlo Integration
The most widely used application of sampling: estimating integrals that can't be solved analytically.
Basic Monte Carlo Estimator
More generally, for expectation under :
Error rate: By CLT, the error is - independent of dimensionality! This is the key advantage over deterministic numerical integration, which suffers the curse of dimensionality.
import numpy as np
from scipy.stats import norm
np.random.seed(42)
# Classic Monte Carlo: estimate π by sampling in a square
# π/4 = area of quarter circle / area of unit square
# = P(X^2 + Y^2 <= 1) where (X,Y) ~ Uniform([0,1]^2)
for n in [100, 1000, 10000, 100000, 1000000]:
x = np.random.uniform(0, 1, n)
y = np.random.uniform(0, 1, n)
inside = (x**2 + y**2 <= 1)
pi_estimate = 4 * inside.mean()
error = abs(pi_estimate - np.pi)
print(f"n={n:>8}: pi ≈ {pi_estimate:.6f}, error={error:.6f}")
# Monte Carlo integration: E_p[f(X)] for Gaussian p
# Compute E[cos(X)] where X ~ N(0,1)
# True value: real part of moment generating function at i = Re[e^{-1/2}]
n_samples = 100_000
samples = np.random.randn(n_samples)
# Monte Carlo estimate
mc_estimate = np.cos(samples).mean()
# True value (via integration formula)
true_value = np.exp(-0.5) # E[cos(X)] = Re[E[e^{iX}]] = Re[e^{-σ²/2}]
print(f"\nE_{{N(0,1)}}[cos(X)]:")
print(f" True value: {true_value:.6f}")
print(f" MC estimate: {mc_estimate:.6f}")
print(f" Error: {abs(mc_estimate - true_value):.6f}")
print(f" Expected error ~ 1/√n = {1/np.sqrt(n_samples):.6f}")
Variance Reduction Techniques
Monte Carlo variance can be high, especially for rare events. Common techniques:
| Technique | Idea | When to Use |
|---|---|---|
| Antithetic variates | Use and together | Symmetric integrands |
| Control variates | Subtract known-mean random variable | Known correlated function |
| Stratified sampling | Sample evenly from sub-regions | Non-uniform integrand |
| Importance sampling | Use better proposal | Rare event estimation |
6. Markov Chain Monte Carlo (MCMC)
MCMC generates samples from a complex distribution by constructing a Markov chain whose stationary distribution is .
The Core Idea
Instead of sampling independently (hard), take a random walk through -space:
x_1 → x_2 → x_3 → x_4 → ... → x_n
where transitions are designed so that the chain
"mixes" according to the target distribution p(x).
After a burn-in period (letting the chain forget its starting point), subsequent samples are approximately from .
Metropolis-Hastings Algorithm
Initialize x_0 arbitrarily
FOR t = 1, 2, ..., N:
1. Propose x* ~ q(x* | x_{t-1}) (proposal distribution)
2. Compute acceptance ratio:
alpha = min(1, [p(x*) * q(x_{t-1}|x*)] / [p(x_{t-1}) * q(x*|x_{t-1})])
3. Accept: x_t = x* with probability alpha
Reject: x_t = x_{t-1} with probability (1 - alpha)
For symmetric proposals (e.g., Gaussian random walk):
Crucially: we only need up to a normalization constant! The constant cancels in the ratio.
import numpy as np
from scipy.stats import norm, multivariate_normal
def metropolis_hastings(log_target, n_samples, x0, proposal_std=0.5, burn_in=1000):
"""
Metropolis-Hastings with Gaussian random walk proposal.
Args:
log_target: log p(x) (up to a constant)
n_samples: number of samples (after burn-in)
x0: initial state
proposal_std: std of Gaussian random walk proposal
burn_in: number of burn-in steps to discard
"""
x = np.array(x0, dtype=float)
dim = x.shape[0] if hasattr(x, 'shape') else 1
samples = []
n_accepted = 0
for i in range(n_samples + burn_in):
# Propose new state
x_star = x + proposal_std * np.random.randn(*x.shape)
# Acceptance ratio (in log space for numerical stability)
log_alpha = log_target(x_star) - log_target(x)
alpha = min(1.0, np.exp(log_alpha))
# Accept or reject
if np.random.uniform() < alpha:
x = x_star
if i >= burn_in:
n_accepted += 1
if i >= burn_in:
samples.append(x.copy())
acceptance_rate = n_accepted / n_samples
return np.array(samples), acceptance_rate
# Example 1: Sample from N(2, 0.5^2) using MH
target_mean, target_std = 2.0, 0.5
log_target_1d = lambda x: norm.logpdf(x, loc=target_mean, scale=target_std)
samples_1d, acc_rate = metropolis_hastings(
log_target=lambda x: log_target_1d(x[0]),
n_samples=10000,
x0=np.array([0.0]),
proposal_std=0.5
)
samples_1d = samples_1d.flatten()
print(f"MH Sampling from N({target_mean}, {target_std}^2):")
print(f" Acceptance rate: {acc_rate:.3f} (optimal ~0.234 in 1D)")
print(f" Sample mean: {samples_1d.mean():.5f} (true: {target_mean})")
print(f" Sample std: {samples_1d.std():.5f} (true: {target_std})")
# Example 2: Sample from a bimodal distribution (challenging for MCMC)
def log_bimodal(x):
"""Log density of 0.5*N(-2,1) + 0.5*N(2,1)."""
log_p1 = np.log(0.5) + norm.logpdf(x[0], -2, 1)
log_p2 = np.log(0.5) + norm.logpdf(x[0], 2, 1)
return np.logaddexp(log_p1, log_p2)
for proposal_std in [0.1, 1.0, 3.0]:
samples, acc = metropolis_hastings(
log_target=log_bimodal,
n_samples=10000,
x0=np.array([0.0]),
proposal_std=proposal_std
)
samples = samples.flatten()
print(f" Proposal std={proposal_std}: acc_rate={acc:.3f}, "
f"sample mean={samples.mean():.3f} (expected 0)")
MCMC Diagnostics
# Auto-correlation function to check MCMC mixing
def effective_sample_size(samples):
"""Estimate ESS using autocorrelation."""
n = len(samples)
# Compute autocorrelation up to lag 100
max_lag = min(100, n // 2)
mean = samples.mean()
var = samples.var()
if var < 1e-10:
return n
# Compute autocorrelations
rho = []
for lag in range(1, max_lag):
c = np.mean((samples[:n-lag] - mean) * (samples[lag:] - mean)) / var
rho.append(c)
if c < 0.05: # stop when autocorrelation drops below threshold
break
# ESS ≈ n / (1 + 2 * sum(autocorrelations))
ess = n / (1 + 2 * sum(rho))
return max(1, ess)
samples, _ = metropolis_hastings(
log_target=lambda x: norm.logpdf(x[0], 2, 0.5),
n_samples=5000,
x0=np.array([0.0]),
proposal_std=0.5
)
samples_flat = samples.flatten()
ess = effective_sample_size(samples_flat)
print(f"\nMCMC Effective Sample Size:")
print(f" Total samples: 5000")
print(f" ESS: {ess:.0f}")
print(f" ESS ratio: {ess/5000:.4f} (0.1 to 0.5 is typical)")
Gibbs Sampling
When the full joint is complex but the conditional distributions are tractable:
Initialize: x_1, ..., x_d
REPEAT:
FOR j = 1 to d:
Sample x_j ~ p(x_j | x_1, ..., x_{j-1}, x_{j+1}, ..., x_d)
Each variable is updated conditioned on all others. This is the basis of many Bayesian inference algorithms and was central to LDA (Latent Dirichlet Allocation) training.
7. Sampling in Deep Learning
Language Model Decoding
Token generation from language models is sampling from :
import numpy as np
def softmax(x, temperature=1.0):
z = x / temperature
e = np.exp(z - z.max())
return e / e.sum()
def greedy_decode(logits):
"""Argmax - deterministic, no sampling."""
return np.argmax(logits)
def temperature_sample(logits, temperature=1.0):
"""Sample from softmax with temperature."""
probs = softmax(logits, temperature)
return np.random.choice(len(probs), p=probs)
def top_k_sample(logits, k=50):
"""Sample from top-k distribution."""
top_k_idx = np.argpartition(logits, -k)[-k:]
top_k_logits = logits[top_k_idx]
probs = softmax(top_k_logits)
return top_k_idx[np.random.choice(k, p=probs)]
def top_p_sample(logits, p=0.9):
"""Nucleus (top-p) sampling: sample from smallest set with cumulative prob >= p."""
probs = softmax(logits)
sorted_idx = np.argsort(probs)[::-1]
sorted_probs = probs[sorted_idx]
cumsum = np.cumsum(sorted_probs)
# Find nucleus: smallest set covering p probability mass
nucleus = sorted_idx[cumsum <= p]
if len(nucleus) == 0:
nucleus = sorted_idx[:1]
nucleus_probs = probs[nucleus]
nucleus_probs /= nucleus_probs.sum()
return nucleus[np.random.choice(len(nucleus), p=nucleus_probs)]
# Example
np.random.seed(42)
vocab_size = 10
logits = np.random.randn(vocab_size)
print("Decoding strategies on logits:", logits.round(2))
print(f" Greedy: token {greedy_decode(logits)}")
print(f" Temp=1.0: token {temperature_sample(logits, 1.0)}")
print(f" Temp=0.5: token {temperature_sample(logits, 0.5)}")
print(f" Top-k (k=3): token {top_k_sample(logits, 3)}")
print(f" Top-p (p=0.9): token {top_p_sample(logits, 0.9)}")
Monte Carlo Dropout for Uncertainty Estimation
import numpy as np
def mc_dropout_inference(model_forward, x, n_passes=100, dropout_rate=0.1):
"""
Run n stochastic forward passes through the model
to estimate prediction mean and uncertainty.
"""
predictions = []
for _ in range(n_passes):
# Apply dropout mask manually (approximation)
pred = model_forward(x, dropout_rate=dropout_rate)
predictions.append(pred)
predictions = np.array(predictions) # (n_passes, n_classes)
mean_pred = predictions.mean(axis=0)
uncertainty = predictions.std(axis=0) # epistemic uncertainty
return mean_pred, uncertainty
# Simulate a simple model forward pass
def fake_model(x, dropout_rate=0.1):
"""Simulate a stochastic forward pass."""
np.random.seed(None) # ensure different random masks each call
base_logits = np.array([2.0, -1.0, 0.5])
# Dropout: randomly zero out some activations
mask = np.random.binomial(1, 1-dropout_rate, size=3).astype(float)
noisy_logits = base_logits * mask + 0.2 * np.random.randn(3)
e = np.exp(noisy_logits - noisy_logits.max())
return e / e.sum()
np.random.seed(42)
x = np.array([1.0, 2.0, 3.0])
mean_pred, uncertainty = mc_dropout_inference(fake_model, x, n_passes=100)
print(f"\nMC Dropout Inference (100 passes):")
print(f" Mean prediction: {mean_pred.round(4)}")
print(f" Uncertainty: {uncertainty.round(4)}")
print(f" Most uncertain class: {np.argmax(uncertainty)}")
8. Summary: Sampling Methods Reference
METHOD | REQUIRES | BEST FOR
────────────────────|─────────────────────────────|──────────────────────────
Inverse CDF | Closed-form F^{-1} | 1D distributions with
| | known CDF inverse
────────────────────|─────────────────────────────|──────────────────────────
Rejection Sampling | Proposal q(x) ≥ p(x)/M, | Low-dimensional targets
| easy to sample from q | when good proposal exists
────────────────────|─────────────────────────────|──────────────────────────
Importance Sampling | Proposal q(x) ≠ 0 wherever | Estimating expectations
| p(x) ≠ 0 | (not generating samples)
────────────────────|─────────────────────────────|──────────────────────────
Metropolis-Hastings | log p(x) up to constant | General-purpose high-dim
| (unnormalized target OK) | Bayesian inference
────────────────────|─────────────────────────────|──────────────────────────
Gibbs Sampling | Tractable conditionals | Bayesian networks,
| p(x_j | x_{-j}) | LDA, graphical models
────────────────────|─────────────────────────────|──────────────────────────
HMC / NUTS | Gradient of log p(x) | High-dim posteriors
| | (Stan, PyMC, NumPyro)
────────────────────|─────────────────────────────|──────────────────────────
Direct (Neural) | Trained generative model | Images, text, audio
| (GAN, VAE, Diffusion) | generation
9. Interview Q&A
Q1: Explain the inverse CDF sampling method. Why does it work?
A: The inverse CDF method generates samples from a target distribution by sampling and returning . It works because - the last equality holds because is uniform, so for . This is the probability integral transform. Practical applications: generating Exponential samples as , generating from any distribution whose quantile function can be computed (e.g., scipy's .ppf() method). It fails when is not available analytically, which happens for most multivariate distributions and complex posteriors in Bayesian models.
Q2: What is importance sampling and when does it fail?
A: Importance sampling estimates without sampling from directly. The identity with importance weights allows us to draw samples from a convenient proposal and reweight them. The self-normalized estimator even works when is only known up to a constant. It fails when the proposal has lighter tails than the target : if for some , the importance weights have infinite variance, and the estimator has no finite variance - it can produce wildly inaccurate results. In practice this manifests as a few samples with enormous weights dominating the estimate, which can be detected by computing the effective sample size (ESS) - low ESS indicates a poor proposal.
Q3: What is the Metropolis-Hastings algorithm and what problem does it solve?
A: MH solves the problem of sampling from a distribution known only up to a normalization constant: where is intractable. The algorithm: starting from current state , propose , then accept with probability ; otherwise stay at . The normalizing constant cancels in the ratio, so we only need . The Markov chain converges to as its stationary distribution - this is guaranteed by the detailed balance condition: . Key challenges: choosing a good proposal (too narrow → slow mixing; too wide → high rejection rate), detecting convergence (burn-in), and handling high-dimensional spaces.
Q4: How does dropout relate to Bayesian inference and sampling?
A: Standard dropout during training randomly zeros activations with probability , which Gal and Ghahramani (2016) showed is equivalent to approximate variational Bayesian inference over the weights - each dropout mask is a sample from an approximate posterior over weights. At test time, standard practice disables dropout and uses expected weights (scaling by ), which approximates crudely. Monte Carlo Dropout keeps dropout active at test time and runs stochastic forward passes. The mean prediction is a Monte Carlo approximation to the posterior predictive mean, and the variance across passes estimates epistemic uncertainty (uncertainty about model parameters). This converts any dropout-trained neural network into an approximate Bayesian neural network with no additional training cost - just forward passes at inference time.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Sampling Methods demo on the EngineersOfAI Playground - no code required.
:::
Q5: Why is MCMC necessary for high-dimensional Bayesian inference, and what are its practical limitations?
A: In Bayesian ML, we want to sample from the posterior over model parameters . For neural networks with millions of parameters, this posterior is a high-dimensional, non-Gaussian distribution with no closed form. Inverse CDF is unavailable, rejection sampling has exponentially small acceptance rates in high dimensions, and importance sampling produces near-zero ESS. MCMC handles this by constructing a Markov chain that explores the high-dimensional space ergodically, without needing a global normalization constant - only is needed. Practical limitations: (1) convergence: it is hard to know when the chain has mixed; (2) autocorrelation: successive samples are correlated, reducing effective sample size; (3) cost: for large neural networks, even one MCMC step requires a full gradient computation - Hamiltonian Monte Carlo (HMC) needs many gradient evaluations per step; (4) scalability: standard MCMC is impractical for models with millions of parameters, which is why variational inference (ELBO, normalizing flows) is often preferred despite being approximate.
