Skip to main content

Sampling Methods

Reading time: ~50 minutes | Interview relevance: High | Target roles: ML Engineer, AI Engineer, Research Scientist, Data Scientist

The ML Scenario That Motivates This Lesson

You're training a Bayesian neural network. After updating the model with data, the posterior distribution over weights P(θD)P(\theta \mid \mathcal{D}) is a high-dimensional, non-Gaussian distribution - it has no closed form. You need to:

  1. Sample from this distribution to make predictions (average over posterior)
  2. Estimate integrals over this distribution (expected loss, predictive variance)
  3. Explore the distribution to understand model uncertainty

None of these are possible analytically. You need sampling algorithms.

Sampling methods appear throughout modern ML:

  • Dropout: stochastic approximation to Bayesian neural network inference
  • Data augmentation: sampling transformed versions of training data
  • Language model generation: sampling from P(wtw1,,wt1)P(w_t \mid w_1, \ldots, w_{t-1})
  • Diffusion models: iterative denoising is a sampling process from a learned distribution
  • RLHF: sampling model outputs for human preference labeling
  • VAE: sampling from the latent space prior

This lesson covers the algorithmic toolkit for sampling from arbitrary probability distributions.

1. Why Sampling Is Hard

Sampling from a distribution p(x)p(x) seems simple: generate random numbers that follow pp. But in ML, we often face:

  1. High dimensionality: p(x)p(\mathbf{x}) over R10000\mathbb{R}^{10000} (image pixel space)
  2. Intractable normalization: p(x)p~(x)p(x) \propto \tilde{p}(x) but p~(x)dx\int \tilde{p}(x) dx is unknown
  3. Complex shape: multi-modal, non-Gaussian distributions
  4. Implicit distributions: defined via a neural network (GAN, diffusion)

Different sampling algorithms handle different subsets of these challenges:

SAMPLING ALGORITHM SELECTION GUIDE:

Can you evaluate F^{-1}(u) efficiently? ──► YES ──► Inverse CDF Method

NO


Do you know a proposal distribution q(x)? ──► YES ──► Rejection Sampling
│ or Importance Sampling
NO


Can you evaluate p(x) up to a constant? ──► YES ──► MCMC (Metropolis-Hastings,
│ Gibbs Sampling, HMC)
NO


Can you sample from a neural network? ──► YES ──► Direct (GAN, Diffusion, Flow)

2. Inverse CDF (Inverse Transform) Method

The most elegant sampling algorithm for 1D distributions.

The Key Theorem

If UUniform(0,1)U \sim \text{Uniform}(0, 1) and FF is a CDF, then:

X=F1(U)FX = F^{-1}(U) \sim F

Proof: P(Xx)=P(F1(U)x)=P(UF(x))=F(x)P(X \leq x) = P(F^{-1}(U) \leq x) = P(U \leq F(x)) = F(x). The last step uses UUniform(0,1)U \sim \text{Uniform}(0,1).

Algorithm

1. Sample u ~ Uniform(0, 1)
2. Return x = F^{-1}(u)

Example: Exponential Distribution

For XExponential(λ)X \sim \text{Exponential}(\lambda):

  • CDF: F(x)=1eλxF(x) = 1 - e^{-\lambda x}
  • Inverse: F1(u)=ln(1u)λF^{-1}(u) = -\frac{\ln(1-u)}{\lambda}

Since 1UUniform(0,1)1-U \sim \text{Uniform}(0,1) if UUniform(0,1)U \sim \text{Uniform}(0,1), we can simplify to X=lnUλX = -\frac{\ln U}{\lambda}.

import numpy as np
from scipy.stats import exponential, ks_2samp

np.random.seed(42)
n_samples = 100_000
lam = 2.0

# --- Inverse CDF sampling ---
def sample_exponential_icdf(lam, n):
"""Sample from Exponential(lam) using inverse CDF."""
u = np.random.uniform(0, 1, n)
return -np.log(1 - u) / lam # equivalently: -np.log(u) / lam

samples_icdf = sample_exponential_icdf(lam, n_samples)
samples_scipy = np.random.exponential(scale=1/lam, size=n_samples)

print(f"Exponential({lam}) via Inverse CDF:")
print(f" Empirical mean: {samples_icdf.mean():.5f} (true: {1/lam:.5f})")
print(f" Empirical std: {samples_icdf.std():.5f} (true: {1/lam:.5f})")

# KS test: are the samples from the right distribution?
ks_stat, p_val = ks_2samp(samples_icdf, samples_scipy)
print(f" KS test (vs scipy): stat={ks_stat:.5f}, p={p_val:.4f} (p>>0.05 = good)")
# Discrete inverse CDF: sampling from a custom discrete distribution
def sample_discrete_icdf(probs, n_samples):
"""
Sample from a discrete distribution using the inverse CDF.
Equivalent to np.random.choice but shows the mechanism.
"""
cdf = np.cumsum(probs)
u = np.random.uniform(0, 1, n_samples)
# Find the smallest index k such that CDF(k) >= u
samples = np.searchsorted(cdf, u)
return samples

# Custom discrete distribution: P(0)=0.1, P(1)=0.4, P(2)=0.3, P(3)=0.2
probs = np.array([0.1, 0.4, 0.3, 0.2])
samples = sample_discrete_icdf(probs, n_samples=100_000)

print("\nDiscrete Inverse CDF Sampling:")
print(f"{'Value':>6} | {'Empirical Freq':>15} | {'True Prob':>10}")
for k, p in enumerate(probs):
print(f"{k:>6} | {(samples==k).mean():>15.5f} | {p:>10.4f}")

Limitation

Inverse CDF requires knowing F1F^{-1} analytically. For many distributions (multivariate Gaussian, posterior distributions in Bayesian models), this is not available.

3. Rejection Sampling

Sample from a proposal distribution q(x)q(x) that is easy to sample from, and reject samples that are unlikely under the target p(x)p(x).

Algorithm

Given target p(x)p(x) and proposal q(x)q(x) with constant MM such that p(x)Mq(x)p(x) \leq M \cdot q(x) for all xx:

REPEAT:
1. Sample x ~ q(x)
2. Sample u ~ Uniform(0, 1)
3. IF u <= p(x) / (M * q(x)):
ACCEPT x (return x)
ELSE:
REJECT x (try again)

Acceptance probability: P(accept)=1MP(\text{accept}) = \frac{1}{M}

The tighter the proposal envelope (MM close to 1), the fewer rejections.

Rejection Sampling Visualization:

M·q(x) ───────────────────────────────── (proposal envelope)
│ │
p(x) │ ╭───────╮ │ (target density)
│ ╭─╯ ╰─╮ │
│ ╭─╯ ╰───╮ │
└──────────────────────────── x

Accept region (between p(x) and M·q(x)): probability p(x)/(M·q(x))
Reject region (above p(x)): probability 1 - p(x)/(M·q(x))
import numpy as np
from scipy.stats import norm, beta

def rejection_sampling(target_pdf, proposal_pdf, proposal_sampler, M, n_samples):
"""
Generic rejection sampler.

Args:
target_pdf: function x -> p(x)
proposal_pdf: function x -> q(x)
proposal_sampler: function n -> n samples from q
M: constant such that p(x) <= M * q(x) for all x
n_samples: desired number of accepted samples

Returns:
accepted_samples, n_total_attempts
"""
accepted = []
n_attempts = 0

while len(accepted) < n_samples:
# Step 1: sample from proposal
x = proposal_sampler(1)[0]
n_attempts += 1

# Step 2: compute acceptance ratio
accept_ratio = target_pdf(x) / (M * proposal_pdf(x))

# Step 3: accept or reject
if np.random.uniform() <= accept_ratio:
accepted.append(x)

return np.array(accepted), n_attempts

# Example: sample from Beta(3, 5) using Uniform(0,1) as proposal
from scipy.stats import beta as beta_dist, uniform

alpha, beta_param = 3.0, 5.0
target = beta_dist(alpha, beta_param)

# Max of Beta(3,5) pdf is at mode = (alpha-1)/(alpha+beta-2) = 2/6 ≈ 0.333
mode = (alpha - 1) / (alpha + beta_param - 2)
M = target.pdf(mode) # max density value

print(f"Rejection Sampling: Beta({alpha}, {beta_param}) from Uniform(0,1)")
print(f" M = {M:.4f} (proposal envelope)")
print(f" Theoretical acceptance rate = 1/M = {1/M:.4f}")

np.random.seed(42)
samples, n_attempts = rejection_sampling(
target_pdf=target.pdf,
proposal_pdf=lambda x: uniform.pdf(x),
proposal_sampler=lambda n: np.random.uniform(size=n),
M=M,
n_samples=10000
)

print(f" Actual acceptance rate: {10000/n_attempts:.4f}")
print(f" Sample mean: {samples.mean():.5f} (true: {target.mean():.5f})")
print(f" Sample std: {samples.std():.5f} (true: {target.std():.5f})")

Limitation

Rejection sampling is inefficient in high dimensions because the acceptance rate decreases exponentially: P(accept)1/MP(\text{accept}) \leq 1/M, and MM must cover the full distribution, which becomes exponentially hard as dimension grows (the "curse of dimensionality").

4. Importance Sampling

Rather than sampling from p(x)p(x) directly, use samples from q(x)q(x) but weight them to correct for the wrong distribution.

The Key Identity

Ep[f(X)]=f(x)p(x)dx=f(x)p(x)q(x)q(x)dx=Eq[f(X)p(X)q(X)]\mathbb{E}_{p}[f(X)] = \int f(x) p(x) dx = \int f(x) \frac{p(x)}{q(x)} q(x) dx = \mathbb{E}_{q}\left[f(X) \frac{p(X)}{q(X)}\right]

The weights w(x)=p(x)/q(x)w(x) = p(x)/q(x) are called importance weights.

Algorithm

1. Sample x_1, ..., x_n ~ q(x)
2. Compute importance weights: w_i = p(x_i) / q(x_i)
3. Estimate: E_p[f(X)] ≈ (Σ_i w_i f(x_i)) / (Σ_i w_i)

The unnormalized estimator 1niwif(xi)\frac{1}{n}\sum_i w_i f(x_i) requires knowing p(x)p(x) exactly. The self-normalized estimator iwif(xi)iwi\frac{\sum_i w_i f(x_i)}{\sum_i w_i} works even when pp is known only up to a constant.

import numpy as np
from scipy.stats import norm

np.random.seed(42)
n_samples = 100_000

# Estimate E_p[X^2] where p = N(3, 1) using q = N(0, 1) as proposal

def importance_sampling(f, target_logpdf, proposal_logpdf, proposal_sampler, n):
"""
Estimate E_p[f(X)] using importance sampling with proposal q.
Uses log-weights for numerical stability.
"""
samples = proposal_sampler(n)

# Log importance weights
log_weights = target_logpdf(samples) - proposal_logpdf(samples)

# Normalize using log-sum-exp trick
log_weights -= log_weights.max() # for numerical stability
weights = np.exp(log_weights)
weights /= weights.sum()

return np.sum(weights * f(samples)), weights

target = norm(loc=3.0, scale=1.0)
proposal = norm(loc=0.0, scale=1.0)

# Estimate E_p[X^2]
# True value: E[X^2] = Var(X) + E[X]^2 = 1 + 9 = 10
f = lambda x: x**2

est, weights = importance_sampling(
f=f,
target_logpdf=target.logpdf,
proposal_logpdf=proposal.logpdf,
proposal_sampler=lambda n: proposal.rvs(n),
n=n_samples
)

# Direct Monte Carlo from target (ground truth)
direct_mc = f(target.rvs(n_samples)).mean()

print(f"Importance Sampling: E_{{N(3,1)}}[X^2]")
print(f" True value: 10.0000")
print(f" Importance sampling: {est:.5f}")
print(f" Direct MC: {direct_mc:.5f}")

# Effective sample size (ESS) - measures quality of importance weights
ess = 1.0 / np.sum(weights**2)
print(f" Effective sample size (ESS): {ess:.0f} / {n_samples}")
print(f" ESS ratio: {ess/n_samples:.4f} (1.0 = perfect, <<1 = bad proposal)")

Effective Sample Size (ESS)

ESS=(iwi)2iwi2\text{ESS} = \frac{\left(\sum_i w_i\right)^2}{\sum_i w_i^2}

ESS measures how many iid samples from pp are equivalent to your nn weighted samples. If qpq \approx p, weights are all similar, and ESSn\text{ESS} \approx n. If qq is a poor proposal, a few samples dominate, and ESS n\ll n.

:::warning When Importance Sampling Fails Importance sampling fails catastrophically when the proposal qq has lighter tails than the target pp. If p(x)/q(x)p(x)/q(x) \to \infty for some xx in the support of pp, the importance weights have infinite variance. This happens when, e.g., estimating tail probabilities of a Gaussian using a Uniform proposal. In high dimensions, finding a good proposal is extremely difficult - which is why MCMC is preferred for high-dimensional posterior sampling. :::

5. Monte Carlo Integration

The most widely used application of sampling: estimating integrals that can't be solved analytically.

Basic Monte Carlo Estimator

Ωf(x)dxΩni=1nf(xi),xiUniform(Ω)\int_\Omega f(x) dx \approx \frac{|\Omega|}{n} \sum_{i=1}^n f(x_i), \quad x_i \sim \text{Uniform}(\Omega)

More generally, for expectation under pp:

Ep[f(X)]=f(x)p(x)dx1ni=1nf(xi),xip\mathbb{E}_p[f(X)] = \int f(x) p(x) dx \approx \frac{1}{n} \sum_{i=1}^n f(x_i), \quad x_i \sim p

Error rate: By CLT, the error is O(1/n)O(1/\sqrt{n}) - independent of dimensionality! This is the key advantage over deterministic numerical integration, which suffers the curse of dimensionality.

import numpy as np
from scipy.stats import norm

np.random.seed(42)

# Classic Monte Carlo: estimate π by sampling in a square
# π/4 = area of quarter circle / area of unit square
# = P(X^2 + Y^2 <= 1) where (X,Y) ~ Uniform([0,1]^2)

for n in [100, 1000, 10000, 100000, 1000000]:
x = np.random.uniform(0, 1, n)
y = np.random.uniform(0, 1, n)
inside = (x**2 + y**2 <= 1)
pi_estimate = 4 * inside.mean()
error = abs(pi_estimate - np.pi)
print(f"n={n:>8}: pi ≈ {pi_estimate:.6f}, error={error:.6f}")
# Monte Carlo integration: E_p[f(X)] for Gaussian p
# Compute E[cos(X)] where X ~ N(0,1)
# True value: real part of moment generating function at i = Re[e^{-1/2}]

n_samples = 100_000
samples = np.random.randn(n_samples)

# Monte Carlo estimate
mc_estimate = np.cos(samples).mean()

# True value (via integration formula)
true_value = np.exp(-0.5) # E[cos(X)] = Re[E[e^{iX}]] = Re[e^{-σ²/2}]

print(f"\nE_{{N(0,1)}}[cos(X)]:")
print(f" True value: {true_value:.6f}")
print(f" MC estimate: {mc_estimate:.6f}")
print(f" Error: {abs(mc_estimate - true_value):.6f}")
print(f" Expected error ~ 1/√n = {1/np.sqrt(n_samples):.6f}")

Variance Reduction Techniques

Monte Carlo variance can be high, especially for rare events. Common techniques:

TechniqueIdeaWhen to Use
Antithetic variatesUse UU and 1U1-U togetherSymmetric integrands
Control variatesSubtract known-mean random variableKnown correlated function
Stratified samplingSample evenly from sub-regionsNon-uniform integrand
Importance samplingUse better proposalRare event estimation

6. Markov Chain Monte Carlo (MCMC)

MCMC generates samples from a complex distribution p(x)p(x) by constructing a Markov chain whose stationary distribution is pp.

The Core Idea

Instead of sampling independently (hard), take a random walk through xx-space:

x_1 → x_2 → x_3 → x_4 → ... → x_n

where transitions are designed so that the chain
"mixes" according to the target distribution p(x).

After a burn-in period (letting the chain forget its starting point), subsequent samples are approximately from p(x)p(x).

Metropolis-Hastings Algorithm

Initialize x_0 arbitrarily
FOR t = 1, 2, ..., N:
1. Propose x* ~ q(x* | x_{t-1}) (proposal distribution)
2. Compute acceptance ratio:
alpha = min(1, [p(x*) * q(x_{t-1}|x*)] / [p(x_{t-1}) * q(x*|x_{t-1})])
3. Accept: x_t = x* with probability alpha
Reject: x_t = x_{t-1} with probability (1 - alpha)

For symmetric proposals q(xx)=q(xx)q(x^* \mid x) = q(x \mid x^*) (e.g., Gaussian random walk):

α=min(1,p(x)p(xt1))\alpha = \min\left(1, \frac{p(x^*)}{p(x_{t-1})}\right)

Crucially: we only need pp up to a normalization constant! The constant cancels in the ratio.

import numpy as np
from scipy.stats import norm, multivariate_normal

def metropolis_hastings(log_target, n_samples, x0, proposal_std=0.5, burn_in=1000):
"""
Metropolis-Hastings with Gaussian random walk proposal.

Args:
log_target: log p(x) (up to a constant)
n_samples: number of samples (after burn-in)
x0: initial state
proposal_std: std of Gaussian random walk proposal
burn_in: number of burn-in steps to discard
"""
x = np.array(x0, dtype=float)
dim = x.shape[0] if hasattr(x, 'shape') else 1
samples = []
n_accepted = 0

for i in range(n_samples + burn_in):
# Propose new state
x_star = x + proposal_std * np.random.randn(*x.shape)

# Acceptance ratio (in log space for numerical stability)
log_alpha = log_target(x_star) - log_target(x)
alpha = min(1.0, np.exp(log_alpha))

# Accept or reject
if np.random.uniform() < alpha:
x = x_star
if i >= burn_in:
n_accepted += 1

if i >= burn_in:
samples.append(x.copy())

acceptance_rate = n_accepted / n_samples
return np.array(samples), acceptance_rate

# Example 1: Sample from N(2, 0.5^2) using MH
target_mean, target_std = 2.0, 0.5
log_target_1d = lambda x: norm.logpdf(x, loc=target_mean, scale=target_std)

samples_1d, acc_rate = metropolis_hastings(
log_target=lambda x: log_target_1d(x[0]),
n_samples=10000,
x0=np.array([0.0]),
proposal_std=0.5
)
samples_1d = samples_1d.flatten()

print(f"MH Sampling from N({target_mean}, {target_std}^2):")
print(f" Acceptance rate: {acc_rate:.3f} (optimal ~0.234 in 1D)")
print(f" Sample mean: {samples_1d.mean():.5f} (true: {target_mean})")
print(f" Sample std: {samples_1d.std():.5f} (true: {target_std})")
# Example 2: Sample from a bimodal distribution (challenging for MCMC)
def log_bimodal(x):
"""Log density of 0.5*N(-2,1) + 0.5*N(2,1)."""
log_p1 = np.log(0.5) + norm.logpdf(x[0], -2, 1)
log_p2 = np.log(0.5) + norm.logpdf(x[0], 2, 1)
return np.logaddexp(log_p1, log_p2)

for proposal_std in [0.1, 1.0, 3.0]:
samples, acc = metropolis_hastings(
log_target=log_bimodal,
n_samples=10000,
x0=np.array([0.0]),
proposal_std=proposal_std
)
samples = samples.flatten()
print(f" Proposal std={proposal_std}: acc_rate={acc:.3f}, "
f"sample mean={samples.mean():.3f} (expected 0)")

MCMC Diagnostics

# Auto-correlation function to check MCMC mixing
def effective_sample_size(samples):
"""Estimate ESS using autocorrelation."""
n = len(samples)
# Compute autocorrelation up to lag 100
max_lag = min(100, n // 2)
mean = samples.mean()
var = samples.var()

if var < 1e-10:
return n

# Compute autocorrelations
rho = []
for lag in range(1, max_lag):
c = np.mean((samples[:n-lag] - mean) * (samples[lag:] - mean)) / var
rho.append(c)
if c < 0.05: # stop when autocorrelation drops below threshold
break

# ESS ≈ n / (1 + 2 * sum(autocorrelations))
ess = n / (1 + 2 * sum(rho))
return max(1, ess)

samples, _ = metropolis_hastings(
log_target=lambda x: norm.logpdf(x[0], 2, 0.5),
n_samples=5000,
x0=np.array([0.0]),
proposal_std=0.5
)
samples_flat = samples.flatten()

ess = effective_sample_size(samples_flat)
print(f"\nMCMC Effective Sample Size:")
print(f" Total samples: 5000")
print(f" ESS: {ess:.0f}")
print(f" ESS ratio: {ess/5000:.4f} (0.1 to 0.5 is typical)")

Gibbs Sampling

When the full joint p(x)p(\mathbf{x}) is complex but the conditional distributions p(xjxj)p(x_j \mid \mathbf{x}_{-j}) are tractable:

Initialize: x_1, ..., x_d
REPEAT:
FOR j = 1 to d:
Sample x_j ~ p(x_j | x_1, ..., x_{j-1}, x_{j+1}, ..., x_d)

Each variable is updated conditioned on all others. This is the basis of many Bayesian inference algorithms and was central to LDA (Latent Dirichlet Allocation) training.

7. Sampling in Deep Learning

Language Model Decoding

Token generation from language models is sampling from P(wtw1,,wt1)P(w_t \mid w_1, \ldots, w_{t-1}):

import numpy as np

def softmax(x, temperature=1.0):
z = x / temperature
e = np.exp(z - z.max())
return e / e.sum()

def greedy_decode(logits):
"""Argmax - deterministic, no sampling."""
return np.argmax(logits)

def temperature_sample(logits, temperature=1.0):
"""Sample from softmax with temperature."""
probs = softmax(logits, temperature)
return np.random.choice(len(probs), p=probs)

def top_k_sample(logits, k=50):
"""Sample from top-k distribution."""
top_k_idx = np.argpartition(logits, -k)[-k:]
top_k_logits = logits[top_k_idx]
probs = softmax(top_k_logits)
return top_k_idx[np.random.choice(k, p=probs)]

def top_p_sample(logits, p=0.9):
"""Nucleus (top-p) sampling: sample from smallest set with cumulative prob >= p."""
probs = softmax(logits)
sorted_idx = np.argsort(probs)[::-1]
sorted_probs = probs[sorted_idx]
cumsum = np.cumsum(sorted_probs)
# Find nucleus: smallest set covering p probability mass
nucleus = sorted_idx[cumsum <= p]
if len(nucleus) == 0:
nucleus = sorted_idx[:1]
nucleus_probs = probs[nucleus]
nucleus_probs /= nucleus_probs.sum()
return nucleus[np.random.choice(len(nucleus), p=nucleus_probs)]

# Example
np.random.seed(42)
vocab_size = 10
logits = np.random.randn(vocab_size)

print("Decoding strategies on logits:", logits.round(2))
print(f" Greedy: token {greedy_decode(logits)}")
print(f" Temp=1.0: token {temperature_sample(logits, 1.0)}")
print(f" Temp=0.5: token {temperature_sample(logits, 0.5)}")
print(f" Top-k (k=3): token {top_k_sample(logits, 3)}")
print(f" Top-p (p=0.9): token {top_p_sample(logits, 0.9)}")

Monte Carlo Dropout for Uncertainty Estimation

import numpy as np

def mc_dropout_inference(model_forward, x, n_passes=100, dropout_rate=0.1):
"""
Run n stochastic forward passes through the model
to estimate prediction mean and uncertainty.
"""
predictions = []
for _ in range(n_passes):
# Apply dropout mask manually (approximation)
pred = model_forward(x, dropout_rate=dropout_rate)
predictions.append(pred)

predictions = np.array(predictions) # (n_passes, n_classes)
mean_pred = predictions.mean(axis=0)
uncertainty = predictions.std(axis=0) # epistemic uncertainty
return mean_pred, uncertainty

# Simulate a simple model forward pass
def fake_model(x, dropout_rate=0.1):
"""Simulate a stochastic forward pass."""
np.random.seed(None) # ensure different random masks each call
base_logits = np.array([2.0, -1.0, 0.5])
# Dropout: randomly zero out some activations
mask = np.random.binomial(1, 1-dropout_rate, size=3).astype(float)
noisy_logits = base_logits * mask + 0.2 * np.random.randn(3)
e = np.exp(noisy_logits - noisy_logits.max())
return e / e.sum()

np.random.seed(42)
x = np.array([1.0, 2.0, 3.0])
mean_pred, uncertainty = mc_dropout_inference(fake_model, x, n_passes=100)

print(f"\nMC Dropout Inference (100 passes):")
print(f" Mean prediction: {mean_pred.round(4)}")
print(f" Uncertainty: {uncertainty.round(4)}")
print(f" Most uncertain class: {np.argmax(uncertainty)}")

8. Summary: Sampling Methods Reference

METHOD | REQUIRES | BEST FOR
────────────────────|─────────────────────────────|──────────────────────────
Inverse CDF | Closed-form F^{-1} | 1D distributions with
| | known CDF inverse
────────────────────|─────────────────────────────|──────────────────────────
Rejection Sampling | Proposal q(x) ≥ p(x)/M, | Low-dimensional targets
| easy to sample from q | when good proposal exists
────────────────────|─────────────────────────────|──────────────────────────
Importance Sampling | Proposal q(x) ≠ 0 wherever | Estimating expectations
| p(x) ≠ 0 | (not generating samples)
────────────────────|─────────────────────────────|──────────────────────────
Metropolis-Hastings | log p(x) up to constant | General-purpose high-dim
| (unnormalized target OK) | Bayesian inference
────────────────────|─────────────────────────────|──────────────────────────
Gibbs Sampling | Tractable conditionals | Bayesian networks,
| p(x_j | x_{-j}) | LDA, graphical models
────────────────────|─────────────────────────────|──────────────────────────
HMC / NUTS | Gradient of log p(x) | High-dim posteriors
| | (Stan, PyMC, NumPyro)
────────────────────|─────────────────────────────|──────────────────────────
Direct (Neural) | Trained generative model | Images, text, audio
| (GAN, VAE, Diffusion) | generation

9. Interview Q&A

Q1: Explain the inverse CDF sampling method. Why does it work?

A: The inverse CDF method generates samples from a target distribution FF by sampling UUniform(0,1)U \sim \text{Uniform}(0,1) and returning X=F1(U)X = F^{-1}(U). It works because P(Xx)=P(F1(U)x)=P(UF(x))=F(x)P(X \leq x) = P(F^{-1}(U) \leq x) = P(U \leq F(x)) = F(x) - the last equality holds because UU is uniform, so P(Ut)=tP(U \leq t) = t for t[0,1]t \in [0,1]. This is the probability integral transform. Practical applications: generating Exponential samples as X=log(U)/λX = -\log(U)/\lambda, generating from any distribution whose quantile function can be computed (e.g., scipy's .ppf() method). It fails when F1F^{-1} is not available analytically, which happens for most multivariate distributions and complex posteriors in Bayesian models.

Q2: What is importance sampling and when does it fail?

A: Importance sampling estimates Ep[f(X)]\mathbb{E}_p[f(X)] without sampling from pp directly. The identity Ep[f(X)]=Eq[f(X)w(X)]\mathbb{E}_p[f(X)] = \mathbb{E}_q[f(X) w(X)] with importance weights w(x)=p(x)/q(x)w(x) = p(x)/q(x) allows us to draw samples from a convenient proposal qq and reweight them. The self-normalized estimator iwif(xi)/iwi\sum_i w_i f(x_i) / \sum_i w_i even works when pp is only known up to a constant. It fails when the proposal qq has lighter tails than the target pp: if p(x)/q(x)p(x)/q(x) \to \infty for some xx, the importance weights have infinite variance, and the estimator has no finite variance - it can produce wildly inaccurate results. In practice this manifests as a few samples with enormous weights dominating the estimate, which can be detected by computing the effective sample size (ESS) - low ESS indicates a poor proposal.

Q3: What is the Metropolis-Hastings algorithm and what problem does it solve?

A: MH solves the problem of sampling from a distribution p(x)p(x) known only up to a normalization constant: p(x)=p~(x)/Zp(x) = \tilde{p}(x)/Z where Z=p~(x)dxZ = \int \tilde{p}(x) dx is intractable. The algorithm: starting from current state xx, propose xq(xx)x^* \sim q(x^* \mid x), then accept with probability α=min(1,p~(x)q(xx)p~(x)q(xx))\alpha = \min(1, \frac{\tilde{p}(x^*) q(x \mid x^*)}{{\tilde{p}(x) q(x^* \mid x)}}); otherwise stay at xx. The normalizing constant ZZ cancels in the ratio, so we only need p~\tilde{p}. The Markov chain converges to pp as its stationary distribution - this is guaranteed by the detailed balance condition: p(x)q(xx)α(xx)=p(x)q(xx)α(xx)p(x) q(x^* \mid x) \alpha(x \to x^*) = p(x^*) q(x \mid x^*) \alpha(x^* \to x). Key challenges: choosing a good proposal (too narrow → slow mixing; too wide → high rejection rate), detecting convergence (burn-in), and handling high-dimensional spaces.

Q4: How does dropout relate to Bayesian inference and sampling?

A: Standard dropout during training randomly zeros activations with probability pp, which Gal and Ghahramani (2016) showed is equivalent to approximate variational Bayesian inference over the weights - each dropout mask is a sample from an approximate posterior over weights. At test time, standard practice disables dropout and uses expected weights (scaling by 1p1-p), which approximates Eθ[P(yx,θ)]\mathbb{E}_\theta[P(y \mid x, \theta)] crudely. Monte Carlo Dropout keeps dropout active at test time and runs TT stochastic forward passes. The mean prediction is a Monte Carlo approximation to the posterior predictive mean, and the variance across passes estimates epistemic uncertainty (uncertainty about model parameters). This converts any dropout-trained neural network into an approximate Bayesian neural network with no additional training cost - just TT forward passes at inference time.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Sampling Methods demo on the EngineersOfAI Playground - no code required.

:::

Q5: Why is MCMC necessary for high-dimensional Bayesian inference, and what are its practical limitations?

A: In Bayesian ML, we want to sample from the posterior P(θD)P(Dθ)P(θ)P(\theta \mid \mathcal{D}) \propto P(\mathcal{D} \mid \theta) P(\theta) over model parameters θ\theta. For neural networks with millions of parameters, this posterior is a high-dimensional, non-Gaussian distribution with no closed form. Inverse CDF is unavailable, rejection sampling has exponentially small acceptance rates in high dimensions, and importance sampling produces near-zero ESS. MCMC handles this by constructing a Markov chain that explores the high-dimensional space ergodically, without needing a global normalization constant - only logp(θD)=logP(Dθ)+logP(θ)+const\log p(\theta \mid \mathcal{D}) = \log P(\mathcal{D} \mid \theta) + \log P(\theta) + \text{const} is needed. Practical limitations: (1) convergence: it is hard to know when the chain has mixed; (2) autocorrelation: successive samples are correlated, reducing effective sample size; (3) cost: for large neural networks, even one MCMC step requires a full gradient computation - Hamiltonian Monte Carlo (HMC) needs many gradient evaluations per step; (4) scalability: standard MCMC is impractical for models with millions of parameters, which is why variational inference (ELBO, normalizing flows) is often preferred despite being approximate.

© 2026 EngineersOfAI. All rights reserved.