Sampling Methods

Reading time: ~50 minutes | Interview relevance: High | Target roles: ML Engineer, AI Engineer, Research Scientist, Data Scientist

The ML Scenario That Motivates This Lesson

You're training a Bayesian neural network. After updating the model with data, the posterior distribution over weights $P(\theta \mid \mathcal{D})$ is a high-dimensional, non-Gaussian distribution - it has no closed form. You need to:

Sample from this distribution to make predictions (average over posterior)
Estimate integrals over this distribution (expected loss, predictive variance)
Explore the distribution to understand model uncertainty

None of these are possible analytically. You need sampling algorithms.

Sampling methods appear throughout modern ML:

Dropout: stochastic approximation to Bayesian neural network inference
Data augmentation: sampling transformed versions of training data
Language model generation: sampling from $P(w_t \mid w_1, \ldots, w_{t-1})$
Diffusion models: iterative denoising is a sampling process from a learned distribution
RLHF: sampling model outputs for human preference labeling
VAE: sampling from the latent space prior

This lesson covers the algorithmic toolkit for sampling from arbitrary probability distributions.

1. Why Sampling Is Hard

Sampling from a distribution $p(x)$ seems simple: generate random numbers that follow $p$ . But in ML, we often face:

High dimensionality: $p(\mathbf{x})$ over $\mathbb{R}^{10000}$ (image pixel space)
Intractable normalization: $p(x) \propto \tilde{p}(x)$ but $\int \tilde{p}(x) dx$ is unknown
Complex shape: multi-modal, non-Gaussian distributions
Implicit distributions: defined via a neural network (GAN, diffusion)

Different sampling algorithms handle different subsets of these challenges:

SAMPLING ALGORITHM SELECTION GUIDE:

Can you evaluate F^{-1}(u) efficiently?  ──► YES ──► Inverse CDF Method
                   │
                  NO
                   │
                   ▼
Do you know a proposal distribution q(x)?  ──► YES ──► Rejection Sampling
                   │                                    or Importance Sampling
                  NO
                   │
                   ▼
Can you evaluate p(x) up to a constant?   ──► YES ──► MCMC (Metropolis-Hastings,
                   │                                   Gibbs Sampling, HMC)
                  NO
                   │
                   ▼
Can you sample from a neural network?    ──► YES ──► Direct (GAN, Diffusion, Flow)

2. Inverse CDF (Inverse Transform) Method

The most elegant sampling algorithm for 1D distributions.

The Key Theorem

If $U \sim \text{Uniform}(0, 1)$ and $F$ is a CDF, then:

$X = F^{-1}(U) \sim F$

Proof: $P(X \leq x) = P(F^{-1}(U) \leq x) = P(U \leq F(x)) = F(x)$ . The last step uses $U \sim \text{Uniform}(0,1)$ .

Algorithm

1. Sample u ~ Uniform(0, 1)
2. Return x = F^{-1}(u)

Example: Exponential Distribution

For $X \sim \text{Exponential}(\lambda)$ :

CDF: $F(x) = 1 - e^{-\lambda x}$
Inverse: $F^{-1}(u) = -\frac{\ln(1-u)}{\lambda}$

Since $1-U \sim \text{Uniform}(0,1)$ if $U \sim \text{Uniform}(0,1)$ , we can simplify to $X = -\frac{\ln U}{\lambda}$ .

import numpy as np
from scipy.stats import exponential, ks_2samp

np.random.seed(42)
n_samples = 100_000
lam = 2.0

# --- Inverse CDF sampling ---
def sample_exponential_icdf(lam, n):
    """Sample from Exponential(lam) using inverse CDF."""
    u = np.random.uniform(0, 1, n)
    return -np.log(1 - u) / lam  # equivalently: -np.log(u) / lam

samples_icdf = sample_exponential_icdf(lam, n_samples)
samples_scipy = np.random.exponential(scale=1/lam, size=n_samples)

print(f"Exponential({lam}) via Inverse CDF:")
print(f"  Empirical mean:    {samples_icdf.mean():.5f}  (true: {1/lam:.5f})")
print(f"  Empirical std:     {samples_icdf.std():.5f}  (true: {1/lam:.5f})")

# KS test: are the samples from the right distribution?
ks_stat, p_val = ks_2samp(samples_icdf, samples_scipy)
print(f"  KS test (vs scipy): stat={ks_stat:.5f}, p={p_val:.4f} (p>>0.05 = good)")

# Discrete inverse CDF: sampling from a custom discrete distribution
def sample_discrete_icdf(probs, n_samples):
    """
    Sample from a discrete distribution using the inverse CDF.
    Equivalent to np.random.choice but shows the mechanism.
    """
    cdf = np.cumsum(probs)
    u = np.random.uniform(0, 1, n_samples)
    # Find the smallest index k such that CDF(k) >= u
    samples = np.searchsorted(cdf, u)
    return samples

# Custom discrete distribution: P(0)=0.1, P(1)=0.4, P(2)=0.3, P(3)=0.2
probs = np.array([0.1, 0.4, 0.3, 0.2])
samples = sample_discrete_icdf(probs, n_samples=100_000)

print("\nDiscrete Inverse CDF Sampling:")
print(f"{'Value':>6} | {'Empirical Freq':>15} | {'True Prob':>10}")
for k, p in enumerate(probs):
    print(f"{k:>6} | {(samples==k).mean():>15.5f} | {p:>10.4f}")

Limitation

Inverse CDF requires knowing $F^{-1}$ analytically. For many distributions (multivariate Gaussian, posterior distributions in Bayesian models), this is not available.

3. Rejection Sampling

Sample from a proposal distribution $q(x)$ that is easy to sample from, and reject samples that are unlikely under the target $p(x)$ .

Algorithm

Given target $p(x)$ and proposal $q(x)$ with constant $M$ such that $p(x) \leq M \cdot q(x)$ for all $x$ :

REPEAT:
  1. Sample x ~ q(x)
  2. Sample u ~ Uniform(0, 1)
  3. IF u <= p(x) / (M * q(x)):
       ACCEPT x  (return x)
     ELSE:
       REJECT x  (try again)

Acceptance probability: $P(\text{accept}) = \frac{1}{M}$

The tighter the proposal envelope ( $M$ close to 1), the fewer rejections.

Rejection Sampling Visualization:

  M·q(x)  ─────────────────────────────────  (proposal envelope)
           │                           │
  p(x)     │      ╭───────╮            │      (target density)
           │    ╭─╯       ╰─╮          │
           │  ╭─╯           ╰───╮      │
           └──────────────────────────── x

  Accept region (between p(x) and M·q(x)): probability p(x)/(M·q(x))
  Reject region (above p(x)):              probability 1 - p(x)/(M·q(x))

import numpy as np
from scipy.stats import norm, beta

def rejection_sampling(target_pdf, proposal_pdf, proposal_sampler, M, n_samples):
    """
    Generic rejection sampler.

    Args:
        target_pdf: function x -> p(x)
        proposal_pdf: function x -> q(x)
        proposal_sampler: function n -> n samples from q
        M: constant such that p(x) <= M * q(x) for all x
        n_samples: desired number of accepted samples

    Returns:
        accepted_samples, n_total_attempts
    """
    accepted = []
    n_attempts = 0

    while len(accepted) < n_samples:
        # Step 1: sample from proposal
        x = proposal_sampler(1)[0]
        n_attempts += 1

        # Step 2: compute acceptance ratio
        accept_ratio = target_pdf(x) / (M * proposal_pdf(x))

        # Step 3: accept or reject
        if np.random.uniform() <= accept_ratio:
            accepted.append(x)

    return np.array(accepted), n_attempts

# Example: sample from Beta(3, 5) using Uniform(0,1) as proposal
from scipy.stats import beta as beta_dist, uniform

alpha, beta_param = 3.0, 5.0
target  = beta_dist(alpha, beta_param)

# Max of Beta(3,5) pdf is at mode = (alpha-1)/(alpha+beta-2) = 2/6 ≈ 0.333
mode = (alpha - 1) / (alpha + beta_param - 2)
M = target.pdf(mode)   # max density value

print(f"Rejection Sampling: Beta({alpha}, {beta_param}) from Uniform(0,1)")
print(f"  M = {M:.4f} (proposal envelope)")
print(f"  Theoretical acceptance rate = 1/M = {1/M:.4f}")

np.random.seed(42)
samples, n_attempts = rejection_sampling(
    target_pdf=target.pdf,
    proposal_pdf=lambda x: uniform.pdf(x),
    proposal_sampler=lambda n: np.random.uniform(size=n),
    M=M,
    n_samples=10000
)

print(f"  Actual acceptance rate: {10000/n_attempts:.4f}")
print(f"  Sample mean: {samples.mean():.5f}  (true: {target.mean():.5f})")
print(f"  Sample std:  {samples.std():.5f}  (true: {target.std():.5f})")

Limitation

Rejection sampling is inefficient in high dimensions because the acceptance rate decreases exponentially: $P(\text{accept}) \leq 1/M$ , and $M$ must cover the full distribution, which becomes exponentially hard as dimension grows (the "curse of dimensionality").

4. Importance Sampling

Rather than sampling from $p(x)$ directly, use samples from $q(x)$ but weight them to correct for the wrong distribution.

The Key Identity

$\mathbb{E}_{p}[f(X)] = \int f(x) p(x) dx = \int f(x) \frac{p(x)}{q(x)} q(x) dx = \mathbb{E}_{q}\left[f(X) \frac{p(X)}{q(X)}\right]$

The weights $w(x) = p(x)/q(x)$ are called importance weights.

Algorithm

Sample x_1, ..., x_n ~ q(x)
Compute importance weights: w_i = p(x_i) / q(x_i)
Estimate: E_p[f(X)] ≈ (Σ_i w_i f(x_i)) / (Σ_i w_i)

The unnormalized estimator $\frac{1}{n}\sum_i w_i f(x_i)$ requires knowing $p(x)$ exactly. The self-normalized estimator $\frac{\sum_i w_i f(x_i)}{\sum_i w_i}$ works even when $p$ is known only up to a constant.

import numpy as np
from scipy.stats import norm

np.random.seed(42)
n_samples = 100_000

# Estimate E_p[X^2] where p = N(3, 1) using q = N(0, 1) as proposal

def importance_sampling(f, target_logpdf, proposal_logpdf, proposal_sampler, n):
    """
    Estimate E_p[f(X)] using importance sampling with proposal q.
    Uses log-weights for numerical stability.
    """
    samples = proposal_sampler(n)

    # Log importance weights
    log_weights = target_logpdf(samples) - proposal_logpdf(samples)

    # Normalize using log-sum-exp trick
    log_weights -= log_weights.max()  # for numerical stability
    weights = np.exp(log_weights)
    weights /= weights.sum()

    return np.sum(weights * f(samples)), weights

target = norm(loc=3.0, scale=1.0)
proposal = norm(loc=0.0, scale=1.0)

# Estimate E_p[X^2]
# True value: E[X^2] = Var(X) + E[X]^2 = 1 + 9 = 10
f = lambda x: x**2

est, weights = importance_sampling(
    f=f,
    target_logpdf=target.logpdf,
    proposal_logpdf=proposal.logpdf,
    proposal_sampler=lambda n: proposal.rvs(n),
    n=n_samples
)

# Direct Monte Carlo from target (ground truth)
direct_mc = f(target.rvs(n_samples)).mean()

print(f"Importance Sampling: E_{{N(3,1)}}[X^2]")
print(f"  True value:          10.0000")
print(f"  Importance sampling: {est:.5f}")
print(f"  Direct MC:           {direct_mc:.5f}")

# Effective sample size (ESS) - measures quality of importance weights
ess = 1.0 / np.sum(weights**2)
print(f"  Effective sample size (ESS): {ess:.0f} / {n_samples}")
print(f"  ESS ratio: {ess/n_samples:.4f}  (1.0 = perfect, <<1 = bad proposal)")

Effective Sample Size (ESS)

$\text{ESS} = \frac{\left(\sum_i w_i\right)^2}{\sum_i w_i^2}$

ESS measures how many iid samples from $p$ are equivalent to your $n$ weighted samples. If $q \approx p$ , weights are all similar, and $\text{ESS} \approx n$ . If $q$ is a poor proposal, a few samples dominate, and ESS $\ll n$ .

:::warning When Importance Sampling Fails Importance sampling fails catastrophically when the proposal $q$ has lighter tails than the target $p$ . If $p(x)/q(x) \to \infty$ for some $x$ in the support of $p$ , the importance weights have infinite variance. This happens when, e.g., estimating tail probabilities of a Gaussian using a Uniform proposal. In high dimensions, finding a good proposal is extremely difficult - which is why MCMC is preferred for high-dimensional posterior sampling. :::

5. Monte Carlo Integration

The most widely used application of sampling: estimating integrals that can't be solved analytically.

Basic Monte Carlo Estimator

$\int_\Omega f(x) dx \approx \frac{|\Omega|}{n} \sum_{i=1}^n f(x_i), \quad x_i \sim \text{Uniform}(\Omega)$

More generally, for expectation under $p$ :

$\mathbb{E}_p[f(X)] = \int f(x) p(x) dx \approx \frac{1}{n} \sum_{i=1}^n f(x_i), \quad x_i \sim p$

Error rate: By CLT, the error is $O(1/\sqrt{n})$ - independent of dimensionality! This is the key advantage over deterministic numerical integration, which suffers the curse of dimensionality.

import numpy as np
from scipy.stats import norm

np.random.seed(42)

# Classic Monte Carlo: estimate π by sampling in a square
# π/4 = area of quarter circle / area of unit square
# = P(X^2 + Y^2 <= 1) where (X,Y) ~ Uniform([0,1]^2)

for n in [100, 1000, 10000, 100000, 1000000]:
    x = np.random.uniform(0, 1, n)
    y = np.random.uniform(0, 1, n)
    inside = (x**2 + y**2 <= 1)
    pi_estimate = 4 * inside.mean()
    error = abs(pi_estimate - np.pi)
    print(f"n={n:>8}: pi ≈ {pi_estimate:.6f}, error={error:.6f}")

# Monte Carlo integration: E_p[f(X)] for Gaussian p
# Compute E[cos(X)] where X ~ N(0,1)
# True value: real part of moment generating function at i = Re[e^{-1/2}]

n_samples = 100_000
samples = np.random.randn(n_samples)

# Monte Carlo estimate
mc_estimate = np.cos(samples).mean()

# True value (via integration formula)
true_value = np.exp(-0.5)  # E[cos(X)] = Re[E[e^{iX}]] = Re[e^{-σ²/2}]

print(f"\nE_{{N(0,1)}}[cos(X)]:")
print(f"  True value:       {true_value:.6f}")
print(f"  MC estimate:      {mc_estimate:.6f}")
print(f"  Error:            {abs(mc_estimate - true_value):.6f}")
print(f"  Expected error ~ 1/√n = {1/np.sqrt(n_samples):.6f}")

Variance Reduction Techniques

Monte Carlo variance can be high, especially for rare events. Common techniques:

Technique	Idea	When to Use
Antithetic variates	Use $U$ and $1-U$ together	Symmetric integrands
Control variates	Subtract known-mean random variable	Known correlated function
Stratified sampling	Sample evenly from sub-regions	Non-uniform integrand
Importance sampling	Use better proposal	Rare event estimation

6. Markov Chain Monte Carlo (MCMC)

MCMC generates samples from a complex distribution $p(x)$ by constructing a Markov chain whose stationary distribution is $p$ .

The Core Idea

Instead of sampling independently (hard), take a random walk through $x$ -space:

x_1 → x_2 → x_3 → x_4 → ... → x_n

where transitions are designed so that the chain
"mixes" according to the target distribution p(x).

After a burn-in period (letting the chain forget its starting point), subsequent samples are approximately from $p(x)$ .

Metropolis-Hastings Algorithm

Initialize x_0 arbitrarily
FOR t = 1, 2, ..., N:
  1. Propose x* ~ q(x* | x_{t-1})     (proposal distribution)
  2. Compute acceptance ratio:
       alpha = min(1, [p(x*) * q(x_{t-1}|x*)] / [p(x_{t-1}) * q(x*|x_{t-1})])
  3. Accept: x_t = x* with probability alpha
     Reject: x_t = x_{t-1} with probability (1 - alpha)

For symmetric proposals $q(x^* \mid x) = q(x \mid x^*)$ (e.g., Gaussian random walk):

$\alpha = \min\left(1, \frac{p(x^*)}{p(x_{t-1})}\right)$

Crucially: we only need $p$ up to a normalization constant! The constant cancels in the ratio.

import numpy as np
from scipy.stats import norm, multivariate_normal

def metropolis_hastings(log_target, n_samples, x0, proposal_std=0.5, burn_in=1000):
    """
    Metropolis-Hastings with Gaussian random walk proposal.

    Args:
        log_target: log p(x) (up to a constant)
        n_samples: number of samples (after burn-in)
        x0: initial state
        proposal_std: std of Gaussian random walk proposal
        burn_in: number of burn-in steps to discard
    """
    x = np.array(x0, dtype=float)
    dim = x.shape[0] if hasattr(x, 'shape') else 1
    samples = []
    n_accepted = 0

    for i in range(n_samples + burn_in):
        # Propose new state
        x_star = x + proposal_std * np.random.randn(*x.shape)

        # Acceptance ratio (in log space for numerical stability)
        log_alpha = log_target(x_star) - log_target(x)
        alpha = min(1.0, np.exp(log_alpha))

        # Accept or reject
        if np.random.uniform() < alpha:
            x = x_star
            if i >= burn_in:
                n_accepted += 1

        if i >= burn_in:
            samples.append(x.copy())

    acceptance_rate = n_accepted / n_samples
    return np.array(samples), acceptance_rate

# Example 1: Sample from N(2, 0.5^2) using MH
target_mean, target_std = 2.0, 0.5
log_target_1d = lambda x: norm.logpdf(x, loc=target_mean, scale=target_std)

samples_1d, acc_rate = metropolis_hastings(
    log_target=lambda x: log_target_1d(x[0]),
    n_samples=10000,
    x0=np.array([0.0]),
    proposal_std=0.5
)
samples_1d = samples_1d.flatten()

print(f"MH Sampling from N({target_mean}, {target_std}^2):")
print(f"  Acceptance rate: {acc_rate:.3f}  (optimal ~0.234 in 1D)")
print(f"  Sample mean:    {samples_1d.mean():.5f}  (true: {target_mean})")
print(f"  Sample std:     {samples_1d.std():.5f}  (true: {target_std})")

# Example 2: Sample from a bimodal distribution (challenging for MCMC)
def log_bimodal(x):
    """Log density of 0.5*N(-2,1) + 0.5*N(2,1)."""
    log_p1 = np.log(0.5) + norm.logpdf(x[0], -2, 1)
    log_p2 = np.log(0.5) + norm.logpdf(x[0],  2, 1)
    return np.logaddexp(log_p1, log_p2)

for proposal_std in [0.1, 1.0, 3.0]:
    samples, acc = metropolis_hastings(
        log_target=log_bimodal,
        n_samples=10000,
        x0=np.array([0.0]),
        proposal_std=proposal_std
    )
    samples = samples.flatten()
    print(f"  Proposal std={proposal_std}: acc_rate={acc:.3f}, "
          f"sample mean={samples.mean():.3f} (expected 0)")

MCMC Diagnostics

# Auto-correlation function to check MCMC mixing
def effective_sample_size(samples):
    """Estimate ESS using autocorrelation."""
    n = len(samples)
    # Compute autocorrelation up to lag 100
    max_lag = min(100, n // 2)
    mean = samples.mean()
    var  = samples.var()

    if var < 1e-10:
        return n

    # Compute autocorrelations
    rho = []
    for lag in range(1, max_lag):
        c = np.mean((samples[:n-lag] - mean) * (samples[lag:] - mean)) / var
        rho.append(c)
        if c < 0.05:  # stop when autocorrelation drops below threshold
            break

    # ESS ≈ n / (1 + 2 * sum(autocorrelations))
    ess = n / (1 + 2 * sum(rho))
    return max(1, ess)

samples, _ = metropolis_hastings(
    log_target=lambda x: norm.logpdf(x[0], 2, 0.5),
    n_samples=5000,
    x0=np.array([0.0]),
    proposal_std=0.5
)
samples_flat = samples.flatten()

ess = effective_sample_size(samples_flat)
print(f"\nMCMC Effective Sample Size:")
print(f"  Total samples: 5000")
print(f"  ESS:          {ess:.0f}")
print(f"  ESS ratio:    {ess/5000:.4f}  (0.1 to 0.5 is typical)")

Gibbs Sampling

When the full joint $p(\mathbf{x})$ is complex but the conditional distributions $p(x_j \mid \mathbf{x}_{-j})$ are tractable:

Initialize: x_1, ..., x_d
REPEAT:
  FOR j = 1 to d:
    Sample x_j ~ p(x_j | x_1, ..., x_{j-1}, x_{j+1}, ..., x_d)

Each variable is updated conditioned on all others. This is the basis of many Bayesian inference algorithms and was central to LDA (Latent Dirichlet Allocation) training.

7. Sampling in Deep Learning

Language Model Decoding

Token generation from language models is sampling from $P(w_t \mid w_1, \ldots, w_{t-1})$ :

import numpy as np

def softmax(x, temperature=1.0):
    z = x / temperature
    e = np.exp(z - z.max())
    return e / e.sum()

def greedy_decode(logits):
    """Argmax - deterministic, no sampling."""
    return np.argmax(logits)

def temperature_sample(logits, temperature=1.0):
    """Sample from softmax with temperature."""
    probs = softmax(logits, temperature)
    return np.random.choice(len(probs), p=probs)

def top_k_sample(logits, k=50):
    """Sample from top-k distribution."""
    top_k_idx = np.argpartition(logits, -k)[-k:]
    top_k_logits = logits[top_k_idx]
    probs = softmax(top_k_logits)
    return top_k_idx[np.random.choice(k, p=probs)]

def top_p_sample(logits, p=0.9):
    """Nucleus (top-p) sampling: sample from smallest set with cumulative prob >= p."""
    probs = softmax(logits)
    sorted_idx = np.argsort(probs)[::-1]
    sorted_probs = probs[sorted_idx]
    cumsum = np.cumsum(sorted_probs)
    # Find nucleus: smallest set covering p probability mass
    nucleus = sorted_idx[cumsum <= p]
    if len(nucleus) == 0:
        nucleus = sorted_idx[:1]
    nucleus_probs = probs[nucleus]
    nucleus_probs /= nucleus_probs.sum()
    return nucleus[np.random.choice(len(nucleus), p=nucleus_probs)]

# Example
np.random.seed(42)
vocab_size = 10
logits = np.random.randn(vocab_size)

print("Decoding strategies on logits:", logits.round(2))
print(f"  Greedy:       token {greedy_decode(logits)}")
print(f"  Temp=1.0:     token {temperature_sample(logits, 1.0)}")
print(f"  Temp=0.5:     token {temperature_sample(logits, 0.5)}")
print(f"  Top-k (k=3):  token {top_k_sample(logits, 3)}")
print(f"  Top-p (p=0.9): token {top_p_sample(logits, 0.9)}")

Monte Carlo Dropout for Uncertainty Estimation

import numpy as np

def mc_dropout_inference(model_forward, x, n_passes=100, dropout_rate=0.1):
    """
    Run n stochastic forward passes through the model
    to estimate prediction mean and uncertainty.
    """
    predictions = []
    for _ in range(n_passes):
        # Apply dropout mask manually (approximation)
        pred = model_forward(x, dropout_rate=dropout_rate)
        predictions.append(pred)

    predictions = np.array(predictions)  # (n_passes, n_classes)
    mean_pred = predictions.mean(axis=0)
    uncertainty = predictions.std(axis=0)  # epistemic uncertainty
    return mean_pred, uncertainty

# Simulate a simple model forward pass
def fake_model(x, dropout_rate=0.1):
    """Simulate a stochastic forward pass."""
    np.random.seed(None)  # ensure different random masks each call
    base_logits = np.array([2.0, -1.0, 0.5])
    # Dropout: randomly zero out some activations
    mask = np.random.binomial(1, 1-dropout_rate, size=3).astype(float)
    noisy_logits = base_logits * mask + 0.2 * np.random.randn(3)
    e = np.exp(noisy_logits - noisy_logits.max())
    return e / e.sum()

np.random.seed(42)
x = np.array([1.0, 2.0, 3.0])
mean_pred, uncertainty = mc_dropout_inference(fake_model, x, n_passes=100)

print(f"\nMC Dropout Inference (100 passes):")
print(f"  Mean prediction: {mean_pred.round(4)}")
print(f"  Uncertainty:     {uncertainty.round(4)}")
print(f"  Most uncertain class: {np.argmax(uncertainty)}")

8. Summary: Sampling Methods Reference

METHOD              | REQUIRES                    | BEST FOR
────────────────────|─────────────────────────────|──────────────────────────
Inverse CDF         | Closed-form F^{-1}          | 1D distributions with
                    |                             | known CDF inverse
────────────────────|─────────────────────────────|──────────────────────────
Rejection Sampling  | Proposal q(x) ≥ p(x)/M,    | Low-dimensional targets
                    | easy to sample from q       | when good proposal exists
────────────────────|─────────────────────────────|──────────────────────────
Importance Sampling | Proposal q(x) ≠ 0 wherever  | Estimating expectations
                    | p(x) ≠ 0                    | (not generating samples)
────────────────────|─────────────────────────────|──────────────────────────
Metropolis-Hastings | log p(x) up to constant     | General-purpose high-dim
                    | (unnormalized target OK)    | Bayesian inference
────────────────────|─────────────────────────────|──────────────────────────
Gibbs Sampling      | Tractable conditionals      | Bayesian networks,
                    | p(x_j | x_{-j})            | LDA, graphical models
────────────────────|─────────────────────────────|──────────────────────────
HMC / NUTS          | Gradient of log p(x)        | High-dim posteriors
                    |                             | (Stan, PyMC, NumPyro)
────────────────────|─────────────────────────────|──────────────────────────
Direct (Neural)     | Trained generative model    | Images, text, audio
                    | (GAN, VAE, Diffusion)       | generation

9. Interview Q&A

Q1: Explain the inverse CDF sampling method. Why does it work?

A: The inverse CDF method generates samples from a target distribution $F$ by sampling $U \sim \text{Uniform}(0,1)$ and returning $X = F^{-1}(U)$ . It works because $P(X \leq x) = P(F^{-1}(U) \leq x) = P(U \leq F(x)) = F(x)$ - the last equality holds because $U$ is uniform, so $P(U \leq t) = t$ for $t \in [0,1]$ . This is the probability integral transform. Practical applications: generating Exponential samples as $X = -\log(U)/\lambda$ , generating from any distribution whose quantile function can be computed (e.g., scipy's .ppf() method). It fails when $F^{-1}$ is not available analytically, which happens for most multivariate distributions and complex posteriors in Bayesian models.

Q2: What is importance sampling and when does it fail?

A: Importance sampling estimates $\mathbb{E}_p[f(X)]$ without sampling from $p$ directly. The identity $\mathbb{E}_p[f(X)] = \mathbb{E}_q[f(X) w(X)]$ with importance weights $w(x) = p(x)/q(x)$ allows us to draw samples from a convenient proposal $q$ and reweight them. The self-normalized estimator $\sum_i w_i f(x_i) / \sum_i w_i$ even works when $p$ is only known up to a constant. It fails when the proposal $q$ has lighter tails than the target $p$ : if $p(x)/q(x) \to \infty$ for some $x$ , the importance weights have infinite variance, and the estimator has no finite variance - it can produce wildly inaccurate results. In practice this manifests as a few samples with enormous weights dominating the estimate, which can be detected by computing the effective sample size (ESS) - low ESS indicates a poor proposal.

Q3: What is the Metropolis-Hastings algorithm and what problem does it solve?

A: MH solves the problem of sampling from a distribution $p(x)$ known only up to a normalization constant: $p(x) = \tilde{p}(x)/Z$ where $Z = \int \tilde{p}(x) dx$ is intractable. The algorithm: starting from current state $x$ , propose $x^* \sim q(x^* \mid x)$ , then accept with probability $\alpha = \min(1, \frac{\tilde{p}(x^*) q(x \mid x^*)}{{\tilde{p}(x) q(x^* \mid x)}})$ ; otherwise stay at $x$ . The normalizing constant $Z$ cancels in the ratio, so we only need $\tilde{p}$ . The Markov chain converges to $p$ as its stationary distribution - this is guaranteed by the detailed balance condition: $p(x) q(x^* \mid x) \alpha(x \to x^*) = p(x^*) q(x \mid x^*) \alpha(x^* \to x)$ . Key challenges: choosing a good proposal (too narrow → slow mixing; too wide → high rejection rate), detecting convergence (burn-in), and handling high-dimensional spaces.

Q4: How does dropout relate to Bayesian inference and sampling?

A: Standard dropout during training randomly zeros activations with probability $p$ , which Gal and Ghahramani (2016) showed is equivalent to approximate variational Bayesian inference over the weights - each dropout mask is a sample from an approximate posterior over weights. At test time, standard practice disables dropout and uses expected weights (scaling by $1-p$ ), which approximates $\mathbb{E}_\theta[P(y \mid x, \theta)]$ crudely. Monte Carlo Dropout keeps dropout active at test time and runs $T$ stochastic forward passes. The mean prediction is a Monte Carlo approximation to the posterior predictive mean, and the variance across passes estimates epistemic uncertainty (uncertainty about model parameters). This converts any dropout-trained neural network into an approximate Bayesian neural network with no additional training cost - just $T$ forward passes at inference time.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Sampling Methods demo on the EngineersOfAI Playground - no code required.

:::

Q5: Why is MCMC necessary for high-dimensional Bayesian inference, and what are its practical limitations?

A: In Bayesian ML, we want to sample from the posterior $P(\theta \mid \mathcal{D}) \propto P(\mathcal{D} \mid \theta) P(\theta)$ over model parameters $\theta$ . For neural networks with millions of parameters, this posterior is a high-dimensional, non-Gaussian distribution with no closed form. Inverse CDF is unavailable, rejection sampling has exponentially small acceptance rates in high dimensions, and importance sampling produces near-zero ESS. MCMC handles this by constructing a Markov chain that explores the high-dimensional space ergodically, without needing a global normalization constant - only $\log p(\theta \mid \mathcal{D}) = \log P(\mathcal{D} \mid \theta) + \log P(\theta) + \text{const}$ is needed. Practical limitations: (1) convergence: it is hard to know when the chain has mixed; (2) autocorrelation: successive samples are correlated, reducing effective sample size; (3) cost: for large neural networks, even one MCMC step requires a full gradient computation - Hamiltonian Monte Carlo (HMC) needs many gradient evaluations per step; (4) scalability: standard MCMC is impractical for models with millions of parameters, which is why variational inference (ELBO, normalizing flows) is often preferred despite being approximate.

The ML Scenario That Motivates This Lesson​

1. Why Sampling Is Hard​

2. Inverse CDF (Inverse Transform) Method​

The Key Theorem​

Algorithm​

Example: Exponential Distribution​

Limitation​

3. Rejection Sampling​

Algorithm​

Limitation​

4. Importance Sampling​

The Key Identity​

Algorithm​

Effective Sample Size (ESS)​

5. Monte Carlo Integration​

Basic Monte Carlo Estimator​

Variance Reduction Techniques​

6. Markov Chain Monte Carlo (MCMC)​

The Core Idea​

Metropolis-Hastings Algorithm​

MCMC Diagnostics​

Gibbs Sampling​

7. Sampling in Deep Learning​

Language Model Decoding​

Monte Carlo Dropout for Uncertainty Estimation​

8. Summary: Sampling Methods Reference​

9. Interview Q&A​

The ML Scenario That Motivates This Lesson

1. Why Sampling Is Hard

2. Inverse CDF (Inverse Transform) Method

The Key Theorem

Algorithm

Example: Exponential Distribution

Limitation

3. Rejection Sampling

Algorithm

Limitation

4. Importance Sampling

The Key Identity

Algorithm

Effective Sample Size (ESS)

5. Monte Carlo Integration

Basic Monte Carlo Estimator

Variance Reduction Techniques

6. Markov Chain Monte Carlo (MCMC)

The Core Idea

Metropolis-Hastings Algorithm

MCMC Diagnostics

Gibbs Sampling

7. Sampling in Deep Learning

Language Model Decoding

Monte Carlo Dropout for Uncertainty Estimation

8. Summary: Sampling Methods Reference

9. Interview Q&A