Numerical Integration - Quadrature, Monte Carlo, and Bayesian Inference
Reading time: ~22 minutes | Level: Numerical Methods → Probabilistic ML
In Bayesian inference, the posterior distribution is:
The denominator is the normalizing constant - an integral over all possible parameter values. For a neural network with millions of parameters, this integral is over a million-dimensional space. No analytic solution exists. No grid-based quadrature can span a million dimensions.
This is why Bayesian deep learning is hard. And it is why Monte Carlo integration, importance sampling, and variational inference exist.
What You Will Learn
- Classical quadrature: trapezoidal, Simpson's rule, Gaussian quadrature
- The curse of dimensionality for deterministic integration
- Monte Carlo integration: why it works and how fast it converges
- Importance sampling: variance reduction via smart proposal distributions
- Applications in Bayesian inference and ELBO optimization
- Quasi-Monte Carlo: lower variance for structured integrands
Part 1 - Classical Quadrature Methods
The integration problem
We want to compute:
Quadrature methods approximate this as a weighted sum:
where are nodes and are weights.
Trapezoidal rule
Approximate as a piecewise linear function between equally spaced points:
Error: .
import numpy as np
from scipy import integrate
def trapezoidal_rule(f, a: float, b: float, n: int) -> float:
"""Trapezoidal rule with n intervals."""
x = np.linspace(a, b, n + 1)
y = f(x)
h = (b - a) / n
return h * (y[0]/2 + np.sum(y[1:-1]) + y[-1]/2)
# Equivalent to numpy.trapz
f = np.sin
a, b = 0, np.pi
true_value = 2.0 # ∫₀^π sin(x) dx = 2
for n in [10, 100, 1000]:
approx = trapezoidal_rule(f, a, b, n)
error = abs(approx - true_value)
print(f"n={n:5d}: approx={approx:.8f}, error={error:.2e}")
# Also available via scipy
result, error_est = integrate.quad(np.sin, 0, np.pi)
print(f"scipy.quad: {result:.10f}, error estimate: {error_est:.2e}")
Simpson's rule
Approximates with piecewise quadratics:
Error: - much faster convergence than trapezoidal.
from scipy.integrate import simps, quad
import numpy as np
f = lambda x: np.exp(-x**2) # Gaussian-like integrand
# Simpson's rule
x = np.linspace(0, 3, 101) # n must be odd for basic Simpson
y = f(x)
result_simps = simps(y, x) # scipy's Simpson's rule
# Compare with quad (adaptive quadrature - gold standard)
true_val, _ = quad(f, 0, 3)
print(f"Simpson's: {result_simps:.8f}")
print(f"quad: {true_val:.8f}")
print(f"Error: {abs(result_simps - true_val):.2e}")
Gaussian quadrature
Gaussian quadrature chooses nodes and weights optimally to integrate polynomials of degree up to exactly with just points. For smooth functions, it achieves exponential convergence.
import numpy as np
from numpy.polynomial.legendre import leggauss
def gaussian_quadrature(f, a: float, b: float, n: int) -> float:
"""
Gauss-Legendre quadrature on [a, b] with n nodes.
Exact for polynomials of degree ≤ 2n-1.
"""
# Nodes and weights for [-1, 1]
nodes, weights = leggauss(n)
# Transform from [-1, 1] to [a, b]
x = 0.5 * (b - a) * nodes + 0.5 * (b + a)
w = 0.5 * (b - a) * weights
return np.sum(w * f(x))
# Test on ∫₀^1 exp(-x²) dx
f = lambda x: np.exp(-x**2)
true_val, _ = integrate.quad(f, 0, 1)
for n in [2, 5, 10, 20]:
approx = gaussian_quadrature(f, 0, 1, n)
error = abs(approx - true_val)
print(f"n={n:2d}: approx={approx:.10f}, error={error:.2e}")
# n=20 already achieves machine precision for smooth f!
Part 2 - The Curse of Dimensionality for Deterministic Integration
All quadrature methods require grid points. For a -dimensional integral with points per dimension, the total number of function evaluations grows as .
To achieve accuracy with a quadrature rule of order , you need points per dimension, giving:
| per dim | Total points | Feasibility | |
|---|---|---|---|
| 1 | 100 | 100 | Trivial |
| 5 | 100 | Borderline | |
| 10 | 100 | Impossible | |
| 1000 | 100 | Absurd |
For Bayesian neural networks with parameters - deterministic quadrature is completely impossible.
Monte Carlo integration has convergence rate independent of dimension - this is its killer advantage.
Part 3 - Monte Carlo Integration
The Monte Carlo estimator
To compute :
- Sample
- Estimate:
By the law of large numbers, as .
Convergence rate: By the CLT, the error decreases as:
This is dimension-independent - the same rate holds whether or .
import numpy as np
def monte_carlo_integrate(
f,
sampler, # Function returning N samples from the integration domain
volume: float, # Volume of the integration domain
N: int = 10000
) -> tuple:
"""
Monte Carlo integration: I ≈ volume × (1/N) Σ f(xᵢ)
Returns: (estimate, standard_error)
"""
samples = sampler(N)
f_values = np.array([f(x) for x in samples])
estimate = volume * np.mean(f_values)
std_error = volume * np.std(f_values) / np.sqrt(N)
return estimate, std_error
# Estimate π via Monte Carlo: area of quarter circle
def in_quarter_circle(point):
return 1.0 if point[0]**2 + point[1]**2 <= 1.0 else 0.0
def uniform_sampler(N):
return np.random.uniform(0, 1, size=(N, 2))
for N in [100, 1000, 10000, 100000]:
pi_est, std_err = monte_carlo_integrate(
in_quarter_circle, uniform_sampler, volume=4.0, N=N
)
print(f"N={N:7d}: π ≈ {pi_est:.5f} ± {2*std_err:.5f}")
# Error decreases as ~1/√N - 10× more samples → ~3× smaller error
Monte Carlo for high-dimensional integrals
import numpy as np
from scipy.stats import multivariate_normal
def compute_normalizing_constant(
unnormalized_log_prob, # log p̃(θ) - unnormalized log density
proposal_sampler, # Sample from proposal q(θ)
log_proposal_prob, # log q(θ) - log density of proposal
N: int = 100000
) -> float:
"""
Estimate normalizing constant Z = ∫ p̃(θ) dθ using importance sampling.
Returns log Z for numerical stability.
Z = E_{q(θ)}[p̃(θ)/q(θ)] = E_{q(θ)}[exp(log p̃(θ) - log q(θ))]
"""
# Sample from proposal
theta_samples = proposal_sampler(N)
# Compute importance weights in log space
log_weights = np.array([
unnormalized_log_prob(theta) - log_proposal_prob(theta)
for theta in theta_samples
])
# log Z = log(mean(exp(log_weights))) - use logsumexp for stability
from scipy.special import logsumexp
log_Z = logsumexp(log_weights) - np.log(N)
return log_Z
Part 4 - Importance Sampling
The variance problem
Simple Monte Carlo uses evaluations from a uniform distribution. If is concentrated in a small region, most samples contribute near zero - high variance.
Importance sampling uses a smarter proposal distribution that concentrates samples where is large:
Estimator: , where .
The ratio is the importance weight.
Optimal proposal: The optimal gives zero variance (but requires knowing the answer). In practice, we use a tractable that roughly matches the shape of .
import numpy as np
from scipy.stats import norm
def importance_sampling(
f,
q_sampler,
log_q,
log_f, # log f(x) for numerical stability
N: int = 10000
) -> tuple:
"""
Importance sampling estimator.
Returns: (estimate, effective_sample_size)
"""
# Sample from proposal
samples = q_sampler(N)
# Log importance weights: log w(x) = log f(x) - log q(x)
log_weights = np.array([log_f(x) - log_q(x) for x in samples])
# Normalize weights for stability
log_weights_normalized = log_weights - np.max(log_weights)
weights = np.exp(log_weights_normalized)
weights /= weights.sum()
# Estimate
f_values = np.array([f(x) for x in samples])
estimate = np.sum(weights * f_values)
# Effective sample size: 1/Σw² - measures diversity of samples
# ESS close to N: proposal matches target well
# ESS close to 1: proposal is very different from target
ess = 1.0 / np.sum(weights**2)
return estimate, ess
# Example: estimate E[X²] under N(0,1) using N(2,1) proposal
# True value: 1.0 (variance of standard normal)
np.random.seed(42)
# Target: N(0, 1)
log_target = lambda x: norm.logpdf(x, 0, 1)
target_f = lambda x: x**2 * np.exp(norm.logpdf(x, 0, 1))
# Proposal: shifted N(2, 1) - bad for this problem, illustrates weight collapse
log_proposal_bad = lambda x: norm.logpdf(x, 2, 1)
sampler_bad = lambda N: norm.rvs(2, 1, N)
# Proposal: N(0, 1) - perfect match
log_proposal_good = lambda x: norm.logpdf(x, 0, 1)
sampler_good = lambda N: norm.rvs(0, 1, N)
est_bad, ess_bad = importance_sampling(
lambda x: x**2, sampler_bad, log_proposal_bad, log_target, N=1000)
est_good, ess_good = importance_sampling(
lambda x: x**2, sampler_good, log_proposal_good, log_target, N=1000)
print(f"Bad proposal: E[X²] = {est_bad:.4f}, ESS = {ess_bad:.0f}/1000")
print(f"Good proposal: E[X²] = {est_good:.4f}, ESS = {ess_good:.0f}/1000")
# Good proposal: ESS ≈ N, Bad: ESS << N (weight collapse)
Self-normalized importance sampling
When the normalizing constant is unknown (common in Bayesian inference):
where uses the unnormalized density .
Part 5 - Applications in Bayesian Inference and Variational Methods
The ELBO: numerical integration appears everywhere
The Evidence Lower BOund (ELBO) in variational inference is:
The expectation is an integral that must be computed numerically.
import torch
import torch.distributions as dist
def compute_elbo(
model: torch.nn.Module,
x: torch.Tensor,
y: torch.Tensor,
n_samples: int = 10
) -> torch.Tensor:
"""
Monte Carlo ELBO for a simple Bayesian neural network.
Approximates E_{q(θ)}[log p(y|x,θ)] using n_samples draws from q.
"""
# Variational posterior: q(θ) = N(μ, σ²) per weight
# (Simplified - in practice use a proper BNN library)
total_log_likelihood = 0.0
for _ in range(n_samples):
# Sample weights from variational posterior
sampled_weights = model.sample_weights() # From q(θ)
# Compute log p(y|x, θ)
predictions = model.forward_with_weights(x, sampled_weights)
log_likelihood = dist.Normal(predictions, 0.1).log_prob(y).sum()
total_log_likelihood += log_likelihood
# Monte Carlo estimate of the expected log likelihood
expected_log_lik = total_log_likelihood / n_samples
# KL divergence (analytic for Gaussian q and p)
kl = model.kl_divergence() # KL(q(θ) || p(θ))
return expected_log_lik - kl
def reparameterization_trick(
mu: torch.Tensor,
log_sigma: torch.Tensor,
n_samples: int = 1
) -> torch.Tensor:
"""
Reparameterization trick: sample θ ~ N(μ, σ²) as θ = μ + σ·ε, ε ~ N(0,1).
Makes the sampling operation differentiable w.r.t. μ and σ.
This is what makes variational autoencoders trainable.
"""
sigma = torch.exp(log_sigma)
eps = torch.randn(n_samples, *mu.shape) # ε ~ N(0, I)
return mu + sigma * eps # θ = μ + σ·ε
Monte Carlo integration in neural network training
import torch
def expected_reward_policy_gradient(
policy_network,
environment,
n_trajectories: int = 50
) -> torch.Tensor:
"""
REINFORCE: estimate E_{τ~π}[R(τ)] via Monte Carlo.
The policy gradient theorem requires this expectation.
∇_θ E[R] = E_{τ~π}[R(τ) ∇_θ log π(τ)]
≈ (1/N) Σᵢ R(τᵢ) ∇_θ log π(τᵢ)
Each trajectory τᵢ is a Monte Carlo sample.
"""
total_loss = 0.0
for _ in range(n_trajectories):
states, actions, rewards = environment.rollout(policy_network)
# Log probability of the trajectory under current policy
log_probs = policy_network.log_prob(states, actions)
# Total reward
R = sum(rewards)
# Policy gradient loss
total_loss += -log_probs.sum() * R # Negative for gradient ascent
return total_loss / n_trajectories
def stratified_sampling_variance_reduction(
f,
n_strata: int = 10,
n_per_stratum: int = 100
) -> tuple:
"""
Stratified sampling: divide [0,1] into n_strata equal intervals.
Sample uniformly within each stratum.
Reduces variance compared to uniform Monte Carlo.
"""
N = n_strata * n_per_stratum
strata_edges = np.linspace(0, 1, n_strata + 1)
estimates = []
for i in range(n_strata):
lo, hi = strata_edges[i], strata_edges[i+1]
samples = np.random.uniform(lo, hi, n_per_stratum)
stratum_mean = np.mean(f(samples)) * (hi - lo)
estimates.append(stratum_mean)
total = sum(estimates)
# Variance is reduced because each stratum is independently estimated
return total, np.var(estimates) / n_strata
Part 6 - Quasi-Monte Carlo
Standard Monte Carlo samples points randomly and achieves convergence. Quasi-Monte Carlo (QMC) uses deterministic low-discrepancy sequences that fill the space more uniformly, achieving convergence - better than for low-dimensional smooth integrands.
from scipy.stats.qmc import Sobol, Halton
import numpy as np
def qmc_integration(f, d: int, N: int = 10000, engine: str = 'sobol') -> float:
"""
Quasi-Monte Carlo integration over [0,1]^d.
Uses low-discrepancy sequences for better uniform coverage.
"""
if engine == 'sobol':
sampler = Sobol(d=d, scramble=True)
samples = sampler.random(N)
elif engine == 'halton':
sampler = Halton(d=d, scramble=True)
samples = sampler.random(N)
else:
# Standard Monte Carlo for comparison
samples = np.random.uniform(0, 1, size=(N, d))
f_values = np.array([f(x) for x in samples])
return np.mean(f_values) # Volume of [0,1]^d = 1
# Compare convergence: Monte Carlo vs Quasi-Monte Carlo
f = lambda x: np.sin(np.pi * x[0]) * np.cos(np.pi * x[1]) # 2D integrand
true_value = 0.0 # by symmetry
for N in [100, 1000, 10000]:
mc_est = qmc_integration(f, d=2, N=N, engine='random')
qmc_est = qmc_integration(f, d=2, N=N, engine='sobol')
print(f"N={N:6d}: MC error={abs(mc_est - true_value):.4f}, "
f"QMC error={abs(qmc_est - true_value):.4f}")
# QMC typically 5-10x more accurate for same N in low dimensions
Interview Questions
Q1: Why does Monte Carlo integration achieve dimension-independent convergence?
Deterministic quadrature requires points per dimension to achieve accuracy at rate , giving total evaluations - exponential in dimension.
Monte Carlo estimator is:
By the Central Limit Theorem:
where depends on the function but not on the dimension . To achieve error : need - independent of .
Why does dimension not appear? Each sample is drawn from the full -dimensional domain and evaluates at a random point. The variance of a single estimate is regardless of . Averaging such estimates reduces variance by , regardless of .
The trade-off: Monte Carlo has worse convergence than deterministic quadrature in 1D (where is slower than for Simpson's rule). But above roughly 8–10 dimensions, Monte Carlo beats all deterministic methods because of the curse of dimensionality.
Q2: What is importance sampling and when does it fail?
Importance sampling estimates using samples from a different distribution :
The ratio is the importance weight.
When it fails (weight collapse):
-
Heavy tails: If has lighter tails than , some samples have extremely large weights . A few samples dominate the estimate - effective sample size (ESS) collapses to near 1.
-
Dimension mismatch: In high dimensions, and may be very different even if marginals look similar. Importance weights can have exponentially large variance.
-
Support mismatch: If where , those regions are never sampled - severe underestimation.
Diagnostic: Monitor the effective sample size (ESS):
where are normalized weights. ESS means (good). ESS means weight collapse (bad).
Practical rule: If ESS , the importance sampling estimate is unreliable - choose a better proposal distribution.
Q3: What is the ELBO and why does computing it require numerical integration?
The ELBO (Evidence Lower BOund) in variational inference:
is a lower bound on the log evidence . Maximizing the ELBO is equivalent to minimizing - fitting the variational distribution to the posterior.
Why integration is needed: The first term is:
For a Bayesian neural network with millions of parameters, where can be millions. This integral has no closed form.
Solutions:
-
Monte Carlo: Sample and estimate: . Works but requires forward passes per step.
-
Reparameterization trick: Write where (simple distribution). Now gradient can be computed via autodiff.
-
Local reparameterization trick: Instead of sampling weights, sample activations - reduces variance because activations are lower-dimensional.
Q4: How does Simpson's rule achieve O(h⁴) accuracy while the trapezoidal rule only achieves O(h²)?
Error analysis via Euler-Maclaurin formula:
Trapezoidal rule error:
The error is dominated by the term - .
Richardson extrapolation insight: The trapezoidal rule applied at step size and can be combined to cancel the leading error term:
This is exactly Simpson's rule. The errors cancel, leaving .
Geometric interpretation: The trapezoidal rule fits flat tops (piecewise constant) between function values - only zeroth-order accuracy per panel. Wait - actually linear. The error comes from the second derivative.
Simpson's rule fits parabolas (degree-2 polynomials) through each pair of panels. A degree-2 rule integrates polynomials up to degree 3 exactly (a special property of the symmetric 3-point rule). The next non-zero error term involves , hence .
Practical implication: For smooth integrands, use Gaussian quadrature (exponential convergence) or adaptive quadrature (scipy.integrate.quad). For tabulated data or functions with kinks, use Simpson's or trapezoidal rule.
Q5: Explain the reparameterization trick and why it matters for variational autoencoders.
The problem: In a VAE, we want to maximize:
The first term requires differentiating through sampling: .
Naive sampling is not differentiable: - sampling is a stochastic node, gradients cannot flow through it.
The reparameterization trick: Instead of sampling directly, write:
Now is sampled from a fixed distribution (no parameters), and is a deterministic differentiable function of . Gradients flow through and normally.
This is a specific application of importance sampling with a reparameterized proposal: we changed variables from (parameterized) to (fixed), transforming a stochastic computational node into a deterministic one.
Why it matters: Without reparameterization, training VAEs requires REINFORCE-style gradient estimators (high variance, slow convergence). With reparameterization, standard backpropagation works directly - VAEs train efficiently.
Quick Reference
| Method | Convergence Rate | Dimensions | Notes |
|---|---|---|---|
| Trapezoidal rule | 1D | Robust, handles kinks | |
| Simpson's rule | 1D | Better for smooth f | |
| Gaussian quadrature | Exponential in n | 1D | Best for smooth f, fixed interval |
| scipy.integrate.quad | Adaptive | 1D | Gold standard for 1D |
| Monte Carlo | Any d | Dimension-independent | |
| Importance sampling | Any d | Better if proposal matches target | |
| Quasi-Monte Carlo | Low d | Better than MC for d < 10 |
Next: Lesson 06: Root-Finding Algorithms →
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Numerical Integration demo on the EngineersOfAI Playground - no code required.
:::
