Skip to main content

Numerical Integration - Quadrature, Monte Carlo, and Bayesian Inference

Reading time: ~22 minutes | Level: Numerical Methods → Probabilistic ML

In Bayesian inference, the posterior distribution is:

p(θD)=p(Dθ)p(θ)p(D)p(\theta | \mathcal{D}) = \frac{p(\mathcal{D} | \theta) \cdot p(\theta)}{p(\mathcal{D})}

The denominator p(D)=p(Dθ)p(θ)dθp(\mathcal{D}) = \int p(\mathcal{D} | \theta) p(\theta) \, d\theta is the normalizing constant - an integral over all possible parameter values. For a neural network with millions of parameters, this integral is over a million-dimensional space. No analytic solution exists. No grid-based quadrature can span a million dimensions.

This is why Bayesian deep learning is hard. And it is why Monte Carlo integration, importance sampling, and variational inference exist.

What You Will Learn

  • Classical quadrature: trapezoidal, Simpson's rule, Gaussian quadrature
  • The curse of dimensionality for deterministic integration
  • Monte Carlo integration: why it works and how fast it converges
  • Importance sampling: variance reduction via smart proposal distributions
  • Applications in Bayesian inference and ELBO optimization
  • Quasi-Monte Carlo: lower variance for structured integrands

Part 1 - Classical Quadrature Methods

The integration problem

We want to compute:

I=abf(x)dxI = \int_a^b f(x) \, dx

Quadrature methods approximate this as a weighted sum:

Ii=1nwif(xi)I \approx \sum_{i=1}^n w_i f(x_i)

where xix_i are nodes and wiw_i are weights.

Trapezoidal rule

Approximate ff as a piecewise linear function between n+1n+1 equally spaced points:

abf(x)dxh[f(x0)2+f(x1)++f(xn1)+f(xn)2],h=ban\int_a^b f(x) \, dx \approx h \left[ \frac{f(x_0)}{2} + f(x_1) + \cdots + f(x_{n-1}) + \frac{f(x_n)}{2} \right], \quad h = \frac{b-a}{n}

Error: O(h2)=O(1/n2)O(h^2) = O(1/n^2).

import numpy as np
from scipy import integrate

def trapezoidal_rule(f, a: float, b: float, n: int) -> float:
"""Trapezoidal rule with n intervals."""
x = np.linspace(a, b, n + 1)
y = f(x)
h = (b - a) / n
return h * (y[0]/2 + np.sum(y[1:-1]) + y[-1]/2)

# Equivalent to numpy.trapz
f = np.sin
a, b = 0, np.pi
true_value = 2.0 # ∫₀^π sin(x) dx = 2

for n in [10, 100, 1000]:
approx = trapezoidal_rule(f, a, b, n)
error = abs(approx - true_value)
print(f"n={n:5d}: approx={approx:.8f}, error={error:.2e}")

# Also available via scipy
result, error_est = integrate.quad(np.sin, 0, np.pi)
print(f"scipy.quad: {result:.10f}, error estimate: {error_est:.2e}")

Simpson's rule

Approximates ff with piecewise quadratics:

abf(x)dxh3[f(x0)+4f(x1)+2f(x2)+4f(x3)++4f(xn1)+f(xn)]\int_a^b f(x) \, dx \approx \frac{h}{3} \left[ f(x_0) + 4f(x_1) + 2f(x_2) + 4f(x_3) + \cdots + 4f(x_{n-1}) + f(x_n) \right]

Error: O(h4)=O(1/n4)O(h^4) = O(1/n^4) - much faster convergence than trapezoidal.

from scipy.integrate import simps, quad
import numpy as np

f = lambda x: np.exp(-x**2) # Gaussian-like integrand

# Simpson's rule
x = np.linspace(0, 3, 101) # n must be odd for basic Simpson
y = f(x)
result_simps = simps(y, x) # scipy's Simpson's rule

# Compare with quad (adaptive quadrature - gold standard)
true_val, _ = quad(f, 0, 3)

print(f"Simpson's: {result_simps:.8f}")
print(f"quad: {true_val:.8f}")
print(f"Error: {abs(result_simps - true_val):.2e}")

Gaussian quadrature

Gaussian quadrature chooses nodes and weights optimally to integrate polynomials of degree up to 2n12n-1 exactly with just nn points. For smooth functions, it achieves exponential convergence.

import numpy as np
from numpy.polynomial.legendre import leggauss

def gaussian_quadrature(f, a: float, b: float, n: int) -> float:
"""
Gauss-Legendre quadrature on [a, b] with n nodes.
Exact for polynomials of degree ≤ 2n-1.
"""
# Nodes and weights for [-1, 1]
nodes, weights = leggauss(n)

# Transform from [-1, 1] to [a, b]
x = 0.5 * (b - a) * nodes + 0.5 * (b + a)
w = 0.5 * (b - a) * weights

return np.sum(w * f(x))

# Test on ∫₀^1 exp(-x²) dx
f = lambda x: np.exp(-x**2)
true_val, _ = integrate.quad(f, 0, 1)

for n in [2, 5, 10, 20]:
approx = gaussian_quadrature(f, 0, 1, n)
error = abs(approx - true_val)
print(f"n={n:2d}: approx={approx:.10f}, error={error:.2e}")
# n=20 already achieves machine precision for smooth f!

Part 2 - The Curse of Dimensionality for Deterministic Integration

All quadrature methods require grid points. For a dd-dimensional integral with nn points per dimension, the total number of function evaluations grows as ndn^d.

To achieve accuracy ϵ\epsilon with a quadrature rule of order pp, you need n=O(ϵ1/p)n = O(\epsilon^{-1/p}) points per dimension, giving:

Ntotal=nd=O(ϵd/p)N_{\text{total}} = n^d = O\left(\epsilon^{-d/p}\right)

ddnn per dimTotal pointsFeasibility
1100100Trivial
5100101010^{10}Borderline
10100102010^{20}Impossible
100010010200010^{2000}Absurd

For Bayesian neural networks with d=106d = 10^6 parameters - deterministic quadrature is completely impossible.

Monte Carlo integration has convergence rate independent of dimension - this is its killer advantage.

Part 3 - Monte Carlo Integration

The Monte Carlo estimator

To compute I=abf(x)dx=(ba)EXUniform(a,b)[f(X)]I = \int_a^b f(x) \, dx = (b-a) \cdot E_{X \sim \text{Uniform}(a,b)}[f(X)]:

  1. Sample x1,,xNUniform(a,b)x_1, \ldots, x_N \sim \text{Uniform}(a, b)
  2. Estimate: I^N=(ba)1Ni=1Nf(xi)\hat{I}_N = (b-a) \cdot \frac{1}{N} \sum_{i=1}^N f(x_i)

By the law of large numbers, I^NI\hat{I}_N \to I as NN \to \infty.

Convergence rate: By the CLT, the error decreases as:

std(I^N)=(ba)std(f)N=O(1N)\text{std}(\hat{I}_N) = \frac{(b-a) \cdot \text{std}(f)}{\sqrt{N}} = O\left(\frac{1}{\sqrt{N}}\right)

This is dimension-independent - the same O(1/N)O(1/\sqrt{N}) rate holds whether d=1d = 1 or d=106d = 10^6.

import numpy as np

def monte_carlo_integrate(
f,
sampler, # Function returning N samples from the integration domain
volume: float, # Volume of the integration domain
N: int = 10000
) -> tuple:
"""
Monte Carlo integration: I ≈ volume × (1/N) Σ f(xᵢ)
Returns: (estimate, standard_error)
"""
samples = sampler(N)
f_values = np.array([f(x) for x in samples])

estimate = volume * np.mean(f_values)
std_error = volume * np.std(f_values) / np.sqrt(N)
return estimate, std_error

# Estimate π via Monte Carlo: area of quarter circle
def in_quarter_circle(point):
return 1.0 if point[0]**2 + point[1]**2 <= 1.0 else 0.0

def uniform_sampler(N):
return np.random.uniform(0, 1, size=(N, 2))

for N in [100, 1000, 10000, 100000]:
pi_est, std_err = monte_carlo_integrate(
in_quarter_circle, uniform_sampler, volume=4.0, N=N
)
print(f"N={N:7d}: π ≈ {pi_est:.5f} ± {2*std_err:.5f}")
# Error decreases as ~1/√N - 10× more samples → ~3× smaller error

Monte Carlo for high-dimensional integrals

import numpy as np
from scipy.stats import multivariate_normal

def compute_normalizing_constant(
unnormalized_log_prob, # log p̃(θ) - unnormalized log density
proposal_sampler, # Sample from proposal q(θ)
log_proposal_prob, # log q(θ) - log density of proposal
N: int = 100000
) -> float:
"""
Estimate normalizing constant Z = ∫ p̃(θ) dθ using importance sampling.
Returns log Z for numerical stability.

Z = E_{q(θ)}[p̃(θ)/q(θ)] = E_{q(θ)}[exp(log p̃(θ) - log q(θ))]
"""
# Sample from proposal
theta_samples = proposal_sampler(N)

# Compute importance weights in log space
log_weights = np.array([
unnormalized_log_prob(theta) - log_proposal_prob(theta)
for theta in theta_samples
])

# log Z = log(mean(exp(log_weights))) - use logsumexp for stability
from scipy.special import logsumexp
log_Z = logsumexp(log_weights) - np.log(N)
return log_Z

Part 4 - Importance Sampling

The variance problem

Simple Monte Carlo uses f(xi)f(x_i) evaluations from a uniform distribution. If ff is concentrated in a small region, most samples contribute near zero - high variance.

Importance sampling uses a smarter proposal distribution q(x)q(x) that concentrates samples where ff is large:

I=f(x)dx=f(x)q(x)q(x)dx=EXq[f(X)q(X)]I = \int f(x) \, dx = \int \frac{f(x)}{q(x)} q(x) \, dx = E_{X \sim q}\left[\frac{f(X)}{q(X)}\right]

Estimator: I^IS=1Ni=1Nf(xi)q(xi)\hat{I}_{IS} = \frac{1}{N} \sum_{i=1}^N \frac{f(x_i)}{q(x_i)}, where xiqx_i \sim q.

The ratio w(x)=f(x)/q(x)w(x) = f(x)/q(x) is the importance weight.

Optimal proposal: The optimal q(x)f(x)q^*(x) \propto |f(x)| gives zero variance (but requires knowing the answer). In practice, we use a tractable qq that roughly matches the shape of f|f|.

import numpy as np
from scipy.stats import norm

def importance_sampling(
f,
q_sampler,
log_q,
log_f, # log f(x) for numerical stability
N: int = 10000
) -> tuple:
"""
Importance sampling estimator.
Returns: (estimate, effective_sample_size)
"""
# Sample from proposal
samples = q_sampler(N)

# Log importance weights: log w(x) = log f(x) - log q(x)
log_weights = np.array([log_f(x) - log_q(x) for x in samples])

# Normalize weights for stability
log_weights_normalized = log_weights - np.max(log_weights)
weights = np.exp(log_weights_normalized)
weights /= weights.sum()

# Estimate
f_values = np.array([f(x) for x in samples])
estimate = np.sum(weights * f_values)

# Effective sample size: 1/Σw² - measures diversity of samples
# ESS close to N: proposal matches target well
# ESS close to 1: proposal is very different from target
ess = 1.0 / np.sum(weights**2)

return estimate, ess

# Example: estimate E[X²] under N(0,1) using N(2,1) proposal
# True value: 1.0 (variance of standard normal)
np.random.seed(42)

# Target: N(0, 1)
log_target = lambda x: norm.logpdf(x, 0, 1)
target_f = lambda x: x**2 * np.exp(norm.logpdf(x, 0, 1))

# Proposal: shifted N(2, 1) - bad for this problem, illustrates weight collapse
log_proposal_bad = lambda x: norm.logpdf(x, 2, 1)
sampler_bad = lambda N: norm.rvs(2, 1, N)

# Proposal: N(0, 1) - perfect match
log_proposal_good = lambda x: norm.logpdf(x, 0, 1)
sampler_good = lambda N: norm.rvs(0, 1, N)

est_bad, ess_bad = importance_sampling(
lambda x: x**2, sampler_bad, log_proposal_bad, log_target, N=1000)
est_good, ess_good = importance_sampling(
lambda x: x**2, sampler_good, log_proposal_good, log_target, N=1000)

print(f"Bad proposal: E[X²] = {est_bad:.4f}, ESS = {ess_bad:.0f}/1000")
print(f"Good proposal: E[X²] = {est_good:.4f}, ESS = {ess_good:.0f}/1000")
# Good proposal: ESS ≈ N, Bad: ESS << N (weight collapse)

Self-normalized importance sampling

When the normalizing constant is unknown (common in Bayesian inference):

Ep[h(θ)]i=1Nw~(θi)h(θi)i=1Nw~(θi)E_{p}[h(\theta)] \approx \frac{\sum_{i=1}^N \tilde{w}(\theta_i) h(\theta_i)}{\sum_{i=1}^N \tilde{w}(\theta_i)}

where w~(θ)=p~(θ)/q(θ)\tilde{w}(\theta) = \tilde{p}(\theta) / q(\theta) uses the unnormalized density p~\tilde{p}.

Part 5 - Applications in Bayesian Inference and Variational Methods

The ELBO: numerical integration appears everywhere

The Evidence Lower BOund (ELBO) in variational inference is:

ELBO(q)=Eq(θ)[logp(Dθ)]KL(q(θ)p(θ))\text{ELBO}(q) = E_{q(\theta)}[\log p(\mathcal{D} | \theta)] - \text{KL}(q(\theta) \| p(\theta))

The expectation Eq(θ)[logp(Dθ)]E_{q(\theta)}[\log p(\mathcal{D} | \theta)] is an integral that must be computed numerically.

import torch
import torch.distributions as dist

def compute_elbo(
model: torch.nn.Module,
x: torch.Tensor,
y: torch.Tensor,
n_samples: int = 10
) -> torch.Tensor:
"""
Monte Carlo ELBO for a simple Bayesian neural network.
Approximates E_{q(θ)}[log p(y|x,θ)] using n_samples draws from q.
"""
# Variational posterior: q(θ) = N(μ, σ²) per weight
# (Simplified - in practice use a proper BNN library)

total_log_likelihood = 0.0

for _ in range(n_samples):
# Sample weights from variational posterior
sampled_weights = model.sample_weights() # From q(θ)

# Compute log p(y|x, θ)
predictions = model.forward_with_weights(x, sampled_weights)
log_likelihood = dist.Normal(predictions, 0.1).log_prob(y).sum()
total_log_likelihood += log_likelihood

# Monte Carlo estimate of the expected log likelihood
expected_log_lik = total_log_likelihood / n_samples

# KL divergence (analytic for Gaussian q and p)
kl = model.kl_divergence() # KL(q(θ) || p(θ))

return expected_log_lik - kl


def reparameterization_trick(
mu: torch.Tensor,
log_sigma: torch.Tensor,
n_samples: int = 1
) -> torch.Tensor:
"""
Reparameterization trick: sample θ ~ N(μ, σ²) as θ = μ + σ·ε, ε ~ N(0,1).
Makes the sampling operation differentiable w.r.t. μ and σ.
This is what makes variational autoencoders trainable.
"""
sigma = torch.exp(log_sigma)
eps = torch.randn(n_samples, *mu.shape) # ε ~ N(0, I)
return mu + sigma * eps # θ = μ + σ·ε

Monte Carlo integration in neural network training

import torch

def expected_reward_policy_gradient(
policy_network,
environment,
n_trajectories: int = 50
) -> torch.Tensor:
"""
REINFORCE: estimate E_{τ~π}[R(τ)] via Monte Carlo.
The policy gradient theorem requires this expectation.

∇_θ E[R] = E_{τ~π}[R(τ) ∇_θ log π(τ)]
≈ (1/N) Σᵢ R(τᵢ) ∇_θ log π(τᵢ)

Each trajectory τᵢ is a Monte Carlo sample.
"""
total_loss = 0.0

for _ in range(n_trajectories):
states, actions, rewards = environment.rollout(policy_network)

# Log probability of the trajectory under current policy
log_probs = policy_network.log_prob(states, actions)

# Total reward
R = sum(rewards)

# Policy gradient loss
total_loss += -log_probs.sum() * R # Negative for gradient ascent

return total_loss / n_trajectories


def stratified_sampling_variance_reduction(
f,
n_strata: int = 10,
n_per_stratum: int = 100
) -> tuple:
"""
Stratified sampling: divide [0,1] into n_strata equal intervals.
Sample uniformly within each stratum.
Reduces variance compared to uniform Monte Carlo.
"""
N = n_strata * n_per_stratum
strata_edges = np.linspace(0, 1, n_strata + 1)
estimates = []

for i in range(n_strata):
lo, hi = strata_edges[i], strata_edges[i+1]
samples = np.random.uniform(lo, hi, n_per_stratum)
stratum_mean = np.mean(f(samples)) * (hi - lo)
estimates.append(stratum_mean)

total = sum(estimates)
# Variance is reduced because each stratum is independently estimated
return total, np.var(estimates) / n_strata

Part 6 - Quasi-Monte Carlo

Standard Monte Carlo samples points randomly and achieves O(1/N)O(1/\sqrt{N}) convergence. Quasi-Monte Carlo (QMC) uses deterministic low-discrepancy sequences that fill the space more uniformly, achieving O((logN)d/N)O((\log N)^d / N) convergence - better than O(1/N)O(1/\sqrt{N}) for low-dimensional smooth integrands.

from scipy.stats.qmc import Sobol, Halton
import numpy as np

def qmc_integration(f, d: int, N: int = 10000, engine: str = 'sobol') -> float:
"""
Quasi-Monte Carlo integration over [0,1]^d.
Uses low-discrepancy sequences for better uniform coverage.
"""
if engine == 'sobol':
sampler = Sobol(d=d, scramble=True)
samples = sampler.random(N)
elif engine == 'halton':
sampler = Halton(d=d, scramble=True)
samples = sampler.random(N)
else:
# Standard Monte Carlo for comparison
samples = np.random.uniform(0, 1, size=(N, d))

f_values = np.array([f(x) for x in samples])
return np.mean(f_values) # Volume of [0,1]^d = 1

# Compare convergence: Monte Carlo vs Quasi-Monte Carlo
f = lambda x: np.sin(np.pi * x[0]) * np.cos(np.pi * x[1]) # 2D integrand
true_value = 0.0 # by symmetry

for N in [100, 1000, 10000]:
mc_est = qmc_integration(f, d=2, N=N, engine='random')
qmc_est = qmc_integration(f, d=2, N=N, engine='sobol')
print(f"N={N:6d}: MC error={abs(mc_est - true_value):.4f}, "
f"QMC error={abs(qmc_est - true_value):.4f}")
# QMC typically 5-10x more accurate for same N in low dimensions

Interview Questions

Q1: Why does Monte Carlo integration achieve dimension-independent convergence?

Deterministic quadrature requires nn points per dimension to achieve accuracy ϵ\epsilon at rate O(hp)O(h^p), giving N=ndN = n^d total evaluations - exponential in dimension.

Monte Carlo estimator is: I^N=1Ni=1Nf(Xi),XiUniform(Ω)\hat{I}_N = \frac{1}{N} \sum_{i=1}^N f(X_i), \quad X_i \sim \text{Uniform}(\Omega)

By the Central Limit Theorem: std(I^N)=σfN=O(1N)\text{std}(\hat{I}_N) = \frac{\sigma_f}{\sqrt{N}} = O\left(\frac{1}{\sqrt{N}}\right)

where σf=std(f(X))\sigma_f = \text{std}(f(X)) depends on the function but not on the dimension dd. To achieve error ϵ\epsilon: need N=O(σf2/ϵ2)N = O(\sigma_f^2 / \epsilon^2) - independent of dd.

Why does dimension not appear? Each sample XiRdX_i \in \mathbb{R}^d is drawn from the full dd-dimensional domain and evaluates ff at a random point. The variance of a single estimate f(Xi)f(X_i) is σf2\sigma_f^2 regardless of dd. Averaging NN such estimates reduces variance by NN, regardless of dd.

The trade-off: Monte Carlo has worse convergence than deterministic quadrature in 1D (where 1/N1/\sqrt{N} is slower than 1/N41/N^4 for Simpson's rule). But above roughly 8–10 dimensions, Monte Carlo beats all deterministic methods because of the curse of dimensionality.

Q2: What is importance sampling and when does it fail?

Importance sampling estimates I=f(x)p(x)dxI = \int f(x) p(x) dx using samples from a different distribution q(x)q(x):

I=f(x)p(x)q(x)q(x)dx=Eq[f(x)p(x)q(x)]1Ni=1Nf(xi)p(xi)q(xi)I = \int \frac{f(x) p(x)}{q(x)} q(x) dx = E_q\left[\frac{f(x) p(x)}{q(x)}\right] \approx \frac{1}{N} \sum_{i=1}^N \frac{f(x_i) p(x_i)}{q(x_i)}

The ratio w(x)=p(x)/q(x)w(x) = p(x)/q(x) is the importance weight.

When it fails (weight collapse):

  1. Heavy tails: If qq has lighter tails than pp, some samples have extremely large weights p(xi)/q(xi)1p(x_i)/q(x_i) \gg 1. A few samples dominate the estimate - effective sample size (ESS) collapses to near 1.

  2. Dimension mismatch: In high dimensions, pp and qq may be very different even if marginals look similar. Importance weights can have exponentially large variance.

  3. Support mismatch: If q(x)=0q(x) = 0 where f(x)p(x)>0f(x)p(x) > 0, those regions are never sampled - severe underestimation.

Diagnostic: Monitor the effective sample size (ESS): ESS=1i=1Nw^i2[1,N]\text{ESS} = \frac{1}{\sum_{i=1}^N \hat{w}_i^2} \in [1, N]

where w^i=wi/jwj\hat{w}_i = w_i / \sum_j w_j are normalized weights. ESS N\approx N means qpq \approx p (good). ESS 1\approx 1 means weight collapse (bad).

Practical rule: If ESS <N/10< N/10, the importance sampling estimate is unreliable - choose a better proposal distribution.

Q3: What is the ELBO and why does computing it require numerical integration?

The ELBO (Evidence Lower BOund) in variational inference:

ELBO(q)=Eq(θ)[logp(Dθ)]KL(q(θ)p(θ))\text{ELBO}(q) = E_{q(\theta)}[\log p(\mathcal{D} | \theta)] - \text{KL}(q(\theta) \| p(\theta))

is a lower bound on the log evidence logp(D)\log p(\mathcal{D}). Maximizing the ELBO is equivalent to minimizing KL(qp)\text{KL}(q \| p) - fitting the variational distribution qq to the posterior.

Why integration is needed: The first term is: Eq(θ)[logp(Dθ)]=q(θ)logp(Dθ)dθE_{q(\theta)}[\log p(\mathcal{D} | \theta)] = \int q(\theta) \log p(\mathcal{D} | \theta) \, d\theta

For a Bayesian neural network with millions of parameters, θRn\theta \in \mathbb{R}^n where nn can be millions. This integral has no closed form.

Solutions:

  1. Monte Carlo: Sample θ1,,θKq(θ)\theta_1, \ldots, \theta_K \sim q(\theta) and estimate: 1Kklogp(Dθk)\frac{1}{K} \sum_k \log p(\mathcal{D} | \theta_k). Works but requires KK forward passes per step.

  2. Reparameterization trick: Write θ=g(ϵ;ϕ)\theta = g(\epsilon; \phi) where ϵN(0,I)\epsilon \sim \mathcal{N}(0, I) (simple distribution). Now gradient ϕELBO\nabla_\phi \text{ELBO} can be computed via autodiff.

  3. Local reparameterization trick: Instead of sampling weights, sample activations - reduces variance because activations are lower-dimensional.

Q4: How does Simpson's rule achieve O(h⁴) accuracy while the trapezoidal rule only achieves O(h²)?

Error analysis via Euler-Maclaurin formula:

Trapezoidal rule error: abf(x)dxTn=(ba)312n2f(ξ)+O(h4)\int_a^b f(x) dx - T_n = -\frac{(b-a)^3}{12n^2} f''(\xi) + O(h^4)

The error is dominated by the ff'' term - O(h2)O(h^2).

Richardson extrapolation insight: The trapezoidal rule applied at step size hh and h/2h/2 can be combined to cancel the leading O(h2)O(h^2) error term:

Sn=4T2nTn3S_n = \frac{4 T_{2n} - T_n}{3}

This is exactly Simpson's rule. The O(h2)O(h^2) errors cancel, leaving O(h4)O(h^4).

Geometric interpretation: The trapezoidal rule fits flat tops (piecewise constant) between function values - only zeroth-order accuracy per panel. Wait - actually linear. The error comes from the second derivative.

Simpson's rule fits parabolas (degree-2 polynomials) through each pair of panels. A degree-2 rule integrates polynomials up to degree 3 exactly (a special property of the symmetric 3-point rule). The next non-zero error term involves f(4)f^{(4)}, hence O(h4)O(h^4).

Practical implication: For smooth integrands, use Gaussian quadrature (exponential convergence) or adaptive quadrature (scipy.integrate.quad). For tabulated data or functions with kinks, use Simpson's or trapezoidal rule.

Q5: Explain the reparameterization trick and why it matters for variational autoencoders.

The problem: In a VAE, we want to maximize: L(ϕ,θ)=Eqϕ(zx)[logpθ(xz)]KL(qϕ(zx)p(z))\mathcal{L}(\phi, \theta) = E_{q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) \| p(z))

The first term requires differentiating through sampling: zqϕ(zx)=N(μϕ(x),σϕ2(x))z \sim q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x)).

Naive sampling is not differentiable: z=sample(qϕ)z = \text{sample}(q_\phi) - sampling is a stochastic node, gradients cannot flow through it.

The reparameterization trick: Instead of sampling zz directly, write:

z=μϕ(x)+σϕ(x)ε,εN(0,I)z = \mu_\phi(x) + \sigma_\phi(x) \cdot \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, I)

Now ε\varepsilon is sampled from a fixed distribution (no parameters), and zz is a deterministic differentiable function of (ϕ,ε)(\phi, \varepsilon). Gradients flow through μϕ\mu_\phi and σϕ\sigma_\phi normally.

This is a specific application of importance sampling with a reparameterized proposal: we changed variables from zz (parameterized) to ε\varepsilon (fixed), transforming a stochastic computational node into a deterministic one.

Why it matters: Without reparameterization, training VAEs requires REINFORCE-style gradient estimators (high variance, slow convergence). With reparameterization, standard backpropagation works directly - VAEs train efficiently.

Quick Reference

MethodConvergence RateDimensionsNotes
Trapezoidal ruleO(h2)O(h^2)1DRobust, handles kinks
Simpson's ruleO(h4)O(h^4)1DBetter for smooth f
Gaussian quadratureExponential in n1DBest for smooth f, fixed interval
scipy.integrate.quadAdaptive1DGold standard for 1D
Monte CarloO(1/N)O(1/\sqrt{N})Any dDimension-independent
Importance samplingO(1/NESS/N)O(1/\sqrt{N \cdot \text{ESS}/N})Any dBetter if proposal matches target
Quasi-Monte CarloO((logN)d/N)O((\log N)^d / N)Low dBetter than MC for d < 10

Next: Lesson 06: Root-Finding Algorithms →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Numerical Integration demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.