Numerical Integration - Quadrature, Monte Carlo, and Bayesian Inference

Reading time: ~22 minutes | Level: Numerical Methods → Probabilistic ML

In Bayesian inference, the posterior distribution is:

$p(\theta | \mathcal{D}) = \frac{p(\mathcal{D} | \theta) \cdot p(\theta)}{p(\mathcal{D})}$

The denominator $p(\mathcal{D}) = \int p(\mathcal{D} | \theta) p(\theta) \, d\theta$ is the normalizing constant - an integral over all possible parameter values. For a neural network with millions of parameters, this integral is over a million-dimensional space. No analytic solution exists. No grid-based quadrature can span a million dimensions.

This is why Bayesian deep learning is hard. And it is why Monte Carlo integration, importance sampling, and variational inference exist.

What You Will Learn

Classical quadrature: trapezoidal, Simpson's rule, Gaussian quadrature
The curse of dimensionality for deterministic integration
Monte Carlo integration: why it works and how fast it converges
Importance sampling: variance reduction via smart proposal distributions
Applications in Bayesian inference and ELBO optimization
Quasi-Monte Carlo: lower variance for structured integrands

Part 1 - Classical Quadrature Methods

The integration problem

We want to compute:

$I = \int_a^b f(x) \, dx$

Quadrature methods approximate this as a weighted sum:

$I \approx \sum_{i=1}^n w_i f(x_i)$

where $x_i$ are nodes and $w_i$ are weights.

Trapezoidal rule

Approximate $f$ as a piecewise linear function between $n+1$ equally spaced points:

$\int_a^b f(x) \, dx \approx h \left[ \frac{f(x_0)}{2} + f(x_1) + \cdots + f(x_{n-1}) + \frac{f(x_n)}{2} \right], \quad h = \frac{b-a}{n}$

Error: $O(h^2) = O(1/n^2)$ .

import numpy as np
from scipy import integrate

def trapezoidal_rule(f, a: float, b: float, n: int) -> float:
    """Trapezoidal rule with n intervals."""
    x = np.linspace(a, b, n + 1)
    y = f(x)
    h = (b - a) / n
    return h * (y[0]/2 + np.sum(y[1:-1]) + y[-1]/2)

# Equivalent to numpy.trapz
f = np.sin
a, b = 0, np.pi
true_value = 2.0  # ∫₀^π sin(x) dx = 2

for n in [10, 100, 1000]:
    approx = trapezoidal_rule(f, a, b, n)
    error = abs(approx - true_value)
    print(f"n={n:5d}: approx={approx:.8f}, error={error:.2e}")

# Also available via scipy
result, error_est = integrate.quad(np.sin, 0, np.pi)
print(f"scipy.quad: {result:.10f}, error estimate: {error_est:.2e}")

Simpson's rule

Approximates $f$ with piecewise quadratics:

$\int_a^b f(x) \, dx \approx \frac{h}{3} \left[ f(x_0) + 4f(x_1) + 2f(x_2) + 4f(x_3) + \cdots + 4f(x_{n-1}) + f(x_n) \right]$

Error: $O(h^4) = O(1/n^4)$ - much faster convergence than trapezoidal.

from scipy.integrate import simps, quad
import numpy as np

f = lambda x: np.exp(-x**2)  # Gaussian-like integrand

# Simpson's rule
x = np.linspace(0, 3, 101)   # n must be odd for basic Simpson
y = f(x)
result_simps = simps(y, x)    # scipy's Simpson's rule

# Compare with quad (adaptive quadrature - gold standard)
true_val, _ = quad(f, 0, 3)

print(f"Simpson's: {result_simps:.8f}")
print(f"quad:      {true_val:.8f}")
print(f"Error:     {abs(result_simps - true_val):.2e}")

Gaussian quadrature

Gaussian quadrature chooses nodes and weights optimally to integrate polynomials of degree up to $2n-1$ exactly with just $n$ points. For smooth functions, it achieves exponential convergence.

import numpy as np
from numpy.polynomial.legendre import leggauss

def gaussian_quadrature(f, a: float, b: float, n: int) -> float:
    """
    Gauss-Legendre quadrature on [a, b] with n nodes.
    Exact for polynomials of degree ≤ 2n-1.
    """
    # Nodes and weights for [-1, 1]
    nodes, weights = leggauss(n)

    # Transform from [-1, 1] to [a, b]
    x = 0.5 * (b - a) * nodes + 0.5 * (b + a)
    w = 0.5 * (b - a) * weights

    return np.sum(w * f(x))

# Test on ∫₀^1 exp(-x²) dx
f = lambda x: np.exp(-x**2)
true_val, _ = integrate.quad(f, 0, 1)

for n in [2, 5, 10, 20]:
    approx = gaussian_quadrature(f, 0, 1, n)
    error = abs(approx - true_val)
    print(f"n={n:2d}: approx={approx:.10f}, error={error:.2e}")
# n=20 already achieves machine precision for smooth f!

Part 2 - The Curse of Dimensionality for Deterministic Integration

All quadrature methods require grid points. For a $d$ -dimensional integral with $n$ points per dimension, the total number of function evaluations grows as $n^d$ .

To achieve accuracy $\epsilon$ with a quadrature rule of order $p$ , you need $n = O(\epsilon^{-1/p})$ points per dimension, giving:

$N_{\text{total}} = n^d = O\left(\epsilon^{-d/p}\right)$

$d$	$n$ per dim	Total points	Feasibility
1	100	100	Trivial
5	100	$10^{10}$	Borderline
10	100	$10^{20}$	Impossible
1000	100	$10^{2000}$	Absurd

For Bayesian neural networks with $d = 10^6$ parameters - deterministic quadrature is completely impossible.

Monte Carlo integration has convergence rate independent of dimension - this is its killer advantage.

Part 3 - Monte Carlo Integration

The Monte Carlo estimator

To compute $I = \int_a^b f(x) \, dx = (b-a) \cdot E_{X \sim \text{Uniform}(a,b)}[f(X)]$ :

Sample $x_1, \ldots, x_N \sim \text{Uniform}(a, b)$
Estimate: $\hat{I}_N = (b-a) \cdot \frac{1}{N} \sum_{i=1}^N f(x_i)$

By the law of large numbers, $\hat{I}_N \to I$ as $N \to \infty$ .

Convergence rate: By the CLT, the error decreases as:

$\text{std}(\hat{I}_N) = \frac{(b-a) \cdot \text{std}(f)}{\sqrt{N}} = O\left(\frac{1}{\sqrt{N}}\right)$

This is dimension-independent - the same $O(1/\sqrt{N})$ rate holds whether $d = 1$ or $d = 10^6$ .

import numpy as np

def monte_carlo_integrate(
    f,
    sampler,           # Function returning N samples from the integration domain
    volume: float,     # Volume of the integration domain
    N: int = 10000
) -> tuple:
    """
    Monte Carlo integration: I ≈ volume × (1/N) Σ f(xᵢ)
    Returns: (estimate, standard_error)
    """
    samples = sampler(N)
    f_values = np.array([f(x) for x in samples])

    estimate = volume * np.mean(f_values)
    std_error = volume * np.std(f_values) / np.sqrt(N)
    return estimate, std_error

# Estimate π via Monte Carlo: area of quarter circle
def in_quarter_circle(point):
    return 1.0 if point[0]**2 + point[1]**2 <= 1.0 else 0.0

def uniform_sampler(N):
    return np.random.uniform(0, 1, size=(N, 2))

for N in [100, 1000, 10000, 100000]:
    pi_est, std_err = monte_carlo_integrate(
        in_quarter_circle, uniform_sampler, volume=4.0, N=N
    )
    print(f"N={N:7d}: π ≈ {pi_est:.5f} ± {2*std_err:.5f}")
    # Error decreases as ~1/√N - 10× more samples → ~3× smaller error

Monte Carlo for high-dimensional integrals

import numpy as np
from scipy.stats import multivariate_normal

def compute_normalizing_constant(
    unnormalized_log_prob,    # log p̃(θ) - unnormalized log density
    proposal_sampler,         # Sample from proposal q(θ)
    log_proposal_prob,        # log q(θ) - log density of proposal
    N: int = 100000
) -> float:
    """
    Estimate normalizing constant Z = ∫ p̃(θ) dθ using importance sampling.
    Returns log Z for numerical stability.

    Z = E_{q(θ)}[p̃(θ)/q(θ)] = E_{q(θ)}[exp(log p̃(θ) - log q(θ))]
    """
    # Sample from proposal
    theta_samples = proposal_sampler(N)

    # Compute importance weights in log space
    log_weights = np.array([
        unnormalized_log_prob(theta) - log_proposal_prob(theta)
        for theta in theta_samples
    ])

    # log Z = log(mean(exp(log_weights))) - use logsumexp for stability
    from scipy.special import logsumexp
    log_Z = logsumexp(log_weights) - np.log(N)
    return log_Z

Part 4 - Importance Sampling

The variance problem

Simple Monte Carlo uses $f(x_i)$ evaluations from a uniform distribution. If $f$ is concentrated in a small region, most samples contribute near zero - high variance.

Importance sampling uses a smarter proposal distribution $q(x)$ that concentrates samples where $f$ is large:

$I = \int f(x) \, dx = \int \frac{f(x)}{q(x)} q(x) \, dx = E_{X \sim q}\left[\frac{f(X)}{q(X)}\right]$

Estimator: $\hat{I}_{IS} = \frac{1}{N} \sum_{i=1}^N \frac{f(x_i)}{q(x_i)}$ , where $x_i \sim q$ .

The ratio $w(x) = f(x)/q(x)$ is the importance weight.

Optimal proposal: The optimal $q^*(x) \propto |f(x)|$ gives zero variance (but requires knowing the answer). In practice, we use a tractable $q$ that roughly matches the shape of $|f|$ .

import numpy as np
from scipy.stats import norm

def importance_sampling(
    f,
    q_sampler,
    log_q,
    log_f,          # log f(x) for numerical stability
    N: int = 10000
) -> tuple:
    """
    Importance sampling estimator.
    Returns: (estimate, effective_sample_size)
    """
    # Sample from proposal
    samples = q_sampler(N)

    # Log importance weights: log w(x) = log f(x) - log q(x)
    log_weights = np.array([log_f(x) - log_q(x) for x in samples])

    # Normalize weights for stability
    log_weights_normalized = log_weights - np.max(log_weights)
    weights = np.exp(log_weights_normalized)
    weights /= weights.sum()

    # Estimate
    f_values = np.array([f(x) for x in samples])
    estimate = np.sum(weights * f_values)

    # Effective sample size: 1/Σw² - measures diversity of samples
    # ESS close to N: proposal matches target well
    # ESS close to 1: proposal is very different from target
    ess = 1.0 / np.sum(weights**2)

    return estimate, ess

# Example: estimate E[X²] under N(0,1) using N(2,1) proposal
# True value: 1.0 (variance of standard normal)
np.random.seed(42)

# Target: N(0, 1)
log_target = lambda x: norm.logpdf(x, 0, 1)
target_f = lambda x: x**2 * np.exp(norm.logpdf(x, 0, 1))

# Proposal: shifted N(2, 1) - bad for this problem, illustrates weight collapse
log_proposal_bad = lambda x: norm.logpdf(x, 2, 1)
sampler_bad = lambda N: norm.rvs(2, 1, N)

# Proposal: N(0, 1) - perfect match
log_proposal_good = lambda x: norm.logpdf(x, 0, 1)
sampler_good = lambda N: norm.rvs(0, 1, N)

est_bad, ess_bad = importance_sampling(
    lambda x: x**2, sampler_bad, log_proposal_bad, log_target, N=1000)
est_good, ess_good = importance_sampling(
    lambda x: x**2, sampler_good, log_proposal_good, log_target, N=1000)

print(f"Bad proposal:  E[X²] = {est_bad:.4f}, ESS = {ess_bad:.0f}/1000")
print(f"Good proposal: E[X²] = {est_good:.4f}, ESS = {ess_good:.0f}/1000")
# Good proposal: ESS ≈ N, Bad: ESS << N (weight collapse)

Self-normalized importance sampling

When the normalizing constant is unknown (common in Bayesian inference):

$E_{p}[h(\theta)] \approx \frac{\sum_{i=1}^N \tilde{w}(\theta_i) h(\theta_i)}{\sum_{i=1}^N \tilde{w}(\theta_i)}$

where $\tilde{w}(\theta) = \tilde{p}(\theta) / q(\theta)$ uses the unnormalized density $\tilde{p}$ .

Part 5 - Applications in Bayesian Inference and Variational Methods

The ELBO: numerical integration appears everywhere

The Evidence Lower BOund (ELBO) in variational inference is:

$\text{ELBO}(q) = E_{q(\theta)}[\log p(\mathcal{D} | \theta)] - \text{KL}(q(\theta) \| p(\theta))$

The expectation $E_{q(\theta)}[\log p(\mathcal{D} | \theta)]$ is an integral that must be computed numerically.

import torch
import torch.distributions as dist

def compute_elbo(
    model: torch.nn.Module,
    x: torch.Tensor,
    y: torch.Tensor,
    n_samples: int = 10
) -> torch.Tensor:
    """
    Monte Carlo ELBO for a simple Bayesian neural network.
    Approximates E_{q(θ)}[log p(y|x,θ)] using n_samples draws from q.
    """
    # Variational posterior: q(θ) = N(μ, σ²) per weight
    # (Simplified - in practice use a proper BNN library)

    total_log_likelihood = 0.0

    for _ in range(n_samples):
        # Sample weights from variational posterior
        sampled_weights = model.sample_weights()  # From q(θ)

        # Compute log p(y|x, θ)
        predictions = model.forward_with_weights(x, sampled_weights)
        log_likelihood = dist.Normal(predictions, 0.1).log_prob(y).sum()
        total_log_likelihood += log_likelihood

    # Monte Carlo estimate of the expected log likelihood
    expected_log_lik = total_log_likelihood / n_samples

    # KL divergence (analytic for Gaussian q and p)
    kl = model.kl_divergence()   # KL(q(θ) || p(θ))

    return expected_log_lik - kl


def reparameterization_trick(
    mu: torch.Tensor,
    log_sigma: torch.Tensor,
    n_samples: int = 1
) -> torch.Tensor:
    """
    Reparameterization trick: sample θ ~ N(μ, σ²) as θ = μ + σ·ε, ε ~ N(0,1).
    Makes the sampling operation differentiable w.r.t. μ and σ.
    This is what makes variational autoencoders trainable.
    """
    sigma = torch.exp(log_sigma)
    eps = torch.randn(n_samples, *mu.shape)   # ε ~ N(0, I)
    return mu + sigma * eps                    # θ = μ + σ·ε

Monte Carlo integration in neural network training

import torch

def expected_reward_policy_gradient(
    policy_network,
    environment,
    n_trajectories: int = 50
) -> torch.Tensor:
    """
    REINFORCE: estimate E_{τ~π}[R(τ)] via Monte Carlo.
    The policy gradient theorem requires this expectation.

    ∇_θ E[R] = E_{τ~π}[R(τ) ∇_θ log π(τ)]
             ≈ (1/N) Σᵢ R(τᵢ) ∇_θ log π(τᵢ)

    Each trajectory τᵢ is a Monte Carlo sample.
    """
    total_loss = 0.0

    for _ in range(n_trajectories):
        states, actions, rewards = environment.rollout(policy_network)

        # Log probability of the trajectory under current policy
        log_probs = policy_network.log_prob(states, actions)

        # Total reward
        R = sum(rewards)

        # Policy gradient loss
        total_loss += -log_probs.sum() * R   # Negative for gradient ascent

    return total_loss / n_trajectories


def stratified_sampling_variance_reduction(
    f,
    n_strata: int = 10,
    n_per_stratum: int = 100
) -> tuple:
    """
    Stratified sampling: divide [0,1] into n_strata equal intervals.
    Sample uniformly within each stratum.
    Reduces variance compared to uniform Monte Carlo.
    """
    N = n_strata * n_per_stratum
    strata_edges = np.linspace(0, 1, n_strata + 1)
    estimates = []

    for i in range(n_strata):
        lo, hi = strata_edges[i], strata_edges[i+1]
        samples = np.random.uniform(lo, hi, n_per_stratum)
        stratum_mean = np.mean(f(samples)) * (hi - lo)
        estimates.append(stratum_mean)

    total = sum(estimates)
    # Variance is reduced because each stratum is independently estimated
    return total, np.var(estimates) / n_strata

Part 6 - Quasi-Monte Carlo

Standard Monte Carlo samples points randomly and achieves $O(1/\sqrt{N})$ convergence. Quasi-Monte Carlo (QMC) uses deterministic low-discrepancy sequences that fill the space more uniformly, achieving $O((\log N)^d / N)$ convergence - better than $O(1/\sqrt{N})$ for low-dimensional smooth integrands.

from scipy.stats.qmc import Sobol, Halton
import numpy as np

def qmc_integration(f, d: int, N: int = 10000, engine: str = 'sobol') -> float:
    """
    Quasi-Monte Carlo integration over [0,1]^d.
    Uses low-discrepancy sequences for better uniform coverage.
    """
    if engine == 'sobol':
        sampler = Sobol(d=d, scramble=True)
        samples = sampler.random(N)
    elif engine == 'halton':
        sampler = Halton(d=d, scramble=True)
        samples = sampler.random(N)
    else:
        # Standard Monte Carlo for comparison
        samples = np.random.uniform(0, 1, size=(N, d))

    f_values = np.array([f(x) for x in samples])
    return np.mean(f_values)   # Volume of [0,1]^d = 1

# Compare convergence: Monte Carlo vs Quasi-Monte Carlo
f = lambda x: np.sin(np.pi * x[0]) * np.cos(np.pi * x[1])   # 2D integrand
true_value = 0.0   # by symmetry

for N in [100, 1000, 10000]:
    mc_est = qmc_integration(f, d=2, N=N, engine='random')
    qmc_est = qmc_integration(f, d=2, N=N, engine='sobol')
    print(f"N={N:6d}: MC error={abs(mc_est - true_value):.4f}, "
          f"QMC error={abs(qmc_est - true_value):.4f}")
# QMC typically 5-10x more accurate for same N in low dimensions

Interview Questions

Q1: Why does Monte Carlo integration achieve dimension-independent convergence?

Deterministic quadrature requires $n$ points per dimension to achieve accuracy $\epsilon$ at rate $O(h^p)$ , giving $N = n^d$ total evaluations - exponential in dimension.

Monte Carlo estimator is: $\hat{I}_N = \frac{1}{N} \sum_{i=1}^N f(X_i), \quad X_i \sim \text{Uniform}(\Omega)$

By the Central Limit Theorem: $\text{std}(\hat{I}_N) = \frac{\sigma_f}{\sqrt{N}} = O\left(\frac{1}{\sqrt{N}}\right)$

where $\sigma_f = \text{std}(f(X))$ depends on the function but not on the dimension $d$ . To achieve error $\epsilon$ : need $N = O(\sigma_f^2 / \epsilon^2)$ - independent of $d$ .

Why does dimension not appear? Each sample $X_i \in \mathbb{R}^d$ is drawn from the full $d$ -dimensional domain and evaluates $f$ at a random point. The variance of a single estimate $f(X_i)$ is $\sigma_f^2$ regardless of $d$ . Averaging $N$ such estimates reduces variance by $N$ , regardless of $d$ .

The trade-off: Monte Carlo has worse convergence than deterministic quadrature in 1D (where $1/\sqrt{N}$ is slower than $1/N^4$ for Simpson's rule). But above roughly 8–10 dimensions, Monte Carlo beats all deterministic methods because of the curse of dimensionality.

Q2: What is importance sampling and when does it fail?

Importance sampling estimates $I = \int f(x) p(x) dx$ using samples from a different distribution $q(x)$ :

$I = \int \frac{f(x) p(x)}{q(x)} q(x) dx = E_q\left[\frac{f(x) p(x)}{q(x)}\right] \approx \frac{1}{N} \sum_{i=1}^N \frac{f(x_i) p(x_i)}{q(x_i)}$

The ratio $w(x) = p(x)/q(x)$ is the importance weight.

When it fails (weight collapse):

Heavy tails: If $q$ has lighter tails than $p$ , some samples have extremely large weights $p(x_i)/q(x_i) \gg 1$ . A few samples dominate the estimate - effective sample size (ESS) collapses to near 1.
Dimension mismatch: In high dimensions, $p$ and $q$ may be very different even if marginals look similar. Importance weights can have exponentially large variance.
Support mismatch: If $q(x) = 0$ where $f(x)p(x) > 0$ , those regions are never sampled - severe underestimation.

Diagnostic: Monitor the effective sample size (ESS): $\text{ESS} = \frac{1}{\sum_{i=1}^N \hat{w}_i^2} \in [1, N]$

where $\hat{w}_i = w_i / \sum_j w_j$ are normalized weights. ESS $\approx N$ means $q \approx p$ (good). ESS $\approx 1$ means weight collapse (bad).

Practical rule: If ESS $< N/10$ , the importance sampling estimate is unreliable - choose a better proposal distribution.

Q3: What is the ELBO and why does computing it require numerical integration?

The ELBO (Evidence Lower BOund) in variational inference:

$\text{ELBO}(q) = E_{q(\theta)}[\log p(\mathcal{D} | \theta)] - \text{KL}(q(\theta) \| p(\theta))$

is a lower bound on the log evidence $\log p(\mathcal{D})$ . Maximizing the ELBO is equivalent to minimizing $\text{KL}(q \| p)$ - fitting the variational distribution $q$ to the posterior.

Why integration is needed: The first term is: $E_{q(\theta)}[\log p(\mathcal{D} | \theta)] = \int q(\theta) \log p(\mathcal{D} | \theta) \, d\theta$

For a Bayesian neural network with millions of parameters, $\theta \in \mathbb{R}^n$ where $n$ can be millions. This integral has no closed form.

Solutions:

Monte Carlo: Sample $\theta_1, \ldots, \theta_K \sim q(\theta)$ and estimate: $\frac{1}{K} \sum_k \log p(\mathcal{D} | \theta_k)$ . Works but requires $K$ forward passes per step.
Reparameterization trick: Write $\theta = g(\epsilon; \phi)$ where $\epsilon \sim \mathcal{N}(0, I)$ (simple distribution). Now gradient $\nabla_\phi \text{ELBO}$ can be computed via autodiff.
Local reparameterization trick: Instead of sampling weights, sample activations - reduces variance because activations are lower-dimensional.

Q4: How does Simpson's rule achieve O(h⁴) accuracy while the trapezoidal rule only achieves O(h²)?

Error analysis via Euler-Maclaurin formula:

Trapezoidal rule error: $\int_a^b f(x) dx - T_n = -\frac{(b-a)^3}{12n^2} f''(\xi) + O(h^4)$

The error is dominated by the $f''$ term - $O(h^2)$ .

Richardson extrapolation insight: The trapezoidal rule applied at step size $h$ and $h/2$ can be combined to cancel the leading $O(h^2)$ error term:

$S_n = \frac{4 T_{2n} - T_n}{3}$

This is exactly Simpson's rule. The $O(h^2)$ errors cancel, leaving $O(h^4)$ .

Geometric interpretation: The trapezoidal rule fits flat tops (piecewise constant) between function values - only zeroth-order accuracy per panel. Wait - actually linear. The error comes from the second derivative.

Simpson's rule fits parabolas (degree-2 polynomials) through each pair of panels. A degree-2 rule integrates polynomials up to degree 3 exactly (a special property of the symmetric 3-point rule). The next non-zero error term involves $f^{(4)}$ , hence $O(h^4)$ .

Practical implication: For smooth integrands, use Gaussian quadrature (exponential convergence) or adaptive quadrature (scipy.integrate.quad). For tabulated data or functions with kinks, use Simpson's or trapezoidal rule.

Q5: Explain the reparameterization trick and why it matters for variational autoencoders.

The problem: In a VAE, we want to maximize: $\mathcal{L}(\phi, \theta) = E_{q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) \| p(z))$

The first term requires differentiating through sampling: $z \sim q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x))$ .

Naive sampling is not differentiable: $z = \text{sample}(q_\phi)$ - sampling is a stochastic node, gradients cannot flow through it.

The reparameterization trick: Instead of sampling $z$ directly, write:

$z = \mu_\phi(x) + \sigma_\phi(x) \cdot \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, I)$

Now $\varepsilon$ is sampled from a fixed distribution (no parameters), and $z$ is a deterministic differentiable function of $(\phi, \varepsilon)$ . Gradients flow through $\mu_\phi$ and $\sigma_\phi$ normally.

This is a specific application of importance sampling with a reparameterized proposal: we changed variables from $z$ (parameterized) to $\varepsilon$ (fixed), transforming a stochastic computational node into a deterministic one.

Why it matters: Without reparameterization, training VAEs requires REINFORCE-style gradient estimators (high variance, slow convergence). With reparameterization, standard backpropagation works directly - VAEs train efficiently.

Quick Reference

Method	Convergence Rate	Dimensions	Notes
Trapezoidal rule	$O(h^2)$	1D	Robust, handles kinks
Simpson's rule	$O(h^4)$	1D	Better for smooth f
Gaussian quadrature	Exponential in n	1D	Best for smooth f, fixed interval
scipy.integrate.quad	Adaptive	1D	Gold standard for 1D
Monte Carlo	$O(1/\sqrt{N})$	Any d	Dimension-independent
Importance sampling	$O(1/\sqrt{N \cdot \text{ESS}/N})$	Any d	Better if proposal matches target
Quasi-Monte Carlo	$O((\log N)^d / N)$	Low d	Better than MC for d < 10

Next: Lesson 06: Root-Finding Algorithms →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Numerical Integration demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

Part 1 - Classical Quadrature Methods​

The integration problem​

Trapezoidal rule​

Simpson's rule​

Gaussian quadrature​

Part 2 - The Curse of Dimensionality for Deterministic Integration​

Part 3 - Monte Carlo Integration​

The Monte Carlo estimator​

Monte Carlo for high-dimensional integrals​

Part 4 - Importance Sampling​

The variance problem​

Self-normalized importance sampling​

Part 5 - Applications in Bayesian Inference and Variational Methods​

The ELBO: numerical integration appears everywhere​

Monte Carlo integration in neural network training​

Part 6 - Quasi-Monte Carlo​

Interview Questions​

Quick Reference​