What is score matching?

How Song and Ermon's score matching framework unifies DDPM and enables stochastic differential equations for continuous-time diffusion - the mathematical theory behind modern diffusion models, from score functions and Langevin dynamics through denoising score matching and the SDE unification.

How does score function work in practice?

Score-Based Generative Models - Diffusion Through the Lens of Score Matching covers score matching, score function, Langevin dynamics from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/diffusion-models/score-based-generative-models

What is the difference between score matching and Langevin dynamics?

See the full breakdown at https://engineersofai.com/docs/ml/diffusion-models/score-based-generative-models

Score-Based Generative Models - Diffusion Through the Lens of Score Matching

:::note Reading time: ~45 minutes | Interview relevance: High | Target roles: Research Engineer, ML Engineer, Applied Scientist, AI Engineer :::

The Real Interview Moment

You are at a research scientist interview at DeepMind. The interviewer leans forward. "Can you explain what the score function is, why you cannot train a score network by direct estimation, how denoising score matching solves this, and then connect the entire framework to DDPM?"

This question separates people who have read the DDPM paper from people who understand the deep probabilistic theory underneath it. The score matching view of diffusion matters for three distinct reasons.

First, it gives an independent mathematical justification for the DDPM training objective - so you are not just taking Ho et al.'s "empirically this works better" claim on faith. Second, it opens the door to the SDE framework (Song et al., 2021), which generalizes both DDPM and NCSN to a unified continuous-time formulation enabling faster ODE-based sampling. Third, it explains why Langevin MCMC sampling works, when it fails, and why annealing is necessary - the physical intuition that connects thermodynamics to machine learning.

Yang Song at Stanford in 2019 noticed: if you knew the score function $\nabla_x \log p(x)$ of the data distribution, Langevin dynamics could generate samples without ever computing $p(x)$ itself. The problem: you do not know $p(x)$ . His insight: train a neural network to estimate the score directly using denoising score matching - a technique that bypasses the intractable Jacobian trace. Two years later, his group unified this with DDPM under a single stochastic differential equation framework. This lesson covers both.

Why This Exists - The Problem with Density Estimation

The fundamental challenge in generative modeling: you want to sample from $p_\text{data}(x)$ - the true data distribution - but you only have finite samples from it. You cannot evaluate $p_\text{data}(x)$ for a given $x$ ; you can only draw examples.

Classical approaches either define an explicit parametric $p_\theta(x)$ and maximize likelihood (VAE, flow), or implicitly match features (GAN). Score matching takes a third path: forget $p_\theta(x)$ entirely. Instead, learn the gradient of the log-density:

$s_\theta(x) \approx \nabla_x \log p_\text{data}(x)$

Why the gradient? Because once you have it, Langevin dynamics generates samples - you need only the gradient of the log-density, not the density itself. The score points "uphill" toward regions of higher probability. Follow the score (with injected noise for mixing), and you converge to the data distribution.

This is elegant. But it comes with a severe practical problem - one that denoising score matching solves.

Historical Context

Score matching was introduced by Hyvärinen (2005) as a method for estimating parameters of unnormalized statistical models - specifically to avoid computing the partition function. The field largely ignored it for a decade. Vincent (2011) showed the equivalence between denoising score matching and the denoising autoencoder objective, providing a practical training procedure. Song and Ermon (2019) combined score matching with multiple noise levels and Langevin sampling to produce NCSN - the first competitive score-based generative model, achieving a new FID record on CIFAR-10. NCSNv2 (Song and Ermon, 2020) refined the noise schedule and achieved 2.45 FID on CIFAR-10. The unification with SDEs appeared in Song et al. (2021), connecting DDPM (VP-SDE), NCSN (VE-SDE), and introducing the probability flow ODE that underlies DDIM-style deterministic sampling.

1. The Score Function and Its Properties

Definition

The score function of a distribution $p(x)$ is:

$s(x) = \nabla_x \log p(x)$

This is a vector field - at every point $x \in \mathbb{R}^d$ , it gives a direction pointing toward regions of higher probability density.

Key property - independence from the normalizing constant: for any distribution $p(x) = \tilde{p}(x) / Z$ where $Z = \int \tilde{p}(x)\, dx$ :

$\nabla_x \log p(x) = \nabla_x \log \tilde{p}(x) - \nabla_x \log Z = \nabla_x \log \tilde{p}(x)$

since $Z$ does not depend on $x$ . The score does not require computing $Z$ - this is the critical property that makes score-based models computationally tractable for energy-based models and other unnormalized densities.

Concrete examples:

Gaussian $p(x) = \mathcal{N}(\mu, \sigma^2)$ : $s(x) = \nabla_x \log p(x) = -\frac{x-\mu}{\sigma^2}$ . The score points from $x$ toward $\mu$ - it is a restoring force pulling toward the mean.
Mixture of Gaussians $p(x) = \sum_k \pi_k \mathcal{N}(\mu_k, \sigma_k^2)$ : $s(x) = \frac{\sum_k \pi_k \mathcal{N}(x;\mu_k,\sigma_k^2) \cdot (-\frac{x-\mu_k}{\sigma_k^2})}{\sum_k \pi_k \mathcal{N}(x;\mu_k,\sigma_k^2)}$ . The score is a weighted average of the component scores - pointing toward the nearest cluster center, weighted by posterior probability.

Langevin Dynamics: Sampling from the Score

Given access to the score function $s(x) = \nabla_x \log p(x)$ , Langevin dynamics samples from $p(x)$ via the iterative update:

$x_{k+1} = x_k + \frac{\varepsilon}{2}\, \nabla_x \log p(x_k) + \sqrt{\varepsilon}\, \eta_k, \qquad \eta_k \sim \mathcal{N}(0, \mathbf{I})$

As $k \to \infty$ and $\varepsilon \to 0$ , the distribution of $x_k$ converges to $p(x)$ under mild regularity conditions. The two terms have a clear physical interpretation:

Drift term $\frac{\varepsilon}{2}\nabla_x \log p(x_k)$ : moves $x$ toward regions of higher probability (uphill in log-density space). Without this term, the chain is a random walk - it explores uniformly.
Diffusion term $\sqrt{\varepsilon}\, \eta_k$ : injects Gaussian noise. Without this term, the chain would converge to a mode and stay there - it would not mix between modes or explore the distribution's tails.

The balance between drift (converging to a mode) and diffusion (mixing between modes) determines how well Langevin samples the full distribution. For multi-modal distributions (natural images have thousands of modes corresponding to different classes, objects, scenes), this balance is delicate.

Unadjusted Langevin Algorithm (ULA): the version above with finite $\varepsilon$ introduces a bias - the stationary distribution is approximately $p(x)$ but not exactly. The Metropolis-Adjusted Langevin Algorithm (MALA) adds a Metropolis-Hastings acceptance step to correct for this bias, but at additional computational cost. For deep learning applications, ULA (no correction) is used - the small bias is acceptable given the approximation from the learned score network.

2. The Problem with Direct Score Estimation

Naive Score Matching - The Jacobian Trace Problem

The most direct approach: minimize the Fisher divergence between the true score and the estimated score:

$J_\text{ESM}(\theta) = \frac{1}{2}\mathbb{E}_{p(x)}\!\left[\|s_\theta(x) - \nabla_x \log p(x)\|^2\right]$

This is explicit score matching. Fatal flaw: it requires $\nabla_x \log p(x)$ , which you do not have.

Via integration by parts (Hyvärinen, 2005), this is equivalent to:

$J_\text{ISM}(\theta) = \mathbb{E}_{p(x)}\!\left[\text{tr}\!\left(\nabla_x s_\theta(x)\right) + \frac{1}{2}\|s_\theta(x)\|^2\right] + \text{const}$

This implicit score matching objective depends only on samples from $p(x)$ - no $\nabla_x \log p(x)$ needed. But it requires $\text{tr}(\nabla_x s_\theta(x))$ - the trace of the Jacobian of the score network.

For a network with $d$ -dimensional output ( $d =$ image dimension), computing the Jacobian trace requires $d$ backward passes (one per output dimension). For CIFAR-10 images: $d = 3 \times 32 \times 32 = 3{,}072$ backward passes per training example. For $512 \times 512$ RGB images: $d = 786{,}432$ backward passes. Completely infeasible.

Sliced score matching (Song et al., 2019) approximates the Jacobian trace by projecting onto a random direction $\mathbf{v} \sim \mathcal{N}(0, \mathbf{I})$ :

$J_\text{SSM}(\theta) = \mathbb{E}_{p(x),\, \mathbf{v}}\!\left[\mathbf{v}^\top \nabla_x s_\theta(x)\, \mathbf{v} + \frac{1}{2}\|s_\theta(x)\|^2\right]$

The directional derivative $\mathbf{v}^\top \nabla_x s_\theta(x)\, \mathbf{v}$ costs only two backward passes (using Hessian-vector products). This is tractable but still slower than DSM. For high-dimensional images, denoising score matching is preferred.

3. Denoising Score Matching

The Core Insight

Vincent (2011): instead of training the score network on the clean data distribution $p_\text{data}(x)$ , train it on a noise-perturbed distribution $q_\sigma(\tilde{x}) = \int p(x)\mathcal{N}(\tilde{x};\, x,\, \sigma^2\mathbf{I})\, dx$ .

For a perturbed point $\tilde{x} = x + \sigma\varepsilon$ with $\varepsilon \sim \mathcal{N}(0, \mathbf{I})$ , the score of the perturbed distribution is analytically known:

$\nabla_{\tilde{x}} \log q_\sigma(\tilde{x}|x) = -\frac{\tilde{x} - x}{\sigma^2} = -\frac{\varepsilon}{\sigma}$

This score is tractable - it just points from $\tilde{x}$ back toward the original clean point $x$ , scaled by $1/\sigma^2$ .

Denoising Score Matching Objective

Train $s_\theta(\tilde{x}, \sigma)$ to estimate the score of the perturbed distribution:

$J_\text{DSM}(\theta) = \mathbb{E}_{x \sim p_\text{data},\, \varepsilon \sim \mathcal{N}(0,\mathbf{I})}\!\left[\left\|s_\theta(x + \sigma\varepsilon,\; \sigma) + \frac{\varepsilon}{\sigma}\right\|^2\right]$

The optimal $s_\theta^*(\tilde{x}, \sigma) = -\varepsilon/\sigma$ . This is simply MSE between the predicted score and the known analytical score of the Gaussian perturbation kernel. No Jacobian trace. No $d$ backward passes. Just a single standard backward pass.

The DDPM Connection: Score = Noise Prediction

Define $\varepsilon_\theta(\tilde{x}, \sigma) = -\sigma \cdot s_\theta(\tilde{x}, \sigma)$ . Then:

$J_\text{DSM}(\theta) = \mathbb{E}\!\left[\left\|\frac{\varepsilon_\theta(\tilde{x}, \sigma)}{\sigma} - \frac{\varepsilon}{\sigma}\right\|^2\right] \propto \mathbb{E}\!\left[\|\varepsilon_\theta(\tilde{x}, \sigma) - \varepsilon\|^2\right]$

This is exactly the DDPM simplified objective $\mathcal{L}_\text{simple} = \mathbb{E}[\|\varepsilon - \varepsilon_\theta(x_t, t)\|^2]$ , with $\sigma$ corresponding to $\sqrt{1-\bar{\alpha}_t}$ at timestep $t$ .

This is the deep connection: DDPM's noise prediction training is mathematically identical to denoising score matching. Ho et al.'s empirical finding that "predicting $\varepsilon$ works better than predicting $x_0$ " is equivalent to saying "denoising score matching works better than the $x_0$ -parameterized score matching objective." The relationship between the score network and the DDPM noise predictor is:

$s_\theta(x_t, t) = \nabla_{x_t} \log p_t(x_t) = -\frac{\varepsilon_\theta(x_t, t)}{\sqrt{1-\bar{\alpha}_t}}$

4. Noise Conditional Score Networks (NCSN and NCSNv2)

Song and Ermon (2019) observed that a single noise level $\sigma$ is insufficient. At small $\sigma$ , the score landscape has sharp peaks around each data point - Langevin dynamics mixes very slowly (exponentially long to escape a mode). At large $\sigma$ , the score landscape is smooth and mixing is fast, but the perturbed distribution no longer resembles the data distribution.

NCSN solution: train a single score network jointly over $L$ noise levels $\{\sigma_i\}_{i=1}^L$ forming a geometric sequence from $\sigma_1 = \sigma_\text{max}$ (large) to $\sigma_L = \sigma_\text{min}$ (small):

$\mathcal{L}_\text{NCSN}(\theta) = \frac{1}{L}\sum_{i=1}^{L} \lambda(\sigma_i)\, \mathbb{E}_{x,\, \varepsilon}\!\left[\left\|s_\theta(x + \sigma_i\varepsilon,\; \sigma_i) + \frac{\varepsilon}{\sigma_i}\right\|^2\right]$

where $\lambda(\sigma_i) = \sigma_i^2$ (balances the loss magnitude across noise levels - without weighting, small- $\sigma$ terms dominate because $1/\sigma^2$ amplifies errors).

Generation via annealed Langevin dynamics: start at large $\sigma$ (smooth landscape, fast mixing), then gradually reduce $\sigma$ (sharper landscape, fine-grained refinement):

Initialize x ~ N(0, σ_max² I)
For σ in [σ_max, ..., σ_min]:              ← outer annealing loop
    α = ε · (σ / σ_min)²                  ← step size scaled to noise level
    For k in 1..K:                         ← inner Langevin loop
        x = x + (α/2) · sθ(x, σ) + √α · η
        η ~ N(0, I)
Return x

The $\alpha = \varepsilon(\sigma/\sigma_\text{min})^2$ scaling ensures the signal-to-noise ratio of the update is consistent across noise levels: at large $\sigma$ the score is small (in absolute value), so the step size is correspondingly large.

NCSNv2 (Song and Ermon, 2020) refined: geometric noise schedule with more levels, spectral normalization in the score network, and careful initialization. Achieved FID 2.45 on CIFAR-10 - competitive with state-of-the-art GANs of the time.

5. The SDE Framework - Unifying Everything

Song et al. (2021) showed that both DDPM and NCSN are discretizations of SDEs, and that this continuous-time view unlocks additional capabilities.

Forward SDE

The noise-addition process is described by an Itô stochastic differential equation:

$dx = f(x, t)\, dt + g(t)\, dW_t$

where:

$f(x, t)$ : the drift coefficient (deterministic force on $x$ )
$g(t)$ : the diffusion coefficient (amount of noise injected per unit time)
$W_t$ : standard Brownian motion (Wiener process) - the continuous-time limit of Gaussian random walks

As $t$ goes from 0 to $T$ , this SDE transforms $x_0 \sim p_\text{data}$ into $x_T \sim p_T$ (a known prior, typically $\mathcal{N}(0, \mathbf{I})$ ).

Reverse SDE (Anderson, 1982)

Any forward Itô SDE has a corresponding reverse-time SDE running the process backwards in time:

$dx = \left[f(x, t) - g(t)^2\, \nabla_x \log p_t(x)\right] dt + g(t)\, d\bar{W}_t$

where $d\bar{W}_t$ is a reverse-time Brownian motion, and $\nabla_x \log p_t(x)$ is the score of the marginal distribution at time $t$ . This is the key: the reverse process requires the score function at each noise level - precisely what the score network $s_\theta(x, t)$ estimates.

The reverse SDE generates samples: start from $x_T \sim p_T$ , integrate backwards from $t = T$ to $t = 0$ using the learned score $s_\theta \approx \nabla_x \log p_t$ . This recovers $x_0 \sim p_\text{data}$ .

DDPM as VP-SDE (Variance Preserving)

The DDPM forward process $q(x_t|x_{t-1}) = \mathcal{N}(x_t;\, \sqrt{1-\beta_t}\, x_{t-1},\, \beta_t\mathbf{I})$ in the limit $T \to \infty$ with $\beta(t) = T\beta_{t/T}$ becomes:

$dx = -\frac{1}{2}\beta(t)\, x\, dt + \sqrt{\beta(t)}\, dW_t \qquad \text{(VP-SDE)}$

The drift $f(x,t) = -\frac{1}{2}\beta(t)x$ contracts the signal toward zero. The diffusion $g(t) = \sqrt{\beta(t)}$ adds noise. The marginal at time $t$ is:

$p_t(x|x_0) = \mathcal{N}\!\left(\sqrt{\bar{\alpha}(t)}\, x_0,\; (1-\bar{\alpha}(t))\mathbf{I}\right)$

where $\bar{\alpha}(t) = \exp(-\int_0^t \beta(s)\, ds)$ . The variance of $p_t(x|x_0)$ is $1 - \bar{\alpha}(t)$ , which approaches 1 as $t \to T$ - hence "variance preserving" (the total variance stays at approximately 1).

NCSN as VE-SDE (Variance Exploding)

The NCSN forward process - adding Gaussian noise with increasing variance $\sigma^2(t)$ from 0 to $\sigma_\text{max}^2$ - in continuous time becomes:

$dx = \sqrt{\frac{d[\sigma^2(t)]}{dt}}\, dW_t \qquad \text{(VE-SDE)}$

The drift is zero ( $f(x,t) = 0$ ): no contraction. The diffusion coefficient grows with $\sigma(t)$ . The marginal is $p_t(x|x_0) = \mathcal{N}(x_0,\, \sigma^2(t)\mathbf{I})$ - the data convolved with growing Gaussian noise. The variance $\sigma^2(t)$ explodes to $\sigma_\text{max}^2$ - hence "variance exploding." The endpoint is not a clean $\mathcal{N}(0, \mathbf{I})$ prior but rather a large-variance Gaussian centered on the data.

VP vs VE comparison:

Property	VP-SDE (DDPM)	VE-SDE (NCSN)
Drift	$-\frac{1}{2}\beta(t)x$ (contracting)	0
Diffusion	$\sqrt{\beta(t)}$	$\sqrt{d[\sigma^2]/dt}$
Signal at $t=T$	Near zero (contracted)	Near $x_0$ (not contracted)
Prior $p_T$	$\approx \mathcal{N}(0, \mathbf{I})$ (clean)	$\mathcal{N}(x_0, \sigma_\text{max}^2\mathbf{I})$ (messy)
Preferred for	Generation (clean prior)	Density estimation

Probability Flow ODE

Key result: every SDE of the form $dx = f(x,t)\, dt + g(t)\, dW_t$ has a corresponding deterministic ODE with the same marginal distributions $\{p_t(x)\}_{t \in [0,T]}$ :

$\frac{dx}{dt} = f(x, t) - \frac{1}{2}\, g(t)^2\, \nabla_x \log p_t(x)$

This probability flow ODE generates samples deterministically - no stochastic component. Replacing $\nabla_x \log p_t$ with the learned $s_\theta(x, t)$ gives a neural ODE that can be integrated with standard ODE solvers (Runge-Kutta, Adams-Bashforth).

Properties of the probability flow ODE:

Deterministic: given the same starting $x_T$ , always produces the same $x_0$ .
Invertible: run the ODE forward (from $t=0$ to $t=T$ ) to encode a real image into its noise space - exact inversion.
Likelihood computation: the instantaneous change of variables formula for ODEs gives: $\log p_0(x_0) = \log p_T(x_T) - \int_0^T \text{tr}(\nabla_x \frac{dx}{dt})\, dt$ , computable via the Hutchinson trace estimator in $O(1)$ backward passes.
Faster sampling: ODE solvers can take larger steps than SDE samplers without diverging. DDIM corresponds to numerically integrating the probability flow ODE for the VP-SDE with a specific first-order solver.

Code: Score Matching Training and Langevin Sampling

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import math


# ============================================================
# 1. Score Network (time-conditioned MLP for 2D toy data)
# ============================================================
class ScoreNetwork(nn.Module):
    """
    Noise-conditional score network: sθ(x, σ) ≈ ∇_x log p_σ(x).
    For images, replace with a time-conditioned U-Net.

    Input: x [batch, d] + σ [batch] → score [batch, d]
    """

    def __init__(self, input_dim: int = 2, hidden_dim: int = 256,
                 sigma_emb_dim: int = 32):
        super().__init__()

        # Log-sigma embedding: maps scalar log(σ) to a feature vector
        self.sigma_emb = nn.Sequential(
            nn.Linear(1, sigma_emb_dim),
            nn.SiLU(),
            nn.Linear(sigma_emb_dim, sigma_emb_dim),
            nn.SiLU(),
        )

        # Main score network
        self.net = nn.Sequential(
            nn.Linear(input_dim + sigma_emb_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, input_dim)
        )

    def forward(self, x: torch.Tensor, sigma: torch.Tensor) -> torch.Tensor:
        """
        x     : [batch, d]
        sigma : [batch] - noise level for each sample
        Returns estimated score sθ(x, σ) = ≈ ∇_x log p_σ(x)
        """
        log_sigma = torch.log(sigma + 1e-8).unsqueeze(-1)   # [batch, 1]
        emb = self.sigma_emb(log_sigma)                      # [batch, emb_dim]
        return self.net(torch.cat([x, emb], dim=-1))


# ============================================================
# 2. Denoising Score Matching Loss (NCSN training objective)
# ============================================================
def ncsn_loss(score_net: ScoreNetwork,
              x_clean: torch.Tensor,
              sigmas: torch.Tensor) -> torch.Tensor:
    """
    NCSN training loss: denoising score matching over multiple noise levels.

    For each sample:
        1. Sample a random noise level σᵢ
        2. Add noise: x̃ = x + σᵢ·ε, ε ~ N(0, I)
        3. Target score: -ε/σᵢ  (score of N(x̃; x, σᵢ²I))
        4. Loss: σᵢ² · ‖sθ(x̃, σᵢ) - (-ε/σᵢ)‖²

    The σ² weighting balances loss magnitude across noise levels.
    Without it, small-σ terms (with 1/σ² score magnitude) dominate.

    Args:
        score_net : noise-conditional score network
        x_clean   : clean data [batch, d]
        sigmas    : geometric sequence of noise levels [L]
    """
    batch, d = x_clean.shape
    device = x_clean.device

    # Sample a random noise level for each example
    idx = torch.randint(len(sigmas), (batch,), device=device)
    sigma = sigmas[idx]                              # [batch]

    # Add noise: x̃ = x + σ·ε
    eps = torch.randn_like(x_clean)
    x_noisy = x_clean + sigma.unsqueeze(-1) * eps

    # Predicted score
    score_pred = score_net(x_noisy, sigma)           # [batch, d]

    # Target score: ∇_{x̃} log N(x̃; x, σ²I) = -(x̃ - x)/σ² = -ε/σ
    score_target = -eps / sigma.unsqueeze(-1)        # [batch, d]

    # Loss weighted by σ² (NCSN weighting scheme)
    loss = 0.5 * (sigma.unsqueeze(-1) ** 2) * (score_pred - score_target) ** 2
    return loss.mean()


# ============================================================
# 3. Annealed Langevin Dynamics Sampling
# ============================================================
@torch.no_grad()
def annealed_langevin_sampling(
    score_net: ScoreNetwork,
    sigmas: list[float],
    n_samples: int = 200,
    n_steps_per_sigma: int = 100,
    eps_base: float = 2e-5,
    dim: int = 2,
    device: str = "cpu",
) -> torch.Tensor:
    """
    NCSN sampling: annealed Langevin dynamics.

    Algorithm:
      Initialize x ~ N(0, σ_max² I)
      For σ in [σ_max → σ_min]:         (outer annealing loop)
        α = eps_base · (σ / σ_min)²     (step size: large at high noise)
        For k in 1..K:                  (inner Langevin loop)
          x ← x + (α/2) · sθ(x, σ) + √α · η,  η ~ N(0, I)

    Intuition for annealing:
      High σ → smooth landscape → fast Langevin mixing → gets near a mode
      Low σ  → sharp landscape  → slow mixing BUT already near correct mode
                                → fine-detail refinement
    """
    score_net.eval()
    sigma_min = sigmas[-1]

    # Initialize from large-noise prior
    x = torch.randn(n_samples, dim, device=device) * sigmas[0]

    for sigma in sigmas:
        sigma_t = torch.full((n_samples,), sigma, device=device)

        # Step size: scaled to maintain constant signal-to-noise ratio
        alpha = eps_base * (sigma / sigma_min) ** 2

        for _ in range(n_steps_per_sigma):
            score = score_net(x, sigma_t)                    # [n, d]
            noise = torch.randn_like(x)
            x = x + (alpha / 2.0) * score + math.sqrt(alpha) * noise

    return x


# ============================================================
# 4. Training NCSN on 2D Gaussian Mixture
# ============================================================
def make_gaussian_mixture(n: int = 30_000, n_modes: int = 8,
                          radius: float = 5.0) -> torch.Tensor:
    """
    Classic 2D score matching demo: 8 Gaussians on a circle.
    Tests whether the score network can learn multi-modal distributions.
    """
    angles = torch.linspace(0, 2 * math.pi, n_modes + 1)[:-1]
    centers = torch.stack([radius * angles.cos(), radius * angles.sin()], dim=1)
    idx = torch.randint(n_modes, (n,))
    return centers[idx] + 0.4 * torch.randn(n, 2)


def train_ncsn(
    n_epochs: int = 600,
    batch_size: int = 512,
    lr: float = 1e-3,
    sigma_begin: float = 10.0,
    sigma_end: float = 0.01,
    n_sigmas: int = 12,
    device: str = "cpu",
) -> tuple:
    """Train NCSN on 2D mixture, return (score_net, sigmas)."""
    data = make_gaussian_mixture(n=60_000).to(device)
    loader = torch.utils.data.DataLoader(
        torch.utils.data.TensorDataset(data),
        batch_size=batch_size, shuffle=True, drop_last=True
    )

    # Geometric noise schedule: σ₁ > σ₂ > ... > σ_L
    sigmas = torch.exp(
        torch.linspace(math.log(sigma_begin), math.log(sigma_end), n_sigmas)
    ).to(device)

    score_net = ScoreNetwork(input_dim=2, hidden_dim=256).to(device)
    optimizer = torch.optim.Adam(score_net.parameters(), lr=lr,
                                  betas=(0.9, 0.999))
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer, T_max=n_epochs, eta_min=1e-5)

    print(f"Training NCSN | σ_max={sigma_begin} | σ_min={sigma_end} | "
          f"L={n_sigmas} noise levels")
    for epoch in range(n_epochs):
        total_loss = 0.0
        for (x_batch,) in loader:
            loss = ncsn_loss(score_net, x_batch.to(device), sigmas)
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(score_net.parameters(), 1.0)
            optimizer.step()
            total_loss += loss.item()
        scheduler.step()

        if (epoch + 1) % 200 == 0:
            avg = total_loss / len(loader)
            print(f"  Epoch {epoch+1:4d} | DSM Loss: {avg:.5f} | "
                  f"LR: {scheduler.get_last_lr()[0]:.2e}")

    return score_net, sigmas


# ============================================================
# 5. Score ↔ DDPM Noise Predictor Conversion
# ============================================================
def score_to_noise(score: torch.Tensor,
                   sigma_t: float) -> torch.Tensor:
    """
    Convert score estimate to DDPM noise prediction.

    Relationship:
        sθ(x_t, t) = ∇_{x_t} log p_t(x_t) ≈ -ε/σ_t
        ⟹ εθ(x_t, t) = -σ_t · sθ(x_t, t)

    where σ_t = √(1 - ᾱ_t) in DDPM notation.
    """
    return -sigma_t * score


def noise_to_score(eps_pred: torch.Tensor,
                   sigma_t: float) -> torch.Tensor:
    """
    Convert DDPM noise prediction to score estimate.

        sθ(x_t, t) = -εθ(x_t, t) / σ_t
    """
    return -eps_pred / (sigma_t + 1e-8)


# ============================================================
# 6. VP-SDE and VE-SDE Implementations
# ============================================================
class VESDE:
    """
    Variance Exploding SDE (NCSN forward process).
    dx = sqrt(d[σ²(t)]/dt) dW
    Marginal: p(x_t | x_0) = N(x_0, σ²(t) I)
    """

    def __init__(self, sigma_min: float = 0.01, sigma_max: float = 50.0):
        self.sigma_min = sigma_min
        self.sigma_max = sigma_max

    def sigma(self, t: torch.Tensor) -> torch.Tensor:
        """Geometric noise schedule: σ(t) = σ_min · (σ_max/σ_min)^t"""
        return self.sigma_min * (self.sigma_max / self.sigma_min) ** t

    def marginal_forward(self, x0: torch.Tensor, t: torch.Tensor,
                          eps: torch.Tensor | None = None
                          ) -> tuple[torch.Tensor, torch.Tensor]:
        """Sample x_t ~ p(x_t | x_0) = N(x_0, σ²(t) I)"""
        if eps is None:
            eps = torch.randn_like(x0)
        sigma_t = self.sigma(t)
        # Broadcast sigma_t for batch × spatial dims
        if x0.dim() > 1:
            sigma_t = sigma_t.view(-1, *([1] * (x0.dim() - 1)))
        return x0 + sigma_t * eps, sigma_t

    def score(self, x_t: torch.Tensor, x_0: torch.Tensor,
              t: torch.Tensor) -> torch.Tensor:
        """True score of p_t(x|x_0): -(x_t - x_0) / σ²(t)"""
        sigma_t = self.sigma(t)
        if x_t.dim() > 1:
            sigma_t = sigma_t.view(-1, *([1] * (x_t.dim() - 1)))
        return -(x_t - x_0) / (sigma_t ** 2)


class VPSDE:
    """
    Variance Preserving SDE (DDPM forward process in continuous time).
    dx = -½β(t)x dt + √β(t) dW
    Marginal: p(x_t|x_0) = N(√ᾱ(t) x_0, (1-ᾱ(t)) I)
    """

    def __init__(self, beta_min: float = 0.1, beta_max: float = 20.0):
        self.beta_min = beta_min
        self.beta_max = beta_max

    def beta(self, t: torch.Tensor) -> torch.Tensor:
        """Linear schedule: β(t) = β_min + t(β_max - β_min)"""
        return self.beta_min + t * (self.beta_max - self.beta_min)

    def alpha_bar(self, t: torch.Tensor) -> torch.Tensor:
        """
        ᾱ(t) = exp(-∫₀ᵗ β(s)ds) = exp(-½t²(β_max-β_min) - t·β_min)
        As t→1: ᾱ→0 → x_t → pure noise
        """
        return torch.exp(
            -0.5 * t ** 2 * (self.beta_max - self.beta_min)
            - t * self.beta_min
        )

    def marginal_forward(self, x0: torch.Tensor, t: torch.Tensor,
                          eps: torch.Tensor | None = None
                          ) -> tuple[torch.Tensor, torch.Tensor]:
        """Sample x_t ~ N(√ᾱ(t)·x_0, (1-ᾱ(t))·I)"""
        if eps is None:
            eps = torch.randn_like(x0)
        ab = self.alpha_bar(t)
        if x0.dim() > 1:
            ab = ab.view(-1, *([1] * (x0.dim() - 1)))
        return ab.sqrt() * x0 + (1 - ab).sqrt() * eps, ab

    def score(self, x_t: torch.Tensor, x_0: torch.Tensor,
              t: torch.Tensor) -> torch.Tensor:
        """True score of p_t(x|x_0): -(x_t - √ᾱ x_0) / (1-ᾱ)"""
        ab = self.alpha_bar(t)
        if x_t.dim() > 1:
            ab = ab.view(-1, *([1] * (x_t.dim() - 1)))
        return -(x_t - ab.sqrt() * x_0) / (1 - ab)


# ── Compare VE and VP at t=1 (endpoint behavior) ─────────────────────────────
def compare_sde_endpoints():
    # Off-center data distribution
    x0 = torch.randn(2000, 2) + torch.tensor([4.0, 4.0])

    ve = VESDE(sigma_min=0.01, sigma_max=50.0)
    vp = VPSDE(beta_min=0.1, beta_max=20.0)

    t_end = torch.ones(2000)  # t=1 (endpoint of forward process)

    xT_ve, _ = ve.marginal_forward(x0, t_end)
    xT_vp, _ = vp.marginal_forward(x0, t_end)

    print("SDE endpoints (t=1):")
    print(f"  VE-SDE mean : {xT_ve.mean(0).tolist()}")   # ≈ (4,4) - not contracted
    print(f"  VE-SDE std  : {xT_ve.std(0).tolist()}")    # ≈ (50, 50) - huge variance
    print(f"  VP-SDE mean : {xT_vp.mean(0).tolist()}")   # ≈ (0, 0) - contracted to 0
    print(f"  VP-SDE std  : {xT_vp.std(0).tolist()}")    # ≈ (1, 1) - unit Gaussian


compare_sde_endpoints()

The Architecture: U-Net Score Network

Production Engineering Notes

:::tip Score vs noise prediction - which parameterization to use In HuggingFace Diffusers, prediction_type controls the parameterization:

"epsilon" (default, DDPM-style): network predicts $\varepsilon$ . Equivalent to $s_\theta = -\varepsilon_\theta / \sigma_t$ up to sign.
"v_prediction" (Salimans and Ho, 2022): network predicts the "velocity" $v = \sqrt{\bar{\alpha}_t}\varepsilon - \sqrt{1-\bar{\alpha}_t}x_0$ . More numerically stable for very high or very low noise levels. Used in Imagen and some Stable Diffusion variants.
"sample" (x0-prediction): network directly predicts $x_0$ . Less stable in practice - score matching theory suggests $\varepsilon$ -prediction works better.

The underlying math is identical; only the network's output target differs. Convert between them using the marginal distribution relationships. :::

:::note Likelihood computation via the probability flow ODE To compute $\log p_\theta(x_0)$ for a given image $x_0$ :

Encode $x_0 \to x_T$ by integrating the probability flow ODE forward from $t=0$ to $t=T$ . This is deterministic and invertible.
Compute $\log p_T(x_T) = \log\mathcal{N}(x_T; 0, \mathbf{I})$ - known analytically.
Add the change-of-variables term: $\log p_0(x_0) = \log p_T(x_T) - \int_0^T \text{tr}(\nabla_x \frac{dx}{dt})\, dt$ .
Estimate the integral via Hutchinson trace estimator: $\text{tr}(\mathbf{J}) \approx \mathbf{v}^\top\mathbf{J}\mathbf{v}$ with $\mathbf{v} \sim \mathcal{N}(0, \mathbf{I})$ .

This gives approximate log-likelihood in $O(T_\text{ODE})$ steps where $T_\text{ODE}$ can be as small as 20–50 with high-order ODE solvers. :::

:::warning Langevin dynamics requires small step size relative to curvature The Langevin update $x_{k+1} = x_k + \frac{\varepsilon}{2}s_\theta(x_k) + \sqrt{\varepsilon}\eta$ requires step size $\varepsilon$ small relative to the local curvature of $\log p(x)$ . Too large: the dynamics diverge. The curvature of the score landscape scales as $1/\sigma^2$ - much larger at small noise levels. NCSN's $\alpha = \varepsilon(\sigma/\sigma_\text{min})^2$ schedule compensates: step sizes scale proportionally to $\sigma^2$ , keeping the effective step size relative to local curvature constant. :::

Common Mistakes

:::danger Confusing the score of p(x) with the score of p_σ(x) The score network $s_\theta(x, \sigma)$ estimates $\nabla_x \log p_\sigma(x)$ - the score of the noise-perturbed distribution, not the original data distribution. At $\sigma \to 0$ , these converge. For finite $\sigma$ , they differ: the score of $p_\sigma$ is smoother and points toward the nearest cluster of data points, weighted by the Gaussian kernel of width $\sigma$ . The denoising score matching target $-\varepsilon/\sigma$ is the score of the Gaussian noise kernel $\mathcal{N}(\tilde{x}; x_0, \sigma^2\mathbf{I})$ , not of $p_\sigma(\tilde{x})$ - though Vincent (2011) proved these are equal in expectation. :::

:::danger Treating VE and VP SDEs as interchangeable VP-SDE (DDPM) contracts the signal toward zero - the endpoint $x_T \approx \mathcal{N}(0, \mathbf{I})$ is a clean, data-independent Gaussian prior. Starting generation from $\mathcal{N}(0, \mathbf{I})$ is straightforward. VE-SDE (NCSN) does not contract - the endpoint depends on the data: $x_T \approx \mathcal{N}(x_0, \sigma_\text{max}^2\mathbf{I})$ . The prior is data-dependent and very large in variance. For generation, VP is cleaner; for density estimation and inversion, VE is sometimes preferred. :::

:::warning Langevin dynamics fails in low-data-density regions The data manifold hypothesis says natural images occupy a thin low-dimensional manifold in $\mathbb{R}^d$ . In regions of zero data density - most of $\mathbb{R}^d$ - the score function is undefined or points toward the nearest data cluster but very weakly. Langevin dynamics initialized in a low-density region can take exponentially long to reach the manifold. This is exactly why NCSN uses large $\sigma$ to start: large noise "spreads" the data distribution throughout space, making the score defined and meaningful everywhere. The annealing schedule gradually reduces $\sigma$ , staying on the high-density region throughout. :::

YouTube Resources

Resource	Channel	Why Watch
Score Matching Tutorial - Yang Song	Stanford Lecture	The author of NCSN explains score matching, denoising score matching, and Langevin dynamics - the definitive reference
SDE Framework for Diffusion - NeurIPS 2021	Yang Song Invited Talk	The VP-SDE, VE-SDE, probability flow ODE, and DDIM connection - presented by the author
Diffusion Models Math - Outlier	Outlier	Visual walkthrough of score matching → Langevin → DDPM with animations
Langevin Dynamics and MCMC - Pieter Abbeel	UC Berkeley	Why Langevin MCMC converges to the target distribution - thermodynamics perspective
Score-Based Models and SDEs - Valentin De Bortoli	Gatsby Unit	Rigorous mathematical treatment - Fokker-Planck equations, convergence guarantees

Interview Q&A

Q1: What is the score function and why is it useful for generative modeling?

The score function of a distribution $p(x)$ is $s(x) = \nabla_x \log p(x)$ - the gradient of the log-density with respect to the input. It is a vector field pointing toward regions of higher probability density. Two properties make it useful: (1) It does not depend on the normalizing constant $Z$ of $p(x)$ , since $\nabla_x \log p(x) = \nabla_x \log \tilde{p}(x)$ - computable from unnormalized models. (2) Langevin dynamics $x_{k+1} = x_k + \frac{\varepsilon}{2}s(x_k) + \sqrt{\varepsilon}\eta_k$ generates samples from $p(x)$ using only the score. Starting from any $x_0$ , the chain converges to $p(x)$ as step size $\varepsilon \to 0$ and steps $k \to \infty$ .

Q2: Explain denoising score matching. Why does it avoid the Jacobian trace problem?

Explicit score matching requires $\nabla_x \log p(x)$ (unknown). Implicit score matching avoids this but requires $\text{tr}(\nabla_x s_\theta(x))$ - the trace of the score Jacobian - costing $d$ backward passes (infeasible for $d = 3 \times 32 \times 32 = 3072$ ).

Denoising score matching (Vincent, 2011): perturb data with noise $\sigma$ to get $\tilde{x} = x + \sigma\varepsilon$ . The score of the noise-perturbed distribution is analytically known: $\nabla_{\tilde{x}} \log q_\sigma(\tilde{x}|x) = -(\tilde{x}-x)/\sigma^2 = -\varepsilon/\sigma$ . Training objective: $\mathbb{E}[\|s_\theta(\tilde{x}, \sigma) - (-\varepsilon/\sigma)\|^2]$ . No Jacobian trace - just a single backward pass per training example. Tractable at any image dimension.

Q3: How does the SDE framework unify DDPM and NCSN?

Song et al. (2021) show that both are discretizations of SDEs. DDPM corresponds to the VP-SDE: $dx = -\frac{1}{2}\beta(t)x\, dt + \sqrt{\beta(t)}\, dW$ , which contracts the signal toward zero. Marginal: $p_t(x|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}(t)}x_0, (1-\bar{\alpha}(t))\mathbf{I})$ . NCSN corresponds to the VE-SDE: $dx = \sqrt{d[\sigma^2(t)]/dt}\, dW$ , which adds noise without contraction. Marginal: $p_t(x|x_0) = \mathcal{N}(x_0, \sigma^2(t)\mathbf{I})$ .

Both are special cases of $dx = f(x,t)\, dt + g(t)\, dW$ . The reverse SDE for any such process is $dx = [f - g^2\nabla_x \log p_t]\, dt + g\, d\bar{W}$ . Plugging in the learned score $s_\theta \approx \nabla_x \log p_t$ gives the unified sampling algorithm. The SDE framework also enables continuous-time analysis, higher-order ODE solvers, and exact likelihood computation.

Q4: What is the probability flow ODE, and how does it relate to DDIM?

Every SDE $dx = f(x,t)\, dt + g(t)\, dW$ has an associated probability flow ODE $\frac{dx}{dt} = f(x,t) - \frac{1}{2}g(t)^2\nabla_x \log p_t(x)$ with identical marginals $\{p_t\}$ but deterministic trajectories. Replacing $\nabla_x \log p_t$ with $s_\theta$ gives a learnable neural ODE.

For the VP-SDE (DDPM), the probability flow ODE discretized with a first-order Euler solver is exactly the DDIM sampling update. DDIM's properties - determinism, invertibility, and ability to use fewer steps - all follow from solving an ODE rather than a stochastic process. Higher-order ODE solvers (Runge-Kutta 4, DPM-Solver) reduce the required function evaluations from 1000 to 10–20 at similar quality.

Q5: Why does annealed Langevin sampling outperform single-scale Langevin?

At $\sigma \approx 0$ (target distribution), the score landscape has sharp peaks around each data point. Langevin dynamics with any step size either overshoots (diverges) or requires exponentially many steps to move between modes. This is the classic multi-modal MCMC problem.

Annealing solves this hierarchically: at large $\sigma$ (smooth landscape), Langevin mixes quickly between all modes - the chain explores the global structure. At smaller $\sigma$ (sharper landscape), the chain refines within a mode - local detail is added. Each noise level uses the final state of the previous level as initialization - the chain never needs to escape a mode at small $\sigma$ . This is analogous to simulated annealing: start hot (high entropy, global exploration), cool slowly (converge to the target distribution's modes one by one).

Q6: Derive the connection between the DDPM noise prediction objective and denoising score matching.

DDPM trains $\varepsilon_\theta$ to minimize $\mathbb{E}[\|\varepsilon_\theta(x_t, t) - \varepsilon\|^2]$ where $x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\varepsilon$ .

Denoising score matching trains $s_\theta$ to minimize $\mathbb{E}[\|s_\theta(x_t, \sigma_t) - (-\varepsilon/\sigma_t)\|^2]$ where $\sigma_t = \sqrt{1-\bar{\alpha}_t}$ .

Define $\varepsilon_\theta(x_t, t) = -\sigma_t \cdot s_\theta(x_t, \sigma_t)$ . Then:

$\mathbb{E}\!\left[\|s_\theta(x_t, \sigma_t) + \varepsilon/\sigma_t\|^2\right] = \mathbb{E}\!\left[\left\|\frac{-\varepsilon_\theta}{\sigma_t} + \frac{\varepsilon}{\sigma_t}\right\|^2\right] = \frac{1}{\sigma_t^2}\mathbb{E}\!\left[\|\varepsilon_\theta - \varepsilon\|^2\right]$

The DDPM objective is $\frac{1}{\sigma_t^2}$ times the DSM objective - they share the same minimizer. Therefore, training DDPM to predict noise $\varepsilon$ is mathematically equivalent to training a score network via denoising score matching. Ho et al.'s empirical observation that noise prediction works better than $x_0$ prediction corresponds to the DSM parameterization ( $\varepsilon$ -prediction) outperforming the direct score parameterization.

Fokker-Planck Equation and Score Dynamics

The SDE framework connects to a classical result from statistical physics: the Fokker-Planck equation (also called the Kolmogorov forward equation). For the SDE $dx = f(x,t)\, dt + g(t)\, dW$ , the marginal distribution $p_t(x)$ evolves according to:

$\frac{\partial p_t}{\partial t} = -\nabla_x \cdot (f\, p_t) + \frac{g(t)^2}{2}\Delta_x p_t$

where $\nabla_x \cdot$ is the divergence and $\Delta_x$ is the Laplacian. This is a PDE governing how the probability density evolves over time. For the VP-SDE with $f(x,t) = -\frac{1}{2}\beta(t)x$ :

$\frac{\partial p_t}{\partial t} = \frac{\beta(t)}{2}\nabla_x \cdot (x\, p_t) + \frac{\beta(t)}{2}\Delta_x p_t$

The stationary distribution of this PDE (as $t \to T$ ) is $\mathcal{N}(0, \mathbf{I})$ - exactly the prior we want for generation. The reverse SDE is derived by time-reversing the Fokker-Planck equation, which requires knowledge of $\nabla_x \log p_t(x)$ - the score function. This is the deep physical reason why the score is the central quantity in diffusion.

Practical: Inspecting Score Networks in HuggingFace Diffusers

# ── Inspecting how HuggingFace Diffusers handles score/noise prediction ───────
# This requires: pip install diffusers transformers

try:
    from diffusers import DDPMScheduler, UNet2DModel
    import torch

    # A minimal DDPM-style U-Net (the score network / noise predictor)
    # This is what sθ(x_t, t) looks like in production code
    unet = UNet2DModel(
        sample_size=32,          # 32×32 images (e.g., CIFAR-10)
        in_channels=3,           # RGB
        out_channels=3,          # predicted noise (same shape as input)
        layers_per_block=2,
        block_out_channels=(64, 128, 128, 256),
        down_block_types=("DownBlock2D", "DownBlock2D",
                          "AttnDownBlock2D", "DownBlock2D"),
        up_block_types=("UpBlock2D", "AttnUpBlock2D",
                        "UpBlock2D", "UpBlock2D"),
    )

    # DDPM noise scheduler: manages β_t, ᾱ_t, etc.
    scheduler = DDPMScheduler(
        num_train_timesteps=1000,
        beta_schedule="linear",
        beta_start=0.0001,
        beta_end=0.02,
        prediction_type="epsilon",   # predict ε (equivalent to DSM)
    )

    print(f"UNet parameters: {sum(p.numel() for p in unet.parameters()):,}")
    print(f"Scheduler: {scheduler.num_train_timesteps} steps")
    print(f"prediction_type: {scheduler.config.prediction_type}")
    print()

    # Training step demonstration
    x0 = torch.randn(4, 3, 32, 32)   # fake clean images
    t  = torch.randint(0, 1000, (4,))  # random timesteps

    # Add noise: q(x_t | x_0) = N(√ᾱ_t x_0, (1-ᾱ_t) I)
    eps = torch.randn_like(x0)
    x_t = scheduler.add_noise(x0, eps, t)

    # Predict noise with U-Net (= approximate score network)
    eps_pred = unet(x_t, t).sample

    # Simple MSE loss (= DSM objective reweighted)
    loss = torch.nn.functional.mse_loss(eps_pred, eps)
    print(f"Training loss example: {loss.item():.4f}")
    print(f"Input shape:  {x_t.shape}")
    print(f"Output shape: {eps_pred.shape}  (same as input - predicts ε)")

    # Convert noise prediction to score estimate
    # sθ(x_t, t) = -εθ(x_t, t) / √(1 - ᾱ_t)
    alpha_bars = scheduler.alphas_cumprod
    sigma_t = (1 - alpha_bars[t]).sqrt().view(-1, 1, 1, 1)
    score_estimate = -eps_pred / sigma_t
    print(f"Score estimate shape: {score_estimate.shape}")
    print(f"Score estimate norm (avg): {score_estimate.norm(dim=(1,2,3)).mean():.2f}")

except ImportError:
    print("diffusers not installed. Install with: pip install diffusers transformers")
    print("This code shows how production score networks are structured.")
    print("Key points:")
    print("  - UNet2DModel: the score/noise predictor sθ(x_t, t)")
    print("  - DDPMScheduler: manages the noise schedule (β_t, ᾱ_t)")
    print("  - prediction_type='epsilon': noise prediction = DSM training")
    print("  - Convert: score = -eps_pred / sqrt(1 - alpha_bar_t)")

Summary: Score Matching - The Unifying Thread

The score-based view provides the most coherent mathematical foundation for modern diffusion models. Tracing the logical chain:

Goal: sample from $p_\text{data}(x)$ without computing it.
Key tool: the score $\nabla_x \log p(x)$ + Langevin dynamics.
Problem: training the score network requires intractable Jacobian traces.
Solution: denoising score matching - perturb data, train score on the analytically-known score of the Gaussian kernel.
Multiple noise levels: NCSN extends this to a geometric sequence of $\sigma$ values, enabling annealed sampling that mixes globally at large $\sigma$ and refines locally at small $\sigma$ .
Continuous limit: the discrete multi-level process is the SDE $dx = f\, dt + g\, dW$ in continuous time.
Reverse process: Anderson's theorem gives the reverse SDE; substituting the learned score gives the generation algorithm.
DDPM connection: the DDPM noise prediction objective is algebraically identical to DSM, with $\varepsilon = -\sigma_t s$ .
Deterministic sampling: the probability flow ODE has the same marginals - solving it numerically gives DDIM, DPM-Solver, and related fast samplers.

This chain - from Langevin dynamics to Hyvärinen's score matching to Vincent's DSM to NCSN to Song's SDE framework to DDPM to DDIM - is the mathematical spine of the entire modern diffusion literature. Every paper in this area builds on some piece of this chain. Understanding it end-to-end puts you in a position to read and extend the research, not just apply pretrained models.

Sub-VP SDE: A Third Family

Song et al. (2021) introduced a third SDE beyond VP and VE - the sub-VP SDE:

$dx = -\frac{1}{2}\beta(t)\, x\, dt + \sqrt{\beta(t)\!\left(1 - e^{-2\int_0^t \beta(s)ds}\right)}\, dW_t$

The sub-VP SDE has smaller diffusion coefficient than the VP-SDE, giving it better likelihood estimation (closer to exact flow) while maintaining trainability. Its marginal distribution is:

$p_t(x|x_0) = \mathcal{N}\!\left(\sqrt{\bar{\alpha}(t)}\, x_0,\; \left(1 - \bar{\alpha}(t)\right)^2 \mathbf{I}\right)$

Note: the variance is $(1 - \bar{\alpha})^2$ versus $(1 - \bar{\alpha})$ for VP-SDE. For small $\bar{\alpha}$ (high noise), sub-VP has less variance than VP - hence "sub-variance-preserving." In practice, sub-VP achieves better bits-per-dim on density estimation benchmarks at the cost of slightly worse FID, making it the best choice when likelihood matters.

class SubVPSDE:
    """
    Sub-VP SDE - better likelihood, slightly worse FID than VP-SDE.
    Variance: (1 - alpha_bar(t))^2 instead of VP's (1 - alpha_bar(t)).
    """
    def __init__(self, beta_min: float = 0.1, beta_max: float = 20.0):
        self.beta_min = beta_min
        self.beta_max = beta_max

    def beta(self, t: float) -> float:
        return self.beta_min + t * (self.beta_max - self.beta_min)

    def alpha_bar(self, t: float) -> float:
        import math
        return math.exp(
            -0.5 * t ** 2 * (self.beta_max - self.beta_min)
            - t * self.beta_min
        )

    def marginal_variance(self, t: float) -> float:
        """
        VP-SDE variance:     (1 - ᾱ(t))
        Sub-VP-SDE variance: (1 - ᾱ(t))²
        Sub-VP is always smaller variance for the same t.
        """
        return (1 - self.alpha_bar(t)) ** 2

    def compare_variance(self):
        import numpy as np
        vpsde = VPSDE(self.beta_min, self.beta_max)
        ts = np.linspace(0, 1, 11)
        print("VP vs Sub-VP variance at different timesteps:")
        print(f"{'t':>5} | {'VP variance':>14} | {'Sub-VP variance':>16} | {'Ratio':>8}")
        print("-" * 52)
        for t in ts:
            vp_var    = float(1 - vpsde.alpha_bar(torch.tensor(t)).item())
            subvp_var = self.marginal_variance(float(t))
            ratio     = subvp_var / (vp_var + 1e-12)
            print(f"{t:>5.2f} | {vp_var:>14.4f} | {subvp_var:>16.4f} | {ratio:>8.4f}")


subvp = SubVPSDE()
subvp.compare_variance()

Connecting Everything: A Field Map

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Diffusion Process (DDPM) demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why This Exists - The Problem with Density Estimation​

Historical Context​

1. The Score Function and Its Properties​

Definition​

Langevin Dynamics: Sampling from the Score​

2. The Problem with Direct Score Estimation​

Naive Score Matching - The Jacobian Trace Problem​

3. Denoising Score Matching​

The Core Insight​

Denoising Score Matching Objective​

The DDPM Connection: Score = Noise Prediction​

4. Noise Conditional Score Networks (NCSN and NCSNv2)​

5. The SDE Framework - Unifying Everything​

Forward SDE​

Reverse SDE (Anderson, 1982)​

DDPM as VP-SDE (Variance Preserving)​

NCSN as VE-SDE (Variance Exploding)​

Probability Flow ODE​

Code: Score Matching Training and Langevin Sampling​

The Architecture: U-Net Score Network​

Production Engineering Notes​

Common Mistakes​

YouTube Resources​

Interview Q&A​

Fokker-Planck Equation and Score Dynamics​

Practical: Inspecting Score Networks in HuggingFace Diffusers​

Summary: Score Matching - The Unifying Thread​

Sub-VP SDE: A Third Family​

Connecting Everything: A Field Map​