Skip to main content

Score-Based Generative Models - Diffusion Through the Lens of Score Matching

:::note Reading time: ~45 minutes | Interview relevance: High | Target roles: Research Engineer, ML Engineer, Applied Scientist, AI Engineer :::

The Real Interview Moment

You are at a research scientist interview at DeepMind. The interviewer leans forward. "Can you explain what the score function is, why you cannot train a score network by direct estimation, how denoising score matching solves this, and then connect the entire framework to DDPM?"

This question separates people who have read the DDPM paper from people who understand the deep probabilistic theory underneath it. The score matching view of diffusion matters for three distinct reasons.

First, it gives an independent mathematical justification for the DDPM training objective - so you are not just taking Ho et al.'s "empirically this works better" claim on faith. Second, it opens the door to the SDE framework (Song et al., 2021), which generalizes both DDPM and NCSN to a unified continuous-time formulation enabling faster ODE-based sampling. Third, it explains why Langevin MCMC sampling works, when it fails, and why annealing is necessary - the physical intuition that connects thermodynamics to machine learning.

Yang Song at Stanford in 2019 noticed: if you knew the score function xlogp(x)\nabla_x \log p(x) of the data distribution, Langevin dynamics could generate samples without ever computing p(x)p(x) itself. The problem: you do not know p(x)p(x). His insight: train a neural network to estimate the score directly using denoising score matching - a technique that bypasses the intractable Jacobian trace. Two years later, his group unified this with DDPM under a single stochastic differential equation framework. This lesson covers both.

Why This Exists - The Problem with Density Estimation

The fundamental challenge in generative modeling: you want to sample from pdata(x)p_\text{data}(x) - the true data distribution - but you only have finite samples from it. You cannot evaluate pdata(x)p_\text{data}(x) for a given xx; you can only draw examples.

Classical approaches either define an explicit parametric pθ(x)p_\theta(x) and maximize likelihood (VAE, flow), or implicitly match features (GAN). Score matching takes a third path: forget pθ(x)p_\theta(x) entirely. Instead, learn the gradient of the log-density:

sθ(x)xlogpdata(x)s_\theta(x) \approx \nabla_x \log p_\text{data}(x)

Why the gradient? Because once you have it, Langevin dynamics generates samples - you need only the gradient of the log-density, not the density itself. The score points "uphill" toward regions of higher probability. Follow the score (with injected noise for mixing), and you converge to the data distribution.

This is elegant. But it comes with a severe practical problem - one that denoising score matching solves.

Historical Context

Score matching was introduced by Hyvärinen (2005) as a method for estimating parameters of unnormalized statistical models - specifically to avoid computing the partition function. The field largely ignored it for a decade. Vincent (2011) showed the equivalence between denoising score matching and the denoising autoencoder objective, providing a practical training procedure. Song and Ermon (2019) combined score matching with multiple noise levels and Langevin sampling to produce NCSN - the first competitive score-based generative model, achieving a new FID record on CIFAR-10. NCSNv2 (Song and Ermon, 2020) refined the noise schedule and achieved 2.45 FID on CIFAR-10. The unification with SDEs appeared in Song et al. (2021), connecting DDPM (VP-SDE), NCSN (VE-SDE), and introducing the probability flow ODE that underlies DDIM-style deterministic sampling.

1. The Score Function and Its Properties

Definition

The score function of a distribution p(x)p(x) is:

s(x)=xlogp(x)s(x) = \nabla_x \log p(x)

This is a vector field - at every point xRdx \in \mathbb{R}^d, it gives a direction pointing toward regions of higher probability density.

Key property - independence from the normalizing constant: for any distribution p(x)=p~(x)/Zp(x) = \tilde{p}(x) / Z where Z=p~(x)dxZ = \int \tilde{p}(x)\, dx:

xlogp(x)=xlogp~(x)xlogZ=xlogp~(x)\nabla_x \log p(x) = \nabla_x \log \tilde{p}(x) - \nabla_x \log Z = \nabla_x \log \tilde{p}(x)

since ZZ does not depend on xx. The score does not require computing ZZ - this is the critical property that makes score-based models computationally tractable for energy-based models and other unnormalized densities.

Concrete examples:

  • Gaussian p(x)=N(μ,σ2)p(x) = \mathcal{N}(\mu, \sigma^2): s(x)=xlogp(x)=xμσ2s(x) = \nabla_x \log p(x) = -\frac{x-\mu}{\sigma^2}. The score points from xx toward μ\mu - it is a restoring force pulling toward the mean.

  • Mixture of Gaussians p(x)=kπkN(μk,σk2)p(x) = \sum_k \pi_k \mathcal{N}(\mu_k, \sigma_k^2): s(x)=kπkN(x;μk,σk2)(xμkσk2)kπkN(x;μk,σk2)s(x) = \frac{\sum_k \pi_k \mathcal{N}(x;\mu_k,\sigma_k^2) \cdot (-\frac{x-\mu_k}{\sigma_k^2})}{\sum_k \pi_k \mathcal{N}(x;\mu_k,\sigma_k^2)}. The score is a weighted average of the component scores - pointing toward the nearest cluster center, weighted by posterior probability.

Langevin Dynamics: Sampling from the Score

Given access to the score function s(x)=xlogp(x)s(x) = \nabla_x \log p(x), Langevin dynamics samples from p(x)p(x) via the iterative update:

xk+1=xk+ε2xlogp(xk)+εηk,ηkN(0,I)x_{k+1} = x_k + \frac{\varepsilon}{2}\, \nabla_x \log p(x_k) + \sqrt{\varepsilon}\, \eta_k, \qquad \eta_k \sim \mathcal{N}(0, \mathbf{I})

As kk \to \infty and ε0\varepsilon \to 0, the distribution of xkx_k converges to p(x)p(x) under mild regularity conditions. The two terms have a clear physical interpretation:

  • Drift term ε2xlogp(xk)\frac{\varepsilon}{2}\nabla_x \log p(x_k): moves xx toward regions of higher probability (uphill in log-density space). Without this term, the chain is a random walk - it explores uniformly.
  • Diffusion term εηk\sqrt{\varepsilon}\, \eta_k: injects Gaussian noise. Without this term, the chain would converge to a mode and stay there - it would not mix between modes or explore the distribution's tails.

The balance between drift (converging to a mode) and diffusion (mixing between modes) determines how well Langevin samples the full distribution. For multi-modal distributions (natural images have thousands of modes corresponding to different classes, objects, scenes), this balance is delicate.

Unadjusted Langevin Algorithm (ULA): the version above with finite ε\varepsilon introduces a bias - the stationary distribution is approximately p(x)p(x) but not exactly. The Metropolis-Adjusted Langevin Algorithm (MALA) adds a Metropolis-Hastings acceptance step to correct for this bias, but at additional computational cost. For deep learning applications, ULA (no correction) is used - the small bias is acceptable given the approximation from the learned score network.

2. The Problem with Direct Score Estimation

Naive Score Matching - The Jacobian Trace Problem

The most direct approach: minimize the Fisher divergence between the true score and the estimated score:

JESM(θ)=12Ep(x) ⁣[sθ(x)xlogp(x)2]J_\text{ESM}(\theta) = \frac{1}{2}\mathbb{E}_{p(x)}\!\left[\|s_\theta(x) - \nabla_x \log p(x)\|^2\right]

This is explicit score matching. Fatal flaw: it requires xlogp(x)\nabla_x \log p(x), which you do not have.

Via integration by parts (Hyvärinen, 2005), this is equivalent to:

JISM(θ)=Ep(x) ⁣[tr ⁣(xsθ(x))+12sθ(x)2]+constJ_\text{ISM}(\theta) = \mathbb{E}_{p(x)}\!\left[\text{tr}\!\left(\nabla_x s_\theta(x)\right) + \frac{1}{2}\|s_\theta(x)\|^2\right] + \text{const}

This implicit score matching objective depends only on samples from p(x)p(x) - no xlogp(x)\nabla_x \log p(x) needed. But it requires tr(xsθ(x))\text{tr}(\nabla_x s_\theta(x)) - the trace of the Jacobian of the score network.

For a network with dd-dimensional output (d=d = image dimension), computing the Jacobian trace requires dd backward passes (one per output dimension). For CIFAR-10 images: d=3×32×32=3,072d = 3 \times 32 \times 32 = 3{,}072 backward passes per training example. For 512×512512 \times 512 RGB images: d=786,432d = 786{,}432 backward passes. Completely infeasible.

Sliced score matching (Song et al., 2019) approximates the Jacobian trace by projecting onto a random direction vN(0,I)\mathbf{v} \sim \mathcal{N}(0, \mathbf{I}):

JSSM(θ)=Ep(x),v ⁣[vxsθ(x)v+12sθ(x)2]J_\text{SSM}(\theta) = \mathbb{E}_{p(x),\, \mathbf{v}}\!\left[\mathbf{v}^\top \nabla_x s_\theta(x)\, \mathbf{v} + \frac{1}{2}\|s_\theta(x)\|^2\right]

The directional derivative vxsθ(x)v\mathbf{v}^\top \nabla_x s_\theta(x)\, \mathbf{v} costs only two backward passes (using Hessian-vector products). This is tractable but still slower than DSM. For high-dimensional images, denoising score matching is preferred.

3. Denoising Score Matching

The Core Insight

Vincent (2011): instead of training the score network on the clean data distribution pdata(x)p_\text{data}(x), train it on a noise-perturbed distribution qσ(x~)=p(x)N(x~;x,σ2I)dxq_\sigma(\tilde{x}) = \int p(x)\mathcal{N}(\tilde{x};\, x,\, \sigma^2\mathbf{I})\, dx.

For a perturbed point x~=x+σε\tilde{x} = x + \sigma\varepsilon with εN(0,I)\varepsilon \sim \mathcal{N}(0, \mathbf{I}), the score of the perturbed distribution is analytically known:

x~logqσ(x~x)=x~xσ2=εσ\nabla_{\tilde{x}} \log q_\sigma(\tilde{x}|x) = -\frac{\tilde{x} - x}{\sigma^2} = -\frac{\varepsilon}{\sigma}

This score is tractable - it just points from x~\tilde{x} back toward the original clean point xx, scaled by 1/σ21/\sigma^2.

Denoising Score Matching Objective

Train sθ(x~,σ)s_\theta(\tilde{x}, \sigma) to estimate the score of the perturbed distribution:

JDSM(θ)=Expdata,εN(0,I) ⁣[sθ(x+σε,  σ)+εσ2]J_\text{DSM}(\theta) = \mathbb{E}_{x \sim p_\text{data},\, \varepsilon \sim \mathcal{N}(0,\mathbf{I})}\!\left[\left\|s_\theta(x + \sigma\varepsilon,\; \sigma) + \frac{\varepsilon}{\sigma}\right\|^2\right]

The optimal sθ(x~,σ)=ε/σs_\theta^*(\tilde{x}, \sigma) = -\varepsilon/\sigma. This is simply MSE between the predicted score and the known analytical score of the Gaussian perturbation kernel. No Jacobian trace. No dd backward passes. Just a single standard backward pass.

The DDPM Connection: Score = Noise Prediction

Define εθ(x~,σ)=σsθ(x~,σ)\varepsilon_\theta(\tilde{x}, \sigma) = -\sigma \cdot s_\theta(\tilde{x}, \sigma). Then:

JDSM(θ)=E ⁣[εθ(x~,σ)σεσ2]E ⁣[εθ(x~,σ)ε2]J_\text{DSM}(\theta) = \mathbb{E}\!\left[\left\|\frac{\varepsilon_\theta(\tilde{x}, \sigma)}{\sigma} - \frac{\varepsilon}{\sigma}\right\|^2\right] \propto \mathbb{E}\!\left[\|\varepsilon_\theta(\tilde{x}, \sigma) - \varepsilon\|^2\right]

This is exactly the DDPM simplified objective Lsimple=E[εεθ(xt,t)2]\mathcal{L}_\text{simple} = \mathbb{E}[\|\varepsilon - \varepsilon_\theta(x_t, t)\|^2], with σ\sigma corresponding to 1αˉt\sqrt{1-\bar{\alpha}_t} at timestep tt.

This is the deep connection: DDPM's noise prediction training is mathematically identical to denoising score matching. Ho et al.'s empirical finding that "predicting ε\varepsilon works better than predicting x0x_0" is equivalent to saying "denoising score matching works better than the x0x_0-parameterized score matching objective." The relationship between the score network and the DDPM noise predictor is:

sθ(xt,t)=xtlogpt(xt)=εθ(xt,t)1αˉts_\theta(x_t, t) = \nabla_{x_t} \log p_t(x_t) = -\frac{\varepsilon_\theta(x_t, t)}{\sqrt{1-\bar{\alpha}_t}}

4. Noise Conditional Score Networks (NCSN and NCSNv2)

Song and Ermon (2019) observed that a single noise level σ\sigma is insufficient. At small σ\sigma, the score landscape has sharp peaks around each data point - Langevin dynamics mixes very slowly (exponentially long to escape a mode). At large σ\sigma, the score landscape is smooth and mixing is fast, but the perturbed distribution no longer resembles the data distribution.

NCSN solution: train a single score network jointly over LL noise levels {σi}i=1L\{\sigma_i\}_{i=1}^L forming a geometric sequence from σ1=σmax\sigma_1 = \sigma_\text{max} (large) to σL=σmin\sigma_L = \sigma_\text{min} (small):

LNCSN(θ)=1Li=1Lλ(σi)Ex,ε ⁣[sθ(x+σiε,  σi)+εσi2]\mathcal{L}_\text{NCSN}(\theta) = \frac{1}{L}\sum_{i=1}^{L} \lambda(\sigma_i)\, \mathbb{E}_{x,\, \varepsilon}\!\left[\left\|s_\theta(x + \sigma_i\varepsilon,\; \sigma_i) + \frac{\varepsilon}{\sigma_i}\right\|^2\right]

where λ(σi)=σi2\lambda(\sigma_i) = \sigma_i^2 (balances the loss magnitude across noise levels - without weighting, small-σ\sigma terms dominate because 1/σ21/\sigma^2 amplifies errors).

Generation via annealed Langevin dynamics: start at large σ\sigma (smooth landscape, fast mixing), then gradually reduce σ\sigma (sharper landscape, fine-grained refinement):

Initialize x ~ N(0, σ_max² I)
For σ in [σ_max, ..., σ_min]: ← outer annealing loop
α = ε · (σ / σ_min)² ← step size scaled to noise level
For k in 1..K: ← inner Langevin loop
x = x + (α/2) · sθ(x, σ) + √α · η
η ~ N(0, I)
Return x

The α=ε(σ/σmin)2\alpha = \varepsilon(\sigma/\sigma_\text{min})^2 scaling ensures the signal-to-noise ratio of the update is consistent across noise levels: at large σ\sigma the score is small (in absolute value), so the step size is correspondingly large.

NCSNv2 (Song and Ermon, 2020) refined: geometric noise schedule with more levels, spectral normalization in the score network, and careful initialization. Achieved FID 2.45 on CIFAR-10 - competitive with state-of-the-art GANs of the time.

5. The SDE Framework - Unifying Everything

Song et al. (2021) showed that both DDPM and NCSN are discretizations of SDEs, and that this continuous-time view unlocks additional capabilities.

Forward SDE

The noise-addition process is described by an Itô stochastic differential equation:

dx=f(x,t)dt+g(t)dWtdx = f(x, t)\, dt + g(t)\, dW_t

where:

  • f(x,t)f(x, t): the drift coefficient (deterministic force on xx)
  • g(t)g(t): the diffusion coefficient (amount of noise injected per unit time)
  • WtW_t: standard Brownian motion (Wiener process) - the continuous-time limit of Gaussian random walks

As tt goes from 0 to TT, this SDE transforms x0pdatax_0 \sim p_\text{data} into xTpTx_T \sim p_T (a known prior, typically N(0,I)\mathcal{N}(0, \mathbf{I})).

Reverse SDE (Anderson, 1982)

Any forward Itô SDE has a corresponding reverse-time SDE running the process backwards in time:

dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dWˉtdx = \left[f(x, t) - g(t)^2\, \nabla_x \log p_t(x)\right] dt + g(t)\, d\bar{W}_t

where dWˉtd\bar{W}_t is a reverse-time Brownian motion, and xlogpt(x)\nabla_x \log p_t(x) is the score of the marginal distribution at time tt. This is the key: the reverse process requires the score function at each noise level - precisely what the score network sθ(x,t)s_\theta(x, t) estimates.

The reverse SDE generates samples: start from xTpTx_T \sim p_T, integrate backwards from t=Tt = T to t=0t = 0 using the learned score sθxlogpts_\theta \approx \nabla_x \log p_t. This recovers x0pdatax_0 \sim p_\text{data}.

DDPM as VP-SDE (Variance Preserving)

The DDPM forward process q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(x_t;\, \sqrt{1-\beta_t}\, x_{t-1},\, \beta_t\mathbf{I}) in the limit TT \to \infty with β(t)=Tβt/T\beta(t) = T\beta_{t/T} becomes:

dx=12β(t)xdt+β(t)dWt(VP-SDE)dx = -\frac{1}{2}\beta(t)\, x\, dt + \sqrt{\beta(t)}\, dW_t \qquad \text{(VP-SDE)}

The drift f(x,t)=12β(t)xf(x,t) = -\frac{1}{2}\beta(t)x contracts the signal toward zero. The diffusion g(t)=β(t)g(t) = \sqrt{\beta(t)} adds noise. The marginal at time tt is:

pt(xx0)=N ⁣(αˉ(t)x0,  (1αˉ(t))I)p_t(x|x_0) = \mathcal{N}\!\left(\sqrt{\bar{\alpha}(t)}\, x_0,\; (1-\bar{\alpha}(t))\mathbf{I}\right)

where αˉ(t)=exp(0tβ(s)ds)\bar{\alpha}(t) = \exp(-\int_0^t \beta(s)\, ds). The variance of pt(xx0)p_t(x|x_0) is 1αˉ(t)1 - \bar{\alpha}(t), which approaches 1 as tTt \to T - hence "variance preserving" (the total variance stays at approximately 1).

NCSN as VE-SDE (Variance Exploding)

The NCSN forward process - adding Gaussian noise with increasing variance σ2(t)\sigma^2(t) from 0 to σmax2\sigma_\text{max}^2 - in continuous time becomes:

dx=d[σ2(t)]dtdWt(VE-SDE)dx = \sqrt{\frac{d[\sigma^2(t)]}{dt}}\, dW_t \qquad \text{(VE-SDE)}

The drift is zero (f(x,t)=0f(x,t) = 0): no contraction. The diffusion coefficient grows with σ(t)\sigma(t). The marginal is pt(xx0)=N(x0,σ2(t)I)p_t(x|x_0) = \mathcal{N}(x_0,\, \sigma^2(t)\mathbf{I}) - the data convolved with growing Gaussian noise. The variance σ2(t)\sigma^2(t) explodes to σmax2\sigma_\text{max}^2 - hence "variance exploding." The endpoint is not a clean N(0,I)\mathcal{N}(0, \mathbf{I}) prior but rather a large-variance Gaussian centered on the data.

VP vs VE comparison:

PropertyVP-SDE (DDPM)VE-SDE (NCSN)
Drift12β(t)x-\frac{1}{2}\beta(t)x (contracting)0
Diffusionβ(t)\sqrt{\beta(t)}d[σ2]/dt\sqrt{d[\sigma^2]/dt}
Signal at t=Tt=TNear zero (contracted)Near x0x_0 (not contracted)
Prior pTp_TN(0,I)\approx \mathcal{N}(0, \mathbf{I}) (clean)N(x0,σmax2I)\mathcal{N}(x_0, \sigma_\text{max}^2\mathbf{I}) (messy)
Preferred forGeneration (clean prior)Density estimation

Probability Flow ODE

Key result: every SDE of the form dx=f(x,t)dt+g(t)dWtdx = f(x,t)\, dt + g(t)\, dW_t has a corresponding deterministic ODE with the same marginal distributions {pt(x)}t[0,T]\{p_t(x)\}_{t \in [0,T]}:

dxdt=f(x,t)12g(t)2xlogpt(x)\frac{dx}{dt} = f(x, t) - \frac{1}{2}\, g(t)^2\, \nabla_x \log p_t(x)

This probability flow ODE generates samples deterministically - no stochastic component. Replacing xlogpt\nabla_x \log p_t with the learned sθ(x,t)s_\theta(x, t) gives a neural ODE that can be integrated with standard ODE solvers (Runge-Kutta, Adams-Bashforth).

Properties of the probability flow ODE:

  1. Deterministic: given the same starting xTx_T, always produces the same x0x_0.
  2. Invertible: run the ODE forward (from t=0t=0 to t=Tt=T) to encode a real image into its noise space - exact inversion.
  3. Likelihood computation: the instantaneous change of variables formula for ODEs gives: logp0(x0)=logpT(xT)0Ttr(xdxdt)dt\log p_0(x_0) = \log p_T(x_T) - \int_0^T \text{tr}(\nabla_x \frac{dx}{dt})\, dt, computable via the Hutchinson trace estimator in O(1)O(1) backward passes.
  4. Faster sampling: ODE solvers can take larger steps than SDE samplers without diverging. DDIM corresponds to numerically integrating the probability flow ODE for the VP-SDE with a specific first-order solver.

Code: Score Matching Training and Langevin Sampling

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import math


# ============================================================
# 1. Score Network (time-conditioned MLP for 2D toy data)
# ============================================================
class ScoreNetwork(nn.Module):
"""
Noise-conditional score network: sθ(x, σ) ≈ ∇_x log p_σ(x).
For images, replace with a time-conditioned U-Net.

Input: x [batch, d] + σ [batch] → score [batch, d]
"""

def __init__(self, input_dim: int = 2, hidden_dim: int = 256,
sigma_emb_dim: int = 32):
super().__init__()

# Log-sigma embedding: maps scalar log(σ) to a feature vector
self.sigma_emb = nn.Sequential(
nn.Linear(1, sigma_emb_dim),
nn.SiLU(),
nn.Linear(sigma_emb_dim, sigma_emb_dim),
nn.SiLU(),
)

# Main score network
self.net = nn.Sequential(
nn.Linear(input_dim + sigma_emb_dim, hidden_dim),
nn.SiLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.SiLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.SiLU(),
nn.Linear(hidden_dim, input_dim)
)

def forward(self, x: torch.Tensor, sigma: torch.Tensor) -> torch.Tensor:
"""
x : [batch, d]
sigma : [batch] - noise level for each sample
Returns estimated score sθ(x, σ) = ≈ ∇_x log p_σ(x)
"""
log_sigma = torch.log(sigma + 1e-8).unsqueeze(-1) # [batch, 1]
emb = self.sigma_emb(log_sigma) # [batch, emb_dim]
return self.net(torch.cat([x, emb], dim=-1))


# ============================================================
# 2. Denoising Score Matching Loss (NCSN training objective)
# ============================================================
def ncsn_loss(score_net: ScoreNetwork,
x_clean: torch.Tensor,
sigmas: torch.Tensor) -> torch.Tensor:
"""
NCSN training loss: denoising score matching over multiple noise levels.

For each sample:
1. Sample a random noise level σᵢ
2. Add noise: x̃ = x + σᵢ·ε, ε ~ N(0, I)
3. Target score: -ε/σᵢ (score of N(x̃; x, σᵢ²I))
4. Loss: σᵢ² · ‖sθ(x̃, σᵢ) - (-ε/σᵢ)‖²

The σ² weighting balances loss magnitude across noise levels.
Without it, small-σ terms (with 1/σ² score magnitude) dominate.

Args:
score_net : noise-conditional score network
x_clean : clean data [batch, d]
sigmas : geometric sequence of noise levels [L]
"""
batch, d = x_clean.shape
device = x_clean.device

# Sample a random noise level for each example
idx = torch.randint(len(sigmas), (batch,), device=device)
sigma = sigmas[idx] # [batch]

# Add noise: x̃ = x + σ·ε
eps = torch.randn_like(x_clean)
x_noisy = x_clean + sigma.unsqueeze(-1) * eps

# Predicted score
score_pred = score_net(x_noisy, sigma) # [batch, d]

# Target score: ∇_{x̃} log N(x̃; x, σ²I) = -(x̃ - x)/σ² = -ε/σ
score_target = -eps / sigma.unsqueeze(-1) # [batch, d]

# Loss weighted by σ² (NCSN weighting scheme)
loss = 0.5 * (sigma.unsqueeze(-1) ** 2) * (score_pred - score_target) ** 2
return loss.mean()


# ============================================================
# 3. Annealed Langevin Dynamics Sampling
# ============================================================
@torch.no_grad()
def annealed_langevin_sampling(
score_net: ScoreNetwork,
sigmas: list[float],
n_samples: int = 200,
n_steps_per_sigma: int = 100,
eps_base: float = 2e-5,
dim: int = 2,
device: str = "cpu",
) -> torch.Tensor:
"""
NCSN sampling: annealed Langevin dynamics.

Algorithm:
Initialize x ~ N(0, σ_max² I)
For σ in [σ_max → σ_min]: (outer annealing loop)
α = eps_base · (σ / σ_min)² (step size: large at high noise)
For k in 1..K: (inner Langevin loop)
x ← x + (α/2) · sθ(x, σ) + √α · η, η ~ N(0, I)

Intuition for annealing:
High σ → smooth landscape → fast Langevin mixing → gets near a mode
Low σ → sharp landscape → slow mixing BUT already near correct mode
→ fine-detail refinement
"""
score_net.eval()
sigma_min = sigmas[-1]

# Initialize from large-noise prior
x = torch.randn(n_samples, dim, device=device) * sigmas[0]

for sigma in sigmas:
sigma_t = torch.full((n_samples,), sigma, device=device)

# Step size: scaled to maintain constant signal-to-noise ratio
alpha = eps_base * (sigma / sigma_min) ** 2

for _ in range(n_steps_per_sigma):
score = score_net(x, sigma_t) # [n, d]
noise = torch.randn_like(x)
x = x + (alpha / 2.0) * score + math.sqrt(alpha) * noise

return x


# ============================================================
# 4. Training NCSN on 2D Gaussian Mixture
# ============================================================
def make_gaussian_mixture(n: int = 30_000, n_modes: int = 8,
radius: float = 5.0) -> torch.Tensor:
"""
Classic 2D score matching demo: 8 Gaussians on a circle.
Tests whether the score network can learn multi-modal distributions.
"""
angles = torch.linspace(0, 2 * math.pi, n_modes + 1)[:-1]
centers = torch.stack([radius * angles.cos(), radius * angles.sin()], dim=1)
idx = torch.randint(n_modes, (n,))
return centers[idx] + 0.4 * torch.randn(n, 2)


def train_ncsn(
n_epochs: int = 600,
batch_size: int = 512,
lr: float = 1e-3,
sigma_begin: float = 10.0,
sigma_end: float = 0.01,
n_sigmas: int = 12,
device: str = "cpu",
) -> tuple:
"""Train NCSN on 2D mixture, return (score_net, sigmas)."""
data = make_gaussian_mixture(n=60_000).to(device)
loader = torch.utils.data.DataLoader(
torch.utils.data.TensorDataset(data),
batch_size=batch_size, shuffle=True, drop_last=True
)

# Geometric noise schedule: σ₁ > σ₂ > ... > σ_L
sigmas = torch.exp(
torch.linspace(math.log(sigma_begin), math.log(sigma_end), n_sigmas)
).to(device)

score_net = ScoreNetwork(input_dim=2, hidden_dim=256).to(device)
optimizer = torch.optim.Adam(score_net.parameters(), lr=lr,
betas=(0.9, 0.999))
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=n_epochs, eta_min=1e-5)

print(f"Training NCSN | σ_max={sigma_begin} | σ_min={sigma_end} | "
f"L={n_sigmas} noise levels")
for epoch in range(n_epochs):
total_loss = 0.0
for (x_batch,) in loader:
loss = ncsn_loss(score_net, x_batch.to(device), sigmas)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(score_net.parameters(), 1.0)
optimizer.step()
total_loss += loss.item()
scheduler.step()

if (epoch + 1) % 200 == 0:
avg = total_loss / len(loader)
print(f" Epoch {epoch+1:4d} | DSM Loss: {avg:.5f} | "
f"LR: {scheduler.get_last_lr()[0]:.2e}")

return score_net, sigmas


# ============================================================
# 5. Score ↔ DDPM Noise Predictor Conversion
# ============================================================
def score_to_noise(score: torch.Tensor,
sigma_t: float) -> torch.Tensor:
"""
Convert score estimate to DDPM noise prediction.

Relationship:
sθ(x_t, t) = ∇_{x_t} log p_t(x_t) ≈ -ε/σ_t
⟹ εθ(x_t, t) = -σ_t · sθ(x_t, t)

where σ_t = √(1 - ᾱ_t) in DDPM notation.
"""
return -sigma_t * score


def noise_to_score(eps_pred: torch.Tensor,
sigma_t: float) -> torch.Tensor:
"""
Convert DDPM noise prediction to score estimate.

sθ(x_t, t) = -εθ(x_t, t) / σ_t
"""
return -eps_pred / (sigma_t + 1e-8)


# ============================================================
# 6. VP-SDE and VE-SDE Implementations
# ============================================================
class VESDE:
"""
Variance Exploding SDE (NCSN forward process).
dx = sqrt(d[σ²(t)]/dt) dW
Marginal: p(x_t | x_0) = N(x_0, σ²(t) I)
"""

def __init__(self, sigma_min: float = 0.01, sigma_max: float = 50.0):
self.sigma_min = sigma_min
self.sigma_max = sigma_max

def sigma(self, t: torch.Tensor) -> torch.Tensor:
"""Geometric noise schedule: σ(t) = σ_min · (σ_max/σ_min)^t"""
return self.sigma_min * (self.sigma_max / self.sigma_min) ** t

def marginal_forward(self, x0: torch.Tensor, t: torch.Tensor,
eps: torch.Tensor | None = None
) -> tuple[torch.Tensor, torch.Tensor]:
"""Sample x_t ~ p(x_t | x_0) = N(x_0, σ²(t) I)"""
if eps is None:
eps = torch.randn_like(x0)
sigma_t = self.sigma(t)
# Broadcast sigma_t for batch × spatial dims
if x0.dim() > 1:
sigma_t = sigma_t.view(-1, *([1] * (x0.dim() - 1)))
return x0 + sigma_t * eps, sigma_t

def score(self, x_t: torch.Tensor, x_0: torch.Tensor,
t: torch.Tensor) -> torch.Tensor:
"""True score of p_t(x|x_0): -(x_t - x_0) / σ²(t)"""
sigma_t = self.sigma(t)
if x_t.dim() > 1:
sigma_t = sigma_t.view(-1, *([1] * (x_t.dim() - 1)))
return -(x_t - x_0) / (sigma_t ** 2)


class VPSDE:
"""
Variance Preserving SDE (DDPM forward process in continuous time).
dx = -½β(t)x dt + √β(t) dW
Marginal: p(x_t|x_0) = N(√ᾱ(t) x_0, (1-ᾱ(t)) I)
"""

def __init__(self, beta_min: float = 0.1, beta_max: float = 20.0):
self.beta_min = beta_min
self.beta_max = beta_max

def beta(self, t: torch.Tensor) -> torch.Tensor:
"""Linear schedule: β(t) = β_min + t(β_max - β_min)"""
return self.beta_min + t * (self.beta_max - self.beta_min)

def alpha_bar(self, t: torch.Tensor) -> torch.Tensor:
"""
ᾱ(t) = exp(-∫₀ᵗ β(s)ds) = exp(-½t²(β_max-β_min) - t·β_min)
As t→1: ᾱ→0 → x_t → pure noise
"""
return torch.exp(
-0.5 * t ** 2 * (self.beta_max - self.beta_min)
- t * self.beta_min
)

def marginal_forward(self, x0: torch.Tensor, t: torch.Tensor,
eps: torch.Tensor | None = None
) -> tuple[torch.Tensor, torch.Tensor]:
"""Sample x_t ~ N(√ᾱ(t)·x_0, (1-ᾱ(t))·I)"""
if eps is None:
eps = torch.randn_like(x0)
ab = self.alpha_bar(t)
if x0.dim() > 1:
ab = ab.view(-1, *([1] * (x0.dim() - 1)))
return ab.sqrt() * x0 + (1 - ab).sqrt() * eps, ab

def score(self, x_t: torch.Tensor, x_0: torch.Tensor,
t: torch.Tensor) -> torch.Tensor:
"""True score of p_t(x|x_0): -(x_t - √ᾱ x_0) / (1-ᾱ)"""
ab = self.alpha_bar(t)
if x_t.dim() > 1:
ab = ab.view(-1, *([1] * (x_t.dim() - 1)))
return -(x_t - ab.sqrt() * x_0) / (1 - ab)


# ── Compare VE and VP at t=1 (endpoint behavior) ─────────────────────────────
def compare_sde_endpoints():
# Off-center data distribution
x0 = torch.randn(2000, 2) + torch.tensor([4.0, 4.0])

ve = VESDE(sigma_min=0.01, sigma_max=50.0)
vp = VPSDE(beta_min=0.1, beta_max=20.0)

t_end = torch.ones(2000) # t=1 (endpoint of forward process)

xT_ve, _ = ve.marginal_forward(x0, t_end)
xT_vp, _ = vp.marginal_forward(x0, t_end)

print("SDE endpoints (t=1):")
print(f" VE-SDE mean : {xT_ve.mean(0).tolist()}") # ≈ (4,4) - not contracted
print(f" VE-SDE std : {xT_ve.std(0).tolist()}") # ≈ (50, 50) - huge variance
print(f" VP-SDE mean : {xT_vp.mean(0).tolist()}") # ≈ (0, 0) - contracted to 0
print(f" VP-SDE std : {xT_vp.std(0).tolist()}") # ≈ (1, 1) - unit Gaussian


compare_sde_endpoints()

The Architecture: U-Net Score Network

Production Engineering Notes

:::tip Score vs noise prediction - which parameterization to use In HuggingFace Diffusers, prediction_type controls the parameterization:

  • "epsilon" (default, DDPM-style): network predicts ε\varepsilon. Equivalent to sθ=εθ/σts_\theta = -\varepsilon_\theta / \sigma_t up to sign.
  • "v_prediction" (Salimans and Ho, 2022): network predicts the "velocity" v=αˉtε1αˉtx0v = \sqrt{\bar{\alpha}_t}\varepsilon - \sqrt{1-\bar{\alpha}_t}x_0. More numerically stable for very high or very low noise levels. Used in Imagen and some Stable Diffusion variants.
  • "sample" (x0-prediction): network directly predicts x0x_0. Less stable in practice - score matching theory suggests ε\varepsilon-prediction works better.

The underlying math is identical; only the network's output target differs. Convert between them using the marginal distribution relationships. :::

:::note Likelihood computation via the probability flow ODE To compute logpθ(x0)\log p_\theta(x_0) for a given image x0x_0:

  1. Encode x0xTx_0 \to x_T by integrating the probability flow ODE forward from t=0t=0 to t=Tt=T. This is deterministic and invertible.
  2. Compute logpT(xT)=logN(xT;0,I)\log p_T(x_T) = \log\mathcal{N}(x_T; 0, \mathbf{I}) - known analytically.
  3. Add the change-of-variables term: logp0(x0)=logpT(xT)0Ttr(xdxdt)dt\log p_0(x_0) = \log p_T(x_T) - \int_0^T \text{tr}(\nabla_x \frac{dx}{dt})\, dt.
  4. Estimate the integral via Hutchinson trace estimator: tr(J)vJv\text{tr}(\mathbf{J}) \approx \mathbf{v}^\top\mathbf{J}\mathbf{v} with vN(0,I)\mathbf{v} \sim \mathcal{N}(0, \mathbf{I}).

This gives approximate log-likelihood in O(TODE)O(T_\text{ODE}) steps where TODET_\text{ODE} can be as small as 20–50 with high-order ODE solvers. :::

:::warning Langevin dynamics requires small step size relative to curvature The Langevin update xk+1=xk+ε2sθ(xk)+εηx_{k+1} = x_k + \frac{\varepsilon}{2}s_\theta(x_k) + \sqrt{\varepsilon}\eta requires step size ε\varepsilon small relative to the local curvature of logp(x)\log p(x). Too large: the dynamics diverge. The curvature of the score landscape scales as 1/σ21/\sigma^2 - much larger at small noise levels. NCSN's α=ε(σ/σmin)2\alpha = \varepsilon(\sigma/\sigma_\text{min})^2 schedule compensates: step sizes scale proportionally to σ2\sigma^2, keeping the effective step size relative to local curvature constant. :::

Common Mistakes

:::danger Confusing the score of p(x) with the score of p_σ(x) The score network sθ(x,σ)s_\theta(x, \sigma) estimates xlogpσ(x)\nabla_x \log p_\sigma(x) - the score of the noise-perturbed distribution, not the original data distribution. At σ0\sigma \to 0, these converge. For finite σ\sigma, they differ: the score of pσp_\sigma is smoother and points toward the nearest cluster of data points, weighted by the Gaussian kernel of width σ\sigma. The denoising score matching target ε/σ-\varepsilon/\sigma is the score of the Gaussian noise kernel N(x~;x0,σ2I)\mathcal{N}(\tilde{x}; x_0, \sigma^2\mathbf{I}), not of pσ(x~)p_\sigma(\tilde{x}) - though Vincent (2011) proved these are equal in expectation. :::

:::danger Treating VE and VP SDEs as interchangeable VP-SDE (DDPM) contracts the signal toward zero - the endpoint xTN(0,I)x_T \approx \mathcal{N}(0, \mathbf{I}) is a clean, data-independent Gaussian prior. Starting generation from N(0,I)\mathcal{N}(0, \mathbf{I}) is straightforward. VE-SDE (NCSN) does not contract - the endpoint depends on the data: xTN(x0,σmax2I)x_T \approx \mathcal{N}(x_0, \sigma_\text{max}^2\mathbf{I}). The prior is data-dependent and very large in variance. For generation, VP is cleaner; for density estimation and inversion, VE is sometimes preferred. :::

:::warning Langevin dynamics fails in low-data-density regions The data manifold hypothesis says natural images occupy a thin low-dimensional manifold in Rd\mathbb{R}^d. In regions of zero data density - most of Rd\mathbb{R}^d - the score function is undefined or points toward the nearest data cluster but very weakly. Langevin dynamics initialized in a low-density region can take exponentially long to reach the manifold. This is exactly why NCSN uses large σ\sigma to start: large noise "spreads" the data distribution throughout space, making the score defined and meaningful everywhere. The annealing schedule gradually reduces σ\sigma, staying on the high-density region throughout. :::

YouTube Resources

ResourceChannelWhy Watch
Score Matching Tutorial - Yang SongStanford LectureThe author of NCSN explains score matching, denoising score matching, and Langevin dynamics - the definitive reference
SDE Framework for Diffusion - NeurIPS 2021Yang Song Invited TalkThe VP-SDE, VE-SDE, probability flow ODE, and DDIM connection - presented by the author
Diffusion Models Math - OutlierOutlierVisual walkthrough of score matching → Langevin → DDPM with animations
Langevin Dynamics and MCMC - Pieter AbbeelUC BerkeleyWhy Langevin MCMC converges to the target distribution - thermodynamics perspective
Score-Based Models and SDEs - Valentin De BortoliGatsby UnitRigorous mathematical treatment - Fokker-Planck equations, convergence guarantees

Interview Q&A

Q1: What is the score function and why is it useful for generative modeling?

The score function of a distribution p(x)p(x) is s(x)=xlogp(x)s(x) = \nabla_x \log p(x) - the gradient of the log-density with respect to the input. It is a vector field pointing toward regions of higher probability density. Two properties make it useful: (1) It does not depend on the normalizing constant ZZ of p(x)p(x), since xlogp(x)=xlogp~(x)\nabla_x \log p(x) = \nabla_x \log \tilde{p}(x) - computable from unnormalized models. (2) Langevin dynamics xk+1=xk+ε2s(xk)+εηkx_{k+1} = x_k + \frac{\varepsilon}{2}s(x_k) + \sqrt{\varepsilon}\eta_k generates samples from p(x)p(x) using only the score. Starting from any x0x_0, the chain converges to p(x)p(x) as step size ε0\varepsilon \to 0 and steps kk \to \infty.

Q2: Explain denoising score matching. Why does it avoid the Jacobian trace problem?

Explicit score matching requires xlogp(x)\nabla_x \log p(x) (unknown). Implicit score matching avoids this but requires tr(xsθ(x))\text{tr}(\nabla_x s_\theta(x)) - the trace of the score Jacobian - costing dd backward passes (infeasible for d=3×32×32=3072d = 3 \times 32 \times 32 = 3072).

Denoising score matching (Vincent, 2011): perturb data with noise σ\sigma to get x~=x+σε\tilde{x} = x + \sigma\varepsilon. The score of the noise-perturbed distribution is analytically known: x~logqσ(x~x)=(x~x)/σ2=ε/σ\nabla_{\tilde{x}} \log q_\sigma(\tilde{x}|x) = -(\tilde{x}-x)/\sigma^2 = -\varepsilon/\sigma. Training objective: E[sθ(x~,σ)(ε/σ)2]\mathbb{E}[\|s_\theta(\tilde{x}, \sigma) - (-\varepsilon/\sigma)\|^2]. No Jacobian trace - just a single backward pass per training example. Tractable at any image dimension.

Q3: How does the SDE framework unify DDPM and NCSN?

Song et al. (2021) show that both are discretizations of SDEs. DDPM corresponds to the VP-SDE: dx=12β(t)xdt+β(t)dWdx = -\frac{1}{2}\beta(t)x\, dt + \sqrt{\beta(t)}\, dW, which contracts the signal toward zero. Marginal: pt(xx0)=N(αˉ(t)x0,(1αˉ(t))I)p_t(x|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}(t)}x_0, (1-\bar{\alpha}(t))\mathbf{I}). NCSN corresponds to the VE-SDE: dx=d[σ2(t)]/dtdWdx = \sqrt{d[\sigma^2(t)]/dt}\, dW, which adds noise without contraction. Marginal: pt(xx0)=N(x0,σ2(t)I)p_t(x|x_0) = \mathcal{N}(x_0, \sigma^2(t)\mathbf{I}).

Both are special cases of dx=f(x,t)dt+g(t)dWdx = f(x,t)\, dt + g(t)\, dW. The reverse SDE for any such process is dx=[fg2xlogpt]dt+gdWˉdx = [f - g^2\nabla_x \log p_t]\, dt + g\, d\bar{W}. Plugging in the learned score sθxlogpts_\theta \approx \nabla_x \log p_t gives the unified sampling algorithm. The SDE framework also enables continuous-time analysis, higher-order ODE solvers, and exact likelihood computation.

Q4: What is the probability flow ODE, and how does it relate to DDIM?

Every SDE dx=f(x,t)dt+g(t)dWdx = f(x,t)\, dt + g(t)\, dW has an associated probability flow ODE dxdt=f(x,t)12g(t)2xlogpt(x)\frac{dx}{dt} = f(x,t) - \frac{1}{2}g(t)^2\nabla_x \log p_t(x) with identical marginals {pt}\{p_t\} but deterministic trajectories. Replacing xlogpt\nabla_x \log p_t with sθs_\theta gives a learnable neural ODE.

For the VP-SDE (DDPM), the probability flow ODE discretized with a first-order Euler solver is exactly the DDIM sampling update. DDIM's properties - determinism, invertibility, and ability to use fewer steps - all follow from solving an ODE rather than a stochastic process. Higher-order ODE solvers (Runge-Kutta 4, DPM-Solver) reduce the required function evaluations from 1000 to 10–20 at similar quality.

Q5: Why does annealed Langevin sampling outperform single-scale Langevin?

At σ0\sigma \approx 0 (target distribution), the score landscape has sharp peaks around each data point. Langevin dynamics with any step size either overshoots (diverges) or requires exponentially many steps to move between modes. This is the classic multi-modal MCMC problem.

Annealing solves this hierarchically: at large σ\sigma (smooth landscape), Langevin mixes quickly between all modes - the chain explores the global structure. At smaller σ\sigma (sharper landscape), the chain refines within a mode - local detail is added. Each noise level uses the final state of the previous level as initialization - the chain never needs to escape a mode at small σ\sigma. This is analogous to simulated annealing: start hot (high entropy, global exploration), cool slowly (converge to the target distribution's modes one by one).

Q6: Derive the connection between the DDPM noise prediction objective and denoising score matching.

DDPM trains εθ\varepsilon_\theta to minimize E[εθ(xt,t)ε2]\mathbb{E}[\|\varepsilon_\theta(x_t, t) - \varepsilon\|^2] where xt=αˉtx0+1αˉtεx_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\varepsilon.

Denoising score matching trains sθs_\theta to minimize E[sθ(xt,σt)(ε/σt)2]\mathbb{E}[\|s_\theta(x_t, \sigma_t) - (-\varepsilon/\sigma_t)\|^2] where σt=1αˉt\sigma_t = \sqrt{1-\bar{\alpha}_t}.

Define εθ(xt,t)=σtsθ(xt,σt)\varepsilon_\theta(x_t, t) = -\sigma_t \cdot s_\theta(x_t, \sigma_t). Then:

E ⁣[sθ(xt,σt)+ε/σt2]=E ⁣[εθσt+εσt2]=1σt2E ⁣[εθε2]\mathbb{E}\!\left[\|s_\theta(x_t, \sigma_t) + \varepsilon/\sigma_t\|^2\right] = \mathbb{E}\!\left[\left\|\frac{-\varepsilon_\theta}{\sigma_t} + \frac{\varepsilon}{\sigma_t}\right\|^2\right] = \frac{1}{\sigma_t^2}\mathbb{E}\!\left[\|\varepsilon_\theta - \varepsilon\|^2\right]

The DDPM objective is 1σt2\frac{1}{\sigma_t^2} times the DSM objective - they share the same minimizer. Therefore, training DDPM to predict noise ε\varepsilon is mathematically equivalent to training a score network via denoising score matching. Ho et al.'s empirical observation that noise prediction works better than x0x_0 prediction corresponds to the DSM parameterization (ε\varepsilon-prediction) outperforming the direct score parameterization.

Fokker-Planck Equation and Score Dynamics

The SDE framework connects to a classical result from statistical physics: the Fokker-Planck equation (also called the Kolmogorov forward equation). For the SDE dx=f(x,t)dt+g(t)dWdx = f(x,t)\, dt + g(t)\, dW, the marginal distribution pt(x)p_t(x) evolves according to:

ptt=x(fpt)+g(t)22Δxpt\frac{\partial p_t}{\partial t} = -\nabla_x \cdot (f\, p_t) + \frac{g(t)^2}{2}\Delta_x p_t

where x\nabla_x \cdot is the divergence and Δx\Delta_x is the Laplacian. This is a PDE governing how the probability density evolves over time. For the VP-SDE with f(x,t)=12β(t)xf(x,t) = -\frac{1}{2}\beta(t)x:

ptt=β(t)2x(xpt)+β(t)2Δxpt\frac{\partial p_t}{\partial t} = \frac{\beta(t)}{2}\nabla_x \cdot (x\, p_t) + \frac{\beta(t)}{2}\Delta_x p_t

The stationary distribution of this PDE (as tTt \to T) is N(0,I)\mathcal{N}(0, \mathbf{I}) - exactly the prior we want for generation. The reverse SDE is derived by time-reversing the Fokker-Planck equation, which requires knowledge of xlogpt(x)\nabla_x \log p_t(x) - the score function. This is the deep physical reason why the score is the central quantity in diffusion.

Practical: Inspecting Score Networks in HuggingFace Diffusers

# ── Inspecting how HuggingFace Diffusers handles score/noise prediction ───────
# This requires: pip install diffusers transformers

try:
from diffusers import DDPMScheduler, UNet2DModel
import torch

# A minimal DDPM-style U-Net (the score network / noise predictor)
# This is what sθ(x_t, t) looks like in production code
unet = UNet2DModel(
sample_size=32, # 32×32 images (e.g., CIFAR-10)
in_channels=3, # RGB
out_channels=3, # predicted noise (same shape as input)
layers_per_block=2,
block_out_channels=(64, 128, 128, 256),
down_block_types=("DownBlock2D", "DownBlock2D",
"AttnDownBlock2D", "DownBlock2D"),
up_block_types=("UpBlock2D", "AttnUpBlock2D",
"UpBlock2D", "UpBlock2D"),
)

# DDPM noise scheduler: manages β_t, ᾱ_t, etc.
scheduler = DDPMScheduler(
num_train_timesteps=1000,
beta_schedule="linear",
beta_start=0.0001,
beta_end=0.02,
prediction_type="epsilon", # predict ε (equivalent to DSM)
)

print(f"UNet parameters: {sum(p.numel() for p in unet.parameters()):,}")
print(f"Scheduler: {scheduler.num_train_timesteps} steps")
print(f"prediction_type: {scheduler.config.prediction_type}")
print()

# Training step demonstration
x0 = torch.randn(4, 3, 32, 32) # fake clean images
t = torch.randint(0, 1000, (4,)) # random timesteps

# Add noise: q(x_t | x_0) = N(√ᾱ_t x_0, (1-ᾱ_t) I)
eps = torch.randn_like(x0)
x_t = scheduler.add_noise(x0, eps, t)

# Predict noise with U-Net (= approximate score network)
eps_pred = unet(x_t, t).sample

# Simple MSE loss (= DSM objective reweighted)
loss = torch.nn.functional.mse_loss(eps_pred, eps)
print(f"Training loss example: {loss.item():.4f}")
print(f"Input shape: {x_t.shape}")
print(f"Output shape: {eps_pred.shape} (same as input - predicts ε)")

# Convert noise prediction to score estimate
# sθ(x_t, t) = -εθ(x_t, t) / √(1 - ᾱ_t)
alpha_bars = scheduler.alphas_cumprod
sigma_t = (1 - alpha_bars[t]).sqrt().view(-1, 1, 1, 1)
score_estimate = -eps_pred / sigma_t
print(f"Score estimate shape: {score_estimate.shape}")
print(f"Score estimate norm (avg): {score_estimate.norm(dim=(1,2,3)).mean():.2f}")

except ImportError:
print("diffusers not installed. Install with: pip install diffusers transformers")
print("This code shows how production score networks are structured.")
print("Key points:")
print(" - UNet2DModel: the score/noise predictor sθ(x_t, t)")
print(" - DDPMScheduler: manages the noise schedule (β_t, ᾱ_t)")
print(" - prediction_type='epsilon': noise prediction = DSM training")
print(" - Convert: score = -eps_pred / sqrt(1 - alpha_bar_t)")

Summary: Score Matching - The Unifying Thread

The score-based view provides the most coherent mathematical foundation for modern diffusion models. Tracing the logical chain:

  1. Goal: sample from pdata(x)p_\text{data}(x) without computing it.
  2. Key tool: the score xlogp(x)\nabla_x \log p(x) + Langevin dynamics.
  3. Problem: training the score network requires intractable Jacobian traces.
  4. Solution: denoising score matching - perturb data, train score on the analytically-known score of the Gaussian kernel.
  5. Multiple noise levels: NCSN extends this to a geometric sequence of σ\sigma values, enabling annealed sampling that mixes globally at large σ\sigma and refines locally at small σ\sigma.
  6. Continuous limit: the discrete multi-level process is the SDE dx=fdt+gdWdx = f\, dt + g\, dW in continuous time.
  7. Reverse process: Anderson's theorem gives the reverse SDE; substituting the learned score gives the generation algorithm.
  8. DDPM connection: the DDPM noise prediction objective is algebraically identical to DSM, with ε=σts\varepsilon = -\sigma_t s.
  9. Deterministic sampling: the probability flow ODE has the same marginals - solving it numerically gives DDIM, DPM-Solver, and related fast samplers.

This chain - from Langevin dynamics to Hyvärinen's score matching to Vincent's DSM to NCSN to Song's SDE framework to DDPM to DDIM - is the mathematical spine of the entire modern diffusion literature. Every paper in this area builds on some piece of this chain. Understanding it end-to-end puts you in a position to read and extend the research, not just apply pretrained models.

Sub-VP SDE: A Third Family

Song et al. (2021) introduced a third SDE beyond VP and VE - the sub-VP SDE:

dx=12β(t)xdt+β(t) ⁣(1e20tβ(s)ds)dWtdx = -\frac{1}{2}\beta(t)\, x\, dt + \sqrt{\beta(t)\!\left(1 - e^{-2\int_0^t \beta(s)ds}\right)}\, dW_t

The sub-VP SDE has smaller diffusion coefficient than the VP-SDE, giving it better likelihood estimation (closer to exact flow) while maintaining trainability. Its marginal distribution is:

pt(xx0)=N ⁣(αˉ(t)x0,  (1αˉ(t))2I)p_t(x|x_0) = \mathcal{N}\!\left(\sqrt{\bar{\alpha}(t)}\, x_0,\; \left(1 - \bar{\alpha}(t)\right)^2 \mathbf{I}\right)

Note: the variance is (1αˉ)2(1 - \bar{\alpha})^2 versus (1αˉ)(1 - \bar{\alpha}) for VP-SDE. For small αˉ\bar{\alpha} (high noise), sub-VP has less variance than VP - hence "sub-variance-preserving." In practice, sub-VP achieves better bits-per-dim on density estimation benchmarks at the cost of slightly worse FID, making it the best choice when likelihood matters.

class SubVPSDE:
"""
Sub-VP SDE - better likelihood, slightly worse FID than VP-SDE.
Variance: (1 - alpha_bar(t))^2 instead of VP's (1 - alpha_bar(t)).
"""
def __init__(self, beta_min: float = 0.1, beta_max: float = 20.0):
self.beta_min = beta_min
self.beta_max = beta_max

def beta(self, t: float) -> float:
return self.beta_min + t * (self.beta_max - self.beta_min)

def alpha_bar(self, t: float) -> float:
import math
return math.exp(
-0.5 * t ** 2 * (self.beta_max - self.beta_min)
- t * self.beta_min
)

def marginal_variance(self, t: float) -> float:
"""
VP-SDE variance: (1 - ᾱ(t))
Sub-VP-SDE variance: (1 - ᾱ(t))²
Sub-VP is always smaller variance for the same t.
"""
return (1 - self.alpha_bar(t)) ** 2

def compare_variance(self):
import numpy as np
vpsde = VPSDE(self.beta_min, self.beta_max)
ts = np.linspace(0, 1, 11)
print("VP vs Sub-VP variance at different timesteps:")
print(f"{'t':>5} | {'VP variance':>14} | {'Sub-VP variance':>16} | {'Ratio':>8}")
print("-" * 52)
for t in ts:
vp_var = float(1 - vpsde.alpha_bar(torch.tensor(t)).item())
subvp_var = self.marginal_variance(float(t))
ratio = subvp_var / (vp_var + 1e-12)
print(f"{t:>5.2f} | {vp_var:>14.4f} | {subvp_var:>16.4f} | {ratio:>8.4f}")


subvp = SubVPSDE()
subvp.compare_variance()

Connecting Everything: A Field Map

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Diffusion Process (DDPM) demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.