DDPMs - The Mathematical Foundation of Diffusion Models
:::note Reading time: ~55 minutes | Interview relevance: Very High | Target roles: ML Engineer, Research Engineer, AI Engineer, Applied Scientist :::
The Real Interview Moment
You are interviewing for a research scientist role at a generative AI lab. The interviewer writes on the whiteboard and says: "Derive this closed form. Then explain how the training objective follows from the ELBO, and tell me why Ho et al. chose to predict instead of ."
This is a qualifying question. It separates engineers who have used Stable Diffusion from engineers who understand it. Most people can draw forward and backward arrows on a diagram. Few can derive the closed-form distribution at an arbitrary timestep , explain why that derivation enables efficient training, connect it to the variational lower bound, articulate the empirical finding that noise prediction outperforms image prediction, and then describe why the cosine schedule (Nichol and Dhariwal 2021) matters for high-resolution generation.
The interviewer continues: "Now, sketch the U-Net architecture - how does it incorporate the timestep, how do skip connections help, and where does attention go?"
Then: "What FID score would you expect a well-trained DDPM to achieve on CIFAR-10, and what does FID actually measure?"
These questions form a complete arc from mathematical derivation to practical engineering to evaluation. Jonathan Ho, Ajay Jain, and Pieter Abbeel at UC Berkeley answered all of them in their 2020 DDPM paper. They unified ideas from non-equilibrium thermodynamics and score matching, replaced slow noise prediction networks with a U-Net, and achieved state-of-the-art FID on CIFAR-10 and LSUN. This lesson gives you the complete derivation, implementation, and intuition to answer every question in that arc.
Why This Exists - The Core Problem with Adversarial Training
Prior to DDPM, the dominant approach to high-quality image generation was Generative Adversarial Networks (GANs). GANs have one fundamental training problem: the generator and discriminator must be carefully balanced throughout training. Too strong a discriminator and gradients vanish - the generator gets no useful signal. Too weak a discriminator and the generator does not improve. Mode collapse can happen silently and is difficult to detect until evaluation time. Every GAN paper required new stabilization tricks: spectral normalization, gradient penalty, progressive growing, minibatch standard deviation.
The deeper problem is that GANs optimize an adversarial objective, not a likelihood-based one. The training signal is indirect: "the discriminator says this looks fake." There is no explicit mathematical connection between the training loss and the quality of the learned distribution. When a GAN converges, there is no theoretical guarantee that it has learned the true data distribution - only that the generator can fool the discriminator.
Diffusion models take the opposite approach. They define an explicit, principled training objective derived from the variational lower bound on the log-likelihood. The training signal is direct: "predict the noise that was added at a specific timestep." This is a simple regression problem, always well-defined, never adversarial. The result is dramatically more stable training - you can train DDPM reliably with a standard Adam optimizer and cosine learning rate schedule, no discriminator, no mode collapse, no specialized tricks required.
The cost is inference speed: DDPM requires 1000 sequential denoising steps to generate one sample. GANs generate in a single forward pass. This speed gap motivated DDIM, DPM-Solver, and Consistency Models (covered in Lessons 04–06). But the stability advantage of DDPM's principled objective is what made the entire diffusion model ecosystem possible.
Historical Context - The Origins of DDPM
The idea of progressive noising comes from non-equilibrium statistical mechanics. Sohl-Dickstein et al. (2015) proposed learning a generative model by reversing a diffusion process - gradually adding noise until the data distribution becomes a known prior, then learning to reverse the process. The paper proved this was theoretically sound: if you can perfectly learn the reverse of a diffusion process, you can generate samples from the data distribution. But the implementation was too slow and the networks too small to produce competitive results. The paper was visionary but practically ahead of its time.
Yang Song and Stefano Ermon (2019, 2020) approached generative modeling from a different angle: score matching. If you know the score function of the data distribution, you can generate samples via Langevin dynamics - just follow the gradient of the log probability, adding noise to avoid getting trapped. They showed that training on data perturbed at multiple noise scales produced a score function that worked across the full noise range. Their NCSN model produced impressive results but still lagged behind GANs in FID.
Ho et al. (2020) connected these threads. They showed that a particular parameterization of the DDPM model - specifically, predicting the noise rather than the denoised image - gave a training objective equivalent to weighted denoising score matching. They replaced the slow noise prediction network with a modern U-Net with attention layers, scaled the training to 256x256 resolution, and achieved an FID of 3.17 on CIFAR-10 - competitive with the best GANs, without any adversarial training instability. DDPM was born, and with it, the modern era of diffusion-based generative AI.
1. The Forward Process - Adding Noise Systematically
Definition and Intuition
The forward process is a Markov chain that gradually destroys structure in a clean image by adding Gaussian noise at each step:
Here is the noise schedule - a sequence of small positive constants controlling how much noise is added at each step. The mean scales down the signal from the previous step. The variance adds Gaussian noise. Over steps (Ho et al. used ), this progressively destroys all structure in until only noise remains.
Intuitively: imagine repeatedly photocopying a photo on a machine that adds a tiny amount of static each time. After many copies, all you have is static. The forward process does exactly this to a digital image, but in a mathematically controlled way.
The full forward joint distribution factors as:
Key Property: Closed-Form at Any Timestep
One of the most important mathematical properties of this particular forward process is that has a closed form at any arbitrary timestep - you do not need to chain steps of sampling to get .
Define and . Then:
Proof (by induction using the Gaussian product rule):
Base case: At : , which matches the formula since .
Inductive step: Suppose .
Apply one more forward step. We can write where .
Substituting the inductive hypothesis :
The variance of the combined noise term is:
So . QED.
Reparameterized Sampling
This closed form gives us a crucial efficiency: we can sample directly from in one step using the reparameterization trick:
This is the reparameterization used in every DDPM training iteration. We do not need to run sequential noising steps - we jump directly from the clean image to a noisy image at any desired noise level in a single matrix operation. Without this closed form, training would require O(T) operations per sample. With it, training is O(1) per sample regardless of .
Forward Process Visualization
As increases, , so the clean signal is scaled toward zero and the image becomes pure Gaussian noise:
t=0: x_0 (clean image, full detail, bar_alpha = 1.0)
↓ add very small noise (beta_1 ≈ 0.0001)
t=100: x_100 (slightly noisy, bar_alpha ≈ 0.95)
↓ add more noise
t=500: x_500 (significant noise, structure fading, bar_alpha ≈ 0.35)
↓ add more noise
t=800: x_800 (heavy noise, mostly indistinguishable, bar_alpha ≈ 0.08)
↓ add final noise
t=T: x_T ≈ N(0, I) (pure noise, no identifiable structure)
The noise schedule controls how quickly this destruction happens - a design choice with real consequences for generation quality.
2. The Noise Schedule - Linear vs Cosine
The noise schedule determines the rate at which signal is destroyed. This is not just a technical detail - the choice of schedule significantly affects training efficiency and sample quality, especially at high resolutions.
Linear Schedule (Ho et al. 2020)
Ho et al. used a linear schedule increasing uniformly from to :
This schedule works well for 32x32 and 64x64 images. The small means very little noise is added in the first few steps, and the gradual increase ensures a smooth transition to pure noise by step .
Problem at high resolution: for 256x256 and above images, the linear schedule destroys high-frequency detail (edges, textures) too quickly in the early timesteps. Many training steps fall in a regime where the image is nearly pure noise, providing little useful training signal. The model sees too many "already destroyed" images and too few "partially corrupted" ones.
Signal-to-noise ratio at the midpoint (linear): . By step 600, the image has lost most of its structure.
Cosine Schedule (Nichol and Dhariwal 2021)
Nichol and Dhariwal proposed a cosine schedule defined through :
where is a small offset to prevent from being too small at (which would add noticeable noise in the very first step, conflicting with the requirement that be essentially a Dirac delta).
The cosine schedule has a characteristic shape: stays high (near 1.0) for the first third of timesteps, then smoothly decreases, reaching near zero around step 700-800. This ensures:
- Early timesteps are meaningful: the model sees images with small but non-trivial amounts of noise, learning fine-grained denoising
- Middle timesteps are information-rich: the transition zone from structured to unstructured gets more training steps
- Late timesteps are efficient: the model does not waste steps on images that are already essentially pure noise
The cosine schedule improves FID by 0.5-2.0 points on CIFAR-10 and much more on 256x256 datasets. For any production high-resolution model, the cosine schedule (or a learned version) is the standard.
3. The Reverse Process - Learning to Denoise
The reverse process is what we want to learn. Starting from , we want to progressively denoise to recover . Each reverse step is modeled as a Gaussian:
The functions and are parameterized by neural networks (in practice, a single U-Net). The full generative model is:
where is the prior - pure Gaussian noise.
The True Reverse Posterior
A key mathematical property: the true reverse posterior is tractable when conditioned on . Using Bayes' theorem and the Gaussian forward process, it is also Gaussian:
where:
This posterior mean is a weighted combination of (the clean image, which we do not know at inference time) and (the current noisy image, which we do have). The variance is entirely determined by the noise schedule - no learning needed for the variance in the original DDPM formulation.
The learned should approximate this posterior. But since is unknown at inference time, we must parameterize in terms of what the network predicts from alone.
4. The ELBO Derivation
To train the reverse process, we maximize the log-likelihood . Since this is intractable (it involves marginalizing over all intermediate steps ), we maximize the ELBO (Evidence Lower BOund):
Expanding by substituting the factored forms of and :
Parsing each term:
- Reconstruction: How well does the final denoising step recover the clean image? Analogous to the reconstruction term in a VAE. In practice, treated as a Gaussian with fixed variance.
- Prior matching: How close is the fully noised image to the prior ? If the noise schedule is designed so that , this term is approximately zero and requires no optimization - it is determined by the fixed forward process, not the model. Ho et al. ignore this term entirely.
- Denoising terms (the main training signal): For each from 2 to , how well does the learned reverse step match the true conditional reverse posterior ?
Since both distributions are Gaussian, each KL divergence has a closed form proportional to the squared difference between means. This is where the bulk of training occurs.
5. The Simplified Training Objective
From ELBO to Noise Prediction
Ho et al. made a critical practical choice: reparameterize the mean in terms of the noise rather than predicting or directly.
From the forward process: , so we can solve for :
Substituting this expression into the true posterior mean :
So if the network predicts , we can compute the optimal mean:
The denoising KL terms become proportional to:
Ho et al. dropped the timestep-dependent weighting coefficient and found that training with the simplified objective worked better in practice:
This is the DDPM training objective in its final form. Sample a clean image , pick a random timestep , add noise to get , ask the network to predict , measure MSE. Simple, elegant, powerful.
Why Predict Noise, Not the Clean Image?
You might reasonably ask: why predict rather than predicting directly? Both are mathematically equivalent - given , we can compute .
The empirical answer: Ho et al. tried both and noise prediction gave better FID. Their ablation showed that -prediction required more careful tuning and produced lower-quality samples.
The intuitive explanation: at high noise levels ( large, small), predicting requires reconstructing a clean image from nearly pure noise. The target has extremely high variance - many different clean images are consistent with the same noisy observation. The gradient signal is therefore very noisy. Predicting is always predicting from a fixed, unit-variance distribution regardless of noise level. The regression target is well-behaved at every .
The theoretical connection: predicting is mathematically equivalent to estimating the score function of the noisy data distribution, since:
This connects DDPM to the score matching framework of Song and Ermon - the theoretical reason why the two approaches produce identical models despite different derivations.
Why the simplified objective works better than the weighted one: the weighting term is large at small (where the step is small and the denoising task is easy) and small at large (where the step is large and the denoising task is hard). The simplified objective, by dropping this weight, places equal emphasis on all timesteps. Empirically this produces better FID - the model learns to handle both easy and hard timesteps equally well.
6. The Sampling Algorithm
At generation time, we start from and apply the learned reverse process times. Substituting the noise-prediction parameterization:
where (the choice in the original DDPM paper). At the final step , we set to avoid adding noise to an almost-clean image.
Full Algorithm:
- Sample
- For : a. If : sample , else b. Compute - one U-Net forward pass c. Update:
- Return
This requires U-Net forward passes - 1000 for the original DDPM. This is the sampling bottleneck that DDIM (Lesson 04) reduces to 20-50 steps.
Variance Choice: Fixed vs Learned
Ho et al. fixed (the upper bound on the posterior variance) rather than using the posterior variance (the lower bound). Nichol and Dhariwal (2021) showed that learning the variance - interpolating between and using the network's output - improves log-likelihood and allows faster sampling with fewer steps. Their improved DDPM parameterizes the variance as:
where is a scalar predicted by the network alongside the noise estimate.
7. The U-Net Backbone
The denoising network is a U-Net - originally proposed for biomedical image segmentation by Ronneberger et al. (2015). The U-Net is ideal for the DDPM denoising task for specific structural reasons.
Why U-Net for Diffusion?
Skip connections preserve fine details: the U-Net's skip connections pass feature maps directly from encoder to decoder at each resolution level. Without skip connections, the information bottleneck in the middle of the network would destroy high-frequency spatial detail. With skip connections, fine textures and edge information bypass the bottleneck and are available for the decoder to use in reconstruction.
Multi-scale reasoning: the encoder-decoder structure processes the image at multiple spatial resolutions simultaneously. The bottleneck (lowest resolution) captures global structure - overall composition, large-scale lighting, semantic content. The shallow layers (highest resolution) capture local detail - texture, fine edges, grain patterns. This multi-scale hierarchy mirrors how diffusion naturally operates: high-noise steps require understanding global structure, low-noise steps require understanding fine detail.
Self-attention at low resolutions: attention layers in the bottleneck and lower-resolution feature maps capture long-range spatial dependencies - allowing the model to ensure that, for example, a face is internally consistent even when the two eyes are far apart in the image.
Timestep conditioning via sinusoidal embeddings: the timestep must be communicated to the network so it knows which noise level it is operating at. This is done via sinusoidal position embeddings (borrowed from Transformers):
This embedding is projected through a small MLP and added to the feature maps at each residual block via adaptive group normalization - the network's behavior changes smoothly with , from predicting fine-detail noise at small to predicting large-scale structure at large .
Architecture Summary
The Ho et al. U-Net for 32x32 images:
Input: (B, C, 32, 32)
┌─────────────────────────────────────────────────┐
│ Encoder │
│ conv(C → 128) → [32×32] │
│ ResBlock × 2 + Downsample → [16×16] │
│ ResBlock × 2 + Self-Attention + Down → [8×8] │
│ ResBlock × 2 + Downsample → [4×4] │
├─────────────────────────────────────────────────│
│ Bottleneck │
│ ResBlock + Self-Attention + ResBlock [4×4] │
├─────────────────────────────────────────────────│
│ Decoder (with skip connections from encoder) │
│ ResBlock × 2 + Upsample → [8×8] │
│ ResBlock × 2 + Self-Attention + Up → [16×16] │
│ ResBlock × 2 + Upsample → [32×32] │
│ conv(128 → C) │
└─────────────────────────────────────────────────┘
Output: epsilon_hat (B, C, 32, 32)
Each ResBlock receives the timestep embedding and incorporates it via: GroupNorm → SiLU → Conv → Add(t_emb_proj) → GroupNorm → SiLU → Conv.
8. Complete PyTorch Implementation
The following is a complete, runnable DDPM implementation with a U-Net denoiser, training loop, and sampling loop. It trains on MNIST to demonstrate the core mechanics on accessible compute.
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import math
# ============================================================
# Sinusoidal timestep embedding
# ============================================================
class SinusoidalPosEmb(nn.Module):
"""
Sinusoidal positional embedding for timestep conditioning.
Same encoding as Transformer positional embeddings.
Output shape: (batch, dim)
"""
def __init__(self, dim):
super().__init__()
self.dim = dim
def forward(self, t):
device = t.device
half_dim = self.dim // 2
emb = math.log(10000) / (half_dim - 1)
emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
# t: (batch,) → multiply with frequency basis
emb = t[:, None].float() * emb[None, :]
emb = torch.cat([emb.sin(), emb.cos()], dim=-1)
return emb # (batch, dim)
# ============================================================
# Residual block with timestep conditioning
# ============================================================
class ResidualBlock(nn.Module):
"""
Residual block conditioned on timestep embedding.
Uses GroupNorm + SiLU (better than BatchNorm for diffusion).
"""
def __init__(self, in_channels, out_channels, time_emb_dim, num_groups=8):
super().__init__()
self.time_mlp = nn.Sequential(
nn.SiLU(),
nn.Linear(time_emb_dim, out_channels)
)
self.block1 = nn.Sequential(
nn.GroupNorm(num_groups, in_channels),
nn.SiLU(),
nn.Conv2d(in_channels, out_channels, 3, padding=1)
)
self.block2 = nn.Sequential(
nn.GroupNorm(num_groups, out_channels),
nn.SiLU(),
nn.Conv2d(out_channels, out_channels, 3, padding=1)
)
# Residual connection handles channel mismatch
self.residual_conv = (
nn.Conv2d(in_channels, out_channels, 1)
if in_channels != out_channels else nn.Identity()
)
def forward(self, x, t_emb):
h = self.block1(x)
# Add timestep embedding broadcast over H, W
h = h + self.time_mlp(t_emb)[:, :, None, None]
h = self.block2(h)
return h + self.residual_conv(x)
# ============================================================
# Simplified U-Net for DDPM
# ============================================================
class UNet(nn.Module):
"""
U-Net denoising network for DDPM.
Architecture: Encoder → Bottleneck → Decoder with skip connections.
Timestep conditioning injected at every ResBlock.
"""
def __init__(
self,
in_channels=1,
base_channels=64,
channel_mults=(1, 2, 4),
time_emb_dim=128,
num_groups=8
):
super().__init__()
# Timestep embedding: sinusoidal → MLP → projected dim
self.time_emb = nn.Sequential(
SinusoidalPosEmb(time_emb_dim),
nn.Linear(time_emb_dim, time_emb_dim * 4),
nn.SiLU(),
nn.Linear(time_emb_dim * 4, time_emb_dim)
)
# Initial projection: in_channels → base_channels
self.init_conv = nn.Conv2d(in_channels, base_channels, 3, padding=1)
# Encoder: progressively halve spatial resolution, increase channels
self.down_blocks = nn.ModuleList()
self.downsamplers = nn.ModuleList()
skip_channels = [base_channels] # track skip connection channel counts
ch = base_channels
for mult in channel_mults:
out_ch = base_channels * mult
self.down_blocks.append(ResidualBlock(ch, out_ch, time_emb_dim, num_groups))
self.downsamplers.append(nn.Conv2d(out_ch, out_ch, 4, stride=2, padding=1))
skip_channels.append(out_ch)
ch = out_ch
# Bottleneck: same resolution, deepens representation
self.mid_block1 = ResidualBlock(ch, ch, time_emb_dim, num_groups)
self.mid_block2 = ResidualBlock(ch, ch, time_emb_dim, num_groups)
# Decoder: progressively double spatial resolution, decrease channels
self.up_blocks = nn.ModuleList()
self.upsamplers = nn.ModuleList()
for mult in reversed(channel_mults):
skip_ch = skip_channels.pop() # retrieve matching encoder skip
out_ch = base_channels * mult
self.upsamplers.append(
nn.ConvTranspose2d(ch, ch, 4, stride=2, padding=1)
)
# Input = current channels + skip connection channels
self.up_blocks.append(
ResidualBlock(ch + skip_ch, out_ch, time_emb_dim, num_groups)
)
ch = out_ch
# Final output: recover input channels
self.final_conv = nn.Sequential(
nn.GroupNorm(num_groups, ch),
nn.SiLU(),
nn.Conv2d(ch, in_channels, 1)
)
def forward(self, x, t):
"""
Args:
x: noisy image (B, C, H, W)
t: timestep indices (B,) as integers
Returns:
predicted noise (B, C, H, W)
"""
t_emb = self.time_emb(t)
x = self.init_conv(x)
skips = [x]
# Encoder pass - store all activations for skip connections
for block, downsample in zip(self.down_blocks, self.downsamplers):
x = block(x, t_emb)
skips.append(x)
x = downsample(x)
# Bottleneck
x = self.mid_block1(x, t_emb)
x = self.mid_block2(x, t_emb)
# Decoder pass - concatenate skip connections
for upsample, block in zip(self.upsamplers, self.up_blocks):
x = upsample(x)
skip = skips.pop()
# Concatenate along channel dimension
x = torch.cat([x, skip], dim=1)
x = block(x, t_emb)
return self.final_conv(x)
# ============================================================
# DDPM - noise schedule, training loss, sampling
# ============================================================
class DDPM:
"""
Full DDPM implementation.
Supports linear and cosine noise schedules.
Implements the simplified training objective and DDPM sampling.
"""
def __init__(self, T=1000, schedule='cosine', device='cuda'):
self.T = T
self.device = device
if schedule == 'linear':
# Original Ho et al. 2020 schedule
betas = torch.linspace(1e-4, 0.02, T, device=device)
elif schedule == 'cosine':
# Nichol & Dhariwal 2021 cosine schedule
# Better for high-resolution images - more uniform noise removal
s = 0.008 # small offset prevents very large beta_0
steps = T + 1
x = torch.linspace(0, T, steps, device=device)
# f(t) = cos^2(pi/2 * (t/T + s) / (1 + s))
alpha_bars = torch.cos(
((x / T) + s) / (1 + s) * math.pi / 2
) ** 2
alpha_bars = alpha_bars / alpha_bars[0] # normalize to f(0)=1
# Derive betas from alpha_bars
betas = 1 - alpha_bars[1:] / alpha_bars[:-1]
betas = torch.clamp(betas, min=0.0001, max=0.9999) # numerical safety
else:
raise ValueError(f"Unknown schedule: {schedule}")
alphas = 1.0 - betas
alpha_bars = torch.cumprod(alphas, dim=0)
# Pre-compute all quantities needed for training and sampling
self.betas = betas
self.alphas = alphas
self.alpha_bars = alpha_bars
self.sqrt_alpha_bars = alpha_bars.sqrt()
self.sqrt_one_minus_alpha_bars = (1 - alpha_bars).sqrt()
self.sqrt_recip_alphas = (1.0 / alphas).sqrt()
def q_sample(self, x_0, t, noise=None):
"""
Sample x_t from x_0 using the closed-form forward process.
x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * eps
This is the key efficiency: O(1) regardless of t.
"""
if noise is None:
noise = torch.randn_like(x_0)
# Index pre-computed coefficients and reshape for broadcasting
sqrt_ab = self.sqrt_alpha_bars[t][:, None, None, None]
sqrt_one_minus_ab = self.sqrt_one_minus_alpha_bars[t][:, None, None, None]
return sqrt_ab * x_0 + sqrt_one_minus_ab * noise
def training_loss(self, model, x_0):
"""
DDPM simplified training objective (equation 14 in Ho et al. 2020).
Steps:
1. Sample random timestep t ~ U[1, T]
2. Sample noise eps ~ N(0, I)
3. Compute noisy image x_t via closed-form forward process
4. Predict eps with model
5. MSE loss between true and predicted noise
"""
batch = x_0.shape[0]
# Random timestep for each sample in the batch
t = torch.randint(0, self.T, (batch,), device=self.device)
noise = torch.randn_like(x_0)
x_t = self.q_sample(x_0, t, noise)
# Model predicts the noise - this is the key parameterization choice
predicted_noise = model(x_t, t)
# Simplified objective: unweighted MSE over all timesteps
return F.mse_loss(predicted_noise, noise)
@torch.no_grad()
def p_sample(self, model, x_t, t_scalar):
"""
One reverse denoising step: x_t → x_{t-1}.
Implements the DDPM sampling algorithm (Algorithm 2 in Ho et al. 2020).
"""
batch = x_t.shape[0]
t = torch.full((batch,), t_scalar, device=self.device, dtype=torch.long)
# Predict noise with the trained U-Net
eps_pred = model(x_t, t)
# Retrieve precomputed schedule values for this timestep
alpha_t = self.alphas[t_scalar]
sqrt_one_minus_ab = self.sqrt_one_minus_alpha_bars[t_scalar]
beta_t = self.betas[t_scalar]
sqrt_recip_alpha_t = self.sqrt_recip_alphas[t_scalar]
# Compute reverse process mean:
# mu_theta = (1/sqrt(alpha_t)) * (x_t - (1-alpha_t)/sqrt(1-alpha_bar_t) * eps)
coeff = (1 - alpha_t) / sqrt_one_minus_ab
mean = sqrt_recip_alpha_t * (x_t - coeff * eps_pred)
if t_scalar > 0:
# Add stochastic noise (sigma_t = sqrt(beta_t) in original DDPM)
noise = torch.randn_like(x_t)
x_prev = mean + beta_t.sqrt() * noise
else:
# Final step: no noise added
x_prev = mean
return x_prev
@torch.no_grad()
def sample(self, model, shape, verbose=False):
"""
Full DDPM sampling: 1000 reverse steps from Gaussian noise.
Note: this takes ~1000 U-Net forward passes.
Use DDIM sampler (Lesson 04) to reduce to 20-50 steps.
"""
model.eval()
x = torch.randn(shape, device=self.device)
for t in reversed(range(self.T)):
if verbose and t % 100 == 0:
print(f" Sampling step {self.T - t}/{self.T} (t={t})")
x = self.p_sample(model, x, t)
return x
# ============================================================
# EMA (Exponential Moving Average) of model weights
# ============================================================
class EMA:
"""
Exponential Moving Average of model weights.
Critical for DDPM - EMA model gives significantly better FID
than online training weights.
Decay = 0.9999 is standard (Ho et al. 2020).
"""
def __init__(self, model, decay=0.9999):
self.decay = decay
# Create a separate copy of model parameters for EMA
self.shadow = {}
for name, param in model.named_parameters():
if param.requires_grad:
self.shadow[name] = param.data.clone()
def update(self, model):
"""Call after each gradient step."""
for name, param in model.named_parameters():
if param.requires_grad:
# EMA update: shadow = decay * shadow + (1 - decay) * param
self.shadow[name] = (
self.decay * self.shadow[name]
+ (1 - self.decay) * param.data
)
def apply_to(self, model):
"""Copy EMA weights to model for evaluation."""
for name, param in model.named_parameters():
if param.requires_grad:
param.data.copy_(self.shadow[name])
# ============================================================
# Full training loop
# ============================================================
def train_ddpm(
num_epochs=50,
batch_size=128,
lr=2e-4,
T=1000,
schedule='cosine',
device='cuda' if torch.cuda.is_available() else 'cpu'
):
"""
Complete DDPM training loop on MNIST.
Includes EMA, gradient clipping, cosine LR schedule.
"""
# MNIST: 28x28 → padded to 32x32 for clean stride-2 downsampling
transform = transforms.Compose([
transforms.Resize(32),
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)) # [0,1] → [-1,1]
])
dataset = datasets.MNIST(
root='./data', train=True, download=True, transform=transform
)
loader = DataLoader(
dataset, batch_size=batch_size, shuffle=True,
num_workers=4, pin_memory=True
)
# Model and optimizer
model = UNet(
in_channels=1,
base_channels=64,
channel_mults=(1, 2, 4),
time_emb_dim=128
).to(device)
ema = EMA(model, decay=0.9999)
ddpm = DDPM(T=T, schedule=schedule, device=device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
# Cosine LR decay (standard for diffusion models)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=num_epochs
)
print(f"Training DDPM on {device}")
print(f" Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f" Noise schedule: {schedule}")
print(f" Diffusion steps T: {T}")
model.train()
for epoch in range(num_epochs):
total_loss = 0.0
for batch_idx, (x_0, _) in enumerate(loader):
x_0 = x_0.to(device)
# Compute DDPM training loss
loss = ddpm.training_loss(model, x_0)
optimizer.zero_grad()
loss.backward()
# Gradient clipping prevents gradient explosion in deep U-Net
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
# Update EMA after every gradient step
ema.update(model)
total_loss += loss.item()
scheduler.step()
avg_loss = total_loss / len(loader)
print(
f"Epoch {epoch+1:3d}/{num_epochs} | "
f"Loss: {avg_loss:.4f} | "
f"LR: {scheduler.get_last_lr()[0]:.2e}"
)
return model, ddpm, ema
# ============================================================
# Noise schedule comparison
# ============================================================
def compare_schedules(T=1000):
"""
Compare linear and cosine noise schedules.
Shows bar_alpha_t at key timesteps and the
signal-to-noise ratio profile.
"""
import numpy as np
# Linear schedule
betas_linear = np.linspace(1e-4, 0.02, T)
alphas_linear = 1 - betas_linear
ab_linear = np.cumprod(alphas_linear)
# Cosine schedule
s = 0.008
steps = np.linspace(0, T, T + 1)
f = np.cos(((steps / T) + s) / (1 + s) * np.pi / 2) ** 2
f = f / f[0]
betas_cosine = 1 - f[1:] / f[:-1]
betas_cosine = np.clip(betas_cosine, 0.0001, 0.9999)
alphas_cosine = 1 - betas_cosine
ab_cosine = np.cumprod(alphas_cosine)
checkpoints = [100, 200, 300, 400, 500, 600, 700, 800, 900, 999]
print("Timestep | Linear alpha_bar | Cosine alpha_bar | Cosine/Linear (higher = more signal)")
print("-" * 80)
for t in checkpoints:
ratio = ab_cosine[t] / (ab_linear[t] + 1e-10)
print(f" t={t:4d} | {ab_linear[t]:.4f} | {ab_cosine[t]:.4f} | {ratio:.2f}x")
print()
print("Interpretation: cosine schedule retains more signal at mid-range timesteps.")
print("This means more informative training steps for textures and fine details.")
print()
# SNR comparison
snr_linear = ab_linear / (1 - ab_linear + 1e-10)
snr_cosine = ab_cosine / (1 - ab_cosine + 1e-10)
print(f"Linear SNR at t=200: {snr_linear[199]:.3f}")
print(f"Cosine SNR at t=200: {snr_cosine[199]:.3f}")
print(f"Cosine has {snr_cosine[199]/snr_linear[199]:.1f}x higher SNR at t=200 → stronger training signal")
# ============================================================
# Run training and generate samples
# ============================================================
if __name__ == '__main__':
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Train
model, ddpm, ema = train_ddpm(
num_epochs=100,
batch_size=128,
device=device
)
# Compare schedules
compare_schedules()
# Generate with online weights
print("\nGenerating samples with online model weights...")
samples_online = ddpm.sample(model, shape=(16, 1, 32, 32), verbose=True)
# Generate with EMA weights (typically better FID)
print("\nGenerating samples with EMA weights...")
ema_model = UNet(in_channels=1).to(device)
ema.apply_to(ema_model)
samples_ema = ddpm.sample(ema_model, shape=(16, 1, 32, 32), verbose=True)
# Unnormalize from [-1, 1] to [0, 1]
for name, samples in [("online", samples_online), ("ema", samples_ema)]:
samples = (samples + 1) / 2
samples = samples.clamp(0, 1)
print(f"\n{name} samples: shape={samples.shape}, "
f"min={samples.min():.3f}, max={samples.max():.3f}")
try:
from torchvision.utils import save_image
save_image(samples_ema, 'ddpm_ema_samples.png', nrow=4)
print("EMA samples saved to ddpm_ema_samples.png")
except ImportError:
print("Install torchvision to save samples as grid")
9. FID Score - What It Measures and What Counts as Good
FID (Fréchet Inception Distance) is the standard metric for evaluating generative model quality. Understanding it is essential for DDPM interviews.
How FID Works
- Generate 50,000 images from the model
- Generate 50,000 real images from the test set
- Run both sets through an Inception-v3 network, extract the 2048-dimensional penultimate layer features
- Fit multivariate Gaussians and to the real and generated features
- Compute the Fréchet distance between these Gaussians:
Lower FID = better. FID = 0 means perfect match. Real images have FID ≈ 0 with themselves (up to sampling noise).
What FID Measures
FID measures both quality and diversity. A model that generates only high-quality images of one mode (e.g., perfect cats, no dogs) will have high FID because its and do not match the full diversity of the test set.
DDPM Benchmarks on CIFAR-10
| Model | FID (CIFAR-10) | Notes |
|---|---|---|
| DDPM (Ho et al. 2020) | 3.17 | Original paper, 1000 steps |
| Improved DDPM (Nichol & Dhariwal 2021) | 2.90 | Cosine schedule + learned variance |
| StyleGAN2 (GAN baseline) | 2.92 | Best GAN at that time |
| DALL-E (autoregressive) | 17.9 | Much weaker than diffusion |
| Score-based (Song & Ermon 2020) | 3.21 | Score matching baseline |
FID of 3.17 means DDPM essentially matched the best GANs with much more stable training. A well-configured DDPM on CIFAR-10 should achieve FID below 4.0. Above 10.0 suggests a training bug (wrong normalization, incorrect schedule, or insufficient training time).
1000 Steps vs Fewer Steps
DDPM samples require 1000 steps because the model was trained with a 1000-step schedule. Using fewer steps with the DDPM sampler introduces discretization error - the approximation that each step is a small Gaussian degrades when steps are large. FID degrades sharply with fewer than ~100 steps in the DDPM sampler.
DDIM (Lesson 04) uses a different sampling formula that allows 20-50 steps on the same trained model by treating the reverse process as an ODE rather than a Markov chain. This is why you should always prefer DDIM or DPM-Solver for inference - the trained model is identical, only the sampler changes.
10. YouTube Resources
| Title | Channel | Why Watch |
|---|---|---|
| Diffusion Models - Beat GANs | Yannic Kilcher | Dhariwal & Nichol paper - classifier guidance and improved DDPM |
| DDPM from Scratch (PyTorch) | Outlier | Step-by-step code implementation with intuition |
| Score-Based Generative Modeling | Yang Song | The connection between DDPM and score matching, from the author |
| Denoising Diffusion Probabilistic Models | AI Coffee Break | Mathematical walkthrough of ELBO derivation |
| Improved DDPM - Nichol & Dhariwal | Yannic Kilcher | Cosine schedule, learned variance, and path to ADM |
11. Production Engineering Notes
:::tip Model size and resolution scaling The original DDPM used a U-Net with ~35M parameters for 32x32 images (CIFAR-10). For 256x256 images (ImageNet), the ADM model uses ~554M parameters with self-attention at 16x16 and 8x8 resolutions. For 512x512 (Stable Diffusion), computation is moved to a 64x64 latent space - reducing dimensionality by 64x. The scaling rule: add attention at all resolution levels where the spatial dimension is at most 32. Below this threshold, global reasoning is cheap and quality-critical. :::
:::note Variance schedule matters more than you expect The linear schedule works well for low-resolution images (32x32, 64x64). For 256x256 and above, the cosine schedule is significantly better because the linear schedule destroys high-frequency detail too aggressively in early timesteps, leaving insufficient training signal for learning fine-grained textures. If you are training on high-resolution data and samples look blurry or lack texture, switch to the cosine schedule before trying anything else. :::
:::note GroupNorm vs BatchNorm for diffusion models DDPM uses GroupNorm (typically 8 or 32 groups) rather than BatchNorm. The reason: BatchNorm computes statistics across the batch dimension, but diffusion models process images at many different noise levels within the same batch. A timestep-mixed batch has wildly different mean and variance statistics depending on the noise level, which breaks BatchNorm's assumptions. GroupNorm normalizes within each channel group independently of batch statistics - it is stable regardless of batch composition. :::
12. Common Mistakes
:::danger Forgetting the closed-form forward process in training A common implementation bug: computing by running sequential noising steps in a loop. This is O(T) per training sample - 1000x slower than the closed-form approach. The closed form computes directly in one operation regardless of . This is what makes DDPM training efficient - every training step requires only one forward pass through the noise schedule computation. Always use the closed form. :::
:::danger Not normalizing inputs to the correct range
The DDPM model expects inputs in . If your images are in (standard output from transforms.ToTensor()), normalize them: x = 2 * x - 1. Failing to do this shifts the clean image distribution away from the Gaussian noise added during the forward process. At high noise levels, the noisy image should look like pure Gaussian noise - but if the clean image is in instead of , the additive Gaussian noise centered at 0 will systematically shift the distribution. This causes subtle training failures that manifest as slightly off-center sample distributions.
:::
:::warning The noise schedule and T are tightly coupled If you change from 1000 to 500 without adjusting the noise schedule, the model will fail. The noise schedule defines how quickly . A schedule designed for reaches near-zero at step 1000. At step 500, is still around 0.35 for the linear schedule - the model has never seen pure noise during training. Sampling with steps from this model will produce images that are too noisy. Always retrain or re-derive the schedule when changing . :::
:::warning EMA weights are essential for evaluation - not optional DDPM models must be evaluated using EMA weights (decay 0.9999), not the online training weights. The instantaneous training weights fluctuate around the loss minimum - they produce noisier, lower-quality samples. The EMA averages over many steps, providing a smoother approximation to the optimal weights. In practice, the EMA model achieves 0.5-1.5 lower FID than the online model at the same training step. If you are evaluating DDPM quality and getting unexpectedly poor FID, check whether you are using EMA weights for sampling. :::
13. Interview Q&A
Q1: Derive the closed-form distribution for DDPM.
Define and . We claim .
Proof by induction. Base case: , which matches since .
Inductive step: suppose . Then . Substituting: . The noise terms combine (sum of independent Gaussians) with variance . So . QED.
Q2: Why does Ho et al. predict rather than or ?
Three reasons. Practical: at high noise levels, predicting means reconstructing a clean image from nearly pure noise - very high variance target, noisy gradients. Predicting is always a well-posed regression problem with bounded targets. Mathematical: the ELBO denoising terms reduce to weighted MSE on ; Ho et al. found the simplified unweighted objective (dropping timestep-dependent weights) performs better empirically. Theoretical: predicting is equivalent to estimating the score , connecting DDPM to score matching. Specifically, , giving a probabilistic interpretation of the training objective.
Q3: What is the DDPM ELBO and what does each term mean?
The DDPM ELBO is:
The reconstruction term measures how well the final denoising step recovers . The prior matching term measures closeness of to - near zero when , so it is ignored. The denoising KL terms are the main training signal: since both and are Gaussian, each KL has a closed form proportional to . Dropping timestep-dependent weighting coefficients gives .
Q4: What is the difference between linear and cosine noise schedules, and when does it matter?
The linear schedule increases linearly from to over steps, designed for 32x32 images. For higher resolutions, it destroys high-frequency detail too quickly - many training steps fall in a regime where the image is nearly pure noise, providing little useful signal. The cosine schedule defines as a cosine curve, maintaining higher SNR for a larger fraction of timesteps and decreasing more sharply near . The cosine schedule keeps about 1.5x higher than linear, giving more training steps in the "interesting" noise regime. For 128x128 and above, cosine consistently improves FID. For 32x32, the difference is small (less than 0.3 FID on CIFAR-10).
Q5: How does DDPM relate to score matching?
They are different training formulations that produce equivalent models. Score matching (Song and Ermon 2019) trains to estimate - the score of the noisy data distribution. DDPM trains to predict added noise. The connection: . So the DDPM noise prediction network is proportional to the negative score. Song et al. (2021) formalized this in the SDE framework, showing DDPM (variance-preserving SDE) and NCSN (variance-exploding SDE) are special cases of a general continuous-time diffusion SDE with a corresponding probability flow ODE. This unification led directly to the DDIM and DPM-Solver samplers, which exploit the ODE structure for accelerated sampling.
Q6: What FID score would you expect a well-trained DDPM to achieve on CIFAR-10, and what would indicate a bug?
A well-configured DDPM (cosine schedule, EMA weights, correct normalization) should achieve FID around 2.9-3.5 on CIFAR-10 after sufficient training. The 2020 Ho et al. result was 3.17 with 1000 steps; Improved DDPM achieved 2.90. FID above 5.0 suggests a problem - check: (1) normalization range (images should be in ), (2) whether EMA weights are being used for sampling, (3) schedule choice and whether matches the schedule, (4) whether the correct noise is used (random fresh at each training step, not the same across epochs). FID above 15-20 indicates a fundamental bug - likely the closed-form forward process is incorrectly implemented or inputs are not normalized.
This lesson is part of the Diffusion Models module. Next: Score-Based Models and SDEs - The Continuous-Time View.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Diffusion Process (DDPM) demo on the EngineersOfAI Playground - no code required.
:::
