The complete mathematical derivation of Denoising Diffusion Probabilistic Models - forward process, reverse process, ELBO objective, noise schedule comparison, U-Net architecture, and why predicting noise works better than predicting clean images.

How does Ho et al 2020 work in practice?

DDPMs - The Mathematical Foundation of Diffusion Models covers DDPM, Ho et al 2020, forward diffusion from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/diffusion-models/denoising-diffusion-probabilistic-models

What is the difference between DDPM and forward diffusion?

See the full breakdown at https://engineersofai.com/docs/ml/diffusion-models/denoising-diffusion-probabilistic-models

DDPMs - The Mathematical Foundation of Diffusion Models

:::note Reading time: ~55 minutes | Interview relevance: Very High | Target roles: ML Engineer, Research Engineer, AI Engineer, Applied Scientist :::

The Real Interview Moment

You are interviewing for a research scientist role at a generative AI lab. The interviewer writes $q(x_t | x_0)$ on the whiteboard and says: "Derive this closed form. Then explain how the training objective follows from the ELBO, and tell me why Ho et al. chose to predict $\varepsilon$ instead of $x_0$ ."

This is a qualifying question. It separates engineers who have used Stable Diffusion from engineers who understand it. Most people can draw forward and backward arrows on a diagram. Few can derive the closed-form distribution at an arbitrary timestep $t$ , explain why that derivation enables efficient training, connect it to the variational lower bound, articulate the empirical finding that noise prediction outperforms image prediction, and then describe why the cosine schedule (Nichol and Dhariwal 2021) matters for high-resolution generation.

The interviewer continues: "Now, sketch the U-Net architecture - how does it incorporate the timestep, how do skip connections help, and where does attention go?"

Then: "What FID score would you expect a well-trained DDPM to achieve on CIFAR-10, and what does FID actually measure?"

These questions form a complete arc from mathematical derivation to practical engineering to evaluation. Jonathan Ho, Ajay Jain, and Pieter Abbeel at UC Berkeley answered all of them in their 2020 DDPM paper. They unified ideas from non-equilibrium thermodynamics and score matching, replaced slow noise prediction networks with a U-Net, and achieved state-of-the-art FID on CIFAR-10 and LSUN. This lesson gives you the complete derivation, implementation, and intuition to answer every question in that arc.

Why This Exists - The Core Problem with Adversarial Training

Prior to DDPM, the dominant approach to high-quality image generation was Generative Adversarial Networks (GANs). GANs have one fundamental training problem: the generator and discriminator must be carefully balanced throughout training. Too strong a discriminator and gradients vanish - the generator gets no useful signal. Too weak a discriminator and the generator does not improve. Mode collapse can happen silently and is difficult to detect until evaluation time. Every GAN paper required new stabilization tricks: spectral normalization, gradient penalty, progressive growing, minibatch standard deviation.

The deeper problem is that GANs optimize an adversarial objective, not a likelihood-based one. The training signal is indirect: "the discriminator says this looks fake." There is no explicit mathematical connection between the training loss and the quality of the learned distribution. When a GAN converges, there is no theoretical guarantee that it has learned the true data distribution - only that the generator can fool the discriminator.

Diffusion models take the opposite approach. They define an explicit, principled training objective derived from the variational lower bound on the log-likelihood. The training signal is direct: "predict the noise that was added at a specific timestep." This is a simple regression problem, always well-defined, never adversarial. The result is dramatically more stable training - you can train DDPM reliably with a standard Adam optimizer and cosine learning rate schedule, no discriminator, no mode collapse, no specialized tricks required.

The cost is inference speed: DDPM requires 1000 sequential denoising steps to generate one sample. GANs generate in a single forward pass. This speed gap motivated DDIM, DPM-Solver, and Consistency Models (covered in Lessons 04–06). But the stability advantage of DDPM's principled objective is what made the entire diffusion model ecosystem possible.

Historical Context - The Origins of DDPM

The idea of progressive noising comes from non-equilibrium statistical mechanics. Sohl-Dickstein et al. (2015) proposed learning a generative model by reversing a diffusion process - gradually adding noise until the data distribution becomes a known prior, then learning to reverse the process. The paper proved this was theoretically sound: if you can perfectly learn the reverse of a diffusion process, you can generate samples from the data distribution. But the implementation was too slow and the networks too small to produce competitive results. The paper was visionary but practically ahead of its time.

Yang Song and Stefano Ermon (2019, 2020) approached generative modeling from a different angle: score matching. If you know the score function $\nabla_x \log p(x)$ of the data distribution, you can generate samples via Langevin dynamics - just follow the gradient of the log probability, adding noise to avoid getting trapped. They showed that training on data perturbed at multiple noise scales produced a score function that worked across the full noise range. Their NCSN model produced impressive results but still lagged behind GANs in FID.

Ho et al. (2020) connected these threads. They showed that a particular parameterization of the DDPM model - specifically, predicting the noise $\varepsilon$ rather than the denoised image $x_0$ - gave a training objective equivalent to weighted denoising score matching. They replaced the slow noise prediction network with a modern U-Net with attention layers, scaled the training to 256x256 resolution, and achieved an FID of 3.17 on CIFAR-10 - competitive with the best GANs, without any adversarial training instability. DDPM was born, and with it, the modern era of diffusion-based generative AI.

1. The Forward Process - Adding Noise Systematically

Definition and Intuition

The forward process is a Markov chain that gradually destroys structure in a clean image $x_0$ by adding Gaussian noise at each step:

$q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\; \sqrt{1-\beta_t}\, x_{t-1},\; \beta_t I\right)$

Here $\{\beta_t\}_{t=1}^T$ is the noise schedule - a sequence of small positive constants controlling how much noise is added at each step. The mean $\sqrt{1-\beta_t}\, x_{t-1}$ scales down the signal from the previous step. The variance $\beta_t I$ adds Gaussian noise. Over $T$ steps (Ho et al. used $T = 1000$ ), this progressively destroys all structure in $x_0$ until only noise remains.

Intuitively: imagine repeatedly photocopying a photo on a machine that adds a tiny amount of static each time. After many copies, all you have is static. The forward process does exactly this to a digital image, but in a mathematically controlled way.

The full forward joint distribution factors as:

$q(x_{1:T} \mid x_0) = \prod_{t=1}^{T} q(x_t \mid x_{t-1})$

Key Property: Closed-Form at Any Timestep

One of the most important mathematical properties of this particular forward process is that $q(x_t | x_0)$ has a closed form at any arbitrary timestep $t$ - you do not need to chain $t$ steps of sampling to get $x_t$ .

Define $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$ . Then:

$\boxed{q(x_t \mid x_0) = \mathcal{N}\!\left(x_t;\; \sqrt{\bar{\alpha}_t}\, x_0,\; (1-\bar{\alpha}_t) I\right)}$

Proof (by induction using the Gaussian product rule):

Base case: At $t=1$ : $q(x_1|x_0) = \mathcal{N}(x_1; \sqrt{1-\beta_1}\, x_0, \beta_1 I) = \mathcal{N}(x_1; \sqrt{\alpha_1}\, x_0, (1-\alpha_1)I)$ , which matches the formula since $\bar{\alpha}_1 = \alpha_1$ .

Inductive step: Suppose $q(x_{t-1}|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_{t-1}}\, x_0, (1-\bar{\alpha}_{t-1}) I)$ .

Apply one more forward step. We can write $x_t = \sqrt{\alpha_t}\, x_{t-1} + \sqrt{\beta_t}\, \varepsilon_2$ where $\varepsilon_2 \sim \mathcal{N}(0,I)$ .

Substituting the inductive hypothesis $x_{t-1} = \sqrt{\bar{\alpha}_{t-1}}\, x_0 + \sqrt{1-\bar{\alpha}_{t-1}}\, \varepsilon_1$ :

$x_t = \sqrt{\alpha_t}\!\left(\sqrt{\bar{\alpha}_{t-1}}\, x_0 + \sqrt{1-\bar{\alpha}_{t-1}}\, \varepsilon_1\right) + \sqrt{\beta_t}\, \varepsilon_2$

$= \sqrt{\alpha_t \bar{\alpha}_{t-1}}\, x_0 + \underbrace{\sqrt{\alpha_t(1-\bar{\alpha}_{t-1})}\, \varepsilon_1 + \sqrt{\beta_t}\, \varepsilon_2}_{\text{sum of independent Gaussians}}$

The variance of the combined noise term is:

$\alpha_t(1-\bar{\alpha}_{t-1}) + \beta_t = \alpha_t - \alpha_t\bar{\alpha}_{t-1} + 1 - \alpha_t = 1 - \alpha_t\bar{\alpha}_{t-1} = 1 - \bar{\alpha}_t$

So $x_t \sim \mathcal{N}(\sqrt{\bar{\alpha}_t}\, x_0, (1-\bar{\alpha}_t) I)$ . QED.

Reparameterized Sampling

This closed form gives us a crucial efficiency: we can sample $x_t$ directly from $x_0$ in one step using the reparameterization trick:

$x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1-\bar{\alpha}_t}\, \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, I)$

This is the reparameterization used in every DDPM training iteration. We do not need to run $t$ sequential noising steps - we jump directly from the clean image $x_0$ to a noisy image $x_t$ at any desired noise level in a single matrix operation. Without this closed form, training would require O(T) operations per sample. With it, training is O(1) per sample regardless of $T$ .

Forward Process Visualization

As $t$ increases, $\bar{\alpha}_t \to 0$ , so the clean signal is scaled toward zero and the image becomes pure Gaussian noise:

t=0:    x_0  (clean image, full detail, bar_alpha = 1.0)
        ↓ add very small noise (beta_1 ≈ 0.0001)
t=100:  x_100 (slightly noisy, bar_alpha ≈ 0.95)
        ↓ add more noise
t=500:  x_500 (significant noise, structure fading, bar_alpha ≈ 0.35)
        ↓ add more noise
t=800:  x_800 (heavy noise, mostly indistinguishable, bar_alpha ≈ 0.08)
        ↓ add final noise
t=T:    x_T   ≈ N(0, I) (pure noise, no identifiable structure)

The noise schedule $\{\beta_t\}$ controls how quickly this destruction happens - a design choice with real consequences for generation quality.

2. The Noise Schedule - Linear vs Cosine

The noise schedule $\{\beta_t\}$ determines the rate at which signal is destroyed. This is not just a technical detail - the choice of schedule significantly affects training efficiency and sample quality, especially at high resolutions.

Linear Schedule (Ho et al. 2020)

Ho et al. used a linear schedule increasing $\beta_t$ uniformly from $\beta_1$ to $\beta_T$ :

$\beta_t = \beta_1 + \frac{t-1}{T-1}(\beta_T - \beta_1), \qquad \beta_1 = 10^{-4},\; \beta_T = 0.02,\; T=1000$

This schedule works well for 32x32 and 64x64 images. The small $\beta_1$ means very little noise is added in the first few steps, and the gradual increase ensures a smooth transition to pure noise by step $T$ .

Problem at high resolution: for 256x256 and above images, the linear schedule destroys high-frequency detail (edges, textures) too quickly in the early timesteps. Many training steps fall in a regime where the image is nearly pure noise, providing little useful training signal. The model sees too many "already destroyed" images and too few "partially corrupted" ones.

Signal-to-noise ratio at the midpoint (linear): $\text{SNR}_{t=500} = \bar{\alpha}_{500}/(1-\bar{\alpha}_{500}) \approx 0.56$ . By step 600, the image has lost most of its structure.

Cosine Schedule (Nichol and Dhariwal 2021)

Nichol and Dhariwal proposed a cosine schedule defined through $\bar{\alpha}_t$ :

$\bar{\alpha}_t = \frac{f(t)}{f(0)}, \qquad f(t) = \cos\!\left(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2}\right)^2$

where $s = 0.008$ is a small offset to prevent $\bar{\alpha}_t$ from being too small at $t=0$ (which would add noticeable noise in the very first step, conflicting with the requirement that $q(x_0|x_0)$ be essentially a Dirac delta).

The cosine schedule has a characteristic shape: $\bar{\alpha}_t$ stays high (near 1.0) for the first third of timesteps, then smoothly decreases, reaching near zero around step 700-800. This ensures:

Early timesteps are meaningful: the model sees images with small but non-trivial amounts of noise, learning fine-grained denoising
Middle timesteps are information-rich: the transition zone from structured to unstructured gets more training steps
Late timesteps are efficient: the model does not waste steps on images that are already essentially pure noise

The cosine schedule improves FID by 0.5-2.0 points on CIFAR-10 and much more on 256x256 datasets. For any production high-resolution model, the cosine schedule (or a learned version) is the standard.

3. The Reverse Process - Learning to Denoise

The reverse process is what we want to learn. Starting from $x_T \sim \mathcal{N}(0, I)$ , we want to progressively denoise to recover $x_0$ . Each reverse step is modeled as a Gaussian:

$p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\!\left(x_{t-1};\; \mu_\theta(x_t, t),\; \Sigma_\theta(x_t, t)\right)$

The functions $\mu_\theta$ and $\Sigma_\theta$ are parameterized by neural networks (in practice, a single U-Net). The full generative model is:

$p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^{T} p_\theta(x_{t-1} \mid x_t)$

where $p(x_T) = \mathcal{N}(0, I)$ is the prior - pure Gaussian noise.

The True Reverse Posterior

A key mathematical property: the true reverse posterior $q(x_{t-1}|x_t, x_0)$ is tractable when conditioned on $x_0$ . Using Bayes' theorem and the Gaussian forward process, it is also Gaussian:

$q(x_{t-1} \mid x_t, x_0) = \mathcal{N}\!\left(x_{t-1};\; \tilde{\mu}_t(x_t, x_0),\; \tilde{\beta}_t I\right)$

where:

$\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}\, x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\, x_t$

$\tilde{\beta}_t = \frac{(1-\bar{\alpha}_{t-1})\beta_t}{1-\bar{\alpha}_t}$

This posterior mean is a weighted combination of $x_0$ (the clean image, which we do not know at inference time) and $x_t$ (the current noisy image, which we do have). The variance $\tilde{\beta}_t$ is entirely determined by the noise schedule - no learning needed for the variance in the original DDPM formulation.

The learned $p_\theta(x_{t-1}|x_t)$ should approximate this posterior. But since $x_0$ is unknown at inference time, we must parameterize $\mu_\theta$ in terms of what the network predicts from $x_t$ alone.

4. The ELBO Derivation

To train the reverse process, we maximize the log-likelihood $\log p_\theta(x_0)$ . Since this is intractable (it involves marginalizing over all $T$ intermediate steps $x_{1:T}$ ), we maximize the ELBO (Evidence Lower BOund):

$\log p_\theta(x_0) \geq \mathcal{L}(\theta) = \mathbb{E}_q\!\left[\log \frac{p_\theta(x_{0:T})}{q(x_{1:T} \mid x_0)}\right]$

Expanding by substituting the factored forms of $p_\theta$ and $q$ :

$\mathcal{L} = \underbrace{\mathbb{E}_q[\log p_\theta(x_0 \mid x_1)]}_{L_0} - \underbrace{D_{\text{KL}}\!\left(q(x_T \mid x_0) \| p(x_T)\right)}_{L_T} - \sum_{t=2}^{T} \underbrace{D_{\text{KL}}\!\left(q(x_{t-1} \mid x_t, x_0) \| p_\theta(x_{t-1} \mid x_t)\right)}_{L_{t-1}}$

Parsing each term:

$L_0$ - Reconstruction: How well does the final denoising step $p_\theta(x_0|x_1)$ recover the clean image? Analogous to the reconstruction term in a VAE. In practice, treated as a Gaussian with fixed variance.

$L_T$ - Prior matching: How close is the fully noised image $x_T$ to the prior $\mathcal{N}(0,I)$ ? If the noise schedule is designed so that $\bar{\alpha}_T \approx 0$ , this term is approximately zero and requires no optimization - it is determined by the fixed forward process, not the model. Ho et al. ignore this term entirely.

$L_{t-1}$ - Denoising terms (the main training signal): For each $t$ from 2 to $T$ , how well does the learned reverse step $p_\theta(x_{t-1}|x_t)$ match the true conditional reverse posterior $q(x_{t-1}|x_t, x_0)$ ?

Since both distributions are Gaussian, each KL divergence has a closed form proportional to the squared difference between means. This is where the bulk of training occurs.

5. The Simplified Training Objective

From ELBO to Noise Prediction

Ho et al. made a critical practical choice: reparameterize the mean in terms of the noise $\varepsilon$ rather than predicting $\mu_\theta$ or $x_0$ directly.

From the forward process: $x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1-\bar{\alpha}_t}\, \varepsilon$ , so we can solve for $x_0$ :

$x_0 = \frac{x_t - \sqrt{1-\bar{\alpha}_t}\, \varepsilon}{\sqrt{\bar{\alpha}_t}}$

Substituting this expression into the true posterior mean $\tilde{\mu}_t$ :

$\tilde{\mu}_t(x_t, \varepsilon) = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\, \varepsilon\right)$

So if the network predicts $\varepsilon_\theta(x_t, t) \approx \varepsilon$ , we can compute the optimal mean:

$\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\, \varepsilon_\theta(x_t, t)\right)$

The denoising KL terms become proportional to:

$\mathbb{E}_{x_0, \varepsilon, t}\!\left[\frac{(1-\alpha_t)^2}{2\alpha_t(1-\bar{\alpha}_t)}\, \|\varepsilon - \varepsilon_\theta(x_t, t)\|^2\right]$

Ho et al. dropped the timestep-dependent weighting coefficient $\frac{(1-\alpha_t)^2}{2\alpha_t(1-\bar{\alpha}_t)}$ and found that training with the simplified objective worked better in practice:

$\boxed{\mathcal{L}_{\text{simple}} = \mathbb{E}_{t \sim \mathcal{U}[1,T],\; x_0 \sim q,\; \varepsilon \sim \mathcal{N}(0,I)}\!\left[\|\varepsilon - \varepsilon_\theta\!\left(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1-\bar{\alpha}_t}\, \varepsilon,\; t\right)\|^2\right]}$

This is the DDPM training objective in its final form. Sample a clean image $x_0$ , pick a random timestep $t$ , add noise $\varepsilon$ to get $x_t$ , ask the network to predict $\varepsilon$ , measure MSE. Simple, elegant, powerful.

Why Predict Noise, Not the Clean Image?

You might reasonably ask: why predict $\varepsilon$ rather than predicting $x_0$ directly? Both are mathematically equivalent - given $\varepsilon_\theta$ , we can compute $\hat{x}_0 = (x_t - \sqrt{1-\bar{\alpha}_t}\,\varepsilon_\theta) / \sqrt{\bar{\alpha}_t}$ .

The empirical answer: Ho et al. tried both and noise prediction gave better FID. Their ablation showed that $x_0$ -prediction required more careful tuning and produced lower-quality samples.

The intuitive explanation: at high noise levels ( $t$ large, $\bar{\alpha}_t$ small), predicting $x_0$ requires reconstructing a clean image from nearly pure noise. The target has extremely high variance - many different clean images are consistent with the same noisy observation. The gradient signal is therefore very noisy. Predicting $\varepsilon \sim \mathcal{N}(0,I)$ is always predicting from a fixed, unit-variance distribution regardless of noise level. The regression target is well-behaved at every $t$ .

The theoretical connection: predicting $\varepsilon_\theta(x_t, t)$ is mathematically equivalent to estimating the score function $\nabla_{x_t} \log q_t(x_t)$ of the noisy data distribution, since:

$\varepsilon_\theta(x_t, t) \approx -\sqrt{1-\bar{\alpha}_t}\, \nabla_{x_t} \log q_t(x_t)$

This connects DDPM to the score matching framework of Song and Ermon - the theoretical reason why the two approaches produce identical models despite different derivations.

Why the simplified objective works better than the weighted one: the weighting term $\frac{(1-\alpha_t)^2}{2\alpha_t(1-\bar{\alpha}_t)}$ is large at small $t$ (where the step is small and the denoising task is easy) and small at large $t$ (where the step is large and the denoising task is hard). The simplified objective, by dropping this weight, places equal emphasis on all timesteps. Empirically this produces better FID - the model learns to handle both easy and hard timesteps equally well.

6. The Sampling Algorithm

At generation time, we start from $x_T \sim \mathcal{N}(0,I)$ and apply the learned reverse process $T$ times. Substituting the noise-prediction parameterization:

$x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\,\varepsilon_\theta(x_t, t)\right) + \sigma_t\, z_t, \qquad z_t \sim \mathcal{N}(0,I)$

where $\sigma_t^2 = \beta_t$ (the choice in the original DDPM paper). At the final step $t=1$ , we set $z_1 = 0$ to avoid adding noise to an almost-clean image.

Full Algorithm:

Sample $x_T \sim \mathcal{N}(0, I)$
For $t = T, T-1, \ldots, 1$ : a. If $t > 1$ : sample $z \sim \mathcal{N}(0, I)$ , else $z = 0$ b. Compute $\hat{\varepsilon} = \varepsilon_\theta(x_t, t)$ - one U-Net forward pass c. Update: $x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\,\hat{\varepsilon}\right) + \sqrt{\beta_t}\, z$
Return $x_0$

This requires $T$ U-Net forward passes - 1000 for the original DDPM. This is the sampling bottleneck that DDIM (Lesson 04) reduces to 20-50 steps.

Variance Choice: Fixed vs Learned

Ho et al. fixed $\sigma_t^2 = \beta_t$ (the upper bound on the posterior variance) rather than using the posterior variance $\tilde{\beta}_t$ (the lower bound). Nichol and Dhariwal (2021) showed that learning the variance - interpolating between $\beta_t$ and $\tilde{\beta}_t$ using the network's output - improves log-likelihood and allows faster sampling with fewer steps. Their improved DDPM parameterizes the variance as:

$\Sigma_\theta(x_t, t) = \exp\!\left(v \log \beta_t + (1-v) \log \tilde{\beta}_t\right)$

where $v \in [0,1]$ is a scalar predicted by the network alongside the noise estimate.

7. The U-Net Backbone

The denoising network $\varepsilon_\theta(x_t, t)$ is a U-Net - originally proposed for biomedical image segmentation by Ronneberger et al. (2015). The U-Net is ideal for the DDPM denoising task for specific structural reasons.

Why U-Net for Diffusion?

Skip connections preserve fine details: the U-Net's skip connections pass feature maps directly from encoder to decoder at each resolution level. Without skip connections, the information bottleneck in the middle of the network would destroy high-frequency spatial detail. With skip connections, fine textures and edge information bypass the bottleneck and are available for the decoder to use in reconstruction.

Multi-scale reasoning: the encoder-decoder structure processes the image at multiple spatial resolutions simultaneously. The bottleneck (lowest resolution) captures global structure - overall composition, large-scale lighting, semantic content. The shallow layers (highest resolution) capture local detail - texture, fine edges, grain patterns. This multi-scale hierarchy mirrors how diffusion naturally operates: high-noise steps require understanding global structure, low-noise steps require understanding fine detail.

Self-attention at low resolutions: attention layers in the bottleneck and lower-resolution feature maps capture long-range spatial dependencies - allowing the model to ensure that, for example, a face is internally consistent even when the two eyes are far apart in the image.

Timestep conditioning via sinusoidal embeddings: the timestep $t$ must be communicated to the network so it knows which noise level it is operating at. This is done via sinusoidal position embeddings (borrowed from Transformers):

$\text{emb}(t) = \left[\sin(t / 10000^{2i/d}),\; \cos(t / 10000^{2i/d})\right]_{i=1}^{d/2}$

This embedding is projected through a small MLP and added to the feature maps at each residual block via adaptive group normalization - the network's behavior changes smoothly with $t$ , from predicting fine-detail noise at small $t$ to predicting large-scale structure at large $t$ .

Architecture Summary

The Ho et al. U-Net for 32x32 images:

Input: (B, C, 32, 32)
┌─────────────────────────────────────────────────┐
│ Encoder                                         │
│   conv(C → 128) → [32×32]                      │
│   ResBlock × 2 + Downsample → [16×16]          │
│   ResBlock × 2 + Self-Attention + Down → [8×8] │
│   ResBlock × 2 + Downsample → [4×4]            │
├─────────────────────────────────────────────────│
│ Bottleneck                                      │
│   ResBlock + Self-Attention + ResBlock [4×4]    │
├─────────────────────────────────────────────────│
│ Decoder (with skip connections from encoder)    │
│   ResBlock × 2 + Upsample → [8×8]              │
│   ResBlock × 2 + Self-Attention + Up → [16×16] │
│   ResBlock × 2 + Upsample → [32×32]            │
│   conv(128 → C)                                 │
└─────────────────────────────────────────────────┘
Output: epsilon_hat (B, C, 32, 32)

Each ResBlock receives the timestep embedding and incorporates it via: GroupNorm → SiLU → Conv → Add(t_emb_proj) → GroupNorm → SiLU → Conv.

8. Complete PyTorch Implementation

The following is a complete, runnable DDPM implementation with a U-Net denoiser, training loop, and sampling loop. It trains on MNIST to demonstrate the core mechanics on accessible compute.

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import math


# ============================================================
# Sinusoidal timestep embedding
# ============================================================
class SinusoidalPosEmb(nn.Module):
    """
    Sinusoidal positional embedding for timestep conditioning.
    Same encoding as Transformer positional embeddings.
    Output shape: (batch, dim)
    """
    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, t):
        device = t.device
        half_dim = self.dim // 2
        emb = math.log(10000) / (half_dim - 1)
        emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
        # t: (batch,) → multiply with frequency basis
        emb = t[:, None].float() * emb[None, :]
        emb = torch.cat([emb.sin(), emb.cos()], dim=-1)
        return emb  # (batch, dim)


# ============================================================
# Residual block with timestep conditioning
# ============================================================
class ResidualBlock(nn.Module):
    """
    Residual block conditioned on timestep embedding.
    Uses GroupNorm + SiLU (better than BatchNorm for diffusion).
    """
    def __init__(self, in_channels, out_channels, time_emb_dim, num_groups=8):
        super().__init__()
        self.time_mlp = nn.Sequential(
            nn.SiLU(),
            nn.Linear(time_emb_dim, out_channels)
        )
        self.block1 = nn.Sequential(
            nn.GroupNorm(num_groups, in_channels),
            nn.SiLU(),
            nn.Conv2d(in_channels, out_channels, 3, padding=1)
        )
        self.block2 = nn.Sequential(
            nn.GroupNorm(num_groups, out_channels),
            nn.SiLU(),
            nn.Conv2d(out_channels, out_channels, 3, padding=1)
        )
        # Residual connection handles channel mismatch
        self.residual_conv = (
            nn.Conv2d(in_channels, out_channels, 1)
            if in_channels != out_channels else nn.Identity()
        )

    def forward(self, x, t_emb):
        h = self.block1(x)
        # Add timestep embedding broadcast over H, W
        h = h + self.time_mlp(t_emb)[:, :, None, None]
        h = self.block2(h)
        return h + self.residual_conv(x)


# ============================================================
# Simplified U-Net for DDPM
# ============================================================
class UNet(nn.Module):
    """
    U-Net denoising network for DDPM.
    Architecture: Encoder → Bottleneck → Decoder with skip connections.
    Timestep conditioning injected at every ResBlock.
    """
    def __init__(
        self,
        in_channels=1,
        base_channels=64,
        channel_mults=(1, 2, 4),
        time_emb_dim=128,
        num_groups=8
    ):
        super().__init__()

        # Timestep embedding: sinusoidal → MLP → projected dim
        self.time_emb = nn.Sequential(
            SinusoidalPosEmb(time_emb_dim),
            nn.Linear(time_emb_dim, time_emb_dim * 4),
            nn.SiLU(),
            nn.Linear(time_emb_dim * 4, time_emb_dim)
        )

        # Initial projection: in_channels → base_channels
        self.init_conv = nn.Conv2d(in_channels, base_channels, 3, padding=1)

        # Encoder: progressively halve spatial resolution, increase channels
        self.down_blocks = nn.ModuleList()
        self.downsamplers = nn.ModuleList()
        skip_channels = [base_channels]  # track skip connection channel counts
        ch = base_channels

        for mult in channel_mults:
            out_ch = base_channels * mult
            self.down_blocks.append(ResidualBlock(ch, out_ch, time_emb_dim, num_groups))
            self.downsamplers.append(nn.Conv2d(out_ch, out_ch, 4, stride=2, padding=1))
            skip_channels.append(out_ch)
            ch = out_ch

        # Bottleneck: same resolution, deepens representation
        self.mid_block1 = ResidualBlock(ch, ch, time_emb_dim, num_groups)
        self.mid_block2 = ResidualBlock(ch, ch, time_emb_dim, num_groups)

        # Decoder: progressively double spatial resolution, decrease channels
        self.up_blocks = nn.ModuleList()
        self.upsamplers = nn.ModuleList()

        for mult in reversed(channel_mults):
            skip_ch = skip_channels.pop()  # retrieve matching encoder skip
            out_ch = base_channels * mult
            self.upsamplers.append(
                nn.ConvTranspose2d(ch, ch, 4, stride=2, padding=1)
            )
            # Input = current channels + skip connection channels
            self.up_blocks.append(
                ResidualBlock(ch + skip_ch, out_ch, time_emb_dim, num_groups)
            )
            ch = out_ch

        # Final output: recover input channels
        self.final_conv = nn.Sequential(
            nn.GroupNorm(num_groups, ch),
            nn.SiLU(),
            nn.Conv2d(ch, in_channels, 1)
        )

    def forward(self, x, t):
        """
        Args:
            x: noisy image (B, C, H, W)
            t: timestep indices (B,) as integers
        Returns:
            predicted noise (B, C, H, W)
        """
        t_emb = self.time_emb(t)

        x = self.init_conv(x)
        skips = [x]

        # Encoder pass - store all activations for skip connections
        for block, downsample in zip(self.down_blocks, self.downsamplers):
            x = block(x, t_emb)
            skips.append(x)
            x = downsample(x)

        # Bottleneck
        x = self.mid_block1(x, t_emb)
        x = self.mid_block2(x, t_emb)

        # Decoder pass - concatenate skip connections
        for upsample, block in zip(self.upsamplers, self.up_blocks):
            x = upsample(x)
            skip = skips.pop()
            # Concatenate along channel dimension
            x = torch.cat([x, skip], dim=1)
            x = block(x, t_emb)

        return self.final_conv(x)


# ============================================================
# DDPM - noise schedule, training loss, sampling
# ============================================================
class DDPM:
    """
    Full DDPM implementation.
    Supports linear and cosine noise schedules.
    Implements the simplified training objective and DDPM sampling.
    """

    def __init__(self, T=1000, schedule='cosine', device='cuda'):
        self.T = T
        self.device = device

        if schedule == 'linear':
            # Original Ho et al. 2020 schedule
            betas = torch.linspace(1e-4, 0.02, T, device=device)

        elif schedule == 'cosine':
            # Nichol & Dhariwal 2021 cosine schedule
            # Better for high-resolution images - more uniform noise removal
            s = 0.008  # small offset prevents very large beta_0
            steps = T + 1
            x = torch.linspace(0, T, steps, device=device)
            # f(t) = cos^2(pi/2 * (t/T + s) / (1 + s))
            alpha_bars = torch.cos(
                ((x / T) + s) / (1 + s) * math.pi / 2
            ) ** 2
            alpha_bars = alpha_bars / alpha_bars[0]  # normalize to f(0)=1
            # Derive betas from alpha_bars
            betas = 1 - alpha_bars[1:] / alpha_bars[:-1]
            betas = torch.clamp(betas, min=0.0001, max=0.9999)  # numerical safety
        else:
            raise ValueError(f"Unknown schedule: {schedule}")

        alphas = 1.0 - betas
        alpha_bars = torch.cumprod(alphas, dim=0)

        # Pre-compute all quantities needed for training and sampling
        self.betas = betas
        self.alphas = alphas
        self.alpha_bars = alpha_bars
        self.sqrt_alpha_bars = alpha_bars.sqrt()
        self.sqrt_one_minus_alpha_bars = (1 - alpha_bars).sqrt()
        self.sqrt_recip_alphas = (1.0 / alphas).sqrt()

    def q_sample(self, x_0, t, noise=None):
        """
        Sample x_t from x_0 using the closed-form forward process.
        x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * eps
        This is the key efficiency: O(1) regardless of t.
        """
        if noise is None:
            noise = torch.randn_like(x_0)
        # Index pre-computed coefficients and reshape for broadcasting
        sqrt_ab = self.sqrt_alpha_bars[t][:, None, None, None]
        sqrt_one_minus_ab = self.sqrt_one_minus_alpha_bars[t][:, None, None, None]
        return sqrt_ab * x_0 + sqrt_one_minus_ab * noise

    def training_loss(self, model, x_0):
        """
        DDPM simplified training objective (equation 14 in Ho et al. 2020).
        Steps:
          1. Sample random timestep t ~ U[1, T]
          2. Sample noise eps ~ N(0, I)
          3. Compute noisy image x_t via closed-form forward process
          4. Predict eps with model
          5. MSE loss between true and predicted noise
        """
        batch = x_0.shape[0]
        # Random timestep for each sample in the batch
        t = torch.randint(0, self.T, (batch,), device=self.device)

        noise = torch.randn_like(x_0)
        x_t = self.q_sample(x_0, t, noise)

        # Model predicts the noise - this is the key parameterization choice
        predicted_noise = model(x_t, t)

        # Simplified objective: unweighted MSE over all timesteps
        return F.mse_loss(predicted_noise, noise)

    @torch.no_grad()
    def p_sample(self, model, x_t, t_scalar):
        """
        One reverse denoising step: x_t → x_{t-1}.
        Implements the DDPM sampling algorithm (Algorithm 2 in Ho et al. 2020).
        """
        batch = x_t.shape[0]
        t = torch.full((batch,), t_scalar, device=self.device, dtype=torch.long)

        # Predict noise with the trained U-Net
        eps_pred = model(x_t, t)

        # Retrieve precomputed schedule values for this timestep
        alpha_t = self.alphas[t_scalar]
        sqrt_one_minus_ab = self.sqrt_one_minus_alpha_bars[t_scalar]
        beta_t = self.betas[t_scalar]
        sqrt_recip_alpha_t = self.sqrt_recip_alphas[t_scalar]

        # Compute reverse process mean:
        # mu_theta = (1/sqrt(alpha_t)) * (x_t - (1-alpha_t)/sqrt(1-alpha_bar_t) * eps)
        coeff = (1 - alpha_t) / sqrt_one_minus_ab
        mean = sqrt_recip_alpha_t * (x_t - coeff * eps_pred)

        if t_scalar > 0:
            # Add stochastic noise (sigma_t = sqrt(beta_t) in original DDPM)
            noise = torch.randn_like(x_t)
            x_prev = mean + beta_t.sqrt() * noise
        else:
            # Final step: no noise added
            x_prev = mean

        return x_prev

    @torch.no_grad()
    def sample(self, model, shape, verbose=False):
        """
        Full DDPM sampling: 1000 reverse steps from Gaussian noise.
        Note: this takes ~1000 U-Net forward passes.
        Use DDIM sampler (Lesson 04) to reduce to 20-50 steps.
        """
        model.eval()
        x = torch.randn(shape, device=self.device)

        for t in reversed(range(self.T)):
            if verbose and t % 100 == 0:
                print(f"  Sampling step {self.T - t}/{self.T} (t={t})")
            x = self.p_sample(model, x, t)

        return x


# ============================================================
# EMA (Exponential Moving Average) of model weights
# ============================================================
class EMA:
    """
    Exponential Moving Average of model weights.
    Critical for DDPM - EMA model gives significantly better FID
    than online training weights.
    Decay = 0.9999 is standard (Ho et al. 2020).
    """

    def __init__(self, model, decay=0.9999):
        self.decay = decay
        # Create a separate copy of model parameters for EMA
        self.shadow = {}
        for name, param in model.named_parameters():
            if param.requires_grad:
                self.shadow[name] = param.data.clone()

    def update(self, model):
        """Call after each gradient step."""
        for name, param in model.named_parameters():
            if param.requires_grad:
                # EMA update: shadow = decay * shadow + (1 - decay) * param
                self.shadow[name] = (
                    self.decay * self.shadow[name]
                    + (1 - self.decay) * param.data
                )

    def apply_to(self, model):
        """Copy EMA weights to model for evaluation."""
        for name, param in model.named_parameters():
            if param.requires_grad:
                param.data.copy_(self.shadow[name])


# ============================================================
# Full training loop
# ============================================================
def train_ddpm(
    num_epochs=50,
    batch_size=128,
    lr=2e-4,
    T=1000,
    schedule='cosine',
    device='cuda' if torch.cuda.is_available() else 'cpu'
):
    """
    Complete DDPM training loop on MNIST.
    Includes EMA, gradient clipping, cosine LR schedule.
    """
    # MNIST: 28x28 → padded to 32x32 for clean stride-2 downsampling
    transform = transforms.Compose([
        transforms.Resize(32),
        transforms.ToTensor(),
        transforms.Normalize((0.5,), (0.5,))  # [0,1] → [-1,1]
    ])

    dataset = datasets.MNIST(
        root='./data', train=True, download=True, transform=transform
    )
    loader = DataLoader(
        dataset, batch_size=batch_size, shuffle=True,
        num_workers=4, pin_memory=True
    )

    # Model and optimizer
    model = UNet(
        in_channels=1,
        base_channels=64,
        channel_mults=(1, 2, 4),
        time_emb_dim=128
    ).to(device)

    ema = EMA(model, decay=0.9999)
    ddpm = DDPM(T=T, schedule=schedule, device=device)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)

    # Cosine LR decay (standard for diffusion models)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer, T_max=num_epochs
    )

    print(f"Training DDPM on {device}")
    print(f"  Model parameters: {sum(p.numel() for p in model.parameters()):,}")
    print(f"  Noise schedule: {schedule}")
    print(f"  Diffusion steps T: {T}")

    model.train()
    for epoch in range(num_epochs):
        total_loss = 0.0
        for batch_idx, (x_0, _) in enumerate(loader):
            x_0 = x_0.to(device)

            # Compute DDPM training loss
            loss = ddpm.training_loss(model, x_0)

            optimizer.zero_grad()
            loss.backward()
            # Gradient clipping prevents gradient explosion in deep U-Net
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()

            # Update EMA after every gradient step
            ema.update(model)

            total_loss += loss.item()

        scheduler.step()
        avg_loss = total_loss / len(loader)
        print(
            f"Epoch {epoch+1:3d}/{num_epochs} | "
            f"Loss: {avg_loss:.4f} | "
            f"LR: {scheduler.get_last_lr()[0]:.2e}"
        )

    return model, ddpm, ema


# ============================================================
# Noise schedule comparison
# ============================================================
def compare_schedules(T=1000):
    """
    Compare linear and cosine noise schedules.
    Shows bar_alpha_t at key timesteps and the
    signal-to-noise ratio profile.
    """
    import numpy as np

    # Linear schedule
    betas_linear = np.linspace(1e-4, 0.02, T)
    alphas_linear = 1 - betas_linear
    ab_linear = np.cumprod(alphas_linear)

    # Cosine schedule
    s = 0.008
    steps = np.linspace(0, T, T + 1)
    f = np.cos(((steps / T) + s) / (1 + s) * np.pi / 2) ** 2
    f = f / f[0]
    betas_cosine = 1 - f[1:] / f[:-1]
    betas_cosine = np.clip(betas_cosine, 0.0001, 0.9999)
    alphas_cosine = 1 - betas_cosine
    ab_cosine = np.cumprod(alphas_cosine)

    checkpoints = [100, 200, 300, 400, 500, 600, 700, 800, 900, 999]

    print("Timestep | Linear alpha_bar | Cosine alpha_bar | Cosine/Linear (higher = more signal)")
    print("-" * 80)
    for t in checkpoints:
        ratio = ab_cosine[t] / (ab_linear[t] + 1e-10)
        print(f"  t={t:4d} | {ab_linear[t]:.4f}           | {ab_cosine[t]:.4f}           | {ratio:.2f}x")

    print()
    print("Interpretation: cosine schedule retains more signal at mid-range timesteps.")
    print("This means more informative training steps for textures and fine details.")
    print()

    # SNR comparison
    snr_linear = ab_linear / (1 - ab_linear + 1e-10)
    snr_cosine = ab_cosine / (1 - ab_cosine + 1e-10)
    print(f"Linear SNR at t=200: {snr_linear[199]:.3f}")
    print(f"Cosine SNR at t=200: {snr_cosine[199]:.3f}")
    print(f"Cosine has {snr_cosine[199]/snr_linear[199]:.1f}x higher SNR at t=200 → stronger training signal")


# ============================================================
# Run training and generate samples
# ============================================================
if __name__ == '__main__':
    device = 'cuda' if torch.cuda.is_available() else 'cpu'

    # Train
    model, ddpm, ema = train_ddpm(
        num_epochs=100,
        batch_size=128,
        device=device
    )

    # Compare schedules
    compare_schedules()

    # Generate with online weights
    print("\nGenerating samples with online model weights...")
    samples_online = ddpm.sample(model, shape=(16, 1, 32, 32), verbose=True)

    # Generate with EMA weights (typically better FID)
    print("\nGenerating samples with EMA weights...")
    ema_model = UNet(in_channels=1).to(device)
    ema.apply_to(ema_model)
    samples_ema = ddpm.sample(ema_model, shape=(16, 1, 32, 32), verbose=True)

    # Unnormalize from [-1, 1] to [0, 1]
    for name, samples in [("online", samples_online), ("ema", samples_ema)]:
        samples = (samples + 1) / 2
        samples = samples.clamp(0, 1)
        print(f"\n{name} samples: shape={samples.shape}, "
              f"min={samples.min():.3f}, max={samples.max():.3f}")

    try:
        from torchvision.utils import save_image
        save_image(samples_ema, 'ddpm_ema_samples.png', nrow=4)
        print("EMA samples saved to ddpm_ema_samples.png")
    except ImportError:
        print("Install torchvision to save samples as grid")

9. FID Score - What It Measures and What Counts as Good

FID (Fréchet Inception Distance) is the standard metric for evaluating generative model quality. Understanding it is essential for DDPM interviews.

How FID Works

Generate 50,000 images from the model
Generate 50,000 real images from the test set
Run both sets through an Inception-v3 network, extract the 2048-dimensional penultimate layer features
Fit multivariate Gaussians $\mathcal{N}(\mu_{real}, \Sigma_{real})$ and $\mathcal{N}(\mu_{gen}, \Sigma_{gen})$ to the real and generated features
Compute the Fréchet distance between these Gaussians:

$\text{FID} = \|\mu_{real} - \mu_{gen}\|^2 + \text{tr}\!\left(\Sigma_{real} + \Sigma_{gen} - 2(\Sigma_{real}\Sigma_{gen})^{1/2}\right)$

Lower FID = better. FID = 0 means perfect match. Real images have FID ≈ 0 with themselves (up to sampling noise).

What FID Measures

FID measures both quality and diversity. A model that generates only high-quality images of one mode (e.g., perfect cats, no dogs) will have high FID because its $\mu_{gen}$ and $\Sigma_{gen}$ do not match the full diversity of the test set.

DDPM Benchmarks on CIFAR-10

Model	FID (CIFAR-10)	Notes
DDPM (Ho et al. 2020)	3.17	Original paper, 1000 steps
Improved DDPM (Nichol & Dhariwal 2021)	2.90	Cosine schedule + learned variance
StyleGAN2 (GAN baseline)	2.92	Best GAN at that time
DALL-E (autoregressive)	17.9	Much weaker than diffusion
Score-based (Song & Ermon 2020)	3.21	Score matching baseline

FID of 3.17 means DDPM essentially matched the best GANs with much more stable training. A well-configured DDPM on CIFAR-10 should achieve FID below 4.0. Above 10.0 suggests a training bug (wrong normalization, incorrect schedule, or insufficient training time).

1000 Steps vs Fewer Steps

DDPM samples require 1000 steps because the model was trained with a 1000-step schedule. Using fewer steps with the DDPM sampler introduces discretization error - the approximation that each step is a small Gaussian degrades when steps are large. FID degrades sharply with fewer than ~100 steps in the DDPM sampler.

DDIM (Lesson 04) uses a different sampling formula that allows 20-50 steps on the same trained model by treating the reverse process as an ODE rather than a Markov chain. This is why you should always prefer DDIM or DPM-Solver for inference - the trained model is identical, only the sampler changes.

10. YouTube Resources

Title	Channel	Why Watch
Diffusion Models - Beat GANs	Yannic Kilcher	Dhariwal & Nichol paper - classifier guidance and improved DDPM
DDPM from Scratch (PyTorch)	Outlier	Step-by-step code implementation with intuition
Score-Based Generative Modeling	Yang Song	The connection between DDPM and score matching, from the author
Denoising Diffusion Probabilistic Models	AI Coffee Break	Mathematical walkthrough of ELBO derivation
Improved DDPM - Nichol & Dhariwal	Yannic Kilcher	Cosine schedule, learned variance, and path to ADM

11. Production Engineering Notes

:::tip Model size and resolution scaling The original DDPM used a U-Net with ~35M parameters for 32x32 images (CIFAR-10). For 256x256 images (ImageNet), the ADM model uses ~554M parameters with self-attention at 16x16 and 8x8 resolutions. For 512x512 (Stable Diffusion), computation is moved to a 64x64 latent space - reducing dimensionality by 64x. The scaling rule: add attention at all resolution levels where the spatial dimension is at most 32. Below this threshold, global reasoning is cheap and quality-critical. :::

:::note Variance schedule matters more than you expect The linear schedule works well for low-resolution images (32x32, 64x64). For 256x256 and above, the cosine schedule is significantly better because the linear schedule destroys high-frequency detail too aggressively in early timesteps, leaving insufficient training signal for learning fine-grained textures. If you are training on high-resolution data and samples look blurry or lack texture, switch to the cosine schedule before trying anything else. :::

:::note GroupNorm vs BatchNorm for diffusion models DDPM uses GroupNorm (typically 8 or 32 groups) rather than BatchNorm. The reason: BatchNorm computes statistics across the batch dimension, but diffusion models process images at many different noise levels within the same batch. A timestep-mixed batch has wildly different mean and variance statistics depending on the noise level, which breaks BatchNorm's assumptions. GroupNorm normalizes within each channel group independently of batch statistics - it is stable regardless of batch composition. :::

12. Common Mistakes

:::danger Forgetting the closed-form forward process in training A common implementation bug: computing $x_t$ by running $t$ sequential noising steps in a loop. This is O(T) per training sample - 1000x slower than the closed-form approach. The closed form $x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1-\bar{\alpha}_t}\, \varepsilon$ computes $x_t$ directly in one operation regardless of $t$ . This is what makes DDPM training efficient - every training step requires only one forward pass through the noise schedule computation. Always use the closed form. :::

:::danger Not normalizing inputs to the correct range The DDPM model expects inputs in $[-1, 1]$ . If your images are in $[0, 1]$ (standard output from transforms.ToTensor()), normalize them: x = 2 * x - 1. Failing to do this shifts the clean image distribution away from the Gaussian noise added during the forward process. At high noise levels, the noisy image should look like pure Gaussian noise - but if the clean image is in $[0,1]$ instead of $[-1,1]$ , the additive Gaussian noise centered at 0 will systematically shift the distribution. This causes subtle training failures that manifest as slightly off-center sample distributions. :::

:::warning The noise schedule and T are tightly coupled If you change $T$ from 1000 to 500 without adjusting the noise schedule, the model will fail. The noise schedule defines how quickly $\bar{\alpha}_t \to 0$ . A schedule designed for $T=1000$ reaches near-zero $\bar{\alpha}_T$ at step 1000. At step 500, $\bar{\alpha}_{500}$ is still around 0.35 for the linear schedule - the model has never seen pure noise during training. Sampling with $T=500$ steps from this model will produce images that are too noisy. Always retrain or re-derive the schedule when changing $T$ . :::

:::warning EMA weights are essential for evaluation - not optional DDPM models must be evaluated using EMA weights (decay 0.9999), not the online training weights. The instantaneous training weights fluctuate around the loss minimum - they produce noisier, lower-quality samples. The EMA averages over many steps, providing a smoother approximation to the optimal weights. In practice, the EMA model achieves 0.5-1.5 lower FID than the online model at the same training step. If you are evaluating DDPM quality and getting unexpectedly poor FID, check whether you are using EMA weights for sampling. :::

13. Interview Q&A

Q1: Derive the closed-form distribution $q(x_t | x_0)$ for DDPM.

Define $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$ . We claim $q(x_t|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}\, x_0, (1-\bar{\alpha}_t)I)$ .

Proof by induction. Base case: $q(x_1|x_0) = \mathcal{N}(\sqrt{1-\beta_1}\, x_0, \beta_1 I) = \mathcal{N}(\sqrt{\alpha_1}\, x_0, (1-\alpha_1)I)$ , which matches since $\bar{\alpha}_1 = \alpha_1$ .

Inductive step: suppose $q(x_{t-1}|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_{t-1}}\, x_0, (1-\bar{\alpha}_{t-1})I)$ . Then $x_t = \sqrt{\alpha_t}\, x_{t-1} + \sqrt{1-\alpha_t}\, \varepsilon_t$ . Substituting: $x_t = \sqrt{\alpha_t\bar{\alpha}_{t-1}}\, x_0 + \sqrt{\alpha_t(1-\bar{\alpha}_{t-1})}\, \varepsilon_1 + \sqrt{1-\alpha_t}\, \varepsilon_2$ . The noise terms combine (sum of independent Gaussians) with variance $\alpha_t(1-\bar{\alpha}_{t-1}) + (1-\alpha_t) = 1 - \alpha_t\bar{\alpha}_{t-1} = 1 - \bar{\alpha}_t$ . So $q(x_t|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}\, x_0, (1-\bar{\alpha}_t)I)$ . QED.

Q2: Why does Ho et al. predict $\varepsilon$ rather than $\mu_\theta$ or $x_0$ ?

Three reasons. Practical: at high noise levels, predicting $x_0$ means reconstructing a clean image from nearly pure noise - very high variance target, noisy gradients. Predicting $\varepsilon \sim \mathcal{N}(0,I)$ is always a well-posed regression problem with bounded targets. Mathematical: the ELBO denoising terms reduce to weighted MSE on $\varepsilon$ ; Ho et al. found the simplified unweighted objective (dropping timestep-dependent weights) performs better empirically. Theoretical: predicting $\varepsilon_\theta(x_t, t)$ is equivalent to estimating the score $\nabla_{x_t} \log q_t(x_t)$ , connecting DDPM to score matching. Specifically, $s_\theta(x_t, t) = -\varepsilon_\theta(x_t, t)/\sqrt{1-\bar{\alpha}_t}$ , giving a probabilistic interpretation of the training objective.

Q3: What is the DDPM ELBO and what does each term mean?

The DDPM ELBO is:

$\mathcal{L} = \underbrace{\mathbb{E}_q[\log p_\theta(x_0|x_1)]}_{\text{reconstruction}} - \underbrace{D_{\text{KL}}(q(x_T|x_0) \| p(x_T))}_{\approx 0\text{ by design}} - \sum_{t=2}^{T} \underbrace{D_{\text{KL}}(q(x_{t-1}|x_t,x_0) \| p_\theta(x_{t-1}|x_t))}_{\text{denoising terms} \propto \|\varepsilon - \varepsilon_\theta\|^2}$

The reconstruction term measures how well the final denoising step recovers $x_0$ . The prior matching term measures closeness of $x_T$ to $\mathcal{N}(0,I)$ - near zero when $\bar{\alpha}_T \approx 0$ , so it is ignored. The denoising KL terms are the main training signal: since both $q(x_{t-1}|x_t,x_0)$ and $p_\theta(x_{t-1}|x_t)$ are Gaussian, each KL has a closed form proportional to $\|\varepsilon - \varepsilon_\theta\|^2$ . Dropping timestep-dependent weighting coefficients gives $\mathcal{L}_{\text{simple}}$ .

Q4: What is the difference between linear and cosine noise schedules, and when does it matter?

The linear schedule increases $\beta_t$ linearly from $10^{-4}$ to $0.02$ over $T$ steps, designed for 32x32 images. For higher resolutions, it destroys high-frequency detail too quickly - many training steps fall in a regime where the image is nearly pure noise, providing little useful signal. The cosine schedule defines $\bar{\alpha}_t$ as a cosine curve, maintaining higher SNR for a larger fraction of timesteps and decreasing more sharply near $t=T$ . The cosine schedule keeps $\bar{\alpha}_{t=200}$ about 1.5x higher than linear, giving more training steps in the "interesting" noise regime. For 128x128 and above, cosine consistently improves FID. For 32x32, the difference is small (less than 0.3 FID on CIFAR-10).

Q5: How does DDPM relate to score matching?

They are different training formulations that produce equivalent models. Score matching (Song and Ermon 2019) trains $s_\theta(x, \sigma)$ to estimate $\nabla_x \log p_\sigma(x)$ - the score of the noisy data distribution. DDPM trains $\varepsilon_\theta(x_t, t)$ to predict added noise. The connection: $s_\theta(x_t, t) = -\varepsilon_\theta(x_t, t)/\sqrt{1-\bar{\alpha}_t}$ . So the DDPM noise prediction network is proportional to the negative score. Song et al. (2021) formalized this in the SDE framework, showing DDPM (variance-preserving SDE) and NCSN (variance-exploding SDE) are special cases of a general continuous-time diffusion SDE with a corresponding probability flow ODE. This unification led directly to the DDIM and DPM-Solver samplers, which exploit the ODE structure for accelerated sampling.

Q6: What FID score would you expect a well-trained DDPM to achieve on CIFAR-10, and what would indicate a bug?

A well-configured DDPM (cosine schedule, EMA weights, correct normalization) should achieve FID around 2.9-3.5 on CIFAR-10 after sufficient training. The 2020 Ho et al. result was 3.17 with 1000 steps; Improved DDPM achieved 2.90. FID above 5.0 suggests a problem - check: (1) normalization range (images should be in $[-1,1]$ ), (2) whether EMA weights are being used for sampling, (3) schedule choice and whether $T$ matches the schedule, (4) whether the correct noise is used (random fresh $\varepsilon$ at each training step, not the same across epochs). FID above 15-20 indicates a fundamental bug - likely the closed-form forward process is incorrectly implemented or inputs are not normalized.

This lesson is part of the Diffusion Models module. Next: Score-Based Models and SDEs - The Continuous-Time View.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Diffusion Process (DDPM) demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why This Exists - The Core Problem with Adversarial Training​

Historical Context - The Origins of DDPM​

1. The Forward Process - Adding Noise Systematically​

Definition and Intuition​

Key Property: Closed-Form at Any Timestep​

Reparameterized Sampling​

Forward Process Visualization​

2. The Noise Schedule - Linear vs Cosine​

Linear Schedule (Ho et al. 2020)​

Cosine Schedule (Nichol and Dhariwal 2021)​

3. The Reverse Process - Learning to Denoise​

The True Reverse Posterior​

4. The ELBO Derivation​

5. The Simplified Training Objective​

From ELBO to Noise Prediction​

Why Predict Noise, Not the Clean Image?​

6. The Sampling Algorithm​

Variance Choice: Fixed vs Learned​

7. The U-Net Backbone​

Why U-Net for Diffusion?​

Architecture Summary​

8. Complete PyTorch Implementation​

9. FID Score - What It Measures and What Counts as Good​

How FID Works​

What FID Measures​

DDPM Benchmarks on CIFAR-10​

1000 Steps vs Fewer Steps​

10. YouTube Resources​

11. Production Engineering Notes​

12. Common Mistakes​

13. Interview Q&A​