Skip to main content

DDPMs - The Mathematical Foundation of Diffusion Models

:::note Reading time: ~55 minutes | Interview relevance: Very High | Target roles: ML Engineer, Research Engineer, AI Engineer, Applied Scientist :::


The Real Interview Moment

You are interviewing for a research scientist role at a generative AI lab. The interviewer writes q(xtx0)q(x_t | x_0) on the whiteboard and says: "Derive this closed form. Then explain how the training objective follows from the ELBO, and tell me why Ho et al. chose to predict ε\varepsilon instead of x0x_0."

This is a qualifying question. It separates engineers who have used Stable Diffusion from engineers who understand it. Most people can draw forward and backward arrows on a diagram. Few can derive the closed-form distribution at an arbitrary timestep tt, explain why that derivation enables efficient training, connect it to the variational lower bound, articulate the empirical finding that noise prediction outperforms image prediction, and then describe why the cosine schedule (Nichol and Dhariwal 2021) matters for high-resolution generation.

The interviewer continues: "Now, sketch the U-Net architecture - how does it incorporate the timestep, how do skip connections help, and where does attention go?"

Then: "What FID score would you expect a well-trained DDPM to achieve on CIFAR-10, and what does FID actually measure?"

These questions form a complete arc from mathematical derivation to practical engineering to evaluation. Jonathan Ho, Ajay Jain, and Pieter Abbeel at UC Berkeley answered all of them in their 2020 DDPM paper. They unified ideas from non-equilibrium thermodynamics and score matching, replaced slow noise prediction networks with a U-Net, and achieved state-of-the-art FID on CIFAR-10 and LSUN. This lesson gives you the complete derivation, implementation, and intuition to answer every question in that arc.


Why This Exists - The Core Problem with Adversarial Training

Prior to DDPM, the dominant approach to high-quality image generation was Generative Adversarial Networks (GANs). GANs have one fundamental training problem: the generator and discriminator must be carefully balanced throughout training. Too strong a discriminator and gradients vanish - the generator gets no useful signal. Too weak a discriminator and the generator does not improve. Mode collapse can happen silently and is difficult to detect until evaluation time. Every GAN paper required new stabilization tricks: spectral normalization, gradient penalty, progressive growing, minibatch standard deviation.

The deeper problem is that GANs optimize an adversarial objective, not a likelihood-based one. The training signal is indirect: "the discriminator says this looks fake." There is no explicit mathematical connection between the training loss and the quality of the learned distribution. When a GAN converges, there is no theoretical guarantee that it has learned the true data distribution - only that the generator can fool the discriminator.

Diffusion models take the opposite approach. They define an explicit, principled training objective derived from the variational lower bound on the log-likelihood. The training signal is direct: "predict the noise that was added at a specific timestep." This is a simple regression problem, always well-defined, never adversarial. The result is dramatically more stable training - you can train DDPM reliably with a standard Adam optimizer and cosine learning rate schedule, no discriminator, no mode collapse, no specialized tricks required.

The cost is inference speed: DDPM requires 1000 sequential denoising steps to generate one sample. GANs generate in a single forward pass. This speed gap motivated DDIM, DPM-Solver, and Consistency Models (covered in Lessons 04–06). But the stability advantage of DDPM's principled objective is what made the entire diffusion model ecosystem possible.


Historical Context - The Origins of DDPM

The idea of progressive noising comes from non-equilibrium statistical mechanics. Sohl-Dickstein et al. (2015) proposed learning a generative model by reversing a diffusion process - gradually adding noise until the data distribution becomes a known prior, then learning to reverse the process. The paper proved this was theoretically sound: if you can perfectly learn the reverse of a diffusion process, you can generate samples from the data distribution. But the implementation was too slow and the networks too small to produce competitive results. The paper was visionary but practically ahead of its time.

Yang Song and Stefano Ermon (2019, 2020) approached generative modeling from a different angle: score matching. If you know the score function xlogp(x)\nabla_x \log p(x) of the data distribution, you can generate samples via Langevin dynamics - just follow the gradient of the log probability, adding noise to avoid getting trapped. They showed that training on data perturbed at multiple noise scales produced a score function that worked across the full noise range. Their NCSN model produced impressive results but still lagged behind GANs in FID.

Ho et al. (2020) connected these threads. They showed that a particular parameterization of the DDPM model - specifically, predicting the noise ε\varepsilon rather than the denoised image x0x_0 - gave a training objective equivalent to weighted denoising score matching. They replaced the slow noise prediction network with a modern U-Net with attention layers, scaled the training to 256x256 resolution, and achieved an FID of 3.17 on CIFAR-10 - competitive with the best GANs, without any adversarial training instability. DDPM was born, and with it, the modern era of diffusion-based generative AI.


1. The Forward Process - Adding Noise Systematically

Definition and Intuition

The forward process is a Markov chain that gradually destroys structure in a clean image x0x_0 by adding Gaussian noise at each step:

q(xtxt1)=N ⁣(xt;  1βtxt1,  βtI)q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\; \sqrt{1-\beta_t}\, x_{t-1},\; \beta_t I\right)

Here {βt}t=1T\{\beta_t\}_{t=1}^T is the noise schedule - a sequence of small positive constants controlling how much noise is added at each step. The mean 1βtxt1\sqrt{1-\beta_t}\, x_{t-1} scales down the signal from the previous step. The variance βtI\beta_t I adds Gaussian noise. Over TT steps (Ho et al. used T=1000T = 1000), this progressively destroys all structure in x0x_0 until only noise remains.

Intuitively: imagine repeatedly photocopying a photo on a machine that adds a tiny amount of static each time. After many copies, all you have is static. The forward process does exactly this to a digital image, but in a mathematically controlled way.

The full forward joint distribution factors as:

q(x1:Tx0)=t=1Tq(xtxt1)q(x_{1:T} \mid x_0) = \prod_{t=1}^{T} q(x_t \mid x_{t-1})

Key Property: Closed-Form at Any Timestep

One of the most important mathematical properties of this particular forward process is that q(xtx0)q(x_t | x_0) has a closed form at any arbitrary timestep tt - you do not need to chain tt steps of sampling to get xtx_t.

Define αt=1βt\alpha_t = 1 - \beta_t and αˉt=s=1tαs\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s. Then:

q(xtx0)=N ⁣(xt;  αˉtx0,  (1αˉt)I)\boxed{q(x_t \mid x_0) = \mathcal{N}\!\left(x_t;\; \sqrt{\bar{\alpha}_t}\, x_0,\; (1-\bar{\alpha}_t) I\right)}

Proof (by induction using the Gaussian product rule):

Base case: At t=1t=1: q(x1x0)=N(x1;1β1x0,β1I)=N(x1;α1x0,(1α1)I)q(x_1|x_0) = \mathcal{N}(x_1; \sqrt{1-\beta_1}\, x_0, \beta_1 I) = \mathcal{N}(x_1; \sqrt{\alpha_1}\, x_0, (1-\alpha_1)I), which matches the formula since αˉ1=α1\bar{\alpha}_1 = \alpha_1.

Inductive step: Suppose q(xt1x0)=N(αˉt1x0,(1αˉt1)I)q(x_{t-1}|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_{t-1}}\, x_0, (1-\bar{\alpha}_{t-1}) I).

Apply one more forward step. We can write xt=αtxt1+βtε2x_t = \sqrt{\alpha_t}\, x_{t-1} + \sqrt{\beta_t}\, \varepsilon_2 where ε2N(0,I)\varepsilon_2 \sim \mathcal{N}(0,I).

Substituting the inductive hypothesis xt1=αˉt1x0+1αˉt1ε1x_{t-1} = \sqrt{\bar{\alpha}_{t-1}}\, x_0 + \sqrt{1-\bar{\alpha}_{t-1}}\, \varepsilon_1:

xt=αt ⁣(αˉt1x0+1αˉt1ε1)+βtε2x_t = \sqrt{\alpha_t}\!\left(\sqrt{\bar{\alpha}_{t-1}}\, x_0 + \sqrt{1-\bar{\alpha}_{t-1}}\, \varepsilon_1\right) + \sqrt{\beta_t}\, \varepsilon_2

=αtαˉt1x0+αt(1αˉt1)ε1+βtε2sum of independent Gaussians= \sqrt{\alpha_t \bar{\alpha}_{t-1}}\, x_0 + \underbrace{\sqrt{\alpha_t(1-\bar{\alpha}_{t-1})}\, \varepsilon_1 + \sqrt{\beta_t}\, \varepsilon_2}_{\text{sum of independent Gaussians}}

The variance of the combined noise term is:

αt(1αˉt1)+βt=αtαtαˉt1+1αt=1αtαˉt1=1αˉt\alpha_t(1-\bar{\alpha}_{t-1}) + \beta_t = \alpha_t - \alpha_t\bar{\alpha}_{t-1} + 1 - \alpha_t = 1 - \alpha_t\bar{\alpha}_{t-1} = 1 - \bar{\alpha}_t

So xtN(αˉtx0,(1αˉt)I)x_t \sim \mathcal{N}(\sqrt{\bar{\alpha}_t}\, x_0, (1-\bar{\alpha}_t) I). QED.

Reparameterized Sampling

This closed form gives us a crucial efficiency: we can sample xtx_t directly from x0x_0 in one step using the reparameterization trick:

xt=αˉtx0+1αˉtε,εN(0,I)x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1-\bar{\alpha}_t}\, \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, I)

This is the reparameterization used in every DDPM training iteration. We do not need to run tt sequential noising steps - we jump directly from the clean image x0x_0 to a noisy image xtx_t at any desired noise level in a single matrix operation. Without this closed form, training would require O(T) operations per sample. With it, training is O(1) per sample regardless of TT.

Forward Process Visualization

As tt increases, αˉt0\bar{\alpha}_t \to 0, so the clean signal is scaled toward zero and the image becomes pure Gaussian noise:

t=0: x_0 (clean image, full detail, bar_alpha = 1.0)
↓ add very small noise (beta_1 ≈ 0.0001)
t=100: x_100 (slightly noisy, bar_alpha ≈ 0.95)
↓ add more noise
t=500: x_500 (significant noise, structure fading, bar_alpha ≈ 0.35)
↓ add more noise
t=800: x_800 (heavy noise, mostly indistinguishable, bar_alpha ≈ 0.08)
↓ add final noise
t=T: x_T ≈ N(0, I) (pure noise, no identifiable structure)

The noise schedule {βt}\{\beta_t\} controls how quickly this destruction happens - a design choice with real consequences for generation quality.


2. The Noise Schedule - Linear vs Cosine

The noise schedule {βt}\{\beta_t\} determines the rate at which signal is destroyed. This is not just a technical detail - the choice of schedule significantly affects training efficiency and sample quality, especially at high resolutions.

Linear Schedule (Ho et al. 2020)

Ho et al. used a linear schedule increasing βt\beta_t uniformly from β1\beta_1 to βT\beta_T:

βt=β1+t1T1(βTβ1),β1=104,  βT=0.02,  T=1000\beta_t = \beta_1 + \frac{t-1}{T-1}(\beta_T - \beta_1), \qquad \beta_1 = 10^{-4},\; \beta_T = 0.02,\; T=1000

This schedule works well for 32x32 and 64x64 images. The small β1\beta_1 means very little noise is added in the first few steps, and the gradual increase ensures a smooth transition to pure noise by step TT.

Problem at high resolution: for 256x256 and above images, the linear schedule destroys high-frequency detail (edges, textures) too quickly in the early timesteps. Many training steps fall in a regime where the image is nearly pure noise, providing little useful training signal. The model sees too many "already destroyed" images and too few "partially corrupted" ones.

Signal-to-noise ratio at the midpoint (linear): SNRt=500=αˉ500/(1αˉ500)0.56\text{SNR}_{t=500} = \bar{\alpha}_{500}/(1-\bar{\alpha}_{500}) \approx 0.56. By step 600, the image has lost most of its structure.

Cosine Schedule (Nichol and Dhariwal 2021)

Nichol and Dhariwal proposed a cosine schedule defined through αˉt\bar{\alpha}_t:

αˉt=f(t)f(0),f(t)=cos ⁣(t/T+s1+sπ2)2\bar{\alpha}_t = \frac{f(t)}{f(0)}, \qquad f(t) = \cos\!\left(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2}\right)^2

where s=0.008s = 0.008 is a small offset to prevent αˉt\bar{\alpha}_t from being too small at t=0t=0 (which would add noticeable noise in the very first step, conflicting with the requirement that q(x0x0)q(x_0|x_0) be essentially a Dirac delta).

The cosine schedule has a characteristic shape: αˉt\bar{\alpha}_t stays high (near 1.0) for the first third of timesteps, then smoothly decreases, reaching near zero around step 700-800. This ensures:

  1. Early timesteps are meaningful: the model sees images with small but non-trivial amounts of noise, learning fine-grained denoising
  2. Middle timesteps are information-rich: the transition zone from structured to unstructured gets more training steps
  3. Late timesteps are efficient: the model does not waste steps on images that are already essentially pure noise

The cosine schedule improves FID by 0.5-2.0 points on CIFAR-10 and much more on 256x256 datasets. For any production high-resolution model, the cosine schedule (or a learned version) is the standard.


3. The Reverse Process - Learning to Denoise

The reverse process is what we want to learn. Starting from xTN(0,I)x_T \sim \mathcal{N}(0, I), we want to progressively denoise to recover x0x_0. Each reverse step is modeled as a Gaussian:

pθ(xt1xt)=N ⁣(xt1;  μθ(xt,t),  Σθ(xt,t))p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\!\left(x_{t-1};\; \mu_\theta(x_t, t),\; \Sigma_\theta(x_t, t)\right)

The functions μθ\mu_\theta and Σθ\Sigma_\theta are parameterized by neural networks (in practice, a single U-Net). The full generative model is:

pθ(x0:T)=p(xT)t=1Tpθ(xt1xt)p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^{T} p_\theta(x_{t-1} \mid x_t)

where p(xT)=N(0,I)p(x_T) = \mathcal{N}(0, I) is the prior - pure Gaussian noise.

The True Reverse Posterior

A key mathematical property: the true reverse posterior q(xt1xt,x0)q(x_{t-1}|x_t, x_0) is tractable when conditioned on x0x_0. Using Bayes' theorem and the Gaussian forward process, it is also Gaussian:

q(xt1xt,x0)=N ⁣(xt1;  μ~t(xt,x0),  β~tI)q(x_{t-1} \mid x_t, x_0) = \mathcal{N}\!\left(x_{t-1};\; \tilde{\mu}_t(x_t, x_0),\; \tilde{\beta}_t I\right)

where:

μ~t(xt,x0)=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉtxt\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}\, x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\, x_t

β~t=(1αˉt1)βt1αˉt\tilde{\beta}_t = \frac{(1-\bar{\alpha}_{t-1})\beta_t}{1-\bar{\alpha}_t}

This posterior mean is a weighted combination of x0x_0 (the clean image, which we do not know at inference time) and xtx_t (the current noisy image, which we do have). The variance β~t\tilde{\beta}_t is entirely determined by the noise schedule - no learning needed for the variance in the original DDPM formulation.

The learned pθ(xt1xt)p_\theta(x_{t-1}|x_t) should approximate this posterior. But since x0x_0 is unknown at inference time, we must parameterize μθ\mu_\theta in terms of what the network predicts from xtx_t alone.


4. The ELBO Derivation

To train the reverse process, we maximize the log-likelihood logpθ(x0)\log p_\theta(x_0). Since this is intractable (it involves marginalizing over all TT intermediate steps x1:Tx_{1:T}), we maximize the ELBO (Evidence Lower BOund):

logpθ(x0)L(θ)=Eq ⁣[logpθ(x0:T)q(x1:Tx0)]\log p_\theta(x_0) \geq \mathcal{L}(\theta) = \mathbb{E}_q\!\left[\log \frac{p_\theta(x_{0:T})}{q(x_{1:T} \mid x_0)}\right]

Expanding by substituting the factored forms of pθp_\theta and qq:

L=Eq[logpθ(x0x1)]L0DKL ⁣(q(xTx0)p(xT))LTt=2TDKL ⁣(q(xt1xt,x0)pθ(xt1xt))Lt1\mathcal{L} = \underbrace{\mathbb{E}_q[\log p_\theta(x_0 \mid x_1)]}_{L_0} - \underbrace{D_{\text{KL}}\!\left(q(x_T \mid x_0) \| p(x_T)\right)}_{L_T} - \sum_{t=2}^{T} \underbrace{D_{\text{KL}}\!\left(q(x_{t-1} \mid x_t, x_0) \| p_\theta(x_{t-1} \mid x_t)\right)}_{L_{t-1}}

Parsing each term:

L0L_0 - Reconstruction: How well does the final denoising step pθ(x0x1)p_\theta(x_0|x_1) recover the clean image? Analogous to the reconstruction term in a VAE. In practice, treated as a Gaussian with fixed variance.

LTL_T - Prior matching: How close is the fully noised image xTx_T to the prior N(0,I)\mathcal{N}(0,I)? If the noise schedule is designed so that αˉT0\bar{\alpha}_T \approx 0, this term is approximately zero and requires no optimization - it is determined by the fixed forward process, not the model. Ho et al. ignore this term entirely.

Lt1L_{t-1} - Denoising terms (the main training signal): For each tt from 2 to TT, how well does the learned reverse step pθ(xt1xt)p_\theta(x_{t-1}|x_t) match the true conditional reverse posterior q(xt1xt,x0)q(x_{t-1}|x_t, x_0)?

Since both distributions are Gaussian, each KL divergence has a closed form proportional to the squared difference between means. This is where the bulk of training occurs.


5. The Simplified Training Objective

From ELBO to Noise Prediction

Ho et al. made a critical practical choice: reparameterize the mean in terms of the noise ε\varepsilon rather than predicting μθ\mu_\theta or x0x_0 directly.

From the forward process: xt=αˉtx0+1αˉtεx_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1-\bar{\alpha}_t}\, \varepsilon, so we can solve for x0x_0:

x0=xt1αˉtεαˉtx_0 = \frac{x_t - \sqrt{1-\bar{\alpha}_t}\, \varepsilon}{\sqrt{\bar{\alpha}_t}}

Substituting this expression into the true posterior mean μ~t\tilde{\mu}_t:

μ~t(xt,ε)=1αt ⁣(xt1αt1αˉtε)\tilde{\mu}_t(x_t, \varepsilon) = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\, \varepsilon\right)

So if the network predicts εθ(xt,t)ε\varepsilon_\theta(x_t, t) \approx \varepsilon, we can compute the optimal mean:

μθ(xt,t)=1αt ⁣(xt1αt1αˉtεθ(xt,t))\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\, \varepsilon_\theta(x_t, t)\right)

The denoising KL terms become proportional to:

Ex0,ε,t ⁣[(1αt)22αt(1αˉt)εεθ(xt,t)2]\mathbb{E}_{x_0, \varepsilon, t}\!\left[\frac{(1-\alpha_t)^2}{2\alpha_t(1-\bar{\alpha}_t)}\, \|\varepsilon - \varepsilon_\theta(x_t, t)\|^2\right]

Ho et al. dropped the timestep-dependent weighting coefficient (1αt)22αt(1αˉt)\frac{(1-\alpha_t)^2}{2\alpha_t(1-\bar{\alpha}_t)} and found that training with the simplified objective worked better in practice:

Lsimple=EtU[1,T],  x0q,  εN(0,I) ⁣[εεθ ⁣(αˉtx0+1αˉtε,  t)2]\boxed{\mathcal{L}_{\text{simple}} = \mathbb{E}_{t \sim \mathcal{U}[1,T],\; x_0 \sim q,\; \varepsilon \sim \mathcal{N}(0,I)}\!\left[\|\varepsilon - \varepsilon_\theta\!\left(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1-\bar{\alpha}_t}\, \varepsilon,\; t\right)\|^2\right]}

This is the DDPM training objective in its final form. Sample a clean image x0x_0, pick a random timestep tt, add noise ε\varepsilon to get xtx_t, ask the network to predict ε\varepsilon, measure MSE. Simple, elegant, powerful.

Why Predict Noise, Not the Clean Image?

You might reasonably ask: why predict ε\varepsilon rather than predicting x0x_0 directly? Both are mathematically equivalent - given εθ\varepsilon_\theta, we can compute x^0=(xt1αˉtεθ)/αˉt\hat{x}_0 = (x_t - \sqrt{1-\bar{\alpha}_t}\,\varepsilon_\theta) / \sqrt{\bar{\alpha}_t}.

The empirical answer: Ho et al. tried both and noise prediction gave better FID. Their ablation showed that x0x_0-prediction required more careful tuning and produced lower-quality samples.

The intuitive explanation: at high noise levels (tt large, αˉt\bar{\alpha}_t small), predicting x0x_0 requires reconstructing a clean image from nearly pure noise. The target has extremely high variance - many different clean images are consistent with the same noisy observation. The gradient signal is therefore very noisy. Predicting εN(0,I)\varepsilon \sim \mathcal{N}(0,I) is always predicting from a fixed, unit-variance distribution regardless of noise level. The regression target is well-behaved at every tt.

The theoretical connection: predicting εθ(xt,t)\varepsilon_\theta(x_t, t) is mathematically equivalent to estimating the score function xtlogqt(xt)\nabla_{x_t} \log q_t(x_t) of the noisy data distribution, since:

εθ(xt,t)1αˉtxtlogqt(xt)\varepsilon_\theta(x_t, t) \approx -\sqrt{1-\bar{\alpha}_t}\, \nabla_{x_t} \log q_t(x_t)

This connects DDPM to the score matching framework of Song and Ermon - the theoretical reason why the two approaches produce identical models despite different derivations.

Why the simplified objective works better than the weighted one: the weighting term (1αt)22αt(1αˉt)\frac{(1-\alpha_t)^2}{2\alpha_t(1-\bar{\alpha}_t)} is large at small tt (where the step is small and the denoising task is easy) and small at large tt (where the step is large and the denoising task is hard). The simplified objective, by dropping this weight, places equal emphasis on all timesteps. Empirically this produces better FID - the model learns to handle both easy and hard timesteps equally well.


6. The Sampling Algorithm

At generation time, we start from xTN(0,I)x_T \sim \mathcal{N}(0,I) and apply the learned reverse process TT times. Substituting the noise-prediction parameterization:

xt1=1αt ⁣(xt1αt1αˉtεθ(xt,t))+σtzt,ztN(0,I)x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\,\varepsilon_\theta(x_t, t)\right) + \sigma_t\, z_t, \qquad z_t \sim \mathcal{N}(0,I)

where σt2=βt\sigma_t^2 = \beta_t (the choice in the original DDPM paper). At the final step t=1t=1, we set z1=0z_1 = 0 to avoid adding noise to an almost-clean image.

Full Algorithm:

  1. Sample xTN(0,I)x_T \sim \mathcal{N}(0, I)
  2. For t=T,T1,,1t = T, T-1, \ldots, 1: a. If t>1t > 1: sample zN(0,I)z \sim \mathcal{N}(0, I), else z=0z = 0 b. Compute ε^=εθ(xt,t)\hat{\varepsilon} = \varepsilon_\theta(x_t, t) - one U-Net forward pass c. Update: xt1=1αt ⁣(xt1αt1αˉtε^)+βtzx_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\,\hat{\varepsilon}\right) + \sqrt{\beta_t}\, z
  3. Return x0x_0

This requires TT U-Net forward passes - 1000 for the original DDPM. This is the sampling bottleneck that DDIM (Lesson 04) reduces to 20-50 steps.

Variance Choice: Fixed vs Learned

Ho et al. fixed σt2=βt\sigma_t^2 = \beta_t (the upper bound on the posterior variance) rather than using the posterior variance β~t\tilde{\beta}_t (the lower bound). Nichol and Dhariwal (2021) showed that learning the variance - interpolating between βt\beta_t and β~t\tilde{\beta}_t using the network's output - improves log-likelihood and allows faster sampling with fewer steps. Their improved DDPM parameterizes the variance as:

Σθ(xt,t)=exp ⁣(vlogβt+(1v)logβ~t)\Sigma_\theta(x_t, t) = \exp\!\left(v \log \beta_t + (1-v) \log \tilde{\beta}_t\right)

where v[0,1]v \in [0,1] is a scalar predicted by the network alongside the noise estimate.


7. The U-Net Backbone

The denoising network εθ(xt,t)\varepsilon_\theta(x_t, t) is a U-Net - originally proposed for biomedical image segmentation by Ronneberger et al. (2015). The U-Net is ideal for the DDPM denoising task for specific structural reasons.

Why U-Net for Diffusion?

Skip connections preserve fine details: the U-Net's skip connections pass feature maps directly from encoder to decoder at each resolution level. Without skip connections, the information bottleneck in the middle of the network would destroy high-frequency spatial detail. With skip connections, fine textures and edge information bypass the bottleneck and are available for the decoder to use in reconstruction.

Multi-scale reasoning: the encoder-decoder structure processes the image at multiple spatial resolutions simultaneously. The bottleneck (lowest resolution) captures global structure - overall composition, large-scale lighting, semantic content. The shallow layers (highest resolution) capture local detail - texture, fine edges, grain patterns. This multi-scale hierarchy mirrors how diffusion naturally operates: high-noise steps require understanding global structure, low-noise steps require understanding fine detail.

Self-attention at low resolutions: attention layers in the bottleneck and lower-resolution feature maps capture long-range spatial dependencies - allowing the model to ensure that, for example, a face is internally consistent even when the two eyes are far apart in the image.

Timestep conditioning via sinusoidal embeddings: the timestep tt must be communicated to the network so it knows which noise level it is operating at. This is done via sinusoidal position embeddings (borrowed from Transformers):

emb(t)=[sin(t/100002i/d),  cos(t/100002i/d)]i=1d/2\text{emb}(t) = \left[\sin(t / 10000^{2i/d}),\; \cos(t / 10000^{2i/d})\right]_{i=1}^{d/2}

This embedding is projected through a small MLP and added to the feature maps at each residual block via adaptive group normalization - the network's behavior changes smoothly with tt, from predicting fine-detail noise at small tt to predicting large-scale structure at large tt.

Architecture Summary

The Ho et al. U-Net for 32x32 images:

Input: (B, C, 32, 32)
┌─────────────────────────────────────────────────┐
│ Encoder │
│ conv(C → 128) → [32×32] │
│ ResBlock × 2 + Downsample → [16×16] │
│ ResBlock × 2 + Self-Attention + Down → [8×8] │
│ ResBlock × 2 + Downsample → [4×4] │
├─────────────────────────────────────────────────│
│ Bottleneck │
│ ResBlock + Self-Attention + ResBlock [4×4] │
├─────────────────────────────────────────────────│
│ Decoder (with skip connections from encoder) │
│ ResBlock × 2 + Upsample → [8×8] │
│ ResBlock × 2 + Self-Attention + Up → [16×16] │
│ ResBlock × 2 + Upsample → [32×32] │
│ conv(128 → C) │
└─────────────────────────────────────────────────┘
Output: epsilon_hat (B, C, 32, 32)

Each ResBlock receives the timestep embedding and incorporates it via: GroupNorm → SiLU → Conv → Add(t_emb_proj) → GroupNorm → SiLU → Conv.


8. Complete PyTorch Implementation

The following is a complete, runnable DDPM implementation with a U-Net denoiser, training loop, and sampling loop. It trains on MNIST to demonstrate the core mechanics on accessible compute.

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import math


# ============================================================
# Sinusoidal timestep embedding
# ============================================================
class SinusoidalPosEmb(nn.Module):
"""
Sinusoidal positional embedding for timestep conditioning.
Same encoding as Transformer positional embeddings.
Output shape: (batch, dim)
"""
def __init__(self, dim):
super().__init__()
self.dim = dim

def forward(self, t):
device = t.device
half_dim = self.dim // 2
emb = math.log(10000) / (half_dim - 1)
emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
# t: (batch,) → multiply with frequency basis
emb = t[:, None].float() * emb[None, :]
emb = torch.cat([emb.sin(), emb.cos()], dim=-1)
return emb # (batch, dim)


# ============================================================
# Residual block with timestep conditioning
# ============================================================
class ResidualBlock(nn.Module):
"""
Residual block conditioned on timestep embedding.
Uses GroupNorm + SiLU (better than BatchNorm for diffusion).
"""
def __init__(self, in_channels, out_channels, time_emb_dim, num_groups=8):
super().__init__()
self.time_mlp = nn.Sequential(
nn.SiLU(),
nn.Linear(time_emb_dim, out_channels)
)
self.block1 = nn.Sequential(
nn.GroupNorm(num_groups, in_channels),
nn.SiLU(),
nn.Conv2d(in_channels, out_channels, 3, padding=1)
)
self.block2 = nn.Sequential(
nn.GroupNorm(num_groups, out_channels),
nn.SiLU(),
nn.Conv2d(out_channels, out_channels, 3, padding=1)
)
# Residual connection handles channel mismatch
self.residual_conv = (
nn.Conv2d(in_channels, out_channels, 1)
if in_channels != out_channels else nn.Identity()
)

def forward(self, x, t_emb):
h = self.block1(x)
# Add timestep embedding broadcast over H, W
h = h + self.time_mlp(t_emb)[:, :, None, None]
h = self.block2(h)
return h + self.residual_conv(x)


# ============================================================
# Simplified U-Net for DDPM
# ============================================================
class UNet(nn.Module):
"""
U-Net denoising network for DDPM.
Architecture: Encoder → Bottleneck → Decoder with skip connections.
Timestep conditioning injected at every ResBlock.
"""
def __init__(
self,
in_channels=1,
base_channels=64,
channel_mults=(1, 2, 4),
time_emb_dim=128,
num_groups=8
):
super().__init__()

# Timestep embedding: sinusoidal → MLP → projected dim
self.time_emb = nn.Sequential(
SinusoidalPosEmb(time_emb_dim),
nn.Linear(time_emb_dim, time_emb_dim * 4),
nn.SiLU(),
nn.Linear(time_emb_dim * 4, time_emb_dim)
)

# Initial projection: in_channels → base_channels
self.init_conv = nn.Conv2d(in_channels, base_channels, 3, padding=1)

# Encoder: progressively halve spatial resolution, increase channels
self.down_blocks = nn.ModuleList()
self.downsamplers = nn.ModuleList()
skip_channels = [base_channels] # track skip connection channel counts
ch = base_channels

for mult in channel_mults:
out_ch = base_channels * mult
self.down_blocks.append(ResidualBlock(ch, out_ch, time_emb_dim, num_groups))
self.downsamplers.append(nn.Conv2d(out_ch, out_ch, 4, stride=2, padding=1))
skip_channels.append(out_ch)
ch = out_ch

# Bottleneck: same resolution, deepens representation
self.mid_block1 = ResidualBlock(ch, ch, time_emb_dim, num_groups)
self.mid_block2 = ResidualBlock(ch, ch, time_emb_dim, num_groups)

# Decoder: progressively double spatial resolution, decrease channels
self.up_blocks = nn.ModuleList()
self.upsamplers = nn.ModuleList()

for mult in reversed(channel_mults):
skip_ch = skip_channels.pop() # retrieve matching encoder skip
out_ch = base_channels * mult
self.upsamplers.append(
nn.ConvTranspose2d(ch, ch, 4, stride=2, padding=1)
)
# Input = current channels + skip connection channels
self.up_blocks.append(
ResidualBlock(ch + skip_ch, out_ch, time_emb_dim, num_groups)
)
ch = out_ch

# Final output: recover input channels
self.final_conv = nn.Sequential(
nn.GroupNorm(num_groups, ch),
nn.SiLU(),
nn.Conv2d(ch, in_channels, 1)
)

def forward(self, x, t):
"""
Args:
x: noisy image (B, C, H, W)
t: timestep indices (B,) as integers
Returns:
predicted noise (B, C, H, W)
"""
t_emb = self.time_emb(t)

x = self.init_conv(x)
skips = [x]

# Encoder pass - store all activations for skip connections
for block, downsample in zip(self.down_blocks, self.downsamplers):
x = block(x, t_emb)
skips.append(x)
x = downsample(x)

# Bottleneck
x = self.mid_block1(x, t_emb)
x = self.mid_block2(x, t_emb)

# Decoder pass - concatenate skip connections
for upsample, block in zip(self.upsamplers, self.up_blocks):
x = upsample(x)
skip = skips.pop()
# Concatenate along channel dimension
x = torch.cat([x, skip], dim=1)
x = block(x, t_emb)

return self.final_conv(x)


# ============================================================
# DDPM - noise schedule, training loss, sampling
# ============================================================
class DDPM:
"""
Full DDPM implementation.
Supports linear and cosine noise schedules.
Implements the simplified training objective and DDPM sampling.
"""

def __init__(self, T=1000, schedule='cosine', device='cuda'):
self.T = T
self.device = device

if schedule == 'linear':
# Original Ho et al. 2020 schedule
betas = torch.linspace(1e-4, 0.02, T, device=device)

elif schedule == 'cosine':
# Nichol & Dhariwal 2021 cosine schedule
# Better for high-resolution images - more uniform noise removal
s = 0.008 # small offset prevents very large beta_0
steps = T + 1
x = torch.linspace(0, T, steps, device=device)
# f(t) = cos^2(pi/2 * (t/T + s) / (1 + s))
alpha_bars = torch.cos(
((x / T) + s) / (1 + s) * math.pi / 2
) ** 2
alpha_bars = alpha_bars / alpha_bars[0] # normalize to f(0)=1
# Derive betas from alpha_bars
betas = 1 - alpha_bars[1:] / alpha_bars[:-1]
betas = torch.clamp(betas, min=0.0001, max=0.9999) # numerical safety
else:
raise ValueError(f"Unknown schedule: {schedule}")

alphas = 1.0 - betas
alpha_bars = torch.cumprod(alphas, dim=0)

# Pre-compute all quantities needed for training and sampling
self.betas = betas
self.alphas = alphas
self.alpha_bars = alpha_bars
self.sqrt_alpha_bars = alpha_bars.sqrt()
self.sqrt_one_minus_alpha_bars = (1 - alpha_bars).sqrt()
self.sqrt_recip_alphas = (1.0 / alphas).sqrt()

def q_sample(self, x_0, t, noise=None):
"""
Sample x_t from x_0 using the closed-form forward process.
x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * eps
This is the key efficiency: O(1) regardless of t.
"""
if noise is None:
noise = torch.randn_like(x_0)
# Index pre-computed coefficients and reshape for broadcasting
sqrt_ab = self.sqrt_alpha_bars[t][:, None, None, None]
sqrt_one_minus_ab = self.sqrt_one_minus_alpha_bars[t][:, None, None, None]
return sqrt_ab * x_0 + sqrt_one_minus_ab * noise

def training_loss(self, model, x_0):
"""
DDPM simplified training objective (equation 14 in Ho et al. 2020).
Steps:
1. Sample random timestep t ~ U[1, T]
2. Sample noise eps ~ N(0, I)
3. Compute noisy image x_t via closed-form forward process
4. Predict eps with model
5. MSE loss between true and predicted noise
"""
batch = x_0.shape[0]
# Random timestep for each sample in the batch
t = torch.randint(0, self.T, (batch,), device=self.device)

noise = torch.randn_like(x_0)
x_t = self.q_sample(x_0, t, noise)

# Model predicts the noise - this is the key parameterization choice
predicted_noise = model(x_t, t)

# Simplified objective: unweighted MSE over all timesteps
return F.mse_loss(predicted_noise, noise)

@torch.no_grad()
def p_sample(self, model, x_t, t_scalar):
"""
One reverse denoising step: x_t → x_{t-1}.
Implements the DDPM sampling algorithm (Algorithm 2 in Ho et al. 2020).
"""
batch = x_t.shape[0]
t = torch.full((batch,), t_scalar, device=self.device, dtype=torch.long)

# Predict noise with the trained U-Net
eps_pred = model(x_t, t)

# Retrieve precomputed schedule values for this timestep
alpha_t = self.alphas[t_scalar]
sqrt_one_minus_ab = self.sqrt_one_minus_alpha_bars[t_scalar]
beta_t = self.betas[t_scalar]
sqrt_recip_alpha_t = self.sqrt_recip_alphas[t_scalar]

# Compute reverse process mean:
# mu_theta = (1/sqrt(alpha_t)) * (x_t - (1-alpha_t)/sqrt(1-alpha_bar_t) * eps)
coeff = (1 - alpha_t) / sqrt_one_minus_ab
mean = sqrt_recip_alpha_t * (x_t - coeff * eps_pred)

if t_scalar > 0:
# Add stochastic noise (sigma_t = sqrt(beta_t) in original DDPM)
noise = torch.randn_like(x_t)
x_prev = mean + beta_t.sqrt() * noise
else:
# Final step: no noise added
x_prev = mean

return x_prev

@torch.no_grad()
def sample(self, model, shape, verbose=False):
"""
Full DDPM sampling: 1000 reverse steps from Gaussian noise.
Note: this takes ~1000 U-Net forward passes.
Use DDIM sampler (Lesson 04) to reduce to 20-50 steps.
"""
model.eval()
x = torch.randn(shape, device=self.device)

for t in reversed(range(self.T)):
if verbose and t % 100 == 0:
print(f" Sampling step {self.T - t}/{self.T} (t={t})")
x = self.p_sample(model, x, t)

return x


# ============================================================
# EMA (Exponential Moving Average) of model weights
# ============================================================
class EMA:
"""
Exponential Moving Average of model weights.
Critical for DDPM - EMA model gives significantly better FID
than online training weights.
Decay = 0.9999 is standard (Ho et al. 2020).
"""

def __init__(self, model, decay=0.9999):
self.decay = decay
# Create a separate copy of model parameters for EMA
self.shadow = {}
for name, param in model.named_parameters():
if param.requires_grad:
self.shadow[name] = param.data.clone()

def update(self, model):
"""Call after each gradient step."""
for name, param in model.named_parameters():
if param.requires_grad:
# EMA update: shadow = decay * shadow + (1 - decay) * param
self.shadow[name] = (
self.decay * self.shadow[name]
+ (1 - self.decay) * param.data
)

def apply_to(self, model):
"""Copy EMA weights to model for evaluation."""
for name, param in model.named_parameters():
if param.requires_grad:
param.data.copy_(self.shadow[name])


# ============================================================
# Full training loop
# ============================================================
def train_ddpm(
num_epochs=50,
batch_size=128,
lr=2e-4,
T=1000,
schedule='cosine',
device='cuda' if torch.cuda.is_available() else 'cpu'
):
"""
Complete DDPM training loop on MNIST.
Includes EMA, gradient clipping, cosine LR schedule.
"""
# MNIST: 28x28 → padded to 32x32 for clean stride-2 downsampling
transform = transforms.Compose([
transforms.Resize(32),
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)) # [0,1] → [-1,1]
])

dataset = datasets.MNIST(
root='./data', train=True, download=True, transform=transform
)
loader = DataLoader(
dataset, batch_size=batch_size, shuffle=True,
num_workers=4, pin_memory=True
)

# Model and optimizer
model = UNet(
in_channels=1,
base_channels=64,
channel_mults=(1, 2, 4),
time_emb_dim=128
).to(device)

ema = EMA(model, decay=0.9999)
ddpm = DDPM(T=T, schedule=schedule, device=device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

# Cosine LR decay (standard for diffusion models)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=num_epochs
)

print(f"Training DDPM on {device}")
print(f" Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f" Noise schedule: {schedule}")
print(f" Diffusion steps T: {T}")

model.train()
for epoch in range(num_epochs):
total_loss = 0.0
for batch_idx, (x_0, _) in enumerate(loader):
x_0 = x_0.to(device)

# Compute DDPM training loss
loss = ddpm.training_loss(model, x_0)

optimizer.zero_grad()
loss.backward()
# Gradient clipping prevents gradient explosion in deep U-Net
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

# Update EMA after every gradient step
ema.update(model)

total_loss += loss.item()

scheduler.step()
avg_loss = total_loss / len(loader)
print(
f"Epoch {epoch+1:3d}/{num_epochs} | "
f"Loss: {avg_loss:.4f} | "
f"LR: {scheduler.get_last_lr()[0]:.2e}"
)

return model, ddpm, ema


# ============================================================
# Noise schedule comparison
# ============================================================
def compare_schedules(T=1000):
"""
Compare linear and cosine noise schedules.
Shows bar_alpha_t at key timesteps and the
signal-to-noise ratio profile.
"""
import numpy as np

# Linear schedule
betas_linear = np.linspace(1e-4, 0.02, T)
alphas_linear = 1 - betas_linear
ab_linear = np.cumprod(alphas_linear)

# Cosine schedule
s = 0.008
steps = np.linspace(0, T, T + 1)
f = np.cos(((steps / T) + s) / (1 + s) * np.pi / 2) ** 2
f = f / f[0]
betas_cosine = 1 - f[1:] / f[:-1]
betas_cosine = np.clip(betas_cosine, 0.0001, 0.9999)
alphas_cosine = 1 - betas_cosine
ab_cosine = np.cumprod(alphas_cosine)

checkpoints = [100, 200, 300, 400, 500, 600, 700, 800, 900, 999]

print("Timestep | Linear alpha_bar | Cosine alpha_bar | Cosine/Linear (higher = more signal)")
print("-" * 80)
for t in checkpoints:
ratio = ab_cosine[t] / (ab_linear[t] + 1e-10)
print(f" t={t:4d} | {ab_linear[t]:.4f} | {ab_cosine[t]:.4f} | {ratio:.2f}x")

print()
print("Interpretation: cosine schedule retains more signal at mid-range timesteps.")
print("This means more informative training steps for textures and fine details.")
print()

# SNR comparison
snr_linear = ab_linear / (1 - ab_linear + 1e-10)
snr_cosine = ab_cosine / (1 - ab_cosine + 1e-10)
print(f"Linear SNR at t=200: {snr_linear[199]:.3f}")
print(f"Cosine SNR at t=200: {snr_cosine[199]:.3f}")
print(f"Cosine has {snr_cosine[199]/snr_linear[199]:.1f}x higher SNR at t=200 → stronger training signal")


# ============================================================
# Run training and generate samples
# ============================================================
if __name__ == '__main__':
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Train
model, ddpm, ema = train_ddpm(
num_epochs=100,
batch_size=128,
device=device
)

# Compare schedules
compare_schedules()

# Generate with online weights
print("\nGenerating samples with online model weights...")
samples_online = ddpm.sample(model, shape=(16, 1, 32, 32), verbose=True)

# Generate with EMA weights (typically better FID)
print("\nGenerating samples with EMA weights...")
ema_model = UNet(in_channels=1).to(device)
ema.apply_to(ema_model)
samples_ema = ddpm.sample(ema_model, shape=(16, 1, 32, 32), verbose=True)

# Unnormalize from [-1, 1] to [0, 1]
for name, samples in [("online", samples_online), ("ema", samples_ema)]:
samples = (samples + 1) / 2
samples = samples.clamp(0, 1)
print(f"\n{name} samples: shape={samples.shape}, "
f"min={samples.min():.3f}, max={samples.max():.3f}")

try:
from torchvision.utils import save_image
save_image(samples_ema, 'ddpm_ema_samples.png', nrow=4)
print("EMA samples saved to ddpm_ema_samples.png")
except ImportError:
print("Install torchvision to save samples as grid")

9. FID Score - What It Measures and What Counts as Good

FID (Fréchet Inception Distance) is the standard metric for evaluating generative model quality. Understanding it is essential for DDPM interviews.

How FID Works

  1. Generate 50,000 images from the model
  2. Generate 50,000 real images from the test set
  3. Run both sets through an Inception-v3 network, extract the 2048-dimensional penultimate layer features
  4. Fit multivariate Gaussians N(μreal,Σreal)\mathcal{N}(\mu_{real}, \Sigma_{real}) and N(μgen,Σgen)\mathcal{N}(\mu_{gen}, \Sigma_{gen}) to the real and generated features
  5. Compute the Fréchet distance between these Gaussians:

FID=μrealμgen2+tr ⁣(Σreal+Σgen2(ΣrealΣgen)1/2)\text{FID} = \|\mu_{real} - \mu_{gen}\|^2 + \text{tr}\!\left(\Sigma_{real} + \Sigma_{gen} - 2(\Sigma_{real}\Sigma_{gen})^{1/2}\right)

Lower FID = better. FID = 0 means perfect match. Real images have FID ≈ 0 with themselves (up to sampling noise).

What FID Measures

FID measures both quality and diversity. A model that generates only high-quality images of one mode (e.g., perfect cats, no dogs) will have high FID because its μgen\mu_{gen} and Σgen\Sigma_{gen} do not match the full diversity of the test set.

DDPM Benchmarks on CIFAR-10

ModelFID (CIFAR-10)Notes
DDPM (Ho et al. 2020)3.17Original paper, 1000 steps
Improved DDPM (Nichol & Dhariwal 2021)2.90Cosine schedule + learned variance
StyleGAN2 (GAN baseline)2.92Best GAN at that time
DALL-E (autoregressive)17.9Much weaker than diffusion
Score-based (Song & Ermon 2020)3.21Score matching baseline

FID of 3.17 means DDPM essentially matched the best GANs with much more stable training. A well-configured DDPM on CIFAR-10 should achieve FID below 4.0. Above 10.0 suggests a training bug (wrong normalization, incorrect schedule, or insufficient training time).

1000 Steps vs Fewer Steps

DDPM samples require 1000 steps because the model was trained with a 1000-step schedule. Using fewer steps with the DDPM sampler introduces discretization error - the approximation that each step is a small Gaussian degrades when steps are large. FID degrades sharply with fewer than ~100 steps in the DDPM sampler.

DDIM (Lesson 04) uses a different sampling formula that allows 20-50 steps on the same trained model by treating the reverse process as an ODE rather than a Markov chain. This is why you should always prefer DDIM or DPM-Solver for inference - the trained model is identical, only the sampler changes.


10. YouTube Resources

TitleChannelWhy Watch
Diffusion Models - Beat GANsYannic KilcherDhariwal & Nichol paper - classifier guidance and improved DDPM
DDPM from Scratch (PyTorch)OutlierStep-by-step code implementation with intuition
Score-Based Generative ModelingYang SongThe connection between DDPM and score matching, from the author
Denoising Diffusion Probabilistic ModelsAI Coffee BreakMathematical walkthrough of ELBO derivation
Improved DDPM - Nichol & DhariwalYannic KilcherCosine schedule, learned variance, and path to ADM

11. Production Engineering Notes

:::tip Model size and resolution scaling The original DDPM used a U-Net with ~35M parameters for 32x32 images (CIFAR-10). For 256x256 images (ImageNet), the ADM model uses ~554M parameters with self-attention at 16x16 and 8x8 resolutions. For 512x512 (Stable Diffusion), computation is moved to a 64x64 latent space - reducing dimensionality by 64x. The scaling rule: add attention at all resolution levels where the spatial dimension is at most 32. Below this threshold, global reasoning is cheap and quality-critical. :::

:::note Variance schedule matters more than you expect The linear schedule works well for low-resolution images (32x32, 64x64). For 256x256 and above, the cosine schedule is significantly better because the linear schedule destroys high-frequency detail too aggressively in early timesteps, leaving insufficient training signal for learning fine-grained textures. If you are training on high-resolution data and samples look blurry or lack texture, switch to the cosine schedule before trying anything else. :::

:::note GroupNorm vs BatchNorm for diffusion models DDPM uses GroupNorm (typically 8 or 32 groups) rather than BatchNorm. The reason: BatchNorm computes statistics across the batch dimension, but diffusion models process images at many different noise levels within the same batch. A timestep-mixed batch has wildly different mean and variance statistics depending on the noise level, which breaks BatchNorm's assumptions. GroupNorm normalizes within each channel group independently of batch statistics - it is stable regardless of batch composition. :::


12. Common Mistakes

:::danger Forgetting the closed-form forward process in training A common implementation bug: computing xtx_t by running tt sequential noising steps in a loop. This is O(T) per training sample - 1000x slower than the closed-form approach. The closed form xt=αˉtx0+1αˉtεx_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1-\bar{\alpha}_t}\, \varepsilon computes xtx_t directly in one operation regardless of tt. This is what makes DDPM training efficient - every training step requires only one forward pass through the noise schedule computation. Always use the closed form. :::

:::danger Not normalizing inputs to the correct range The DDPM model expects inputs in [1,1][-1, 1]. If your images are in [0,1][0, 1] (standard output from transforms.ToTensor()), normalize them: x = 2 * x - 1. Failing to do this shifts the clean image distribution away from the Gaussian noise added during the forward process. At high noise levels, the noisy image should look like pure Gaussian noise - but if the clean image is in [0,1][0,1] instead of [1,1][-1,1], the additive Gaussian noise centered at 0 will systematically shift the distribution. This causes subtle training failures that manifest as slightly off-center sample distributions. :::

:::warning The noise schedule and T are tightly coupled If you change TT from 1000 to 500 without adjusting the noise schedule, the model will fail. The noise schedule defines how quickly αˉt0\bar{\alpha}_t \to 0. A schedule designed for T=1000T=1000 reaches near-zero αˉT\bar{\alpha}_T at step 1000. At step 500, αˉ500\bar{\alpha}_{500} is still around 0.35 for the linear schedule - the model has never seen pure noise during training. Sampling with T=500T=500 steps from this model will produce images that are too noisy. Always retrain or re-derive the schedule when changing TT. :::

:::warning EMA weights are essential for evaluation - not optional DDPM models must be evaluated using EMA weights (decay 0.9999), not the online training weights. The instantaneous training weights fluctuate around the loss minimum - they produce noisier, lower-quality samples. The EMA averages over many steps, providing a smoother approximation to the optimal weights. In practice, the EMA model achieves 0.5-1.5 lower FID than the online model at the same training step. If you are evaluating DDPM quality and getting unexpectedly poor FID, check whether you are using EMA weights for sampling. :::


13. Interview Q&A

Q1: Derive the closed-form distribution q(xtx0)q(x_t | x_0) for DDPM.

Define αt=1βt\alpha_t = 1 - \beta_t and αˉt=s=1tαs\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s. We claim q(xtx0)=N(αˉtx0,(1αˉt)I)q(x_t|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}\, x_0, (1-\bar{\alpha}_t)I).

Proof by induction. Base case: q(x1x0)=N(1β1x0,β1I)=N(α1x0,(1α1)I)q(x_1|x_0) = \mathcal{N}(\sqrt{1-\beta_1}\, x_0, \beta_1 I) = \mathcal{N}(\sqrt{\alpha_1}\, x_0, (1-\alpha_1)I), which matches since αˉ1=α1\bar{\alpha}_1 = \alpha_1.

Inductive step: suppose q(xt1x0)=N(αˉt1x0,(1αˉt1)I)q(x_{t-1}|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_{t-1}}\, x_0, (1-\bar{\alpha}_{t-1})I). Then xt=αtxt1+1αtεtx_t = \sqrt{\alpha_t}\, x_{t-1} + \sqrt{1-\alpha_t}\, \varepsilon_t. Substituting: xt=αtαˉt1x0+αt(1αˉt1)ε1+1αtε2x_t = \sqrt{\alpha_t\bar{\alpha}_{t-1}}\, x_0 + \sqrt{\alpha_t(1-\bar{\alpha}_{t-1})}\, \varepsilon_1 + \sqrt{1-\alpha_t}\, \varepsilon_2. The noise terms combine (sum of independent Gaussians) with variance αt(1αˉt1)+(1αt)=1αtαˉt1=1αˉt\alpha_t(1-\bar{\alpha}_{t-1}) + (1-\alpha_t) = 1 - \alpha_t\bar{\alpha}_{t-1} = 1 - \bar{\alpha}_t. So q(xtx0)=N(αˉtx0,(1αˉt)I)q(x_t|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}\, x_0, (1-\bar{\alpha}_t)I). QED.

Q2: Why does Ho et al. predict ε\varepsilon rather than μθ\mu_\theta or x0x_0?

Three reasons. Practical: at high noise levels, predicting x0x_0 means reconstructing a clean image from nearly pure noise - very high variance target, noisy gradients. Predicting εN(0,I)\varepsilon \sim \mathcal{N}(0,I) is always a well-posed regression problem with bounded targets. Mathematical: the ELBO denoising terms reduce to weighted MSE on ε\varepsilon; Ho et al. found the simplified unweighted objective (dropping timestep-dependent weights) performs better empirically. Theoretical: predicting εθ(xt,t)\varepsilon_\theta(x_t, t) is equivalent to estimating the score xtlogqt(xt)\nabla_{x_t} \log q_t(x_t), connecting DDPM to score matching. Specifically, sθ(xt,t)=εθ(xt,t)/1αˉts_\theta(x_t, t) = -\varepsilon_\theta(x_t, t)/\sqrt{1-\bar{\alpha}_t}, giving a probabilistic interpretation of the training objective.

Q3: What is the DDPM ELBO and what does each term mean?

The DDPM ELBO is:

L=Eq[logpθ(x0x1)]reconstructionDKL(q(xTx0)p(xT))0 by designt=2TDKL(q(xt1xt,x0)pθ(xt1xt))denoising termsεεθ2\mathcal{L} = \underbrace{\mathbb{E}_q[\log p_\theta(x_0|x_1)]}_{\text{reconstruction}} - \underbrace{D_{\text{KL}}(q(x_T|x_0) \| p(x_T))}_{\approx 0\text{ by design}} - \sum_{t=2}^{T} \underbrace{D_{\text{KL}}(q(x_{t-1}|x_t,x_0) \| p_\theta(x_{t-1}|x_t))}_{\text{denoising terms} \propto \|\varepsilon - \varepsilon_\theta\|^2}

The reconstruction term measures how well the final denoising step recovers x0x_0. The prior matching term measures closeness of xTx_T to N(0,I)\mathcal{N}(0,I) - near zero when αˉT0\bar{\alpha}_T \approx 0, so it is ignored. The denoising KL terms are the main training signal: since both q(xt1xt,x0)q(x_{t-1}|x_t,x_0) and pθ(xt1xt)p_\theta(x_{t-1}|x_t) are Gaussian, each KL has a closed form proportional to εεθ2\|\varepsilon - \varepsilon_\theta\|^2. Dropping timestep-dependent weighting coefficients gives Lsimple\mathcal{L}_{\text{simple}}.

Q4: What is the difference between linear and cosine noise schedules, and when does it matter?

The linear schedule increases βt\beta_t linearly from 10410^{-4} to 0.020.02 over TT steps, designed for 32x32 images. For higher resolutions, it destroys high-frequency detail too quickly - many training steps fall in a regime where the image is nearly pure noise, providing little useful signal. The cosine schedule defines αˉt\bar{\alpha}_t as a cosine curve, maintaining higher SNR for a larger fraction of timesteps and decreasing more sharply near t=Tt=T. The cosine schedule keeps αˉt=200\bar{\alpha}_{t=200} about 1.5x higher than linear, giving more training steps in the "interesting" noise regime. For 128x128 and above, cosine consistently improves FID. For 32x32, the difference is small (less than 0.3 FID on CIFAR-10).

Q5: How does DDPM relate to score matching?

They are different training formulations that produce equivalent models. Score matching (Song and Ermon 2019) trains sθ(x,σ)s_\theta(x, \sigma) to estimate xlogpσ(x)\nabla_x \log p_\sigma(x) - the score of the noisy data distribution. DDPM trains εθ(xt,t)\varepsilon_\theta(x_t, t) to predict added noise. The connection: sθ(xt,t)=εθ(xt,t)/1αˉts_\theta(x_t, t) = -\varepsilon_\theta(x_t, t)/\sqrt{1-\bar{\alpha}_t}. So the DDPM noise prediction network is proportional to the negative score. Song et al. (2021) formalized this in the SDE framework, showing DDPM (variance-preserving SDE) and NCSN (variance-exploding SDE) are special cases of a general continuous-time diffusion SDE with a corresponding probability flow ODE. This unification led directly to the DDIM and DPM-Solver samplers, which exploit the ODE structure for accelerated sampling.

Q6: What FID score would you expect a well-trained DDPM to achieve on CIFAR-10, and what would indicate a bug?

A well-configured DDPM (cosine schedule, EMA weights, correct normalization) should achieve FID around 2.9-3.5 on CIFAR-10 after sufficient training. The 2020 Ho et al. result was 3.17 with 1000 steps; Improved DDPM achieved 2.90. FID above 5.0 suggests a problem - check: (1) normalization range (images should be in [1,1][-1,1]), (2) whether EMA weights are being used for sampling, (3) schedule choice and whether TT matches the schedule, (4) whether the correct noise is used (random fresh ε\varepsilon at each training step, not the same across epochs). FID above 15-20 indicates a fundamental bug - likely the closed-form forward process is incorrectly implemented or inputs are not normalized.


This lesson is part of the Diffusion Models module. Next: Score-Based Models and SDEs - The Continuous-Time View.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Diffusion Process (DDPM) demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.