Skip to main content

Module 15 - Diffusion Models

The Most Important Generative Modeling Breakthrough in a Decade

In 2020, Ho et al. published DDPM. In 2021, DALL-E made text-to-image generation mainstream. In 2022, Stable Diffusion went open-source and Midjourney launched. By 2023, diffusion models were generating over 15 million images per day on Midjourney alone, powering drug discovery pipelines, producing film-quality audio synthesis, and designing protein structures that no human had ever conceived.

Diffusion models did not win by being incremental improvements. They won by solving the fundamental tension in generative modeling: the trade-off between sample quality, diversity, training stability, and likelihood. GANs produced sharp images but suffered from mode collapse and training instability. VAEs were stable but produced blurry samples. Normalizing flows were exact but architecturally constrained. Diffusion models cracked all four objectives simultaneously - and did it with a training objective so simple it fits in one line: predict the noise.

The core idea is borrowed from non-equilibrium thermodynamics. You take a clean data sample and gradually add Gaussian noise until it becomes pure random noise. Then you train a neural network to reverse this process - to denoise step by step. At generation time, you start from pure noise and iteratively denoise to produce a clean sample. The neural network learns the score function of the data distribution, and sampling becomes numerical integration of a stochastic differential equation.

This module covers the complete diffusion model stack - from the mathematical foundations of DDPM and score matching, through the DDIM sampling speedup, latent diffusion that enables high-resolution generation on consumer hardware, classifier-free guidance for conditional control, fine-tuning techniques like DreamBooth and LoRA, evaluation metrics for generative models, and applications beyond images to audio, molecular design, and time series.


Module Map


Lessons in This Module

#LessonKey Concepts
01Generative Models OverviewVAEs, GANs, normalizing flows, diffusion comparison, FID scores, why diffusion won
02DDPM - The Mathematical FoundationForward process, ELBO derivation, noise prediction objective, noise schedules, sampling algorithm
03Score-Based Generative ModelsScore function, Langevin dynamics, denoising score matching, SDE framework, VP/VE SDE
04DDIM and Accelerated SamplingNon-Markovian reverse process, deterministic ODE, DDIM inversion, DPM-Solver, consistency models
05Latent Diffusion ModelsVAE latent compression, cross-attention conditioning, Stable Diffusion architecture, text-to-image
06Classifier-Free GuidanceJoint conditional/unconditional training, CFG scale, guidance trade-offs, CLIP guidance
07Fine-Tuning Diffusion ModelsDreamBooth, Textual Inversion, LoRA, ControlNet, parameter-efficient fine-tuning
08Diffusion for Non-Image DomainsAudio waveforms, molecular design, protein structure, time series, video diffusion
09Evaluation of Generative ModelsFID, IS, CLIP score, precision/recall, human evaluation, calibration

Key Concepts Glossary

Forward process q(xtxt1)q(x_t | x_{t-1}): The Markov chain that gradually adds Gaussian noise to data over TT timesteps until xTN(0,I)x_T \approx \mathcal{N}(0, I).

Reverse process pθ(xt1xt)p_\theta(x_{t-1} | x_t): The learned denoising process - a neural network (typically U-Net) that predicts how to remove noise at each step.

Noise schedule {βt}t=1T\{\beta_t\}_{t=1}^T: Controls how much noise is added at each step. Linear schedule (original DDPM) or cosine schedule (Improved DDPM) are the two main choices.

Score function xlogp(x)\nabla_x \log p(x): The gradient of the log-density. The denoising network εθ\varepsilon_\theta is proportional to the negative score: sθ(xt,t)=εθ(xt,t)/1αˉts_\theta(x_t, t) = -\varepsilon_\theta(x_t, t) / \sqrt{1 - \bar{\alpha}_t}.

ELBO: Evidence Lower BOund - the training objective for DDPM. In practice simplified to just predicting the added noise ε\varepsilon.

DDIM inversion: Encode a real image into xTx_T by running the reverse process forward. Enables image editing in diffusion latent space.

Classifier-Free Guidance (CFG): At inference time, combine conditional and unconditional predictions: ε~=εuncond+w(εcondεuncond)\tilde{\varepsilon} = \varepsilon_{\text{uncond}} + w \cdot (\varepsilon_{\text{cond}} - \varepsilon_{\text{uncond}}). Higher ww = stronger conditioning = less diversity.

U-Net backbone: The standard architecture for the denoising network. Encoder-decoder with skip connections and timestep embedding via sinusoidal positional encoding injected into every residual block.


The Big Picture

Diffusion models occupy a unique position in the generative modeling landscape. They are not the fastest (GANs generate in one forward pass). They are not the most parameter-efficient. But they are the most controllable, the most stable to train, and produce the highest-quality, most-diverse outputs of any generative model class to date.

Understanding them deeply - from the mathematics of score matching to the engineering of Stable Diffusion - is essential for any ML engineer working in generative AI, multimodal systems, drug discovery, or creative AI applications. This module gives you that depth.

© 2026 EngineersOfAI. All rights reserved.