Module 15 - Diffusion Models
The Most Important Generative Modeling Breakthrough in a Decade
In 2020, Ho et al. published DDPM. In 2021, DALL-E made text-to-image generation mainstream. In 2022, Stable Diffusion went open-source and Midjourney launched. By 2023, diffusion models were generating over 15 million images per day on Midjourney alone, powering drug discovery pipelines, producing film-quality audio synthesis, and designing protein structures that no human had ever conceived.
Diffusion models did not win by being incremental improvements. They won by solving the fundamental tension in generative modeling: the trade-off between sample quality, diversity, training stability, and likelihood. GANs produced sharp images but suffered from mode collapse and training instability. VAEs were stable but produced blurry samples. Normalizing flows were exact but architecturally constrained. Diffusion models cracked all four objectives simultaneously - and did it with a training objective so simple it fits in one line: predict the noise.
The core idea is borrowed from non-equilibrium thermodynamics. You take a clean data sample and gradually add Gaussian noise until it becomes pure random noise. Then you train a neural network to reverse this process - to denoise step by step. At generation time, you start from pure noise and iteratively denoise to produce a clean sample. The neural network learns the score function of the data distribution, and sampling becomes numerical integration of a stochastic differential equation.
This module covers the complete diffusion model stack - from the mathematical foundations of DDPM and score matching, through the DDIM sampling speedup, latent diffusion that enables high-resolution generation on consumer hardware, classifier-free guidance for conditional control, fine-tuning techniques like DreamBooth and LoRA, evaluation metrics for generative models, and applications beyond images to audio, molecular design, and time series.
Module Map
Lessons in This Module
| # | Lesson | Key Concepts |
|---|---|---|
| 01 | Generative Models Overview | VAEs, GANs, normalizing flows, diffusion comparison, FID scores, why diffusion won |
| 02 | DDPM - The Mathematical Foundation | Forward process, ELBO derivation, noise prediction objective, noise schedules, sampling algorithm |
| 03 | Score-Based Generative Models | Score function, Langevin dynamics, denoising score matching, SDE framework, VP/VE SDE |
| 04 | DDIM and Accelerated Sampling | Non-Markovian reverse process, deterministic ODE, DDIM inversion, DPM-Solver, consistency models |
| 05 | Latent Diffusion Models | VAE latent compression, cross-attention conditioning, Stable Diffusion architecture, text-to-image |
| 06 | Classifier-Free Guidance | Joint conditional/unconditional training, CFG scale, guidance trade-offs, CLIP guidance |
| 07 | Fine-Tuning Diffusion Models | DreamBooth, Textual Inversion, LoRA, ControlNet, parameter-efficient fine-tuning |
| 08 | Diffusion for Non-Image Domains | Audio waveforms, molecular design, protein structure, time series, video diffusion |
| 09 | Evaluation of Generative Models | FID, IS, CLIP score, precision/recall, human evaluation, calibration |
Key Concepts Glossary
Forward process : The Markov chain that gradually adds Gaussian noise to data over timesteps until .
Reverse process : The learned denoising process - a neural network (typically U-Net) that predicts how to remove noise at each step.
Noise schedule : Controls how much noise is added at each step. Linear schedule (original DDPM) or cosine schedule (Improved DDPM) are the two main choices.
Score function : The gradient of the log-density. The denoising network is proportional to the negative score: .
ELBO: Evidence Lower BOund - the training objective for DDPM. In practice simplified to just predicting the added noise .
DDIM inversion: Encode a real image into by running the reverse process forward. Enables image editing in diffusion latent space.
Classifier-Free Guidance (CFG): At inference time, combine conditional and unconditional predictions: . Higher = stronger conditioning = less diversity.
U-Net backbone: The standard architecture for the denoising network. Encoder-decoder with skip connections and timestep embedding via sinusoidal positional encoding injected into every residual block.
The Big Picture
Diffusion models occupy a unique position in the generative modeling landscape. They are not the fastest (GANs generate in one forward pass). They are not the most parameter-efficient. But they are the most controllable, the most stable to train, and produce the highest-quality, most-diverse outputs of any generative model class to date.
Understanding them deeply - from the mathematics of score matching to the engineering of Stable Diffusion - is essential for any ML engineer working in generative AI, multimodal systems, drug discovery, or creative AI applications. This module gives you that depth.
