Module 15 - Diffusion Models

The Most Important Generative Modeling Breakthrough in a Decade

In 2020, Ho et al. published DDPM. In 2021, DALL-E made text-to-image generation mainstream. In 2022, Stable Diffusion went open-source and Midjourney launched. By 2023, diffusion models were generating over 15 million images per day on Midjourney alone, powering drug discovery pipelines, producing film-quality audio synthesis, and designing protein structures that no human had ever conceived.

Diffusion models did not win by being incremental improvements. They won by solving the fundamental tension in generative modeling: the trade-off between sample quality, diversity, training stability, and likelihood. GANs produced sharp images but suffered from mode collapse and training instability. VAEs were stable but produced blurry samples. Normalizing flows were exact but architecturally constrained. Diffusion models cracked all four objectives simultaneously - and did it with a training objective so simple it fits in one line: predict the noise.

The core idea is borrowed from non-equilibrium thermodynamics. You take a clean data sample and gradually add Gaussian noise until it becomes pure random noise. Then you train a neural network to reverse this process - to denoise step by step. At generation time, you start from pure noise and iteratively denoise to produce a clean sample. The neural network learns the score function of the data distribution, and sampling becomes numerical integration of a stochastic differential equation.

This module covers the complete diffusion model stack - from the mathematical foundations of DDPM and score matching, through the DDIM sampling speedup, latent diffusion that enables high-resolution generation on consumer hardware, classifier-free guidance for conditional control, fine-tuning techniques like DreamBooth and LoRA, evaluation metrics for generative models, and applications beyond images to audio, molecular design, and time series.

Module Map

Lessons in This Module

#	Lesson	Key Concepts
01	Generative Models Overview	VAEs, GANs, normalizing flows, diffusion comparison, FID scores, why diffusion won
02	DDPM - The Mathematical Foundation	Forward process, ELBO derivation, noise prediction objective, noise schedules, sampling algorithm
03	Score-Based Generative Models	Score function, Langevin dynamics, denoising score matching, SDE framework, VP/VE SDE
04	DDIM and Accelerated Sampling	Non-Markovian reverse process, deterministic ODE, DDIM inversion, DPM-Solver, consistency models
05	Latent Diffusion Models	VAE latent compression, cross-attention conditioning, Stable Diffusion architecture, text-to-image
06	Classifier-Free Guidance	Joint conditional/unconditional training, CFG scale, guidance trade-offs, CLIP guidance
07	Fine-Tuning Diffusion Models	DreamBooth, Textual Inversion, LoRA, ControlNet, parameter-efficient fine-tuning
08	Diffusion for Non-Image Domains	Audio waveforms, molecular design, protein structure, time series, video diffusion
09	Evaluation of Generative Models	FID, IS, CLIP score, precision/recall, human evaluation, calibration

Key Concepts Glossary

Forward process $q(x_t | x_{t-1})$ : The Markov chain that gradually adds Gaussian noise to data over $T$ timesteps until $x_T \approx \mathcal{N}(0, I)$ .

Reverse process $p_\theta(x_{t-1} | x_t)$ : The learned denoising process - a neural network (typically U-Net) that predicts how to remove noise at each step.

Noise schedule $\{\beta_t\}_{t=1}^T$ : Controls how much noise is added at each step. Linear schedule (original DDPM) or cosine schedule (Improved DDPM) are the two main choices.

Score function $\nabla_x \log p(x)$ : The gradient of the log-density. The denoising network $\varepsilon_\theta$ is proportional to the negative score: $s_\theta(x_t, t) = -\varepsilon_\theta(x_t, t) / \sqrt{1 - \bar{\alpha}_t}$ .

ELBO: Evidence Lower BOund - the training objective for DDPM. In practice simplified to just predicting the added noise $\varepsilon$ .

DDIM inversion: Encode a real image into $x_T$ by running the reverse process forward. Enables image editing in diffusion latent space.

Classifier-Free Guidance (CFG): At inference time, combine conditional and unconditional predictions: $\tilde{\varepsilon} = \varepsilon_{\text{uncond}} + w \cdot (\varepsilon_{\text{cond}} - \varepsilon_{\text{uncond}})$ . Higher $w$ = stronger conditioning = less diversity.

U-Net backbone: The standard architecture for the denoising network. Encoder-decoder with skip connections and timestep embedding via sinusoidal positional encoding injected into every residual block.

The Big Picture

Diffusion models occupy a unique position in the generative modeling landscape. They are not the fastest (GANs generate in one forward pass). They are not the most parameter-efficient. But they are the most controllable, the most stable to train, and produce the highest-quality, most-diverse outputs of any generative model class to date.

Understanding them deeply - from the mathematics of score matching to the engineering of Stable Diffusion - is essential for any ML engineer working in generative AI, multimodal systems, drug discovery, or creative AI applications. This module gives you that depth.

The Most Important Generative Modeling Breakthrough in a Decade​

Module Map​

Lessons in This Module​

Key Concepts Glossary​

The Big Picture​

The Most Important Generative Modeling Breakthrough in a Decade

Module Map

Lessons in This Module

Key Concepts Glossary

The Big Picture