Diffusion Model Papers - Generating Images by Learning to Denoise
Reading time: ~35 min | Interview relevance: High | Roles: MLE, Research Engineer, AI Engineer (Generative AI)
The Real Interview Moment
You're interviewing for a generative AI role at a company building image generation products. The interviewer puts a blank whiteboard in front of you and says: "Explain to me, from first principles, how diffusion models generate images. Start with the forward process, derive the training objective, explain how sampling works, and then tell me how Stable Diffusion makes this practical for high-resolution images. I want math, not hand-waving."
You've used Stable Diffusion to generate images. You know it "adds noise and then removes it." But can you write the noise schedule? Can you derive the ELBO? Do you understand why DDIM is faster than DDPM? Can you explain classifier-free guidance and why a guidance scale of 7.5 is typical? The interviewer is testing whether you understand diffusion deeply enough to debug training, improve sampling, or adapt the architecture.
This is the diffusion interview. If you're working on any form of generative AI - images, video, audio, or 3D - diffusion models are foundational knowledge.
What You Will Master
After reading this page, you will be able to:
- Explain the forward and reverse diffusion processes with full mathematical derivation
- Derive the simplified training objective (predict the noise)
- Explain noise schedules and their impact on generation quality
- Compare DDPM and DDIM sampling and explain the speed-quality trade-off
- Explain latent diffusion and why it enables high-resolution generation
- Derive classifier-free guidance and explain the guidance scale
- Describe the Stable Diffusion architecture end-to-end
- Answer every common interview question about diffusion models
Part 1 - The Core Idea
Intuition
Diffusion models work by:
- Forward process: Gradually add Gaussian noise to data until it becomes pure noise
- Reverse process: Learn to gradually remove noise, transforming noise back into data
If you can learn to reverse one small noise step, you can chain many reverse steps together to generate data from pure noise.
Why Not Just Use GANs or VAEs?
| Property | GANs | VAEs | Diffusion Models |
|---|---|---|---|
| Training stability | Unstable (mode collapse, training failure) | Stable | Stable |
| Sample quality | High (but mode dropping) | Blurry | Highest |
| Mode coverage | Poor (mode collapse) | Good | Excellent |
| Likelihood | Not available | Tractable (ELBO) | Tractable (ELBO) |
| Sampling speed | Fast (single forward pass) | Fast (single forward pass) | Slow (many steps) |
| Controllability | Difficult | Moderate | Excellent (guidance) |
"Diffusion models learn to reverse a gradual noising process. You train a neural network to predict the noise added at each step, then at generation time, start from pure Gaussian noise and iteratively denoise. They beat GANs because training is stable with no mode collapse, and they beat VAEs because they produce much sharper samples. The trade-off is sampling speed - you need hundreds of denoising steps, though DDIM and distillation reduce this significantly."
Part 2 - The Forward Process
Adding Noise Gradually
The forward process adds Gaussian noise over timesteps (typically ):
where is the noise schedule - a small variance at each step.
Key property: You can sample directly from without iterating through all previous steps:
where:
- (cumulative product)
This means:
A common mistake is to say "we add noise T times during training." This is wrong. During training, we sample a random timestep and directly compute from using the closed-form formula above. The iterative process only happens during sampling (generation).
Noise Schedules
The choice of significantly affects quality:
| Schedule | Formula | Properties |
|---|---|---|
| Linear | Original DDPM. Too aggressive at the end. | |
| Cosine | , | Smoother, better for higher resolutions |
| Sigmoid | where are endpoints | Used in some modern architectures |
import torch
import numpy as np
def linear_schedule(T, beta_start=1e-4, beta_end=0.02):
"""Original DDPM linear schedule."""
return torch.linspace(beta_start, beta_end, T)
def cosine_schedule(T, s=0.008):
"""Improved cosine schedule (Nichol & Dhariwal, 2021)."""
steps = T + 1
t = torch.linspace(0, T, steps) / T
alphas_cumprod = torch.cos((t + s) / (1 + s) * np.pi / 2) ** 2
alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
return torch.clamp(betas, 0.0001, 0.999)
Part 3 - The Reverse Process and Training Objective
The Reverse Process
The reverse process learns to denoise:
The neural network predicts the mean of the denoised distribution. In practice, there are three equivalent parameterizations:
| Prediction target | What the network outputs | Training loss |
|---|---|---|
| Predict noise | The noise that was added | |
| Predict via | The clean image | |
| Predict velocity |
The noise prediction parameterization is most common (used in DDPM, Stable Diffusion).
Deriving the Training Objective
The full variational lower bound (ELBO) for diffusion models:
The posterior has a closed form (because both the forward and posterior are Gaussian):
where:
The Simplified Loss
Ho et al. (2020) showed that the simplified loss works best in practice:
In words: Sample a clean image , a random timestep , and random noise . Noise the image to get . Train the network to predict from and .
import torch
import torch.nn as nn
class DiffusionTrainer:
"""Simplified DDPM training loop."""
def __init__(self, model, T=1000, beta_start=1e-4, beta_end=0.02):
self.model = model
self.T = T
# Precompute schedule
betas = torch.linspace(beta_start, beta_end, T)
alphas = 1 - betas
self.alpha_bar = torch.cumprod(alphas, dim=0)
def training_step(self, x_0):
"""One training step."""
batch_size = x_0.shape[0]
# Sample random timestep for each image
t = torch.randint(0, self.T, (batch_size,))
# Sample noise
epsilon = torch.randn_like(x_0)
# Create noisy image: x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
alpha_bar_t = self.alpha_bar[t].view(-1, 1, 1, 1)
x_t = torch.sqrt(alpha_bar_t) * x_0 + torch.sqrt(1 - alpha_bar_t) * epsilon
# Predict noise
epsilon_pred = self.model(x_t, t)
# Simple MSE loss
loss = nn.functional.mse_loss(epsilon_pred, epsilon)
return loss
Never confuse the forward process (adding noise) with the training objective. The forward process is not learned - it's fixed. The training objective is to predict the noise. Many candidates mix these up and describe training as "learning to add noise," which is backwards.
Part 4 - Sampling: DDPM vs DDIM
DDPM Sampling (Stochastic)
DDPM sampling follows the reverse process step by step:
where adds stochasticity.
Problem: Requires all steps (typically 1000), which is very slow.
DDIM Sampling (Deterministic)
DDIM (Song et al., 2021) derives a non-Markovian reverse process that allows skipping steps:
When (fully deterministic), DDIM can use a sub-sequence of timesteps, e.g., , reducing steps from 1000 to 50 or even 20.
| Method | Steps | Quality (FID) | Speed | Deterministic? |
|---|---|---|---|---|
| DDPM | 1000 | Best | Slowest | No (stochastic) |
| DDIM (50 steps) | 50 | Near DDPM | 20x faster | Yes (optional) |
| DDIM (20 steps) | 20 | Good | 50x faster | Yes (optional) |
| DDIM (10 steps) | 10 | Acceptable | 100x faster | Yes (optional) |
@torch.no_grad()
def ddim_sample(model, shape, T=1000, num_steps=50, eta=0.0):
"""
DDIM sampling with configurable steps.
eta=0: deterministic, eta=1: equivalent to DDPM
"""
# Create sub-sequence of timesteps
step_size = T // num_steps
timesteps = list(range(0, T, step_size))[::-1] # [T-1, T-1-step, ..., 0]
x = torch.randn(shape) # Start from pure noise
for i, t in enumerate(timesteps):
t_tensor = torch.full((shape[0],), t, dtype=torch.long)
# Predict noise
eps_pred = model(x, t_tensor)
# Predict x_0
alpha_bar_t = alpha_bar[t]
x0_pred = (x - torch.sqrt(1 - alpha_bar_t) * eps_pred) / torch.sqrt(alpha_bar_t)
if i < len(timesteps) - 1:
t_next = timesteps[i + 1]
alpha_bar_next = alpha_bar[t_next]
# DDIM update
sigma = eta * torch.sqrt((1 - alpha_bar_next) / (1 - alpha_bar_t)) * \
torch.sqrt(1 - alpha_bar_t / alpha_bar_next)
dir_xt = torch.sqrt(1 - alpha_bar_next - sigma**2) * eps_pred
x = torch.sqrt(alpha_bar_next) * x0_pred + dir_xt
if sigma > 0:
x += sigma * torch.randn_like(x)
else:
x = x0_pred
return x
Part 5 - Score-Based Generative Models
Connection to Score Matching
Song & Ermon (2019) showed diffusion models can be understood through score matching.
The score function is the gradient of the log probability density:
This points in the direction of increasing data likelihood - following it via Langevin dynamics generates samples:
The problem: Estimating is intractable for complex distributions.
The solution: Train a neural network to approximate the score at each noise level.
Unified Framework
Song et al. (2021) unified DDPM and score-based models under stochastic differential equations (SDEs):
Forward SDE (noise addition):
Reverse SDE (denoising):
The neural network learns , which is equivalent to learning (noise prediction) up to a scaling factor:
- OpenAI: Used guided diffusion (DALL-E 2), then moved to consistency models for speed.
- Stability AI: Stable Diffusion (latent diffusion), open-source.
- Google: Imagen uses diffusion in pixel space with large text encoders (T5).
- Meta: Uses flow matching (continuous-time generalization of diffusion) for newer models.
Part 6 - Latent Diffusion and Stable Diffusion
The Resolution Problem
Running diffusion in pixel space is extremely expensive. For a 512x512 RGB image:
- Input dimension:
- Each denoising step processes this full-resolution tensor
- Memory and compute scale quadratically with resolution
Latent Diffusion Models (LDMs)
Rombach et al. (2022) proposed running diffusion in a compressed latent space:
- Train a VAE (encoder + decoder ) to compress images:
- Run diffusion in latent space: Add noise to , train denoiser on , sample in -space
- Decode to pixels:
With downsampling factor : a 512x512 image becomes a 64x64x4 latent - 48x fewer dimensions.
The Stable Diffusion Architecture
Key components:
| Component | Architecture | Role | Parameters |
|---|---|---|---|
| VAE | KL-regularized autoencoder | Compress/decompress images | ~84M |
| U-Net | U-Net with attention | Denoise latents | ~860M |
| Text encoder | CLIP ViT-L/14 | Encode text prompts | ~123M |
| Total | - | - | ~1.07B |
Part 7 - Classifier-Free Guidance
The Problem
Conditional diffusion models can generate images conditioned on text, but the conditioning is often weak - the model ignores parts of the prompt.
Classifier Guidance (Dhariwal & Nichol, 2021)
Use a separate classifier to guide sampling:
Problem: Requires training a separate classifier on noisy images.
Classifier-Free Guidance (Ho & Salimans, 2022)
Key insight: You can get the same effect without a classifier by training the model with and without conditioning (randomly dropping the text prompt during training).
At sampling time, combine the conditional and unconditional predictions:
where:
- is the conditional prediction (with text prompt)
- is the unconditional prediction (text dropped)
- is the guidance scale
Rearranging:
This amplifies the "direction" that the conditioning provides.
| Guidance Scale () | Effect | Quality |
|---|---|---|
| 0 | No guidance (unconditional generation) | Low prompt adherence |
| 1-3 | Mild guidance | Natural but may ignore prompt details |
| 7-8 | Standard (Stable Diffusion default) | Good balance |
| 10-15 | Strong guidance | High prompt adherence but saturated/unnatural |
| 20+ | Extreme guidance | Artifacts, oversaturation |
"Classifier-free guidance trains the diffusion model to work both with and without text conditioning by randomly dropping the prompt during training. At inference, you compute both the conditional and unconditional noise predictions and amplify their difference by a guidance scale. Higher guidance means stronger prompt adherence but at the cost of diversity and naturalness. A scale of 7-8 is the typical sweet spot for Stable Diffusion."
Part 8 - Modern Advances
Consistency Models (Song et al., 2023)
Map any point on the diffusion trajectory directly to in a single step:
Trained via consistency distillation (distill from a pre-trained diffusion model) or consistency training (train from scratch).
Result: 1-4 step generation with quality approaching 50-step DDIM.
Flow Matching (Lipman et al., 2023)
Replaces the noising/denoising framework with optimal transport between noise and data distributions. The model learns a velocity field that transports samples from noise to data along straight paths.
Advantages: Simpler training, straight paths (fewer steps needed), better theoretical properties.
DiT (Diffusion Transformers)
Replace the U-Net with a Vision Transformer (ViT) architecture:
- Peebles & Xie (2023): DiT outperforms U-Net on ImageNet at the same compute
- Used in DALL-E 3 and Sora
- Better scalability - transformers scale more predictably than U-Nets
Architecture Evolution
Part 9 - Diffusion Models Beyond Images
| Domain | Model | Key Innovation |
|---|---|---|
| Video | Sora, Runway Gen-3 | Temporal attention, spatial-temporal U-Net |
| Audio | AudioLDM, Stable Audio | Mel-spectrogram latent space |
| 3D | DreamFusion, Zero-1-to-3 | Score distillation sampling (SDS) |
| Protein | RFdiffusion | Diffusion on 3D coordinates and residue types |
| Molecular | GeoDiff | Diffusion on molecular conformations |
| Music | MusicGen (uses flow matching) | Joint audio-text generation |
| Policy (RL) | Diffuser | Diffusion over action trajectories |
Part 10 - Practice Problems
Problem 1: Forward Process Derivation
Show that by recursively applying the single-step noise addition .
Hint 1 - Direction
Use the reparameterization trick at each step: . Substitute in terms of , and so on.
Full Answer
Step 1: Using the reparameterization trick:
Step 2: Substitute :
Step 3: The last two terms are independent Gaussians. Their sum has variance:
So:
Step 4: Continuing recursively to :
Problem 2: Guidance Scale Analysis
A user generates images with guidance scale and . The first set looks good but doesn't follow the prompt well. The second set follows the prompt but has color saturation artifacts. Explain why and suggest a solution.
Hint 1 - Direction
Think about what classifier-free guidance does geometrically: it amplifies the difference between conditional and unconditional predictions.
Full Answer + Rubric
At : The guidance is weak. The model generates plausible images but the conditional signal (text) doesn't dominate enough. The model partially ignores the prompt because the unconditional mode has significant weight.
At : The guidance is too strong. The update amplifies the conditional direction so much that the predicted noise goes outside the normal range. This pushes pixel values to extremes, causing saturation.
Solutions:
-
Use : The empirically validated sweet spot for most prompts.
-
Dynamic guidance: Use higher guidance for early timesteps (establish structure) and lower guidance for later timesteps (refine details). Some implementations use a cosine schedule for .
-
Thresholding: Clip the predicted to at each step (dynamic thresholding from Imagen paper), preventing saturation.
-
Better prompt engineering: More detailed prompts often work better at moderate guidance than vague prompts at high guidance.
Scoring:
- Strong Hire: Explains the geometric interpretation of guidance, identifies saturation mechanism, suggests dynamic guidance or thresholding
- Lean Hire: Knows that guidance scale controls prompt adherence but can't explain the artifacts
- No Hire: Suggests "just use a higher scale for better results"
Problem 3: System Design - Text-to-Image Service
Design a text-to-image generation service that handles 1000 requests per minute with <10 second latency. What are your key architectural choices?
Hint 1 - Direction
Consider: model choice (pixel vs latent), number of sampling steps, batching strategy, GPU selection, caching, and whether to use distilled models.
Full Answer + Rubric
Architecture:
-
Model: Latent diffusion (Stable Diffusion XL or similar). Pixel-space diffusion is too slow at high resolution.
-
Sampling: DDIM with 20-30 steps or consistency model with 4 steps. Not 1000-step DDPM.
-
GPU: A100 80GB or H100. A single SDXL inference takes ~2-3 seconds on A100 with 20 steps.
-
Throughput math: 1000 requests/min = ~17 requests/sec. With 3s per request per GPU and batch size 4: each GPU processes ~1.3 requests/sec. Need ~13 GPUs (plus redundancy = 16).
-
Batching: Queue incoming requests, batch by similar dimensions. Dynamic batching with max wait time of 500ms.
-
Caching: Cache text encoder outputs (CLIP embeddings for the same prompt). Cache VAE decoder outputs for common resolutions.
-
Optimization: Use TensorRT or torch.compile for inference. Half-precision throughout. Consider quantized models (INT8 U-Net).
-
Scaling: Horizontal scaling behind a load balancer. GPU autoscaling based on queue depth.
Scoring:
- Strong Hire: Concrete throughput math, correct model/sampling choices, addresses batching and optimization
- Lean Hire: Reasonable architecture but no concrete numbers or optimization strategies
- No Hire: Suggests running 1000-step DDPM on a single GPU
Interview Cheat Sheet
| Question Pattern | Framework | Key Phrases |
|---|---|---|
| "Explain diffusion models" | Forward (add noise) → Training (predict noise) → Sampling (iterative denoise) | "Diffusion models learn to reverse a gradual noising process. Training is simple: predict the noise added at a random timestep. Sampling chains many denoising steps." |
| "What's the training loss?" | - predict noise from noisy input | "The simplified loss is just MSE between the actual noise and the predicted noise. Sample a timestep, noise the image, predict the noise." |
| "DDPM vs DDIM?" | DDPM: 1000 steps, stochastic. DDIM: 20-50 steps, deterministic, same model | "DDIM uses a non-Markovian reverse process that allows skipping timesteps. Same model, 20x faster, deterministic sampling." |
| "How does Stable Diffusion work?" | VAE (compress) → U-Net (denoise latents) → CLIP (text condition) → CFG | "Stable Diffusion runs diffusion in a compressed latent space, using a VAE for compression and CLIP for text conditioning." |
| "What is classifier-free guidance?" | Train with/without condition → amplify conditional direction at inference | "Drop the conditioning randomly during training. At inference, amplify the difference between conditional and unconditional predictions." |
| "Why latent diffusion?" | Pixel space is 48x larger → latent space is efficient → same quality | "Running diffusion on 64x64 latents instead of 512x512 pixels gives 48x fewer dimensions with minimal quality loss." |
Spaced Repetition Checkpoints
- Day 0: Read this page. Write the simplified training loss. Explain the forward process in your own words.
- Day 3: Derive from the single-step . Explain DDPM vs DDIM sampling.
- Day 7: Draw the Stable Diffusion architecture from memory. Explain classifier-free guidance with the formula.
- Day 14: Compare diffusion models with GANs and VAEs on 5 dimensions. Explain flow matching in 2 minutes.
- Day 21: Solve all three practice problems from memory. Time yourself - 8-10 minutes each.
Next Steps
- Continue to RAG Papers for retrieval-augmented generation
- Review Attention Is All You Need for the transformer architecture used in DiT
- For generative AI system design, see ML System Design
