Diffusion Model Papers - Generating Images by Learning to Denoise

Reading time: ~35 min | Interview relevance: High | Roles: MLE, Research Engineer, AI Engineer (Generative AI)

The Real Interview Moment

You're interviewing for a generative AI role at a company building image generation products. The interviewer puts a blank whiteboard in front of you and says: "Explain to me, from first principles, how diffusion models generate images. Start with the forward process, derive the training objective, explain how sampling works, and then tell me how Stable Diffusion makes this practical for high-resolution images. I want math, not hand-waving."

You've used Stable Diffusion to generate images. You know it "adds noise and then removes it." But can you write the noise schedule? Can you derive the ELBO? Do you understand why DDIM is faster than DDPM? Can you explain classifier-free guidance and why a guidance scale of 7.5 is typical? The interviewer is testing whether you understand diffusion deeply enough to debug training, improve sampling, or adapt the architecture.

This is the diffusion interview. If you're working on any form of generative AI - images, video, audio, or 3D - diffusion models are foundational knowledge.

What You Will Master

After reading this page, you will be able to:

Explain the forward and reverse diffusion processes with full mathematical derivation
Derive the simplified training objective (predict the noise)
Explain noise schedules and their impact on generation quality
Compare DDPM and DDIM sampling and explain the speed-quality trade-off
Explain latent diffusion and why it enables high-resolution generation
Derive classifier-free guidance and explain the guidance scale
Describe the Stable Diffusion architecture end-to-end
Answer every common interview question about diffusion models

Part 1 - The Core Idea

Intuition

Diffusion models work by:

Forward process: Gradually add Gaussian noise to data until it becomes pure noise
Reverse process: Learn to gradually remove noise, transforming noise back into data

If you can learn to reverse one small noise step, you can chain many reverse steps together to generate data from pure noise.

Diffusion Forward and Reverse Process

Why Not Just Use GANs or VAEs?

Property	GANs	VAEs	Diffusion Models
Training stability	Unstable (mode collapse, training failure)	Stable	Stable
Sample quality	High (but mode dropping)	Blurry	Highest
Mode coverage	Poor (mode collapse)	Good	Excellent
Likelihood	Not available	Tractable (ELBO)	Tractable (ELBO)
Sampling speed	Fast (single forward pass)	Fast (single forward pass)	Slow (many steps)
Controllability	Difficult	Moderate	Excellent (guidance)

60-Second Answer

"Diffusion models learn to reverse a gradual noising process. You train a neural network to predict the noise added at each step, then at generation time, start from pure Gaussian noise and iteratively denoise. They beat GANs because training is stable with no mode collapse, and they beat VAEs because they produce much sharper samples. The trade-off is sampling speed - you need hundreds of denoising steps, though DDIM and distillation reduce this significantly."

Part 2 - The Forward Process

Adding Noise Gradually

The forward process adds Gaussian noise over $T$ timesteps (typically $T = 1000$ ):

$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \beta_t \mathbf{I})$

where $\beta_t$ is the noise schedule - a small variance at each step.

Key property: You can sample $x_t$ directly from $x_0$ without iterating through all previous steps:

$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, (1 - \bar{\alpha}_t) \mathbf{I})$

where:

$\alpha_t = 1 - \beta_t$
$\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$ (cumulative product)

This means: $x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, \mathbf{I})$

Common Trap

A common mistake is to say "we add noise T times during training." This is wrong. During training, we sample a random timestep $t \sim \text{Uniform}(1, T)$ and directly compute $x_t$ from $x_0$ using the closed-form formula above. The iterative process only happens during sampling (generation).

Noise Schedules

The choice of $\beta_t$ significantly affects quality:

Schedule	Formula	Properties
Linear	$\beta_t = \beta_1 + \frac{t-1}{T-1}(\beta_T - \beta_1)$	Original DDPM. Too aggressive at the end.
Cosine	$\bar{\alpha}_t = \frac{f(t)}{f(0)}$ , $f(t) = \cos\left(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2}\right)^2$	Smoother, better for higher resolutions
Sigmoid	$\beta_t = \sigma(a + (b-a) \cdot t/T)$ where $a,b$ are endpoints	Used in some modern architectures

import torch
import numpy as np

def linear_schedule(T, beta_start=1e-4, beta_end=0.02):
    """Original DDPM linear schedule."""
    return torch.linspace(beta_start, beta_end, T)

def cosine_schedule(T, s=0.008):
    """Improved cosine schedule (Nichol & Dhariwal, 2021)."""
    steps = T + 1
    t = torch.linspace(0, T, steps) / T
    alphas_cumprod = torch.cos((t + s) / (1 + s) * np.pi / 2) ** 2
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return torch.clamp(betas, 0.0001, 0.999)

Part 3 - The Reverse Process and Training Objective

The Reverse Process

The reverse process learns to denoise:

$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$

The neural network $\mu_\theta$ predicts the mean of the denoised distribution. In practice, there are three equivalent parameterizations:

Prediction target	What the network outputs	Training loss
Predict noise $\epsilon_\theta(x_t, t)$	The noise that was added	$\\|\epsilon - \epsilon_\theta(x_t, t)\\|^2$
Predict $x_0$ via $\hat{x}_\theta(x_t, t)$	The clean image	$\\|x_0 - \hat{x}_\theta(x_t, t)\\|^2$
Predict velocity $v_\theta(x_t, t)$	$v = \sqrt{\bar\alpha_t}\epsilon - \sqrt{1-\bar\alpha_t}x_0$	$\\|v - v_\theta(x_t, t)\\|^2$

The noise prediction parameterization is most common (used in DDPM, Stable Diffusion).

Deriving the Training Objective

The full variational lower bound (ELBO) for diffusion models:

$\mathcal{L} = \mathbb{E}_q \left[ D_\text{KL}(q(x_T|x_0) \| p(x_T)) + \sum_{t=2}^{T} D_\text{KL}(q(x_{t-1}|x_t, x_0) \| p_\theta(x_{t-1}|x_t)) - \log p_\theta(x_0|x_1) \right]$

The posterior $q(x_{t-1}|x_t, x_0)$ has a closed form (because both the forward and posterior are Gaussian):

$q(x_{t-1}|x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t \mathbf{I})$

where: $\tilde{\mu}_t = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t$

$\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t$

The Simplified Loss

Ho et al. (2020) showed that the simplified loss works best in practice:

$\mathcal{L}_\text{simple} = \mathbb{E}_{t, x_0, \epsilon} \left[ \|\epsilon - \epsilon_\theta(\underbrace{\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon}_{x_t}, t)\|^2 \right]$

In words: Sample a clean image $x_0$ , a random timestep $t$ , and random noise $\epsilon$ . Noise the image to get $x_t$ . Train the network to predict $\epsilon$ from $x_t$ and $t$ .

import torch
import torch.nn as nn

class DiffusionTrainer:
    """Simplified DDPM training loop."""

    def __init__(self, model, T=1000, beta_start=1e-4, beta_end=0.02):
        self.model = model
        self.T = T

        # Precompute schedule
        betas = torch.linspace(beta_start, beta_end, T)
        alphas = 1 - betas
        self.alpha_bar = torch.cumprod(alphas, dim=0)

    def training_step(self, x_0):
        """One training step."""
        batch_size = x_0.shape[0]

        # Sample random timestep for each image
        t = torch.randint(0, self.T, (batch_size,))

        # Sample noise
        epsilon = torch.randn_like(x_0)

        # Create noisy image: x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
        alpha_bar_t = self.alpha_bar[t].view(-1, 1, 1, 1)
        x_t = torch.sqrt(alpha_bar_t) * x_0 + torch.sqrt(1 - alpha_bar_t) * epsilon

        # Predict noise
        epsilon_pred = self.model(x_t, t)

        # Simple MSE loss
        loss = nn.functional.mse_loss(epsilon_pred, epsilon)
        return loss

Instant Rejection

Never confuse the forward process (adding noise) with the training objective. The forward process is not learned - it's fixed. The training objective is to predict the noise. Many candidates mix these up and describe training as "learning to add noise," which is backwards.

Part 4 - Sampling: DDPM vs DDIM

DDPM Sampling (Stochastic)

DDPM sampling follows the reverse process step by step:

$x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) + \sqrt{\beta_t} \, z$

where $z \sim \mathcal{N}(0, \mathbf{I})$ adds stochasticity.

Problem: Requires all $T$ steps (typically 1000), which is very slow.

DDIM Sampling (Deterministic)

DDIM (Song et al., 2021) derives a non-Markovian reverse process that allows skipping steps:

$x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \underbrace{\left( \frac{x_t - \sqrt{1-\bar{\alpha}_t} \, \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}} \right)}_{\text{predicted } x_0} + \sqrt{1 - \bar{\alpha}_{t-1} - \sigma_t^2} \cdot \epsilon_\theta(x_t, t) + \sigma_t z$

When $\sigma_t = 0$ (fully deterministic), DDIM can use a sub-sequence of timesteps, e.g., $\{T, T-k, T-2k, ..., 0\}$ , reducing steps from 1000 to 50 or even 20.

Method	Steps	Quality (FID)	Speed	Deterministic?
DDPM	1000	Best	Slowest	No (stochastic)
DDIM (50 steps)	50	Near DDPM	20x faster	Yes (optional)
DDIM (20 steps)	20	Good	50x faster	Yes (optional)
DDIM (10 steps)	10	Acceptable	100x faster	Yes (optional)

@torch.no_grad()
def ddim_sample(model, shape, T=1000, num_steps=50, eta=0.0):
    """
    DDIM sampling with configurable steps.
    eta=0: deterministic, eta=1: equivalent to DDPM
    """
    # Create sub-sequence of timesteps
    step_size = T // num_steps
    timesteps = list(range(0, T, step_size))[::-1]  # [T-1, T-1-step, ..., 0]

    x = torch.randn(shape)  # Start from pure noise

    for i, t in enumerate(timesteps):
        t_tensor = torch.full((shape[0],), t, dtype=torch.long)

        # Predict noise
        eps_pred = model(x, t_tensor)

        # Predict x_0
        alpha_bar_t = alpha_bar[t]
        x0_pred = (x - torch.sqrt(1 - alpha_bar_t) * eps_pred) / torch.sqrt(alpha_bar_t)

        if i < len(timesteps) - 1:
            t_next = timesteps[i + 1]
            alpha_bar_next = alpha_bar[t_next]

            # DDIM update
            sigma = eta * torch.sqrt((1 - alpha_bar_next) / (1 - alpha_bar_t)) * \
                    torch.sqrt(1 - alpha_bar_t / alpha_bar_next)

            dir_xt = torch.sqrt(1 - alpha_bar_next - sigma**2) * eps_pred

            x = torch.sqrt(alpha_bar_next) * x0_pred + dir_xt

            if sigma > 0:
                x += sigma * torch.randn_like(x)
        else:
            x = x0_pred

    return x

DDPM vs DDIM Sampling Comparison

Part 5 - Score-Based Generative Models

Connection to Score Matching

Song & Ermon (2019) showed diffusion models can be understood through score matching.

The score function is the gradient of the log probability density:

$\nabla_x \log p(x)$

This points in the direction of increasing data likelihood - following it via Langevin dynamics generates samples:

$x_{t+1} = x_t + \frac{\eta}{2} \nabla_x \log p(x_t) + \sqrt{\eta} \, z$

The problem: Estimating $\nabla_x \log p(x)$ is intractable for complex distributions.

The solution: Train a neural network $s_\theta(x, t) \approx \nabla_x \log q_t(x)$ to approximate the score at each noise level.

Unified Framework

Song et al. (2021) unified DDPM and score-based models under stochastic differential equations (SDEs):

Forward SDE (noise addition): $dx = f(x, t) dt + g(t) dw$

Reverse SDE (denoising): $dx = [f(x,t) - g(t)^2 \nabla_x \log p_t(x)] dt + g(t) d\bar{w}$

The neural network learns $\nabla_x \log p_t(x)$ , which is equivalent to learning $\epsilon_\theta$ (noise prediction) up to a scaling factor:

$s_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1 - \bar{\alpha}_t}}$

Company Variation

OpenAI: Used guided diffusion (DALL-E 2), then moved to consistency models for speed.
Stability AI: Stable Diffusion (latent diffusion), open-source.
Google: Imagen uses diffusion in pixel space with large text encoders (T5).
Meta: Uses flow matching (continuous-time generalization of diffusion) for newer models.

Part 6 - Latent Diffusion and Stable Diffusion

The Resolution Problem

Running diffusion in pixel space is extremely expensive. For a 512x512 RGB image:

Input dimension: $512 \times 512 \times 3 = 786,432$
Each denoising step processes this full-resolution tensor
Memory and compute scale quadratically with resolution

Latent Diffusion Models (LDMs)

Rombach et al. (2022) proposed running diffusion in a compressed latent space:

Train a VAE (encoder $\mathcal{E}$ + decoder $\mathcal{D}$ ) to compress images: $z = \mathcal{E}(x) \in \mathbb{R}^{h/f \times w/f \times c}$
Run diffusion in latent space: Add noise to $z$ , train denoiser on $z$ , sample in $z$ -space
Decode to pixels: $\hat{x} = \mathcal{D}(\hat{z})$

With downsampling factor $f = 8$ : a 512x512 image becomes a 64x64x4 latent - 48x fewer dimensions.

Latent Diffusion: Compress Then Diffuse

The Stable Diffusion Architecture

Stable Diffusion Architecture

Key components:

Component	Architecture	Role	Parameters
VAE	KL-regularized autoencoder	Compress/decompress images	~84M
U-Net	U-Net with attention	Denoise latents	~860M
Text encoder	CLIP ViT-L/14	Encode text prompts	~123M
Total	-	-	~1.07B

Part 7 - Classifier-Free Guidance

The Problem

Conditional diffusion models can generate images conditioned on text, but the conditioning is often weak - the model ignores parts of the prompt.

Classifier Guidance (Dhariwal & Nichol, 2021)

Use a separate classifier $p(c|x_t)$ to guide sampling:

$\hat{\epsilon}(x_t, t, c) = \epsilon_\theta(x_t, t) - \sqrt{1 - \bar{\alpha}_t} \cdot s \cdot \nabla_{x_t} \log p(c|x_t)$

Problem: Requires training a separate classifier on noisy images.

Classifier-Free Guidance (Ho & Salimans, 2022)

Key insight: You can get the same effect without a classifier by training the model with and without conditioning (randomly dropping the text prompt during training).

At sampling time, combine the conditional and unconditional predictions:

$\hat{\epsilon}(x_t, t, c) = (1 + w) \cdot \epsilon_\theta(x_t, t, c) - w \cdot \epsilon_\theta(x_t, t, \varnothing)$

where:

$\epsilon_\theta(x_t, t, c)$ is the conditional prediction (with text prompt)
$\epsilon_\theta(x_t, t, \varnothing)$ is the unconditional prediction (text dropped)
$w$ is the guidance scale

Rearranging:

$\hat{\epsilon} = \epsilon_\text{uncond} + (1 + w)(\epsilon_\text{cond} - \epsilon_\text{uncond})$

This amplifies the "direction" that the conditioning provides.

Guidance Scale ( $w$ )	Effect	Quality
0	No guidance (unconditional generation)	Low prompt adherence
1-3	Mild guidance	Natural but may ignore prompt details
7-8	Standard (Stable Diffusion default)	Good balance
10-15	Strong guidance	High prompt adherence but saturated/unnatural
20+	Extreme guidance	Artifacts, oversaturation

60-Second Answer

"Classifier-free guidance trains the diffusion model to work both with and without text conditioning by randomly dropping the prompt during training. At inference, you compute both the conditional and unconditional noise predictions and amplify their difference by a guidance scale. Higher guidance means stronger prompt adherence but at the cost of diversity and naturalness. A scale of 7-8 is the typical sweet spot for Stable Diffusion."

Part 8 - Modern Advances

Consistency Models (Song et al., 2023)

Map any point on the diffusion trajectory directly to $x_0$ in a single step:

$f_\theta(x_t, t) = x_0 \quad \text{for all } t$

Trained via consistency distillation (distill from a pre-trained diffusion model) or consistency training (train from scratch).

Result: 1-4 step generation with quality approaching 50-step DDIM.

Flow Matching (Lipman et al., 2023)

Replaces the noising/denoising framework with optimal transport between noise and data distributions. The model learns a velocity field that transports samples from noise to data along straight paths.

$\frac{dx}{dt} = v_\theta(x, t)$

Advantages: Simpler training, straight paths (fewer steps needed), better theoretical properties.

DiT (Diffusion Transformers)

Replace the U-Net with a Vision Transformer (ViT) architecture:

Peebles & Xie (2023): DiT outperforms U-Net on ImageNet at the same compute
Used in DALL-E 3 and Sora
Better scalability - transformers scale more predictably than U-Nets

Architecture Evolution

Diffusion Model Architecture Evolution: DDPM to DiT

Part 9 - Diffusion Models Beyond Images

Domain	Model	Key Innovation
Video	Sora, Runway Gen-3	Temporal attention, spatial-temporal U-Net
Audio	AudioLDM, Stable Audio	Mel-spectrogram latent space
3D	DreamFusion, Zero-1-to-3	Score distillation sampling (SDS)
Protein	RFdiffusion	Diffusion on 3D coordinates and residue types
Molecular	GeoDiff	Diffusion on molecular conformations
Music	MusicGen (uses flow matching)	Joint audio-text generation
Policy (RL)	Diffuser	Diffusion over action trajectories

Part 10 - Practice Problems

Problem 1: Forward Process Derivation

Show that $q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t)\mathbf{I})$ by recursively applying the single-step noise addition $q(x_t | x_{t-1})$ .

Hint 1 - Direction

Use the reparameterization trick at each step: $x_t = \sqrt{\alpha_t} x_{t-1} + \sqrt{\beta_t} \epsilon_t$ . Substitute $x_{t-1}$ in terms of $x_{t-2}$ , and so on.

Full Answer

Step 1: Using the reparameterization trick:

$x_t = \sqrt{\alpha_t} x_{t-1} + \sqrt{1 - \alpha_t} \epsilon_t, \quad \epsilon_t \sim \mathcal{N}(0, \mathbf{I})$

Step 2: Substitute $x_{t-1} = \sqrt{\alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_{t-1}} \epsilon_{t-1}$ :

$x_t = \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{\alpha_t(1-\alpha_{t-1})} \epsilon_{t-1} + \sqrt{1-\alpha_t} \epsilon_t$

Step 3: The last two terms are independent Gaussians. Their sum has variance:

$\alpha_t(1 - \alpha_{t-1}) + (1 - \alpha_t) = 1 - \alpha_t \alpha_{t-1}$

So: $x_t = \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\epsilon}$

Step 4: Continuing recursively to $x_0$ :

$x_t = \sqrt{\prod_{s=1}^t \alpha_s} x_0 + \sqrt{1 - \prod_{s=1}^t \alpha_s} \epsilon = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$

Problem 2: Guidance Scale Analysis

A user generates images with guidance scale $w = 3$ and $w = 20$ . The first set looks good but doesn't follow the prompt well. The second set follows the prompt but has color saturation artifacts. Explain why and suggest a solution.

Hint 1 - Direction

Think about what classifier-free guidance does geometrically: it amplifies the difference between conditional and unconditional predictions.

Full Answer + Rubric

At $w = 3$ : The guidance is weak. The model generates plausible images but the conditional signal (text) doesn't dominate enough. The model partially ignores the prompt because the unconditional mode has significant weight.

At $w = 20$ : The guidance is too strong. The update $\hat{\epsilon} = \epsilon_\text{uncond} + (1+w)(\epsilon_\text{cond} - \epsilon_\text{uncond})$ amplifies the conditional direction so much that the predicted noise goes outside the normal range. This pushes pixel values to extremes, causing saturation.

Solutions:

Use $w = 7-8$ : The empirically validated sweet spot for most prompts.
Dynamic guidance: Use higher guidance for early timesteps (establish structure) and lower guidance for later timesteps (refine details). Some implementations use a cosine schedule for $w$ .
Thresholding: Clip the predicted $x_0$ to $[-1, 1]$ at each step (dynamic thresholding from Imagen paper), preventing saturation.
Better prompt engineering: More detailed prompts often work better at moderate guidance than vague prompts at high guidance.

Scoring:

Strong Hire: Explains the geometric interpretation of guidance, identifies saturation mechanism, suggests dynamic guidance or thresholding
Lean Hire: Knows that guidance scale controls prompt adherence but can't explain the artifacts
No Hire: Suggests "just use a higher scale for better results"

Problem 3: System Design - Text-to-Image Service

Design a text-to-image generation service that handles 1000 requests per minute with <10 second latency. What are your key architectural choices?

Hint 1 - Direction

Consider: model choice (pixel vs latent), number of sampling steps, batching strategy, GPU selection, caching, and whether to use distilled models.

Full Answer + Rubric

Architecture:

Model: Latent diffusion (Stable Diffusion XL or similar). Pixel-space diffusion is too slow at high resolution.
Sampling: DDIM with 20-30 steps or consistency model with 4 steps. Not 1000-step DDPM.
GPU: A100 80GB or H100. A single SDXL inference takes ~2-3 seconds on A100 with 20 steps.
Throughput math: 1000 requests/min = ~17 requests/sec. With 3s per request per GPU and batch size 4: each GPU processes ~1.3 requests/sec. Need ~13 GPUs (plus redundancy = 16).
Batching: Queue incoming requests, batch by similar dimensions. Dynamic batching with max wait time of 500ms.
Caching: Cache text encoder outputs (CLIP embeddings for the same prompt). Cache VAE decoder outputs for common resolutions.
Optimization: Use TensorRT or torch.compile for inference. Half-precision throughout. Consider quantized models (INT8 U-Net).
Scaling: Horizontal scaling behind a load balancer. GPU autoscaling based on queue depth.

Scoring:

Strong Hire: Concrete throughput math, correct model/sampling choices, addresses batching and optimization
Lean Hire: Reasonable architecture but no concrete numbers or optimization strategies
No Hire: Suggests running 1000-step DDPM on a single GPU

Interview Cheat Sheet

Question Pattern	Framework	Key Phrases
"Explain diffusion models"	Forward (add noise) → Training (predict noise) → Sampling (iterative denoise)	"Diffusion models learn to reverse a gradual noising process. Training is simple: predict the noise added at a random timestep. Sampling chains many denoising steps."
"What's the training loss?"	$\\|\epsilon - \epsilon_\theta(x_t, t)\\|^2$ - predict noise from noisy input	"The simplified loss is just MSE between the actual noise and the predicted noise. Sample a timestep, noise the image, predict the noise."
"DDPM vs DDIM?"	DDPM: 1000 steps, stochastic. DDIM: 20-50 steps, deterministic, same model	"DDIM uses a non-Markovian reverse process that allows skipping timesteps. Same model, 20x faster, deterministic sampling."
"How does Stable Diffusion work?"	VAE (compress) → U-Net (denoise latents) → CLIP (text condition) → CFG	"Stable Diffusion runs diffusion in a compressed latent space, using a VAE for compression and CLIP for text conditioning."
"What is classifier-free guidance?"	Train with/without condition → amplify conditional direction at inference	"Drop the conditioning randomly during training. At inference, amplify the difference between conditional and unconditional predictions."
"Why latent diffusion?"	Pixel space is 48x larger → latent space is efficient → same quality	"Running diffusion on 64x64 latents instead of 512x512 pixels gives 48x fewer dimensions with minimal quality loss."

Spaced Repetition Checkpoints

Day 0: Read this page. Write the simplified training loss. Explain the forward process in your own words.
Day 3: Derive $q(x_t|x_0)$ from the single-step $q(x_t|x_{t-1})$ . Explain DDPM vs DDIM sampling.
Day 7: Draw the Stable Diffusion architecture from memory. Explain classifier-free guidance with the formula.
Day 14: Compare diffusion models with GANs and VAEs on 5 dimensions. Explain flow matching in 2 minutes.
Day 21: Solve all three practice problems from memory. Time yourself - 8-10 minutes each.

Next Steps

Continue to RAG Papers for retrieval-augmented generation
Review Attention Is All You Need for the transformer architecture used in DiT
For generative AI system design, see ML System Design

The Real Interview Moment​

What You Will Master​

Part 1 - The Core Idea​

Intuition​

Why Not Just Use GANs or VAEs?​

Part 2 - The Forward Process​

Adding Noise Gradually​

Noise Schedules​

Part 3 - The Reverse Process and Training Objective​

The Reverse Process​

Deriving the Training Objective​

The Simplified Loss​

Part 4 - Sampling: DDPM vs DDIM​

DDPM Sampling (Stochastic)​

DDIM Sampling (Deterministic)​

Part 5 - Score-Based Generative Models​

Connection to Score Matching​

Unified Framework​

Part 6 - Latent Diffusion and Stable Diffusion​

The Resolution Problem​

Latent Diffusion Models (LDMs)​

The Stable Diffusion Architecture​

Part 7 - Classifier-Free Guidance​

The Problem​

Classifier Guidance (Dhariwal & Nichol, 2021)​

Classifier-Free Guidance (Ho & Salimans, 2022)​

Part 8 - Modern Advances​

Consistency Models (Song et al., 2023)​

Flow Matching (Lipman et al., 2023)​

DiT (Diffusion Transformers)​

Architecture Evolution​

Part 9 - Diffusion Models Beyond Images​

Part 10 - Practice Problems​

Problem 1: Forward Process Derivation​

Problem 2: Guidance Scale Analysis​

Problem 3: System Design - Text-to-Image Service​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Next Steps​

The Real Interview Moment

What You Will Master

Part 1 - The Core Idea

Intuition

Why Not Just Use GANs or VAEs?

Part 2 - The Forward Process

Adding Noise Gradually

Noise Schedules

Part 3 - The Reverse Process and Training Objective

The Reverse Process

Deriving the Training Objective

The Simplified Loss

Part 4 - Sampling: DDPM vs DDIM

DDPM Sampling (Stochastic)

DDIM Sampling (Deterministic)

Part 5 - Score-Based Generative Models

Connection to Score Matching

Unified Framework

Part 6 - Latent Diffusion and Stable Diffusion

The Resolution Problem

Latent Diffusion Models (LDMs)

The Stable Diffusion Architecture

Part 7 - Classifier-Free Guidance

The Problem

Classifier Guidance (Dhariwal & Nichol, 2021)

Classifier-Free Guidance (Ho & Salimans, 2022)

Part 8 - Modern Advances

Consistency Models (Song et al., 2023)

Flow Matching (Lipman et al., 2023)

DiT (Diffusion Transformers)

Architecture Evolution

Part 9 - Diffusion Models Beyond Images

Part 10 - Practice Problems

Problem 1: Forward Process Derivation

Problem 2: Guidance Scale Analysis

Problem 3: System Design - Text-to-Image Service

Interview Cheat Sheet

Spaced Repetition Checkpoints

Next Steps