Skip to main content

Diffusion Model Papers - Generating Images by Learning to Denoise

Reading time: ~35 min | Interview relevance: High | Roles: MLE, Research Engineer, AI Engineer (Generative AI)

The Real Interview Moment

You're interviewing for a generative AI role at a company building image generation products. The interviewer puts a blank whiteboard in front of you and says: "Explain to me, from first principles, how diffusion models generate images. Start with the forward process, derive the training objective, explain how sampling works, and then tell me how Stable Diffusion makes this practical for high-resolution images. I want math, not hand-waving."

You've used Stable Diffusion to generate images. You know it "adds noise and then removes it." But can you write the noise schedule? Can you derive the ELBO? Do you understand why DDIM is faster than DDPM? Can you explain classifier-free guidance and why a guidance scale of 7.5 is typical? The interviewer is testing whether you understand diffusion deeply enough to debug training, improve sampling, or adapt the architecture.

This is the diffusion interview. If you're working on any form of generative AI - images, video, audio, or 3D - diffusion models are foundational knowledge.

What You Will Master

After reading this page, you will be able to:

  • Explain the forward and reverse diffusion processes with full mathematical derivation
  • Derive the simplified training objective (predict the noise)
  • Explain noise schedules and their impact on generation quality
  • Compare DDPM and DDIM sampling and explain the speed-quality trade-off
  • Explain latent diffusion and why it enables high-resolution generation
  • Derive classifier-free guidance and explain the guidance scale
  • Describe the Stable Diffusion architecture end-to-end
  • Answer every common interview question about diffusion models

Part 1 - The Core Idea

Intuition

Diffusion models work by:

  1. Forward process: Gradually add Gaussian noise to data until it becomes pure noise
  2. Reverse process: Learn to gradually remove noise, transforming noise back into data

If you can learn to reverse one small noise step, you can chain many reverse steps together to generate data from pure noise.

Diffusion Forward and Reverse Process

Why Not Just Use GANs or VAEs?

PropertyGANsVAEsDiffusion Models
Training stabilityUnstable (mode collapse, training failure)StableStable
Sample qualityHigh (but mode dropping)BlurryHighest
Mode coveragePoor (mode collapse)GoodExcellent
LikelihoodNot availableTractable (ELBO)Tractable (ELBO)
Sampling speedFast (single forward pass)Fast (single forward pass)Slow (many steps)
ControllabilityDifficultModerateExcellent (guidance)
60-Second Answer

"Diffusion models learn to reverse a gradual noising process. You train a neural network to predict the noise added at each step, then at generation time, start from pure Gaussian noise and iteratively denoise. They beat GANs because training is stable with no mode collapse, and they beat VAEs because they produce much sharper samples. The trade-off is sampling speed - you need hundreds of denoising steps, though DDIM and distillation reduce this significantly."

Part 2 - The Forward Process

Adding Noise Gradually

The forward process adds Gaussian noise over TT timesteps (typically T=1000T = 1000):

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \beta_t \mathbf{I})

where βt\beta_t is the noise schedule - a small variance at each step.

Key property: You can sample xtx_t directly from x0x_0 without iterating through all previous steps:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, (1 - \bar{\alpha}_t) \mathbf{I})

where:

  • αt=1βt\alpha_t = 1 - \beta_t
  • αˉt=s=1tαs\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s (cumulative product)

This means: xt=αˉtx0+1αˉtϵ,ϵN(0,I)x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, \mathbf{I})

Common Trap

A common mistake is to say "we add noise T times during training." This is wrong. During training, we sample a random timestep tUniform(1,T)t \sim \text{Uniform}(1, T) and directly compute xtx_t from x0x_0 using the closed-form formula above. The iterative process only happens during sampling (generation).

Noise Schedules

The choice of βt\beta_t significantly affects quality:

ScheduleFormulaProperties
Linearβt=β1+t1T1(βTβ1)\beta_t = \beta_1 + \frac{t-1}{T-1}(\beta_T - \beta_1)Original DDPM. Too aggressive at the end.
Cosineαˉt=f(t)f(0)\bar{\alpha}_t = \frac{f(t)}{f(0)}, f(t)=cos(t/T+s1+sπ2)2f(t) = \cos\left(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2}\right)^2Smoother, better for higher resolutions
Sigmoidβt=σ(a+(ba)t/T)\beta_t = \sigma(a + (b-a) \cdot t/T) where a,ba,b are endpointsUsed in some modern architectures
import torch
import numpy as np

def linear_schedule(T, beta_start=1e-4, beta_end=0.02):
"""Original DDPM linear schedule."""
return torch.linspace(beta_start, beta_end, T)

def cosine_schedule(T, s=0.008):
"""Improved cosine schedule (Nichol & Dhariwal, 2021)."""
steps = T + 1
t = torch.linspace(0, T, steps) / T
alphas_cumprod = torch.cos((t + s) / (1 + s) * np.pi / 2) ** 2
alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
return torch.clamp(betas, 0.0001, 0.999)

Part 3 - The Reverse Process and Training Objective

The Reverse Process

The reverse process learns to denoise:

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

The neural network μθ\mu_\theta predicts the mean of the denoised distribution. In practice, there are three equivalent parameterizations:

Prediction targetWhat the network outputsTraining loss
Predict noise ϵθ(xt,t)\epsilon_\theta(x_t, t)The noise that was addedϵϵθ(xt,t)2\|\epsilon - \epsilon_\theta(x_t, t)\|^2
Predict x0x_0 via x^θ(xt,t)\hat{x}_\theta(x_t, t)The clean imagex0x^θ(xt,t)2\|x_0 - \hat{x}_\theta(x_t, t)\|^2
Predict velocity vθ(xt,t)v_\theta(x_t, t)v=αˉtϵ1αˉtx0v = \sqrt{\bar\alpha_t}\epsilon - \sqrt{1-\bar\alpha_t}x_0vvθ(xt,t)2\|v - v_\theta(x_t, t)\|^2

The noise prediction parameterization is most common (used in DDPM, Stable Diffusion).

Deriving the Training Objective

The full variational lower bound (ELBO) for diffusion models:

L=Eq[DKL(q(xTx0)p(xT))+t=2TDKL(q(xt1xt,x0)pθ(xt1xt))logpθ(x0x1)]\mathcal{L} = \mathbb{E}_q \left[ D_\text{KL}(q(x_T|x_0) \| p(x_T)) + \sum_{t=2}^{T} D_\text{KL}(q(x_{t-1}|x_t, x_0) \| p_\theta(x_{t-1}|x_t)) - \log p_\theta(x_0|x_1) \right]

The posterior q(xt1xt,x0)q(x_{t-1}|x_t, x_0) has a closed form (because both the forward and posterior are Gaussian):

q(xt1xt,x0)=N(xt1;μ~t(xt,x0),β~tI)q(x_{t-1}|x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t \mathbf{I})

where: μ~t=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉtxt\tilde{\mu}_t = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t

β~t=1αˉt11αˉtβt\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t

The Simplified Loss

Ho et al. (2020) showed that the simplified loss works best in practice:

Lsimple=Et,x0,ϵ[ϵϵθ(αˉtx0+1αˉtϵxt,t)2]\mathcal{L}_\text{simple} = \mathbb{E}_{t, x_0, \epsilon} \left[ \|\epsilon - \epsilon_\theta(\underbrace{\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon}_{x_t}, t)\|^2 \right]

In words: Sample a clean image x0x_0, a random timestep tt, and random noise ϵ\epsilon. Noise the image to get xtx_t. Train the network to predict ϵ\epsilon from xtx_t and tt.

import torch
import torch.nn as nn

class DiffusionTrainer:
"""Simplified DDPM training loop."""

def __init__(self, model, T=1000, beta_start=1e-4, beta_end=0.02):
self.model = model
self.T = T

# Precompute schedule
betas = torch.linspace(beta_start, beta_end, T)
alphas = 1 - betas
self.alpha_bar = torch.cumprod(alphas, dim=0)

def training_step(self, x_0):
"""One training step."""
batch_size = x_0.shape[0]

# Sample random timestep for each image
t = torch.randint(0, self.T, (batch_size,))

# Sample noise
epsilon = torch.randn_like(x_0)

# Create noisy image: x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
alpha_bar_t = self.alpha_bar[t].view(-1, 1, 1, 1)
x_t = torch.sqrt(alpha_bar_t) * x_0 + torch.sqrt(1 - alpha_bar_t) * epsilon

# Predict noise
epsilon_pred = self.model(x_t, t)

# Simple MSE loss
loss = nn.functional.mse_loss(epsilon_pred, epsilon)
return loss
Instant Rejection

Never confuse the forward process (adding noise) with the training objective. The forward process is not learned - it's fixed. The training objective is to predict the noise. Many candidates mix these up and describe training as "learning to add noise," which is backwards.

Part 4 - Sampling: DDPM vs DDIM

DDPM Sampling (Stochastic)

DDPM sampling follows the reverse process step by step:

xt1=1αt(xtβt1αˉtϵθ(xt,t))+βtzx_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) + \sqrt{\beta_t} \, z

where zN(0,I)z \sim \mathcal{N}(0, \mathbf{I}) adds stochasticity.

Problem: Requires all TT steps (typically 1000), which is very slow.

DDIM Sampling (Deterministic)

DDIM (Song et al., 2021) derives a non-Markovian reverse process that allows skipping steps:

xt1=αˉt1(xt1αˉtϵθ(xt,t)αˉt)predicted x0+1αˉt1σt2ϵθ(xt,t)+σtzx_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \underbrace{\left( \frac{x_t - \sqrt{1-\bar{\alpha}_t} \, \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}} \right)}_{\text{predicted } x_0} + \sqrt{1 - \bar{\alpha}_{t-1} - \sigma_t^2} \cdot \epsilon_\theta(x_t, t) + \sigma_t z

When σt=0\sigma_t = 0 (fully deterministic), DDIM can use a sub-sequence of timesteps, e.g., {T,Tk,T2k,...,0}\{T, T-k, T-2k, ..., 0\}, reducing steps from 1000 to 50 or even 20.

MethodStepsQuality (FID)SpeedDeterministic?
DDPM1000BestSlowestNo (stochastic)
DDIM (50 steps)50Near DDPM20x fasterYes (optional)
DDIM (20 steps)20Good50x fasterYes (optional)
DDIM (10 steps)10Acceptable100x fasterYes (optional)
@torch.no_grad()
def ddim_sample(model, shape, T=1000, num_steps=50, eta=0.0):
"""
DDIM sampling with configurable steps.
eta=0: deterministic, eta=1: equivalent to DDPM
"""
# Create sub-sequence of timesteps
step_size = T // num_steps
timesteps = list(range(0, T, step_size))[::-1] # [T-1, T-1-step, ..., 0]

x = torch.randn(shape) # Start from pure noise

for i, t in enumerate(timesteps):
t_tensor = torch.full((shape[0],), t, dtype=torch.long)

# Predict noise
eps_pred = model(x, t_tensor)

# Predict x_0
alpha_bar_t = alpha_bar[t]
x0_pred = (x - torch.sqrt(1 - alpha_bar_t) * eps_pred) / torch.sqrt(alpha_bar_t)

if i < len(timesteps) - 1:
t_next = timesteps[i + 1]
alpha_bar_next = alpha_bar[t_next]

# DDIM update
sigma = eta * torch.sqrt((1 - alpha_bar_next) / (1 - alpha_bar_t)) * \
torch.sqrt(1 - alpha_bar_t / alpha_bar_next)

dir_xt = torch.sqrt(1 - alpha_bar_next - sigma**2) * eps_pred

x = torch.sqrt(alpha_bar_next) * x0_pred + dir_xt

if sigma > 0:
x += sigma * torch.randn_like(x)
else:
x = x0_pred

return x

DDPM vs DDIM Sampling Comparison

Part 5 - Score-Based Generative Models

Connection to Score Matching

Song & Ermon (2019) showed diffusion models can be understood through score matching.

The score function is the gradient of the log probability density:

xlogp(x)\nabla_x \log p(x)

This points in the direction of increasing data likelihood - following it via Langevin dynamics generates samples:

xt+1=xt+η2xlogp(xt)+ηzx_{t+1} = x_t + \frac{\eta}{2} \nabla_x \log p(x_t) + \sqrt{\eta} \, z

The problem: Estimating xlogp(x)\nabla_x \log p(x) is intractable for complex distributions.

The solution: Train a neural network sθ(x,t)xlogqt(x)s_\theta(x, t) \approx \nabla_x \log q_t(x) to approximate the score at each noise level.

Unified Framework

Song et al. (2021) unified DDPM and score-based models under stochastic differential equations (SDEs):

Forward SDE (noise addition): dx=f(x,t)dt+g(t)dwdx = f(x, t) dt + g(t) dw

Reverse SDE (denoising): dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dwˉdx = [f(x,t) - g(t)^2 \nabla_x \log p_t(x)] dt + g(t) d\bar{w}

The neural network learns xlogpt(x)\nabla_x \log p_t(x), which is equivalent to learning ϵθ\epsilon_\theta (noise prediction) up to a scaling factor:

sθ(xt,t)=ϵθ(xt,t)1αˉts_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1 - \bar{\alpha}_t}}

Company Variation
  • OpenAI: Used guided diffusion (DALL-E 2), then moved to consistency models for speed.
  • Stability AI: Stable Diffusion (latent diffusion), open-source.
  • Google: Imagen uses diffusion in pixel space with large text encoders (T5).
  • Meta: Uses flow matching (continuous-time generalization of diffusion) for newer models.

Part 6 - Latent Diffusion and Stable Diffusion

The Resolution Problem

Running diffusion in pixel space is extremely expensive. For a 512x512 RGB image:

  • Input dimension: 512×512×3=786,432512 \times 512 \times 3 = 786,432
  • Each denoising step processes this full-resolution tensor
  • Memory and compute scale quadratically with resolution

Latent Diffusion Models (LDMs)

Rombach et al. (2022) proposed running diffusion in a compressed latent space:

  1. Train a VAE (encoder E\mathcal{E} + decoder D\mathcal{D}) to compress images: z=E(x)Rh/f×w/f×cz = \mathcal{E}(x) \in \mathbb{R}^{h/f \times w/f \times c}
  2. Run diffusion in latent space: Add noise to zz, train denoiser on zz, sample in zz-space
  3. Decode to pixels: x^=D(z^)\hat{x} = \mathcal{D}(\hat{z})

With downsampling factor f=8f = 8: a 512x512 image becomes a 64x64x4 latent - 48x fewer dimensions.

Latent Diffusion: Compress Then Diffuse

The Stable Diffusion Architecture

Stable Diffusion Architecture

Key components:

ComponentArchitectureRoleParameters
VAEKL-regularized autoencoderCompress/decompress images~84M
U-NetU-Net with attentionDenoise latents~860M
Text encoderCLIP ViT-L/14Encode text prompts~123M
Total--~1.07B

Part 7 - Classifier-Free Guidance

The Problem

Conditional diffusion models can generate images conditioned on text, but the conditioning is often weak - the model ignores parts of the prompt.

Classifier Guidance (Dhariwal & Nichol, 2021)

Use a separate classifier p(cxt)p(c|x_t) to guide sampling:

ϵ^(xt,t,c)=ϵθ(xt,t)1αˉtsxtlogp(cxt)\hat{\epsilon}(x_t, t, c) = \epsilon_\theta(x_t, t) - \sqrt{1 - \bar{\alpha}_t} \cdot s \cdot \nabla_{x_t} \log p(c|x_t)

Problem: Requires training a separate classifier on noisy images.

Classifier-Free Guidance (Ho & Salimans, 2022)

Key insight: You can get the same effect without a classifier by training the model with and without conditioning (randomly dropping the text prompt during training).

At sampling time, combine the conditional and unconditional predictions:

ϵ^(xt,t,c)=(1+w)ϵθ(xt,t,c)wϵθ(xt,t,)\hat{\epsilon}(x_t, t, c) = (1 + w) \cdot \epsilon_\theta(x_t, t, c) - w \cdot \epsilon_\theta(x_t, t, \varnothing)

where:

  • ϵθ(xt,t,c)\epsilon_\theta(x_t, t, c) is the conditional prediction (with text prompt)
  • ϵθ(xt,t,)\epsilon_\theta(x_t, t, \varnothing) is the unconditional prediction (text dropped)
  • ww is the guidance scale

Rearranging:

ϵ^=ϵuncond+(1+w)(ϵcondϵuncond)\hat{\epsilon} = \epsilon_\text{uncond} + (1 + w)(\epsilon_\text{cond} - \epsilon_\text{uncond})

This amplifies the "direction" that the conditioning provides.

Guidance Scale (ww)EffectQuality
0No guidance (unconditional generation)Low prompt adherence
1-3Mild guidanceNatural but may ignore prompt details
7-8Standard (Stable Diffusion default)Good balance
10-15Strong guidanceHigh prompt adherence but saturated/unnatural
20+Extreme guidanceArtifacts, oversaturation
60-Second Answer

"Classifier-free guidance trains the diffusion model to work both with and without text conditioning by randomly dropping the prompt during training. At inference, you compute both the conditional and unconditional noise predictions and amplify their difference by a guidance scale. Higher guidance means stronger prompt adherence but at the cost of diversity and naturalness. A scale of 7-8 is the typical sweet spot for Stable Diffusion."

Part 8 - Modern Advances

Consistency Models (Song et al., 2023)

Map any point on the diffusion trajectory directly to x0x_0 in a single step:

fθ(xt,t)=x0for all tf_\theta(x_t, t) = x_0 \quad \text{for all } t

Trained via consistency distillation (distill from a pre-trained diffusion model) or consistency training (train from scratch).

Result: 1-4 step generation with quality approaching 50-step DDIM.

Flow Matching (Lipman et al., 2023)

Replaces the noising/denoising framework with optimal transport between noise and data distributions. The model learns a velocity field that transports samples from noise to data along straight paths.

dxdt=vθ(x,t)\frac{dx}{dt} = v_\theta(x, t)

Advantages: Simpler training, straight paths (fewer steps needed), better theoretical properties.

DiT (Diffusion Transformers)

Replace the U-Net with a Vision Transformer (ViT) architecture:

  • Peebles & Xie (2023): DiT outperforms U-Net on ImageNet at the same compute
  • Used in DALL-E 3 and Sora
  • Better scalability - transformers scale more predictably than U-Nets

Architecture Evolution

Diffusion Model Architecture Evolution: DDPM to DiT

Part 9 - Diffusion Models Beyond Images

DomainModelKey Innovation
VideoSora, Runway Gen-3Temporal attention, spatial-temporal U-Net
AudioAudioLDM, Stable AudioMel-spectrogram latent space
3DDreamFusion, Zero-1-to-3Score distillation sampling (SDS)
ProteinRFdiffusionDiffusion on 3D coordinates and residue types
MolecularGeoDiffDiffusion on molecular conformations
MusicMusicGen (uses flow matching)Joint audio-text generation
Policy (RL)DiffuserDiffusion over action trajectories

Part 10 - Practice Problems

Problem 1: Forward Process Derivation

Show that q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t)\mathbf{I}) by recursively applying the single-step noise addition q(xtxt1)q(x_t | x_{t-1}).

Hint 1 - Direction

Use the reparameterization trick at each step: xt=αtxt1+βtϵtx_t = \sqrt{\alpha_t} x_{t-1} + \sqrt{\beta_t} \epsilon_t. Substitute xt1x_{t-1} in terms of xt2x_{t-2}, and so on.

Full Answer

Step 1: Using the reparameterization trick:

xt=αtxt1+1αtϵt,ϵtN(0,I)x_t = \sqrt{\alpha_t} x_{t-1} + \sqrt{1 - \alpha_t} \epsilon_t, \quad \epsilon_t \sim \mathcal{N}(0, \mathbf{I})

Step 2: Substitute xt1=αt1xt2+1αt1ϵt1x_{t-1} = \sqrt{\alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_{t-1}} \epsilon_{t-1}:

xt=αtαt1xt2+αt(1αt1)ϵt1+1αtϵtx_t = \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{\alpha_t(1-\alpha_{t-1})} \epsilon_{t-1} + \sqrt{1-\alpha_t} \epsilon_t

Step 3: The last two terms are independent Gaussians. Their sum has variance:

αt(1αt1)+(1αt)=1αtαt1\alpha_t(1 - \alpha_{t-1}) + (1 - \alpha_t) = 1 - \alpha_t \alpha_{t-1}

So: xt=αtαt1xt2+1αtαt1ϵˉx_t = \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\epsilon}

Step 4: Continuing recursively to x0x_0:

xt=s=1tαsx0+1s=1tαsϵ=αˉtx0+1αˉtϵx_t = \sqrt{\prod_{s=1}^t \alpha_s} x_0 + \sqrt{1 - \prod_{s=1}^t \alpha_s} \epsilon = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon

Problem 2: Guidance Scale Analysis

A user generates images with guidance scale w=3w = 3 and w=20w = 20. The first set looks good but doesn't follow the prompt well. The second set follows the prompt but has color saturation artifacts. Explain why and suggest a solution.

Hint 1 - Direction

Think about what classifier-free guidance does geometrically: it amplifies the difference between conditional and unconditional predictions.

Full Answer + Rubric

At w=3w = 3: The guidance is weak. The model generates plausible images but the conditional signal (text) doesn't dominate enough. The model partially ignores the prompt because the unconditional mode has significant weight.

At w=20w = 20: The guidance is too strong. The update ϵ^=ϵuncond+(1+w)(ϵcondϵuncond)\hat{\epsilon} = \epsilon_\text{uncond} + (1+w)(\epsilon_\text{cond} - \epsilon_\text{uncond}) amplifies the conditional direction so much that the predicted noise goes outside the normal range. This pushes pixel values to extremes, causing saturation.

Solutions:

  1. Use w=78w = 7-8: The empirically validated sweet spot for most prompts.

  2. Dynamic guidance: Use higher guidance for early timesteps (establish structure) and lower guidance for later timesteps (refine details). Some implementations use a cosine schedule for ww.

  3. Thresholding: Clip the predicted x0x_0 to [1,1][-1, 1] at each step (dynamic thresholding from Imagen paper), preventing saturation.

  4. Better prompt engineering: More detailed prompts often work better at moderate guidance than vague prompts at high guidance.

Scoring:

  • Strong Hire: Explains the geometric interpretation of guidance, identifies saturation mechanism, suggests dynamic guidance or thresholding
  • Lean Hire: Knows that guidance scale controls prompt adherence but can't explain the artifacts
  • No Hire: Suggests "just use a higher scale for better results"

Problem 3: System Design - Text-to-Image Service

Design a text-to-image generation service that handles 1000 requests per minute with <10 second latency. What are your key architectural choices?

Hint 1 - Direction

Consider: model choice (pixel vs latent), number of sampling steps, batching strategy, GPU selection, caching, and whether to use distilled models.

Full Answer + Rubric

Architecture:

  1. Model: Latent diffusion (Stable Diffusion XL or similar). Pixel-space diffusion is too slow at high resolution.

  2. Sampling: DDIM with 20-30 steps or consistency model with 4 steps. Not 1000-step DDPM.

  3. GPU: A100 80GB or H100. A single SDXL inference takes ~2-3 seconds on A100 with 20 steps.

  4. Throughput math: 1000 requests/min = ~17 requests/sec. With 3s per request per GPU and batch size 4: each GPU processes ~1.3 requests/sec. Need ~13 GPUs (plus redundancy = 16).

  5. Batching: Queue incoming requests, batch by similar dimensions. Dynamic batching with max wait time of 500ms.

  6. Caching: Cache text encoder outputs (CLIP embeddings for the same prompt). Cache VAE decoder outputs for common resolutions.

  7. Optimization: Use TensorRT or torch.compile for inference. Half-precision throughout. Consider quantized models (INT8 U-Net).

  8. Scaling: Horizontal scaling behind a load balancer. GPU autoscaling based on queue depth.

Scoring:

  • Strong Hire: Concrete throughput math, correct model/sampling choices, addresses batching and optimization
  • Lean Hire: Reasonable architecture but no concrete numbers or optimization strategies
  • No Hire: Suggests running 1000-step DDPM on a single GPU

Interview Cheat Sheet

Question PatternFrameworkKey Phrases
"Explain diffusion models"Forward (add noise) → Training (predict noise) → Sampling (iterative denoise)"Diffusion models learn to reverse a gradual noising process. Training is simple: predict the noise added at a random timestep. Sampling chains many denoising steps."
"What's the training loss?"ϵϵθ(xt,t)2\|\epsilon - \epsilon_\theta(x_t, t)\|^2 - predict noise from noisy input"The simplified loss is just MSE between the actual noise and the predicted noise. Sample a timestep, noise the image, predict the noise."
"DDPM vs DDIM?"DDPM: 1000 steps, stochastic. DDIM: 20-50 steps, deterministic, same model"DDIM uses a non-Markovian reverse process that allows skipping timesteps. Same model, 20x faster, deterministic sampling."
"How does Stable Diffusion work?"VAE (compress) → U-Net (denoise latents) → CLIP (text condition) → CFG"Stable Diffusion runs diffusion in a compressed latent space, using a VAE for compression and CLIP for text conditioning."
"What is classifier-free guidance?"Train with/without condition → amplify conditional direction at inference"Drop the conditioning randomly during training. At inference, amplify the difference between conditional and unconditional predictions."
"Why latent diffusion?"Pixel space is 48x larger → latent space is efficient → same quality"Running diffusion on 64x64 latents instead of 512x512 pixels gives 48x fewer dimensions with minimal quality loss."

Spaced Repetition Checkpoints

  • Day 0: Read this page. Write the simplified training loss. Explain the forward process in your own words.
  • Day 3: Derive q(xtx0)q(x_t|x_0) from the single-step q(xtxt1)q(x_t|x_{t-1}). Explain DDPM vs DDIM sampling.
  • Day 7: Draw the Stable Diffusion architecture from memory. Explain classifier-free guidance with the formula.
  • Day 14: Compare diffusion models with GANs and VAEs on 5 dimensions. Explain flow matching in 2 minutes.
  • Day 21: Solve all three practice problems from memory. Time yourself - 8-10 minutes each.

Next Steps

© 2026 EngineersOfAI. All rights reserved.