Diffusion Models
Reading time: ~35 min | Interview relevance: High | Target roles: ML Engineer, AI Engineer, Research Engineer, MLOps Engineer
The Day Synthetic Faces Became Indistinguishable
It was 2022 and your team was building an avatar generation feature for a social platform. You had been working with GANs - specifically StyleGAN2 - for six months. The results were impressive on headshots: photorealistic faces at 1024x1024. But every time someone requested an avatar with an unusual pose, non-white lighting, or a non-standard composition, the GAN would fail catastrophically. Mode collapse, checkerboard artifacts, faces that half-melted into abstract patterns. You could not reliably generate anything outside the training distribution.
Your ML research lead came in one morning and dropped a paper on your desk: "High-Resolution Image Synthesis with Latent Diffusion Models" by Rombach et al. She had spent the night running it. The generated samples were staggering - not just faces, but scenes, objects, textures, artistic styles. Photorealistic renders next to oil paintings next to pixel art, all from the same model. She ran your standard GAN failure prompts through it. Every one of them worked.
You spent the next week reading three papers: DDPM, DDIM, and the latent diffusion paper. The core insight was fundamentally different from everything you had learned about generative models. GANs were adversarial - a generator trying to fool a discriminator, locked in a delicate equilibrium that required constant care to maintain. Diffusion models were not adversarial at all. They were a denoising problem. The model's job was to predict and remove noise. Nothing more.
That simplicity was the key. A denoising model does not have the adversarial training instabilities of GANs. It does not have the blurry outputs of VAEs. It can be scaled with compute because the objective is stable. And when you add text conditioning - giving the denoiser a description of what you want - you get a model that follows natural language with remarkable fidelity. By the end of the week you had replaced your GAN-based pipeline with an early Stable Diffusion model. You never went back.
The shift from GANs to diffusion was not just a technical improvement. It was a fundamental rethinking of how you frame generative modeling. Instead of asking "how do I generate realistic images?", diffusion asks "how do I denoise images?" And it turns out denoising is much easier to learn.
Why GANs Fell Behind
Generative Adversarial Networks (Goodfellow et al., 2014) dominated image generation for six years. The concept was elegant: train a generator and a discriminator simultaneously. tries to generate realistic images; tries to distinguish real from generated. They improve each other in competition.
But this elegance came with severe practical problems:
Training instability. The GAN objective requires the generator and discriminator to stay in equilibrium. If the discriminator gets too strong, the generator receives no gradient signal - it cannot tell the direction to improve. If the generator gets too strong, the discriminator stops providing useful signal. This balance was notoriously difficult to maintain. Practitioners developed elaborate tricks: gradient penalty, spectral normalization, learning rate schedules, progressive growing - all to keep the adversarial game stable.
Mode collapse. GANs frequently converge on a small subset of the data distribution - generating only certain types of images very well while ignoring others. A GAN trained on faces might produce only frontal-view, neutral-expression faces despite the training data containing many other poses and expressions.
Limited compositional generalization. GANs are excellent within their training distribution but fail dramatically outside it. Text conditioning was added to GANs (DALL-E v1 used a discrete VAE + transformer approach), but controllable generation remained limited.
Evaluation complexity. Fréchet Inception Distance (FID), the standard GAN quality metric, correlates poorly with human preference on many tasks. Knowing your GAN has a good FID does not tell you if it can follow complex instructions.
Diffusion models address all of these problems not by being a better adversarial training scheme, but by abandoning the adversarial framework entirely.
The Core Idea: Learn to Reverse Noise
Imagine taking a beautiful photograph and gradually adding Gaussian noise to it. After steps, the image is pure noise - indistinguishable from random pixels. This is the forward process.
Now imagine training a neural network to undo this process one step at a time. Given a slightly noisy image, predict and remove the noise. If you can learn to do this well for every noise level, you can start from pure random noise and iteratively denoise it into a coherent image. This is the reverse process, and it is what diffusion models do.
The remarkable thing is that the reverse process, when conditioned on a text prompt, generates images that match the prompt. The noise is structured - not arbitrary - and the denoiser learns to steer the denoising trajectory toward images consistent with the conditioning signal.
The Mathematics of Diffusion
Forward Process
The forward process adds Gaussian noise in steps. At each step , a small amount of noise is added:
where is a small positive constant (the noise schedule) and is the identity matrix. Typically increases linearly or cosine-shaped from to .
Using the reparameterization trick, we can sample from in a single step:
where and .
This is crucial for efficient training: you can directly sample a noisy version of an image at any noise level without simulating all steps.
Reverse Process: The Neural Network's Job
The reverse process is defined as:
A neural network predicts the noise that was added at step . The training objective is:
This is a simple mean squared error loss: the model predicts the noise, and we compare its prediction to the actual noise that was added. This is far more stable than the adversarial GAN loss.
Given the predicted noise, we can recover the previous image:
where is fresh noise added for stochasticity.
DDPM: The Foundational Paper
Ho et al. (2020) published "Denoising Diffusion Probabilistic Models" (DDPM), demonstrating that diffusion models could produce sample quality competitive with GANs on CIFAR-10. Key choices:
- steps
- Linear noise schedule from to
- UNet architecture as the denoiser, with sinusoidal time embeddings and residual blocks
- Training on 256×256 images took several days on 256 TPU-v3 chips
The samples were excellent but generation was slow: 1000 denoising steps per image, each requiring a full UNet forward pass. On a single GPU, generating one 256×256 image took ~20 seconds.
DDIM: Deterministic and Fast Sampling
Song et al. (2020) showed that the diffusion process could be reinterpreted as an ODE (ordinary differential equation) rather than a stochastic SDE. This enabled deterministic sampling with far fewer steps.
DDIM (Denoising Diffusion Implicit Models) uses the same trained model as DDPM but skips steps during inference. Instead of denoising from to in 1000 steps, DDIM can go from to in 50 steps - a 20x speedup - with only a modest quality loss.
The DDIM update rule:
When , sampling is fully deterministic: the same noise always produces the same image. This is critical for applications like image editing where you want reproducibility.
Latent Diffusion: The Key to Scaling
Rombach et al. (2022) made the decisive insight that enabled practical large-scale diffusion. Running diffusion directly in pixel space is expensive: a 512×512×3 image is 786,432 dimensions. Attention in UNet scales quadratically with spatial resolution.
The solution: run diffusion in the latent space of a pretrained VAE.
How Latent Diffusion Works
The VAE compresses each 8×8 spatial region into a single 4-channel latent vector - an 8x spatial compression in each dimension, 64x overall. Diffusion runs on the 64×64×4 latent space. The UNet operates on 64×64 instead of 512×512, a 64x reduction in spatial resolution. This makes training and inference dramatically faster and cheaper.
Key insight: the VAE learns to separate "perceptual content" (stored in the latent) from fine pixel-level texture (handled by the decoder). The diffusion model only needs to learn the high-level structure - the VAE decoder fills in the details.
Stable Diffusion is the open-source implementation of latent diffusion trained by Stability AI on LAION-5B.
Text Conditioning: How the Model Follows Prompts
A diffusion model without conditioning generates random images. Text conditioning tells the denoiser what to generate.
The mechanism is cross-attention. The UNet denoiser has cross-attention layers where the spatial features of the noisy latent attend over the CLIP text embeddings:
where comes from the latent features (projected) and come from the CLIP text encoder output.
At every denoising step, the spatial features "look at" the text embedding and steer the denoising direction toward images semantically consistent with the text. The text conditioning is applied at every UNet layer that has cross-attention, which is typically all the middle and decoder layers.
Classifier-Free Guidance (CFG)
The fundamental tension in text-to-image generation: diversity vs. prompt fidelity.
If you set conditioning signal low, you get diverse but off-prompt images. If you push conditioning high, you get on-prompt images but with reduced variety and often visual artifacts.
Classifier-Free Guidance (Ho & Salimans, 2022) solves this with a clever trick. During training, randomly drop the text condition with probability 10-20%, replacing it with a null embedding (empty prompt). This trains the model to denoise both with and without conditioning.
During inference, you run the model twice per step: once with the text condition and once with the null condition. The actual noise prediction is extrapolated beyond the conditional prediction:
where is the guidance scale (typically 7-15) and is the text condition.
When : just conditional generation, standard quality. When : the conditional direction is amplified 7x - strong prompt adherence, somewhat reduced diversity. When : very tight prompt adherence, possible oversaturation and artifacts.
CFG doubles the cost of inference (two forward passes per step) but dramatically improves prompt adherence. It is used in virtually all production text-to-image systems.
Architecture Evolution: UNet to DiT
The UNet Backbone (SD 1.x, SD 2.x, SDXL)
The original DDPM and Stable Diffusion use a UNet as the denoiser. The UNet has:
- Encoder: Progressively downsampling ResNet blocks
- Bottleneck: Middle blocks with self-attention at the lowest spatial resolution
- Decoder: Progressively upsampling ResNet blocks with skip connections from the encoder
- Cross-attention: Text conditioning at each decoder resolution
- Time embedding: Sinusoidal encoding of injected at every block
SDXL (Stability AI, 2023) scaled the UNet significantly: 2 text encoders (CLIP-ViT-L plus OpenCLIP-ViT-bigG), a much larger UNet backbone, and a separate refiner model for high-resolution detail. Training resolution: 1024×1024.
The DiT Revolution (SD3, FLUX, Sora)
Peebles & Xie (2023) proposed DiT - Diffusion Transformer - replacing the UNet backbone with a standard transformer. The key insight: transformers scale better than UNets with compute, consistent with findings in language modeling.
In DiT:
- The latent is patchified (each spatial patch becomes one token)
- A transformer with standard self-attention processes the sequence
- Time and class conditioning are injected via adaptive layer norm (adaLN)
- No skip connections; the transformer directly maps noisy patches to noise predictions
DiT-XL/2 (the largest variant) achieved better FID than the best UNet at equal compute. The scaling law is clear: more transformer parameters → better quality.
FLUX (Black Forest Labs, 2024) extends DiT with multi-modal diffusion transformers that jointly attend over image patches and text tokens - a more unified approach. SD3 (Stability AI, 2024) uses a similar multi-modal transformer architecture.
Sora (OpenAI, 2024) applies DiT to video by treating video frames as patches in spacetime. Same architecture, extended to the temporal dimension.
The Broader Family: Image-to-Image, Inpainting, ControlNet
Image-to-Image
Start from a noisy version of an existing image rather than pure noise. Inject noise to step instead of , then denoise. The amount of noise injected controls the "strength" of the transformation - low strength preserves most of the original, high strength allows more creative deviation.
# Image-to-image pipeline
from diffusers import StableDiffusionImg2ImgPipeline
import torch
from PIL import Image
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
).to("cuda")
init_image = Image.open("input.jpg").convert("RGB").resize((512, 512))
result = pipe(
prompt="a photo of a cat in a garden, oil painting style",
image=init_image,
strength=0.75, # 0 = no change, 1 = pure generation
guidance_scale=7.5,
num_inference_steps=50,
).images[0]
ControlNet
ControlNet (Zhang et al., 2023) adds spatial conditioning to a frozen diffusion model. A separate network (with the same architecture as the UNet's encoder) processes a spatial control signal - edge map, depth map, pose skeleton, segmentation mask - and injects it into the UNet via additional zero-initialized convolutions.
This enables precise spatial control: "generate an image of a person in this exact pose" or "generate a scene with this depth structure."
Code: Text-to-Image with Diffusers
import torch
from diffusers import (
StableDiffusionPipeline,
DPMSolverMultistepScheduler,
StableDiffusionXLPipeline,
)
from PIL import Image
import numpy as np
def setup_sd_pipeline(
model_id: str = "runwayml/stable-diffusion-v1-5",
device: str = "cuda",
enable_xformers: bool = True,
) -> StableDiffusionPipeline:
"""Set up a Stable Diffusion pipeline with optimizations."""
pipe = StableDiffusionPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16 if device == "cuda" else torch.float32,
safety_checker=None,
).to(device)
# Use DPM-Solver++ for faster, higher-quality sampling
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
pipe.scheduler.config,
use_karras_sigmas=True,
)
# Memory-efficient attention
if enable_xformers and device == "cuda":
try:
pipe.enable_xformers_memory_efficient_attention()
print("xformers memory efficient attention enabled")
except Exception:
pass
# Attention slicing for lower VRAM
pipe.enable_attention_slicing()
return pipe
def generate_image(
pipe,
prompt: str,
negative_prompt: str = "blurry, low quality, bad anatomy, ugly, watermark",
width: int = 512,
height: int = 512,
num_steps: int = 30,
guidance_scale: float = 7.5,
seed: int = 42,
) -> Image.Image:
"""Generate an image from a text prompt."""
generator = torch.Generator(device=pipe.device).manual_seed(seed)
result = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
width=width,
height=height,
num_inference_steps=num_steps,
guidance_scale=guidance_scale,
generator=generator,
)
return result.images[0]
def explore_cfg_effect(pipe, prompt: str, guidance_scales: list[float]) -> list[Image.Image]:
"""Generate images at different CFG scales to visualize the trade-off."""
images = []
generator = torch.Generator(device=pipe.device).manual_seed(42)
for cfg in guidance_scales:
result = pipe(
prompt=prompt,
guidance_scale=cfg,
num_inference_steps=30,
generator=torch.Generator(device=pipe.device).manual_seed(42),
)
img = result.images[0]
images.append(img)
print(f" CFG={cfg}: done")
return images
def batch_generate(
pipe,
prompts: list[str],
negative_prompt: str = "blurry, low quality",
batch_size: int = 4,
num_steps: int = 30,
guidance_scale: float = 7.5,
) -> list[Image.Image]:
"""Generate multiple images efficiently using batching."""
all_images = []
for i in range(0, len(prompts), batch_size):
batch_prompts = prompts[i:i + batch_size]
result = pipe(
prompt=batch_prompts,
negative_prompt=[negative_prompt] * len(batch_prompts),
num_inference_steps=num_steps,
guidance_scale=guidance_scale,
)
all_images.extend(result.images)
print(f"Generated {min(i + batch_size, len(prompts))}/{len(prompts)}")
return all_images
if __name__ == "__main__":
pipe = setup_sd_pipeline()
# Single image generation
image = generate_image(
pipe,
prompt="a serene mountain lake at dawn, photorealistic, golden hour lighting",
num_steps=30,
guidance_scale=7.5,
)
image.save("output.png")
print("Saved output.png")
# Explore CFG effect
print("Exploring CFG scales...")
cfg_images = explore_cfg_effect(
pipe,
prompt="an astronaut riding a horse on Mars",
guidance_scales=[1.0, 4.0, 7.5, 12.0, 20.0],
)
for cfg, img in zip([1.0, 4.0, 7.5, 12.0, 20.0], cfg_images):
img.save(f"cfg_{cfg}.png")
Code: SDXL for Higher Quality
def setup_sdxl_pipeline(device: str = "cuda") -> StableDiffusionXLPipeline:
"""Set up SDXL pipeline."""
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16",
use_safetensors=True,
).to(device)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
pipe.scheduler.config,
use_karras_sigmas=True,
)
pipe.enable_attention_slicing()
return pipe
def generate_sdxl(
pipe,
prompt: str,
negative_prompt: str = "blurry, low quality, bad anatomy",
width: int = 1024,
height: int = 1024,
num_steps: int = 30,
guidance_scale: float = 7.0,
seed: int = 42,
) -> Image.Image:
"""Generate at SDXL native resolution (1024x1024)."""
generator = torch.Generator(device=pipe.device).manual_seed(seed)
result = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
width=width,
height=height,
num_inference_steps=num_steps,
guidance_scale=guidance_scale,
generator=generator,
)
return result.images[0]
Production Engineering Notes
Step Count vs Quality vs Latency
The number of denoising steps is the primary latency lever in diffusion inference. With DDIM or DPM-Solver++:
| Steps | Quality | Latency (A100, 512x512) |
|---|---|---|
| 10 | Good | ~0.8s |
| 20 | Very Good | ~1.5s |
| 30 | Excellent | ~2.2s |
| 50 | Near-optimal | ~3.5s |
| 100 | Marginal improvement | ~7s |
For production: 20-30 steps with DPM-Solver++ achieves near-optimal quality. Anything beyond 50 steps provides diminishing returns that users typically cannot perceive.
VRAM Requirements
| Model | Resolution | VRAM Required |
|---|---|---|
| SD 1.5 (FP16) | 512x512 | 4GB |
| SD 2.1 (FP16) | 768x768 | 6GB |
| SDXL (FP16) | 1024x1024 | 10GB |
| SDXL with refiner | 1024x1024 | 18GB |
| FLUX (FP8) | 1024x1024 | 12GB |
For lower VRAM: enable CPU offloading (pipe.enable_model_cpu_offload()), use FP8 weights, or use quantization.
Safety Filtering
All production deployments must filter outputs for harmful content. The standard approach uses a CLIP-based safety classifier that scores the generated image against a list of harmful concepts:
from transformers import pipeline as hf_pipeline
def setup_safety_checker():
"""Load a CLIP-based safety classifier."""
return hf_pipeline(
"image-classification",
model="Falconsai/nsfw_image_detection",
device=0 if torch.cuda.is_available() else -1,
)
def check_safety(safety_checker, image: Image.Image) -> dict:
"""Check if a generated image passes safety filters."""
result = safety_checker(image)
scores = {item["label"]: item["score"] for item in result}
nsfw_score = scores.get("nsfw", 0.0)
is_safe = nsfw_score < 0.5
return {
"is_safe": is_safe,
"nsfw_score": nsfw_score,
"normal_score": scores.get("normal", 1.0 - nsfw_score),
}
Beyond NSFW detection, production deployments should also filter for:
- Images of specific real people (privacy/deepfake concerns)
- Copyrighted art styles (legal exposure)
- Violence and graphic content
- Harmful instructions hidden in generated text
Batch Generation and Cost
import time
from contextlib import contextmanager
@contextmanager
def timer(label: str):
start = time.perf_counter()
yield
elapsed = time.perf_counter() - start
print(f"{label}: {elapsed:.2f}s")
def estimate_generation_cost(
num_images: int,
steps: int = 30,
resolution: int = 512,
a100_cost_per_hour: float = 3.00,
images_per_second: float = 0.5,
) -> dict:
"""Estimate inference cost for batch image generation."""
total_seconds = num_images / images_per_second
gpu_hours = total_seconds / 3600
cost = gpu_hours * a100_cost_per_hour
return {
"num_images": num_images,
"estimated_time_minutes": total_seconds / 60,
"gpu_hours": gpu_hours,
"estimated_cost_usd": cost,
"cost_per_image_cents": (cost / num_images) * 100,
}
Common Mistakes
:::danger Running Full-Resolution Diffusion for Large Images Running DDPM-style diffusion directly at 1024x1024 pixel space (not latent diffusion) is massively expensive - the UNet operates on 1M+ dimensional tensors. Always use latent diffusion (SD, SDXL, FLUX) which compresses to 128x128 or 64x64 before diffusion. Direct pixel diffusion at high resolution requires hundreds of GPU-hours per image. :::
:::danger Setting CFG Scale Too High A common misconception is that higher guidance scale always means better images. Guidance scale above 12-15 typically causes oversaturation, color distortion, and unnatural sharpness. Start at 7.5 for general use. Only go above 10 for prompts where you need very tight adherence to specific details. Monitor generation quality by viewing outputs, not just trusting the number. :::
:::warning Ignoring Negative Prompts Negative prompts are not a hack - they are a critical part of prompt engineering for diffusion models. Without negative prompts, models frequently produce artifacts, poor anatomy, watermarks, and low-quality outputs. A standard negative prompt like "blurry, low quality, bad anatomy, ugly, watermark, signature, text" significantly improves output quality at zero extra cost. :::
:::warning Not Caching the Model Between Requests Loading a Stable Diffusion pipeline from disk takes 10-30 seconds. Never load a new pipeline for each request in production. Load the model once at startup, keep it in GPU memory, and process requests sequentially or with a request queue. Use torch.compile for additional speedup after warm-up. :::
Interview Questions and Answers
Q1: Explain the forward and reverse processes in diffusion models. What is the model actually learning?
The forward process is a fixed Markov chain that gradually adds Gaussian noise to a training image over steps. After steps (typically 1000), the image is pure Gaussian noise. This process is not learned - it is defined by the noise schedule . The reverse process is what the model learns: given a noisy image at step , predict and remove the noise to get a slightly cleaner image at step . The model is specifically trained to predict the noise that was added (called epsilon-prediction or the "simple objective" from DDPM). This is a regression problem trained with MSE loss. At inference time, you start from pure Gaussian noise and iteratively apply the learned denoiser times to produce a coherent image. Text conditioning steers the denoising trajectory by injecting text embeddings via cross-attention at each step.
Q2: Why did diffusion models displace GANs, despite GANs being dominant for years?
Three main advantages: (1) Training stability - diffusion uses a simple MSE regression objective rather than the adversarial min-max game. There is no mode collapse, no discriminator-generator imbalance. The loss is always well-defined and gradients flow reliably. (2) Better mode coverage - GANs suffer from mode collapse where they learn to generate only certain types of images. Diffusion models, by learning to denoise at all noise levels, are forced to model the full data distribution. (3) Scalability - diffusion objectives scale cleanly with compute. More parameters and more training data consistently improve results, as demonstrated by FLUX and Stable Diffusion XL. GANs were notoriously difficult to scale reliably. The tradeoff is inference speed: GANs generate in a single forward pass while diffusion requires 20-50 steps. But with DDIM, DPM-Solver++, and consistency models, the speed gap has narrowed significantly.
Q3: What is Classifier-Free Guidance and why is it important?
CFG is a technique for amplifying the influence of the conditioning signal (e.g., text prompt) during inference without training a separate classifier. During training, the conditioning is randomly dropped with 10-20% probability, training the model to denoise both conditionally and unconditionally. During inference, you run the model twice per step: once with the text condition and once with an empty/null condition. The actual noise prediction is: , where is the guidance scale. This extrapolates in the direction where the conditional prediction differs from the unconditional prediction - it amplifies the features of the image that are caused by the text condition. Higher means stronger prompt adherence but less diversity and potential oversaturation. CFG doubles inference cost (two forward passes) but is used in virtually all production systems because it dramatically improves prompt fidelity.
Q4: What is the key innovation of latent diffusion models?
The key innovation is moving diffusion from pixel space to the latent space of a pretrained VAE. A standard diffusion model trained at 512x512 pixels must denoise tensors with 786K dimensions. The UNet's computational cost (especially attention) scales quadratically with spatial resolution, making high-resolution training prohibitively expensive. Latent diffusion (Rombach et al., 2022) first trains a VAE to compress images by 8x in each spatial dimension: a 512x512x3 image becomes a 64x64x4 latent. Diffusion then runs on these 64x64x4 latents, a 64x reduction in spatial dimensions. The UNet trains 10-50x faster and at much lower memory cost. The VAE decoder reconstructs high-quality pixel-space images from the denoised latents. The insight is that the VAE separates "semantic content" (captured by the latent) from fine perceptual detail (handled by the decoder), and diffusion only needs to model the semantic content.
Q5: How does ControlNet enable spatial conditioning in diffusion models?
ControlNet (Zhang et al., 2023) adds spatial conditioning to a frozen diffusion model's UNet without changing the original model's weights. A duplicate of the UNet encoder (same architecture) is trained from scratch to process the spatial control signal (edge map, depth map, pose skeleton, etc.). This control encoder processes the spatial condition and produces intermediate feature maps that are injected into the original UNet encoder via zero-initialized convolutional layers. The zero initialization is crucial: at the start of training, the control encoder outputs zero, so the model starts from its pretrained behavior and gradually learns to incorporate the spatial condition. The frozen original UNet's language capabilities and quality are fully preserved. ControlNet enables precise spatial control: "generate an image of a dancer with the skeleton in exactly this pose" or "generate an architectural rendering that follows this floor plan." Multiple ControlNets can be combined with weighted blending.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Diffusion Process (DDPM) demo on the EngineersOfAI Playground - no code required.
:::
