DDIM and Accelerated Diffusion Sampling
:::note Reading time: ~50 minutes | Interview relevance: Very High | Target roles: ML Engineer, Research Engineer, AI Engineer, Applied Scientist :::
The Real Interview Moment
You are interviewing for a generative AI platform role at a company deploying text-to-image at scale. The team has a DDPM model trained for six months on a cluster of 256 A100s. It produces stunning results. But deployment is blocked: each image takes 15 minutes on a V100 GPU. The product team needs sub-5-second latency for the consumer app. The interviewer asks: "How would you solve this, without retraining?"
Most engineers answer: "use a smaller model" or "get a faster GPU" or "quantize it." These miss the point. A 4x smaller model would degrade quality significantly. A 4x faster GPU would still produce 4-minute images. The correct answer requires understanding that DDPM's 1000-step sampling constraint is an artifact of the sampling algorithm, not the model itself.
The trained U-Net has no knowledge of how many sampling steps will be used at inference time. It was trained on a marginal distribution . Any sampling process that respects these marginals can use the same trained model. Song et al. (2020) discovered that the DDPM forward process is just one of infinitely many valid noise schedules that yield the same marginals - and the deterministic one enables 10x to 50x speedup with no retraining and minimal quality loss.
This insight - that diffusion sampling is a differential equation problem with many valid solvers, not a fixed sequential recipe - underpins everything that came after: DDIM, PLMS, DPM-Solver, DPM-Solver++, Consistency Models. Understanding it is what separates engineers who deploy diffusion models from engineers who understand why they work.
Why This Exists - The Sampling Bottleneck
DDPM trains a U-Net to predict the noise added at any timestep . Training is efficient: sample a random , a random , compute in one step, and regress the predicted noise against . The closed-form forward process makes each training step O(1). Clean and fast.
The sampling process is the problem. To generate from DDPM, you must simulate the reverse Markov chain:
This requires sequential steps. Each step calls the U-Net once. Steps cannot be parallelized because step depends on the output of step . On a V100 GPU running a 860M parameter U-Net, each forward pass takes approximately 0.9 seconds. 1000 passes: 15 minutes per image.
The problem is not the model size. It is that the DDPM derivation assumes the reverse process must mirror the Markovian forward process step-by-step. Song et al. (2020) asked: does it have to?
The answer is no. The DDPM training objective depends only on marginals , not on the full joint distribution . There are infinitely many joint distributions sharing the same marginals. DDIM defines one that enables non-sequential large steps - reducing inference from 1000 to 20-50 steps with the same trained model.
Historical Context - The Non-Markovian Insight
By mid-2020, DDPM had set state-of-the-art FID on CIFAR-10 but was obviously too slow for practical deployment. Jiaming Song, Chenlin Meng, and Stefano Ermon at Stanford noticed something crucial: the DDPM training loss is:
where . This loss depends only on the marginal - defined entirely by . It does not depend on - the transition probability between consecutive steps.
This is the key insight: the trained U-Net has no knowledge of whether the forward process is Markovian. It only knows the marginal distributions. Therefore, any alternative forward process sharing the same marginals can be reversed using the same trained U-Net.
DDIM defines a family of non-Markovian processes indexed by a stochasticity parameter , all sharing the DDPM marginals. Setting gives a fully deterministic reverse process - an ODE that can be solved with large step sizes, enabling 20-50x fewer steps. This was published as DDIM (Denoising Diffusion Implicit Models) at ICLR 2021.
1. Neural Function Evaluations - The Cost Unit
The cost of diffusion sampling is measured in NFE (Neural Function Evaluations) - the number of times the denoising network is called. Each NFE is one U-Net forward pass.
| Method | NFE | A100 time (SD-sized model) | Quality (FID, CIFAR-10) |
|---|---|---|---|
| DDPM (original) | 1000 | ~900s | 3.17 |
| DDIM (, 250 steps) | 250 | ~230s | 4.16 |
| DDIM (, 100 steps) | 100 | ~90s | 4.16 |
| DDIM (, 50 steps) | 50 | ~45s | 4.67 |
| PLMS (25 steps) | 25 | ~22s | ~3.5 |
| DPM-Solver-2 (20 steps) | 20 | ~18s | ~3.0 |
| DPM-Solver-3 (10-15 steps) | 12 | ~11s | ~3.0 |
| LCM (4 steps) | 4 | ~4s | ~3.5 |
| LCM (1 step) | 1 | ~1s | ~6.0 |
(Approximate values - exact numbers depend on model size and hardware.)
The sequential dependency is the core constraint: step requires , which requires step to finish. You cannot parallelize across timesteps within a single sample. All speedups come from either taking larger steps (DDIM, DPM-Solver) or fundamentally changing the inference paradigm (Consistency Models).
2. The DDIM Forward Process - Same Marginals, Different Joint
Keeping Marginals, Losing the Markov Property
DDIM defines a family of non-Markovian forward processes parameterized by :
Breaking this down:
- - the "target direction" toward the clean image at timestep
- - the "noise direction," pointing from toward (the predicted noise direction), scaled to fill the remaining variance
- - stochastic noise injected at this step
The term is the noise direction - it equals the actual if the forward model is exact.
Verification that marginals are preserved: you can verify by integrating over that - identical to DDPM. The joint is different from DDPM (it is non-Markovian), but the marginal at each timestep is the same. This is why the same trained U-Net works.
The Parameter - Controlling Stochasticity
DDIM introduces to parameterize how much stochastic noise is injected:
The term equals exactly - the DDPM posterior standard deviation. So:
- : - no noise injected - fully deterministic DDIM - the true DDIM case
- : - recovers the DDPM posterior variance - stochastic sampling identical to DDPM
- : interpolation between deterministic and DDPM behavior
The naming "DDIM" (Denoising Diffusion Implicit Models) comes from the case: the model is implicit in the sense that sampling is defined by the ODE trajectory rather than an explicit Markov chain.
3. The DDIM Update Rule
Step-by-Step Derivation
At inference time, the U-Net predicts - an estimate of the noise in . The DDIM update moves from to in two logical sub-steps:
Sub-step 1 - Predict the clean image :
Using the forward process reparameterization , solve for :
This is a per-step estimate of the clean image. At early timesteps (high noise), this estimate is rough. At later timesteps (low noise), it is increasingly accurate.
Sub-step 2 - Compute via the DDIM posterior:
Plug into the DDIM transition distribution:
where when and when .
This is the complete DDIM update. Substituting explicitly:
The ODE Perspective ()
When , there is no stochastic term and the DDIM update becomes the Euler discretization of a probability flow ODE. Using the substitution (the noise-to-signal ratio), the ODE is:
where is the estimated score function. This ODE perspective is what enables higher-order solvers (DPM-Solver) to dramatically outperform the first-order DDIM Euler steps.
The ODE is deterministic: starting from the same , the same image is produced every time. This determinism is what enables DDIM inversion (Section 6) and latent space interpolation.
4. Accelerated Sampling - Skipping Timesteps
Why Step Skipping Works
With DDPM, you must step through - 1000 steps for DDPM. With DDIM, you can jump from to for any subset of size .
The noise schedule is smooth enough that skipping from to (a step of 250) is a valid (larger) Euler step on the ODE. The quality degrades gracefully because the ODE is well-conditioned: the curvature along the trajectory is bounded, so large steps introduce bounded approximation error.
Why can't DDPM skip steps? Because DDPM sampling uses the stochastic Markov transition , which is derived assuming unit steps. Taking a 250-step jump in the DDPM sampler introduces massive discretization error in the Gaussian approximation. The DDIM ODE is more robust because the deterministic trajectory is smoother than a stochastic path.
Step Selection Strategies
Given and desired steps, how do you choose which timesteps to use?
Uniform spacing (most common): for . Select every 20th timestep: .
Quadratic spacing (better for very few steps): . Concentrates more steps at high noise levels ( large) where the ODE curvature is largest.
Log spacing in -space (optimal for DPM-Solver): uniform spacing in - provably optimal step size selection for the Taylor expansion used in DPM-Solver.
T=1000, S=20:
Uniform steps: [50, 100, 150, 200, 250, 300, ..., 950, 1000]
Quadratic steps: [3, 11, 25, 44, 70, 100, 136, ..., 911, 1000]
(more steps at high noise → lower truncation error)
For DDIM at 20-50 steps, uniform spacing works well. For DPM-Solver at 10-15 steps, quadratic or -space spacing is better.
5. DDIM Inversion - Encoding Real Images
A unique and powerful property of deterministic DDIM (): the ODE can be run backwards. Given a real image , you can find the noise vector such that . This is called DDIM inversion.
Why Inversion Enables Image Editing
Once you have , you can:
- Change the text conditioning from to
- Run DDIM sampling from with the new conditioning
- Get an edited image that preserves the structure of but follows
This works because encodes the "structure" of the image as noise - the spatial configuration of the noise vector determines which image the ODE trajectory converges to. With the same starting noise but different text conditioning, the trajectory diverges while preserving broad structural similarity.
Applications: text-guided style transfer, changing object attributes ("make the cat orange"), editing specific regions, Prompt-to-Prompt-style attention manipulation.
Inversion Algorithm
DDIM inversion is the forward direction of the ODE (from to ), using the same deterministic update but in reverse:
where is the predicted clean image at each inversion step.
Limitations: inversion is not perfectly reversible due to discretization error. With 50+ steps, with small visible artifacts. With 10-20 steps, the reconstruction error becomes noticeable. For precise editing applications, use more inversion steps than generation steps.
A second limitation: DDIM inversion assumes (deterministic). With CFG at inference but not during inversion, there is a conditioning mismatch that causes drift. Null-Text Inversion (Mokady et al. 2022) addresses this by optimizing the null embedding per image to correct for CFG-induced drift.
6. DPM-Solver - Higher-Order Diffusion ODE Solvers
Why First-Order is Not Enough
DDIM applies the first-order Euler method to the probability flow ODE. Each step accumulates local truncation error, where is the step size in -space. With 50 steps over the full noise range, the total accumulated error is . Acceptable for , but at , the Euler steps are so large that curvature in the ODE cannot be ignored.
Lu et al. (2022) recognized that the diffusion ODE has special structure: the linear part (the signal-preserving component) has an exact analytical solution. Only the nonlinear part (the score function term) needs numerical approximation. Separating these gives DPM-Solver.
The DPM-Solver Formulation
The probability flow ODE in space (log signal-to-noise ratio) is:
where is the denoising network in -prediction form. The exact solution from to is:
The integral over is approximated using Taylor expansion around . DPM-Solver- uses a -th order Taylor expansion:
DPM-Solver-1 (): uses - first-order, equivalent to DDIM.
DPM-Solver-2 (): evaluates at an intermediate point computed from the first-order prediction. Uses this additional evaluation for the 2nd-order correction. Cost: 2 NFE per step, but each step is 2nd-order accurate .
DPM-Solver-3 (): 3rd-order Taylor expansion, accuracy per step. Works in 10-15 total steps.
DPM-Solver++ - Improving Stability
DPM-Solver++ (Lu et al. 2022b) applies the same idea but in -prediction space (predicting the clean image) rather than -prediction space. This produces better numerical stability at large guidance scales (CFG ) because predictions are bounded while predictions can be large at high guidance scales. DPM-Solver++ is the default in most production systems and in the HuggingFace Diffusers library.
DDIM (1st order): accumulates O(h^2) per step - need 50+ steps for quality
DPM-Solver-2: accumulates O(h^3) per step - 20 steps matches DDIM 50 steps
DPM-Solver-3: accumulates O(h^4) per step - 12 steps matches DDIM 50 steps
7. Architecture Diagram
8. Complete PyTorch DDIM Sampler with Inversion
This implementation works with any pre-trained DDPM U-Net - no changes to the model required.
import torch
import numpy as np
from typing import Optional, List, Tuple
import math
class DDIMSampler:
"""
DDIM sampler with:
- Deterministic sampling (eta=0) and stochastic sampling (eta=1)
- Step schedule selection (uniform, quadratic)
- DDIM inversion for image editing
- Works with any pre-trained DDPM-style U-Net
Reference: Song et al. 2020 "Denoising Diffusion Implicit Models"
"""
def __init__(self, model, alphas_cumprod: torch.Tensor):
"""
Args:
model: trained U-Net - callable (x_t, t) -> predicted_noise
OR (x_t, t, conditioning) -> predicted_noise
alphas_cumprod: shape (T,) - the alpha_bar_t values from training
"""
self.model = model
self.alphas_cumprod = alphas_cumprod
self.T = len(alphas_cumprod)
def make_timestep_schedule(
self,
num_steps: int,
spacing: str = "uniform"
) -> List[int]:
"""
Select S timesteps from the full T-step schedule.
Returns list of timestep indices in descending order (T → 0).
Args:
num_steps: S, the number of inference steps
spacing: 'uniform', 'quadratic', or 'lambda' (log-SNR space)
"""
if spacing == "uniform":
# Evenly spaced across [1, T]
step_ratio = self.T // num_steps
timesteps = list(range(0, self.T, step_ratio))
timesteps = sorted(timesteps, reverse=True)
elif spacing == "quadratic":
# More steps at high noise levels where ODE curvature is larger
timesteps = [
int((i / num_steps) ** 2 * (self.T - 1))
for i in range(num_steps + 1)
]
timesteps = sorted(list(set(timesteps)), reverse=True)
elif spacing == "lambda":
# Uniform in log-SNR space - optimal for DPM-Solver
# lambda_t = log(alpha_t / sigma_t)
alphas = self.alphas_cumprod.sqrt()
sigmas = (1 - self.alphas_cumprod).sqrt()
lambdas = torch.log(alphas / sigmas)
lambda_min = lambdas[0].item()
lambda_max = lambdas[-1].item()
lambda_seq = torch.linspace(lambda_max, lambda_min, num_steps + 1)
# Find closest timestep for each lambda value
timesteps = []
for lam in lambda_seq:
idx = (lambdas - lam).abs().argmin().item()
timesteps.append(idx)
timesteps = sorted(list(set(timesteps)), reverse=True)
else:
raise ValueError(f"Unknown spacing: {spacing}")
return timesteps
@torch.no_grad()
def sample(
self,
shape: Tuple,
num_inference_steps: int = 50,
eta: float = 0.0,
conditioning=None,
negative_conditioning=None,
guidance_scale: float = 7.5,
spacing: str = "uniform",
device: str = "cuda",
generator: Optional[torch.Generator] = None,
verbose: bool = False,
) -> torch.Tensor:
"""
Generate samples using DDIM.
Args:
shape: (B, C, H, W) output shape
num_inference_steps: S, number of denoising steps (20-50 typical)
eta: 0.0 = deterministic DDIM, 1.0 = stochastic (DDPM behavior)
conditioning: text or class conditioning tensor
negative_conditioning: for CFG (null/negative embedding)
guidance_scale: CFG guidance scale w (ignored if no conditioning)
spacing: 'uniform', 'quadratic', or 'lambda'
generator: torch.Generator for reproducibility
Returns:
Generated samples of shape (B, C, H, W)
"""
B = shape[0]
timesteps = self.make_timestep_schedule(num_inference_steps, spacing)
# Start from pure Gaussian noise
x = torch.randn(shape, generator=generator, device=device)
alphas_cumprod = self.alphas_cumprod.to(device)
use_cfg = (conditioning is not None and negative_conditioning is not None
and guidance_scale > 1.0)
if verbose:
print(f"DDIM sampling: {num_inference_steps} steps, eta={eta}")
print(f"Timesteps: {timesteps[:5]}...{timesteps[-3:]}")
for i, t in enumerate(timesteps):
# Previous timestep - 0 if this is the last step
t_prev = timesteps[i + 1] if i + 1 < len(timesteps) else 0
t_batch = torch.full((B,), t, device=device, dtype=torch.long)
# === Predict noise with U-Net ===
if use_cfg:
# CFG: batch conditional and unconditional for efficiency
x_doubled = torch.cat([x, x], dim=0)
t_doubled = torch.cat([t_batch, t_batch], dim=0)
cond_cat = torch.cat([conditioning, negative_conditioning], dim=0)
eps_both = self.model(x_doubled, t_doubled, cond_cat)
eps_cond, eps_uncond = eps_both.chunk(2, dim=0)
# CFG combination
eps_theta = eps_uncond + guidance_scale * (eps_cond - eps_uncond)
elif conditioning is not None:
eps_theta = self.model(x, t_batch, conditioning)
else:
eps_theta = self.model(x, t_batch)
# === Retrieve noise schedule values ===
alpha_bar_t = alphas_cumprod[t]
alpha_bar_prev = alphas_cumprod[t_prev] if t_prev > 0 else torch.tensor(1.0, device=device)
# === Predict x_0 from x_t ===
# x_0_hat = (x_t - sqrt(1 - alpha_bar_t) * eps) / sqrt(alpha_bar_t)
x0_pred = (x - (1 - alpha_bar_t).sqrt() * eps_theta) / alpha_bar_t.sqrt()
# Clamp to prevent values outside training distribution
x0_pred = x0_pred.clamp(-1.0, 1.0)
# === Compute DDIM sigma_t ===
# sigma_t = eta * sqrt((1 - alpha_bar_{t-1}) / (1 - alpha_bar_t)) * sqrt(1 - alpha_bar_t / alpha_bar_{t-1})
sigma_t = eta * (
((1 - alpha_bar_prev) / (1 - alpha_bar_t)).sqrt()
* (1 - alpha_bar_t / alpha_bar_prev).sqrt()
)
# === Compute x_{t-1} ===
# "Direction toward clean image" + "direction toward x_t" + noise
dir_xt = (1 - alpha_bar_prev - sigma_t ** 2).sqrt() * eps_theta
x0_dir = alpha_bar_prev.sqrt() * x0_pred
if eta > 0 and t_prev > 0:
random_noise = sigma_t * torch.randn_like(x, generator=generator)
else:
random_noise = 0.0
x = x0_dir + dir_xt + random_noise
if verbose and i % 10 == 0:
print(f" Step {i+1:3d}/{num_inference_steps}, t={t}, t_prev={t_prev}")
return x # denoised latent or image
@torch.no_grad()
def invert(
self,
x0: torch.Tensor,
num_steps: int = 50,
conditioning=None,
device: str = "cuda",
verbose: bool = False,
) -> torch.Tensor:
"""
DDIM Inversion: encode a real image x_0 → x_T (approximate).
Runs the ODE forward (from t=0 to t=T) to find the noise
that would produce x_0 under DDIM decoding.
IMPORTANT: Valid only for eta=0 (deterministic DDIM).
The more inversion steps, the more accurate the reconstruction.
Args:
x0: clean image in [-1, 1], shape (B, C, H, W)
num_steps: inversion steps (50+ for high fidelity)
conditioning: optional text conditioning during inversion
Returns:
x_T: inverted noise, shape (B, C, H, W)
"""
# Ascending timestep order for inversion (t=0 → t=T)
timesteps = self.make_timestep_schedule(num_steps, spacing="uniform")
timesteps = sorted(timesteps) # ascending: 0 → T
x = x0.to(device)
alphas_cumprod = self.alphas_cumprod.to(device)
if verbose:
print(f"DDIM Inversion: {num_steps} steps")
for i in range(len(timesteps) - 1):
t = timesteps[i]
t_next = timesteps[i + 1]
t_batch = torch.full((x.shape[0],), t, device=device, dtype=torch.long)
# Predict noise at current (partially noisy) image
if conditioning is not None:
eps_theta = self.model(x, t_batch, conditioning)
else:
eps_theta = self.model(x, t_batch)
alpha_bar_t = alphas_cumprod[t]
alpha_bar_t_next = alphas_cumprod[t_next]
# Inversion step: deterministic DDIM forward (eta=0 implied)
# Predict x_0 estimate
x0_pred = (x - (1 - alpha_bar_t).sqrt() * eps_theta) / alpha_bar_t.sqrt()
# Compute x_{t+1} via DDIM formula (reversed direction)
# direction toward x_t encoded as noise direction
x = (
alpha_bar_t_next.sqrt() * x0_pred
+ (1 - alpha_bar_t_next).sqrt() * eps_theta
)
if verbose and i % 10 == 0:
print(f" Inversion step {i+1}/{len(timesteps)-1}, t={t} → t_next={t_next}")
return x # x_T: the inverted noise
# ============================================================
# Guidance scale effect analysis
# ============================================================
def analyze_guidance_scale_quality_tradeoff():
"""
Shows the empirical quality-diversity-speed tradeoff.
DDIM at different step counts and eta values.
"""
print("DDIM Quality vs Speed Analysis")
print("=" * 60)
print()
# Empirical FID scores on CIFAR-10 (from DDIM paper)
results = [
# (num_steps, eta, approx_FID_CIFAR10)
(1000, 1.0, 3.17), # DDPM (baseline)
(250, 0.0, 4.16), # DDIM deterministic, 250 steps
(100, 0.0, 4.16), # DDIM deterministic, 100 steps
(50, 0.0, 4.67), # DDIM deterministic, 50 steps
(50, 1.0, 4.11), # DDIM stochastic, 50 steps
(20, 0.0, 6.84), # DDIM deterministic, 20 steps
(10, 0.0, 13.73), # DDIM too few steps
]
print(f"{'Steps':>8} | {'eta':>6} | {'FID (CIFAR-10)':>15} | {'vs DDPM':>10}")
print("-" * 50)
baseline_fid = 3.17
for steps, eta, fid in results:
speedup = 1000 / steps
delta = fid - baseline_fid
print(f"{steps:8d} | {eta:6.1f} | {fid:15.2f} | {delta:+8.2f} FID")
print()
print("Key observations:")
print(" DDIM 50 steps: FID 4.67 vs DDPM 3.17 → +1.5 FID, 20x faster")
print(" DDIM 100 steps: FID 4.16 → only +1.0 FID, 10x faster")
print(" DDIM 20 steps: FID 6.84 → acceptable for previews, 50x faster")
print(" eta=1 (stochastic) slightly better diversity at 50 steps")
print()
print("Recommendation for production:")
print(" - High quality: DPM-Solver-2 @ 20 steps")
print(" - Fast preview: DDIM @ 20 steps")
print(" - Real-time: LCM @ 4 steps (requires distilled model)")
# ============================================================
# DPM-Solver-2 - simplified educational implementation
# ============================================================
class DPMSolver2:
"""
DPM-Solver-2: 2nd-order diffusion ODE solver.
Achieves DDIM-100-step quality in 20 steps.
Key idea: solve the exact linear part of the diffusion ODE,
approximate only the nonlinear score function term
using a 2nd-order Taylor expansion in log-SNR space.
Reference: Lu et al. (2022) "DPM-Solver: A Fast ODE Solver for
Diffusion Probabilistic Model Sampling in Around 10 Steps"
"""
def __init__(self, model, alphas_cumprod: torch.Tensor):
self.model = model
self.alphas_cumprod = alphas_cumprod
# Precompute lambda_t = log(alpha_t / sigma_t) - log SNR
alphas = alphas_cumprod.sqrt()
sigmas = (1 - alphas_cumprod).sqrt()
self.lambdas = torch.log(alphas / sigmas)
def _get_t_from_lambda(self, lam: float) -> int:
"""Find the timestep index closest to a given lambda value."""
return int((self.lambdas - lam).abs().argmin().item())
@torch.no_grad()
def sample(
self,
shape: Tuple,
num_steps: int = 20,
conditioning=None,
device: str = "cuda",
) -> torch.Tensor:
"""
Sample using DPM-Solver-2.
Uses 2 NFE per step: first-order prediction + 2nd-order correction.
Total NFE = 2 * num_steps (same quality as DDIM at 4 * num_steps).
"""
# Schedule: uniform in lambda space
lambda_max = self.lambdas.max().item()
lambda_min = self.lambdas.min().item()
lambda_schedule = torch.linspace(lambda_max, lambda_min, num_steps + 1)
x = torch.randn(shape, device=device)
alphas_cumprod = self.alphas_cumprod.to(device)
for i in range(num_steps):
lam_t = lambda_schedule[i].item()
lam_s = lambda_schedule[i + 1].item()
h = lam_s - lam_t # step size in lambda space (negative)
# Find corresponding timestep indices
t_idx = self._get_t_from_lambda(lam_t)
t_mid_idx = self._get_t_from_lambda((lam_t + lam_s) / 2)
t_batch = torch.full((shape[0],), t_idx, device=device, dtype=torch.long)
t_mid_batch = torch.full((shape[0],), t_mid_idx, device=device, dtype=torch.long)
alpha_t = alphas_cumprod[t_idx].sqrt()
sigma_t = (1 - alphas_cumprod[t_idx]).sqrt()
# === First-order prediction (DPM-Solver-1 step to midpoint) ===
if conditioning is not None:
eps_t = self.model(x, t_batch, conditioning)
else:
eps_t = self.model(x, t_batch)
# Predict x_0 in x_0-space (DPM-Solver++ style - more stable)
x0_t = (x - sigma_t * eps_t) / alpha_t
# First-order update to midpoint lambda
alpha_mid = alphas_cumprod[t_mid_idx].sqrt()
sigma_mid = (1 - alphas_cumprod[t_mid_idx]).sqrt()
x_mid = alpha_mid * x0_t + sigma_mid * eps_t
# === Second-order correction ===
if conditioning is not None:
eps_mid = self.model(x_mid, t_mid_batch, conditioning)
else:
eps_mid = self.model(x_mid, t_mid_batch)
# 2nd-order corrected x_0 estimate (average of two evaluations)
x0_corrected = (x - sigma_t * (eps_t + eps_mid) / 2) / alpha_t
# Final update using corrected estimate
alpha_s_sq = alphas_cumprod[self._get_t_from_lambda(lam_s)]
alpha_s = alpha_s_sq.sqrt()
sigma_s = (1 - alpha_s_sq).sqrt()
x = alpha_s * x0_corrected + sigma_s * (eps_t + eps_mid) / 2
return x
# ============================================================
# Usage example and comparison
# ============================================================
def sampling_comparison_demo():
"""
Demonstrates the NFE-quality tradeoff across samplers.
"""
print("Sampler Comparison for SD-scale Model (860M U-Net)")
print("=" * 65)
print()
samplers = [
("DDPM (1000 steps)", 1000, 1.0, 3.17),
("DDIM (250 steps, η=0)", 250, 1.0, 4.16),
("DDIM (50 steps, η=0)", 50, 1.0, 4.67),
("DDIM (20 steps, η=0)", 20, 1.0, 6.84),
("DPM-Solver-2 (20 steps)", 40, 2.0, 3.05), # 2 NFE per step
("DPM-Solver-3 (12 steps)", 36, 2.5, 3.08), # 3 NFE per step
("LCM (4 steps)", 4, 0.4, 3.50), # distilled
]
print(f"{'Method':<35} | {'NFE':>5} | {'A100 (s)':>8} | {'FID':>6}")
print("-" * 62)
for name, nfe, time_per_step, fid in samplers:
total_time = nfe * 0.009 # ~9ms per NFE for 860M model on A100
print(f"{name:<35} | {nfe:5d} | {total_time:8.1f}s | {fid:6.2f}")
print()
print("Recommendation:")
print(" Production high-quality: DPM-Solver-2/3 @ 12-20 steps")
print(" Fast preview: DDIM @ 20 steps")
print(" Real-time (distilled): LCM @ 4 steps")
print(" Image editing (invert): DDIM with 50+ inversion steps")
if __name__ == "__main__":
analyze_guidance_scale_quality_tradeoff()
print()
sampling_comparison_demo()
9. YouTube Resources
| Video | Channel | What You Learn |
|---|---|---|
| DDIM Paper Explained | Yannic Kilcher | Full DDIM paper walkthrough with non-Markovian derivation |
| Diffusion Models Crash Course | Outlier | Intuitive DDIM vs DDPM comparison with code |
| DPM-Solver Explained | AI Coffee Break | ODE solver perspective, DPM-Solver vs DDIM, why it works |
| Consistency Models | Yannic Kilcher | Song et al. 2023, consistency training vs distillation |
| Stable Diffusion Deep Dive | Fast.ai | End-to-end SD pipeline with sampler options and CFG |
10. Production Engineering Notes
Throughput Optimization
The sequential nature of diffusion sampling (each step depends on the previous) prevents step-level parallelism. Production throughput improvements come from:
Batching: process multiple requests simultaneously. For a batch of images at steps, total time = , not . A batch of 4 images takes roughly the same time as 1 image for the U-Net forward pass (up to memory limits).
Compilation: torch.compile() with mode="reduce-overhead" provides 20-40% speedup on A100/H100 by fusing operations and reducing kernel launch overhead. Diffusion models benefit significantly because the same U-Net architecture is called many times with different inputs.
FlashAttention: replaces standard attention ( memory) with tiled attention ( memory, 2-4x faster). Essential for SDXL at 1024px where attention heads are large.
Quantization: FP16 is standard. INT8 with bitsandbytes or onnxruntime provides 20-30% additional speedup with minimal FID degradation. INT4 is more aggressive and produces noticeable quality loss.
Memory During Sampling
For SD 1.5 at 512x512 on a single A100 (80GB):
Component | Memory (FP16)
-----------------------|---------------
U-Net | 1.7 GB
VAE encoder/decoder | 0.4 GB
CLIP text encoder | 0.5 GB
Activations (CFG) | ~2.0 GB
Latent buffers | ~0.1 GB
-----------------------|---------------
Total | ~4.7 GB
CFG doubles the activation memory because both conditional and unconditional passes must fit. At batch size 8, total memory is ~12 GB - comfortable on A100 but tight on a 16GB consumer GPU.
11. Common Mistakes
:::danger Using DDIM with too few steps at high CFG Below 20 steps with DDIM, high guidance scale (CFG ) amplifies discretization errors. The first-order Euler step in DDIM accumulates error; at 10 steps, is large enough that the error is visible as oversaturation and edge artifacts. Use DPM-Solver-2 or DPM-Solver-3 for step counts below 20 - they are substantially more robust to coarse discretization due to higher-order accuracy. :::
:::warning Assuming eta=0 always produces better images Deterministic DDIM () trades diversity for consistency. The same prompt always produces the same image (given the same seed), which is desirable for reproducibility but reduces sample diversity. For evaluation on diversity metrics (Recall, Coverage, PRDC), stochastic typically scores better because it explores more of the distribution. For creative applications requiring variety, use . :::
:::warning DDIM inversion reconstruction errors compound with CFG DDIM inversion assumes a deterministic ODE, but if CFG is used at inference with a different conditioning than during inversion, the reconstructed image will drift from the original. The more CFG amplification, the larger the drift. For precise image editing with CFG, use Null-Text Inversion (Mokady et al. 2022) which optimizes the null embedding per image to compensate for CFG-induced drift. Simple DDIM inversion without this correction works well only at low guidance scales. :::
:::warning The case is NOT equivalent to DDPM in sample quality Setting in DDIM with steps is NOT the same as running DDPM with 1000 steps. DDIM with uses 50 large stochastic steps whereas DDPM uses 1000 small steps. The stochastic noise at large step sizes behaves differently from many small noise injections. Empirically, DDIM with at 50 steps achieves slightly worse FID than DDIM with at 50 steps (4.11 vs 4.67 per the paper). The optimal at low step counts is empirically around 0.6-0.8, not 0 or 1. :::
12. Progressive Distillation - The Training-Time Approach
Salimans and Ho (2022) introduced Progressive Distillation: instead of improving the sampler at inference time, distill the multi-step process into a student model that requires half the steps. The process is iterative:
- Start with a teacher that generates in steps
- Train a student to match the teacher's output in steps: each student step matches two teacher steps
- Make the student the new teacher and repeat:
After 4 iterations: steps. Quality loss at each halving is small because the student only needs to learn a 2x speedup, not 1000x. The distillation loss:
Using LPIPS perceptual distance rather than MSE better preserves high-frequency texture. After 4 distillation iterations, the model generates high-quality images in 64 steps - about 16x faster than the original, with the student model having truly learned to be efficient rather than just applying a different ODE solver.
Advantage over DDIM/DPM-Solver: the student is retrained, so it can adapt its denoising strategy to the reduced step count. It is not constrained to follow the DDPM ODE trajectory.
Disadvantage: requires the training compute of a full fine-tuning run per distillation iteration. For large models (2.6B SDXL), this is expensive. DDIM and DPM-Solver require zero additional training.
13. Noise Schedules and Their Interaction with Samplers
The noise schedule (equivalently ) determines the rate at which noise is added in the forward process. The choice of schedule has direct implications for how well step-skipping works at low NFE.
Linear Schedule and Its Curvature Profile
The linear schedule adds uniformly, meaning decreases roughly quadratically. The ODE curvature (second derivative of the trajectory in -space) is concentrated in the middle timesteps. This has a counterintuitive implication for step skipping:
- Skipping early timesteps (): very cheap - the ODE is nearly linear there, large steps incur minimal error
- Skipping middle timesteps ( to ): expensive - the ODE has highest curvature, large steps cause significant approximation error
- Skipping late timesteps (): moderate cost - the model is refining fine details, curvature is moderate
Practical implication: with DDIM at 50 steps on a linear schedule, the steps in the curvature-heavy middle region are limiting quality. Adding more steps in - helps more than adding steps at the ends.
Cosine Schedule and Uniform Curvature
The cosine schedule is designed to distribute information removal more uniformly across timesteps. It also distributes ODE curvature more uniformly. This means:
- No single region concentrates error
- Uniform step spacing works reliably well
- At very low step counts (10-15), quality degrades more gracefully with cosine than linear
DPM-Solver's Lambda-Space Uniformity
DPM-Solver works best with steps uniformly spaced in (log signal-to-noise ratio) space. This is because DPM-Solver's Taylor expansion for the score integral assumes a smooth ODE in -space. Uniform minimizes the local truncation error at each step.
def compute_lambda_schedule(
alphas_cumprod: torch.Tensor,
num_steps: int
) -> torch.Tensor:
"""
Compute timestep indices for uniform lambda-space spacing.
Optimal for DPM-Solver-2 and DPM-Solver-3.
"""
alphas = alphas_cumprod.sqrt()
sigmas = (1 - alphas_cumprod).sqrt()
lambdas = torch.log(alphas / (sigmas + 1e-8)) # (T,)
lambda_max = lambdas.max()
lambda_min = lambdas.min()
# Uniform spacing in lambda: equal ODE step size in log-SNR
lambda_seq = torch.linspace(lambda_max, lambda_min, num_steps + 1)
# Map each lambda value to the nearest timestep index
timestep_indices = []
for lam in lambda_seq:
idx = (lambdas - lam).abs().argmin().item()
timestep_indices.append(idx)
return torch.tensor(sorted(set(timestep_indices), reverse=True))
# Comparison: uniform-t vs uniform-lambda spacing for DPM-Solver
# At 15 steps with cosine schedule:
# uniform-t: FID ~3.2 on CIFAR-10
# uniform-lambda: FID ~2.95 on CIFAR-10
# The lambda spacing aligns step sizes with ODE curvature
Schedule-Aware Recommendations
| Schedule | DDIM steps | Best spacing | DPM-Solver steps | Best spacing |
|---|---|---|---|---|
| Linear | 50 | Uniform-t | 20 | Uniform-lambda |
| Cosine | 50 | Uniform-t | 15 | Uniform-lambda |
| Cosine | 30 | Uniform-t | 12 | Uniform-lambda |
The cosine schedule uniformly calibrates both step strategies. For practical deployments: use DPM-Solver-2 with uniform-lambda spacing regardless of the underlying noise schedule - this combination is robust across all models tested.
14. Interview Q&A
Q1: Derive the DDIM update rule from first principles. Why can the same trained U-Net be used?
The DDIM update starts from the same marginals as DDPM: . Song et al. define a non-Markovian reverse process with the same marginals. Setting (deterministic), Bayes' theorem gives:
The same trained U-Net works because the DDPM training loss depends only on - the marginal at each timestep. The model was never trained with knowledge of the full Markov joint . Any sampling process sharing the same marginals is valid for inference with the same model.
Q2: What does physically mean in DDIM, and what are its practical advantages?
When , at every step - no stochastic noise is injected. The sampling becomes a deterministic function: the same noise always produces the same image. Physically, this corresponds to following a probability flow ODE rather than a stochastic process.
Practical advantages: (1) Reproducibility - given a seed, results are exactly reproducible. (2) Latent space interpolation - linearly interpolating between two vectors produces smooth interpolation in image space. (3) DDIM inversion - the deterministic ODE can be run backwards to encode a real image into latent space. (4) Slightly fewer artifacts at very low step counts because no stochastic noise amplifies discretization errors.
Tradeoff: reduced sample diversity. Two different seeds for the same prompt will produce images that look more similar to each other than with .
Q3: Why does DPM-Solver achieve better quality than DDIM in 20 steps, while DDIM needs 50?
DDIM uses first-order Euler integration of the probability flow ODE, accumulating local error per step. DPM-Solver separates the ODE into a linear part (with exact solution involving ) and a nonlinear part (the score function). The linear part is solved exactly; only the nonlinear part needs approximation. DPM-Solver-2 applies a 2nd-order Taylor expansion, achieving local error. With 20 steps ( in -space), this gives substantially lower total error than DDIM with 50 steps ( but 1st-order). DPM-Solver-3 extends to , working in 10-15 steps. The key insight: exploiting the analytical structure of the ODE is more efficient than brute-force first-order stepping.
Q4: How does DDIM inversion enable image editing, and what are its limitations?
DDIM inversion runs the deterministic ODE from to : given a real image , it finds such that . With in hand, you can change the text conditioning and re-run DDIM decoding - the structure of the image is preserved (because encodes spatial structure in the noise trajectory) while semantics change according to the new prompt.
Limitations: (1) Imperfect reconstruction at low step counts due to discretization error - use 50+ inversion steps for high fidelity. (2) CFG during decoding but not inversion causes semantic drift - use Null-Text Inversion to correct for this. (3) Very large semantic changes require attention manipulation techniques (Prompt-to-Prompt), not just re-conditioning. (4) Works only for ; stochastic DDIM cannot be inverted.
Q5: How would you choose between DDIM, DPM-Solver, and LCM for a production system?
DDIM (, 30-50 steps): simplest implementation, well-understood, good for image editing via inversion. Prefer when latency allows 30-50 steps and you need inversion capability or are debugging pipeline issues.
DPM-Solver-2 or DPM-Solver++ (20 steps): the current production sweet spot. Better quality per NFE than DDIM, stable with high CFG (use DPM-Solver++ for CFG ), minimal code change from DDIM. Default choice for new production deployments in 2024-2025.
LCM / LCM-LoRA (4 steps): for real-time or near-real-time applications (interactive tools, live preview). Requires a separately distilled model - you cannot just swap the sampler on a DDPM model. Quality ceiling is lower than DPM-Solver at 20 steps, but 5-10x lower latency makes it the right choice for user-facing interactive applications.
Q6: What is the relationship between the DDPM variance schedule and DDIM's step-skipping quality?
The variance schedule determines the spacing of the probability flow ODE in space. The linear schedule concentrates ODE curvature in the middle timesteps ( for linear) and has very low curvature at early and late steps. This means step-skipping is nearly free at very early and very late steps. The cosine schedule distributes curvature more uniformly, so any skipping pattern incurs more uniform error.
For DDIM at 50 steps: uniform spacing works well with both schedules. For DPM-Solver at 12 steps: uniform spacing in -space is optimal regardless of the underlying noise schedule, because DPM-Solver's Taylor expansion is most accurate with uniform . The practical recommendation: use uniform-in- spacing for DPM-Solver, uniform-in- spacing for DDIM.
The schedule and sampler must be treated as a joint design decision. For any new production deployment: (1) identify which noise schedule was used during training, (2) choose DPM-Solver-2 as the default sampler, (3) set steps uniform in -space, (4) validate FID vs step count on a held-out sample. This combination reliably achieves training-quality FID at 15-20 NFE across all modern diffusion architectures tested in 2023-2025.
Rectified Flow and SD3/FLUX
Rectified flow (Liu et al. 2022, used in SD3 and FLUX) replaces the cosine/linear noise schedules entirely. Instead of defining a noising process with gradually increasing variance, rectified flow interpolates linearly between noise and data:
The reverse ODE follows straight lines between data and noise in probability space. Why does this matter for sampling? Straight-line trajectories have zero curvature - DDIM with 10-20 steps solves them with minimal approximation error. Empirically, SD3 and FLUX produce high-quality images at 20-28 steps, and acceptable quality at even fewer, without needing higher-order solvers. The guidance scale also tends to be lower (3-5) because the rectified flow training is more stable for the conditioning signal.
The tradeoff: rectified flow requires retraining - you cannot apply rectified flow sampling to a DDPM-trained model. But for models trained from scratch (SD3, FLUX), it is the strictly better noise schedule choice for sampling efficiency.
This lesson is part of the Diffusion Models module. Next: Latent Diffusion Models - The Architecture Behind Stable Diffusion.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Diffusion Process (DDPM) demo on the EngineersOfAI Playground - no code required.
:::
