How DDIM reduces 1000-step DDPM sampling to 10-50 steps via a non-Markovian process, the eta parameter, DDIM inversion for image editing, and DPM-Solver as the current production standard.

How does accelerated sampling work in practice?

DDIM and Accelerated Diffusion Sampling covers DDIM, accelerated sampling, denoising diffusion implicit models from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/diffusion-models/ddim-and-accelerated-sampling

What is the difference between DDIM and denoising diffusion implicit models?

See the full breakdown at https://engineersofai.com/docs/ml/diffusion-models/ddim-and-accelerated-sampling

DDIM and Accelerated Diffusion Sampling

:::note Reading time: ~50 minutes | Interview relevance: Very High | Target roles: ML Engineer, Research Engineer, AI Engineer, Applied Scientist :::

The Real Interview Moment

You are interviewing for a generative AI platform role at a company deploying text-to-image at scale. The team has a DDPM model trained for six months on a cluster of 256 A100s. It produces stunning results. But deployment is blocked: each image takes 15 minutes on a V100 GPU. The product team needs sub-5-second latency for the consumer app. The interviewer asks: "How would you solve this, without retraining?"

Most engineers answer: "use a smaller model" or "get a faster GPU" or "quantize it." These miss the point. A 4x smaller model would degrade quality significantly. A 4x faster GPU would still produce 4-minute images. The correct answer requires understanding that DDPM's 1000-step sampling constraint is an artifact of the sampling algorithm, not the model itself.

The trained U-Net has no knowledge of how many sampling steps will be used at inference time. It was trained on a marginal distribution $q(x_t|x_0) = \mathcal{N}(\sqrt{\bar\alpha_t}x_0, (1-\bar\alpha_t)I)$ . Any sampling process that respects these marginals can use the same trained model. Song et al. (2020) discovered that the DDPM forward process is just one of infinitely many valid noise schedules that yield the same marginals - and the deterministic one enables 10x to 50x speedup with no retraining and minimal quality loss.

This insight - that diffusion sampling is a differential equation problem with many valid solvers, not a fixed sequential recipe - underpins everything that came after: DDIM, PLMS, DPM-Solver, DPM-Solver++, Consistency Models. Understanding it is what separates engineers who deploy diffusion models from engineers who understand why they work.

Why This Exists - The Sampling Bottleneck

DDPM trains a U-Net to predict the noise $\varepsilon$ added at any timestep $t$ . Training is efficient: sample a random $t$ , a random $\varepsilon$ , compute $x_t = \sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\varepsilon$ in one step, and regress the predicted noise against $\varepsilon$ . The closed-form forward process makes each training step O(1). Clean and fast.

The sampling process is the problem. To generate from DDPM, you must simulate the reverse Markov chain:

$p_\theta(x_{t-1}|x_t) = \mathcal{N}\!\left(\frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\varepsilon_\theta(x_t,t)\right),\; \beta_t I\right)$

This requires $T=1000$ sequential steps. Each step calls the U-Net once. Steps cannot be parallelized because step $t-1$ depends on the output of step $t$ . On a V100 GPU running a 860M parameter U-Net, each forward pass takes approximately 0.9 seconds. 1000 passes: 15 minutes per image.

The problem is not the model size. It is that the DDPM derivation assumes the reverse process must mirror the Markovian forward process step-by-step. Song et al. (2020) asked: does it have to?

The answer is no. The DDPM training objective depends only on marginals $q(x_t|x_0)$ , not on the full joint distribution $q(x_1, ..., x_T|x_0)$ . There are infinitely many joint distributions sharing the same marginals. DDIM defines one that enables non-sequential large steps - reducing inference from 1000 to 20-50 steps with the same trained model.

Historical Context - The Non-Markovian Insight

By mid-2020, DDPM had set state-of-the-art FID on CIFAR-10 but was obviously too slow for practical deployment. Jiaming Song, Chenlin Meng, and Stefano Ermon at Stanford noticed something crucial: the DDPM training loss is:

$\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \varepsilon}\!\left[\|\varepsilon - \varepsilon_\theta(x_t, t)\|^2\right]$

where $x_t = \sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\varepsilon$ . This loss depends only on the marginal $q(x_t|x_0)$ - defined entirely by $\bar\alpha_t$ . It does not depend on $q(x_t|x_{t-1})$ - the transition probability between consecutive steps.

This is the key insight: the trained U-Net has no knowledge of whether the forward process is Markovian. It only knows the marginal distributions. Therefore, any alternative forward process sharing the same marginals $q(x_t|x_0)$ can be reversed using the same trained U-Net.

DDIM defines a family of non-Markovian processes indexed by a stochasticity parameter $\eta$ , all sharing the DDPM marginals. Setting $\eta=0$ gives a fully deterministic reverse process - an ODE that can be solved with large step sizes, enabling 20-50x fewer steps. This was published as DDIM (Denoising Diffusion Implicit Models) at ICLR 2021.

1. Neural Function Evaluations - The Cost Unit

The cost of diffusion sampling is measured in NFE (Neural Function Evaluations) - the number of times the denoising network is called. Each NFE is one U-Net forward pass.

Method	NFE	A100 time (SD-sized model)	Quality (FID, CIFAR-10)
DDPM (original)	1000	~900s	3.17
DDIM ( $\eta=0$ , 250 steps)	250	~230s	4.16
DDIM ( $\eta=0$ , 100 steps)	100	~90s	4.16
DDIM ( $\eta=0$ , 50 steps)	50	~45s	4.67
PLMS (25 steps)	25	~22s	~3.5
DPM-Solver-2 (20 steps)	20	~18s	~3.0
DPM-Solver-3 (10-15 steps)	12	~11s	~3.0
LCM (4 steps)	4	~4s	~3.5
LCM (1 step)	1	~1s	~6.0

(Approximate values - exact numbers depend on model size and hardware.)

The sequential dependency is the core constraint: step $t-1$ requires $x_t$ , which requires step $t$ to finish. You cannot parallelize across timesteps within a single sample. All speedups come from either taking larger steps (DDIM, DPM-Solver) or fundamentally changing the inference paradigm (Consistency Models).

2. The DDIM Forward Process - Same Marginals, Different Joint

Keeping Marginals, Losing the Markov Property

DDIM defines a family of non-Markovian forward processes parameterized by $\sigma_t \geq 0$ :

$q_\sigma(x_{t-1}|x_t, x_0) = \mathcal{N}\!\left(\sqrt{\bar\alpha_{t-1}}\,x_0 + \sqrt{1-\bar\alpha_{t-1}-\sigma_t^2}\cdot\frac{x_t - \sqrt{\bar\alpha_t}\,x_0}{\sqrt{1-\bar\alpha_t}},\; \sigma_t^2 I\right)$

Breaking this down:

$\sqrt{\bar\alpha_{t-1}}\,x_0$ - the "target direction" toward the clean image at timestep $t-1$
$\sqrt{1-\bar\alpha_{t-1}-\sigma_t^2}\cdot\frac{x_t - \sqrt{\bar\alpha_t}x_0}{\sqrt{1-\bar\alpha_t}}$ - the "noise direction," pointing from $x_0$ toward $x_t$ (the predicted noise direction), scaled to fill the remaining variance
$\sigma_t^2 I$ - stochastic noise injected at this step

The term $\frac{x_t - \sqrt{\bar\alpha_t}x_0}{\sqrt{1-\bar\alpha_t}}$ is the noise direction - it equals the actual $\varepsilon$ if the forward model is exact.

Verification that marginals are preserved: you can verify by integrating over $x_0$ that $q_\sigma(x_t|x_0) = \mathcal{N}(\sqrt{\bar\alpha_t}x_0, (1-\bar\alpha_t)I)$ - identical to DDPM. The joint $q_\sigma(x_{t-1}, x_t|x_0)$ is different from DDPM (it is non-Markovian), but the marginal at each timestep is the same. This is why the same trained U-Net works.

The $\eta$ Parameter - Controlling Stochasticity

DDIM introduces $\eta \in [0, 1]$ to parameterize how much stochastic noise is injected:

$\sigma_t(\eta) = \eta \cdot \sqrt{\frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}} \cdot \sqrt{1 - \frac{\bar\alpha_t}{\bar\alpha_{t-1}}}$

The term $\sqrt{\frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}} \cdot \sqrt{1 - \frac{\bar\alpha_t}{\bar\alpha_{t-1}}}$ equals exactly $\sqrt{\tilde\beta_t}$ - the DDPM posterior standard deviation. So:

$\eta = 0$ : $\sigma_t = 0$ - no noise injected - fully deterministic DDIM - the true DDIM case
$\eta = 1$ : $\sigma_t = \sqrt{\tilde\beta_t}$ - recovers the DDPM posterior variance - stochastic sampling identical to DDPM
$\eta \in (0,1)$ : interpolation between deterministic and DDPM behavior

The naming "DDIM" (Denoising Diffusion Implicit Models) comes from the $\eta=0$ case: the model is implicit in the sense that sampling is defined by the ODE trajectory rather than an explicit Markov chain.

3. The DDIM Update Rule

Step-by-Step Derivation

At inference time, the U-Net predicts $\varepsilon_\theta(x_t, t)$ - an estimate of the noise in $x_t$ . The DDIM update moves from $x_t$ to $x_{t-1}$ in two logical sub-steps:

Sub-step 1 - Predict the clean image $\hat{x}_0$ :

Using the forward process reparameterization $x_t = \sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\varepsilon$ , solve for $x_0$ :

$\hat{x}_0 = \frac{x_t - \sqrt{1-\bar\alpha_t}\,\varepsilon_\theta(x_t, t)}{\sqrt{\bar\alpha_t}}$

This is a per-step estimate of the clean image. At early timesteps (high noise), this estimate is rough. At later timesteps (low noise), it is increasingly accurate.

Sub-step 2 - Compute $x_{t-1}$ via the DDIM posterior:

Plug $\hat{x}_0$ into the DDIM transition distribution:

$x_{t-1} = \underbrace{\sqrt{\bar\alpha_{t-1}}\,\hat{x}_0}_{\text{direction toward clean image}} + \underbrace{\sqrt{1-\bar\alpha_{t-1}-\sigma_t^2}\cdot\varepsilon_\theta(x_t,t)}_{\text{direction toward } x_t} + \underbrace{\sigma_t\,\varepsilon_t}_{\text{random noise}}$

where $\varepsilon_t \sim \mathcal{N}(0,I)$ when $\eta > 0$ and $\varepsilon_t = 0$ when $\eta = 0$ .

This is the complete DDIM update. Substituting $\hat{x}_0$ explicitly:

$x_{t-1} = \sqrt{\bar\alpha_{t-1}}\cdot\frac{x_t - \sqrt{1-\bar\alpha_t}\,\varepsilon_\theta}{\sqrt{\bar\alpha_t}} + \sqrt{1-\bar\alpha_{t-1}-\sigma_t^2}\cdot\varepsilon_\theta + \sigma_t\varepsilon_t$

The ODE Perspective ( $\eta = 0$ )

When $\eta = 0$ , there is no stochastic term and the DDIM update becomes the Euler discretization of a probability flow ODE. Using the substitution $\bar\sigma_t = \sqrt{(1-\bar\alpha_t)/\bar\alpha_t}$ (the noise-to-signal ratio), the ODE is:

$\frac{dx}{d\bar\sigma} = -\bar\sigma\,\nabla_x \log p_{\bar\sigma}(x)$

where $\nabla_x \log p_{\bar\sigma}(x) = -\varepsilon_\theta(x,t)/\sqrt{1-\bar\alpha_t}$ is the estimated score function. This ODE perspective is what enables higher-order solvers (DPM-Solver) to dramatically outperform the first-order DDIM Euler steps.

The ODE is deterministic: starting from the same $x_T$ , the same image is produced every time. This determinism is what enables DDIM inversion (Section 6) and latent space interpolation.

4. Accelerated Sampling - Skipping Timesteps

Why Step Skipping Works

With DDPM, you must step through $T, T-1, T-2, \ldots, 1$ - 1000 steps for DDPM. With DDIM, you can jump from $x_{t_n}$ to $x_{t_{n-1}}$ for any subset $\{t_1, t_2, \ldots, t_S\} \subset \{1, \ldots, T\}$ of size $S \ll T$ .

The noise schedule $\{\bar\alpha_t\}$ is smooth enough that skipping from $t=750$ to $t=500$ (a step of 250) is a valid (larger) Euler step on the ODE. The quality degrades gracefully because the ODE is well-conditioned: the curvature along the trajectory is bounded, so large steps introduce bounded approximation error.

Why can't DDPM skip steps? Because DDPM sampling uses the stochastic Markov transition $p_\theta(x_{t-1}|x_t)$ , which is derived assuming unit steps. Taking a 250-step jump in the DDPM sampler introduces massive discretization error in the Gaussian approximation. The DDIM ODE is more robust because the deterministic trajectory is smoother than a stochastic path.

Step Selection Strategies

Given $T=1000$ and $S=50$ desired steps, how do you choose which $S$ timesteps to use?

Uniform spacing (most common): $t_n = \lfloor n \cdot T/S \rfloor$ for $n = 1, \ldots, S$ . Select every 20th timestep: $\{20, 40, 60, \ldots, 1000\}$ .

Quadratic spacing (better for very few steps): $t_n = \lfloor (n/S)^2 \cdot T \rfloor$ . Concentrates more steps at high noise levels ( $t$ large) where the ODE curvature is largest.

Log spacing in $\lambda$ -space (optimal for DPM-Solver): uniform spacing in $\lambda_t = \log(\sqrt{\bar\alpha_t}/\sqrt{1-\bar\alpha_t})$ - provably optimal step size selection for the Taylor expansion used in DPM-Solver.

T=1000, S=20:

Uniform steps:    [50, 100, 150, 200, 250, 300, ..., 950, 1000]
Quadratic steps:  [3, 11, 25, 44, 70, 100, 136, ..., 911, 1000]
                  (more steps at high noise → lower truncation error)

For DDIM at 20-50 steps, uniform spacing works well. For DPM-Solver at 10-15 steps, quadratic or $\lambda$ -space spacing is better.

5. DDIM Inversion - Encoding Real Images

A unique and powerful property of deterministic DDIM ( $\eta=0$ ): the ODE can be run backwards. Given a real image $x_0$ , you can find the noise vector $x_T$ such that $\text{DDIM-sample}(x_T) \approx x_0$ . This is called DDIM inversion.

Why Inversion Enables Image Editing

Once you have $x_T^* = \text{DDIM-invert}(x_0)$ , you can:

Change the text conditioning from $c_1$ to $c_2$
Run DDIM sampling from $x_T^*$ with the new conditioning
Get an edited image that preserves the structure of $x_0$ but follows $c_2$

This works because $x_T$ encodes the "structure" of the image as noise - the spatial configuration of the noise vector determines which image the ODE trajectory converges to. With the same starting noise but different text conditioning, the trajectory diverges while preserving broad structural similarity.

Applications: text-guided style transfer, changing object attributes ("make the cat orange"), editing specific regions, Prompt-to-Prompt-style attention manipulation.

Inversion Algorithm

DDIM inversion is the forward direction of the ODE (from $t=0$ to $t=T$ ), using the same deterministic update but in reverse:

$x_{t+1} = \sqrt{\bar\alpha_{t+1}}\,\hat{x}_0^{(t)} + \sqrt{1-\bar\alpha_{t+1}}\,\varepsilon_\theta(x_t, t)$

where $\hat{x}_0^{(t)} = (x_t - \sqrt{1-\bar\alpha_t}\,\varepsilon_\theta(x_t, t))/\sqrt{\bar\alpha_t}$ is the predicted clean image at each inversion step.

Limitations: inversion is not perfectly reversible due to discretization error. With 50+ steps, $\text{Decode}(\text{Encode}(x_0)) \approx x_0$ with small visible artifacts. With 10-20 steps, the reconstruction error becomes noticeable. For precise editing applications, use more inversion steps than generation steps.

A second limitation: DDIM inversion assumes $\eta=0$ (deterministic). With CFG at inference but not during inversion, there is a conditioning mismatch that causes drift. Null-Text Inversion (Mokady et al. 2022) addresses this by optimizing the null embedding per image to correct for CFG-induced drift.

6. DPM-Solver - Higher-Order Diffusion ODE Solvers

Why First-Order is Not Enough

DDIM applies the first-order Euler method to the probability flow ODE. Each step accumulates $O(h^2)$ local truncation error, where $h = \Delta\lambda$ is the step size in $\lambda$ -space. With 50 steps over the full noise range, the total accumulated error is $O(T \cdot h^2) = O(T \cdot (T/S)^2)$ . Acceptable for $S=50$ , but at $S=10$ , the Euler steps are so large that curvature in the ODE cannot be ignored.

Lu et al. (2022) recognized that the diffusion ODE has special structure: the linear part (the signal-preserving component) has an exact analytical solution. Only the nonlinear part (the score function term) needs numerical approximation. Separating these gives DPM-Solver.

The DPM-Solver Formulation

The probability flow ODE in $\lambda_t = \log(\alpha_t/\sigma_t)$ space (log signal-to-noise ratio) is:

$\frac{dx}{d\lambda} = \frac{1}{2}\left(x - \frac{1}{\alpha_\lambda} D_\theta(x_\lambda, \lambda)\right)$

where $D_\theta$ is the denoising network in $x_0$ -prediction form. The exact solution from $\lambda_s$ to $\lambda_t$ is:

$x_t = \frac{\alpha_t}{\alpha_s}x_s - \alpha_t \int_{\lambda_s}^{\lambda_t} e^{-\lambda} D_\theta(x_\lambda, \lambda)\, d\lambda$

The integral over $D_\theta$ is approximated using Taylor expansion around $\lambda_s$ . DPM-Solver- $k$ uses a $k$ -th order Taylor expansion:

DPM-Solver-1 ( $k=1$ ): uses $D_\theta(x_s, \lambda_s)$ - first-order, equivalent to DDIM.

DPM-Solver-2 ( $k=2$ ): evaluates $D_\theta$ at an intermediate point $s_{i+1/2}$ computed from the first-order prediction. Uses this additional evaluation for the 2nd-order correction. Cost: 2 NFE per step, but each step is 2nd-order accurate $O(h^3)$ .

DPM-Solver-3 ( $k=3$ ): 3rd-order Taylor expansion, $O(h^4)$ accuracy per step. Works in 10-15 total steps.

DPM-Solver++ - Improving Stability

DPM-Solver++ (Lu et al. 2022b) applies the same idea but in $x_0$ -prediction space (predicting the clean image) rather than $\varepsilon$ -prediction space. This produces better numerical stability at large guidance scales (CFG $\geq 7$ ) because $x_0$ predictions are bounded while $\varepsilon$ predictions can be large at high guidance scales. DPM-Solver++ is the default in most production systems and in the HuggingFace Diffusers library.

DDIM (1st order):    accumulates O(h^2) per step - need 50+ steps for quality
DPM-Solver-2:        accumulates O(h^3) per step - 20 steps matches DDIM 50 steps
DPM-Solver-3:        accumulates O(h^4) per step - 12 steps matches DDIM 50 steps

7. Architecture Diagram

8. Complete PyTorch DDIM Sampler with Inversion

This implementation works with any pre-trained DDPM U-Net - no changes to the model required.

import torch
import numpy as np
from typing import Optional, List, Tuple
import math


class DDIMSampler:
    """
    DDIM sampler with:
    - Deterministic sampling (eta=0) and stochastic sampling (eta=1)
    - Step schedule selection (uniform, quadratic)
    - DDIM inversion for image editing
    - Works with any pre-trained DDPM-style U-Net

    Reference: Song et al. 2020 "Denoising Diffusion Implicit Models"
    """

    def __init__(self, model, alphas_cumprod: torch.Tensor):
        """
        Args:
            model: trained U-Net - callable (x_t, t) -> predicted_noise
                   OR (x_t, t, conditioning) -> predicted_noise
            alphas_cumprod: shape (T,) - the alpha_bar_t values from training
        """
        self.model = model
        self.alphas_cumprod = alphas_cumprod
        self.T = len(alphas_cumprod)

    def make_timestep_schedule(
        self,
        num_steps: int,
        spacing: str = "uniform"
    ) -> List[int]:
        """
        Select S timesteps from the full T-step schedule.
        Returns list of timestep indices in descending order (T → 0).

        Args:
            num_steps: S, the number of inference steps
            spacing: 'uniform', 'quadratic', or 'lambda' (log-SNR space)
        """
        if spacing == "uniform":
            # Evenly spaced across [1, T]
            step_ratio = self.T // num_steps
            timesteps = list(range(0, self.T, step_ratio))
            timesteps = sorted(timesteps, reverse=True)

        elif spacing == "quadratic":
            # More steps at high noise levels where ODE curvature is larger
            timesteps = [
                int((i / num_steps) ** 2 * (self.T - 1))
                for i in range(num_steps + 1)
            ]
            timesteps = sorted(list(set(timesteps)), reverse=True)

        elif spacing == "lambda":
            # Uniform in log-SNR space - optimal for DPM-Solver
            # lambda_t = log(alpha_t / sigma_t)
            alphas = self.alphas_cumprod.sqrt()
            sigmas = (1 - self.alphas_cumprod).sqrt()
            lambdas = torch.log(alphas / sigmas)

            lambda_min = lambdas[0].item()
            lambda_max = lambdas[-1].item()
            lambda_seq = torch.linspace(lambda_max, lambda_min, num_steps + 1)

            # Find closest timestep for each lambda value
            timesteps = []
            for lam in lambda_seq:
                idx = (lambdas - lam).abs().argmin().item()
                timesteps.append(idx)
            timesteps = sorted(list(set(timesteps)), reverse=True)

        else:
            raise ValueError(f"Unknown spacing: {spacing}")

        return timesteps

    @torch.no_grad()
    def sample(
        self,
        shape: Tuple,
        num_inference_steps: int = 50,
        eta: float = 0.0,
        conditioning=None,
        negative_conditioning=None,
        guidance_scale: float = 7.5,
        spacing: str = "uniform",
        device: str = "cuda",
        generator: Optional[torch.Generator] = None,
        verbose: bool = False,
    ) -> torch.Tensor:
        """
        Generate samples using DDIM.

        Args:
            shape: (B, C, H, W) output shape
            num_inference_steps: S, number of denoising steps (20-50 typical)
            eta: 0.0 = deterministic DDIM, 1.0 = stochastic (DDPM behavior)
            conditioning: text or class conditioning tensor
            negative_conditioning: for CFG (null/negative embedding)
            guidance_scale: CFG guidance scale w (ignored if no conditioning)
            spacing: 'uniform', 'quadratic', or 'lambda'
            generator: torch.Generator for reproducibility

        Returns:
            Generated samples of shape (B, C, H, W)
        """
        B = shape[0]
        timesteps = self.make_timestep_schedule(num_inference_steps, spacing)

        # Start from pure Gaussian noise
        x = torch.randn(shape, generator=generator, device=device)

        alphas_cumprod = self.alphas_cumprod.to(device)
        use_cfg = (conditioning is not None and negative_conditioning is not None
                   and guidance_scale > 1.0)

        if verbose:
            print(f"DDIM sampling: {num_inference_steps} steps, eta={eta}")
            print(f"Timesteps: {timesteps[:5]}...{timesteps[-3:]}")

        for i, t in enumerate(timesteps):
            # Previous timestep - 0 if this is the last step
            t_prev = timesteps[i + 1] if i + 1 < len(timesteps) else 0

            t_batch = torch.full((B,), t, device=device, dtype=torch.long)

            # === Predict noise with U-Net ===
            if use_cfg:
                # CFG: batch conditional and unconditional for efficiency
                x_doubled = torch.cat([x, x], dim=0)
                t_doubled = torch.cat([t_batch, t_batch], dim=0)
                cond_cat = torch.cat([conditioning, negative_conditioning], dim=0)
                eps_both = self.model(x_doubled, t_doubled, cond_cat)
                eps_cond, eps_uncond = eps_both.chunk(2, dim=0)
                # CFG combination
                eps_theta = eps_uncond + guidance_scale * (eps_cond - eps_uncond)
            elif conditioning is not None:
                eps_theta = self.model(x, t_batch, conditioning)
            else:
                eps_theta = self.model(x, t_batch)

            # === Retrieve noise schedule values ===
            alpha_bar_t = alphas_cumprod[t]
            alpha_bar_prev = alphas_cumprod[t_prev] if t_prev > 0 else torch.tensor(1.0, device=device)

            # === Predict x_0 from x_t ===
            # x_0_hat = (x_t - sqrt(1 - alpha_bar_t) * eps) / sqrt(alpha_bar_t)
            x0_pred = (x - (1 - alpha_bar_t).sqrt() * eps_theta) / alpha_bar_t.sqrt()
            # Clamp to prevent values outside training distribution
            x0_pred = x0_pred.clamp(-1.0, 1.0)

            # === Compute DDIM sigma_t ===
            # sigma_t = eta * sqrt((1 - alpha_bar_{t-1}) / (1 - alpha_bar_t)) * sqrt(1 - alpha_bar_t / alpha_bar_{t-1})
            sigma_t = eta * (
                ((1 - alpha_bar_prev) / (1 - alpha_bar_t)).sqrt()
                * (1 - alpha_bar_t / alpha_bar_prev).sqrt()
            )

            # === Compute x_{t-1} ===
            # "Direction toward clean image" + "direction toward x_t" + noise
            dir_xt = (1 - alpha_bar_prev - sigma_t ** 2).sqrt() * eps_theta
            x0_dir = alpha_bar_prev.sqrt() * x0_pred

            if eta > 0 and t_prev > 0:
                random_noise = sigma_t * torch.randn_like(x, generator=generator)
            else:
                random_noise = 0.0

            x = x0_dir + dir_xt + random_noise

            if verbose and i % 10 == 0:
                print(f"  Step {i+1:3d}/{num_inference_steps}, t={t}, t_prev={t_prev}")

        return x  # denoised latent or image

    @torch.no_grad()
    def invert(
        self,
        x0: torch.Tensor,
        num_steps: int = 50,
        conditioning=None,
        device: str = "cuda",
        verbose: bool = False,
    ) -> torch.Tensor:
        """
        DDIM Inversion: encode a real image x_0 → x_T (approximate).
        Runs the ODE forward (from t=0 to t=T) to find the noise
        that would produce x_0 under DDIM decoding.

        IMPORTANT: Valid only for eta=0 (deterministic DDIM).
        The more inversion steps, the more accurate the reconstruction.

        Args:
            x0: clean image in [-1, 1], shape (B, C, H, W)
            num_steps: inversion steps (50+ for high fidelity)
            conditioning: optional text conditioning during inversion

        Returns:
            x_T: inverted noise, shape (B, C, H, W)
        """
        # Ascending timestep order for inversion (t=0 → t=T)
        timesteps = self.make_timestep_schedule(num_steps, spacing="uniform")
        timesteps = sorted(timesteps)  # ascending: 0 → T

        x = x0.to(device)
        alphas_cumprod = self.alphas_cumprod.to(device)

        if verbose:
            print(f"DDIM Inversion: {num_steps} steps")

        for i in range(len(timesteps) - 1):
            t = timesteps[i]
            t_next = timesteps[i + 1]

            t_batch = torch.full((x.shape[0],), t, device=device, dtype=torch.long)

            # Predict noise at current (partially noisy) image
            if conditioning is not None:
                eps_theta = self.model(x, t_batch, conditioning)
            else:
                eps_theta = self.model(x, t_batch)

            alpha_bar_t = alphas_cumprod[t]
            alpha_bar_t_next = alphas_cumprod[t_next]

            # Inversion step: deterministic DDIM forward (eta=0 implied)
            # Predict x_0 estimate
            x0_pred = (x - (1 - alpha_bar_t).sqrt() * eps_theta) / alpha_bar_t.sqrt()

            # Compute x_{t+1} via DDIM formula (reversed direction)
            # direction toward x_t encoded as noise direction
            x = (
                alpha_bar_t_next.sqrt() * x0_pred
                + (1 - alpha_bar_t_next).sqrt() * eps_theta
            )

            if verbose and i % 10 == 0:
                print(f"  Inversion step {i+1}/{len(timesteps)-1}, t={t} → t_next={t_next}")

        return x  # x_T: the inverted noise


# ============================================================
# Guidance scale effect analysis
# ============================================================

def analyze_guidance_scale_quality_tradeoff():
    """
    Shows the empirical quality-diversity-speed tradeoff.
    DDIM at different step counts and eta values.
    """
    print("DDIM Quality vs Speed Analysis")
    print("=" * 60)
    print()

    # Empirical FID scores on CIFAR-10 (from DDIM paper)
    results = [
        # (num_steps, eta, approx_FID_CIFAR10)
        (1000, 1.0, 3.17),   # DDPM (baseline)
        (250,  0.0, 4.16),   # DDIM deterministic, 250 steps
        (100,  0.0, 4.16),   # DDIM deterministic, 100 steps
        (50,   0.0, 4.67),   # DDIM deterministic, 50 steps
        (50,   1.0, 4.11),   # DDIM stochastic, 50 steps
        (20,   0.0, 6.84),   # DDIM deterministic, 20 steps
        (10,   0.0, 13.73),  # DDIM too few steps
    ]

    print(f"{'Steps':>8} | {'eta':>6} | {'FID (CIFAR-10)':>15} | {'vs DDPM':>10}")
    print("-" * 50)
    baseline_fid = 3.17
    for steps, eta, fid in results:
        speedup = 1000 / steps
        delta = fid - baseline_fid
        print(f"{steps:8d} | {eta:6.1f} | {fid:15.2f} | {delta:+8.2f} FID")

    print()
    print("Key observations:")
    print("  DDIM 50 steps: FID 4.67 vs DDPM 3.17 → +1.5 FID, 20x faster")
    print("  DDIM 100 steps: FID 4.16 → only +1.0 FID, 10x faster")
    print("  DDIM 20 steps: FID 6.84 → acceptable for previews, 50x faster")
    print("  eta=1 (stochastic) slightly better diversity at 50 steps")
    print()
    print("Recommendation for production:")
    print("  - High quality: DPM-Solver-2 @ 20 steps")
    print("  - Fast preview: DDIM @ 20 steps")
    print("  - Real-time: LCM @ 4 steps (requires distilled model)")


# ============================================================
# DPM-Solver-2 - simplified educational implementation
# ============================================================

class DPMSolver2:
    """
    DPM-Solver-2: 2nd-order diffusion ODE solver.
    Achieves DDIM-100-step quality in 20 steps.

    Key idea: solve the exact linear part of the diffusion ODE,
    approximate only the nonlinear score function term
    using a 2nd-order Taylor expansion in log-SNR space.

    Reference: Lu et al. (2022) "DPM-Solver: A Fast ODE Solver for
    Diffusion Probabilistic Model Sampling in Around 10 Steps"
    """

    def __init__(self, model, alphas_cumprod: torch.Tensor):
        self.model = model
        self.alphas_cumprod = alphas_cumprod

        # Precompute lambda_t = log(alpha_t / sigma_t) - log SNR
        alphas = alphas_cumprod.sqrt()
        sigmas = (1 - alphas_cumprod).sqrt()
        self.lambdas = torch.log(alphas / sigmas)

    def _get_t_from_lambda(self, lam: float) -> int:
        """Find the timestep index closest to a given lambda value."""
        return int((self.lambdas - lam).abs().argmin().item())

    @torch.no_grad()
    def sample(
        self,
        shape: Tuple,
        num_steps: int = 20,
        conditioning=None,
        device: str = "cuda",
    ) -> torch.Tensor:
        """
        Sample using DPM-Solver-2.
        Uses 2 NFE per step: first-order prediction + 2nd-order correction.
        Total NFE = 2 * num_steps (same quality as DDIM at 4 * num_steps).
        """
        # Schedule: uniform in lambda space
        lambda_max = self.lambdas.max().item()
        lambda_min = self.lambdas.min().item()
        lambda_schedule = torch.linspace(lambda_max, lambda_min, num_steps + 1)

        x = torch.randn(shape, device=device)
        alphas_cumprod = self.alphas_cumprod.to(device)

        for i in range(num_steps):
            lam_t = lambda_schedule[i].item()
            lam_s = lambda_schedule[i + 1].item()
            h = lam_s - lam_t  # step size in lambda space (negative)

            # Find corresponding timestep indices
            t_idx = self._get_t_from_lambda(lam_t)
            t_mid_idx = self._get_t_from_lambda((lam_t + lam_s) / 2)

            t_batch = torch.full((shape[0],), t_idx, device=device, dtype=torch.long)
            t_mid_batch = torch.full((shape[0],), t_mid_idx, device=device, dtype=torch.long)

            alpha_t = alphas_cumprod[t_idx].sqrt()
            sigma_t = (1 - alphas_cumprod[t_idx]).sqrt()

            # === First-order prediction (DPM-Solver-1 step to midpoint) ===
            if conditioning is not None:
                eps_t = self.model(x, t_batch, conditioning)
            else:
                eps_t = self.model(x, t_batch)

            # Predict x_0 in x_0-space (DPM-Solver++ style - more stable)
            x0_t = (x - sigma_t * eps_t) / alpha_t

            # First-order update to midpoint lambda
            alpha_mid = alphas_cumprod[t_mid_idx].sqrt()
            sigma_mid = (1 - alphas_cumprod[t_mid_idx]).sqrt()
            x_mid = alpha_mid * x0_t + sigma_mid * eps_t

            # === Second-order correction ===
            if conditioning is not None:
                eps_mid = self.model(x_mid, t_mid_batch, conditioning)
            else:
                eps_mid = self.model(x_mid, t_mid_batch)

            # 2nd-order corrected x_0 estimate (average of two evaluations)
            x0_corrected = (x - sigma_t * (eps_t + eps_mid) / 2) / alpha_t

            # Final update using corrected estimate
            alpha_s_sq = alphas_cumprod[self._get_t_from_lambda(lam_s)]
            alpha_s = alpha_s_sq.sqrt()
            sigma_s = (1 - alpha_s_sq).sqrt()

            x = alpha_s * x0_corrected + sigma_s * (eps_t + eps_mid) / 2

        return x


# ============================================================
# Usage example and comparison
# ============================================================

def sampling_comparison_demo():
    """
    Demonstrates the NFE-quality tradeoff across samplers.
    """
    print("Sampler Comparison for SD-scale Model (860M U-Net)")
    print("=" * 65)
    print()

    samplers = [
        ("DDPM (1000 steps)",       1000, 1.0,  3.17),
        ("DDIM (250 steps, η=0)",    250, 1.0,  4.16),
        ("DDIM (50 steps, η=0)",      50, 1.0,  4.67),
        ("DDIM (20 steps, η=0)",      20, 1.0,  6.84),
        ("DPM-Solver-2 (20 steps)",   40, 2.0,  3.05),  # 2 NFE per step
        ("DPM-Solver-3 (12 steps)",   36, 2.5,  3.08),  # 3 NFE per step
        ("LCM (4 steps)",              4, 0.4,  3.50),  # distilled
    ]

    print(f"{'Method':<35} | {'NFE':>5} | {'A100 (s)':>8} | {'FID':>6}")
    print("-" * 62)
    for name, nfe, time_per_step, fid in samplers:
        total_time = nfe * 0.009  # ~9ms per NFE for 860M model on A100
        print(f"{name:<35} | {nfe:5d} | {total_time:8.1f}s | {fid:6.2f}")

    print()
    print("Recommendation:")
    print("  Production high-quality: DPM-Solver-2/3 @ 12-20 steps")
    print("  Fast preview:            DDIM @ 20 steps")
    print("  Real-time (distilled):   LCM @ 4 steps")
    print("  Image editing (invert):  DDIM with 50+ inversion steps")


if __name__ == "__main__":
    analyze_guidance_scale_quality_tradeoff()
    print()
    sampling_comparison_demo()

9. YouTube Resources

Video	Channel	What You Learn
DDIM Paper Explained	Yannic Kilcher	Full DDIM paper walkthrough with non-Markovian derivation
Diffusion Models Crash Course	Outlier	Intuitive DDIM vs DDPM comparison with code
DPM-Solver Explained	AI Coffee Break	ODE solver perspective, DPM-Solver vs DDIM, why it works
Consistency Models	Yannic Kilcher	Song et al. 2023, consistency training vs distillation
Stable Diffusion Deep Dive	Fast.ai	End-to-end SD pipeline with sampler options and CFG

10. Production Engineering Notes

Throughput Optimization

The sequential nature of diffusion sampling (each step depends on the previous) prevents step-level parallelism. Production throughput improvements come from:

Batching: process multiple requests simultaneously. For a batch of $N$ images at $S$ steps, total time = $S \times \text{(time per batch forward pass)}$ , not $N \times S \times \text{(time per single forward pass)}$ . A batch of 4 images takes roughly the same time as 1 image for the U-Net forward pass (up to memory limits).

Compilation: torch.compile() with mode="reduce-overhead" provides 20-40% speedup on A100/H100 by fusing operations and reducing kernel launch overhead. Diffusion models benefit significantly because the same U-Net architecture is called many times with different inputs.

FlashAttention: replaces standard attention ( $O(N^2)$ memory) with tiled attention ( $O(N)$ memory, 2-4x faster). Essential for SDXL at 1024px where attention heads are large.

Quantization: FP16 is standard. INT8 with bitsandbytes or onnxruntime provides 20-30% additional speedup with minimal FID degradation. INT4 is more aggressive and produces noticeable quality loss.

Memory During Sampling

For SD 1.5 at 512x512 on a single A100 (80GB):

Component              | Memory (FP16)
-----------------------|---------------
U-Net                  | 1.7 GB
VAE encoder/decoder    | 0.4 GB
CLIP text encoder      | 0.5 GB
Activations (CFG)      | ~2.0 GB
Latent buffers         | ~0.1 GB
-----------------------|---------------
Total                  | ~4.7 GB

CFG doubles the activation memory because both conditional and unconditional passes must fit. At batch size 8, total memory is ~12 GB - comfortable on A100 but tight on a 16GB consumer GPU.

11. Common Mistakes

:::danger Using DDIM with too few steps at high CFG Below 20 steps with DDIM, high guidance scale (CFG $\geq 7$ ) amplifies discretization errors. The first-order Euler step in DDIM accumulates $O(h^2)$ error; at 10 steps, $h$ is large enough that the error is visible as oversaturation and edge artifacts. Use DPM-Solver-2 or DPM-Solver-3 for step counts below 20 - they are substantially more robust to coarse discretization due to higher-order accuracy. :::

:::warning Assuming eta=0 always produces better images Deterministic DDIM ( $\eta=0$ ) trades diversity for consistency. The same prompt always produces the same image (given the same seed), which is desirable for reproducibility but reduces sample diversity. For evaluation on diversity metrics (Recall, Coverage, PRDC), stochastic $\eta=1$ typically scores better because it explores more of the distribution. For creative applications requiring variety, use $\eta > 0$ . :::

:::warning DDIM inversion reconstruction errors compound with CFG DDIM inversion assumes a deterministic ODE, but if CFG is used at inference with a different conditioning than during inversion, the reconstructed image will drift from the original. The more CFG amplification, the larger the drift. For precise image editing with CFG, use Null-Text Inversion (Mokady et al. 2022) which optimizes the null embedding per image to compensate for CFG-induced drift. Simple DDIM inversion without this correction works well only at low guidance scales. :::

:::warning The $\eta=1$ case is NOT equivalent to DDPM in sample quality Setting $\eta=1$ in DDIM with $S=50$ steps is NOT the same as running DDPM with 1000 steps. DDIM with $\eta=1$ uses 50 large stochastic steps whereas DDPM uses 1000 small steps. The stochastic noise at large step sizes behaves differently from many small noise injections. Empirically, DDIM with $\eta=1$ at 50 steps achieves slightly worse FID than DDIM with $\eta=0$ at 50 steps (4.11 vs 4.67 per the paper). The optimal $\eta$ at low step counts is empirically around 0.6-0.8, not 0 or 1. :::

12. Progressive Distillation - The Training-Time Approach

Salimans and Ho (2022) introduced Progressive Distillation: instead of improving the sampler at inference time, distill the multi-step process into a student model that requires half the steps. The process is iterative:

Start with a teacher that generates in $T$ steps
Train a student to match the teacher's output in $T/2$ steps: each student step matches two teacher steps
Make the student the new teacher and repeat: $T \to T/2 \to T/4 \to \ldots$

After 4 iterations: $1024 \to 512 \to 256 \to 128 \to 64$ steps. Quality loss at each halving is small because the student only needs to learn a 2x speedup, not 1000x. The distillation loss:

$L_{distill} = \mathbb{E}_{z_0, t}\!\left[w(t) \cdot d\!\left(\hat{x}_0^{student}(z_{t_n}),\; \hat{x}_0^{teacher}(z_{t_n})\right)\right]$

Using LPIPS perceptual distance $d$ rather than MSE better preserves high-frequency texture. After 4 distillation iterations, the model generates high-quality images in 64 steps - about 16x faster than the original, with the student model having truly learned to be efficient rather than just applying a different ODE solver.

Advantage over DDIM/DPM-Solver: the student is retrained, so it can adapt its denoising strategy to the reduced step count. It is not constrained to follow the DDPM ODE trajectory.

Disadvantage: requires the training compute of a full fine-tuning run per distillation iteration. For large models (2.6B SDXL), this is expensive. DDIM and DPM-Solver require zero additional training.

13. Noise Schedules and Their Interaction with Samplers

The noise schedule $\{\beta_t\}_{t=1}^T$ (equivalently $\{\bar\alpha_t\}$ ) determines the rate at which noise is added in the forward process. The choice of schedule has direct implications for how well step-skipping works at low NFE.

Linear Schedule and Its Curvature Profile

The linear schedule adds $\beta_t$ uniformly, meaning $\bar\alpha_t$ decreases roughly quadratically. The ODE curvature (second derivative of the trajectory in $\lambda$ -space) is concentrated in the middle timesteps. This has a counterintuitive implication for step skipping:

Skipping early timesteps ( $t > 0.9T$ ): very cheap - the ODE is nearly linear there, large steps incur minimal error
Skipping middle timesteps ( $t \approx 0.4T$ to $0.7T$ ): expensive - the ODE has highest curvature, large steps cause significant approximation error
Skipping late timesteps ( $t < 0.1T$ ): moderate cost - the model is refining fine details, curvature is moderate

Practical implication: with DDIM at 50 steps on a linear schedule, the steps in the curvature-heavy middle region are limiting quality. Adding more steps in $t \approx 400$ - $700$ helps more than adding steps at the ends.

Cosine Schedule and Uniform Curvature

The cosine schedule is designed to distribute information removal more uniformly across timesteps. It also distributes ODE curvature more uniformly. This means:

No single region concentrates error
Uniform step spacing works reliably well
At very low step counts (10-15), quality degrades more gracefully with cosine than linear

DPM-Solver's Lambda-Space Uniformity

DPM-Solver works best with steps uniformly spaced in $\lambda_t = \log(\alpha_t/\sigma_t)$ (log signal-to-noise ratio) space. This is because DPM-Solver's Taylor expansion for the score integral assumes a smooth ODE in $\lambda$ -space. Uniform $\Delta\lambda$ minimizes the local truncation error at each step.

def compute_lambda_schedule(
    alphas_cumprod: torch.Tensor,
    num_steps: int
) -> torch.Tensor:
    """
    Compute timestep indices for uniform lambda-space spacing.
    Optimal for DPM-Solver-2 and DPM-Solver-3.
    """
    alphas = alphas_cumprod.sqrt()
    sigmas = (1 - alphas_cumprod).sqrt()
    lambdas = torch.log(alphas / (sigmas + 1e-8))  # (T,)

    lambda_max = lambdas.max()
    lambda_min = lambdas.min()

    # Uniform spacing in lambda: equal ODE step size in log-SNR
    lambda_seq = torch.linspace(lambda_max, lambda_min, num_steps + 1)

    # Map each lambda value to the nearest timestep index
    timestep_indices = []
    for lam in lambda_seq:
        idx = (lambdas - lam).abs().argmin().item()
        timestep_indices.append(idx)

    return torch.tensor(sorted(set(timestep_indices), reverse=True))


# Comparison: uniform-t vs uniform-lambda spacing for DPM-Solver
# At 15 steps with cosine schedule:
# uniform-t:      FID ~3.2 on CIFAR-10
# uniform-lambda: FID ~2.95 on CIFAR-10
# The lambda spacing aligns step sizes with ODE curvature

Schedule-Aware Recommendations

Schedule	DDIM steps	Best spacing	DPM-Solver steps	Best spacing
Linear	50	Uniform-t	20	Uniform-lambda
Cosine	50	Uniform-t	15	Uniform-lambda
Cosine	30	Uniform-t	12	Uniform-lambda

The cosine schedule uniformly calibrates both step strategies. For practical deployments: use DPM-Solver-2 with uniform-lambda spacing regardless of the underlying noise schedule - this combination is robust across all models tested.

14. Interview Q&A

Q1: Derive the DDIM update rule from first principles. Why can the same trained U-Net be used?

The DDIM update starts from the same marginals as DDPM: $q(x_t|x_0) = \mathcal{N}(\sqrt{\bar\alpha_t}x_0, (1-\bar\alpha_t)I)$ . Song et al. define a non-Markovian reverse process with the same marginals. Setting $\sigma_t = 0$ (deterministic), Bayes' theorem gives:

$x_{t-1} = \sqrt{\bar\alpha_{t-1}}\underbrace{\frac{x_t - \sqrt{1-\bar\alpha_t}\varepsilon_\theta}{\sqrt{\bar\alpha_t}}}_{\hat x_0} + \sqrt{1-\bar\alpha_{t-1}}\varepsilon_\theta$

The same trained U-Net works because the DDPM training loss $\mathbb{E}[\|\varepsilon - \varepsilon_\theta(x_t,t)\|^2]$ depends only on $q(x_t|x_0)$ - the marginal at each timestep. The model was never trained with knowledge of the full Markov joint $q(x_1, \ldots, x_T|x_0)$ . Any sampling process sharing the same marginals is valid for inference with the same model.

Q2: What does $\eta=0$ physically mean in DDIM, and what are its practical advantages?

When $\eta=0$ , $\sigma_t = 0$ at every step - no stochastic noise is injected. The sampling becomes a deterministic function: the same noise $x_T$ always produces the same image. Physically, this corresponds to following a probability flow ODE rather than a stochastic process.

Practical advantages: (1) Reproducibility - given a seed, results are exactly reproducible. (2) Latent space interpolation - linearly interpolating between two $x_T$ vectors produces smooth interpolation in image space. (3) DDIM inversion - the deterministic ODE can be run backwards to encode a real image into latent space. (4) Slightly fewer artifacts at very low step counts because no stochastic noise amplifies discretization errors.

Tradeoff: reduced sample diversity. Two different seeds for the same prompt will produce images that look more similar to each other than with $\eta=1$ .

Q3: Why does DPM-Solver achieve better quality than DDIM in 20 steps, while DDIM needs 50?

DDIM uses first-order Euler integration of the probability flow ODE, accumulating $O(h^2)$ local error per step. DPM-Solver separates the ODE into a linear part (with exact solution involving $e^{\int \beta dt}$ ) and a nonlinear part (the score function). The linear part is solved exactly; only the nonlinear part needs approximation. DPM-Solver-2 applies a 2nd-order Taylor expansion, achieving $O(h^3)$ local error. With 20 steps ( $h \approx 0.05$ in $\lambda$ -space), this gives substantially lower total error than DDIM with 50 steps ( $h \approx 0.02$ but 1st-order). DPM-Solver-3 extends to $O(h^4)$ , working in 10-15 steps. The key insight: exploiting the analytical structure of the ODE is more efficient than brute-force first-order stepping.

Q4: How does DDIM inversion enable image editing, and what are its limitations?

DDIM inversion runs the deterministic ODE from $t=0$ to $t=T$ : given a real image $x_0$ , it finds $x_T^*$ such that $\text{DDIM-decode}(x_T^*) \approx x_0$ . With $x_T^*$ in hand, you can change the text conditioning and re-run DDIM decoding - the structure of the image is preserved (because $x_T^*$ encodes spatial structure in the noise trajectory) while semantics change according to the new prompt.

Limitations: (1) Imperfect reconstruction at low step counts due to discretization error - use 50+ inversion steps for high fidelity. (2) CFG during decoding but not inversion causes semantic drift - use Null-Text Inversion to correct for this. (3) Very large semantic changes require attention manipulation techniques (Prompt-to-Prompt), not just re-conditioning. (4) Works only for $\eta=0$ ; stochastic DDIM cannot be inverted.

Q5: How would you choose between DDIM, DPM-Solver, and LCM for a production system?

DDIM ( $\eta=0$ , 30-50 steps): simplest implementation, well-understood, good for image editing via inversion. Prefer when latency allows 30-50 steps and you need inversion capability or are debugging pipeline issues.

DPM-Solver-2 or DPM-Solver++ (20 steps): the current production sweet spot. Better quality per NFE than DDIM, stable with high CFG (use DPM-Solver++ for CFG $\geq 7$ ), minimal code change from DDIM. Default choice for new production deployments in 2024-2025.

LCM / LCM-LoRA (4 steps): for real-time or near-real-time applications (interactive tools, live preview). Requires a separately distilled model - you cannot just swap the sampler on a DDPM model. Quality ceiling is lower than DPM-Solver at 20 steps, but 5-10x lower latency makes it the right choice for user-facing interactive applications.

Q6: What is the relationship between the DDPM variance schedule and DDIM's step-skipping quality?

The variance schedule determines the spacing of the probability flow ODE in $\lambda_t = \log(\alpha_t/\sigma_t)$ space. The linear schedule concentrates ODE curvature in the middle timesteps ( $t \approx 400-700$ for linear) and has very low curvature at early and late steps. This means step-skipping is nearly free at very early and very late steps. The cosine schedule distributes curvature more uniformly, so any skipping pattern incurs more uniform error.

For DDIM at 50 steps: uniform spacing works well with both schedules. For DPM-Solver at 12 steps: uniform spacing in $\lambda$ -space is optimal regardless of the underlying noise schedule, because DPM-Solver's Taylor expansion is most accurate with uniform $\Delta\lambda$ . The practical recommendation: use uniform-in- $\lambda$ spacing for DPM-Solver, uniform-in- $t$ spacing for DDIM.

The schedule and sampler must be treated as a joint design decision. For any new production deployment: (1) identify which noise schedule was used during training, (2) choose DPM-Solver-2 as the default sampler, (3) set steps uniform in $\lambda$ -space, (4) validate FID vs step count on a held-out sample. This combination reliably achieves training-quality FID at 15-20 NFE across all modern diffusion architectures tested in 2023-2025.

Rectified Flow and SD3/FLUX

Rectified flow (Liu et al. 2022, used in SD3 and FLUX) replaces the cosine/linear noise schedules entirely. Instead of defining a noising process with gradually increasing variance, rectified flow interpolates linearly between noise and data:

$x_t = (1 - t) \cdot x_0 + t \cdot \varepsilon, \qquad t \in [0, 1]$

The reverse ODE follows straight lines between data and noise in probability space. Why does this matter for sampling? Straight-line trajectories have zero curvature - DDIM with 10-20 steps solves them with minimal approximation error. Empirically, SD3 and FLUX produce high-quality images at 20-28 steps, and acceptable quality at even fewer, without needing higher-order solvers. The guidance scale also tends to be lower (3-5) because the rectified flow training is more stable for the conditioning signal.

The tradeoff: rectified flow requires retraining - you cannot apply rectified flow sampling to a DDPM-trained model. But for models trained from scratch (SD3, FLUX), it is the strictly better noise schedule choice for sampling efficiency.

This lesson is part of the Diffusion Models module. Next: Latent Diffusion Models - The Architecture Behind Stable Diffusion.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Diffusion Process (DDPM) demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why This Exists - The Sampling Bottleneck​

Historical Context - The Non-Markovian Insight​

1. Neural Function Evaluations - The Cost Unit​

2. The DDIM Forward Process - Same Marginals, Different Joint​

Keeping Marginals, Losing the Markov Property​

The η\etaη Parameter - Controlling Stochasticity​

3. The DDIM Update Rule​

Step-by-Step Derivation​

The ODE Perspective (η=0\eta = 0η=0)​

4. Accelerated Sampling - Skipping Timesteps​

Why Step Skipping Works​

Step Selection Strategies​

5. DDIM Inversion - Encoding Real Images​

Why Inversion Enables Image Editing​

Inversion Algorithm​

6. DPM-Solver - Higher-Order Diffusion ODE Solvers​

Why First-Order is Not Enough​

The DPM-Solver Formulation​

DPM-Solver++ - Improving Stability​

7. Architecture Diagram​

8. Complete PyTorch DDIM Sampler with Inversion​

9. YouTube Resources​

10. Production Engineering Notes​

Throughput Optimization​

Memory During Sampling​

11. Common Mistakes​

12. Progressive Distillation - The Training-Time Approach​

13. Noise Schedules and Their Interaction with Samplers​

Linear Schedule and Its Curvature Profile​

Cosine Schedule and Uniform Curvature​

DPM-Solver's Lambda-Space Uniformity​

Schedule-Aware Recommendations​

14. Interview Q&A​

Rectified Flow and SD3/FLUX​