Skip to main content

Generative Models Overview - VAEs, GANs, Flow Models, and Diffusion

:::note Reading time: ~45 minutes | Interview relevance: High | Target roles: ML Engineer, AI Engineer, Research Engineer, Applied Scientist :::

The Real Interview Moment

You are interviewing at Stability AI. The interviewer slides a whiteboard marker across the table. "Walk me through every major approach to generative modeling. For each one, tell me what it optimizes, where it fails, and why diffusion models ultimately beat them all."

Most candidates give a surface-level tour - "GANs use a discriminator, VAEs use an encoder-decoder structure" - and never explain why the community spent a decade wrestling with mode collapse, blurry outputs, and training instability before diffusion models arrived. The interviewer has heard that surface-level answer forty times this month.

What wins the room is a unified framework. Every generative model answers the same question: how do we learn a probability distribution pθ(x)p_\theta(x) over high-dimensional data from finite samples, and then sample novel examples from it? Each architecture makes a different bet about what to optimize and what to sacrifice - between tractable likelihood, sample quality, diversity, training stability, and computational cost. Understanding those trade-offs at a mathematical level, not just an intuitive one, is what separates an ML practitioner from an ML engineer.

By the end of 2022, one approach had cracked a four-way trade-off that had stymied the field for a decade. Midjourney launched. Stable Diffusion went open source. DALL-E 2 made text-to-image generation mainstream. Within 18 months of DALL-E 2's release, Midjourney was processing over 15 million image generations per day. But to understand why diffusion won, you need to understand what each earlier approach tried - and precisely where it broke.

Why This Exists - The Generative Modeling Problem

Machine learning began with discriminative models - classifiers and regressors that learn p(yx)p(y|x). Generative models go further: they learn the full distribution p(x)p(x) over inputs.

This unlocks three capabilities discriminative models cannot provide:

  1. Sampling: Draw xpθ(x)x \sim p_\theta(x) - generate novel images, molecules, audio, protein sequences.
  2. Density estimation: Evaluate logpθ(x)\log p_\theta(x) for any xx - detect anomalies, assess whether a test sample is in-distribution.
  3. Conditional generation: Sample xpθ(xy)x \sim p_\theta(x|y) - generate an image matching a text prompt, complete a partial molecule, inpaint a masked region.

The challenge: a 256×256256 \times 256 RGB image is a point in R196,608\mathbb{R}^{196{,}608}. The true data distribution pdata(x)p_\text{data}(x) occupies an extremely thin manifold embedded in that space - the probability manifold hypothesis. Most points in R196,608\mathbb{R}^{196{,}608} are pure noise. The manifold of valid natural images is vastly lower-dimensional (perhaps dmanifold100d_\text{manifold} \approx 10010001000). Learning to model that manifold accurately, from finite samples, without explicit access to pdatap_\text{data}, is the core challenge.

Four major families of approaches tackle it differently. Each makes a different bet.

Historical Context

Generative modeling predates deep learning. Gaussian mixture models, hidden Markov models, and restricted Boltzmann machines were studied for decades. The modern deep generative era began in 2013–2014. Variational Autoencoders (Kingma and Welling, 2013) and Generative Adversarial Networks (Goodfellow et al., 2014) appeared in the same year, defining two radically different paradigms. Normalizing flows were formalized by Rezende and Mohamed (2015) and Dinh et al. (2016). Score-based models and early diffusion work appeared in 2019–2020 (Song and Ermon; Ho et al.). DALL-E 2 (Ramesh et al., 2022) and Stable Diffusion (Rombach et al., 2022) demonstrated that diffusion had decisively surpassed GANs on the metrics that matter for production applications.

1. Latent Variable Models and the Intractability Problem

All probabilistic generative models face a common structure. Define a latent variable zRkz \in \mathbb{R}^k with a simple prior p(z)=N(0,I)p(z) = \mathcal{N}(0, \mathbf{I}), and a conditional generative model pθ(xz)p_\theta(x|z) (a neural network). The marginal data likelihood is:

pθ(x)=pθ(xz)p(z)dzp_\theta(x) = \int p_\theta(x|z)\, p(z)\, dz

The integral is over all possible latent configurations - it is intractable to compute for any reasonable kk. This intractability is the central problem that each generative model family resolves differently.

  • VAEs: approximate the integral via a lower bound (ELBO), introducing an encoder network.
  • GANs: bypass the integral entirely - never compute pθ(x)p_\theta(x), just sample.
  • Flows: restrict the model architecture so the integral becomes a simple formula.
  • Diffusion: use many small steps where each conditional pθ(xt1xt)p_\theta(x_{t-1}|x_t) is tractable.

2. Variational Autoencoders (VAEs)

The ELBO and the Reparameterization Trick

VAEs (Kingma and Welling, 2013) introduce an approximate posterior qϕ(zx)q_\phi(z|x) (the encoder network) and derive a tractable lower bound on the log-likelihood:

logpθ(x)L(x;θ,ϕ)=Eqϕ(zx) ⁣[logpθ(xz)]reconstructionDKL ⁣(qϕ(zx)p(z))regularization (KL toward prior)\log p_\theta(x) \geq \mathcal{L}(x;\, \theta, \phi) = \underbrace{\mathbb{E}_{q_\phi(z|x)}\!\left[\log p_\theta(x|z)\right]}_{\text{reconstruction}} - \underbrace{D_\text{KL}\!\left(q_\phi(z|x) \| p(z)\right)}_{\text{regularization (KL toward prior)}}

This is the Evidence Lower BOund (ELBO). The gap between logpθ(x)\log p_\theta(x) and the ELBO is exactly DKL(qϕ(zx)pθ(zx))D_\text{KL}(q_\phi(z|x) \| p_\theta(z|x)) - the KL between the approximate and true posterior. Maximizing the ELBO jointly over θ\theta (decoder) and ϕ\phi (encoder) simultaneously improves the generative model and the inference network.

The reparameterization trick makes this end-to-end differentiable. Rather than sampling zqϕ(zx)=N(μϕ(x),σϕ2(x))z \sim q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x)) (non-differentiable), write:

z=μϕ(x)+σϕ(x)ε,εN(0,I)z = \mu_\phi(x) + \sigma_\phi(x) \odot \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, \mathbf{I})

The randomness ε\varepsilon is moved outside the computational graph. Gradients flow through μϕ\mu_\phi and σϕ\sigma_\phi as normal.

Why VAEs Produce Blurry Samples

The reconstruction term Eqϕ[logpθ(xz)]\mathbb{E}_{q_\phi}[\log p_\theta(x|z)] is typically pixel-wise MSE or binary cross-entropy - treating each pixel as independent. When the model is uncertain about the exact value of a pixel (e.g., whether the cat's ear should be at position 47 or 49), it hedges and predicts the average. Averaged across millions of pixels, across the uncertainty in zz, across the variability in the training set - the result is systematic blurriness. The ELBO gap between the approximate and true ELBO means the model cannot perfectly trade off reconstruction quality against the prior.

Posterior collapse: if the decoder becomes powerful enough to reconstruct xx without using zz at all, the KL term drives qϕ(zx)p(z)q_\phi(z|x) \to p(z), and the latent space becomes useless. The model degenerates into a pure decoder with an empty latent space.

VAE Properties

PropertyVAE
Training objectiveMaximize ELBO
LikelihoodLower bound (approximate)
Sample qualityModerate (blurry)
DiversityGood - covers full distribution
Training stabilityExcellent - standard SGD
Sampling speedFast - one decoder pass

3. Generative Adversarial Networks (GANs)

The Minimax Game and Nash Equilibrium

GANs (Goodfellow et al., 2014) take a fundamentally different approach: implicit density estimation via an adversarial game. The generator GG maps noise zp(z)z \sim p(z) to data space. The discriminator DD tries to distinguish real from generated samples:

minGmaxD  Expdata ⁣[logD(x)]+Ezp(z) ⁣[log ⁣(1D(G(z)))]\min_G \max_D \;\mathbb{E}_{x \sim p_\text{data}}\!\left[\log D(x)\right] + \mathbb{E}_{z \sim p(z)}\!\left[\log\!\left(1 - D(G(z))\right)\right]

At the Nash equilibrium: GG produces samples from pdatap_\text{data}, and DD outputs 12\frac{1}{2} everywhere (cannot distinguish real from fake). The generator never explicitly computes pθ(x)p_\theta(x) - it learns a deterministic mapping from noise to data that fools the discriminator.

Why GANs produce sharp images: the discriminator learns a rich perceptual similarity measure. A blurry image is easily distinguishable from a real image - the discriminator penalizes it directly. Unlike pixel-wise MSE, the adversarial loss captures high-frequency texture details. This is why DCGAN (Radford et al., 2015) immediately produced dramatically sharper samples than VAEs of the same era.

The Failure Modes

Mode collapse: the generator finds a small set of outputs that reliably fool the discriminator and stops exploring. In the extreme, the generator maps every zz to near-identical outputs. The training dataset might have 1000 distinct categories; the generator collapses to producing images from 10. Detected only by inspecting diversity - low FID can coexist with mode collapse if the collapsed modes are photorealistic.

Training instability: the minimax game is difficult to optimize. When DD is too strong, the generator gradient GE[log(1D(G(z)))]\nabla_G \mathbb{E}[\log(1-D(G(z)))] vanishes (discriminator is saturated). When DD is too weak, GG receives no useful learning signal. Balancing requires careful architectural choices: spectral normalization, gradient penalty (WGAN-GP), two-timescale learning rates, label smoothing. A failed GAN training run often produces only noise or one mode - and there is no loss curve diagnostic that reliably predicts failure.

No likelihood: a GAN cannot compute logpθ(x)\log p_\theta(x) for a given xx. The mapping G:RkRDG: \mathbb{R}^k \to \mathbb{R}^D with kDk \ll D (100-dim noise to 196K-dim image) is not invertible. No likelihood means no anomaly detection, no model selection by held-out likelihood, no density estimation.

GAN Properties

PropertyGAN
Training objectiveMinimax adversarial game
LikelihoodNone (implicit density)
Sample qualityExcellent - sharp, photorealistic
DiversityPoor to moderate - mode collapse
Training stabilityPoor - requires tricks
Sampling speedExtremely fast - one forward pass

4. Normalizing Flows

Exact Density via Bijective Transforms

Normalizing flows model pθ(x)p_\theta(x) exactly via a bijective transformation f:RdRdf: \mathbb{R}^d \to \mathbb{R}^d. If zp(z)z \sim p(z) and x=f(z)x = f(z), the change-of-variables formula gives:

logpθ(x)=logp(z)logdetJf(z),z=f1(x)\log p_\theta(x) = \log p(z) - \log\left|\det J_f(z)\right|, \qquad z = f^{-1}(x)

where Jf(z)=fzJ_f(z) = \frac{\partial f}{\partial z} is the Jacobian. Since ff is bijective:

  • Sample: draw zN(0,I)z \sim \mathcal{N}(0, \mathbf{I}), compute x=f(z)x = f(z) - one forward pass.
  • Evaluate exact likelihood: compute z=f1(x)z = f^{-1}(x), plug into the formula - one inverse pass.
  • Train: maximize exact log-likelihood end-to-end.

This is the only family that achieves exact maximum likelihood with efficient sampling - a remarkable theoretical property.

The Architectural Constraint

To make detJf\det J_f tractable, the transformation ff must have a structured Jacobian - triangular, block-triangular, or computed via autoregressive structure. This severely limits model expressiveness.

RealNVP (Dinh et al., 2016): coupling layers split dimensions into two halves; one half transforms the other. Jacobian is block-triangular. Efficient in both directions.

Glow (Kingma and Dhariwal, 2018): extends RealNVP with 1×1 invertible convolutions for learnable channel permutations. Produced decent 256×256256 \times 256 face samples.

Autoregressive flows (MAF, IAF): condition each dimension on previous ones. Jacobian is triangular - cheap log-det. But sequential generation makes sampling slow (O(d)O(d) sequential passes), and sequential training (IAF) makes density evaluation slow. Cannot have fast sampling and fast density evaluation simultaneously.

For high-dimensional images: the bijective constraint is painfully limiting. Glow required enormous model size and still produced samples clearly inferior to BigGAN of the same era - a reminder that exact likelihood and sample quality are distinct objectives. A model can be excellent at density estimation and mediocre at generation.

Flow Properties

PropertyNormalizing Flow
Training objectiveExact maximum likelihood
LikelihoodExact
Sample qualityModerate
DiversityGood
Training stabilityGood
Sampling speedFast (one pass) or slow (autoregressive)

5. Energy-Based Models

Energy-based models (EBMs) define:

pθ(x)=exp(Eθ(x))Z(θ),Z(θ)=exp(Eθ(x))dxp_\theta(x) = \frac{\exp(-E_\theta(x))}{Z(\theta)}, \qquad Z(\theta) = \int \exp(-E_\theta(x))\, dx

where Eθ(x)E_\theta(x) is a neural network (the "energy") and Z(θ)Z(\theta) is the partition function - the normalizing constant. Low energy = high probability; high energy = low probability.

The intractable partition function: Z(θ)Z(\theta) requires integrating over all of input space - impossible for high-dimensional xx. This makes maximum likelihood training θlogpθ(x)=θEθ(x)θlogZ(θ)\nabla_\theta \log p_\theta(x) = -\nabla_\theta E_\theta(x) - \nabla_\theta \log Z(\theta) intractable - the second term requires knowing ZZ.

Training uses contrastive divergence or MCMC-based estimators: generate negative samples (from the current model) via MCMC (e.g., Langevin dynamics), and update weights to increase energy of negative samples and decrease energy of positive (real) samples. This requires running MCMC chains at every training step - expensive and often poorly mixed.

EBMs are theoretically elegant - any neural network can be an energy function - but practically difficult to train for high-dimensional data.

6. Score-Based Models (Preview)

Score-based models (Song and Ermon, 2019) sidestep both the intractable partition function and the explicit likelihood by learning the score function:

sθ(x)=xlogp(x)s_\theta(x) = \nabla_x \log p(x)

The score is the gradient of the log-density with respect to the input. It does not depend on ZZ (since xlogp(x)=xlogp~(x)\nabla_x \log p(x) = \nabla_x \log \tilde{p}(x) where p~\tilde{p} is the unnormalized density). Given the score, Langevin dynamics generates samples:

xt+1=xt+ε2sθ(xt)+εηt,ηtN(0,I)x_{t+1} = x_t + \frac{\varepsilon}{2} s_\theta(x_t) + \sqrt{\varepsilon}\, \eta_t, \qquad \eta_t \sim \mathcal{N}(0, \mathbf{I})

Starting from any x0x_0, as step size ε0\varepsilon \to 0 and steps \to \infty, the chain converges to p(x)p(x). Score matching provides a way to train sθs_\theta from data without access to p(x)p(x) - covered in full in Lesson 03.

7. Diffusion Models

The Core Idea

Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) model the data distribution through a progressive denoising process.

The forward process gradually corrupts a clean data point x0x_0 by adding Gaussian noise over TT steps:

q(xtxt1)=N ⁣(xt;  1βtxt1,  βtI)q(x_t | x_{t-1}) = \mathcal{N}\!\left(x_t;\; \sqrt{1-\beta_t}\, x_{t-1},\; \beta_t\mathbf{I}\right)

By step TT, xTN(0,I)x_T \approx \mathcal{N}(0, \mathbf{I}) - pure noise. Using the reparameterization αˉt=s=1t(1βs)\bar{\alpha}_t = \prod_{s=1}^t (1-\beta_s), the marginal at any step is:

q(xtx0)=N ⁣(xt;  αˉtx0,  (1αˉt)I)q(x_t|x_0) = \mathcal{N}\!\left(x_t;\; \sqrt{\bar{\alpha}_t}\, x_0,\; (1-\bar{\alpha}_t)\mathbf{I}\right)

allowing direct sampling at any noise level without iterating through all steps.

The reverse process is a learned neural network (U-Net) that undoes one denoising step:

pθ(xt1xt)=N ⁣(xt1;  μθ(xt,t),  Σθ(xt,t))p_\theta(x_{t-1} | x_t) = \mathcal{N}\!\left(x_{t-1};\; \mu_\theta(x_t, t),\; \Sigma_\theta(x_t, t)\right)

Training: at each step, sample a random tt, corrupt x0x_0 to xtx_t, then train the network to predict the noise that was added:

Lsimple=Et,x0,εN(0,I) ⁣[εεθ(xt,t)2]\mathcal{L}_\text{simple} = \mathbb{E}_{t,\, x_0,\, \varepsilon \sim \mathcal{N}(0,\mathbf{I})}\!\left[\|\varepsilon - \varepsilon_\theta(x_t, t)\|^2\right]

This is simple MSE - no adversarial game, no intractable integral, no architectural constraints.

Why Diffusion Models Win - The Four Constraints

The generative modeling community cares about four things simultaneously:

Sample quality: do generated samples look realistic? FID, IS, human preference.

Sample diversity: does the model cover the full data distribution? A model that generates perfect copies of the 100 most common ImageNet classes would score well on quality but fail on diversity. Measured by recall, coverage, FID recall component.

Training stability: can you reliably train to convergence without hand-tuning? GANs are notorious for mode collapse, loss curve divergence, and "it was working yesterday" instability.

Likelihood estimation: can you compute logpθ(x)\log p_\theta(x) for a given xx? Essential for anomaly detection, model selection, and Bayesian inference pipelines.

ModelQualityDiversityStabilityLikelihood
VAEModerateGoodExcellentApprox. lower bound
GANExcellentPoorPoorNone
FlowModerateGoodGoodExact
EBMGoodGoodPoorIntractable
DiffusionExcellentExcellentExcellentApprox. lower bound

The key insight: the forward process in diffusion is fixed and analytically known - you designed it. The training target at each step (εθ(xt,t)ε\varepsilon_\theta(x_t, t) \approx \varepsilon) is always a well-conditioned MSE regression problem. There is no adversarial dynamic, no mode collapse risk, no non-stationary training landscape. Diffusion achieves excellent quality (iterative denoising over 100–1000 steps allows fine-grained correction) and excellent diversity (each sample starts from an independent random xTN(0,I)x_T \sim \mathcal{N}(0, \mathbf{I})).

The only sacrifice: sampling speed. DDPM requires T=1000T = 1000 denoising passes. DDIM (Song et al., 2020) reduces this to 20–50 steps via a deterministic ODE formulation. Latent diffusion (Rombach et al., 2022) compresses images to a small latent space, reducing per-step cost by 16–64×. These two innovations together make diffusion practical for production.

FID Evolution on ImageNet

FID (Fréchet Inception Distance) measures distributional similarity in Inception feature space. Lower is better. This table tells the story of a decade of progress:

ModelFamilyFID (ImageNet 256×256)Year
GlowFlow48.92018
VQVAE-2Hybrid (VAE+Autoregressive)31.12019
BigGANGAN7.42019
StyleGAN2GAN3.8 (FFHQ 1024px)2020
ADM (DDPM++)Diffusion10.92021
ADM-G (classifier guided)Diffusion4.592021
LDM-4Latent Diffusion10.562022
DiT-XL/2Diffusion Transformer2.272022
SDXLLatent Diffusion<2.02023

The DiT-XL/2 result (2.27 FID, ImageNet 256×256 class-conditional) surpassed all prior GAN-based methods. Diffusion did not just catch up to GANs - it decisively surpassed them.

FID formula for reference:

FID=μrμg2+tr ⁣(Σr+Σg2 ⁣(ΣrΣg)1/2)\text{FID} = \|\mu_r - \mu_g\|^2 + \text{tr}\!\left(\Sigma_r + \Sigma_g - 2\!\left(\Sigma_r\Sigma_g\right)^{1/2}\right)

where (μr,Σr)(\mu_r, \Sigma_r) are the mean and covariance of Inception features on real images, and (μg,Σg)(\mu_g, \Sigma_g) on generated images. The first term captures mean shift (quality); the second captures covariance mismatch (diversity). Lower FID = closer to real data in feature space.

Code - Comparing Generative Model Architectures

import torch
import torch.nn as nn
import torch.nn.functional as F
import time


# ============================================================
# VAE - ELBO training, fast single-pass sampling
# ============================================================
class VAE(nn.Module):
"""
Variational Autoencoder on flattened images (e.g., MNIST 28×28=784).
Encoder q_φ(z|x) → (μ, log σ²)
Decoder p_θ(x|z) → reconstructed x
"""
def __init__(self, input_dim: int = 784, latent_dim: int = 64):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 512), nn.ReLU(),
nn.Linear(512, 256), nn.ReLU(),
nn.Linear(256, latent_dim * 2) # output: [μ | log_σ²]
)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 256), nn.ReLU(),
nn.Linear(256, 512), nn.ReLU(),
nn.Linear(512, input_dim), nn.Sigmoid()
)
self.latent_dim = latent_dim

def encode(self, x):
h = self.encoder(x)
mu, log_var = h.chunk(2, dim=-1)
return mu, log_var

def reparameterize(self, mu, log_var):
"""
Reparameterization trick: z = μ + σ·ε, ε~N(0,I)
Moves randomness outside the computational graph → differentiable.
"""
std = torch.exp(0.5 * log_var)
eps = torch.randn_like(std)
return mu + eps * std

def forward(self, x):
mu, log_var = self.encode(x)
z = self.reparameterize(mu, log_var)
return self.decoder(z), mu, log_var

def elbo_loss(self, x):
"""
ELBO = E[log p(x|z)] - KL(q(z|x) || p(z))
Pixel-wise BCE for reconstruction; closed-form KL for Gaussians.
The blurriness of VAE outputs comes from pixel-wise BCE/MSE:
uncertainty → hedge → average → blur.
"""
x_recon, mu, log_var = self(x)
# Reconstruction: pixel-wise BCE (independent pixel assumption)
recon_loss = F.binary_cross_entropy(x_recon, x, reduction="sum")
# KL divergence: closed form for q=N(μ,σ²), p=N(0,1)
kl_loss = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
return recon_loss + kl_loss

@torch.no_grad()
def sample(self, n: int, device: str) -> torch.Tensor:
"""One decoder forward pass - fast."""
z = torch.randn(n, self.latent_dim, device=device)
return self.decoder(z)


# ============================================================
# GAN - adversarial training, one-pass generation (no likelihood)
# ============================================================
class DCGANGenerator(nn.Module):
"""
Deep Convolutional GAN generator.
Maps z~N(0,I) → (C,H,W) images via transposed convolutions.
One forward pass → sharp image. No likelihood ever computed.
"""
def __init__(self, latent_dim: int = 100, channels: int = 64,
image_channels: int = 1):
super().__init__()
self.net = nn.Sequential(
# z → 4×4 feature map
nn.ConvTranspose2d(latent_dim, channels * 8, 4, 1, 0, bias=False),
nn.BatchNorm2d(channels * 8), nn.ReLU(True),
# 4×4 → 8×8
nn.ConvTranspose2d(channels * 8, channels * 4, 4, 2, 1, bias=False),
nn.BatchNorm2d(channels * 4), nn.ReLU(True),
# 8×8 → 16×16
nn.ConvTranspose2d(channels * 4, channels * 2, 4, 2, 1, bias=False),
nn.BatchNorm2d(channels * 2), nn.ReLU(True),
# 16×16 → 32×32
nn.ConvTranspose2d(channels * 2, image_channels, 4, 2, 1, bias=False),
nn.Tanh(), # output in [-1, 1]
)
self.latent_dim = latent_dim

@torch.no_grad()
def sample(self, n: int, device: str) -> torch.Tensor:
"""One forward pass - extremely fast. No likelihood computation possible."""
z = torch.randn(n, self.latent_dim, 1, 1, device=device)
return self.net(z)


class DCGANDiscriminator(nn.Module):
"""
DCGAN discriminator: real or fake?
If too strong → generator gradient vanishes (training instability).
If too weak → generator gets no useful signal.
Balancing this is the core difficulty of GAN training.
"""
def __init__(self, channels: int = 64, image_channels: int = 1):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(image_channels, channels, 4, 2, 1, bias=False),
nn.LeakyReLU(0.2, inplace=True),
nn.Conv2d(channels, channels * 2, 4, 2, 1, bias=False),
nn.BatchNorm2d(channels * 2), nn.LeakyReLU(0.2, inplace=True),
nn.Conv2d(channels * 2, channels * 4, 4, 2, 1, bias=False),
nn.BatchNorm2d(channels * 4), nn.LeakyReLU(0.2, inplace=True),
nn.Conv2d(channels * 4, 1, 4, 1, 0, bias=False),
nn.Sigmoid(),
)

def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.net(x).view(-1)


def gan_losses(real_x, G, D):
"""
Standard GAN losses.
d_loss: D is trained to output 1 for real, 0 for fake.
g_loss: G is trained to make D output 1 for its fakes.
"""
bs = real_x.size(0)
device = real_x.device

real_lbl = torch.ones(bs, device=device)
fake_lbl = torch.zeros(bs, device=device)

z = torch.randn(bs, G.latent_dim, 1, 1, device=device)
fake_x = G.net(z).detach() # detach: do not update G when training D

d_loss = (
F.binary_cross_entropy(D(real_x), real_lbl)
+ F.binary_cross_entropy(D(fake_x), fake_lbl)
)

z = torch.randn(bs, G.latent_dim, 1, 1, device=device)
g_loss = F.binary_cross_entropy(D(G.net(z)), real_lbl)

return d_loss, g_loss


# ============================================================
# Minimal DDPM - T forward passes, MSE training
# ============================================================
class MinimalDDPM(nn.Module):
"""
DDPM on flattened data (MLP for simplicity; production uses U-Net).
Training: simple MSE noise prediction.
Sampling: T iterative denoising steps.
The T-step cost is the only trade-off vs GAN/VAE.
"""
def __init__(self, input_dim: int = 784, T: int = 1000):
super().__init__()
self.T = T
self.input_dim = input_dim

# Noise schedule: linear beta schedule
betas = torch.linspace(1e-4, 0.02, T)
alphas = 1.0 - betas
alpha_bars = torch.cumprod(alphas, dim=0)

self.register_buffer("betas", betas)
self.register_buffer("alphas", alphas)
self.register_buffer("alpha_bars", alpha_bars)

# Timestep-conditioned noise predictor (U-Net in real implementations)
self.eps_net = nn.Sequential(
nn.Linear(input_dim + 1, 1024), nn.SiLU(),
nn.Linear(1024, 1024), nn.SiLU(),
nn.Linear(1024, 512), nn.SiLU(),
nn.Linear(512, input_dim),
)

def add_noise(self, x0, t, eps=None):
"""Forward process: x_t = sqrt(ᾱ_t)·x0 + sqrt(1-ᾱ_t)·ε"""
if eps is None:
eps = torch.randn_like(x0)
ab = self.alpha_bars[t].unsqueeze(-1)
return ab.sqrt() * x0 + (1 - ab).sqrt() * eps, eps

def predict_noise(self, xt, t):
t_emb = (t.float() / self.T).unsqueeze(-1)
return self.eps_net(torch.cat([xt, t_emb], dim=-1))

def training_loss(self, x0):
"""Simple MSE: always a well-conditioned regression problem."""
t = torch.randint(0, self.T, (x0.size(0),), device=x0.device)
xt, eps = self.add_noise(x0, t)
eps_pred = self.predict_noise(xt, t)
return F.mse_loss(eps_pred, eps)

@torch.no_grad()
def sample(self, n: int, device: str, steps: int | None = None) -> torch.Tensor:
"""
Ancestral sampling: T denoising passes.
This is the slow part - O(T) network evaluations.
Production solutions: DDIM (20-50 steps), DPM-Solver (10-20 steps).
"""
if steps is None:
steps = self.T
x = torch.randn(n, self.input_dim, device=device)

for step in reversed(range(steps)):
t_batch = torch.full((n,), step, device=device, dtype=torch.long)

ab_t = self.alpha_bars[step]
a_t = self.alphas[step]
b_t = self.betas[step]

eps_pred = self.predict_noise(x, t_batch)
# DDPM reverse step mean
mu = (1.0 / a_t.sqrt()) * (
x - (1 - a_t) / (1 - ab_t).sqrt() * eps_pred
)
if step > 0:
x = mu + b_t.sqrt() * torch.randn_like(x)
else:
x = mu # Final step: no noise

return x


# ── Sampling speed comparison ──────────────────────────────────────────────────
def compare_sampling_speed():
device = "cpu"
n_samples = 8

vae = VAE(input_dim=784, latent_dim=64)
gen = DCGANGenerator(latent_dim=100, image_channels=1)
ddpm = MinimalDDPM(input_dim=784, T=100) # T=100 for demo, production uses 1000

for name, fn in [
("VAE (1 decoder pass)", lambda: vae.sample(n_samples, device)),
("GAN (1 generator pass)", lambda: gen.sample(n_samples, device)),
("DDPM (100 denoise steps)", lambda: ddpm.sample(n_samples, device, steps=100)),
]:
t0 = time.perf_counter()
out = fn()
elapsed = (time.perf_counter() - t0) * 1000
print(f"{name}: {elapsed:.1f}ms → shape {tuple(out.shape)}")


compare_sampling_speed()
# Expected: VAE and GAN ~same fast; DDPM ~100x slower per step

Evaluation Metrics

FID (Fréchet Inception Distance): the primary metric. Computes the Fréchet distance between real and generated image distributions in InceptionV3 feature space. Captures both quality (mean shift) and diversity (covariance mismatch).

Inception Score (IS): measures whether generated images look like real ImageNet classes (high p(yx)p(y|x) sharpness) and are diverse across classes (high H(y)H(y) entropy). Limitation: fails when the generative distribution is out of ImageNet distribution (e.g., medical images).

Precision and Recall: precision = fraction of generated samples that fall within the real data manifold (quality); recall = fraction of the real manifold that is covered by generated samples (diversity). More interpretable than FID, increasingly standard.

CLIP Score: for text-conditioned generation - cosine similarity between CLIP embeddings of the generated image and the conditioning text prompt. Measures text-image alignment rather than photorealism.

Production Engineering Notes

:::warning Choosing the right model for your application Not every application needs diffusion.

  • Real-time single-step generation (interactive tools, gaming, mobile): GAN or Consistency Model.
  • Exact likelihood computation (anomaly detection, NLP density estimation, OOD detection): Normalizing Flow or Autoregressive model.
  • Maximum quality text-to-image (creative AI, product photography, marketing): Latent Diffusion + classifier-free guidance.
  • Small dataset (less than 5K samples): VAE - GANs overfit catastrophically on small data; diffusion also struggles.
  • Controllable editing (inpainting, style transfer, image editing): Diffusion with DDIM inversion.
  • Scientific generation (molecules, proteins, materials): Diffusion - best coverage of the combinatorial space.
  • Speed-critical with quality (4-step generation acceptable): Latent Consistency Models (LCM) or Consistency Models. :::

:::note Quality vs likelihood - the Pareto frontier High log-likelihood and high sample quality are distinct objectives that do not always align. Normalizing flows achieve exact likelihood but moderate FID. GANs achieve excellent FID but no likelihood. Diffusion models achieve strong FID and an ELBO lower bound - sitting on the best-known Pareto frontier as of 2024. For most production applications, FID and human preference are the correct optimization targets. :::

Common Mistakes

:::danger Confusing likelihood and sample quality High log-likelihood does not imply good samples. A model can assign high probability to blurry or average-looking images. Conversely, a GAN can produce photorealistic samples with zero computable likelihood. Normalizing flows routinely outperform diffusion models on log-likelihood (measured in bits-per-dim) while producing visually inferior samples. For generative AI applications, FID and human evaluation are the right metrics - not NLL. :::

:::danger Using FID as the only metric FID conflates quality and diversity - a model with low FID could fail by having poor quality (bad mean in Inception feature space) or poor diversity (bad covariance). Always report Precision (quality: what fraction of generated samples are realistic?) and Recall (coverage: what fraction of the real distribution is covered?) alongside FID. The community is moving toward these more interpretable decomposed metrics. :::

:::warning Mode collapse is not always visible in training loss A GAN with excellent training loss curves can still suffer from severe mode collapse. It might generate 995 of 1000 ImageNet categories beautifully but fail completely on rare classes. Checking per-class generation diversity, visualizing failure modes, and computing stratified FID across class subsets are important diagnostics. FID computed on the full test set can be misleadingly good when a tail of rare classes fails. :::

:::warning Not restarting when GAN training diverges GAN training instability often manifests as: discriminator loss → 0 (discriminator is dominating), generator loss → constant (gradient vanished), or mode collapse (generated samples become identical). These failure modes are often not recoverable from the same checkpoint - the training dynamic has left the basin of good behavior. Prevention is better than recovery: use gradient penalty (WGAN-GP), spectral normalization, or switch to a GAN variant with more stable training (StyleGAN2, ProjectedGAN). :::

YouTube Resources

ResourceChannelWhy Watch
Generative Models - Andrej KarpathyStanford CS231nBest single lecture covering VAE, GAN, and flow comparison with intuition - start here
GAN Tutorial - Ian GoodfellowNeurIPS 2016 TutorialThe GAN inventor explains mode collapse, training instability, and convergence theory
DDPM - Ho et al. Paper TalkNeurIPS 2020The original DDPM paper presentation - explains the simplified training objective
Diffusion Models Beat GANs - Dhariwal and NicholNIPS 2021The FID-record paper - classifier guidance and the architecture improvements
Score-Based Generative Models - Yang SongStanfordScore matching, NCSN, and the SDE unification - deep theory behind modern diffusion

Interview Q&A

Q1: Why did diffusion models surpass GANs on image generation quality, despite being much slower?

GANs optimize a minimax game - a fundamentally unstable optimization with two competing objectives. The generator can only improve as fast as the discriminator provides useful signal. Mode collapse means the generator exploits a narrow set of outputs that fool the discriminator and stops improving. Training is non-stationary: the optimal generator changes as the discriminator improves, and vice versa. These dynamics require extensive hyperparameter tuning, architectural tricks (spectral normalization, gradient penalty), and often fail silently.

Diffusion models optimize simple MSE noise prediction - always a well-conditioned regression problem. The training signal is rich: the network is trained at every noise level from pure noise to nearly clean images, receiving informative gradient signal throughout training. Iterative sampling (100–1000 steps) allows the model to make fine-grained corrections at each step, accumulating quality that a single-pass generator cannot achieve. Every sample starts from an independent xTN(0,I)x_T \sim \mathcal{N}(0, \mathbf{I}), preventing mode collapse - there is no learned generator network to collapse.

Q2: What is the relationship between the ELBO in VAEs and the ELBO in diffusion models?

Both optimize an Evidence Lower BOund on the log-likelihood. In VAEs: LVAE=Eqϕ(zx)[logpθ(xz)]DKL(qϕp)\mathcal{L}_\text{VAE} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_\text{KL}(q_\phi \| p). In diffusion models, the ELBO decomposes across TT timesteps: each term E[logpθ(xt1xt)]DKL(q(xtxt1)p(xt))\mathbb{E}[\log p_\theta(x_{t-1}|x_t)] - D_\text{KL}(q(x_t|x_{t-1}) \| p(x_t)) is a denoising problem at noise level tt. The simplified objective Lsimple=E[εεθ2]\mathcal{L}_\text{simple} = \mathbb{E}[\|\varepsilon - \varepsilon_\theta\|^2] is a reweighted version of the full diffusion ELBO. The key difference: the diffusion ELBO provides training signal at every noise level (T terms), while the VAE ELBO has a single bottleneck. This richer supervision leads to better generative models - and explains why diffusion training is more stable.

Q3: Explain why normalizing flows produce exact likelihood but mediocre samples.

Flows maximize logpθ(x)=logp(z)logdetJf(z)\log p_\theta(x) = \log p(z) - \log|\det J_f(z)|, which is exact because the bijective constraint makes the integral trivial (one-to-one mapping). But maximizing logpθ(x)\log p_\theta(x) does not directly optimize visual quality - it optimizes probability assignment. A model can achieve high likelihood by placing probability mass broadly across the data manifold, at the cost of per-sample sharpness.

More fundamentally, the bijective constraint limits flow architectures: to maintain tractable logdetJf\log|\det J_f|, transformations must be structured (coupling layers, triangular Jacobians). This rules out the unrestricted U-Net, ResNet, or Transformer architectures that diffusion models use. The expressiveness gap means flows cannot learn as complex a distribution in practice, even though their training objective is theoretically stronger.

Q4: What is FID and why is it the standard evaluation metric for generative models?

FID (Fréchet Inception Distance) measures the distance between the distributions of real and generated images in InceptionV3 feature space. Formally: FID=μrμg2+tr(Σr+Σg2(ΣrΣg)1/2)\text{FID} = \|\mu_r - \mu_g\|^2 + \text{tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2}) where (μr,Σr)(\mu_r, \Sigma_r) and (μg,Σg)(\mu_g, \Sigma_g) are statistics of Inception features on real and generated images respectively. Lower FID = more similar to real images.

FID became the standard because: (1) it correlates with human quality judgments better than pixel-level metrics; (2) it captures both quality (μrμg2\|\mu_r-\mu_g\|^2: mean shift) and diversity (tr(Σr+Σg2(ΣrΣg)1/2)\text{tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2}): covariance mismatch) in one number; (3) it is fast to compute (extract features, compute statistics). Limitations: it depends on InceptionV3 trained on ImageNet - may not transfer to medical, scientific, or artistic domains; it can be gamed; precision and recall provide more interpretable decompositions.

Q5: What is the difference between explicit and implicit density models? Where does diffusion sit?

Explicit density models directly define and compute pθ(x)p_\theta(x). Examples: VAEs (via ELBO lower bound), normalizing flows (exact), autoregressive models (product of conditionals). You can compute logpθ(x)\log p_\theta(x) for any test point - essential for anomaly detection, model selection, and Bayesian inference.

Implicit density models define a sampling procedure without computing pθ(x)p_\theta(x). Example: GANs - the generator defines a deterministic map from noise to data with no tractable density. No likelihood computation possible.

Diffusion models sit between: they have a principled ELBO (approximate log-likelihood lower bound) computed via the sum of denoising losses across all timesteps, so they support approximate density estimation. But they are not as clean as flows for likelihood tasks. They are "explicit-ish" - better than GANs for density estimation, better than flows for sample quality, making them the best-known compromise on the quality-likelihood-stability-diversity Pareto frontier.

Computing FID in Practice

Understanding FID conceptually is not the same as computing it correctly. Here is the standard implementation:

import torch
import torch.nn as nn
import numpy as np
from scipy import linalg


def compute_fid(real_features: np.ndarray, gen_features: np.ndarray) -> float:
"""
Compute FID between real and generated image feature distributions.

Args:
real_features: [n_real, D] InceptionV3 features for real images
gen_features: [n_gen, D] InceptionV3 features for generated images

Returns:
FID score (lower = better; real vs real ≈ 0)

Formula:
FID = ‖μ_r - μ_g‖² + tr(Σ_r + Σ_g - 2·(Σ_r·Σ_g)^{1/2})
"""
mu_r = real_features.mean(axis=0)
mu_g = gen_features.mean(axis=0)
sigma_r = np.cov(real_features, rowvar=False)
sigma_g = np.cov(gen_features, rowvar=False)

# Squared mean difference
mean_diff_sq = np.sum((mu_r - mu_g) ** 2)

# Matrix square root: (Σ_r · Σ_g)^{1/2}
# Use scipy's sqrtm - eigendecomposition of Σ_r^{1/2} · Σ_g · Σ_r^{1/2}
covmean, _ = linalg.sqrtm(sigma_r @ sigma_g, disp=False)

# Handle numerical issues (sqrtm can produce tiny imaginary parts)
if np.iscomplexobj(covmean):
if not np.allclose(np.diagonal(covmean).imag, 0, atol=1e-3):
raise ValueError("sqrtm produced significant imaginary component")
covmean = covmean.real

# Trace term
trace_term = np.trace(sigma_r) + np.trace(sigma_g) - 2 * np.trace(covmean)

return float(mean_diff_sq + trace_term)


def extract_inception_features(images: torch.Tensor,
device: str = "cpu") -> np.ndarray:
"""
Extract InceptionV3 pool3 features (2048-dim) for FID computation.
images: [n, 3, 299, 299] in range [0, 1]
"""
from torchvision.models import inception_v3, Inception_V3_Weights

inception = inception_v3(weights=Inception_V3_Weights.DEFAULT,
transform_input=False).to(device)
inception.eval()
inception.fc = nn.Identity() # Remove final classifier → 2048-dim features

features = []
batch_size = 32
with torch.no_grad():
for i in range(0, len(images), batch_size):
batch = images[i:i+batch_size].to(device)
# InceptionV3 needs 299×299 input
batch = nn.functional.interpolate(batch, size=(299, 299),
mode="bilinear",
align_corners=False)
feat = inception(batch)
features.append(feat.cpu().numpy())

return np.concatenate(features, axis=0)


# ── Sanity check: FID of real vs real should be ~0 ────────────────────────────
np.random.seed(42)
# Simulate 2048-dim Inception features (in practice: extract from real images)
real_feat_A = np.random.randn(1000, 2048) * 10 + 5 # real images, subset A
real_feat_B = np.random.randn(1000, 2048) * 10 + 5 # real images, subset B
gen_feat = np.random.randn(1000, 2048) * 12 + 3 # generated images

fid_real_vs_real = compute_fid(real_feat_A, real_feat_B)
fid_real_vs_gen = compute_fid(real_feat_A, gen_feat)

print(f"FID (real vs real, different subsets): {fid_real_vs_real:.2f} (expect ~0)")
print(f"FID (real vs generated): {fid_real_vs_gen:.2f} (expect > 0)")
print("\nFID pitfalls:")
print(" - Need ≥10K samples for stable estimates (high variance with n<5K)")
print(" - Must use the same InceptionV3 weights and preprocessing as reference")
print(" - Domain mismatch: FID on medical images is not comparable to ImageNet FID")

The Sampling Speed Problem and Its Solutions

Diffusion's 1000-step sampling is a production bottleneck. Three main solutions have emerged:

DDIM (Denoising Diffusion Implicit Models, Song et al. 2020): replaces the stochastic reverse process with a deterministic ODE integration. The same noise predictor is reused at fewer steps (20–50) by taking larger ODE steps. No retraining required - it is a different sampling algorithm for the same model. Quality degrades gracefully with fewer steps.

DPM-Solver (Lu et al. 2022): a higher-order ODE solver specifically designed for diffusion models. Achieves high-quality samples in 10–20 steps by exploiting the semi-linear structure of the diffusion ODE (the linear part has an exact solution, only the nonlinear noise network output needs to be approximated).

Consistency Models (Song et al. 2023): train a model to directly map any noisy xtx_t to the clean x0x_0, maintaining consistency - i.e., f(xt,t)=f(xt,t)f(x_t, t) = f(x_{t'}, t') for all t,tt, t' on the same trajectory. Achieves 1–4 step generation. Can be distilled from a pretrained diffusion model or trained from scratch.

Latent Diffusion (Rombach et al. 2022): compress images to a small latent space (16× or 64× compression) using a VAE encoder, run diffusion in that latent space, then decode. Reduces per-step cost from 16 million pixels to 16 thousand latent values - 1000× fewer operations per denoising step. This is the architecture underlying Stable Diffusion and nearly all production text-to-image systems.

MethodStepsRelative SpeedQuality Drop
DDPM10001× (baseline)-
DDIM5020×Minimal
DDIM2050×Small
DPM-Solver15~67×Small
Consistency (distilled)4250×Moderate
Consistency (1-step)11000×Significant
Latent Diffusion + DDIM-5050~3200× effective-

Conditional Generation - How Each Model Family Handles It

All production generative models need conditional generation: p(xc)p(x|c) where cc is a conditioning signal (text prompt, class label, reference image). Each architecture handles conditioning differently.

VAE: append cc to the latent zz (conditional VAE, CVAE) or condition both encoder and decoder on cc. Works well for simple conditioning (class labels); poor for complex text prompts.

GAN: concatenate condition cc to both generator input and discriminator input (conditional GAN, cGAN). Works for class-conditional generation; struggles with complex text due to training instability. StyleGAN-based approaches use adaptive instance normalization (AdaIN) to inject conditioning.

Normalizing Flow: condition the affine coupling layers on cc (conditional flow). Works but the bijective constraint makes powerful conditioning difficult.

Diffusion: classifier guidance (Dhariwal and Nichol, 2021) - train a noisy classifier pϕ(cxt)p_\phi(c|x_t) and add its gradient to the score: s~θ(xt,t)=sθ(xt,t)+wxtlogpϕ(cxt)\tilde{s}_\theta(x_t, t) = s_\theta(x_t, t) + w\nabla_{x_t}\log p_\phi(c|x_t). This steers sampling toward cc with guidance scale ww. Higher ww = stronger conditioning at cost of diversity.

Classifier-Free Guidance (CFG) (Ho and Salimans, 2022): the dominant approach today. Train a single model that can be conditioned or unconditioned (randomly drop cc during training). At inference, extrapolate between conditioned and unconditioned predictions:

ε~θ(xt,t,c)=εθ(xt,t,)+w(εθ(xt,t,c)εθ(xt,t,))\tilde{\varepsilon}_\theta(x_t, t, c) = \varepsilon_\theta(x_t, t, \emptyset) + w \cdot (\varepsilon_\theta(x_t, t, c) - \varepsilon_\theta(x_t, t, \emptyset))

At w=0w = 0: unconditioned. At w=1w = 1: standard conditioned. At w>1w > 1 (e.g., 7.5 in Stable Diffusion): overextrapolated - stronger adherence to the text prompt at the cost of diversity. CFG is the reason Stable Diffusion follows text prompts well - it is the primary mechanism for text-to-image control.

import torch
import torch.nn as nn

def classifier_free_guidance_step(
noise_pred_cond: torch.Tensor,
noise_pred_uncond: torch.Tensor,
guidance_scale: float = 7.5,
) -> torch.Tensor:
"""
Classifier-Free Guidance (CFG) - the mechanism behind Stable Diffusion prompt adherence.

CFG interpolates (extrapolates) between conditioned and unconditioned predictions.
guidance_scale = 1.0: no guidance (standard conditional)
guidance_scale = 7.5: strong guidance (Stable Diffusion default)
guidance_scale > 15: often over-saturated, low diversity

Args:
noise_pred_cond: εθ(x_t, t, c) noise prediction with condition c
noise_pred_uncond: εθ(x_t, t, ∅) noise prediction without condition
guidance_scale: w - guidance strength
"""
# Linear extrapolation: uncond + w·(cond - uncond)
# = (1-w)·uncond + w·cond when w=1 → standard cond
# = uncond + w*(cond - uncond) when w>1 → extrapolated
return noise_pred_uncond + guidance_scale * (noise_pred_cond - noise_pred_uncond)


def cfg_sampling_step(
model, # noise predictor εθ(x_t, t, c)
x_t, # current noisy image
t, # current timestep
condition, # text embeddings or class label
guidance_scale, # w
scheduler, # DDPM or DDIM scheduler
):
"""
One CFG denoising step - used in every SD inference call.
Two forward passes per step: once conditioned, once unconditioned.
This doubles the compute cost of sampling vs unconditioned.
"""
# Two forward passes: conditioned and unconditioned
noise_pred_cond = model(x_t, t, encoder_hidden_states=condition)
noise_pred_uncond = model(x_t, t, encoder_hidden_states=None)

# Apply CFG
noise_pred = classifier_free_guidance_step(
noise_pred_cond, noise_pred_uncond, guidance_scale
)

# Update x_t using the scheduler (DDPM, DDIM, DPM-Solver, etc.)
x_prev = scheduler.step(noise_pred, t, x_t).prev_sample
return x_prev

print("Classifier-Free Guidance:")
print(" Training: randomly drop condition c with probability p_uncond (e.g., 0.1)")
print(" Inference: two forward passes per step, extrapolate with guidance_scale w")
print(" guidance_scale = 1.0 → pure conditional (follows prompt loosely)")
print(" guidance_scale = 7.5 → Stable Diffusion default (strong prompt adherence)")
print(" guidance_scale > 15 → over-saturation, reduced diversity")
print(" Cost: 2× forward passes vs unconditioned sampling")

## Architecture Evolution - From U-Net to Transformer

The original DDPM used a **U-Net** architecture: a convolutional encoder-decoder with skip connections, residual blocks, and multi-head self-attention at the lowest spatial resolution. This architecture encodes strong inductive biases for image data: spatial locality, translation equivariance, multi-scale processing.

**DiT (Diffusion Transformer, Peebles and Xie, 2022)** replaced the U-Net with a **Vision Transformer** operating directly on image patches - patchifying the noisy image, processing with standard transformer blocks, then unpatchifying to predict the noise. DiT-XL/2 achieved FID 2.27 on ImageNet 256×256 class-conditional, surpassing all prior models. Transformers scale more predictably than U-Nets: larger models reliably produce lower FID, following a compute-optimal scaling law.

The shift from U-Net to Transformer mirrors the broader shift in ML from CNNs to Transformers: transformers are more general (no spatial locality inductive bias), more scalable, and more amenable to conditioning via cross-attention (text prompts are integrated naturally as key-value pairs in cross-attention layers).

## When Each Model Class Still Wins in Production (2024 Reference)

Despite diffusion's dominance in image generation, each model family remains the right tool for specific production problems.

**GAN use cases in 2024**: (1) Real-time face reenactment and video avatars - GAN inference in one pass makes real-time (<100ms) generation feasible. (2) Image super-resolution (ESRGAN, RealESRGAN) - GAN perceptual loss gives sharper textures than diffusion-based SR for some use cases. (3) Neural radiance field (NeRF) compositing - integrating GAN-generated textures into 3D scenes.

**VAE use cases in 2024**: (1) As the compression backbone in Latent Diffusion - the VAE encoder/decoder converts between pixel space and latent space. (2) Drug discovery - VAEs on molecular graphs (junction tree VAE) for constrained chemical space exploration. (3) Any application needing a structured, smooth latent space for interpolation or attribute manipulation.

**Normalizing flow use cases in 2024**: (1) Density estimation for anomaly detection where exact log-likelihood is required. (2) Scientific applications (molecular dynamics, Boltzmann generators) where the bijective structure matches physical conservation laws. (3) Variational inference - flows as expressive approximate posteriors in Bayesian models.

**Autoregressive model use cases (not covered yet)**: (1) Text generation - GPT family. (2) Image tokenization - VQ-VAE + transformer (used in DALL-E 1, DALL-E 3 tokenizes into image tokens). (3) Audio (AudioLM, Whisper for speech). The autoregressive family achieves exact likelihood via the chain rule of probability: $p(x) = \prod_i p(x_i | x_{<i})$.

## The Research Frontier (2024–2025)

**Flow Matching** (Lipman et al., 2022; Liu et al., 2022): instead of learning to reverse a specific noising SDE, flow matching directly learns an optimal transport flow from noise to data. Training objective: $\mathcal{L}_{FM} = \mathbb{E}_{t,x_0,x_1}\!\left[\|v_\theta(x_t, t) - (x_1 - x_0)\|^2\right]$ where $x_t = (1-t)x_0 + tx_1$ is a straight-line interpolation from noise $x_0$ to data $x_1$. Flow matching trains faster, enables straighter trajectories (fewer ODE steps), and achieves state-of-the-art FID. Meta's Voicebox (audio), Stable Diffusion 3, and FLUX all use flow matching.

**Consistency Models** (Song et al., 2023): train a model $f_\theta(x_t, t) \approx x_0$ that is consistent along the probability flow ODE - i.e., all points on the same trajectory map to the same $x_0$. Achieves 14 step generation via direct distillation from a pretrained diffusion model, with only a moderate FID penalty.

**Video diffusion and 3D**: DiT-based architectures extended to video (Sora, OpenAI 2024), 3D generation (DreamFusion, Point-E), and motion generation (MDM). The same mathematical framework generalizes: the forward process corrupts video frames or 3D point clouds, and the reverse process denoises them.

The unifying trend: every new architecture is a special case of the SDE/flow matching framework. The mathematical tools from this lesson - score functions, Langevin dynamics, probability flow ODEs - remain central regardless of the specific model variant.

## Quick Reference - Generative Model Trade-offs

| Property | VAE | GAN | Flow | EBM | Diffusion |
|----------|-----|-----|------|-----|-----------|
| Training objective | ELBO (approx. MLE) | Minimax adversarial | Exact MLE | Contrastive divergence | MSE noise prediction |
| Likelihood | Approx. lower bound | None | Exact | Intractable Z | Approx. lower bound |
| Sample quality | Moderate (blur) | Excellent | Moderate | Good | Excellent |
| Sample diversity | Good | Poor–Moderate | Good | Good | Excellent |
| Training stability | Excellent | Poor | Good | Poor | Excellent |
| Sampling speed | Fast (1 pass) | Fast (1 pass) | Fast | MCMC (slow) | Slow (T passes) |
| Conditional gen. | CVAE | cGAN | Cond. flow | - | CFG (dominant) |
| Likelihood for anomaly | Approximate | No | Yes | No | Approximate |
| Scales to 1024px? | No (blurry) | Yes (GAN) | Barely | No | Yes (LDM) |
| Production usage | VAE backbone | Real-time | Scientific | Research | Text-to-image |

This table is the one-page summary of what takes researchers years of papers to establish. Each cell represents a design decision, a theoretical result, and a set of engineering constraints. When an interviewer asks "compare generative models," this is the target answer - precise, multi-dimensional, and grounded in why each property holds rather than just what it is.

```mermaid
flowchart LR
A["DDPM 2020<br/>U-Net + MSE<br/>1000 steps, 3.17 FID CIFAR-10"]:::blue
B["DDIM 2020<br/>ODE sampling<br/>20-50 steps, same model"]:::green
C["ADM + Guidance 2021<br/>U-Net + classifier guidance<br/>4.59 FID ImageNet"]:::teal
D["Latent Diffusion 2022<br/>VAE compression + U-Net<br/>Stable Diffusion"]:::indigo
E["DiT-XL/2 2022<br/>Vision Transformer<br/>2.27 FID ImageNet"]:::purple
F["CFG + DiT 2023+<br/>Text-conditioned Transformer<br/>Production systems"]:::orange

A --> B --> C --> D
C --> E --> F

classDef blue fill:#dbeafe,color:#1e293b,stroke:#2563eb
classDef green fill:#dcfce7,color:#14532d,stroke:#16a34a
classDef teal fill:#ccfbf1,color:#134e4a,stroke:#14b8a6
classDef indigo fill:#e0e7ff,color:#3730a3,stroke:#6366f1
classDef purple fill:#ede9fe,color:#4c1d95,stroke:#7c3aed
classDef orange fill:#ffedd5,color:#7c2d12,stroke:#ea580c

:::tip 🎮 Interactive Playground

**Visualize this concept:** Try the **[VAE vs GAN vs Diffusion](/playground/generative-comparison)** demo on the EngineersOfAI Playground - no code required.

:::
© 2026 EngineersOfAI. All rights reserved.