Generative Models Overview - VAEs, GANs, Flow Models, and Diffusion
:::note Reading time: ~45 minutes | Interview relevance: High | Target roles: ML Engineer, AI Engineer, Research Engineer, Applied Scientist :::
The Real Interview Moment
You are interviewing at Stability AI. The interviewer slides a whiteboard marker across the table. "Walk me through every major approach to generative modeling. For each one, tell me what it optimizes, where it fails, and why diffusion models ultimately beat them all."
Most candidates give a surface-level tour - "GANs use a discriminator, VAEs use an encoder-decoder structure" - and never explain why the community spent a decade wrestling with mode collapse, blurry outputs, and training instability before diffusion models arrived. The interviewer has heard that surface-level answer forty times this month.
What wins the room is a unified framework. Every generative model answers the same question: how do we learn a probability distribution over high-dimensional data from finite samples, and then sample novel examples from it? Each architecture makes a different bet about what to optimize and what to sacrifice - between tractable likelihood, sample quality, diversity, training stability, and computational cost. Understanding those trade-offs at a mathematical level, not just an intuitive one, is what separates an ML practitioner from an ML engineer.
By the end of 2022, one approach had cracked a four-way trade-off that had stymied the field for a decade. Midjourney launched. Stable Diffusion went open source. DALL-E 2 made text-to-image generation mainstream. Within 18 months of DALL-E 2's release, Midjourney was processing over 15 million image generations per day. But to understand why diffusion won, you need to understand what each earlier approach tried - and precisely where it broke.
Why This Exists - The Generative Modeling Problem
Machine learning began with discriminative models - classifiers and regressors that learn . Generative models go further: they learn the full distribution over inputs.
This unlocks three capabilities discriminative models cannot provide:
- Sampling: Draw - generate novel images, molecules, audio, protein sequences.
- Density estimation: Evaluate for any - detect anomalies, assess whether a test sample is in-distribution.
- Conditional generation: Sample - generate an image matching a text prompt, complete a partial molecule, inpaint a masked region.
The challenge: a RGB image is a point in . The true data distribution occupies an extremely thin manifold embedded in that space - the probability manifold hypothesis. Most points in are pure noise. The manifold of valid natural images is vastly lower-dimensional (perhaps –). Learning to model that manifold accurately, from finite samples, without explicit access to , is the core challenge.
Four major families of approaches tackle it differently. Each makes a different bet.
Historical Context
Generative modeling predates deep learning. Gaussian mixture models, hidden Markov models, and restricted Boltzmann machines were studied for decades. The modern deep generative era began in 2013–2014. Variational Autoencoders (Kingma and Welling, 2013) and Generative Adversarial Networks (Goodfellow et al., 2014) appeared in the same year, defining two radically different paradigms. Normalizing flows were formalized by Rezende and Mohamed (2015) and Dinh et al. (2016). Score-based models and early diffusion work appeared in 2019–2020 (Song and Ermon; Ho et al.). DALL-E 2 (Ramesh et al., 2022) and Stable Diffusion (Rombach et al., 2022) demonstrated that diffusion had decisively surpassed GANs on the metrics that matter for production applications.
1. Latent Variable Models and the Intractability Problem
All probabilistic generative models face a common structure. Define a latent variable with a simple prior , and a conditional generative model (a neural network). The marginal data likelihood is:
The integral is over all possible latent configurations - it is intractable to compute for any reasonable . This intractability is the central problem that each generative model family resolves differently.
- VAEs: approximate the integral via a lower bound (ELBO), introducing an encoder network.
- GANs: bypass the integral entirely - never compute , just sample.
- Flows: restrict the model architecture so the integral becomes a simple formula.
- Diffusion: use many small steps where each conditional is tractable.
2. Variational Autoencoders (VAEs)
The ELBO and the Reparameterization Trick
VAEs (Kingma and Welling, 2013) introduce an approximate posterior (the encoder network) and derive a tractable lower bound on the log-likelihood:
This is the Evidence Lower BOund (ELBO). The gap between and the ELBO is exactly - the KL between the approximate and true posterior. Maximizing the ELBO jointly over (decoder) and (encoder) simultaneously improves the generative model and the inference network.
The reparameterization trick makes this end-to-end differentiable. Rather than sampling (non-differentiable), write:
The randomness is moved outside the computational graph. Gradients flow through and as normal.
Why VAEs Produce Blurry Samples
The reconstruction term is typically pixel-wise MSE or binary cross-entropy - treating each pixel as independent. When the model is uncertain about the exact value of a pixel (e.g., whether the cat's ear should be at position 47 or 49), it hedges and predicts the average. Averaged across millions of pixels, across the uncertainty in , across the variability in the training set - the result is systematic blurriness. The ELBO gap between the approximate and true ELBO means the model cannot perfectly trade off reconstruction quality against the prior.
Posterior collapse: if the decoder becomes powerful enough to reconstruct without using at all, the KL term drives , and the latent space becomes useless. The model degenerates into a pure decoder with an empty latent space.
VAE Properties
| Property | VAE |
|---|---|
| Training objective | Maximize ELBO |
| Likelihood | Lower bound (approximate) |
| Sample quality | Moderate (blurry) |
| Diversity | Good - covers full distribution |
| Training stability | Excellent - standard SGD |
| Sampling speed | Fast - one decoder pass |
3. Generative Adversarial Networks (GANs)
The Minimax Game and Nash Equilibrium
GANs (Goodfellow et al., 2014) take a fundamentally different approach: implicit density estimation via an adversarial game. The generator maps noise to data space. The discriminator tries to distinguish real from generated samples:
At the Nash equilibrium: produces samples from , and outputs everywhere (cannot distinguish real from fake). The generator never explicitly computes - it learns a deterministic mapping from noise to data that fools the discriminator.
Why GANs produce sharp images: the discriminator learns a rich perceptual similarity measure. A blurry image is easily distinguishable from a real image - the discriminator penalizes it directly. Unlike pixel-wise MSE, the adversarial loss captures high-frequency texture details. This is why DCGAN (Radford et al., 2015) immediately produced dramatically sharper samples than VAEs of the same era.
The Failure Modes
Mode collapse: the generator finds a small set of outputs that reliably fool the discriminator and stops exploring. In the extreme, the generator maps every to near-identical outputs. The training dataset might have 1000 distinct categories; the generator collapses to producing images from 10. Detected only by inspecting diversity - low FID can coexist with mode collapse if the collapsed modes are photorealistic.
Training instability: the minimax game is difficult to optimize. When is too strong, the generator gradient vanishes (discriminator is saturated). When is too weak, receives no useful learning signal. Balancing requires careful architectural choices: spectral normalization, gradient penalty (WGAN-GP), two-timescale learning rates, label smoothing. A failed GAN training run often produces only noise or one mode - and there is no loss curve diagnostic that reliably predicts failure.
No likelihood: a GAN cannot compute for a given . The mapping with (100-dim noise to 196K-dim image) is not invertible. No likelihood means no anomaly detection, no model selection by held-out likelihood, no density estimation.
GAN Properties
| Property | GAN |
|---|---|
| Training objective | Minimax adversarial game |
| Likelihood | None (implicit density) |
| Sample quality | Excellent - sharp, photorealistic |
| Diversity | Poor to moderate - mode collapse |
| Training stability | Poor - requires tricks |
| Sampling speed | Extremely fast - one forward pass |
4. Normalizing Flows
Exact Density via Bijective Transforms
Normalizing flows model exactly via a bijective transformation . If and , the change-of-variables formula gives:
where is the Jacobian. Since is bijective:
- Sample: draw , compute - one forward pass.
- Evaluate exact likelihood: compute , plug into the formula - one inverse pass.
- Train: maximize exact log-likelihood end-to-end.
This is the only family that achieves exact maximum likelihood with efficient sampling - a remarkable theoretical property.
The Architectural Constraint
To make tractable, the transformation must have a structured Jacobian - triangular, block-triangular, or computed via autoregressive structure. This severely limits model expressiveness.
RealNVP (Dinh et al., 2016): coupling layers split dimensions into two halves; one half transforms the other. Jacobian is block-triangular. Efficient in both directions.
Glow (Kingma and Dhariwal, 2018): extends RealNVP with 1×1 invertible convolutions for learnable channel permutations. Produced decent face samples.
Autoregressive flows (MAF, IAF): condition each dimension on previous ones. Jacobian is triangular - cheap log-det. But sequential generation makes sampling slow ( sequential passes), and sequential training (IAF) makes density evaluation slow. Cannot have fast sampling and fast density evaluation simultaneously.
For high-dimensional images: the bijective constraint is painfully limiting. Glow required enormous model size and still produced samples clearly inferior to BigGAN of the same era - a reminder that exact likelihood and sample quality are distinct objectives. A model can be excellent at density estimation and mediocre at generation.
Flow Properties
| Property | Normalizing Flow |
|---|---|
| Training objective | Exact maximum likelihood |
| Likelihood | Exact |
| Sample quality | Moderate |
| Diversity | Good |
| Training stability | Good |
| Sampling speed | Fast (one pass) or slow (autoregressive) |
5. Energy-Based Models
Energy-based models (EBMs) define:
where is a neural network (the "energy") and is the partition function - the normalizing constant. Low energy = high probability; high energy = low probability.
The intractable partition function: requires integrating over all of input space - impossible for high-dimensional . This makes maximum likelihood training intractable - the second term requires knowing .
Training uses contrastive divergence or MCMC-based estimators: generate negative samples (from the current model) via MCMC (e.g., Langevin dynamics), and update weights to increase energy of negative samples and decrease energy of positive (real) samples. This requires running MCMC chains at every training step - expensive and often poorly mixed.
EBMs are theoretically elegant - any neural network can be an energy function - but practically difficult to train for high-dimensional data.
6. Score-Based Models (Preview)
Score-based models (Song and Ermon, 2019) sidestep both the intractable partition function and the explicit likelihood by learning the score function:
The score is the gradient of the log-density with respect to the input. It does not depend on (since where is the unnormalized density). Given the score, Langevin dynamics generates samples:
Starting from any , as step size and steps , the chain converges to . Score matching provides a way to train from data without access to - covered in full in Lesson 03.
7. Diffusion Models
The Core Idea
Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) model the data distribution through a progressive denoising process.
The forward process gradually corrupts a clean data point by adding Gaussian noise over steps:
By step , - pure noise. Using the reparameterization , the marginal at any step is:
allowing direct sampling at any noise level without iterating through all steps.
The reverse process is a learned neural network (U-Net) that undoes one denoising step:
Training: at each step, sample a random , corrupt to , then train the network to predict the noise that was added:
This is simple MSE - no adversarial game, no intractable integral, no architectural constraints.
Why Diffusion Models Win - The Four Constraints
The generative modeling community cares about four things simultaneously:
Sample quality: do generated samples look realistic? FID, IS, human preference.
Sample diversity: does the model cover the full data distribution? A model that generates perfect copies of the 100 most common ImageNet classes would score well on quality but fail on diversity. Measured by recall, coverage, FID recall component.
Training stability: can you reliably train to convergence without hand-tuning? GANs are notorious for mode collapse, loss curve divergence, and "it was working yesterday" instability.
Likelihood estimation: can you compute for a given ? Essential for anomaly detection, model selection, and Bayesian inference pipelines.
| Model | Quality | Diversity | Stability | Likelihood |
|---|---|---|---|---|
| VAE | Moderate | Good | Excellent | Approx. lower bound |
| GAN | Excellent | Poor | Poor | None |
| Flow | Moderate | Good | Good | Exact |
| EBM | Good | Good | Poor | Intractable |
| Diffusion | Excellent | Excellent | Excellent | Approx. lower bound |
The key insight: the forward process in diffusion is fixed and analytically known - you designed it. The training target at each step () is always a well-conditioned MSE regression problem. There is no adversarial dynamic, no mode collapse risk, no non-stationary training landscape. Diffusion achieves excellent quality (iterative denoising over 100–1000 steps allows fine-grained correction) and excellent diversity (each sample starts from an independent random ).
The only sacrifice: sampling speed. DDPM requires denoising passes. DDIM (Song et al., 2020) reduces this to 20–50 steps via a deterministic ODE formulation. Latent diffusion (Rombach et al., 2022) compresses images to a small latent space, reducing per-step cost by 16–64×. These two innovations together make diffusion practical for production.
FID Evolution on ImageNet
FID (Fréchet Inception Distance) measures distributional similarity in Inception feature space. Lower is better. This table tells the story of a decade of progress:
| Model | Family | FID (ImageNet 256×256) | Year |
|---|---|---|---|
| Glow | Flow | 48.9 | 2018 |
| VQVAE-2 | Hybrid (VAE+Autoregressive) | 31.1 | 2019 |
| BigGAN | GAN | 7.4 | 2019 |
| StyleGAN2 | GAN | 3.8 (FFHQ 1024px) | 2020 |
| ADM (DDPM++) | Diffusion | 10.9 | 2021 |
| ADM-G (classifier guided) | Diffusion | 4.59 | 2021 |
| LDM-4 | Latent Diffusion | 10.56 | 2022 |
| DiT-XL/2 | Diffusion Transformer | 2.27 | 2022 |
| SDXL | Latent Diffusion | <2.0 | 2023 |
The DiT-XL/2 result (2.27 FID, ImageNet 256×256 class-conditional) surpassed all prior GAN-based methods. Diffusion did not just catch up to GANs - it decisively surpassed them.
FID formula for reference:
where are the mean and covariance of Inception features on real images, and on generated images. The first term captures mean shift (quality); the second captures covariance mismatch (diversity). Lower FID = closer to real data in feature space.
Code - Comparing Generative Model Architectures
import torch
import torch.nn as nn
import torch.nn.functional as F
import time
# ============================================================
# VAE - ELBO training, fast single-pass sampling
# ============================================================
class VAE(nn.Module):
"""
Variational Autoencoder on flattened images (e.g., MNIST 28×28=784).
Encoder q_φ(z|x) → (μ, log σ²)
Decoder p_θ(x|z) → reconstructed x
"""
def __init__(self, input_dim: int = 784, latent_dim: int = 64):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 512), nn.ReLU(),
nn.Linear(512, 256), nn.ReLU(),
nn.Linear(256, latent_dim * 2) # output: [μ | log_σ²]
)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 256), nn.ReLU(),
nn.Linear(256, 512), nn.ReLU(),
nn.Linear(512, input_dim), nn.Sigmoid()
)
self.latent_dim = latent_dim
def encode(self, x):
h = self.encoder(x)
mu, log_var = h.chunk(2, dim=-1)
return mu, log_var
def reparameterize(self, mu, log_var):
"""
Reparameterization trick: z = μ + σ·ε, ε~N(0,I)
Moves randomness outside the computational graph → differentiable.
"""
std = torch.exp(0.5 * log_var)
eps = torch.randn_like(std)
return mu + eps * std
def forward(self, x):
mu, log_var = self.encode(x)
z = self.reparameterize(mu, log_var)
return self.decoder(z), mu, log_var
def elbo_loss(self, x):
"""
ELBO = E[log p(x|z)] - KL(q(z|x) || p(z))
Pixel-wise BCE for reconstruction; closed-form KL for Gaussians.
The blurriness of VAE outputs comes from pixel-wise BCE/MSE:
uncertainty → hedge → average → blur.
"""
x_recon, mu, log_var = self(x)
# Reconstruction: pixel-wise BCE (independent pixel assumption)
recon_loss = F.binary_cross_entropy(x_recon, x, reduction="sum")
# KL divergence: closed form for q=N(μ,σ²), p=N(0,1)
kl_loss = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
return recon_loss + kl_loss
@torch.no_grad()
def sample(self, n: int, device: str) -> torch.Tensor:
"""One decoder forward pass - fast."""
z = torch.randn(n, self.latent_dim, device=device)
return self.decoder(z)
# ============================================================
# GAN - adversarial training, one-pass generation (no likelihood)
# ============================================================
class DCGANGenerator(nn.Module):
"""
Deep Convolutional GAN generator.
Maps z~N(0,I) → (C,H,W) images via transposed convolutions.
One forward pass → sharp image. No likelihood ever computed.
"""
def __init__(self, latent_dim: int = 100, channels: int = 64,
image_channels: int = 1):
super().__init__()
self.net = nn.Sequential(
# z → 4×4 feature map
nn.ConvTranspose2d(latent_dim, channels * 8, 4, 1, 0, bias=False),
nn.BatchNorm2d(channels * 8), nn.ReLU(True),
# 4×4 → 8×8
nn.ConvTranspose2d(channels * 8, channels * 4, 4, 2, 1, bias=False),
nn.BatchNorm2d(channels * 4), nn.ReLU(True),
# 8×8 → 16×16
nn.ConvTranspose2d(channels * 4, channels * 2, 4, 2, 1, bias=False),
nn.BatchNorm2d(channels * 2), nn.ReLU(True),
# 16×16 → 32×32
nn.ConvTranspose2d(channels * 2, image_channels, 4, 2, 1, bias=False),
nn.Tanh(), # output in [-1, 1]
)
self.latent_dim = latent_dim
@torch.no_grad()
def sample(self, n: int, device: str) -> torch.Tensor:
"""One forward pass - extremely fast. No likelihood computation possible."""
z = torch.randn(n, self.latent_dim, 1, 1, device=device)
return self.net(z)
class DCGANDiscriminator(nn.Module):
"""
DCGAN discriminator: real or fake?
If too strong → generator gradient vanishes (training instability).
If too weak → generator gets no useful signal.
Balancing this is the core difficulty of GAN training.
"""
def __init__(self, channels: int = 64, image_channels: int = 1):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(image_channels, channels, 4, 2, 1, bias=False),
nn.LeakyReLU(0.2, inplace=True),
nn.Conv2d(channels, channels * 2, 4, 2, 1, bias=False),
nn.BatchNorm2d(channels * 2), nn.LeakyReLU(0.2, inplace=True),
nn.Conv2d(channels * 2, channels * 4, 4, 2, 1, bias=False),
nn.BatchNorm2d(channels * 4), nn.LeakyReLU(0.2, inplace=True),
nn.Conv2d(channels * 4, 1, 4, 1, 0, bias=False),
nn.Sigmoid(),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.net(x).view(-1)
def gan_losses(real_x, G, D):
"""
Standard GAN losses.
d_loss: D is trained to output 1 for real, 0 for fake.
g_loss: G is trained to make D output 1 for its fakes.
"""
bs = real_x.size(0)
device = real_x.device
real_lbl = torch.ones(bs, device=device)
fake_lbl = torch.zeros(bs, device=device)
z = torch.randn(bs, G.latent_dim, 1, 1, device=device)
fake_x = G.net(z).detach() # detach: do not update G when training D
d_loss = (
F.binary_cross_entropy(D(real_x), real_lbl)
+ F.binary_cross_entropy(D(fake_x), fake_lbl)
)
z = torch.randn(bs, G.latent_dim, 1, 1, device=device)
g_loss = F.binary_cross_entropy(D(G.net(z)), real_lbl)
return d_loss, g_loss
# ============================================================
# Minimal DDPM - T forward passes, MSE training
# ============================================================
class MinimalDDPM(nn.Module):
"""
DDPM on flattened data (MLP for simplicity; production uses U-Net).
Training: simple MSE noise prediction.
Sampling: T iterative denoising steps.
The T-step cost is the only trade-off vs GAN/VAE.
"""
def __init__(self, input_dim: int = 784, T: int = 1000):
super().__init__()
self.T = T
self.input_dim = input_dim
# Noise schedule: linear beta schedule
betas = torch.linspace(1e-4, 0.02, T)
alphas = 1.0 - betas
alpha_bars = torch.cumprod(alphas, dim=0)
self.register_buffer("betas", betas)
self.register_buffer("alphas", alphas)
self.register_buffer("alpha_bars", alpha_bars)
# Timestep-conditioned noise predictor (U-Net in real implementations)
self.eps_net = nn.Sequential(
nn.Linear(input_dim + 1, 1024), nn.SiLU(),
nn.Linear(1024, 1024), nn.SiLU(),
nn.Linear(1024, 512), nn.SiLU(),
nn.Linear(512, input_dim),
)
def add_noise(self, x0, t, eps=None):
"""Forward process: x_t = sqrt(ᾱ_t)·x0 + sqrt(1-ᾱ_t)·ε"""
if eps is None:
eps = torch.randn_like(x0)
ab = self.alpha_bars[t].unsqueeze(-1)
return ab.sqrt() * x0 + (1 - ab).sqrt() * eps, eps
def predict_noise(self, xt, t):
t_emb = (t.float() / self.T).unsqueeze(-1)
return self.eps_net(torch.cat([xt, t_emb], dim=-1))
def training_loss(self, x0):
"""Simple MSE: always a well-conditioned regression problem."""
t = torch.randint(0, self.T, (x0.size(0),), device=x0.device)
xt, eps = self.add_noise(x0, t)
eps_pred = self.predict_noise(xt, t)
return F.mse_loss(eps_pred, eps)
@torch.no_grad()
def sample(self, n: int, device: str, steps: int | None = None) -> torch.Tensor:
"""
Ancestral sampling: T denoising passes.
This is the slow part - O(T) network evaluations.
Production solutions: DDIM (20-50 steps), DPM-Solver (10-20 steps).
"""
if steps is None:
steps = self.T
x = torch.randn(n, self.input_dim, device=device)
for step in reversed(range(steps)):
t_batch = torch.full((n,), step, device=device, dtype=torch.long)
ab_t = self.alpha_bars[step]
a_t = self.alphas[step]
b_t = self.betas[step]
eps_pred = self.predict_noise(x, t_batch)
# DDPM reverse step mean
mu = (1.0 / a_t.sqrt()) * (
x - (1 - a_t) / (1 - ab_t).sqrt() * eps_pred
)
if step > 0:
x = mu + b_t.sqrt() * torch.randn_like(x)
else:
x = mu # Final step: no noise
return x
# ── Sampling speed comparison ──────────────────────────────────────────────────
def compare_sampling_speed():
device = "cpu"
n_samples = 8
vae = VAE(input_dim=784, latent_dim=64)
gen = DCGANGenerator(latent_dim=100, image_channels=1)
ddpm = MinimalDDPM(input_dim=784, T=100) # T=100 for demo, production uses 1000
for name, fn in [
("VAE (1 decoder pass)", lambda: vae.sample(n_samples, device)),
("GAN (1 generator pass)", lambda: gen.sample(n_samples, device)),
("DDPM (100 denoise steps)", lambda: ddpm.sample(n_samples, device, steps=100)),
]:
t0 = time.perf_counter()
out = fn()
elapsed = (time.perf_counter() - t0) * 1000
print(f"{name}: {elapsed:.1f}ms → shape {tuple(out.shape)}")
compare_sampling_speed()
# Expected: VAE and GAN ~same fast; DDPM ~100x slower per step
Evaluation Metrics
FID (Fréchet Inception Distance): the primary metric. Computes the Fréchet distance between real and generated image distributions in InceptionV3 feature space. Captures both quality (mean shift) and diversity (covariance mismatch).
Inception Score (IS): measures whether generated images look like real ImageNet classes (high sharpness) and are diverse across classes (high entropy). Limitation: fails when the generative distribution is out of ImageNet distribution (e.g., medical images).
Precision and Recall: precision = fraction of generated samples that fall within the real data manifold (quality); recall = fraction of the real manifold that is covered by generated samples (diversity). More interpretable than FID, increasingly standard.
CLIP Score: for text-conditioned generation - cosine similarity between CLIP embeddings of the generated image and the conditioning text prompt. Measures text-image alignment rather than photorealism.
Production Engineering Notes
:::warning Choosing the right model for your application Not every application needs diffusion.
- Real-time single-step generation (interactive tools, gaming, mobile): GAN or Consistency Model.
- Exact likelihood computation (anomaly detection, NLP density estimation, OOD detection): Normalizing Flow or Autoregressive model.
- Maximum quality text-to-image (creative AI, product photography, marketing): Latent Diffusion + classifier-free guidance.
- Small dataset (less than 5K samples): VAE - GANs overfit catastrophically on small data; diffusion also struggles.
- Controllable editing (inpainting, style transfer, image editing): Diffusion with DDIM inversion.
- Scientific generation (molecules, proteins, materials): Diffusion - best coverage of the combinatorial space.
- Speed-critical with quality (4-step generation acceptable): Latent Consistency Models (LCM) or Consistency Models. :::
:::note Quality vs likelihood - the Pareto frontier High log-likelihood and high sample quality are distinct objectives that do not always align. Normalizing flows achieve exact likelihood but moderate FID. GANs achieve excellent FID but no likelihood. Diffusion models achieve strong FID and an ELBO lower bound - sitting on the best-known Pareto frontier as of 2024. For most production applications, FID and human preference are the correct optimization targets. :::
Common Mistakes
:::danger Confusing likelihood and sample quality High log-likelihood does not imply good samples. A model can assign high probability to blurry or average-looking images. Conversely, a GAN can produce photorealistic samples with zero computable likelihood. Normalizing flows routinely outperform diffusion models on log-likelihood (measured in bits-per-dim) while producing visually inferior samples. For generative AI applications, FID and human evaluation are the right metrics - not NLL. :::
:::danger Using FID as the only metric FID conflates quality and diversity - a model with low FID could fail by having poor quality (bad mean in Inception feature space) or poor diversity (bad covariance). Always report Precision (quality: what fraction of generated samples are realistic?) and Recall (coverage: what fraction of the real distribution is covered?) alongside FID. The community is moving toward these more interpretable decomposed metrics. :::
:::warning Mode collapse is not always visible in training loss A GAN with excellent training loss curves can still suffer from severe mode collapse. It might generate 995 of 1000 ImageNet categories beautifully but fail completely on rare classes. Checking per-class generation diversity, visualizing failure modes, and computing stratified FID across class subsets are important diagnostics. FID computed on the full test set can be misleadingly good when a tail of rare classes fails. :::
:::warning Not restarting when GAN training diverges GAN training instability often manifests as: discriminator loss → 0 (discriminator is dominating), generator loss → constant (gradient vanished), or mode collapse (generated samples become identical). These failure modes are often not recoverable from the same checkpoint - the training dynamic has left the basin of good behavior. Prevention is better than recovery: use gradient penalty (WGAN-GP), spectral normalization, or switch to a GAN variant with more stable training (StyleGAN2, ProjectedGAN). :::
YouTube Resources
| Resource | Channel | Why Watch |
|---|---|---|
| Generative Models - Andrej Karpathy | Stanford CS231n | Best single lecture covering VAE, GAN, and flow comparison with intuition - start here |
| GAN Tutorial - Ian Goodfellow | NeurIPS 2016 Tutorial | The GAN inventor explains mode collapse, training instability, and convergence theory |
| DDPM - Ho et al. Paper Talk | NeurIPS 2020 | The original DDPM paper presentation - explains the simplified training objective |
| Diffusion Models Beat GANs - Dhariwal and Nichol | NIPS 2021 | The FID-record paper - classifier guidance and the architecture improvements |
| Score-Based Generative Models - Yang Song | Stanford | Score matching, NCSN, and the SDE unification - deep theory behind modern diffusion |
Interview Q&A
Q1: Why did diffusion models surpass GANs on image generation quality, despite being much slower?
GANs optimize a minimax game - a fundamentally unstable optimization with two competing objectives. The generator can only improve as fast as the discriminator provides useful signal. Mode collapse means the generator exploits a narrow set of outputs that fool the discriminator and stops improving. Training is non-stationary: the optimal generator changes as the discriminator improves, and vice versa. These dynamics require extensive hyperparameter tuning, architectural tricks (spectral normalization, gradient penalty), and often fail silently.
Diffusion models optimize simple MSE noise prediction - always a well-conditioned regression problem. The training signal is rich: the network is trained at every noise level from pure noise to nearly clean images, receiving informative gradient signal throughout training. Iterative sampling (100–1000 steps) allows the model to make fine-grained corrections at each step, accumulating quality that a single-pass generator cannot achieve. Every sample starts from an independent , preventing mode collapse - there is no learned generator network to collapse.
Q2: What is the relationship between the ELBO in VAEs and the ELBO in diffusion models?
Both optimize an Evidence Lower BOund on the log-likelihood. In VAEs: . In diffusion models, the ELBO decomposes across timesteps: each term is a denoising problem at noise level . The simplified objective is a reweighted version of the full diffusion ELBO. The key difference: the diffusion ELBO provides training signal at every noise level (T terms), while the VAE ELBO has a single bottleneck. This richer supervision leads to better generative models - and explains why diffusion training is more stable.
Q3: Explain why normalizing flows produce exact likelihood but mediocre samples.
Flows maximize , which is exact because the bijective constraint makes the integral trivial (one-to-one mapping). But maximizing does not directly optimize visual quality - it optimizes probability assignment. A model can achieve high likelihood by placing probability mass broadly across the data manifold, at the cost of per-sample sharpness.
More fundamentally, the bijective constraint limits flow architectures: to maintain tractable , transformations must be structured (coupling layers, triangular Jacobians). This rules out the unrestricted U-Net, ResNet, or Transformer architectures that diffusion models use. The expressiveness gap means flows cannot learn as complex a distribution in practice, even though their training objective is theoretically stronger.
Q4: What is FID and why is it the standard evaluation metric for generative models?
FID (Fréchet Inception Distance) measures the distance between the distributions of real and generated images in InceptionV3 feature space. Formally: where and are statistics of Inception features on real and generated images respectively. Lower FID = more similar to real images.
FID became the standard because: (1) it correlates with human quality judgments better than pixel-level metrics; (2) it captures both quality (: mean shift) and diversity (: covariance mismatch) in one number; (3) it is fast to compute (extract features, compute statistics). Limitations: it depends on InceptionV3 trained on ImageNet - may not transfer to medical, scientific, or artistic domains; it can be gamed; precision and recall provide more interpretable decompositions.
Q5: What is the difference between explicit and implicit density models? Where does diffusion sit?
Explicit density models directly define and compute . Examples: VAEs (via ELBO lower bound), normalizing flows (exact), autoregressive models (product of conditionals). You can compute for any test point - essential for anomaly detection, model selection, and Bayesian inference.
Implicit density models define a sampling procedure without computing . Example: GANs - the generator defines a deterministic map from noise to data with no tractable density. No likelihood computation possible.
Diffusion models sit between: they have a principled ELBO (approximate log-likelihood lower bound) computed via the sum of denoising losses across all timesteps, so they support approximate density estimation. But they are not as clean as flows for likelihood tasks. They are "explicit-ish" - better than GANs for density estimation, better than flows for sample quality, making them the best-known compromise on the quality-likelihood-stability-diversity Pareto frontier.
Computing FID in Practice
Understanding FID conceptually is not the same as computing it correctly. Here is the standard implementation:
import torch
import torch.nn as nn
import numpy as np
from scipy import linalg
def compute_fid(real_features: np.ndarray, gen_features: np.ndarray) -> float:
"""
Compute FID between real and generated image feature distributions.
Args:
real_features: [n_real, D] InceptionV3 features for real images
gen_features: [n_gen, D] InceptionV3 features for generated images
Returns:
FID score (lower = better; real vs real ≈ 0)
Formula:
FID = ‖μ_r - μ_g‖² + tr(Σ_r + Σ_g - 2·(Σ_r·Σ_g)^{1/2})
"""
mu_r = real_features.mean(axis=0)
mu_g = gen_features.mean(axis=0)
sigma_r = np.cov(real_features, rowvar=False)
sigma_g = np.cov(gen_features, rowvar=False)
# Squared mean difference
mean_diff_sq = np.sum((mu_r - mu_g) ** 2)
# Matrix square root: (Σ_r · Σ_g)^{1/2}
# Use scipy's sqrtm - eigendecomposition of Σ_r^{1/2} · Σ_g · Σ_r^{1/2}
covmean, _ = linalg.sqrtm(sigma_r @ sigma_g, disp=False)
# Handle numerical issues (sqrtm can produce tiny imaginary parts)
if np.iscomplexobj(covmean):
if not np.allclose(np.diagonal(covmean).imag, 0, atol=1e-3):
raise ValueError("sqrtm produced significant imaginary component")
covmean = covmean.real
# Trace term
trace_term = np.trace(sigma_r) + np.trace(sigma_g) - 2 * np.trace(covmean)
return float(mean_diff_sq + trace_term)
def extract_inception_features(images: torch.Tensor,
device: str = "cpu") -> np.ndarray:
"""
Extract InceptionV3 pool3 features (2048-dim) for FID computation.
images: [n, 3, 299, 299] in range [0, 1]
"""
from torchvision.models import inception_v3, Inception_V3_Weights
inception = inception_v3(weights=Inception_V3_Weights.DEFAULT,
transform_input=False).to(device)
inception.eval()
inception.fc = nn.Identity() # Remove final classifier → 2048-dim features
features = []
batch_size = 32
with torch.no_grad():
for i in range(0, len(images), batch_size):
batch = images[i:i+batch_size].to(device)
# InceptionV3 needs 299×299 input
batch = nn.functional.interpolate(batch, size=(299, 299),
mode="bilinear",
align_corners=False)
feat = inception(batch)
features.append(feat.cpu().numpy())
return np.concatenate(features, axis=0)
# ── Sanity check: FID of real vs real should be ~0 ────────────────────────────
np.random.seed(42)
# Simulate 2048-dim Inception features (in practice: extract from real images)
real_feat_A = np.random.randn(1000, 2048) * 10 + 5 # real images, subset A
real_feat_B = np.random.randn(1000, 2048) * 10 + 5 # real images, subset B
gen_feat = np.random.randn(1000, 2048) * 12 + 3 # generated images
fid_real_vs_real = compute_fid(real_feat_A, real_feat_B)
fid_real_vs_gen = compute_fid(real_feat_A, gen_feat)
print(f"FID (real vs real, different subsets): {fid_real_vs_real:.2f} (expect ~0)")
print(f"FID (real vs generated): {fid_real_vs_gen:.2f} (expect > 0)")
print("\nFID pitfalls:")
print(" - Need ≥10K samples for stable estimates (high variance with n<5K)")
print(" - Must use the same InceptionV3 weights and preprocessing as reference")
print(" - Domain mismatch: FID on medical images is not comparable to ImageNet FID")
The Sampling Speed Problem and Its Solutions
Diffusion's 1000-step sampling is a production bottleneck. Three main solutions have emerged:
DDIM (Denoising Diffusion Implicit Models, Song et al. 2020): replaces the stochastic reverse process with a deterministic ODE integration. The same noise predictor is reused at fewer steps (20–50) by taking larger ODE steps. No retraining required - it is a different sampling algorithm for the same model. Quality degrades gracefully with fewer steps.
DPM-Solver (Lu et al. 2022): a higher-order ODE solver specifically designed for diffusion models. Achieves high-quality samples in 10–20 steps by exploiting the semi-linear structure of the diffusion ODE (the linear part has an exact solution, only the nonlinear noise network output needs to be approximated).
Consistency Models (Song et al. 2023): train a model to directly map any noisy to the clean , maintaining consistency - i.e., for all on the same trajectory. Achieves 1–4 step generation. Can be distilled from a pretrained diffusion model or trained from scratch.
Latent Diffusion (Rombach et al. 2022): compress images to a small latent space (16× or 64× compression) using a VAE encoder, run diffusion in that latent space, then decode. Reduces per-step cost from 16 million pixels to 16 thousand latent values - 1000× fewer operations per denoising step. This is the architecture underlying Stable Diffusion and nearly all production text-to-image systems.
| Method | Steps | Relative Speed | Quality Drop |
|---|---|---|---|
| DDPM | 1000 | 1× (baseline) | - |
| DDIM | 50 | 20× | Minimal |
| DDIM | 20 | 50× | Small |
| DPM-Solver | 15 | ~67× | Small |
| Consistency (distilled) | 4 | 250× | Moderate |
| Consistency (1-step) | 1 | 1000× | Significant |
| Latent Diffusion + DDIM-50 | 50 | ~3200× effective | - |
Conditional Generation - How Each Model Family Handles It
All production generative models need conditional generation: where is a conditioning signal (text prompt, class label, reference image). Each architecture handles conditioning differently.
VAE: append to the latent (conditional VAE, CVAE) or condition both encoder and decoder on . Works well for simple conditioning (class labels); poor for complex text prompts.
GAN: concatenate condition to both generator input and discriminator input (conditional GAN, cGAN). Works for class-conditional generation; struggles with complex text due to training instability. StyleGAN-based approaches use adaptive instance normalization (AdaIN) to inject conditioning.
Normalizing Flow: condition the affine coupling layers on (conditional flow). Works but the bijective constraint makes powerful conditioning difficult.
Diffusion: classifier guidance (Dhariwal and Nichol, 2021) - train a noisy classifier and add its gradient to the score: . This steers sampling toward with guidance scale . Higher = stronger conditioning at cost of diversity.
Classifier-Free Guidance (CFG) (Ho and Salimans, 2022): the dominant approach today. Train a single model that can be conditioned or unconditioned (randomly drop during training). At inference, extrapolate between conditioned and unconditioned predictions:
At : unconditioned. At : standard conditioned. At (e.g., 7.5 in Stable Diffusion): overextrapolated - stronger adherence to the text prompt at the cost of diversity. CFG is the reason Stable Diffusion follows text prompts well - it is the primary mechanism for text-to-image control.
import torch
import torch.nn as nn
def classifier_free_guidance_step(
noise_pred_cond: torch.Tensor,
noise_pred_uncond: torch.Tensor,
guidance_scale: float = 7.5,
) -> torch.Tensor:
"""
Classifier-Free Guidance (CFG) - the mechanism behind Stable Diffusion prompt adherence.
CFG interpolates (extrapolates) between conditioned and unconditioned predictions.
guidance_scale = 1.0: no guidance (standard conditional)
guidance_scale = 7.5: strong guidance (Stable Diffusion default)
guidance_scale > 15: often over-saturated, low diversity
Args:
noise_pred_cond: εθ(x_t, t, c) noise prediction with condition c
noise_pred_uncond: εθ(x_t, t, ∅) noise prediction without condition
guidance_scale: w - guidance strength
"""
# Linear extrapolation: uncond + w·(cond - uncond)
# = (1-w)·uncond + w·cond when w=1 → standard cond
# = uncond + w*(cond - uncond) when w>1 → extrapolated
return noise_pred_uncond + guidance_scale * (noise_pred_cond - noise_pred_uncond)
def cfg_sampling_step(
model, # noise predictor εθ(x_t, t, c)
x_t, # current noisy image
t, # current timestep
condition, # text embeddings or class label
guidance_scale, # w
scheduler, # DDPM or DDIM scheduler
):
"""
One CFG denoising step - used in every SD inference call.
Two forward passes per step: once conditioned, once unconditioned.
This doubles the compute cost of sampling vs unconditioned.
"""
# Two forward passes: conditioned and unconditioned
noise_pred_cond = model(x_t, t, encoder_hidden_states=condition)
noise_pred_uncond = model(x_t, t, encoder_hidden_states=None)
# Apply CFG
noise_pred = classifier_free_guidance_step(
noise_pred_cond, noise_pred_uncond, guidance_scale
)
# Update x_t using the scheduler (DDPM, DDIM, DPM-Solver, etc.)
x_prev = scheduler.step(noise_pred, t, x_t).prev_sample
return x_prev
print("Classifier-Free Guidance:")
print(" Training: randomly drop condition c with probability p_uncond (e.g., 0.1)")
print(" Inference: two forward passes per step, extrapolate with guidance_scale w")
print(" guidance_scale = 1.0 → pure conditional (follows prompt loosely)")
print(" guidance_scale = 7.5 → Stable Diffusion default (strong prompt adherence)")
print(" guidance_scale > 15 → over-saturation, reduced diversity")
print(" Cost: 2× forward passes vs unconditioned sampling")
## Architecture Evolution - From U-Net to Transformer
The original DDPM used a **U-Net** architecture: a convolutional encoder-decoder with skip connections, residual blocks, and multi-head self-attention at the lowest spatial resolution. This architecture encodes strong inductive biases for image data: spatial locality, translation equivariance, multi-scale processing.
**DiT (Diffusion Transformer, Peebles and Xie, 2022)** replaced the U-Net with a **Vision Transformer** operating directly on image patches - patchifying the noisy image, processing with standard transformer blocks, then unpatchifying to predict the noise. DiT-XL/2 achieved FID 2.27 on ImageNet 256×256 class-conditional, surpassing all prior models. Transformers scale more predictably than U-Nets: larger models reliably produce lower FID, following a compute-optimal scaling law.
The shift from U-Net to Transformer mirrors the broader shift in ML from CNNs to Transformers: transformers are more general (no spatial locality inductive bias), more scalable, and more amenable to conditioning via cross-attention (text prompts are integrated naturally as key-value pairs in cross-attention layers).
## When Each Model Class Still Wins in Production (2024 Reference)
Despite diffusion's dominance in image generation, each model family remains the right tool for specific production problems.
**GAN use cases in 2024**: (1) Real-time face reenactment and video avatars - GAN inference in one pass makes real-time (<100ms) generation feasible. (2) Image super-resolution (ESRGAN, RealESRGAN) - GAN perceptual loss gives sharper textures than diffusion-based SR for some use cases. (3) Neural radiance field (NeRF) compositing - integrating GAN-generated textures into 3D scenes.
**VAE use cases in 2024**: (1) As the compression backbone in Latent Diffusion - the VAE encoder/decoder converts between pixel space and latent space. (2) Drug discovery - VAEs on molecular graphs (junction tree VAE) for constrained chemical space exploration. (3) Any application needing a structured, smooth latent space for interpolation or attribute manipulation.
**Normalizing flow use cases in 2024**: (1) Density estimation for anomaly detection where exact log-likelihood is required. (2) Scientific applications (molecular dynamics, Boltzmann generators) where the bijective structure matches physical conservation laws. (3) Variational inference - flows as expressive approximate posteriors in Bayesian models.
**Autoregressive model use cases (not covered yet)**: (1) Text generation - GPT family. (2) Image tokenization - VQ-VAE + transformer (used in DALL-E 1, DALL-E 3 tokenizes into image tokens). (3) Audio (AudioLM, Whisper for speech). The autoregressive family achieves exact likelihood via the chain rule of probability: $p(x) = \prod_i p(x_i | x_{<i})$.
## The Research Frontier (2024–2025)
**Flow Matching** (Lipman et al., 2022; Liu et al., 2022): instead of learning to reverse a specific noising SDE, flow matching directly learns an optimal transport flow from noise to data. Training objective: $\mathcal{L}_{FM} = \mathbb{E}_{t,x_0,x_1}\!\left[\|v_\theta(x_t, t) - (x_1 - x_0)\|^2\right]$ where $x_t = (1-t)x_0 + tx_1$ is a straight-line interpolation from noise $x_0$ to data $x_1$. Flow matching trains faster, enables straighter trajectories (fewer ODE steps), and achieves state-of-the-art FID. Meta's Voicebox (audio), Stable Diffusion 3, and FLUX all use flow matching.
**Consistency Models** (Song et al., 2023): train a model $f_\theta(x_t, t) \approx x_0$ that is consistent along the probability flow ODE - i.e., all points on the same trajectory map to the same $x_0$. Achieves 1–4 step generation via direct distillation from a pretrained diffusion model, with only a moderate FID penalty.
**Video diffusion and 3D**: DiT-based architectures extended to video (Sora, OpenAI 2024), 3D generation (DreamFusion, Point-E), and motion generation (MDM). The same mathematical framework generalizes: the forward process corrupts video frames or 3D point clouds, and the reverse process denoises them.
The unifying trend: every new architecture is a special case of the SDE/flow matching framework. The mathematical tools from this lesson - score functions, Langevin dynamics, probability flow ODEs - remain central regardless of the specific model variant.
## Quick Reference - Generative Model Trade-offs
| Property | VAE | GAN | Flow | EBM | Diffusion |
|----------|-----|-----|------|-----|-----------|
| Training objective | ELBO (approx. MLE) | Minimax adversarial | Exact MLE | Contrastive divergence | MSE noise prediction |
| Likelihood | Approx. lower bound | None | Exact | Intractable Z | Approx. lower bound |
| Sample quality | Moderate (blur) | Excellent | Moderate | Good | Excellent |
| Sample diversity | Good | Poor–Moderate | Good | Good | Excellent |
| Training stability | Excellent | Poor | Good | Poor | Excellent |
| Sampling speed | Fast (1 pass) | Fast (1 pass) | Fast | MCMC (slow) | Slow (T passes) |
| Conditional gen. | CVAE | cGAN | Cond. flow | - | CFG (dominant) |
| Likelihood for anomaly | Approximate | No | Yes | No | Approximate |
| Scales to 1024px? | No (blurry) | Yes (GAN) | Barely | No | Yes (LDM) |
| Production usage | VAE backbone | Real-time | Scientific | Research | Text-to-image |
This table is the one-page summary of what takes researchers years of papers to establish. Each cell represents a design decision, a theoretical result, and a set of engineering constraints. When an interviewer asks "compare generative models," this is the target answer - precise, multi-dimensional, and grounded in why each property holds rather than just what it is.
```mermaid
flowchart LR
A["DDPM 2020<br/>U-Net + MSE<br/>1000 steps, 3.17 FID CIFAR-10"]:::blue
B["DDIM 2020<br/>ODE sampling<br/>20-50 steps, same model"]:::green
C["ADM + Guidance 2021<br/>U-Net + classifier guidance<br/>4.59 FID ImageNet"]:::teal
D["Latent Diffusion 2022<br/>VAE compression + U-Net<br/>Stable Diffusion"]:::indigo
E["DiT-XL/2 2022<br/>Vision Transformer<br/>2.27 FID ImageNet"]:::purple
F["CFG + DiT 2023+<br/>Text-conditioned Transformer<br/>Production systems"]:::orange
A --> B --> C --> D
C --> E --> F
classDef blue fill:#dbeafe,color:#1e293b,stroke:#2563eb
classDef green fill:#dcfce7,color:#14532d,stroke:#16a34a
classDef teal fill:#ccfbf1,color:#134e4a,stroke:#14b8a6
classDef indigo fill:#e0e7ff,color:#3730a3,stroke:#6366f1
classDef purple fill:#ede9fe,color:#4c1d95,stroke:#7c3aed
classDef orange fill:#ffedd5,color:#7c2d12,stroke:#ea580c
:::tip 🎮 Interactive Playground
**Visualize this concept:** Try the **[VAE vs GAN vs Diffusion](/playground/generative-comparison)** demo on the EngineersOfAI Playground - no code required.
:::
