Generative Models - VAEs, GANs, Diffusion, and Beyond

Reading time: ~40 min | Interview relevance: High | Roles: MLE, AI Eng, Research Engineer, Applied Scientist, Computer Vision Engineer

The Real Interview Moment

You are in an OpenAI research engineer interview. The interviewer asks: "Compare VAEs, GANs, and diffusion models. For each, give me the training objective, the key mathematical insight, and when you would choose one over the others." You start with VAEs - "they maximize a variational lower bound on the log-likelihood..." - and the interviewer nods. You move to GANs - "minimax objective, the generator minimizes while the discriminator maximizes..." She interrupts: "Why does mode collapse happen in GANs but not in diffusion models? Derive the connection."

This is not a question you can answer with surface-level knowledge. It requires understanding the fundamental difference between adversarial training (which has an unstable equilibrium) and score-based denoising (which has a well-defined regression target at each noise level). Candidates who can only describe the architectures get a "lean hire." Candidates who can derive the objectives, explain the failure modes, and compare the paradigms at a mathematical level get a "strong hire."

This page gives you the complete mathematical foundation for every major generative model family, with the depth needed to handle follow-up questions at top AI labs.

What You Will Master

Derive the ELBO (Evidence Lower Bound) for VAEs from first principles
Explain the reparameterization trick and why it is necessary
State the GAN minimax objective and prove the optimal discriminator
Diagnose mode collapse, training instability, and solutions (WGAN, spectral norm)
Derive the forward and reverse diffusion processes in DDPM
Connect score matching to denoising diffusion
Compare VAE vs GAN vs Diffusion vs Autoregressive vs Flow models with precise tradeoffs
Explain modern applications: text-to-image, video generation, 3D generation

Self-Assessment: Where Are You Now?

Skill	1 - Cannot	2 - Vaguely	3 - Can Explain	4 - Can Derive	5 - Can Teach	Your Score
Derive the ELBO for VAEs						___
Explain reparameterization trick						___
State GAN objective and optimal D						___
Explain mode collapse and solutions						___
Derive DDPM forward/reverse process						___
Explain score matching						___
Compare all generative model families						___
Explain text-to-image diffusion						___

Target: All 4s and 5s before your interview.

Part 1 - Variational Autoencoders (VAEs)

The Generative Modeling Problem

We have data $x$ drawn from an unknown distribution $p_{\text{data}}(x)$ . We want to learn a model $p_\theta(x)$ that approximates this distribution so we can:

Evaluate the probability of new data (density estimation)
Generate new samples (sampling)
Learn meaningful latent representations (representation learning)

The Latent Variable Model

A VAE introduces latent variables $z$ and models:

$p_\theta(x) = \int p_\theta(x|z) \, p(z) \, dz$

Where $p(z) = \mathcal{N}(0, I)$ is a simple prior and $p_\theta(x|z)$ is a neural network (decoder) that maps $z$ to $x$ .

The problem: This integral is intractable - we cannot compute $p_\theta(x)$ directly because we would need to integrate over all possible $z$ values.

Deriving the ELBO

We want to maximize $\log p_\theta(x)$ . Since we cannot compute it directly, we derive a lower bound.

Start with the log-likelihood and introduce an approximate posterior $q_\phi(z|x)$ :

$\log p_\theta(x) = \log \int p_\theta(x, z) \, dz$

Multiply and divide by $q_\phi(z|x)$ :

$= \log \int \frac{p_\theta(x, z)}{q_\phi(z|x)} q_\phi(z|x) \, dz$

By Jensen's inequality ( $\log$ is concave):

$\geq \int q_\phi(z|x) \log \frac{p_\theta(x, z)}{q_\phi(z|x)} \, dz$

Expanding $p_\theta(x, z) = p_\theta(x|z) p(z)$ :

$= 𝔼_{q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) \| p(z))$

This is the ELBO (Evidence Lower Bound):

$\text{ELBO}(\theta, \phi; x) = \underbrace{𝔼_{q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{Reconstruction term}} - \underbrace{\text{KL}(q_\phi(z|x) \| p(z))}_{\text{Regularization term}}$

VAE - ELBO Training Flow with Reparameterization Trick

60-Second Answer

"A VAE is a latent variable model that learns to generate data by encoding inputs into a latent space and decoding them back. The training objective is the ELBO - a lower bound on the log-likelihood - which has two terms: a reconstruction loss that encourages accurate decoding, and a KL divergence that regularizes the latent space to be close to a standard Gaussian. The key innovation is the reparameterization trick: instead of sampling z directly from the encoder's distribution (which is not differentiable), we express z as mu plus sigma times epsilon where epsilon is sampled from a standard normal. This allows gradients to flow through the sampling step."

The Reparameterization Trick

The problem: We need to compute $\nabla_\phi 𝔼_{q_\phi(z|x)}[f(z)]$ . But the expectation is over a distribution that depends on $\phi$ - we cannot simply backpropagate through a sampling operation.

The solution: Instead of $z \sim \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x))$ , write:

$z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$

Now $z$ is a deterministic function of $\phi$ (through $\mu$ and $\sigma$ ) plus independent noise. Gradients with respect to $\phi$ flow through $\mu$ and $\sigma$ without issues.

Common Trap

Do NOT say "the reparameterization trick makes the VAE differentiable." The VAE is already differentiable - the encoder and decoder are neural networks. The reparameterization trick makes the sampling step differentiable with respect to the encoder parameters. Without it, you cannot train the encoder via backpropagation because sampling is a stochastic, non-differentiable operation.

VAE Limitations

Blurry outputs: The pixel-wise reconstruction loss (MSE or binary cross-entropy) encourages averaging over modes, producing blurry images rather than sharp ones.
Posterior collapse: The KL term can dominate, causing the encoder to ignore the input and produce $q_\phi(z|x) \approx p(z)$ . The decoder then ignores $z$ and generates the same output regardless of input.
Limited expressiveness: The Gaussian assumption for both prior and posterior limits the complexity of the latent space.

Solutions: Beta-VAE (weight the KL term), VQ-VAE (discrete latent space), hierarchical VAEs (NVAE, VDVAE).

Part 2 - Generative Adversarial Networks (GANs)

The Minimax Objective

A GAN consists of two networks:

Generator $G(z)$ : maps noise $z \sim p(z)$ to fake data
Discriminator $D(x)$ : classifies data as real or fake

The training objective is a minimax game:

$\min_G \max_D \, V(D, G) = 𝔼_{x \sim p_{\text{data}}}[\log D(x)] + 𝔼_{z \sim p(z)}[\log(1 - D(G(z)))]$

The discriminator tries to maximize $V$ - assign high probability to real data and low probability to fake data. The generator tries to minimize $V$ - fool the discriminator into assigning high probability to fake data.

The Optimal Discriminator

For a fixed generator $G$ , the optimal discriminator is:

$D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_G(x)}$

Proof: The discriminator maximizes:

$V(D) = \int \left[ p_{\text{data}}(x) \log D(x) + p_G(x) \log(1 - D(x)) \right] dx$

Taking the derivative with respect to $D(x)$ and setting it to zero:

$\frac{p_{\text{data}}(x)}{D(x)} - \frac{p_G(x)}{1 - D(x)} = 0$

Solving: $D^*(x) = p_{\text{data}}(x) / (p_{\text{data}}(x) + p_G(x))$ .

Substituting the Optimal Discriminator

When $D = D^*$ , the game value becomes:

$V(D^*, G) = -\log 4 + 2 \cdot \text{JSD}(p_{\text{data}} \| p_G)$

Where JSD is the Jensen-Shannon Divergence. This means the generator is minimizing the JSD between the real and generated distributions. The global optimum is $p_G = p_{\text{data}}$ , where $\text{JSD} = 0$ and $V = -\log 4$ .

Interviewer's Perspective

Deriving the optimal discriminator and showing that the GAN objective reduces to JSD minimization is a classic interview question at Google Research, OpenAI, and DeepMind. The derivation is straightforward but you must be able to do it on a whiteboard without notes. Practice it until you can complete it in under 3 minutes.

Mode Collapse

What it is: The generator finds a single output (or a few outputs) that fools the discriminator and produces only those, ignoring the diversity of the real data distribution.

Why it happens: The generator's gradient points toward the mode that currently fools the discriminator best. Once it finds such a mode, the gradient reinforces it rather than encouraging exploration of other modes.

GAN Training Dynamics - Stable vs Mode Collapse

Solutions to GAN Training Instability

Technique	Key Idea	Paper
WGAN	Replace JSD with Wasserstein distance, clip weights	Arjovsky et al., 2017
WGAN-GP	Gradient penalty instead of weight clipping	Gulrajani et al., 2017
Spectral Normalization	Normalize discriminator weights by spectral norm	Miyato et al., 2018
Progressive Growing	Start with low-res, gradually increase	Karras et al., 2018
StyleGAN	Style-based generator with mapping network	Karras et al., 2019
Minibatch Discrimination	Discriminator sees batches, not individuals	Salimans et al., 2016

Wasserstein GAN (WGAN)

The key insight: JSD can be zero or undefined when the real and generated distributions have non-overlapping support (common in high dimensions). This gives the generator zero gradient - it cannot learn.

The Wasserstein distance (Earth Mover's Distance) provides a meaningful gradient even when distributions do not overlap:

$W(p_{\text{data}}, p_G) = \inf_{\gamma \in \Pi(p_{\text{data}}, p_G)} 𝔼_{(x,y) \sim \gamma}[\|x - y\|]$

By the Kantorovich-Rubinstein duality:

$W(p_{\text{data}}, p_G) = \sup_{\|f\|_L \leq 1} 𝔼_{x \sim p_{\text{data}}}[f(x)] - 𝔼_{x \sim p_G}[f(x)]$

Where the supremum is over all 1-Lipschitz functions. The discriminator (now called "critic") approximates this Lipschitz function.

Enforcing the Lipschitz constraint:

Weight clipping (WGAN): Clip all weights to $[-c, c]$ . Simple but leads to capacity underuse and exploding/vanishing gradients.
Gradient penalty (WGAN-GP): Add $\lambda 𝔼[(\|\nabla_{\hat{x}} D(\hat{x})\| - 1)^2]$ to the loss, where $\hat{x}$ is a random interpolation between real and fake samples.
Spectral normalization: Divide each weight matrix by its spectral norm (largest singular value). Guarantees 1-Lipschitz without any penalty term.

Instant Rejection

Never say "WGAN just removes the log from the GAN loss." The mathematical change is fundamental - switching from JSD to Wasserstein distance, which requires the Lipschitz constraint. The removal of log is a consequence, not the cause. Saying this suggests you memorized a formula without understanding the theory.

Part 3 - Diffusion Models

The Core Idea

Diffusion models learn to reverse a gradual noising process. Start with clean data $x_0$ , add noise step by step until you get pure Gaussian noise $x_T$ , then learn a neural network that reverses each step.

Forward Process (Adding Noise)

The forward process is a fixed Markov chain that gradually adds Gaussian noise:

$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \beta_t I)$

Where $\beta_t$ is a small noise schedule (typically increasing from $\beta_1 = 10^{-4}$ to $\beta_T = 0.02$ ).

Key property: We can sample $x_t$ directly from $x_0$ without iterating through all steps:

$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, (1 - \bar{\alpha}_t) I)$

Where $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$ .

This means: $x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon$ , where $\epsilon \sim \mathcal{N}(0, I)$ .

At $t = T$ (with large $T$ ), $\bar{\alpha}_T \approx 0$ and $x_T \approx \epsilon$ - pure noise.

DDPM - Forward Noising Process and Learned Reverse Denoising

Reverse Process (Denoising)

The reverse process is parameterized by a neural network:

$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)$

The network predicts the mean $\mu_\theta(x_t, t)$ of the denoising step at each timestep $t$ .

DDPM Training Objective (Ho et al., 2020)

After simplification, the training objective reduces to:

$L_{\text{simple}} = 𝔼_{t, x_0, \epsilon}\left[\| \epsilon - \epsilon_\theta(x_t, t) \|^2\right]$

Where $\epsilon_\theta$ is the neural network predicting the noise that was added. The training procedure:

Sample a clean image $x_0$ from the dataset
Sample a random timestep $t \sim \text{Uniform}(1, T)$
Sample noise $\epsilon \sim \mathcal{N}(0, I)$
Compute noisy image: $x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon$
Train the network to predict $\epsilon$ from $x_t$ and $t$ : minimize $\|\epsilon - \epsilon_\theta(x_t, t)\|^2$

60-Second Answer

"Diffusion models learn to generate data by reversing a gradual noising process. The forward process adds Gaussian noise step by step until the data becomes pure noise. A neural network learns to reverse each step - given a noisy image and the noise level, it predicts the noise that was added. At generation time, we start with random noise and iteratively denoise to produce a clean image. The training is simple - just mean squared error between the predicted and actual noise. Unlike GANs, there is no adversarial training and no mode collapse. Unlike VAEs, there is no posterior approximation or blurriness. The main drawback is slow sampling - generating an image requires hundreds to thousands of network evaluations."

Score Matching Connection

The noise prediction $\epsilon_\theta(x_t, t)$ is closely related to the score function - the gradient of the log probability:

$\nabla_{x_t} \log q(x_t) \approx -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1 - \bar{\alpha}_t}}$

This connects DDPM to score-based generative models (Song & Ermon, 2019). The network is learning to estimate the direction toward higher probability at each noise level.

Why this matters: Score matching provides a clean theoretical framework. The denoising objective is equivalent to estimating the score function at every noise level, and Langevin dynamics (following the score with added noise) produces samples from the learned distribution.

Sampling Methods

Method	Steps	Speed	Quality	How It Works
DDPM	1000	Slow	High	Original reverse process, one step at a time
DDIM	50-100	Medium	High	Deterministic sampling, skip steps
DPM-Solver	10-25	Fast	High	ODE solver for the reverse process
Consistency Models	1-2	Very fast	Good	Distilled to predict final output directly
Latent Diffusion	20-50	Fast	High	Diffusion in latent space, not pixel space

Why Diffusion Models Avoid Mode Collapse

GANs suffer from mode collapse because the generator optimizes against a single discriminator - it can find shortcuts. Diffusion models avoid this because:

No adversarial training: The objective is simple regression (predict the noise).
Multi-scale denoising: The network must handle every noise level, from nearly clean to pure noise. This forces it to learn the full data distribution at every scale.
Stable training: MSE loss with random timestep sampling provides consistent, low-variance gradients.

Common Trap

Do NOT say "diffusion models are just VAEs with many latent variables." While there is a mathematical connection (the DDPM ELBO can be written as a sequence of KL terms), the training procedure and the resulting model behavior are fundamentally different. Diffusion models do not learn an encoder or a latent representation - they learn to denoise. The connection is theoretical, not practical.

Part 4 - Flow-Based Models

Core Idea

Flow-based models learn an invertible transformation $f_\theta$ from a simple distribution (Gaussian) to the data distribution:

$x = f_\theta(z), \quad z = f_\theta^{-1}(x), \quad z \sim \mathcal{N}(0, I)$

Because $f_\theta$ is invertible, we can compute exact log-likelihoods via the change of variables formula:

$\log p_\theta(x) = \log p(z) + \log \left|\det \frac{\partial f_\theta^{-1}}{\partial x}\right|$

$= \log p(f_\theta^{-1}(x)) + \log \left|\det J_{f_\theta^{-1}}(x)\right|$

Key Models

Model	Key Innovation	Limitation
NICE (2014)	Additive coupling layers	Limited expressiveness
RealNVP (2017)	Affine coupling layers	Still limited by coupling structure
Glow (2018)	1x1 convolutions + squeeze	Better quality but expensive
Neural ODE / FFJORD (2018-2019)	Continuous-time flows	Expensive determinant computation

Why Flows Are Less Popular

Invertibility constraint: The requirement that $f_\theta$ be invertible limits the architecture. You cannot use standard residual connections, attention, or convolutions freely.
Computational cost: Computing the Jacobian determinant is expensive, even with clever architectures.
Sample quality: Despite exact likelihoods, flow models generally produce lower-quality samples than GANs or diffusion models at the same model size.

Where flows still shine: Density estimation tasks where exact likelihoods are needed, and as components of larger systems (e.g., normalizing flows in VAE posteriors).

Part 5 - The Grand Comparison

Head-to-Head Comparison

Property	VAE	GAN	Diffusion	Autoregressive	Flow
Training objective	ELBO (lower bound)	Minimax game	Denoising MSE	Exact log-likelihood	Exact log-likelihood
Training stability	Stable	Unstable	Very stable	Stable	Stable
Sample quality	Blurry	Sharp	Excellent	Excellent	Moderate
Mode coverage	Good	Poor (mode collapse)	Excellent	Excellent	Good
Sampling speed	Fast (one pass)	Fast (one pass)	Slow (iterative)	Slow (sequential)	Fast (one pass)
Exact likelihood	Lower bound only	No	Lower bound only	Yes	Yes
Latent space	Yes (continuous)	Yes (input noise)	No explicit	No	Yes (invertible)
Diversity control	Temperature on z	Truncation trick	Guidance scale	Temperature	Temperature on z
Key weakness	Blurry outputs	Mode collapse	Slow generation	Sequential, slow	Limited architecture

When to Choose Each

Generative Model Selection - Decision Framework

Company Variation

At OpenAI and Stability AI, diffusion model expertise is almost mandatory. At Meta, GAN knowledge is still valued (StyleGAN for faces, GAN upscaling). At Google, autoregressive models dominate (Gemini, Imagen uses diffusion but the team also works on autoregressive image generation). For startups, understanding the speed-quality tradeoff is most important - they often need fast inference.

Part 6 - Modern Applications

Text-to-Image: The Stable Diffusion Architecture

The key innovation in Stable Diffusion (Romero et al., 2022) is latent diffusion: run the diffusion process in a compressed latent space rather than pixel space.

Autoencoder: Compress images from pixel space ( $256 \times 256 \times 3$ ) to latent space ( $32 \times 32 \times 4$ ) - a 48x compression
Diffusion model: Operate the forward/reverse process in latent space - much faster
Conditioning: Use cross-attention with text embeddings (from CLIP) to guide generation

Classifier-Free Guidance (CFG): At generation time, run the model twice - once with the text condition and once without. Amplify the difference:

$\epsilon_{\text{guided}} = \epsilon_{\text{uncond}} + s \cdot (\epsilon_{\text{cond}} - \epsilon_{\text{uncond}})$

Where $s$ is the guidance scale. Higher $s$ means stronger adherence to the text prompt but less diversity.

Video Generation

Video diffusion models extend image diffusion along the temporal axis:

Temporal attention: Add temporal self-attention layers to the 2D U-Net to attend across frames
Temporal convolutions: 3D convolutions or pseudo-3D (2D spatial + 1D temporal) convolutions
Frame conditioning: Generate keyframes first, then interpolate

Major models: Sora (OpenAI), Veo (Google), Gen-3 (Runway), Kling (Kuaishou).

3D Generation

Diffusion models for 3D content:

Point-E / Shap-E (OpenAI): Diffusion in point cloud or implicit function space
DreamFusion: Use 2D diffusion model as a prior to optimize a 3D representation (NeRF) via Score Distillation Sampling
Zero-1-to-3: Generate novel views from a single image using view-conditioned diffusion

Practice Problems

Problem 1: ELBO Derivation

Derive the ELBO from scratch, starting from $\log p_\theta(x)$ . Show every step, including where Jensen's inequality is applied and what the gap between the ELBO and the true log-likelihood represents.

Hint 1 - Direction

Start by writing $\log p_\theta(x) = \log \int p_\theta(x,z) dz$ . Introduce $q_\phi(z|x)$ by multiplying and dividing. Apply Jensen's inequality.

Hint 2 - Insight

The gap between $\log p_\theta(x)$ and the ELBO is exactly $\text{KL}(q_\phi(z|x) \| p_\theta(z|x))$ - the KL divergence between the approximate and true posterior. This is always non-negative, confirming the ELBO is indeed a lower bound. As $q_\phi$ becomes a better approximation of the true posterior, the gap shrinks.

Hint 3 - Full Solution + Rubric

Complete derivation:

Starting from:

$\log p_\theta(x) = \log \int p_\theta(x, z) \, dz$

Introduce $q_\phi(z|x)$ :

$= \log \int \frac{p_\theta(x, z)}{q_\phi(z|x)} q_\phi(z|x) \, dz$

$= \log 𝔼_{q_\phi(z|x)}\left[\frac{p_\theta(x, z)}{q_\phi(z|x)}\right]$

By Jensen's inequality ( $\log 𝔼[X] \geq 𝔼[\log X]$ ):

$\geq 𝔼_{q_\phi(z|x)}\left[\log \frac{p_\theta(x, z)}{q_\phi(z|x)}\right] = \text{ELBO}$

Expanding $p_\theta(x,z) = p_\theta(x|z) p(z)$ :

$\text{ELBO} = 𝔼_{q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) \| p(z))$

The gap: Alternatively, we can write:

$\log p_\theta(x) = \text{ELBO} + \text{KL}(q_\phi(z|x) \| p_\theta(z|x))$

Since KL divergence is always $\geq 0$ , the ELBO $\leq \log p_\theta(x)$ . The gap is the KL between the approximate posterior $q_\phi(z|x)$ and the true posterior $p_\theta(z|x)$ .

To prove the gap identity, start from:

$\text{KL}(q_\phi(z|x) \| p_\theta(z|x)) = 𝔼_{q_\phi}\left[\log \frac{q_\phi(z|x)}{p_\theta(z|x)}\right]$

$= 𝔼_{q_\phi}\left[\log q_\phi(z|x) - \log p_\theta(z|x)\right]$

$= 𝔼_{q_\phi}\left[\log q_\phi(z|x) - \log \frac{p_\theta(x,z)}{p_\theta(x)}\right]$

$= 𝔼_{q_\phi}\left[\log q_\phi(z|x) - \log p_\theta(x,z)\right] + \log p_\theta(x)$

$= -\text{ELBO} + \log p_\theta(x)$

Rearranging: $\log p_\theta(x) = \text{ELBO} + \text{KL}(q_\phi(z|x) \| p_\theta(z|x))$ .

Scoring Rubric:

Strong Hire: Complete derivation with Jensen's inequality clearly identified, both forms of the ELBO (reconstruction + KL), derives the gap identity, explains that the gap is the posterior approximation quality.
Lean Hire: Gets the ELBO formula correct but skips the gap derivation or cannot explain what the gap represents.
No Hire: Cannot derive the ELBO or confuses the direction of the KL divergence.

Problem 2: GAN Optimal Discriminator

Prove that the optimal discriminator for a fixed generator is $D^*(x) = p_{\text{data}}(x) / (p_{\text{data}}(x) + p_G(x))$ . Then show that substituting $D^*$ into the GAN objective gives $-\log 4 + 2 \cdot \text{JSD}(p_{\text{data}} \| p_G)$ .

Hint 1 - Direction

For the optimal D, differentiate the integrand of the GAN objective with respect to $D(x)$ and set to zero. For the JSD connection, substitute $D^*$ and use the definition of JSD.

Hint 2 - Insight

The integrand is $p_{\text{data}} \log D + p_G \log(1-D)$ , which is maximized at $D = p_{\text{data}} / (p_{\text{data}} + p_G)$ . Substituting gives: two terms that each look like KL divergences to the mixture distribution $M = (p_{\text{data}} + p_G)/2$ .

Hint 3 - Full Solution + Rubric

Part 1: Optimal discriminator

The GAN objective:

$V(D,G) = \int p_{\text{data}}(x) \log D(x) + p_G(x) \log(1 - D(x)) \, dx$

For each $x$ , maximizing the integrand $a \log y + b \log(1-y)$ where $a = p_{\text{data}}(x)$ , $b = p_G(x)$ :

$\frac{a}{y} - \frac{b}{1-y} = 0 \implies y^* = \frac{a}{a+b} = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_G(x)}$

Part 2: JSD connection

Substituting $D^*$ , let $M = (p_{\text{data}} + p_G)/2$ :

$V(D^*, G) = \int p_{\text{data}} \log \frac{p_{\text{data}}}{p_{\text{data}} + p_G} + p_G \log \frac{p_G}{p_{\text{data}} + p_G} \, dx$

$= \int p_{\text{data}} \log \frac{p_{\text{data}}}{2M} + p_G \log \frac{p_G}{2M} \, dx$

$= \int p_{\text{data}} \log \frac{p_{\text{data}}}{M} - p_{\text{data}} \log 2 + p_G \log \frac{p_G}{M} - p_G \log 2 \, dx$

$= \text{KL}(p_{\text{data}} \| M) + \text{KL}(p_G \| M) - 2 \log 2$

$= 2 \cdot \text{JSD}(p_{\text{data}} \| p_G) - \log 4$

At the optimum where $p_G = p_{\text{data}}$ : JSD = 0, so $V = -\log 4$ .

Scoring Rubric:

Strong Hire: Complete both derivations, identifies the JSD, states the global optimum condition.
Lean Hire: Gets the optimal discriminator but cannot complete the JSD derivation.
No Hire: Cannot set up the optimization problem for D.

Problem 3: Diffusion Model Design

You need to design a diffusion model for generating 1024x1024 images. Running DDPM directly in pixel space would be extremely slow. Design an efficient architecture and explain each choice.

Hint 1 - Direction

Think about the Stable Diffusion approach: compress to latent space first, then run diffusion there. What autoencoder would you use? What U-Net architecture for the denoiser? How would you handle conditioning?

Hint 2 - Insight

Latent diffusion: (1) Train a VQ-VAE or KL-VAE to compress 1024x1024x3 to 128x128x4 or 64x64x4. (2) Train a U-Net denoiser in latent space with attention at multiple scales. (3) Use cross-attention for text conditioning. (4) Use classifier-free guidance for quality. (5) DDIM or DPM-Solver for fast sampling in 20-50 steps.

Hint 3 - Full Solution + Rubric

Architecture design:

Stage 1: Autoencoder

KL-regularized autoencoder (similar to Stable Diffusion's VAE)
Encode 1024x1024x3 to 128x128x4 (64x compression)
Train with reconstruction loss + perceptual loss + adversarial loss + small KL penalty
This autoencoder is trained once and frozen

Stage 2: Latent Diffusion Model

U-Net architecture with:
- ResNet blocks at each resolution level
- Self-attention at 64x64, 32x32, 16x16 resolutions
- Cross-attention for text conditioning at the same resolutions
- Timestep conditioning via sinusoidal embeddings + AdaGN (Adaptive Group Norm)
Text encoder: CLIP ViT-L/14 or T5-XXL for text embeddings
Noise schedule: linear or cosine, T=1000 training steps

Stage 3: Conditioning and Guidance

Classifier-free guidance: train with 10% unconditional (drop text embedding)
Guidance scale s=7.5 to 12 at inference

Stage 4: Fast Sampling

DDIM with 50 steps, or DPM-Solver++ with 20 steps
Optionally: progressive distillation to 4-8 steps

Memory and compute considerations:

Working in 128x128x4 instead of 1024x1024x3 reduces per-step compute by ~64x
U-Net with ~900M parameters (similar to SDXL)
Training: 256 A100 GPUs, ~2 weeks on 2B image-text pairs

Scoring Rubric:

Strong Hire: Designs latent diffusion (not pixel-space), specifies autoencoder, U-Net architecture with attention, text conditioning via cross-attention, classifier-free guidance, fast sampling method, and discusses compute.
Lean Hire: Suggests latent diffusion but missing details on conditioning, guidance, or sampling speed.
No Hire: Proposes pixel-space diffusion for 1024x1024 or cannot describe the architecture.

Problem 4: Mode Collapse Diagnosis

You are training a GAN to generate diverse human faces. After 50K steps, the discriminator loss drops to near zero and the generator produces realistic but nearly identical faces (same pose, similar features). Diagnose and fix.

Hint 1 - Direction

Near-zero discriminator loss with low-diversity generation is classic mode collapse. Think about what signals the generator is receiving and why it converges to a single mode.

Hint 2 - Insight

The discriminator became too strong too fast - it can easily tell real from fake, but the generator found one mode that partially fools it and collapses there. Fixes: use WGAN-GP or spectral normalization to stabilize the discriminator, add minibatch discrimination for diversity, reduce discriminator update frequency, or switch to a diffusion model.

Hint 3 - Full Solution + Rubric

Diagnosis:

Mode collapse with discriminator domination. The generator finds the single face that maximizes $D(G(z))$ and produces only that face. The discriminator then achieves near-perfect accuracy because the generator's output is a single point (or a narrow mode).

Root cause analysis:

Discriminator is too powerful relative to generator - may have more capacity or be updated more frequently
Standard GAN loss (JSD) provides zero gradient when distributions do not overlap
No explicit diversity encouragement in the objective

Fixes (in order of priority):

Switch to WGAN-GP: Wasserstein distance provides meaningful gradients even when distributions do not overlap. Add gradient penalty ( $\lambda = 10$ ).
Spectral normalization on discriminator: Constrains discriminator's Lipschitz constant, preventing it from being too confident.
Balance update ratio: Update generator 2x for every discriminator update, or use learning rate imbalance (lower D learning rate).
Minibatch discrimination: Allow discriminator to see statistics across the batch. If all generated faces are similar, the discriminator can detect this.
R1 regularization: Add gradient penalty on real data only: $\frac{\gamma}{2} 𝔼[\|\nabla_x D(x)\|^2]$ . Prevents discriminator from overconfidently classifying near the data manifold.
Progressive growing: Start generating at low resolution (4x4), gradually increase. Low-resolution modes are easier to cover.
Architecture changes: StyleGAN-style generator with mapping network + style injection tends to be more stable than basic generators.

If all else fails: Switch to a diffusion model. Diffusion models do not suffer from mode collapse by design and consistently produce diverse, high-quality face images.

Scoring Rubric:

Strong Hire: Correctly diagnoses mode collapse from the symptoms, explains the root cause (discriminator/generator imbalance + JSD gradient issues), provides 3+ actionable fixes with justification, mentions WGAN-GP or spectral norm, considers switching to diffusion.
Lean Hire: Identifies mode collapse but provides only 1-2 generic fixes without explaining why they work.
No Hire: Cannot identify mode collapse from the description or suggests "train longer."

Problem 5: Generative Model Selection

Your startup needs to build a product that generates custom product images from text descriptions. The images must be 512x512, generation must take less than 2 seconds, and you have a limited compute budget for both training and inference. Which generative model family do you choose and why?

Hint 1 - Direction

Consider the tradeoffs: quality, speed, training cost, and text conditioning capability. The 2-second latency constraint eliminates some options.

Hint 2 - Insight

Standard diffusion (DDPM) with 1000 steps is too slow. But latent diffusion with fast samplers (20 steps DPM-Solver) or distilled models (4-8 steps) can meet the 2-second target. GANs are fast but harder to condition on text and have diversity issues. VAEs are too blurry. Autoregressive models (like DALL-E 1) are too slow for 2 seconds.

Hint 3 - Full Solution + Rubric

Recommendation: Latent Diffusion Model with fast sampling

Why not other options:

GAN: Fast inference but text conditioning is difficult to implement well, mode collapse risks with diverse product categories, and training instability with limited compute.
VAE: Too blurry for product images where detail matters.
Autoregressive: Too slow - generating a 512x512 image token by token takes many seconds.
Flow: Lower quality than diffusion, limited text conditioning support.

Latent diffusion implementation:

Autoencoder: Fine-tune a pre-trained SD VAE (or use it directly). Compress 512x512x3 to 64x64x4.
Denoiser: Fine-tune Stable Diffusion's U-Net on your product image dataset. This dramatically reduces training cost compared to training from scratch.
Text conditioning: Use CLIP text encoder (already part of SD) plus potentially a domain-specific text encoder fine-tuned on product descriptions.
Fast sampling for 2-second budget:
- DDIM with 20 steps: ~1.5 seconds on A10G GPU
- DPM-Solver++ with 15 steps: ~1.1 seconds
- LCM (Latent Consistency Model) with 4 steps: ~0.4 seconds
- Distilled model: train consistency distillation for 4-8 step generation
Infrastructure: Serve on A10G GPUs ($1.00/hour). With batching, can serve 10-20 images per GPU per second.

Training budget: Fine-tuning SD on ~100K product images takes ~2-4 A100 GPU-days. Much cheaper than training from scratch.

Scoring Rubric:

Strong Hire: Chooses latent diffusion with clear justification against alternatives, mentions fine-tuning from a pre-trained checkpoint, provides specific fast sampling methods with latency estimates, considers inference cost.
Lean Hire: Chooses diffusion but does not address the 2-second constraint or does not mention fine-tuning from a pre-trained model.
No Hire: Chooses a GAN or VAE without addressing their fundamental limitations for this use case.

Interview Cheat Sheet

Concept	Key Formula	One-Liner	Red Flag
VAE ELBO	$𝔼[\log p(x	z)] - \text{KL}(q(z	x) \| p(z))$
Reparameterization trick	$z = \mu + \sigma \odot \epsilon$	Makes sampling differentiable w.r.t. encoder	"Makes the model differentiable"
GAN objective	$\min_G \max_D 𝔼[\log D(x)] + 𝔼[\log(1-D(G(z)))]$	Generator fools discriminator	"Generator maximizes" (it minimizes)
Optimal discriminator	$D^* = p_{\text{data}} / (p_{\text{data}} + p_G)$	Ratio of real to total density	Not being able to derive this
Mode collapse	Generator produces low-diversity output	Adversarial training instability	"Just train longer to fix it"
WGAN	Wasserstein distance + Lipschitz constraint	Meaningful gradients everywhere	"WGAN removes the log"
DDPM objective	$𝔼[\\|\epsilon - \epsilon_\theta(x_t, t)\\|^2]$	Predict the noise that was added	"Diffusion predicts clean images"
Forward process	$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$	Gradually add noise to data	Not knowing you can jump to any t directly
Score function	$\nabla_x \log p(x)$	Direction toward higher probability	Confusing score with loss
Classifier-free guidance	$\epsilon_{\text{guided}} = \epsilon_u + s(\epsilon_c - \epsilon_u)$	Amplify conditional signal	"Higher guidance always better"
Latent diffusion	Diffusion in compressed space	Speed via compression	"Same as pixel diffusion"

Spaced Repetition Checkpoints

Day 0 - Initial Learning

Read this entire page
Derive the ELBO on paper without looking
Write the GAN minimax objective and prove the optimal discriminator
Write the DDPM forward process and training objective
Fill out the grand comparison table from memory

Day 3 - First Recall

Without notes, derive the ELBO and explain the gap
Give the "60-Second Answer" for diffusion models out loud, timed
Explain why mode collapse happens in GANs but not diffusion models
Explain the reparameterization trick and why it is needed

Day 7 - Connections

Compare all five generative model families in a table from memory
Explain the connection between DDPM noise prediction and score matching
Do Practice Problem 1 (ELBO derivation) on a whiteboard
Explain classifier-free guidance and how guidance scale affects quality/diversity

Day 14 - Application

Do Practice Problem 3 (diffusion model design) under timed conditions (15 minutes)
Do Practice Problem 5 (model selection for a startup) under timed conditions (10 minutes)
Explain the Stable Diffusion architecture end-to-end
Review any derivations you hesitated on

Day 21 - Mock Interview

Have someone ask: "Compare VAEs, GANs, and diffusion models - training objectives, strengths, weaknesses"
Time yourself: full answer should take 5-8 minutes
Do all 5 practice problems in sequence under timed conditions (55 minutes total)
Can you whiteboard the DDPM training algorithm from memory?

Key Takeaways

Every generative model family makes different tradeoffs. VAEs give latent spaces but blurry outputs. GANs give sharp images but mode collapse. Diffusion models give the best quality and diversity but are slow. Understanding these tradeoffs is more valuable than memorizing any single model.
The ELBO, minimax, and denoising objectives are the three pillars. If you can derive all three and explain their implications, you can handle any generative model interview question.
Diffusion models dominate current practice. They combine excellent sample quality, mode coverage, easy conditioning, and stable training. The main challenge - slow sampling - is being addressed rapidly through DDIM, consistency models, and distillation.
The practical answer is often "fine-tune a pre-trained diffusion model." For most applications, training from scratch is unnecessary. Understanding how to adapt Stable Diffusion or similar models to your domain is the most valuable practical skill.
Know the math, but also know the engineering. Interviewers at top labs want both the derivations and the practical knowledge - latent vs pixel space, classifier-free guidance parameters, sampling speed budgets, and inference cost calculations.

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - Variational Autoencoders (VAEs)​

The Generative Modeling Problem​

The Latent Variable Model​

Deriving the ELBO​

The Reparameterization Trick​

VAE Limitations​

Part 2 - Generative Adversarial Networks (GANs)​

The Minimax Objective​

The Optimal Discriminator​

Substituting the Optimal Discriminator​

Mode Collapse​

Solutions to GAN Training Instability​

Wasserstein GAN (WGAN)​

Part 3 - Diffusion Models​

The Core Idea​

Forward Process (Adding Noise)​

Reverse Process (Denoising)​

DDPM Training Objective (Ho et al., 2020)​

Score Matching Connection​

Sampling Methods​

Why Diffusion Models Avoid Mode Collapse​

Part 4 - Flow-Based Models​

Core Idea​

Key Models​

Why Flows Are Less Popular​

Part 5 - The Grand Comparison​

Head-to-Head Comparison​

When to Choose Each​

Part 6 - Modern Applications​

Text-to-Image: The Stable Diffusion Architecture​

Video Generation​

3D Generation​

Practice Problems​

Problem 1: ELBO Derivation​

Problem 2: GAN Optimal Discriminator​

Problem 3: Diffusion Model Design​

Problem 4: Mode Collapse Diagnosis​

Problem 5: Generative Model Selection​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 - Initial Learning​

Day 3 - First Recall​

Day 7 - Connections​

Day 14 - Application​

Day 21 - Mock Interview​

Key Takeaways​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - Variational Autoencoders (VAEs)

The Generative Modeling Problem

The Latent Variable Model

Deriving the ELBO

The Reparameterization Trick

VAE Limitations

Part 2 - Generative Adversarial Networks (GANs)

The Minimax Objective

The Optimal Discriminator

Substituting the Optimal Discriminator

Mode Collapse

Solutions to GAN Training Instability

Wasserstein GAN (WGAN)

Part 3 - Diffusion Models

The Core Idea

Forward Process (Adding Noise)

Reverse Process (Denoising)

DDPM Training Objective (Ho et al., 2020)

Score Matching Connection

Sampling Methods

Why Diffusion Models Avoid Mode Collapse

Part 4 - Flow-Based Models

Core Idea

Key Models

Why Flows Are Less Popular

Part 5 - The Grand Comparison

Head-to-Head Comparison

When to Choose Each

Part 6 - Modern Applications

Text-to-Image: The Stable Diffusion Architecture

Video Generation

3D Generation

Practice Problems

Problem 1: ELBO Derivation

Problem 2: GAN Optimal Discriminator

Problem 3: Diffusion Model Design

Problem 4: Mode Collapse Diagnosis

Problem 5: Generative Model Selection

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 - Initial Learning

Day 3 - First Recall

Day 7 - Connections

Day 14 - Application

Day 21 - Mock Interview

Key Takeaways