Skip to main content

Generative Models - VAEs, GANs, Diffusion, and Beyond

Reading time: ~40 min | Interview relevance: High | Roles: MLE, AI Eng, Research Engineer, Applied Scientist, Computer Vision Engineer

The Real Interview Moment

You are in an OpenAI research engineer interview. The interviewer asks: "Compare VAEs, GANs, and diffusion models. For each, give me the training objective, the key mathematical insight, and when you would choose one over the others." You start with VAEs - "they maximize a variational lower bound on the log-likelihood..." - and the interviewer nods. You move to GANs - "minimax objective, the generator minimizes while the discriminator maximizes..." She interrupts: "Why does mode collapse happen in GANs but not in diffusion models? Derive the connection."

This is not a question you can answer with surface-level knowledge. It requires understanding the fundamental difference between adversarial training (which has an unstable equilibrium) and score-based denoising (which has a well-defined regression target at each noise level). Candidates who can only describe the architectures get a "lean hire." Candidates who can derive the objectives, explain the failure modes, and compare the paradigms at a mathematical level get a "strong hire."

This page gives you the complete mathematical foundation for every major generative model family, with the depth needed to handle follow-up questions at top AI labs.

What You Will Master

  • Derive the ELBO (Evidence Lower Bound) for VAEs from first principles
  • Explain the reparameterization trick and why it is necessary
  • State the GAN minimax objective and prove the optimal discriminator
  • Diagnose mode collapse, training instability, and solutions (WGAN, spectral norm)
  • Derive the forward and reverse diffusion processes in DDPM
  • Connect score matching to denoising diffusion
  • Compare VAE vs GAN vs Diffusion vs Autoregressive vs Flow models with precise tradeoffs
  • Explain modern applications: text-to-image, video generation, 3D generation

Self-Assessment: Where Are You Now?

Skill1 - Cannot2 - Vaguely3 - Can Explain4 - Can Derive5 - Can TeachYour Score
Derive the ELBO for VAEs___
Explain reparameterization trick___
State GAN objective and optimal D___
Explain mode collapse and solutions___
Derive DDPM forward/reverse process___
Explain score matching___
Compare all generative model families___
Explain text-to-image diffusion___

Target: All 4s and 5s before your interview.

Part 1 - Variational Autoencoders (VAEs)

The Generative Modeling Problem

We have data xx drawn from an unknown distribution pdata(x)p_{\text{data}}(x). We want to learn a model pθ(x)p_\theta(x) that approximates this distribution so we can:

  1. Evaluate the probability of new data (density estimation)
  2. Generate new samples (sampling)
  3. Learn meaningful latent representations (representation learning)

The Latent Variable Model

A VAE introduces latent variables zz and models:

pθ(x)=pθ(xz)p(z)dzp_\theta(x) = \int p_\theta(x|z) \, p(z) \, dz

Where p(z)=N(0,I)p(z) = \mathcal{N}(0, I) is a simple prior and pθ(xz)p_\theta(x|z) is a neural network (decoder) that maps zz to xx.

The problem: This integral is intractable - we cannot compute pθ(x)p_\theta(x) directly because we would need to integrate over all possible zz values.

Deriving the ELBO

We want to maximize logpθ(x)\log p_\theta(x). Since we cannot compute it directly, we derive a lower bound.

Start with the log-likelihood and introduce an approximate posterior qϕ(zx)q_\phi(z|x):

logpθ(x)=logpθ(x,z)dz\log p_\theta(x) = \log \int p_\theta(x, z) \, dz

Multiply and divide by qϕ(zx)q_\phi(z|x):

=logpθ(x,z)qϕ(zx)qϕ(zx)dz= \log \int \frac{p_\theta(x, z)}{q_\phi(z|x)} q_\phi(z|x) \, dz

By Jensen's inequality (log\log is concave):

qϕ(zx)logpθ(x,z)qϕ(zx)dz\geq \int q_\phi(z|x) \log \frac{p_\theta(x, z)}{q_\phi(z|x)} \, dz

Expanding pθ(x,z)=pθ(xz)p(z)p_\theta(x, z) = p_\theta(x|z) p(z):

=𝔼qϕ(zx)[logpθ(xz)]KL(qϕ(zx)p(z))= 𝔼_{q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) \| p(z))

This is the ELBO (Evidence Lower Bound):

ELBO(θ,ϕ;x)=𝔼qϕ(zx)[logpθ(xz)]Reconstruction termKL(qϕ(zx)p(z))Regularization term\text{ELBO}(\theta, \phi; x) = \underbrace{𝔼_{q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{Reconstruction term}} - \underbrace{\text{KL}(q_\phi(z|x) \| p(z))}_{\text{Regularization term}}

VAE - ELBO Training Flow with Reparameterization Trick

60-Second Answer

"A VAE is a latent variable model that learns to generate data by encoding inputs into a latent space and decoding them back. The training objective is the ELBO - a lower bound on the log-likelihood - which has two terms: a reconstruction loss that encourages accurate decoding, and a KL divergence that regularizes the latent space to be close to a standard Gaussian. The key innovation is the reparameterization trick: instead of sampling z directly from the encoder's distribution (which is not differentiable), we express z as mu plus sigma times epsilon where epsilon is sampled from a standard normal. This allows gradients to flow through the sampling step."

The Reparameterization Trick

The problem: We need to compute ϕ𝔼qϕ(zx)[f(z)]\nabla_\phi 𝔼_{q_\phi(z|x)}[f(z)]. But the expectation is over a distribution that depends on ϕ\phi - we cannot simply backpropagate through a sampling operation.

The solution: Instead of zN(μϕ(x),σϕ2(x))z \sim \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x)), write:

z=μϕ(x)+σϕ(x)ϵ,ϵN(0,I)z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

Now zz is a deterministic function of ϕ\phi (through μ\mu and σ\sigma) plus independent noise. Gradients with respect to ϕ\phi flow through μ\mu and σ\sigma without issues.

Common Trap

Do NOT say "the reparameterization trick makes the VAE differentiable." The VAE is already differentiable - the encoder and decoder are neural networks. The reparameterization trick makes the sampling step differentiable with respect to the encoder parameters. Without it, you cannot train the encoder via backpropagation because sampling is a stochastic, non-differentiable operation.

VAE Limitations

  1. Blurry outputs: The pixel-wise reconstruction loss (MSE or binary cross-entropy) encourages averaging over modes, producing blurry images rather than sharp ones.
  2. Posterior collapse: The KL term can dominate, causing the encoder to ignore the input and produce qϕ(zx)p(z)q_\phi(z|x) \approx p(z). The decoder then ignores zz and generates the same output regardless of input.
  3. Limited expressiveness: The Gaussian assumption for both prior and posterior limits the complexity of the latent space.

Solutions: Beta-VAE (weight the KL term), VQ-VAE (discrete latent space), hierarchical VAEs (NVAE, VDVAE).

Part 2 - Generative Adversarial Networks (GANs)

The Minimax Objective

A GAN consists of two networks:

  • Generator G(z)G(z): maps noise zp(z)z \sim p(z) to fake data
  • Discriminator D(x)D(x): classifies data as real or fake

The training objective is a minimax game:

minGmaxDV(D,G)=𝔼xpdata[logD(x)]+𝔼zp(z)[log(1D(G(z)))]\min_G \max_D \, V(D, G) = 𝔼_{x \sim p_{\text{data}}}[\log D(x)] + 𝔼_{z \sim p(z)}[\log(1 - D(G(z)))]

The discriminator tries to maximize VV - assign high probability to real data and low probability to fake data. The generator tries to minimize VV - fool the discriminator into assigning high probability to fake data.

The Optimal Discriminator

For a fixed generator GG, the optimal discriminator is:

D(x)=pdata(x)pdata(x)+pG(x)D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_G(x)}

Proof: The discriminator maximizes:

V(D)=[pdata(x)logD(x)+pG(x)log(1D(x))]dxV(D) = \int \left[ p_{\text{data}}(x) \log D(x) + p_G(x) \log(1 - D(x)) \right] dx

Taking the derivative with respect to D(x)D(x) and setting it to zero:

pdata(x)D(x)pG(x)1D(x)=0\frac{p_{\text{data}}(x)}{D(x)} - \frac{p_G(x)}{1 - D(x)} = 0

Solving: D(x)=pdata(x)/(pdata(x)+pG(x))D^*(x) = p_{\text{data}}(x) / (p_{\text{data}}(x) + p_G(x)).

Substituting the Optimal Discriminator

When D=DD = D^*, the game value becomes:

V(D,G)=log4+2JSD(pdatapG)V(D^*, G) = -\log 4 + 2 \cdot \text{JSD}(p_{\text{data}} \| p_G)

Where JSD is the Jensen-Shannon Divergence. This means the generator is minimizing the JSD between the real and generated distributions. The global optimum is pG=pdatap_G = p_{\text{data}}, where JSD=0\text{JSD} = 0 and V=log4V = -\log 4.

Interviewer's Perspective

Deriving the optimal discriminator and showing that the GAN objective reduces to JSD minimization is a classic interview question at Google Research, OpenAI, and DeepMind. The derivation is straightforward but you must be able to do it on a whiteboard without notes. Practice it until you can complete it in under 3 minutes.

Mode Collapse

What it is: The generator finds a single output (or a few outputs) that fools the discriminator and produces only those, ignoring the diversity of the real data distribution.

Why it happens: The generator's gradient points toward the mode that currently fools the discriminator best. Once it finds such a mode, the gradient reinforces it rather than encouraging exploration of other modes.

GAN Training Dynamics - Stable vs Mode Collapse

Solutions to GAN Training Instability

TechniqueKey IdeaPaper
WGANReplace JSD with Wasserstein distance, clip weightsArjovsky et al., 2017
WGAN-GPGradient penalty instead of weight clippingGulrajani et al., 2017
Spectral NormalizationNormalize discriminator weights by spectral normMiyato et al., 2018
Progressive GrowingStart with low-res, gradually increaseKarras et al., 2018
StyleGANStyle-based generator with mapping networkKarras et al., 2019
Minibatch DiscriminationDiscriminator sees batches, not individualsSalimans et al., 2016

Wasserstein GAN (WGAN)

The key insight: JSD can be zero or undefined when the real and generated distributions have non-overlapping support (common in high dimensions). This gives the generator zero gradient - it cannot learn.

The Wasserstein distance (Earth Mover's Distance) provides a meaningful gradient even when distributions do not overlap:

W(pdata,pG)=infγΠ(pdata,pG)𝔼(x,y)γ[xy]W(p_{\text{data}}, p_G) = \inf_{\gamma \in \Pi(p_{\text{data}}, p_G)} 𝔼_{(x,y) \sim \gamma}[\|x - y\|]

By the Kantorovich-Rubinstein duality:

W(pdata,pG)=supfL1𝔼xpdata[f(x)]𝔼xpG[f(x)]W(p_{\text{data}}, p_G) = \sup_{\|f\|_L \leq 1} 𝔼_{x \sim p_{\text{data}}}[f(x)] - 𝔼_{x \sim p_G}[f(x)]

Where the supremum is over all 1-Lipschitz functions. The discriminator (now called "critic") approximates this Lipschitz function.

Enforcing the Lipschitz constraint:

  • Weight clipping (WGAN): Clip all weights to [c,c][-c, c]. Simple but leads to capacity underuse and exploding/vanishing gradients.
  • Gradient penalty (WGAN-GP): Add λ𝔼[(x^D(x^)1)2]\lambda 𝔼[(\|\nabla_{\hat{x}} D(\hat{x})\| - 1)^2] to the loss, where x^\hat{x} is a random interpolation between real and fake samples.
  • Spectral normalization: Divide each weight matrix by its spectral norm (largest singular value). Guarantees 1-Lipschitz without any penalty term.
Instant Rejection

Never say "WGAN just removes the log from the GAN loss." The mathematical change is fundamental - switching from JSD to Wasserstein distance, which requires the Lipschitz constraint. The removal of log is a consequence, not the cause. Saying this suggests you memorized a formula without understanding the theory.

Part 3 - Diffusion Models

The Core Idea

Diffusion models learn to reverse a gradual noising process. Start with clean data x0x_0, add noise step by step until you get pure Gaussian noise xTx_T, then learn a neural network that reverses each step.

Forward Process (Adding Noise)

The forward process is a fixed Markov chain that gradually adds Gaussian noise:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \beta_t I)

Where βt\beta_t is a small noise schedule (typically increasing from β1=104\beta_1 = 10^{-4} to βT=0.02\beta_T = 0.02).

Key property: We can sample xtx_t directly from x0x_0 without iterating through all steps:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, (1 - \bar{\alpha}_t) I)

Where αt=1βt\alpha_t = 1 - \beta_t and αˉt=s=1tαs\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s.

This means: xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon, where ϵN(0,I)\epsilon \sim \mathcal{N}(0, I).

At t=Tt = T (with large TT), αˉT0\bar{\alpha}_T \approx 0 and xTϵx_T \approx \epsilon - pure noise.

DDPM - Forward Noising Process and Learned Reverse Denoising

Reverse Process (Denoising)

The reverse process is parameterized by a neural network:

pθ(xt1xt)=N(xt1;μθ(xt,t),σt2I)p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)

The network predicts the mean μθ(xt,t)\mu_\theta(x_t, t) of the denoising step at each timestep tt.

DDPM Training Objective (Ho et al., 2020)

After simplification, the training objective reduces to:

Lsimple=𝔼t,x0,ϵ[ϵϵθ(xt,t)2]L_{\text{simple}} = 𝔼_{t, x_0, \epsilon}\left[\| \epsilon - \epsilon_\theta(x_t, t) \|^2\right]

Where ϵθ\epsilon_\theta is the neural network predicting the noise that was added. The training procedure:

  1. Sample a clean image x0x_0 from the dataset
  2. Sample a random timestep tUniform(1,T)t \sim \text{Uniform}(1, T)
  3. Sample noise ϵN(0,I)\epsilon \sim \mathcal{N}(0, I)
  4. Compute noisy image: xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon
  5. Train the network to predict ϵ\epsilon from xtx_t and tt: minimize ϵϵθ(xt,t)2\|\epsilon - \epsilon_\theta(x_t, t)\|^2
60-Second Answer

"Diffusion models learn to generate data by reversing a gradual noising process. The forward process adds Gaussian noise step by step until the data becomes pure noise. A neural network learns to reverse each step - given a noisy image and the noise level, it predicts the noise that was added. At generation time, we start with random noise and iteratively denoise to produce a clean image. The training is simple - just mean squared error between the predicted and actual noise. Unlike GANs, there is no adversarial training and no mode collapse. Unlike VAEs, there is no posterior approximation or blurriness. The main drawback is slow sampling - generating an image requires hundreds to thousands of network evaluations."

Score Matching Connection

The noise prediction ϵθ(xt,t)\epsilon_\theta(x_t, t) is closely related to the score function - the gradient of the log probability:

xtlogq(xt)ϵθ(xt,t)1αˉt\nabla_{x_t} \log q(x_t) \approx -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1 - \bar{\alpha}_t}}

This connects DDPM to score-based generative models (Song & Ermon, 2019). The network is learning to estimate the direction toward higher probability at each noise level.

Why this matters: Score matching provides a clean theoretical framework. The denoising objective is equivalent to estimating the score function at every noise level, and Langevin dynamics (following the score with added noise) produces samples from the learned distribution.

Sampling Methods

MethodStepsSpeedQualityHow It Works
DDPM1000SlowHighOriginal reverse process, one step at a time
DDIM50-100MediumHighDeterministic sampling, skip steps
DPM-Solver10-25FastHighODE solver for the reverse process
Consistency Models1-2Very fastGoodDistilled to predict final output directly
Latent Diffusion20-50FastHighDiffusion in latent space, not pixel space

Why Diffusion Models Avoid Mode Collapse

GANs suffer from mode collapse because the generator optimizes against a single discriminator - it can find shortcuts. Diffusion models avoid this because:

  1. No adversarial training: The objective is simple regression (predict the noise).
  2. Multi-scale denoising: The network must handle every noise level, from nearly clean to pure noise. This forces it to learn the full data distribution at every scale.
  3. Stable training: MSE loss with random timestep sampling provides consistent, low-variance gradients.
Common Trap

Do NOT say "diffusion models are just VAEs with many latent variables." While there is a mathematical connection (the DDPM ELBO can be written as a sequence of KL terms), the training procedure and the resulting model behavior are fundamentally different. Diffusion models do not learn an encoder or a latent representation - they learn to denoise. The connection is theoretical, not practical.

Part 4 - Flow-Based Models

Core Idea

Flow-based models learn an invertible transformation fθf_\theta from a simple distribution (Gaussian) to the data distribution:

x=fθ(z),z=fθ1(x),zN(0,I)x = f_\theta(z), \quad z = f_\theta^{-1}(x), \quad z \sim \mathcal{N}(0, I)

Because fθf_\theta is invertible, we can compute exact log-likelihoods via the change of variables formula:

logpθ(x)=logp(z)+logdetfθ1x\log p_\theta(x) = \log p(z) + \log \left|\det \frac{\partial f_\theta^{-1}}{\partial x}\right|

=logp(fθ1(x))+logdetJfθ1(x)= \log p(f_\theta^{-1}(x)) + \log \left|\det J_{f_\theta^{-1}}(x)\right|

Key Models

ModelKey InnovationLimitation
NICE (2014)Additive coupling layersLimited expressiveness
RealNVP (2017)Affine coupling layersStill limited by coupling structure
Glow (2018)1x1 convolutions + squeezeBetter quality but expensive
Neural ODE / FFJORD (2018-2019)Continuous-time flowsExpensive determinant computation
  1. Invertibility constraint: The requirement that fθf_\theta be invertible limits the architecture. You cannot use standard residual connections, attention, or convolutions freely.
  2. Computational cost: Computing the Jacobian determinant is expensive, even with clever architectures.
  3. Sample quality: Despite exact likelihoods, flow models generally produce lower-quality samples than GANs or diffusion models at the same model size.

Where flows still shine: Density estimation tasks where exact likelihoods are needed, and as components of larger systems (e.g., normalizing flows in VAE posteriors).

Part 5 - The Grand Comparison

Head-to-Head Comparison

PropertyVAEGANDiffusionAutoregressiveFlow
Training objectiveELBO (lower bound)Minimax gameDenoising MSEExact log-likelihoodExact log-likelihood
Training stabilityStableUnstableVery stableStableStable
Sample qualityBlurrySharpExcellentExcellentModerate
Mode coverageGoodPoor (mode collapse)ExcellentExcellentGood
Sampling speedFast (one pass)Fast (one pass)Slow (iterative)Slow (sequential)Fast (one pass)
Exact likelihoodLower bound onlyNoLower bound onlyYesYes
Latent spaceYes (continuous)Yes (input noise)No explicitNoYes (invertible)
Diversity controlTemperature on zTruncation trickGuidance scaleTemperatureTemperature on z
Key weaknessBlurry outputsMode collapseSlow generationSequential, slowLimited architecture

When to Choose Each

Generative Model Selection - Decision Framework

Company Variation

At OpenAI and Stability AI, diffusion model expertise is almost mandatory. At Meta, GAN knowledge is still valued (StyleGAN for faces, GAN upscaling). At Google, autoregressive models dominate (Gemini, Imagen uses diffusion but the team also works on autoregressive image generation). For startups, understanding the speed-quality tradeoff is most important - they often need fast inference.

Part 6 - Modern Applications

Text-to-Image: The Stable Diffusion Architecture

The key innovation in Stable Diffusion (Romero et al., 2022) is latent diffusion: run the diffusion process in a compressed latent space rather than pixel space.

  1. Autoencoder: Compress images from pixel space (256×256×3256 \times 256 \times 3) to latent space (32×32×432 \times 32 \times 4) - a 48x compression
  2. Diffusion model: Operate the forward/reverse process in latent space - much faster
  3. Conditioning: Use cross-attention with text embeddings (from CLIP) to guide generation

Classifier-Free Guidance (CFG): At generation time, run the model twice - once with the text condition and once without. Amplify the difference:

ϵguided=ϵuncond+s(ϵcondϵuncond)\epsilon_{\text{guided}} = \epsilon_{\text{uncond}} + s \cdot (\epsilon_{\text{cond}} - \epsilon_{\text{uncond}})

Where ss is the guidance scale. Higher ss means stronger adherence to the text prompt but less diversity.

Video Generation

Video diffusion models extend image diffusion along the temporal axis:

  • Temporal attention: Add temporal self-attention layers to the 2D U-Net to attend across frames
  • Temporal convolutions: 3D convolutions or pseudo-3D (2D spatial + 1D temporal) convolutions
  • Frame conditioning: Generate keyframes first, then interpolate

Major models: Sora (OpenAI), Veo (Google), Gen-3 (Runway), Kling (Kuaishou).

3D Generation

Diffusion models for 3D content:

  • Point-E / Shap-E (OpenAI): Diffusion in point cloud or implicit function space
  • DreamFusion: Use 2D diffusion model as a prior to optimize a 3D representation (NeRF) via Score Distillation Sampling
  • Zero-1-to-3: Generate novel views from a single image using view-conditioned diffusion

Practice Problems

Problem 1: ELBO Derivation

Derive the ELBO from scratch, starting from logpθ(x)\log p_\theta(x). Show every step, including where Jensen's inequality is applied and what the gap between the ELBO and the true log-likelihood represents.

Hint 1 - Direction

Start by writing logpθ(x)=logpθ(x,z)dz\log p_\theta(x) = \log \int p_\theta(x,z) dz. Introduce qϕ(zx)q_\phi(z|x) by multiplying and dividing. Apply Jensen's inequality.

Hint 2 - Insight

The gap between logpθ(x)\log p_\theta(x) and the ELBO is exactly KL(qϕ(zx)pθ(zx))\text{KL}(q_\phi(z|x) \| p_\theta(z|x)) - the KL divergence between the approximate and true posterior. This is always non-negative, confirming the ELBO is indeed a lower bound. As qϕq_\phi becomes a better approximation of the true posterior, the gap shrinks.

Hint 3 - Full Solution + Rubric

Complete derivation:

Starting from:

logpθ(x)=logpθ(x,z)dz\log p_\theta(x) = \log \int p_\theta(x, z) \, dz

Introduce qϕ(zx)q_\phi(z|x):

=logpθ(x,z)qϕ(zx)qϕ(zx)dz= \log \int \frac{p_\theta(x, z)}{q_\phi(z|x)} q_\phi(z|x) \, dz

=log𝔼qϕ(zx)[pθ(x,z)qϕ(zx)]= \log 𝔼_{q_\phi(z|x)}\left[\frac{p_\theta(x, z)}{q_\phi(z|x)}\right]

By Jensen's inequality (log𝔼[X]𝔼[logX]\log 𝔼[X] \geq 𝔼[\log X]):

𝔼qϕ(zx)[logpθ(x,z)qϕ(zx)]=ELBO\geq 𝔼_{q_\phi(z|x)}\left[\log \frac{p_\theta(x, z)}{q_\phi(z|x)}\right] = \text{ELBO}

Expanding pθ(x,z)=pθ(xz)p(z)p_\theta(x,z) = p_\theta(x|z) p(z):

ELBO=𝔼qϕ(zx)[logpθ(xz)]KL(qϕ(zx)p(z))\text{ELBO} = 𝔼_{q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) \| p(z))

The gap: Alternatively, we can write:

logpθ(x)=ELBO+KL(qϕ(zx)pθ(zx))\log p_\theta(x) = \text{ELBO} + \text{KL}(q_\phi(z|x) \| p_\theta(z|x))

Since KL divergence is always 0\geq 0, the ELBO logpθ(x)\leq \log p_\theta(x). The gap is the KL between the approximate posterior qϕ(zx)q_\phi(z|x) and the true posterior pθ(zx)p_\theta(z|x).

To prove the gap identity, start from:

KL(qϕ(zx)pθ(zx))=𝔼qϕ[logqϕ(zx)pθ(zx)]\text{KL}(q_\phi(z|x) \| p_\theta(z|x)) = 𝔼_{q_\phi}\left[\log \frac{q_\phi(z|x)}{p_\theta(z|x)}\right]

=𝔼qϕ[logqϕ(zx)logpθ(zx)]= 𝔼_{q_\phi}\left[\log q_\phi(z|x) - \log p_\theta(z|x)\right]

=𝔼qϕ[logqϕ(zx)logpθ(x,z)pθ(x)]= 𝔼_{q_\phi}\left[\log q_\phi(z|x) - \log \frac{p_\theta(x,z)}{p_\theta(x)}\right]

=𝔼qϕ[logqϕ(zx)logpθ(x,z)]+logpθ(x)= 𝔼_{q_\phi}\left[\log q_\phi(z|x) - \log p_\theta(x,z)\right] + \log p_\theta(x)

=ELBO+logpθ(x)= -\text{ELBO} + \log p_\theta(x)

Rearranging: logpθ(x)=ELBO+KL(qϕ(zx)pθ(zx))\log p_\theta(x) = \text{ELBO} + \text{KL}(q_\phi(z|x) \| p_\theta(z|x)).

Scoring Rubric:

  • Strong Hire: Complete derivation with Jensen's inequality clearly identified, both forms of the ELBO (reconstruction + KL), derives the gap identity, explains that the gap is the posterior approximation quality.
  • Lean Hire: Gets the ELBO formula correct but skips the gap derivation or cannot explain what the gap represents.
  • No Hire: Cannot derive the ELBO or confuses the direction of the KL divergence.

Problem 2: GAN Optimal Discriminator

Prove that the optimal discriminator for a fixed generator is D(x)=pdata(x)/(pdata(x)+pG(x))D^*(x) = p_{\text{data}}(x) / (p_{\text{data}}(x) + p_G(x)). Then show that substituting DD^* into the GAN objective gives log4+2JSD(pdatapG)-\log 4 + 2 \cdot \text{JSD}(p_{\text{data}} \| p_G).

Hint 1 - Direction

For the optimal D, differentiate the integrand of the GAN objective with respect to D(x)D(x) and set to zero. For the JSD connection, substitute DD^* and use the definition of JSD.

Hint 2 - Insight

The integrand is pdatalogD+pGlog(1D)p_{\text{data}} \log D + p_G \log(1-D), which is maximized at D=pdata/(pdata+pG)D = p_{\text{data}} / (p_{\text{data}} + p_G). Substituting gives: two terms that each look like KL divergences to the mixture distribution M=(pdata+pG)/2M = (p_{\text{data}} + p_G)/2.

Hint 3 - Full Solution + Rubric

Part 1: Optimal discriminator

The GAN objective:

V(D,G)=pdata(x)logD(x)+pG(x)log(1D(x))dxV(D,G) = \int p_{\text{data}}(x) \log D(x) + p_G(x) \log(1 - D(x)) \, dx

For each xx, maximizing the integrand alogy+blog(1y)a \log y + b \log(1-y) where a=pdata(x)a = p_{\text{data}}(x), b=pG(x)b = p_G(x):

ayb1y=0    y=aa+b=pdata(x)pdata(x)+pG(x)\frac{a}{y} - \frac{b}{1-y} = 0 \implies y^* = \frac{a}{a+b} = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_G(x)}

Part 2: JSD connection

Substituting DD^*, let M=(pdata+pG)/2M = (p_{\text{data}} + p_G)/2:

V(D,G)=pdatalogpdatapdata+pG+pGlogpGpdata+pGdxV(D^*, G) = \int p_{\text{data}} \log \frac{p_{\text{data}}}{p_{\text{data}} + p_G} + p_G \log \frac{p_G}{p_{\text{data}} + p_G} \, dx

=pdatalogpdata2M+pGlogpG2Mdx= \int p_{\text{data}} \log \frac{p_{\text{data}}}{2M} + p_G \log \frac{p_G}{2M} \, dx

=pdatalogpdataMpdatalog2+pGlogpGMpGlog2dx= \int p_{\text{data}} \log \frac{p_{\text{data}}}{M} - p_{\text{data}} \log 2 + p_G \log \frac{p_G}{M} - p_G \log 2 \, dx

=KL(pdataM)+KL(pGM)2log2= \text{KL}(p_{\text{data}} \| M) + \text{KL}(p_G \| M) - 2 \log 2

=2JSD(pdatapG)log4= 2 \cdot \text{JSD}(p_{\text{data}} \| p_G) - \log 4

At the optimum where pG=pdatap_G = p_{\text{data}}: JSD = 0, so V=log4V = -\log 4.

Scoring Rubric:

  • Strong Hire: Complete both derivations, identifies the JSD, states the global optimum condition.
  • Lean Hire: Gets the optimal discriminator but cannot complete the JSD derivation.
  • No Hire: Cannot set up the optimization problem for D.

Problem 3: Diffusion Model Design

You need to design a diffusion model for generating 1024x1024 images. Running DDPM directly in pixel space would be extremely slow. Design an efficient architecture and explain each choice.

Hint 1 - Direction

Think about the Stable Diffusion approach: compress to latent space first, then run diffusion there. What autoencoder would you use? What U-Net architecture for the denoiser? How would you handle conditioning?

Hint 2 - Insight

Latent diffusion: (1) Train a VQ-VAE or KL-VAE to compress 1024x1024x3 to 128x128x4 or 64x64x4. (2) Train a U-Net denoiser in latent space with attention at multiple scales. (3) Use cross-attention for text conditioning. (4) Use classifier-free guidance for quality. (5) DDIM or DPM-Solver for fast sampling in 20-50 steps.

Hint 3 - Full Solution + Rubric

Architecture design:

Stage 1: Autoencoder

  • KL-regularized autoencoder (similar to Stable Diffusion's VAE)
  • Encode 1024x1024x3 to 128x128x4 (64x compression)
  • Train with reconstruction loss + perceptual loss + adversarial loss + small KL penalty
  • This autoencoder is trained once and frozen

Stage 2: Latent Diffusion Model

  • U-Net architecture with:
    • ResNet blocks at each resolution level
    • Self-attention at 64x64, 32x32, 16x16 resolutions
    • Cross-attention for text conditioning at the same resolutions
    • Timestep conditioning via sinusoidal embeddings + AdaGN (Adaptive Group Norm)
  • Text encoder: CLIP ViT-L/14 or T5-XXL for text embeddings
  • Noise schedule: linear or cosine, T=1000 training steps

Stage 3: Conditioning and Guidance

  • Classifier-free guidance: train with 10% unconditional (drop text embedding)
  • Guidance scale s=7.5 to 12 at inference

Stage 4: Fast Sampling

  • DDIM with 50 steps, or DPM-Solver++ with 20 steps
  • Optionally: progressive distillation to 4-8 steps

Memory and compute considerations:

  • Working in 128x128x4 instead of 1024x1024x3 reduces per-step compute by ~64x
  • U-Net with ~900M parameters (similar to SDXL)
  • Training: 256 A100 GPUs, ~2 weeks on 2B image-text pairs

Scoring Rubric:

  • Strong Hire: Designs latent diffusion (not pixel-space), specifies autoencoder, U-Net architecture with attention, text conditioning via cross-attention, classifier-free guidance, fast sampling method, and discusses compute.
  • Lean Hire: Suggests latent diffusion but missing details on conditioning, guidance, or sampling speed.
  • No Hire: Proposes pixel-space diffusion for 1024x1024 or cannot describe the architecture.

Problem 4: Mode Collapse Diagnosis

You are training a GAN to generate diverse human faces. After 50K steps, the discriminator loss drops to near zero and the generator produces realistic but nearly identical faces (same pose, similar features). Diagnose and fix.

Hint 1 - Direction

Near-zero discriminator loss with low-diversity generation is classic mode collapse. Think about what signals the generator is receiving and why it converges to a single mode.

Hint 2 - Insight

The discriminator became too strong too fast - it can easily tell real from fake, but the generator found one mode that partially fools it and collapses there. Fixes: use WGAN-GP or spectral normalization to stabilize the discriminator, add minibatch discrimination for diversity, reduce discriminator update frequency, or switch to a diffusion model.

Hint 3 - Full Solution + Rubric

Diagnosis:

Mode collapse with discriminator domination. The generator finds the single face that maximizes D(G(z))D(G(z)) and produces only that face. The discriminator then achieves near-perfect accuracy because the generator's output is a single point (or a narrow mode).

Root cause analysis:

  1. Discriminator is too powerful relative to generator - may have more capacity or be updated more frequently
  2. Standard GAN loss (JSD) provides zero gradient when distributions do not overlap
  3. No explicit diversity encouragement in the objective

Fixes (in order of priority):

  1. Switch to WGAN-GP: Wasserstein distance provides meaningful gradients even when distributions do not overlap. Add gradient penalty (λ=10\lambda = 10).

  2. Spectral normalization on discriminator: Constrains discriminator's Lipschitz constant, preventing it from being too confident.

  3. Balance update ratio: Update generator 2x for every discriminator update, or use learning rate imbalance (lower D learning rate).

  4. Minibatch discrimination: Allow discriminator to see statistics across the batch. If all generated faces are similar, the discriminator can detect this.

  5. R1 regularization: Add gradient penalty on real data only: γ2𝔼[xD(x)2]\frac{\gamma}{2} 𝔼[\|\nabla_x D(x)\|^2]. Prevents discriminator from overconfidently classifying near the data manifold.

  6. Progressive growing: Start generating at low resolution (4x4), gradually increase. Low-resolution modes are easier to cover.

  7. Architecture changes: StyleGAN-style generator with mapping network + style injection tends to be more stable than basic generators.

If all else fails: Switch to a diffusion model. Diffusion models do not suffer from mode collapse by design and consistently produce diverse, high-quality face images.

Scoring Rubric:

  • Strong Hire: Correctly diagnoses mode collapse from the symptoms, explains the root cause (discriminator/generator imbalance + JSD gradient issues), provides 3+ actionable fixes with justification, mentions WGAN-GP or spectral norm, considers switching to diffusion.
  • Lean Hire: Identifies mode collapse but provides only 1-2 generic fixes without explaining why they work.
  • No Hire: Cannot identify mode collapse from the description or suggests "train longer."

Problem 5: Generative Model Selection

Your startup needs to build a product that generates custom product images from text descriptions. The images must be 512x512, generation must take less than 2 seconds, and you have a limited compute budget for both training and inference. Which generative model family do you choose and why?

Hint 1 - Direction

Consider the tradeoffs: quality, speed, training cost, and text conditioning capability. The 2-second latency constraint eliminates some options.

Hint 2 - Insight

Standard diffusion (DDPM) with 1000 steps is too slow. But latent diffusion with fast samplers (20 steps DPM-Solver) or distilled models (4-8 steps) can meet the 2-second target. GANs are fast but harder to condition on text and have diversity issues. VAEs are too blurry. Autoregressive models (like DALL-E 1) are too slow for 2 seconds.

Hint 3 - Full Solution + Rubric

Recommendation: Latent Diffusion Model with fast sampling

Why not other options:

  • GAN: Fast inference but text conditioning is difficult to implement well, mode collapse risks with diverse product categories, and training instability with limited compute.
  • VAE: Too blurry for product images where detail matters.
  • Autoregressive: Too slow - generating a 512x512 image token by token takes many seconds.
  • Flow: Lower quality than diffusion, limited text conditioning support.

Latent diffusion implementation:

  1. Autoencoder: Fine-tune a pre-trained SD VAE (or use it directly). Compress 512x512x3 to 64x64x4.

  2. Denoiser: Fine-tune Stable Diffusion's U-Net on your product image dataset. This dramatically reduces training cost compared to training from scratch.

  3. Text conditioning: Use CLIP text encoder (already part of SD) plus potentially a domain-specific text encoder fine-tuned on product descriptions.

  4. Fast sampling for 2-second budget:

    • DDIM with 20 steps: ~1.5 seconds on A10G GPU
    • DPM-Solver++ with 15 steps: ~1.1 seconds
    • LCM (Latent Consistency Model) with 4 steps: ~0.4 seconds
    • Distilled model: train consistency distillation for 4-8 step generation
  5. Infrastructure: Serve on A10G GPUs ($1.00/hour). With batching, can serve 10-20 images per GPU per second.

Training budget: Fine-tuning SD on ~100K product images takes ~2-4 A100 GPU-days. Much cheaper than training from scratch.

Scoring Rubric:

  • Strong Hire: Chooses latent diffusion with clear justification against alternatives, mentions fine-tuning from a pre-trained checkpoint, provides specific fast sampling methods with latency estimates, considers inference cost.
  • Lean Hire: Chooses diffusion but does not address the 2-second constraint or does not mention fine-tuning from a pre-trained model.
  • No Hire: Chooses a GAN or VAE without addressing their fundamental limitations for this use case.

Interview Cheat Sheet

ConceptKey FormulaOne-LinerRed Flag
VAE ELBO$𝔼[\log p(xz)] - \text{KL}(q(zx) | p(z))$
Reparameterization trickz=μ+σϵz = \mu + \sigma \odot \epsilonMakes sampling differentiable w.r.t. encoder"Makes the model differentiable"
GAN objectiveminGmaxD𝔼[logD(x)]+𝔼[log(1D(G(z)))]\min_G \max_D 𝔼[\log D(x)] + 𝔼[\log(1-D(G(z)))]Generator fools discriminator"Generator maximizes" (it minimizes)
Optimal discriminatorD=pdata/(pdata+pG)D^* = p_{\text{data}} / (p_{\text{data}} + p_G)Ratio of real to total densityNot being able to derive this
Mode collapseGenerator produces low-diversity outputAdversarial training instability"Just train longer to fix it"
WGANWasserstein distance + Lipschitz constraintMeaningful gradients everywhere"WGAN removes the log"
DDPM objective𝔼[ϵϵθ(xt,t)2]𝔼[\|\epsilon - \epsilon_\theta(x_t, t)\|^2]Predict the noise that was added"Diffusion predicts clean images"
Forward processxt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilonGradually add noise to dataNot knowing you can jump to any t directly
Score functionxlogp(x)\nabla_x \log p(x)Direction toward higher probabilityConfusing score with loss
Classifier-free guidanceϵguided=ϵu+s(ϵcϵu)\epsilon_{\text{guided}} = \epsilon_u + s(\epsilon_c - \epsilon_u)Amplify conditional signal"Higher guidance always better"
Latent diffusionDiffusion in compressed spaceSpeed via compression"Same as pixel diffusion"

Spaced Repetition Checkpoints

Day 0 - Initial Learning

  • Read this entire page
  • Derive the ELBO on paper without looking
  • Write the GAN minimax objective and prove the optimal discriminator
  • Write the DDPM forward process and training objective
  • Fill out the grand comparison table from memory

Day 3 - First Recall

  • Without notes, derive the ELBO and explain the gap
  • Give the "60-Second Answer" for diffusion models out loud, timed
  • Explain why mode collapse happens in GANs but not diffusion models
  • Explain the reparameterization trick and why it is needed

Day 7 - Connections

  • Compare all five generative model families in a table from memory
  • Explain the connection between DDPM noise prediction and score matching
  • Do Practice Problem 1 (ELBO derivation) on a whiteboard
  • Explain classifier-free guidance and how guidance scale affects quality/diversity

Day 14 - Application

  • Do Practice Problem 3 (diffusion model design) under timed conditions (15 minutes)
  • Do Practice Problem 5 (model selection for a startup) under timed conditions (10 minutes)
  • Explain the Stable Diffusion architecture end-to-end
  • Review any derivations you hesitated on

Day 21 - Mock Interview

  • Have someone ask: "Compare VAEs, GANs, and diffusion models - training objectives, strengths, weaknesses"
  • Time yourself: full answer should take 5-8 minutes
  • Do all 5 practice problems in sequence under timed conditions (55 minutes total)
  • Can you whiteboard the DDPM training algorithm from memory?

Key Takeaways

  1. Every generative model family makes different tradeoffs. VAEs give latent spaces but blurry outputs. GANs give sharp images but mode collapse. Diffusion models give the best quality and diversity but are slow. Understanding these tradeoffs is more valuable than memorizing any single model.

  2. The ELBO, minimax, and denoising objectives are the three pillars. If you can derive all three and explain their implications, you can handle any generative model interview question.

  3. Diffusion models dominate current practice. They combine excellent sample quality, mode coverage, easy conditioning, and stable training. The main challenge - slow sampling - is being addressed rapidly through DDIM, consistency models, and distillation.

  4. The practical answer is often "fine-tune a pre-trained diffusion model." For most applications, training from scratch is unnecessary. Understanding how to adapt Stable Diffusion or similar models to your domain is the most valuable practical skill.

  5. Know the math, but also know the engineering. Interviewers at top labs want both the derivations and the practical knowledge - latent vs pixel space, classifier-free guidance parameters, sampling speed budgets, and inference cost calculations.

© 2026 EngineersOfAI. All rights reserved.