Generative Models - VAEs, GANs, Diffusion, and Beyond
Reading time: ~40 min | Interview relevance: High | Roles: MLE, AI Eng, Research Engineer, Applied Scientist, Computer Vision Engineer
The Real Interview Moment
You are in an OpenAI research engineer interview. The interviewer asks: "Compare VAEs, GANs, and diffusion models. For each, give me the training objective, the key mathematical insight, and when you would choose one over the others." You start with VAEs - "they maximize a variational lower bound on the log-likelihood..." - and the interviewer nods. You move to GANs - "minimax objective, the generator minimizes while the discriminator maximizes..." She interrupts: "Why does mode collapse happen in GANs but not in diffusion models? Derive the connection."
This is not a question you can answer with surface-level knowledge. It requires understanding the fundamental difference between adversarial training (which has an unstable equilibrium) and score-based denoising (which has a well-defined regression target at each noise level). Candidates who can only describe the architectures get a "lean hire." Candidates who can derive the objectives, explain the failure modes, and compare the paradigms at a mathematical level get a "strong hire."
This page gives you the complete mathematical foundation for every major generative model family, with the depth needed to handle follow-up questions at top AI labs.
What You Will Master
- Derive the ELBO (Evidence Lower Bound) for VAEs from first principles
- Explain the reparameterization trick and why it is necessary
- State the GAN minimax objective and prove the optimal discriminator
- Diagnose mode collapse, training instability, and solutions (WGAN, spectral norm)
- Derive the forward and reverse diffusion processes in DDPM
- Connect score matching to denoising diffusion
- Compare VAE vs GAN vs Diffusion vs Autoregressive vs Flow models with precise tradeoffs
- Explain modern applications: text-to-image, video generation, 3D generation
Self-Assessment: Where Are You Now?
| Skill | 1 - Cannot | 2 - Vaguely | 3 - Can Explain | 4 - Can Derive | 5 - Can Teach | Your Score |
|---|---|---|---|---|---|---|
| Derive the ELBO for VAEs | ___ | |||||
| Explain reparameterization trick | ___ | |||||
| State GAN objective and optimal D | ___ | |||||
| Explain mode collapse and solutions | ___ | |||||
| Derive DDPM forward/reverse process | ___ | |||||
| Explain score matching | ___ | |||||
| Compare all generative model families | ___ | |||||
| Explain text-to-image diffusion | ___ |
Target: All 4s and 5s before your interview.
Part 1 - Variational Autoencoders (VAEs)
The Generative Modeling Problem
We have data drawn from an unknown distribution . We want to learn a model that approximates this distribution so we can:
- Evaluate the probability of new data (density estimation)
- Generate new samples (sampling)
- Learn meaningful latent representations (representation learning)
The Latent Variable Model
A VAE introduces latent variables and models:
Where is a simple prior and is a neural network (decoder) that maps to .
The problem: This integral is intractable - we cannot compute directly because we would need to integrate over all possible values.
Deriving the ELBO
We want to maximize . Since we cannot compute it directly, we derive a lower bound.
Start with the log-likelihood and introduce an approximate posterior :
Multiply and divide by :
By Jensen's inequality ( is concave):
Expanding :
This is the ELBO (Evidence Lower Bound):
"A VAE is a latent variable model that learns to generate data by encoding inputs into a latent space and decoding them back. The training objective is the ELBO - a lower bound on the log-likelihood - which has two terms: a reconstruction loss that encourages accurate decoding, and a KL divergence that regularizes the latent space to be close to a standard Gaussian. The key innovation is the reparameterization trick: instead of sampling z directly from the encoder's distribution (which is not differentiable), we express z as mu plus sigma times epsilon where epsilon is sampled from a standard normal. This allows gradients to flow through the sampling step."
The Reparameterization Trick
The problem: We need to compute . But the expectation is over a distribution that depends on - we cannot simply backpropagate through a sampling operation.
The solution: Instead of , write:
Now is a deterministic function of (through and ) plus independent noise. Gradients with respect to flow through and without issues.
Do NOT say "the reparameterization trick makes the VAE differentiable." The VAE is already differentiable - the encoder and decoder are neural networks. The reparameterization trick makes the sampling step differentiable with respect to the encoder parameters. Without it, you cannot train the encoder via backpropagation because sampling is a stochastic, non-differentiable operation.
VAE Limitations
- Blurry outputs: The pixel-wise reconstruction loss (MSE or binary cross-entropy) encourages averaging over modes, producing blurry images rather than sharp ones.
- Posterior collapse: The KL term can dominate, causing the encoder to ignore the input and produce . The decoder then ignores and generates the same output regardless of input.
- Limited expressiveness: The Gaussian assumption for both prior and posterior limits the complexity of the latent space.
Solutions: Beta-VAE (weight the KL term), VQ-VAE (discrete latent space), hierarchical VAEs (NVAE, VDVAE).
Part 2 - Generative Adversarial Networks (GANs)
The Minimax Objective
A GAN consists of two networks:
- Generator : maps noise to fake data
- Discriminator : classifies data as real or fake
The training objective is a minimax game:
The discriminator tries to maximize - assign high probability to real data and low probability to fake data. The generator tries to minimize - fool the discriminator into assigning high probability to fake data.
The Optimal Discriminator
For a fixed generator , the optimal discriminator is:
Proof: The discriminator maximizes:
Taking the derivative with respect to and setting it to zero:
Solving: .
Substituting the Optimal Discriminator
When , the game value becomes:
Where JSD is the Jensen-Shannon Divergence. This means the generator is minimizing the JSD between the real and generated distributions. The global optimum is , where and .
Deriving the optimal discriminator and showing that the GAN objective reduces to JSD minimization is a classic interview question at Google Research, OpenAI, and DeepMind. The derivation is straightforward but you must be able to do it on a whiteboard without notes. Practice it until you can complete it in under 3 minutes.
Mode Collapse
What it is: The generator finds a single output (or a few outputs) that fools the discriminator and produces only those, ignoring the diversity of the real data distribution.
Why it happens: The generator's gradient points toward the mode that currently fools the discriminator best. Once it finds such a mode, the gradient reinforces it rather than encouraging exploration of other modes.
Solutions to GAN Training Instability
| Technique | Key Idea | Paper |
|---|---|---|
| WGAN | Replace JSD with Wasserstein distance, clip weights | Arjovsky et al., 2017 |
| WGAN-GP | Gradient penalty instead of weight clipping | Gulrajani et al., 2017 |
| Spectral Normalization | Normalize discriminator weights by spectral norm | Miyato et al., 2018 |
| Progressive Growing | Start with low-res, gradually increase | Karras et al., 2018 |
| StyleGAN | Style-based generator with mapping network | Karras et al., 2019 |
| Minibatch Discrimination | Discriminator sees batches, not individuals | Salimans et al., 2016 |
Wasserstein GAN (WGAN)
The key insight: JSD can be zero or undefined when the real and generated distributions have non-overlapping support (common in high dimensions). This gives the generator zero gradient - it cannot learn.
The Wasserstein distance (Earth Mover's Distance) provides a meaningful gradient even when distributions do not overlap:
By the Kantorovich-Rubinstein duality:
Where the supremum is over all 1-Lipschitz functions. The discriminator (now called "critic") approximates this Lipschitz function.
Enforcing the Lipschitz constraint:
- Weight clipping (WGAN): Clip all weights to . Simple but leads to capacity underuse and exploding/vanishing gradients.
- Gradient penalty (WGAN-GP): Add to the loss, where is a random interpolation between real and fake samples.
- Spectral normalization: Divide each weight matrix by its spectral norm (largest singular value). Guarantees 1-Lipschitz without any penalty term.
Never say "WGAN just removes the log from the GAN loss." The mathematical change is fundamental - switching from JSD to Wasserstein distance, which requires the Lipschitz constraint. The removal of log is a consequence, not the cause. Saying this suggests you memorized a formula without understanding the theory.
Part 3 - Diffusion Models
The Core Idea
Diffusion models learn to reverse a gradual noising process. Start with clean data , add noise step by step until you get pure Gaussian noise , then learn a neural network that reverses each step.
Forward Process (Adding Noise)
The forward process is a fixed Markov chain that gradually adds Gaussian noise:
Where is a small noise schedule (typically increasing from to ).
Key property: We can sample directly from without iterating through all steps:
Where and .
This means: , where .
At (with large ), and - pure noise.
Reverse Process (Denoising)
The reverse process is parameterized by a neural network:
The network predicts the mean of the denoising step at each timestep .
DDPM Training Objective (Ho et al., 2020)
After simplification, the training objective reduces to:
Where is the neural network predicting the noise that was added. The training procedure:
- Sample a clean image from the dataset
- Sample a random timestep
- Sample noise
- Compute noisy image:
- Train the network to predict from and : minimize
"Diffusion models learn to generate data by reversing a gradual noising process. The forward process adds Gaussian noise step by step until the data becomes pure noise. A neural network learns to reverse each step - given a noisy image and the noise level, it predicts the noise that was added. At generation time, we start with random noise and iteratively denoise to produce a clean image. The training is simple - just mean squared error between the predicted and actual noise. Unlike GANs, there is no adversarial training and no mode collapse. Unlike VAEs, there is no posterior approximation or blurriness. The main drawback is slow sampling - generating an image requires hundreds to thousands of network evaluations."
Score Matching Connection
The noise prediction is closely related to the score function - the gradient of the log probability:
This connects DDPM to score-based generative models (Song & Ermon, 2019). The network is learning to estimate the direction toward higher probability at each noise level.
Why this matters: Score matching provides a clean theoretical framework. The denoising objective is equivalent to estimating the score function at every noise level, and Langevin dynamics (following the score with added noise) produces samples from the learned distribution.
Sampling Methods
| Method | Steps | Speed | Quality | How It Works |
|---|---|---|---|---|
| DDPM | 1000 | Slow | High | Original reverse process, one step at a time |
| DDIM | 50-100 | Medium | High | Deterministic sampling, skip steps |
| DPM-Solver | 10-25 | Fast | High | ODE solver for the reverse process |
| Consistency Models | 1-2 | Very fast | Good | Distilled to predict final output directly |
| Latent Diffusion | 20-50 | Fast | High | Diffusion in latent space, not pixel space |
Why Diffusion Models Avoid Mode Collapse
GANs suffer from mode collapse because the generator optimizes against a single discriminator - it can find shortcuts. Diffusion models avoid this because:
- No adversarial training: The objective is simple regression (predict the noise).
- Multi-scale denoising: The network must handle every noise level, from nearly clean to pure noise. This forces it to learn the full data distribution at every scale.
- Stable training: MSE loss with random timestep sampling provides consistent, low-variance gradients.
Do NOT say "diffusion models are just VAEs with many latent variables." While there is a mathematical connection (the DDPM ELBO can be written as a sequence of KL terms), the training procedure and the resulting model behavior are fundamentally different. Diffusion models do not learn an encoder or a latent representation - they learn to denoise. The connection is theoretical, not practical.
Part 4 - Flow-Based Models
Core Idea
Flow-based models learn an invertible transformation from a simple distribution (Gaussian) to the data distribution:
Because is invertible, we can compute exact log-likelihoods via the change of variables formula:
Key Models
| Model | Key Innovation | Limitation |
|---|---|---|
| NICE (2014) | Additive coupling layers | Limited expressiveness |
| RealNVP (2017) | Affine coupling layers | Still limited by coupling structure |
| Glow (2018) | 1x1 convolutions + squeeze | Better quality but expensive |
| Neural ODE / FFJORD (2018-2019) | Continuous-time flows | Expensive determinant computation |
Why Flows Are Less Popular
- Invertibility constraint: The requirement that be invertible limits the architecture. You cannot use standard residual connections, attention, or convolutions freely.
- Computational cost: Computing the Jacobian determinant is expensive, even with clever architectures.
- Sample quality: Despite exact likelihoods, flow models generally produce lower-quality samples than GANs or diffusion models at the same model size.
Where flows still shine: Density estimation tasks where exact likelihoods are needed, and as components of larger systems (e.g., normalizing flows in VAE posteriors).
Part 5 - The Grand Comparison
Head-to-Head Comparison
| Property | VAE | GAN | Diffusion | Autoregressive | Flow |
|---|---|---|---|---|---|
| Training objective | ELBO (lower bound) | Minimax game | Denoising MSE | Exact log-likelihood | Exact log-likelihood |
| Training stability | Stable | Unstable | Very stable | Stable | Stable |
| Sample quality | Blurry | Sharp | Excellent | Excellent | Moderate |
| Mode coverage | Good | Poor (mode collapse) | Excellent | Excellent | Good |
| Sampling speed | Fast (one pass) | Fast (one pass) | Slow (iterative) | Slow (sequential) | Fast (one pass) |
| Exact likelihood | Lower bound only | No | Lower bound only | Yes | Yes |
| Latent space | Yes (continuous) | Yes (input noise) | No explicit | No | Yes (invertible) |
| Diversity control | Temperature on z | Truncation trick | Guidance scale | Temperature | Temperature on z |
| Key weakness | Blurry outputs | Mode collapse | Slow generation | Sequential, slow | Limited architecture |
When to Choose Each
At OpenAI and Stability AI, diffusion model expertise is almost mandatory. At Meta, GAN knowledge is still valued (StyleGAN for faces, GAN upscaling). At Google, autoregressive models dominate (Gemini, Imagen uses diffusion but the team also works on autoregressive image generation). For startups, understanding the speed-quality tradeoff is most important - they often need fast inference.
Part 6 - Modern Applications
Text-to-Image: The Stable Diffusion Architecture
The key innovation in Stable Diffusion (Romero et al., 2022) is latent diffusion: run the diffusion process in a compressed latent space rather than pixel space.
- Autoencoder: Compress images from pixel space () to latent space () - a 48x compression
- Diffusion model: Operate the forward/reverse process in latent space - much faster
- Conditioning: Use cross-attention with text embeddings (from CLIP) to guide generation
Classifier-Free Guidance (CFG): At generation time, run the model twice - once with the text condition and once without. Amplify the difference:
Where is the guidance scale. Higher means stronger adherence to the text prompt but less diversity.
Video Generation
Video diffusion models extend image diffusion along the temporal axis:
- Temporal attention: Add temporal self-attention layers to the 2D U-Net to attend across frames
- Temporal convolutions: 3D convolutions or pseudo-3D (2D spatial + 1D temporal) convolutions
- Frame conditioning: Generate keyframes first, then interpolate
Major models: Sora (OpenAI), Veo (Google), Gen-3 (Runway), Kling (Kuaishou).
3D Generation
Diffusion models for 3D content:
- Point-E / Shap-E (OpenAI): Diffusion in point cloud or implicit function space
- DreamFusion: Use 2D diffusion model as a prior to optimize a 3D representation (NeRF) via Score Distillation Sampling
- Zero-1-to-3: Generate novel views from a single image using view-conditioned diffusion
Practice Problems
Problem 1: ELBO Derivation
Derive the ELBO from scratch, starting from . Show every step, including where Jensen's inequality is applied and what the gap between the ELBO and the true log-likelihood represents.
Hint 1 - Direction
Start by writing . Introduce by multiplying and dividing. Apply Jensen's inequality.
Hint 2 - Insight
The gap between and the ELBO is exactly - the KL divergence between the approximate and true posterior. This is always non-negative, confirming the ELBO is indeed a lower bound. As becomes a better approximation of the true posterior, the gap shrinks.
Hint 3 - Full Solution + Rubric
Complete derivation:
Starting from:
Introduce :
By Jensen's inequality ():
Expanding :
The gap: Alternatively, we can write:
Since KL divergence is always , the ELBO . The gap is the KL between the approximate posterior and the true posterior .
To prove the gap identity, start from:
Rearranging: .
Scoring Rubric:
- Strong Hire: Complete derivation with Jensen's inequality clearly identified, both forms of the ELBO (reconstruction + KL), derives the gap identity, explains that the gap is the posterior approximation quality.
- Lean Hire: Gets the ELBO formula correct but skips the gap derivation or cannot explain what the gap represents.
- No Hire: Cannot derive the ELBO or confuses the direction of the KL divergence.
Problem 2: GAN Optimal Discriminator
Prove that the optimal discriminator for a fixed generator is . Then show that substituting into the GAN objective gives .
Hint 1 - Direction
For the optimal D, differentiate the integrand of the GAN objective with respect to and set to zero. For the JSD connection, substitute and use the definition of JSD.
Hint 2 - Insight
The integrand is , which is maximized at . Substituting gives: two terms that each look like KL divergences to the mixture distribution .
Hint 3 - Full Solution + Rubric
Part 1: Optimal discriminator
The GAN objective:
For each , maximizing the integrand where , :
Part 2: JSD connection
Substituting , let :
At the optimum where : JSD = 0, so .
Scoring Rubric:
- Strong Hire: Complete both derivations, identifies the JSD, states the global optimum condition.
- Lean Hire: Gets the optimal discriminator but cannot complete the JSD derivation.
- No Hire: Cannot set up the optimization problem for D.
Problem 3: Diffusion Model Design
You need to design a diffusion model for generating 1024x1024 images. Running DDPM directly in pixel space would be extremely slow. Design an efficient architecture and explain each choice.
Hint 1 - Direction
Think about the Stable Diffusion approach: compress to latent space first, then run diffusion there. What autoencoder would you use? What U-Net architecture for the denoiser? How would you handle conditioning?
Hint 2 - Insight
Latent diffusion: (1) Train a VQ-VAE or KL-VAE to compress 1024x1024x3 to 128x128x4 or 64x64x4. (2) Train a U-Net denoiser in latent space with attention at multiple scales. (3) Use cross-attention for text conditioning. (4) Use classifier-free guidance for quality. (5) DDIM or DPM-Solver for fast sampling in 20-50 steps.
Hint 3 - Full Solution + Rubric
Architecture design:
Stage 1: Autoencoder
- KL-regularized autoencoder (similar to Stable Diffusion's VAE)
- Encode 1024x1024x3 to 128x128x4 (64x compression)
- Train with reconstruction loss + perceptual loss + adversarial loss + small KL penalty
- This autoencoder is trained once and frozen
Stage 2: Latent Diffusion Model
- U-Net architecture with:
- ResNet blocks at each resolution level
- Self-attention at 64x64, 32x32, 16x16 resolutions
- Cross-attention for text conditioning at the same resolutions
- Timestep conditioning via sinusoidal embeddings + AdaGN (Adaptive Group Norm)
- Text encoder: CLIP ViT-L/14 or T5-XXL for text embeddings
- Noise schedule: linear or cosine, T=1000 training steps
Stage 3: Conditioning and Guidance
- Classifier-free guidance: train with 10% unconditional (drop text embedding)
- Guidance scale s=7.5 to 12 at inference
Stage 4: Fast Sampling
- DDIM with 50 steps, or DPM-Solver++ with 20 steps
- Optionally: progressive distillation to 4-8 steps
Memory and compute considerations:
- Working in 128x128x4 instead of 1024x1024x3 reduces per-step compute by ~64x
- U-Net with ~900M parameters (similar to SDXL)
- Training: 256 A100 GPUs, ~2 weeks on 2B image-text pairs
Scoring Rubric:
- Strong Hire: Designs latent diffusion (not pixel-space), specifies autoencoder, U-Net architecture with attention, text conditioning via cross-attention, classifier-free guidance, fast sampling method, and discusses compute.
- Lean Hire: Suggests latent diffusion but missing details on conditioning, guidance, or sampling speed.
- No Hire: Proposes pixel-space diffusion for 1024x1024 or cannot describe the architecture.
Problem 4: Mode Collapse Diagnosis
You are training a GAN to generate diverse human faces. After 50K steps, the discriminator loss drops to near zero and the generator produces realistic but nearly identical faces (same pose, similar features). Diagnose and fix.
Hint 1 - Direction
Near-zero discriminator loss with low-diversity generation is classic mode collapse. Think about what signals the generator is receiving and why it converges to a single mode.
Hint 2 - Insight
The discriminator became too strong too fast - it can easily tell real from fake, but the generator found one mode that partially fools it and collapses there. Fixes: use WGAN-GP or spectral normalization to stabilize the discriminator, add minibatch discrimination for diversity, reduce discriminator update frequency, or switch to a diffusion model.
Hint 3 - Full Solution + Rubric
Diagnosis:
Mode collapse with discriminator domination. The generator finds the single face that maximizes and produces only that face. The discriminator then achieves near-perfect accuracy because the generator's output is a single point (or a narrow mode).
Root cause analysis:
- Discriminator is too powerful relative to generator - may have more capacity or be updated more frequently
- Standard GAN loss (JSD) provides zero gradient when distributions do not overlap
- No explicit diversity encouragement in the objective
Fixes (in order of priority):
-
Switch to WGAN-GP: Wasserstein distance provides meaningful gradients even when distributions do not overlap. Add gradient penalty ().
-
Spectral normalization on discriminator: Constrains discriminator's Lipschitz constant, preventing it from being too confident.
-
Balance update ratio: Update generator 2x for every discriminator update, or use learning rate imbalance (lower D learning rate).
-
Minibatch discrimination: Allow discriminator to see statistics across the batch. If all generated faces are similar, the discriminator can detect this.
-
R1 regularization: Add gradient penalty on real data only: . Prevents discriminator from overconfidently classifying near the data manifold.
-
Progressive growing: Start generating at low resolution (4x4), gradually increase. Low-resolution modes are easier to cover.
-
Architecture changes: StyleGAN-style generator with mapping network + style injection tends to be more stable than basic generators.
If all else fails: Switch to a diffusion model. Diffusion models do not suffer from mode collapse by design and consistently produce diverse, high-quality face images.
Scoring Rubric:
- Strong Hire: Correctly diagnoses mode collapse from the symptoms, explains the root cause (discriminator/generator imbalance + JSD gradient issues), provides 3+ actionable fixes with justification, mentions WGAN-GP or spectral norm, considers switching to diffusion.
- Lean Hire: Identifies mode collapse but provides only 1-2 generic fixes without explaining why they work.
- No Hire: Cannot identify mode collapse from the description or suggests "train longer."
Problem 5: Generative Model Selection
Your startup needs to build a product that generates custom product images from text descriptions. The images must be 512x512, generation must take less than 2 seconds, and you have a limited compute budget for both training and inference. Which generative model family do you choose and why?
Hint 1 - Direction
Consider the tradeoffs: quality, speed, training cost, and text conditioning capability. The 2-second latency constraint eliminates some options.
Hint 2 - Insight
Standard diffusion (DDPM) with 1000 steps is too slow. But latent diffusion with fast samplers (20 steps DPM-Solver) or distilled models (4-8 steps) can meet the 2-second target. GANs are fast but harder to condition on text and have diversity issues. VAEs are too blurry. Autoregressive models (like DALL-E 1) are too slow for 2 seconds.
Hint 3 - Full Solution + Rubric
Recommendation: Latent Diffusion Model with fast sampling
Why not other options:
- GAN: Fast inference but text conditioning is difficult to implement well, mode collapse risks with diverse product categories, and training instability with limited compute.
- VAE: Too blurry for product images where detail matters.
- Autoregressive: Too slow - generating a 512x512 image token by token takes many seconds.
- Flow: Lower quality than diffusion, limited text conditioning support.
Latent diffusion implementation:
-
Autoencoder: Fine-tune a pre-trained SD VAE (or use it directly). Compress 512x512x3 to 64x64x4.
-
Denoiser: Fine-tune Stable Diffusion's U-Net on your product image dataset. This dramatically reduces training cost compared to training from scratch.
-
Text conditioning: Use CLIP text encoder (already part of SD) plus potentially a domain-specific text encoder fine-tuned on product descriptions.
-
Fast sampling for 2-second budget:
- DDIM with 20 steps: ~1.5 seconds on A10G GPU
- DPM-Solver++ with 15 steps: ~1.1 seconds
- LCM (Latent Consistency Model) with 4 steps: ~0.4 seconds
- Distilled model: train consistency distillation for 4-8 step generation
-
Infrastructure: Serve on A10G GPUs ($1.00/hour). With batching, can serve 10-20 images per GPU per second.
Training budget: Fine-tuning SD on ~100K product images takes ~2-4 A100 GPU-days. Much cheaper than training from scratch.
Scoring Rubric:
- Strong Hire: Chooses latent diffusion with clear justification against alternatives, mentions fine-tuning from a pre-trained checkpoint, provides specific fast sampling methods with latency estimates, considers inference cost.
- Lean Hire: Chooses diffusion but does not address the 2-second constraint or does not mention fine-tuning from a pre-trained model.
- No Hire: Chooses a GAN or VAE without addressing their fundamental limitations for this use case.
Interview Cheat Sheet
| Concept | Key Formula | One-Liner | Red Flag |
|---|---|---|---|
| VAE ELBO | $𝔼[\log p(x | z)] - \text{KL}(q(z | x) | p(z))$ |
| Reparameterization trick | Makes sampling differentiable w.r.t. encoder | "Makes the model differentiable" | |
| GAN objective | Generator fools discriminator | "Generator maximizes" (it minimizes) | |
| Optimal discriminator | Ratio of real to total density | Not being able to derive this | |
| Mode collapse | Generator produces low-diversity output | Adversarial training instability | "Just train longer to fix it" |
| WGAN | Wasserstein distance + Lipschitz constraint | Meaningful gradients everywhere | "WGAN removes the log" |
| DDPM objective | Predict the noise that was added | "Diffusion predicts clean images" | |
| Forward process | Gradually add noise to data | Not knowing you can jump to any t directly | |
| Score function | Direction toward higher probability | Confusing score with loss | |
| Classifier-free guidance | Amplify conditional signal | "Higher guidance always better" | |
| Latent diffusion | Diffusion in compressed space | Speed via compression | "Same as pixel diffusion" |
Spaced Repetition Checkpoints
Day 0 - Initial Learning
- Read this entire page
- Derive the ELBO on paper without looking
- Write the GAN minimax objective and prove the optimal discriminator
- Write the DDPM forward process and training objective
- Fill out the grand comparison table from memory
Day 3 - First Recall
- Without notes, derive the ELBO and explain the gap
- Give the "60-Second Answer" for diffusion models out loud, timed
- Explain why mode collapse happens in GANs but not diffusion models
- Explain the reparameterization trick and why it is needed
Day 7 - Connections
- Compare all five generative model families in a table from memory
- Explain the connection between DDPM noise prediction and score matching
- Do Practice Problem 1 (ELBO derivation) on a whiteboard
- Explain classifier-free guidance and how guidance scale affects quality/diversity
Day 14 - Application
- Do Practice Problem 3 (diffusion model design) under timed conditions (15 minutes)
- Do Practice Problem 5 (model selection for a startup) under timed conditions (10 minutes)
- Explain the Stable Diffusion architecture end-to-end
- Review any derivations you hesitated on
Day 21 - Mock Interview
- Have someone ask: "Compare VAEs, GANs, and diffusion models - training objectives, strengths, weaknesses"
- Time yourself: full answer should take 5-8 minutes
- Do all 5 practice problems in sequence under timed conditions (55 minutes total)
- Can you whiteboard the DDPM training algorithm from memory?
Key Takeaways
-
Every generative model family makes different tradeoffs. VAEs give latent spaces but blurry outputs. GANs give sharp images but mode collapse. Diffusion models give the best quality and diversity but are slow. Understanding these tradeoffs is more valuable than memorizing any single model.
-
The ELBO, minimax, and denoising objectives are the three pillars. If you can derive all three and explain their implications, you can handle any generative model interview question.
-
Diffusion models dominate current practice. They combine excellent sample quality, mode coverage, easy conditioning, and stable training. The main challenge - slow sampling - is being addressed rapidly through DDIM, consistency models, and distillation.
-
The practical answer is often "fine-tune a pre-trained diffusion model." For most applications, training from scratch is unnecessary. Understanding how to adapt Stable Diffusion or similar models to your domain is the most valuable practical skill.
-
Know the math, but also know the engineering. Interviewers at top labs want both the derivations and the practical knowledge - latent vs pixel space, classifier-free guidance parameters, sampling speed budgets, and inference cost calculations.
