What is generative adversarial networks?

The complete story of GANs - from Goodfellow's 2014 minimax formulation to DCGAN, Wasserstein GAN, Progressive GAN, and StyleGAN2 - including training instabilities, theoretical foundations, and why diffusion models eventually surpassed them.

How does GAN training work in practice?

Generative Adversarial Networks - From the Original GAN to StyleGAN covers generative adversarial networks, GAN training, Wasserstein GAN from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/unsupervised-learning/generative-adversarial-networks

What is the difference between generative adversarial networks and Wasserstein GAN?

See the full breakdown at https://engineersofai.com/docs/ml/unsupervised-learning/generative-adversarial-networks

Generative Adversarial Networks - From the Original GAN to StyleGAN

:::note Reading time: ~50 minutes | Interview relevance: Very High | Target roles: ML Engineer, Research Engineer, AI Engineer, Applied Scientist :::

The Real Interview Moment

"Derive the optimal discriminator for the original GAN objective. Then show that at optimality, training the generator is equivalent to minimizing JS divergence. Then explain why this is problematic." The interviewer writes the minimax objective on the whiteboard and waits.

This is a classic generative modeling interview question. The derivation is three steps: find $D^*$ by functional differentiation, substitute $D^*$ back into the generator objective, recognize $JS(p_{data} \| p_g)$ . Then the punchline: JS divergence has zero gradient when the supports of $p_{data}$ and $p_g$ do not overlap - which is almost always true early in training when $p_g$ generates garbage. This is the theoretical root of GAN training instability, and understanding it motivates the entire line of work from WGAN to StyleGAN.

Ian Goodfellow reportedly conceived the adversarial training idea at a Montreal bar in 2014, after a colleague suggested training a network to fool another network. He went home that night, coded the first GAN, and saw it generate recognizable digits on MNIST. A decade later, GANs produced photorealistic faces indistinguishable from real photos and powered hundreds of commercial applications - before diffusion models overtook them as the dominant paradigm.

Why This Exists - The Generative Modeling Problem

Before GANs, the dominant approach to deep generative modeling was the Variational Autoencoder (VAE). VAEs are elegant - they maximize a variational lower bound on the log-likelihood and produce a smooth, structured latent space. But VAE samples are consistently blurry. The reason: the reconstruction loss (pixel-level MSE or BCE) causes the decoder to average over multiple plausible reconstructions, producing the average image rather than any particular sharp one.

The deeper issue is that maximum likelihood training (and its variational approximations) penalize every error uniformly. Generating a slightly blurry image is penalized the same as generating a completely unrealistic one. What you want is a training signal that says: "does this image look realistic?" - a human judgment, not a pixel-level metric.

GANs operationalize this insight by training a discriminator whose only job is to answer "real or fake?" The generator's training signal is the discriminator's judgment, not a pixel-level metric. This adversarial training signal encourages the generator to produce realistic images, not averaged-out blurry ones.

The cost: training two networks adversarially is fundamentally unstable. A decade of GAN research was largely the story of understanding and mitigating this instability.

Historical Context - The Adversarial Idea

Ian Goodfellow, Yoshua Bengio, Aaron Courville, and colleagues published the original GAN paper at NIPS 2014. The paper introduced the minimax formulation and proved that the global optimum has the generator matching the data distribution. Implementation was minimal - a fully-connected network on MNIST.

The DCGAN paper (Radford et al. 2015) made GANs practical by discovering architectural tricks: strided convolutions instead of pooling, batch normalization, and LeakyReLU activations. DCGAN generated convincing 64x64 bedroom images. From 2015-2018, GAN research expanded rapidly: conditional GANs, image-to-image translation (Pix2Pix, CycleGAN), video prediction, and many stabilization techniques.

Wasserstein GAN (Arjovsky et al. 2017) provided the theoretical breakthrough: Earth Mover's distance does not suffer the zero-gradient problem of JS divergence, enabling meaningful gradient signal even when supports do not overlap. WGAN-GP (Gulrajani et al. 2017) improved Wasserstein training stability with a gradient penalty.

Progressive GAN (Karras et al. 2018) produced the first photorealistic human faces at 1024x1024. StyleGAN (2019) and StyleGAN2 (2020) produced state-of-the-art photorealistic faces that launched the "deepfake era." By 2021, GANs had peaked - diffusion models began matching and then surpassing GAN quality while offering more stable training.

1. The GAN Minimax Formulation

The Objective

The original GAN trains two networks simultaneously:

Generator $G: z \rightarrow x$ maps noise $z \sim p_z(\mathcal{N}(0,I))$ to generated images
Discriminator $D: x \rightarrow [0,1]$ estimates the probability that input $x$ is a real image

The minimax objective:

$\min_G \max_D V(D,G) = \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$

The discriminator maximizes $V$ - it wants $D(x) \to 1$ for real $x$ and $D(G(z)) \to 0$ for generated $G(z)$ . The generator minimizes $V$ - it wants $D(G(z)) \to 1$ (fool the discriminator into thinking generated images are real).

The Optimal Discriminator

For fixed generator $G$ (fixed $p_g$ ), the optimal discriminator is found by functional differentiation. At each point $x$ , we maximize:

$p_{data}(x) \log D(x) + p_g(x) \log(1 - D(x))$

Taking derivative with respect to $D(x)$ and setting to zero:

$\frac{p_{data}(x)}{D(x)} - \frac{p_g(x)}{1 - D(x)} = 0$

$D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}$

The optimal discriminator assigns each point a probability proportional to how likely it is to come from the real distribution vs the generated distribution.

Connection to JS Divergence

Substituting $D^*$ back into the generator objective:

$V(D^*, G) = \mathbb{E}_{x \sim p_{data}}\!\left[\log \frac{p_{data}(x)}{p_{data}(x)+p_g(x)}\right] + \mathbb{E}_{x \sim p_g}\!\left[\log \frac{p_g(x)}{p_{data}(x)+p_g(x)}\right]$

This equals $-\log 4 + 2 \cdot JS(p_{data} \| p_g)$ , where JS divergence is:

$JS(p_{data} \| p_g) = \frac{1}{2}KL\!\left(p_{data} \middle\| \frac{p_{data}+p_g}{2}\right) + \frac{1}{2}KL\!\left(p_g \middle\| \frac{p_{data}+p_g}{2}\right)$

The global minimum $G^*$ satisfies $p_g = p_{data}$ , achieving $JS = 0$ .

Why JS Divergence Causes Vanishing Gradients

The JS divergence between two distributions with disjoint supports is exactly $\log 2$ - a constant. Early in training, the generator produces nonsense images with a distribution $p_g$ that has essentially zero overlap with $p_{data}$ . The JS divergence is constant ( $\log 2$ ), meaning its gradient with respect to generator parameters is zero. The generator receives no useful gradient signal.

This is the fundamental theoretical flaw of the original GAN objective, and it explains why GAN training is so difficult: the discriminator becomes too good (easily distinguishes real from fake), at which point the generator's gradients vanish and training stalls.

2. The Non-Saturating Loss

To address vanishing gradients in practice, Goodfellow et al. immediately proposed an alternative generator objective. Instead of minimizing $\mathbb{E}[\log(1 - D(G(z)))]$ (which saturates when $D(G(z)) \to 0$ ), maximize $\mathbb{E}[\log D(G(z))]$ :

$\text{Original (saturating):} \quad \min_G \mathbb{E}_z[\log(1 - D(G(z)))]$ $\text{Non-saturating:} \quad \max_G \mathbb{E}_z[\log D(G(z))]$

When $D(G(z)) \approx 0$ (discriminator correctly identifies generated images as fake), the gradient of $\log(1-D(G(z)))$ with respect to generator parameters is nearly zero. The gradient of $\log D(G(z))$ is large - strong signal to improve the generator. Non-saturating loss is used in practice in almost all GAN implementations.

3. DCGAN - Making GANs Work in Practice

Architectural Discoveries

Radford et al. (2015) ran systematic experiments and discovered the architectural choices that stabilize GAN training:

Replace pooling with strided convolutions: the discriminator uses strided convolutions (stride 2) to downsample; the generator uses fractionally-strided (transposed) convolutions to upsample. This allows the network to learn its own spatial downsampling rather than using fixed pooling.

Batch Normalization everywhere except: apply BN in both generator and discriminator, except at the generator's output layer and the discriminator's input layer (where it would interfere with image statistics).

LeakyReLU in discriminator: standard ReLU kills gradients for negative activations. LeakyReLU with slope 0.2 for negatives maintains gradient flow throughout the discriminator.

Tanh output in generator: output layer uses tanh to bound generated pixel values in [-1, 1], matching normalized real images.

No fully-connected layers: all-convolutional architecture, even at the top and bottom. Eliminates the need for spatial reshaping operations.

These discoveries were empirical, not theoretically derived - the practical impact was massive. DCGAN produced state-of-the-art results and became the baseline architecture for the next several years.

4. Training Instabilities - Mode Collapse

What Mode Collapse Looks Like

Mode collapse occurs when the generator learns to produce a small subset of all possible real images - it finds a few "modes" that reliably fool the discriminator and gets stuck there. In an extreme case, all generated images look nearly identical.

The root cause: the generator finds a Nash equilibrium with the discriminator where it focuses on a few high-probability modes. Since the discriminator must also correctly classify all other real images, it cannot over-specialize on the modes the generator is producing. The generator exploits this blind spot.

Mode collapse is subtle and hard to detect. FID might be acceptable if the produced modes are diverse enough. Recall (from the Precision-Recall framework) will be low, but if you only report FID, you will not see it. The practical detection method: look at a grid of 64+ generated images from different noise vectors. If they are all variations of the same face expression / object pose, you have mode collapse.

Mitigation Techniques

Mini-batch discrimination (Salimans et al. 2016): include a "mini-batch layer" in the discriminator that computes feature statistics across the entire batch. If the generator collapses to similar images, the mini-batch layer detects them and the discriminator learns to penalize low diversity within a batch.

Historical averaging: add a penalty term $\|\theta - \frac{1}{t}\sum_i \theta_i\|^2$ that prevents parameters from moving too far from their historical average, preventing oscillatory dynamics.

Unrolled GANs: compute the generator gradient not against the current discriminator but against the discriminator $k$ steps into the future (unroll the discriminator update loop). Reduces oscillations.

Spectral normalization: normalize each weight matrix in the discriminator by its spectral norm (largest singular value), bounding the Lipschitz constant. This prevents the discriminator from becoming too sharp, maintaining useful gradient signal.

5. Wasserstein GAN - The Theoretical Fix

Earth Mover's Distance

Arjovsky, Chintala, and Bottou (2017) proposed replacing JS divergence with the Earth Mover's Distance (EMD) / Wasserstein-1 distance:

$W(p_{data}, p_g) = \sup_{\|f\|_L \leq 1} \left[\mathbb{E}_{x \sim p_{data}}[f(x)] - \mathbb{E}_{x \sim p_g}[f(x)]\right]$

where the supremum is over all 1-Lipschitz functions $f$ (functions with bounded gradient). Intuitively, EMD measures the minimum "work" required to transform distribution $p_g$ into $p_{data}$ - the minimum amount of probability mass times the distance it must be moved.

The key property: $W(p_{data}, p_g)$ is continuous and differentiable even when the supports of $p_{data}$ and $p_g$ are disjoint. Unlike JS divergence, which is constant ( $\log 2$ ) when supports do not overlap, EMD provides a smooth gradient signal proportional to how far apart the distributions are. Early in training, when $p_g$ is far from $p_{data}$ , EMD gives a large gradient. As they converge, the gradient smoothly decreases.

WGAN Implementation

To approximate the supremum over 1-Lipschitz functions, WGAN trains a critic $f_w$ (not a discriminator - it outputs real numbers, not probabilities) and enforces the 1-Lipschitz constraint by weight clipping: after each gradient step, clip all critic weights to $[-c, c]$ for a small constant $c$ (e.g., 0.01).

The WGAN objective:

Critic (maximize): $\mathbb{E}_{x \sim p_{data}}[f_w(x)] - \mathbb{E}_{z \sim p_z}[f_w(G(z))]$
Generator (minimize): $-\mathbb{E}_{z \sim p_z}[f_w(G(z))]$

Key training changes from original GAN:

No sigmoid at critic output (real-valued output)
Train critic to convergence before each generator step (typically 5:1 ratio)
Adam optimizer is problematic - use RMSProp

Weight clipping enforces Lipschitz but causes capacity under-utilization - the critic learns to use weights near the clipping boundary, effectively becoming a low-capacity function approximator.

WGAN-GP - Gradient Penalty

Gulrajani et al. (2017) improved WGAN by replacing weight clipping with a gradient penalty. The 1-Lipschitz constraint can be enforced by requiring the gradient of the critic to have norm 1 everywhere. The penalty is computed on interpolated samples between real and generated:

$\hat{x} = \alpha x_{real} + (1-\alpha) G(z), \quad \alpha \sim \text{Uniform}(0,1)$

$\lambda \mathbb{E}_{\hat{x}}\!\left[\!\left(\|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1\right)^2\right]$

WGAN-GP is strictly better than WGAN: no capacity under-utilization, works with Adam optimizer, much more stable training, and enables deep architectures that weight clipping makes difficult to train. WGAN-GP became a standard stabilization technique and is still used today in specialized GAN applications.

6. Progressive Growing of GANs

The Resolution Scaling Problem

Training a GAN directly at 1024x1024 is extremely difficult: the generator and discriminator must simultaneously learn low-level textures, mid-level structures, and high-level semantic content. Early generated images are pure noise, and the discriminator learns to identify noise vs real data - which gives the generator a useless gradient.

Karras et al. (2018) at NVIDIA introduced Progressive Growing: start with 4x4 images (minimal structure), stabilize training, then progressively add layers at higher resolutions as training converges:

4×4 → 8×8 → 16×16 → 32×32 → 64×64 → 128×128 → 256×256 → 512×512 → 1024×1024

When a new resolution layer is added, it is introduced with a learned blending weight $\alpha$ that starts at 0 (the new layer is invisible) and linearly increases to 1. This smooth transition prevents training instability when new resolution layers are added.

The result: photorealistic 1024x1024 face generation (FFHQ dataset), setting a new quality bar for generative models in 2018.

7. StyleGAN - Disentangled Latent Space

Architecture Overview

StyleGAN (Karras, Laine, Aila, 2019) introduced a radically new generator architecture that disentangles the latent code from the image synthesis process:

Mapping Network: Instead of feeding the noise vector $z$ directly to the generator, a mapping network $f: \mathcal{Z} \rightarrow \mathcal{W}$ transforms $z \in \mathbb{R}^{512}$ through 8 fully-connected layers to a disentangled latent code $w \in \mathbb{R}^{512}$ .

The $\mathcal{W}$ space is empirically much less entangled than $\mathcal{Z}$ : individual dimensions of $w$ correspond more cleanly to interpretable attributes (age, gender, hair color) rather than complex combinations.

Style Injection via AdaIN: the style code $w$ is injected at every layer of the generator through Adaptive Instance Normalization:

$\text{AdaIN}(x_i, y) = y_{s,i} \cdot \frac{x_i - \mu(x_i)}{\sigma(x_i)} + y_{b,i}$

where $y_{s,i}$ and $y_{b,i}$ are scale and bias computed by learned affine transformations of $w$ . AdaIN normalizes the feature map at each layer and then re-scales according to the style code - this allows $w$ to control the style at each resolution level independently.

Stochastic Noise: at each layer, small amounts of Gaussian noise are added to the feature maps before AdaIN. This allows the model to generate fine stochastic details (individual hair placement, skin pore texture) independently from the main style code - the style controls global structure while noise controls per-pixel variation.

Learned Constant Input: the generator starts from a learned constant (not from $z$ or $w$ ) - the "template" shape. Styles are applied progressively to modify this template.

Why StyleGAN's Latent Space is Disentangled

The $\mathcal{Z}$ space must follow a standard Gaussian prior - all points in $\mathcal{Z}$ map to valid images. This constrains $Z$ to be a "wrapped" version of the image space, with inevitable entanglements where one dimension must control multiple attributes to avoid holes in the distribution.

The $\mathcal{W}$ space has no such constraint - the mapping network can unwrap the Gaussian prior into a more geometrically natural representation. The Perceptual Path Length (PPL) metric measures this: it computes the average perceptual distance between images when taking small steps in $\mathcal{W}$ vs $\mathcal{Z}$ . StyleGAN's $\mathcal{W}$ has ~4x lower PPL than $\mathcal{Z}$ , confirming better disentanglement.

StyleGAN2

StyleGAN2 (Karras et al. 2020) fixes the characteristic "water droplet" artifacts visible in StyleGAN images. The artifacts were caused by the AdaIN normalization interfering with the generator's ability to control feature map statistics. StyleGAN2 removes AdaIN and replaces it with weight demodulation - normalizing the generator's convolutional weights by the expected standard deviation of the feature maps.

Additional improvements: path length regularization (keeps the mapping from $\mathcal{W}$ to image space smooth), removes progressive training (no longer needed at StyleGAN2 quality), and improved data augmentation (ADA - Adaptive Discriminator Augmentation) for small dataset training.

8. Architecture Diagram

9. Image-to-Image Translation

Pix2Pix - Paired Translation

Isola et al. (2017) trained a conditional GAN for paired image-to-image translation: given aligned pairs $(x, y)$ (e.g., sketch and corresponding photo), train a generator $G(x)$ to produce the corresponding output. The loss combines L1 (for overall structure) and adversarial (for realism):

$L = \lambda L_1(G(x), y) + L_{adv}(G, D)$

The PatchGAN discriminator (from Pix2Pix) classifies each $N \times N$ patch as real or fake, rather than classifying the whole image. This is parameter-efficient and captures local texture statistics at multiple scales.

CycleGAN - Unpaired Translation

Zhu et al. (2017) tackled unpaired image translation: given two image sets (horses and zebras, but not aligned pairs), learn to translate between them. The key insight: add a cycle consistency constraint - if you translate a horse image to a zebra and then back to a horse, you should recover the original.

$L_{cyc}(G, F) = \mathbb{E}_{x \sim p_X}[\|F(G(x)) - x\|_1] + \mathbb{E}_{y \sim p_Y}[\|G(F(y)) - y\|_1]$

CycleGAN uses two generators ( $G: X \rightarrow Y$ , $F: Y \rightarrow X$ ) and two discriminators. The cycle consistency loss prevents both generators from learning arbitrary mappings - they must be mutual inverses.

10. Full PyTorch DCGAN Implementation

"""
DCGAN implementation on CelebA or MNIST.
Demonstrates generator, discriminator, alternating training, FID tracking.
"""

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, utils as vutils
import numpy as np
from pathlib import Path


# ============================================================
# Generator
# ============================================================

class DCGANGenerator(nn.Module):
    """
    DCGAN Generator: noise z → image.
    Architecture: FC → reshape → ConvTranspose → BN → ReLU × N → Tanh
    """

    def __init__(self, z_dim: int = 100, n_features: int = 64, n_channels: int = 3):
        super().__init__()
        self.z_dim = z_dim

        self.model = nn.Sequential(
            # Input: (B, z_dim, 1, 1) - reshape before this block
            # Block 1: 1×1 → 4×4
            nn.ConvTranspose2d(z_dim, n_features * 8, 4, 1, 0, bias=False),
            nn.BatchNorm2d(n_features * 8),
            nn.ReLU(inplace=True),

            # Block 2: 4×4 → 8×8
            nn.ConvTranspose2d(n_features * 8, n_features * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(n_features * 4),
            nn.ReLU(inplace=True),

            # Block 3: 8×8 → 16×16
            nn.ConvTranspose2d(n_features * 4, n_features * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(n_features * 2),
            nn.ReLU(inplace=True),

            # Block 4: 16×16 → 32×32
            nn.ConvTranspose2d(n_features * 2, n_features, 4, 2, 1, bias=False),
            nn.BatchNorm2d(n_features),
            nn.ReLU(inplace=True),

            # Block 5: 32×32 → 64×64 - output layer, no BN, use Tanh
            nn.ConvTranspose2d(n_features, n_channels, 4, 2, 1, bias=False),
            nn.Tanh()
            # Output in [-1, 1], matching normalized real images
        )

        self._initialize_weights()

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, (nn.Conv2d, nn.ConvTranspose2d)):
                nn.init.normal_(m.weight, 0.0, 0.02)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.normal_(m.weight, 1.0, 0.02)
                nn.init.constant_(m.bias, 0)

    def forward(self, z: torch.Tensor) -> torch.Tensor:
        """
        Args:
            z: (B, z_dim) noise vectors

        Returns:
            images: (B, n_channels, 64, 64)
        """
        z = z.view(-1, self.z_dim, 1, 1)  # (B, z_dim) → (B, z_dim, 1, 1)
        return self.model(z)


# ============================================================
# Discriminator
# ============================================================

class DCGANDiscriminator(nn.Module):
    """
    DCGAN Discriminator: image → [0, 1] (real probability).
    Architecture: Conv + LeakyReLU × N → sigmoid
    Key: LeakyReLU everywhere (slope=0.2), BN except input layer
    """

    def __init__(self, n_channels: int = 3, n_features: int = 64):
        super().__init__()

        self.model = nn.Sequential(
            # Input: (B, n_channels, 64, 64)
            # Block 1: 64×64 → 32×32 - NO BatchNorm at input layer
            nn.Conv2d(n_channels, n_features, 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),

            # Block 2: 32×32 → 16×16
            nn.Conv2d(n_features, n_features * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(n_features * 2),
            nn.LeakyReLU(0.2, inplace=True),

            # Block 3: 16×16 → 8×8
            nn.Conv2d(n_features * 2, n_features * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(n_features * 4),
            nn.LeakyReLU(0.2, inplace=True),

            # Block 4: 8×8 → 4×4
            nn.Conv2d(n_features * 4, n_features * 8, 4, 2, 1, bias=False),
            nn.BatchNorm2d(n_features * 8),
            nn.LeakyReLU(0.2, inplace=True),

            # Block 5: 4×4 → 1×1 - output, sigmoid
            nn.Conv2d(n_features * 8, 1, 4, 1, 0, bias=False),
            nn.Sigmoid()
        )

        self._initialize_weights()

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.normal_(m.weight, 0.0, 0.02)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.normal_(m.weight, 1.0, 0.02)
                nn.init.constant_(m.bias, 0)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: (B, n_channels, 64, 64) images

        Returns:
            score: (B,) real/fake probabilities
        """
        return self.model(x).view(-1)  # flatten to (B,)


# ============================================================
# Training loop
# ============================================================

def train_dcgan(
    data_root: str = "./data",
    output_dir: str = "./dcgan_output",
    z_dim: int = 100,
    n_features: int = 64,
    n_channels: int = 3,
    image_size: int = 64,
    num_epochs: int = 25,
    batch_size: int = 128,
    lr: float = 2e-4,
    beta1: float = 0.5,
    device: str = "cuda",
    seed: int = 42,
):
    torch.manual_seed(seed)
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    # ---- Dataset ----
    transform = transforms.Compose([
        transforms.Resize(image_size),
        transforms.CenterCrop(image_size),
        transforms.ToTensor(),
        transforms.Normalize([0.5] * n_channels, [0.5] * n_channels)  # → [-1, 1]
    ])
    dataset = datasets.ImageFolder(data_root, transform=transform)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, num_workers=4)

    # ---- Models ----
    G = DCGANGenerator(z_dim=z_dim, n_features=n_features, n_channels=n_channels).to(device)
    D = DCGANDiscriminator(n_channels=n_channels, n_features=n_features).to(device)

    # ---- Loss and Optimizers ----
    criterion = nn.BCELoss()  # Binary cross-entropy for real/fake
    # Adam with lower beta1 (0.5 instead of 0.9) - DCGAN recommendation for stability
    opt_G = optim.Adam(G.parameters(), lr=lr, betas=(beta1, 0.999))
    opt_D = optim.Adam(D.parameters(), lr=lr, betas=(beta1, 0.999))

    # Fixed noise for visualization (always the same → track G progress)
    fixed_noise = torch.randn(64, z_dim, device=device)

    REAL_LABEL = 1.0
    FAKE_LABEL = 0.0

    G_losses = []
    D_losses = []

    for epoch in range(num_epochs):
        for i, (real_images, _) in enumerate(dataloader):
            real_images = real_images.to(device)
            B = real_images.size(0)

            # ============================================================
            # Update Discriminator: maximize log D(x) + log(1 - D(G(z)))
            # ============================================================
            D.zero_grad()

            # Real images → D should output 1
            real_labels = torch.full((B,), REAL_LABEL, dtype=torch.float32, device=device)
            d_real_output = D(real_images)
            d_real_loss = criterion(d_real_output, real_labels)
            d_real_loss.backward()

            # Fake images → D should output 0
            noise = torch.randn(B, z_dim, device=device)
            fake_images = G(noise)
            fake_labels = torch.full((B,), FAKE_LABEL, dtype=torch.float32, device=device)
            # Detach: don't backprop through G when updating D
            d_fake_output = D(fake_images.detach())
            d_fake_loss = criterion(d_fake_output, fake_labels)
            d_fake_loss.backward()

            d_total_loss = d_real_loss + d_fake_loss
            opt_D.step()

            # ============================================================
            # Update Generator: maximize log D(G(z)) (non-saturating)
            # ============================================================
            G.zero_grad()

            # Generator wants D to output 1 for fake images (fool D)
            # Use same fake_images, but now DO backprop through G
            g_output = D(fake_images)
            # Label fake images as REAL for G's loss - trains G to fool D
            g_loss = criterion(g_output, real_labels)
            g_loss.backward()
            opt_G.step()

            G_losses.append(g_loss.item())
            D_losses.append(d_total_loss.item())

            if i % 100 == 0:
                d_real_mean = d_real_output.mean().item()
                d_fake_mean = d_fake_output.mean().item()
                print(
                    f"Epoch [{epoch}/{num_epochs}] Step [{i}/{len(dataloader)}] "
                    f"D_loss: {d_total_loss:.4f} G_loss: {g_loss:.4f} "
                    f"D(x): {d_real_mean:.4f} D(G(z)): {d_fake_mean:.4f}"
                )
                # D(x) should stay near 0.5-0.8
                # D(G(z)) should be near 0 initially, rise toward 0.5 as G improves

        # Save sample images at end of each epoch
        with torch.no_grad():
            fake_sample = G(fixed_noise).detach().cpu()
        grid = vutils.make_grid(fake_sample, nrow=8, normalize=True, value_range=(-1, 1))
        vutils.save_image(grid, f"{output_dir}/epoch_{epoch:03d}.png")

    # Save checkpoints
    torch.save(G.state_dict(), f"{output_dir}/generator.pth")
    torch.save(D.state_dict(), f"{output_dir}/discriminator.pth")
    print(f"Training complete. Checkpoints saved to {output_dir}")
    return G, D


# ============================================================
# WGAN-GP gradient penalty (add to discriminator training)
# ============================================================

def compute_gradient_penalty(
    discriminator: nn.Module,
    real_images: torch.Tensor,
    fake_images: torch.Tensor,
    device: str,
    lambda_gp: float = 10.0
) -> torch.Tensor:
    """
    Compute WGAN-GP gradient penalty.

    Enforces 1-Lipschitz constraint on the critic by penalizing
    the gradient norm deviation from 1 on interpolated samples.

    Args:
        discriminator: the critic network
        real_images: real training images
        fake_images: generated images (no detach - gradients flow here)
        lambda_gp: penalty weight (10 is standard from the paper)

    Returns:
        gradient_penalty: scalar loss term to add to critic loss
    """
    B = real_images.size(0)

    # Sample interpolation coefficients alpha ~ Uniform[0, 1]
    alpha = torch.rand(B, 1, 1, 1, device=device)

    # Interpolated samples: points on the line between real and fake
    interpolated = alpha * real_images + (1 - alpha) * fake_images.detach()
    interpolated.requires_grad_(True)

    # Critic score at interpolated points
    d_interpolated = discriminator(interpolated)

    # Compute gradient of critic score with respect to interpolated samples
    gradients = torch.autograd.grad(
        outputs=d_interpolated,
        inputs=interpolated,
        grad_outputs=torch.ones_like(d_interpolated),
        create_graph=True,  # Need second-order gradients for backprop through GP
        retain_graph=True,
        only_inputs=True
    )[0]

    # Gradient norm: (B, n_channels * H * W) → (B,)
    gradients = gradients.reshape(B, -1)
    gradient_norm = gradients.norm(2, dim=1)

    # Penalty: (||gradient|| - 1)^2
    gradient_penalty = lambda_gp * ((gradient_norm - 1) ** 2).mean()
    return gradient_penalty

11. GAN vs Diffusion - Why Diffusion Won

By 2022-2023, diffusion models (DALL-E 2, Stable Diffusion) surpassed GANs on most image quality benchmarks. The reasons:

Dimension	GAN	Diffusion Model
Training stability	Notoriously difficult, requires tricks	Stable with standard Adam + cosine LR
Mode coverage	Mode collapse is common	Excellent coverage, high diversity
Training objective	Adversarial - indirect, can vanish	Direct MSE regression - always well-defined
Sample quality ceiling	Limited by discriminator capacity	Scales cleanly with U-Net/DiT size
Text conditioning	Hard to integrate naturally	Cross-attention integrates naturally
Inference speed	Very fast (1 forward pass)	Slow (50-1000 forward passes)
Training data required	More efficient on small datasets	Needs more data for same quality
Latent space	Structured in StyleGAN	Noise vector, less interpretable

GANs' primary remaining advantage is inference speed - one forward pass through the generator produces an image in milliseconds. This makes GANs still attractive for real-time applications (video game character generation, real-time face animation). For highest quality generation and text-to-image, diffusion is now dominant.

12. YouTube Resources

Video	Channel	What You Learn
GAN Paper Explained	Yannic Kilcher	Original GAN paper, Nash equilibrium, JS divergence
Wasserstein GAN Explained	Arxiv Insights	Earth mover's distance, weight clipping, WGAN-GP
StyleGAN Architecture Deep Dive	Henry AI Labs	Mapping network, AdaIN, disentanglement, W space
GAN Training Tricks	Aladdin Persson	Practical DCGAN implementation, mode collapse, tips
GANs vs Diffusion Models	Computerphile	Why diffusion won, quality comparison, trade-offs

13. Common Pitfalls

:::danger Balanced discriminator-generator training is non-trivial If the discriminator is too strong (trained for many more steps, or with too high a learning rate relative to the generator), the generator's gradients vanish - it cannot improve. If the discriminator is too weak, it provides no useful training signal. The standard 1:1 alternating update (one discriminator step, one generator step) is a starting point, not a guarantee. Monitor D(x) and D(G(z)) throughout training: D(x) should stay in 0.5-0.9 and D(G(z)) should start near 0 and gradually rise toward 0.5 as training progresses. :::

:::warning Checkerboard artifacts indicate transposed convolution issues Generated images often show a regular grid-like pattern ("checkerboard artifacts") from transposed convolutions with certain stride/kernel size combinations. The standard fix: replace transposed convolutions with upsample (nearest-neighbor or bilinear) followed by a regular convolution. This eliminates the periodic overlap pattern that creates checkerboards. :::

:::danger Mode collapse is silent unless you look for it A GAN can appear to be training normally (losses behave, generated images look plausible) while quietly collapsing to a small set of modes. Always generate a large grid (100+ images) from different noise vectors and check for visual diversity. Also monitor Recall from the Precision-Recall framework - mode collapse produces high Precision but very low Recall. Looking only at a few cherry-picked images or FID will not catch it. :::

14. Interview Q&A

Q1: Derive the optimal discriminator for the original GAN objective, and show it leads to JS divergence minimization.

For fixed generator (fixed $p_g$ ), the GAN objective at each point $x$ is:

$p_{data}(x)\log D(x) + p_g(x)\log(1-D(x))$

This is maximized by setting the derivative to zero: $p_{data}(x)/D(x) = p_g(x)/(1-D(x))$ , giving $D^*(x) = p_{data}(x)/(p_{data}(x)+p_g(x))$ .

Substituting $D^*$ back: $V(D^*, G) = \mathbb{E}_{x\sim p_{data}}\!\left[\log\frac{p_{data}}{p_{data}+p_g}\right] + \mathbb{E}_{x\sim p_g}\!\left[\log\frac{p_g}{p_{data}+p_g}\right]$

This equals $-\log 4 + 2\cdot JS(p_{data}\|p_g)$ . Since $JS \geq 0$ with equality iff $p_g = p_{data}$ , the global minimum requires the generator to match the data distribution exactly.

The problem: when $p_{data}$ and $p_g$ have disjoint supports (almost always true early in training), $D^*(x) \in \{0,1\}$ and the JS divergence equals $\log 2$ everywhere - a constant with zero gradient. The generator cannot improve.

Q2: What is the Earth Mover's distance and why does it fix the vanishing gradient problem?

The Earth Mover's (Wasserstein-1) distance is the minimum amount of "work" to transform distribution $p_g$ into $p_{data}$ , where work = mass × distance moved. Formally:

$W(p_{data}, p_g) = \sup_{\|f\|_L \leq 1} \left[\mathbb{E}_{x\sim p_{data}}[f(x)] - \mathbb{E}_{x\sim p_g}[f(x)]\right]$

Unlike JS divergence, $W$ is continuous and differentiable even when supports do not overlap. When $p_g$ generates images far from the real distribution, $W$ gives a gradient proportional to how far apart the distributions are. As they converge, the gradient smoothly decreases. There is no sudden loss of gradient when supports become disjoint.

WGAN approximates $W$ by training a Lipschitz-constrained critic (weight clipping or gradient penalty) to maximize the difference in expectations. The generator then minimizes the critic's expected output on generated images - maximizing $-\mathbb{E}[f_w(G(z))]$ .

Q3: Explain StyleGAN's AdaIN mechanism and why it enables style control.

AdaIN (Adaptive Instance Normalization): at each layer of the generator, the feature maps $x_i$ are first normalized (zero mean, unit variance), then scaled and shifted by style-derived parameters:

$\text{AdaIN}(x_i, y) = y_{s,i} \cdot \frac{x_i - \mu(x_i)}{\sigma(x_i)} + y_{b,i}$

The scale $y_{s,i}$ and bias $y_{b,i}$ are computed by learned affine transformations of the style code $w$ . This means the style code controls the statistical distribution (mean and variance) of features at each layer - and feature statistics in CNNs are known to encode style (from neural style transfer research).

Each resolution level gets its own affine transform of $w$ . Low-resolution layers (4x4, 8x8) control coarse style (overall pose, face shape, age). High-resolution layers control fine style (hair texture, skin pores, freckles). This hierarchical style injection, combined with the disentangled $\mathcal{W}$ space, enables precise style control by interpolating $w$ at different resolutions.

Q4: What is mode collapse and how do you detect and mitigate it?

Mode collapse occurs when the generator learns to produce a small subset of the real data distribution - it finds a few images that reliably fool the discriminator and gets stuck producing variations of those images. In extreme cases, all noise vectors produce nearly identical images.

Detection: (1) Generate a large grid (100+ images) from different noise vectors and look for visual repetition. (2) Compute Recall from the Precision-Recall framework - mode collapse produces very low Recall (high quality, low coverage). (3) Monitor the FID decomposition: low FID overall but high Frechet distance contribution from the covariance term indicates distributional mismatch.

Mitigation: (1) Mini-batch discrimination: add a layer that computes statistics across the batch - the discriminator can detect when all images in a batch look similar and penalize it. (2) WGAN-GP: EMD is less susceptible to mode collapse than JS because it provides gradient even when supports are disjoint. (3) Spectral normalization: bounds discriminator Lipschitz constant, preventing it from becoming too confident. (4) Experience replay: maintain a buffer of previously generated images and occasionally show them to the discriminator - prevents the generator from cycling through modes.

Q5: Why did diffusion models surpass GANs for photorealistic image synthesis?

The root cause is training stability and optimization landscape. GAN training is a two-player minimax game with a notoriously complex, non-convex optimization landscape. The generator and discriminator can enter cycles, collapse, or diverge depending on initialization and hyperparameters. A decade of engineering tricks (BN, spectral norm, WGAN-GP, progressive growing, ADA) were necessary to stabilize GANs at high resolution.

Diffusion models have a simple, well-conditioned training objective: predict the noise added to a clean image. This is a standard regression problem with a smooth loss landscape, always well-defined, solvable with any standard optimizer (Adam + cosine LR). Training is monotonically stable - you can add more data, increase model size, and training improves predictably.

The quality ceiling: GANs are fundamentally limited by the discriminator's capacity to evaluate image quality. As image resolution increases, the discriminator must capture increasingly subtle statistics - and this becomes the bottleneck. Diffusion models scale more cleanly: larger U-Net → better denoising → better samples, with no adversarial bottleneck. DALL-E 2 and Stable Diffusion's quality, diversity, and text alignment in 2022 exceeded the best GANs while being much simpler to train.

15. GAN Applications Where They Remain Competitive

Despite diffusion models dominating for highest-quality image synthesis, GANs remain the preferred architecture in several production scenarios:

Real-Time Image Synthesis

GANs require a single forward pass through the generator to produce an image - typically 10-50ms for a 512x512 image on a consumer GPU. Diffusion models require 10-1000 forward passes. For real-time applications where latency is measured in milliseconds, GANs are still the right tool:

Virtual try-on: generate clothing on a person in real-time for e-commerce
Real-time face animation: DeepFake-style face swapping at video frame rates
Game character generation: procedural avatar customization at runtime
Interactive image editing: brush strokes that immediately generate realistic textures

High-Resolution Face Generation

StyleGAN2-ADA remains competitive for pure face synthesis quality, especially when training data is limited. Its Adaptive Discriminator Augmentation (ADA) automatically adjusts augmentation strength based on training set size - enabling high-quality training on as few as 1,000 images. Diffusion models generally need larger datasets to achieve comparable quality on small, specialized domains.

Super-Resolution and Image Enhancement

GAN-based super-resolution models (Real-ESRGAN, ESRGAN, GFPGAN) are production standards for upscaling, face restoration, and artifact removal. The discriminator's adversarial loss produces sharper, more realistic textures than diffusion-based upscaling at the same inference speed. These models power commercial upscaling tools (Topaz AI, Gigapixel) and video restoration pipelines.

Data Augmentation for Discriminative Models

GANs can generate synthetic training data to augment small datasets for classification, detection, or segmentation tasks. Because GAN inference is fast, you can generate augmented training examples on-the-fly during discriminative model training. Diffusion-based augmentation is also used but at lower volume due to inference cost.

16. Conditional GANs and Class-Conditional Synthesis

Class-Conditional GAN

In a class-conditional GAN, both generator and discriminator receive a class label $c$ as additional input. The generator learns $G(z, c)$ - given noise $z$ and class label $c$ , produce an image of class $c$ . The discriminator learns $D(x, c)$ - assess whether image $x$ is a real example of class $c$ .

The conditioning is typically implemented by embedding the class label and concatenating or adding it to intermediate features in both networks. For ImageNet class-conditional generation (1000 classes at 256x256), BigGAN (Brock et al. 2019) was state-of-the-art with FID of 7.4 on ImageNet - later surpassed by DiT (1.79 FID) and other diffusion models.

Class-conditional GANs support class mixing - interpolating between class embeddings to generate hybrid images (e.g., a mix between "tabby cat" and "Persian cat"). This is analogous to style mixing in StyleGAN but in semantic class space.

Projection Discriminator

For conditional GANs, a key architectural choice is how the discriminator uses the class label. Early approaches concatenated the label embedding to features. The projection discriminator (Miyato & Koyama, 2018) uses a more principled approach: compute the inner product between the label embedding and the discriminator's penultimate features. This projects the conditioning into the same space as the features, providing stronger conditioning signal and improving stability on complex conditional distributions.

This lesson is part of the Unsupervised Learning module. Continue to the Diffusion Models module for the modern successor to GANs: Diffusion Models Overview.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the GAN Training Dynamics demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why This Exists - The Generative Modeling Problem​

Historical Context - The Adversarial Idea​

1. The GAN Minimax Formulation​

The Objective​

The Optimal Discriminator​

Connection to JS Divergence​

Why JS Divergence Causes Vanishing Gradients​

2. The Non-Saturating Loss​

3. DCGAN - Making GANs Work in Practice​

Architectural Discoveries​

4. Training Instabilities - Mode Collapse​

What Mode Collapse Looks Like​

Mitigation Techniques​

5. Wasserstein GAN - The Theoretical Fix​

Earth Mover's Distance​

WGAN Implementation​

WGAN-GP - Gradient Penalty​

6. Progressive Growing of GANs​

The Resolution Scaling Problem​

7. StyleGAN - Disentangled Latent Space​

Architecture Overview​

Why StyleGAN's Latent Space is Disentangled​

StyleGAN2​

8. Architecture Diagram​

9. Image-to-Image Translation​

Pix2Pix - Paired Translation​

CycleGAN - Unpaired Translation​

10. Full PyTorch DCGAN Implementation​

11. GAN vs Diffusion - Why Diffusion Won​

12. YouTube Resources​

13. Common Pitfalls​

14. Interview Q&A​

15. GAN Applications Where They Remain Competitive​

Real-Time Image Synthesis​

High-Resolution Face Generation​

Super-Resolution and Image Enhancement​

Data Augmentation for Discriminative Models​

16. Conditional GANs and Class-Conditional Synthesis​

Class-Conditional GAN​

Projection Discriminator​