A complete guide to evaluating generative models - from the mathematics of FID and Inception Score to Precision/Recall manifolds, CLIP-based metrics, DINO similarity, human preference studies, metric gaming, and building production evaluation pipelines.

How does Frechet Inception Distance work in practice?

Evaluating Generative Models - FID, IS, Precision/Recall, Human Evaluation covers FID score, Frechet Inception Distance, Inception Score from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/diffusion-models/evaluation-of-generative-models

What is the difference between FID score and Inception Score?

See the full breakdown at https://engineersofai.com/docs/ml/diffusion-models/evaluation-of-generative-models

Evaluating Generative Models - FID, IS, Precision/Recall, Human Evaluation

:::note Reading time: ~55 minutes | Interview relevance: High | Target roles: ML Engineer, Research Engineer, Applied Scientist, AI Engineer :::

The Real Interview Moment

The research team at a generative AI company has just trained a new diffusion model. They claim it beats the previous version. The manager asks you to design an evaluation. You ask: "Better by what metric?" The team says: "Lower FID." You pause.

"FID alone is insufficient. A model with very low FID might be trading off diversity for quality - high precision, low recall. We need to measure quality and coverage separately. And for our text-to-image use case, CLIP-T is as important as FID - a beautiful but off-prompt image is useless. Here is the full evaluation suite I would run: FID for overall distribution match, Precision and Recall separately for quality and coverage, CLIP-T for text alignment, and human ELO for flagship comparisons. Each metric measures something the others miss."

The follow-up: "How many samples do you need for a statistically reliable FID comparison?" You know this cold: at least 10,000, ideally 50,000. FID has a negative bias proportional to $1/N$ - the same model evaluated on 2,000 samples appears worse than on 50,000.

This conversation separates generative ML engineers from practitioners who copy-paste benchmark numbers. Understanding what each metric measures, what it systematically misses, and when results become statistically meaningful is a mark of research maturity.

Why This Exists - The Fundamental Difficulty

Evaluating a classifier is straightforward: hold-out test set, ground truth labels, measure accuracy. There is a single unambiguous right answer for each input.

Evaluating a generative model is fundamentally different: there is no single correct output. A model asked to "generate a cat" can validly produce any of billions of different cat images. The evaluation question is not "is this correct?" but rather a set of competing objectives:

Quality: is each generated image high quality, artifact-free, photorealistic?
Diversity: does the generator cover many different modes, or does it collapse to a few?
Fidelity: does the generated distribution match the real data distribution?
Text alignment: for text-to-image, does the image match the text prompt?
Human preference: would a human prefer this output over a competitor's?
Downstream utility: do images generated by this model make downstream classifiers perform better when used as training data?

These objectives conflict. A model achieves perfect "quality" by memorizing training images - zero diversity. A model achieves perfect "diversity" by generating random noise - zero quality. A model achieves high CLIP-T by generating simple, generic images that CLIP reliably recognizes - not useful for open-domain generation. The ideal model is Pareto-optimal across all dimensions simultaneously. No single number captures this.

Understanding these tradeoffs and choosing the right combination of metrics for your use case is the core competency this lesson builds.

Historical Context

Before FID, evaluation relied on visual inspection and user preference studies - subjective and difficult to reproduce. The Inception Score (IS, Salimans et al. 2016) was the first widely-adopted automatic metric, using Inception v3's classification confidence and diversity to proxy quality and coverage.

FID (Heusel et al. 2017, NeurIPS) improved dramatically on IS by comparing the real and generated distributions rather than measuring absolute properties. This comparison-based approach is robust to many failure modes of IS.

The precision-recall framework for generative models (Kynkäänniemi et al. 2019, NeurIPS) introduced the manifold perspective: model quality and diversity as two separate geometric properties (what fraction of generated samples are realistic vs what fraction of real modes are covered). This explicitly decomposed the single-number FID into the two components that designers actually want to control independently.

CLIP-based metrics (Hessel et al. 2021 for CLIP-S; adopted broadly in 2022) enabled text-image alignment evaluation, which FID completely ignores. By 2024, best practice is to report FID, Precision, Recall, CLIP-T, and human preference ELO for flagship model comparisons - no single metric is trusted alone.

1. Inception Score (IS)

Definition and Formula

The Inception Score (Salimans et al. 2016) uses a pretrained Inception v3 network to measure two properties of generated images simultaneously:

$IS(G) = \exp\!\left(\mathbb{E}_{x \sim p_g}\left[KL\!\left(p(y|x) \;\|\; p(y)\right)\right]\right)$

where:

$p(y|x)$ : the class probability distribution from Inception v3 given generated image $x$
$p(y) = \int p(y|x) p_g(x) dx$ : the marginal class distribution averaged over all generated images
$KL(p(y|x) \| p(y))$ : KL divergence - maximized when $p(y|x)$ is peaked (quality) and $p(y)$ is flat (diversity)

Quality term: Inception v3 should be very confident about the class of each individual image. A high-quality dog image produces a peaked $p(y|x)$ concentrated near "dog." A blurry or unrealistic image produces a flat $p(y|x)$ across all 1000 ImageNet classes.

Diversity term: across all generated images, the model should produce examples of many different classes. If the model always generates dogs, $p(y) = [1.0, 0, 0, ...]$ - peaked, low entropy, low IS even if each individual dog is perfect. If the model generates all 1000 ImageNet classes equally, $p(y)$ is uniform - high entropy, high IS contribution.

IS Values in Practice

Distribution	Typical IS (CIFAR-10)
Random Gaussian noise	~1.0
Single repeated image	~1.0 (no diversity)
GAN, 2016	~3-5
StyleGAN v2 (2020)	~10.0
Real CIFAR-10 data	~11.2
State-of-the-art diffusion (2023)	~10-11

IS theoretical maximum equals the number of classes ( $e^{\log(1000)} = 1000$ for ImageNet). In practice, real data scores around 11 for CIFAR-10 because images contain mixed or ambiguous classes.

Critical Limitations

No comparison to real data: IS measures properties of generated samples alone. A model can achieve very high IS by generating sharp, confident, diverse hallucinations that look nothing like the training distribution. There is no reference anchor.

Dataset-specific Inception features: Inception v3 was trained on ImageNet. IS is only meaningful for datasets with similar distribution (natural images of everyday objects). On FFHQ (human faces), medical images, satellite imagery, or molecular structure visualizations, Inception's 1000 ImageNet classes are semantically meaningless.

Easy to game: generate exactly one image per ImageNet class, all maximally sharp and unambiguous. IS = maximum value. The model is not useful for any real task.

No within-class diversity: generating 50,000 identical-pose golden retrievers achieves the same IS as generating 50,000 diverse golden retrievers in different poses, contexts, and lighting - as long as Inception confidently predicts "golden retriever" for all of them.

IS was largely superseded by FID after 2018 but is still reported for historical benchmark comparability.

2. Fréchet Inception Distance (FID)

The Core Idea

FID (Heusel et al. 2017) fundamentally improves on IS by comparing the generated distribution to a reference distribution of real images. Rather than measuring absolute quality in isolation, FID asks: "how different is the generated distribution from the real data distribution?"

The comparison is made in the feature space of a pretrained Inception v3 network, using the 2048-dimensional pool layer activations (not the classification logits). Both the real and generated image distributions are approximated as multivariate Gaussians, and the Fréchet (Wasserstein-2) distance between them is computed.

Mathematical Formulation

$FID = \|\mu_r - \mu_g\|^2 + \mathrm{Tr}\!\left(\Sigma_r + \Sigma_g - 2\!\left(\Sigma_r \Sigma_g\right)^{1/2}\right)$

where:

$\mu_r, \Sigma_r$ : mean and covariance of Inception v3 features from real images (2048-dim mean, $2048 \times 2048$ covariance)
$\mu_g, \Sigma_g$ : mean and covariance of Inception v3 features from generated images
$(\Sigma_r \Sigma_g)^{1/2}$ : matrix square root of the product $\Sigma_r \Sigma_g$ , computed via eigendecomposition

The Wasserstein-2 distance between two Gaussians $\mathcal{N}(\mu_1, \Sigma_1)$ and $\mathcal{N}(\mu_2, \Sigma_2)$ is exactly this formula. FID = 0 when the generated distribution perfectly matches the real distribution (identical mean and covariance in feature space).

What Each Term Measures

Mean term $\|\mu_r - \mu_g\|^2$ : measures whether the average feature vector of generated images matches the average feature vector of real images. If generated images cluster in a different semantic region (e.g., always generating indoor scenes when the real distribution is mostly outdoor), this term is large.

Covariance term $\mathrm{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})$ : measures whether the spread and correlation structure of generated features matches the real distribution. A mode-collapsed model (generates only a few types of images) has a covariance $\Sigma_g$ with many small eigenvalues where $\Sigma_r$ has large ones - this contributes significantly to FID. A well-calibrated diverse model matches the full covariance structure.

FID Progress on CIFAR-10

Model (Year)	FID (CIFAR-10)
GAN baseline (2016)	~40
ProgressiveGAN (2018)	~8.0
StyleGAN (2019)	~2.84
DDPM (Ho et al. 2020)	3.17
Improved DDPM (2021)	2.94
LDM-4 (Rombach et al. 2022)	2.95
DiT-XL/2 (Peebles et al. 2023)	1.79
EDM2-XXL (Karras et al. 2023)	1.58

Lower is better. FID below 2.0 on CIFAR-10 is considered state-of-the-art as of 2023.

FID Pitfalls - Critical Knowledge

Sample size bias (most important): FID has a negative bias proportional to $1/N$ . The same model evaluated on $N=2000$ samples will report a higher (worse) FID than on $N=50000$ samples - even though the model has not changed. This is because small samples produce poor estimates of the covariance matrix $\Sigma$ , and the Gaussian fit is less accurate.

Required sample counts for reliable FID:

Use Case	Minimum Samples	Standard
CIFAR-10 benchmark	10,000	50,000
ImageNet 256px	10,000	50,000
Custom high-res dataset	5,000	10,000+
Quick development sanity check	2,000	(results not comparable to published work)

Always report sample count. Comparing FID values computed at different sample sizes is invalid.

Inception checkpoint version: multiple incompatible implementations exist. The pytorch-fid library uses a specific pretrained Inception v3 checkpoint. The clean-fid library uses a different one with proper resizing (bilinear instead of bicubic), which yields systematically different values. TensorFlow-based FID uses yet another variant. Differences of 0.5-2.0 FID points across implementations are common for the same model. Never compare FID values from papers that used different implementations.

Gaussian approximation: the Gaussian fit is an approximation. Real image features are not multivariate Gaussian - they have heavy tails, multimodal structure, and non-linear correlations. FID's Gaussian assumption works well on average but can give misleading results for distributions with unusual structure.

Cannot measure text alignment: for text-to-image models, FID is computed against a reference dataset without considering the prompt. A model that generates beautiful, realistic images that consistently ignore the text prompt will have excellent FID. This is a critical blind spot - always pair FID with CLIP-T for text-to-image evaluation.

3. Precision and Recall for Generative Models

The Fundamental Limitation of FID

FID collapses quality and diversity into a single number. This hides important information: a model with FID=3.0 might achieve this via high quality + moderate diversity, or via moderate quality + high diversity. These models have very different failure modes and different appropriate use cases.

Kynkäänniemi et al. (2019) introduced a framework that explicitly separates quality from diversity using the geometric concept of manifolds.

Definitions

Precision: the fraction of generated samples that fall inside the real data manifold. Measures sample quality - are individual generated images realistic? Precision = 1.0 means every generated image is realistic (indistinguishable from real data). Precision = 0.0 means every generated image is unrealistic.

Recall: the fraction of the real data manifold covered by the generated distribution. Measures diversity / coverage - does the generator produce images covering all modes of the real distribution? Recall = 1.0 means the generated distribution covers all types of real images. Recall = 0.0 means no overlap with the real distribution.

k-NN Manifold Estimation

Manifolds are approximated using k-nearest-neighbor balls in feature space:

Extract features from $N_{real}$ real images using a VGG or Inception network: $\{f_r^i\}_{i=1}^{N}$
Extract features from $N_{gen}$ generated images: $\{f_g^j\}_{j=1}^{M}$
For each real feature $f_r^i$ , compute the distance to its $k$ -th nearest neighbor among real features: $r_i = d_k(f_r^i)$ . The "real manifold ball" around $f_r^i$ is the ball of radius $r_i$ .
Precision = fraction of generated features $f_g^j$ that fall inside at least one real feature ball: $f_g^j$ is "inside the real manifold" if $\exists\, i: \|f_g^j - f_r^i\|_2 \leq r_i$
Recall = fraction of real features $f_r^i$ that have at least one generated feature inside their ball: $f_r^i$ is "covered" if $\exists\, j: \|f_g^j - f_r^i\|_2 \leq r_g^j$ (using generated manifold radii)

Standard $k=3$ from the original paper.

The Guidance Scale Tradeoff

Different CFG (Classifier-Free Guidance) scales trace different Precision-Recall curves:

Low CFG (1-3): the model generates diverse samples near the unconditional distribution. High Recall (covers many modes), lower Precision (some samples are unrealistic or off-prompt). Low FID.

Medium CFG (5-9): balanced quality and diversity. Optimal FID range. Good Precision and Recall simultaneously.

High CFG (12-20): the model generates only the most representative, "prototypical" images for each prompt. Very high Precision (all samples are realistic), but low Recall (misses uncommon modes). Higher FID because diversity drops.

Precision (Quality)
    ^
    |                    ★ high CFG (12+)
    |                   /  High quality, low diversity
    |                  /
    |          ●──────●
    |         / optimal CFG (5-9)
    |        /
    |  ●────/
    |  low CFG (1-3)
    |  High diversity, lower quality
    +──────────────────────────────→ Recall (Diversity)

Production implication: when a model team reports that their new model "achieves better FID," ask whether they measured at the same CFG scale, and whether Precision and Recall were measured separately. A model might improve FID by increasing CFG scale, which improves Precision at the cost of Recall - this is a parameter tuning result, not a model improvement.

Density and Coverage (Improved Variant)

Naeem et al. (2020) proposed Density and Coverage as more robust alternatives:

Density: measures how many real manifold balls contain each generated sample - not just whether at least one ball contains it. A generated sample deep inside the real manifold (many real neighbors) gets a higher density score than one barely inside. More robust to outliers than Precision.

Coverage: measures the fraction of real samples whose ball contains at least one generated sample - same as Recall conceptually, but using a different counting procedure that is less sensitive to the exact value of $k$ .

For most practical comparisons, Precision/Recall and Density/Coverage give similar directional conclusions. Use whichever has cleaner implementation support for your codebase.

4. CLIP Score - Text-Image Alignment

Definition

For text-to-image models, photorealism alone is insufficient evaluation. The central product requirement is that generated images accurately represent the text prompt. CLIP Score (Hessel et al. 2021) measures this text-image alignment:

$\text{CLIP-S}(c, x) = w \cdot \max\!\left(100 \cdot \cos\!\left(\text{CLIP}_\text{text}(c),\, \text{CLIP}_\text{img}(x)\right),\; 0\right)$

where $c$ is the text prompt, $x$ is the generated image, $w = 2.5$ is a calibration constant from the CLIP-S paper, and the cosine similarity is between normalized CLIP text and image embeddings.

CLIP Score correlates with human text-image alignment judgments with Pearson $r \approx 0.72$ on DrawBench prompts - reasonable but not perfect.

CLIP Score Weaknesses

Counting objects: CLIP is poor at distinguishing "three dogs" from "five dogs" - CLIP embeddings do not reliably encode exact quantities. A model that generates two dogs when prompted for five may score highly on CLIP-T.

Spatial relationships: "a ball to the left of a box" vs "a ball above a box" - CLIP embeddings encode presence of objects better than their spatial arrangements. A model that places objects correctly on 50% of trials may score similarly to one that places them correctly 80% of the time.

Attribute binding: "a red chair and a blue table" - CLIP must determine not just that there is a red thing, a blue thing, a chair, and a table, but that the red-chair binding and blue-table binding are both correct. CLIP does this imperfectly.

English-centric: CLIP was trained primarily on English-language web data. CLIP Score is poorly calibrated for non-English prompts, rare terminology, and domain-specific concepts outside web-scale training data.

Adversarial examples: because CLIP is a fixed pretrained model, a generator can theoretically learn to produce images that are CLIP-optimal but perceptually wrong. CLIP-T measures what CLIP thinks is alignment, not what humans think.

Text-Image Evaluation Benchmarks

DrawBench (Saharia et al. 2022, Imagen): 200 prompts across categories designed to stress-test models - counting ("5 dogs"), colors ("a red horse"), descriptions, DALL-E-original prompts, text rendering (text in the image). Models rated by human evaluators for alignment and fidelity. The standard benchmark for text-to-image alignment at model release.

PartiPrompts (Yu et al. 2022, Parti): 1,632 English prompts across 12 categories and 5 challenge levels. More systematic and comprehensive than DrawBench. Categories include abstract concepts, world knowledge, activities, animals, food, indoor/outdoor scenes.

T2I-CompBench (Huang et al. 2023): specifically designed to test compositional text-image alignment - attribute binding, spatial relationships (left/right/above/below), non-spatial relationships (wearing, riding, holding). Uses VQA models (BLIP, CLIP) to measure specific compositional properties rather than just overall alignment.

GenAI-Bench (Li et al. 2024): 1,600 prompts that correlate well with human ELO ratings on the HEIM benchmark. Provides more fine-grained skill decomposition than previous benchmarks.

5. DINO Similarity - Fine-Grained Identity

When CLIP Fails for Identity Preservation

For personalization tasks (DreamBooth, LoRA fine-tuning), the key evaluation question is: "does the generated image show the same specific character/object as the reference images?" This is distinct from semantic category agreement ("is this a dog?") - it requires fine-grained visual identity comparison ("is this the same specific dog with that white patch?").

CLIP-I (CLIP image-to-image cosine similarity) is commonly used but is suboptimal for this task. CLIP's image features are optimized for text-image alignment - they capture semantic category and style robustly but are relatively weak on fine-grained visual identity. Two dogs of the same breed in the same pose may have similar CLIP-I scores even if they are clearly different dogs.

DINO Features for Identity

DINO (Caron et al. 2021) trains a ViT with self-supervised contrastive learning (self-distillation with no labels). DINO features capture visual structure, texture, and local appearance at a finer level of detail than CLIP. Crucially, DINO was not trained to be invariant to visual identity variations the way CLIP was (CLIP is invariant to visual details not described by language).

DINO-I (DINO image-to-image cosine similarity using ViT-B/16) correlates better with human judgments of "does this look like the same specific character?" than CLIP-I, particularly for the fine-grained identity cases common in personalization evaluation.

Using DINO-I in practice:

Generate 30-50 images of the personalized concept across diverse prompts
Compute DINO ViT-B/16 features for all generated images and all reference images
Average pairwise cosine similarity between generated and reference feature vectors
Target DINO-I > 0.6 for acceptable identity preservation
DINO-I > 0.75 indicates strong identity preservation
DINO-I > 0.85 may indicate overfitting (outputs look too similar to training images)

DINO vs CLIP Feature Comparison

Aspect	CLIP-I	DINO-I
Semantic category agreement	Excellent	Good
Fine-grained visual identity	Moderate	Excellent
Style/texture similarity	Good	Excellent
Pose invariance	High	Lower (more pose-sensitive)
Use case	Coarse semantic match	Fine-grained identity match
Standard benchmark use	T2I evaluation	Personalization evaluation

6. Human Evaluation

When Automatic Metrics Fail

Every automatic metric has systematic gaps:

FID: does not measure text alignment, cannot detect style-appropriate failures, Gaussian assumption breaks down
CLIP-T: poor at counting, spatial relations, attribute binding, non-English prompts
Precision/Recall: captures statistical distribution coverage, not aesthetic quality or physical plausibility
DINO-I: measures visual similarity but not whether the concept is deployed in a contextually correct way
IS: all the problems discussed above

No automatic metric currently captures: long-range coherence in video (does the story make sense?), physical plausibility (does this obey gravity?), artistic style fidelity (does this look like it was made by the referenced artist?), or subtle semantic correctness (is the animal doing exactly what the prompt says?).

Human evaluation remains the gold standard for flagship model comparisons, new paper claims, and any evaluation that will influence major product decisions.

ELO Rating for Pairwise Comparisons

The ELO rating system, adapted from chess, is the standard approach for comparing models on human preference:

$E_A = \frac{1}{1 + 10^{(R_B - R_A)/400}}, \quad R_A' = R_A + K(S_A - E_A)$

where $E_A$ is the expected win probability for model A, $R_A, R_B$ are current ratings, $K$ is the update constant (typically 32), and $S_A$ is 1 (win), 0.5 (tie), or 0 (loss).

How it works in practice:

Present annotators with pairs of images from two different models, generated from the same prompt
Ask "which image better matches the prompt?" (or "which do you prefer?")
Update both models' ELO scores based on the result
After many comparisons, models settle into a stable ranking
ELO differences are meaningful: 200 ELO points ≈ 75% win rate for the higher-rated model

Sample size for stability: 5,000-20,000 pairwise comparisons per model comparison for stable ELO rankings. Chatbot Arena (LMSYS) and ImagenHub use this approach.

MOS (Mean Opinion Score) for Audio

For audio generation, human evaluation uses MOS (Mean Opinion Score): annotators rate audio samples on a 1-5 integer scale:

5: Excellent - like the best audio I have heard
4: Good - some perceptible imperfections but not annoying
3: Fair - somewhat annoying imperfections
2: Poor - annoying, hard to listen to
1: Bad - very annoying, nearly unusable

MOS is standard in TTS and audio generation evaluation. Key challenges: annotators calibrate differently (one person's 4 is another's 3), and MOS conflates multiple quality dimensions (naturalness, expressiveness, clarity, absence of artifacts). Use anchor clips at known quality levels to help annotators calibrate their scales.

Rigorous Study Design

For a publishable or production-grade human evaluation:

1. Forced binary choice, not Likert scale: "which image better matches the prompt?" is more reliable than "rate this image 1-7." Binary forced choice has lower variance across annotators. Use Likert only when you need absolute quality judgments (MOS-style).

2. Separate dimensions: rate "matches the prompt" and "image quality" separately. An image can be photorealistic but off-prompt, or cartoonish but perfectly prompt-following. Conflating in one rating reduces signal.

3. Counterbalancing: show each model in position A (left) and position B (right) equally often. Annotators have a position bias toward the first image shown - this bias is real and consistent across annotators.

4. Prompt sampling: sample prompts systematically from the benchmark (e.g., stratified by DrawBench category) rather than cherry-picking easy cases. Cherry-picked prompts produce overoptimistic results.

5. Inter-annotator agreement: compute Krippendorff's $\alpha$ or Cohen's $\kappa$ across annotators. For text-image preference, $\alpha \approx 0.5-0.7$ is typical. If $\alpha < 0.4$ , the task is too subjective or poorly defined - redesign the annotation instructions or task decomposition.

6. Statistical power: for a 95% confidence interval of ±3%, you need at least 1,000 pairwise comparisons per model pair with 3+ independent ratings each. Use a power analysis calculator before designing the study.

7. Annotator qualification: use qualification tasks with known-answer pairs to screen annotators before including their ratings. Annotators who consistently prefer lower-quality images may have different aesthetic sensibilities or may be rushing through the task.

7. The Quality-Diversity Pareto Frontier

The key practical insight: FID-optimal and human-preferred CFG scales differ. FID minimizes at moderate CFG (7-9) because it rewards both quality and diversity. Humans prefer higher CFG (9-12) because they find saturated, sharp images more impressive even when diversity is lower. Models optimized purely for human preference via RLHF often show higher FID than models optimized for FID - the metrics systematically diverge.

FID does not equal human preference. Both are valid metrics, but for different goals.

8. Complete Python Evaluation Pipeline

"""
Complete evaluation suite for generative image models.
Covers: FID (from scratch), CLIP-T score, Precision/Recall,
and the clean-fid library for production-grade evaluation.
"""

import torch
import numpy as np
from scipy import linalg
from typing import List, Tuple, Dict, Optional
from PIL import Image


# ============================================================
# FID implementation from scratch
# ============================================================

class FIDScorer:
    """
    FID scorer using InceptionV3 pool features.

    Critical implementation details:
    - Use the SAME Inception checkpoint as the papers you compare against
    - Resize to 299x299 (Inception input size) before feature extraction
    - Use float32 for the covariance computation (float16 loses precision)
    - Need at least 10,000 samples for reliable results (50,000 is standard)
    """

    def __init__(self, device: str = "cuda"):
        import torchvision.models as models
        from torchvision import transforms

        self.device = device

        # InceptionV3 with pretrained ImageNet weights
        # Remove final classifier - we want 2048-dim pool features
        self.inception = models.inception_v3(pretrained=True)
        self.inception.fc = torch.nn.Identity()
        self.inception.eval().to(device)

        # Preprocessing: resize to 299x299, normalize to [0, 1], then to [-1, 1]
        self.preprocess = transforms.Compose([
            transforms.Resize((299, 299), interpolation=transforms.InterpolationMode.BILINEAR),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
        ])

    @torch.no_grad()
    def extract_features(
        self,
        images: List[Image.Image],
        batch_size: int = 64
    ) -> np.ndarray:
        """
        Extract 2048-dim InceptionV3 pool features from a list of PIL images.

        Args:
            images: list of PIL Images
            batch_size: GPU batch size

        Returns:
            features: (N, 2048) float32 numpy array
        """
        all_features = []

        for i in range(0, len(images), batch_size):
            batch_pil = images[i : i + batch_size]
            batch = torch.stack([self.preprocess(img) for img in batch_pil])
            batch = batch.to(self.device)

            # InceptionV3 forward
            features = self.inception(batch)
            if isinstance(features, tuple):
                features = features[0]  # Some versions return (logits, aux_logits)

            all_features.append(features.cpu().numpy())

        return np.concatenate(all_features, axis=0)  # (N, 2048)

    def compute_statistics(self, features: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
        """
        Fit a Gaussian to the feature distribution.

        Returns: (mu, sigma) where mu: (2048,), sigma: (2048, 2048)
        This is the Gaussian approximation used in FID.
        """
        mu = np.mean(features, axis=0)           # (2048,)
        sigma = np.cov(features, rowvar=False)   # (2048, 2048) - uses Bessel correction
        return mu, sigma

    def frechet_distance(
        self,
        mu1: np.ndarray, sigma1: np.ndarray,
        mu2: np.ndarray, sigma2: np.ndarray,
    ) -> float:
        """
        Compute the Fréchet (Wasserstein-2) distance:
        FID = ||mu1 - mu2||^2 + Tr(sigma1 + sigma2 - 2*(sigma1 @ sigma2)^{1/2})

        The matrix square root is computed via scipy's sqrtm (eigendecomposition).
        Small imaginary components from numerical error are discarded.
        """
        # Mean difference squared
        mean_diff_sq = np.sum((mu1 - mu2) ** 2)

        # Matrix square root of sigma1 @ sigma2
        product = sigma1 @ sigma2
        sqrt_product, _ = linalg.sqrtm(product, disp=False)

        # Handle numerical imaginary parts
        if np.iscomplexobj(sqrt_product):
            if not np.allclose(np.imag(sqrt_product), 0, atol=1e-3):
                raise ValueError(
                    "Matrix square root has large imaginary components - "
                    "check that sigma matrices are positive semi-definite"
                )
            sqrt_product = np.real(sqrt_product)

        # Trace term
        trace_term = np.trace(sigma1) + np.trace(sigma2) - 2 * np.trace(sqrt_product)

        fid = mean_diff_sq + trace_term
        return float(fid)

    def compute_fid(
        self,
        real_images: List[Image.Image],
        generated_images: List[Image.Image],
        batch_size: int = 64,
    ) -> float:
        """
        Compute FID between two sets of images.

        IMPORTANT: Use at least 10,000 images per set (50,000 for paper-quality results).
        FID has a negative bias proportional to 1/N that makes small-sample
        evaluations appear worse than they are.
        """
        if len(real_images) < 2000 or len(generated_images) < 2000:
            import warnings
            warnings.warn(
                f"FID computed on {min(len(real_images), len(generated_images))} samples. "
                f"Results are unreliable below 10,000. "
                f"Do not compare this number to published benchmarks.",
                UserWarning
            )

        print(f"Extracting features from {len(real_images)} real images...")
        real_features = self.extract_features(real_images, batch_size)

        print(f"Extracting features from {len(generated_images)} generated images...")
        gen_features = self.extract_features(generated_images, batch_size)

        mu_r, sigma_r = self.compute_statistics(real_features)
        mu_g, sigma_g = self.compute_statistics(gen_features)

        fid = self.frechet_distance(mu_r, sigma_r, mu_g, sigma_g)
        return fid


# ============================================================
# Production FID with clean-fid library
# ============================================================

def compute_fid_cleanfid(
    real_image_dir: str,
    generated_image_dir: str,
    model_name: str = "clip_vit_b_32",  # or "inception_v3"
    num_workers: int = 4,
) -> float:
    """
    Production-grade FID using the clean-fid library.

    clean-fid fixes the Inception preprocessing bug (uses correct bilinear
    antialias resize instead of nearest-neighbor / bicubic) and provides
    precomputed statistics for standard benchmarks.

    install: pip install clean-fid

    Note: Use model_name="inception_v3" for comparability with most papers.
    Use model_name="clip_vit_b_32" for domain-shifted datasets (non-ImageNet-like).
    """
    try:
        from cleanfid import fid
        return fid.compute_fid(
            real_image_dir,
            generated_image_dir,
            mode="clean",
            model_name=model_name,
            num_workers=num_workers,
        )
    except ImportError:
        raise ImportError("Install with: pip install clean-fid")


# ============================================================
# CLIP Score - text-image alignment
# ============================================================

class CLIPScorer:
    """
    Measures text-image alignment using CLIP embeddings.

    CLIP-S(c, x) = 2.5 * max(100 * cos(CLIP_text(c), CLIP_img(x)), 0)

    Returns scores in range [0, 250] (calibrated by the CLIP-S paper).
    Typical values for good text-image models: 80-120.
    """

    def __init__(
        self,
        model_name: str = "openai/clip-vit-large-patch14",  # ViT-L/14 is best quality
        device: str = "cuda"
    ):
        from transformers import CLIPProcessor, CLIPModel
        self.device = device
        self.model = CLIPModel.from_pretrained(model_name).to(device)
        self.processor = CLIPProcessor.from_pretrained(model_name)
        self.model.eval()

    @torch.no_grad()
    def compute_clip_score(
        self,
        images: List[Image.Image],
        prompts: List[str],
        calibration_weight: float = 2.5,
    ) -> Tuple[float, List[float]]:
        """
        Compute CLIP-T score for a list of (image, prompt) pairs.

        Args:
            images: generated images
            prompts: corresponding text prompts used to generate them
            calibration_weight: 2.5 from the CLIP-S paper

        Returns:
            (mean_score, per_pair_scores)
        """
        assert len(images) == len(prompts), "images and prompts must have the same length"
        scores = []

        for img, prompt in zip(images, prompts):
            inputs = self.processor(
                text=[prompt], images=[img],
                return_tensors="pt", padding=True, truncation=True
            ).to(self.device)

            outputs = self.model(**inputs)
            img_emb = outputs.image_embeds / outputs.image_embeds.norm(dim=-1, keepdim=True)
            txt_emb = outputs.text_embeds / outputs.text_embeds.norm(dim=-1, keepdim=True)

            cos_sim = (img_emb * txt_emb).sum(dim=-1).item()
            clip_s = calibration_weight * max(100.0 * cos_sim, 0.0)
            scores.append(clip_s)

        return float(np.mean(scores)), scores


# ============================================================
# Precision and Recall for generative models
# ============================================================

class PrecisionRecallScorer:
    """
    Manifold-based Precision and Recall (Kynkäänniemi et al. 2019).

    Precision = fraction of generated samples inside real data manifold (quality)
    Recall = fraction of real data manifold covered by generated samples (diversity)

    Manifold approximation: k-NN balls in feature space.
    Standard: k=3, VGG or Inception features.
    """

    def __init__(self, k: int = 3):
        self.k = k

    def _kth_nn_distances(self, features: np.ndarray) -> np.ndarray:
        """
        For each sample, compute the distance to its k-th nearest neighbor.
        This is the radius of the k-NN ball approximating the manifold at that point.
        """
        from sklearn.neighbors import NearestNeighbors

        nn = NearestNeighbors(
            n_neighbors=self.k + 1,  # +1 because the sample is its own nearest neighbor
            metric='euclidean',
            algorithm='ball_tree',
            n_jobs=-1,
        )
        nn.fit(features)
        distances, _ = nn.kneighbors(features)
        # distances[:, 0] = 0 (distance to self), distances[:, k] = k-th neighbor
        return distances[:, -1]  # (N,) - manifold radius at each sample

    def _fraction_in_manifold(
        self,
        probe_features: np.ndarray,
        manifold_features: np.ndarray,
        manifold_radii: np.ndarray,
    ) -> float:
        """
        Count fraction of probe samples that fall inside the manifold.

        A probe sample p is "inside the manifold" if there exists a manifold
        sample m such that distance(p, m) <= radius(m).
        """
        from sklearn.neighbors import NearestNeighbors

        nn = NearestNeighbors(n_neighbors=1, metric='euclidean', algorithm='ball_tree', n_jobs=-1)
        nn.fit(manifold_features)

        distances, indices = nn.kneighbors(probe_features)
        distances = distances[:, 0]   # distance to nearest manifold sample
        indices = indices[:, 0]       # index of nearest manifold sample

        # Is the probe within the k-NN ball of its nearest manifold neighbor?
        in_manifold = distances <= manifold_radii[indices]
        return float(np.mean(in_manifold))

    def compute(
        self,
        real_features: np.ndarray,
        gen_features: np.ndarray,
    ) -> Tuple[float, float]:
        """
        Compute Precision and Recall.

        Args:
            real_features: (N, D) feature vectors from real images
            gen_features: (M, D) feature vectors from generated images

        Returns:
            (precision, recall) - both in [0, 1]
        """
        print(f"Computing k-NN radii for real manifold (k={self.k})...")
        real_radii = self._kth_nn_distances(real_features)

        print(f"Computing k-NN radii for generated manifold (k={self.k})...")
        gen_radii = self._kth_nn_distances(gen_features)

        print("Computing Precision (generated inside real manifold)...")
        precision = self._fraction_in_manifold(gen_features, real_features, real_radii)

        print("Computing Recall (real manifold covered by generated)...")
        recall = self._fraction_in_manifold(real_features, gen_features, gen_radii)

        return precision, recall


# ============================================================
# Complete evaluation runner
# ============================================================

def run_comprehensive_evaluation(
    real_images: List[Image.Image],
    generated_images: List[Image.Image],
    prompts: Optional[List[str]] = None,
    device: str = "cuda",
) -> Dict[str, float]:
    """
    Run all standard generative model metrics.

    Minimum recommended: 10,000 images per set.
    For publishable FID: 50,000 images.

    Returns dictionary of metric names to values.
    """
    results = {}

    # FID
    print("=== Computing FID ===")
    fid_scorer = FIDScorer(device=device)
    results["FID"] = fid_scorer.compute_fid(real_images, generated_images)
    print(f"FID: {results['FID']:.3f}")

    # Precision and Recall (reuse Inception features computed for FID)
    print("=== Computing Precision/Recall ===")
    real_features = fid_scorer.extract_features(real_images)
    gen_features = fid_scorer.extract_features(generated_images)

    pr_scorer = PrecisionRecallScorer(k=3)
    results["Precision"], results["Recall"] = pr_scorer.compute(real_features, gen_features)
    print(f"Precision: {results['Precision']:.3f}, Recall: {results['Recall']:.3f}")

    # CLIP Score (requires prompt-image pairs)
    if prompts is not None and len(prompts) == len(generated_images):
        print("=== Computing CLIP-T Score ===")
        clip_scorer = CLIPScorer(device=device)
        results["CLIP-T"], _ = clip_scorer.compute_clip_score(generated_images, prompts)
        print(f"CLIP-T Score: {results['CLIP-T']:.2f}")

    print("\n=== Summary ===")
    for k, v in results.items():
        print(f"  {k}: {v:.4f}")

    return results


# ============================================================
# Statistical significance for FID comparisons
# ============================================================

def bootstrap_fid_confidence_interval(
    real_features: np.ndarray,
    gen_features: np.ndarray,
    n_bootstrap: int = 100,
    confidence: float = 0.95,
) -> Tuple[float, float, float]:
    """
    Compute bootstrap confidence interval for FID to assess statistical significance.

    When comparing two models:
    - Model A FID: compute bootstrap CI
    - Model B FID: compute bootstrap CI
    - If CIs do not overlap, the difference is statistically significant

    Returns: (fid_mean, ci_lower, ci_upper)

    Note: This requires a large number of samples to be meaningful.
    Use at least 10,000 samples per set.
    """
    fid_scorer = FIDScorer.__new__(FIDScorer)  # skip __init__, just use frechet_distance

    fid_bootstrap = []
    n_real = len(real_features)
    n_gen = len(gen_features)

    for _ in range(n_bootstrap):
        # Sample with replacement
        real_sample = real_features[np.random.randint(0, n_real, n_real)]
        gen_sample = gen_features[np.random.randint(0, n_gen, n_gen)]

        mu_r, sigma_r = np.mean(real_sample, axis=0), np.cov(real_sample, rowvar=False)
        mu_g, sigma_g = np.mean(gen_sample, axis=0), np.cov(gen_sample, rowvar=False)

        product = sigma_r @ sigma_g
        sqrt_product, _ = linalg.sqrtm(product, disp=False)
        if np.iscomplexobj(sqrt_product):
            sqrt_product = np.real(sqrt_product)

        fid = (
            np.sum((mu_r - mu_g) ** 2)
            + np.trace(sigma_r) + np.trace(sigma_g) - 2 * np.trace(sqrt_product)
        )
        fid_bootstrap.append(float(fid))

    fid_bootstrap = np.array(fid_bootstrap)
    alpha = 1 - confidence
    ci_lower = np.percentile(fid_bootstrap, 100 * alpha / 2)
    ci_upper = np.percentile(fid_bootstrap, 100 * (1 - alpha / 2))

    return float(np.mean(fid_bootstrap)), float(ci_lower), float(ci_upper)

9. Metric Gaming and Evaluation Failure Modes

How FID Can Be Gamed

Understanding metric gaming is important both for detecting fraudulent benchmarks and for understanding why optimizing directly for FID is dangerous:

Memorization attack: a model that memorizes training images and generates them with small random perturbations will achieve near-zero FID. It is completely useless for generation but scores perfectly on the metric.

Resolution gaming: FID is computed after resizing to 299×299 for Inception. A model that generates sharp high-frequency noise at 512px but becomes smooth at 299px can appear to have low FID despite generating images that look wrong at native resolution.

Sample size gaming: compute FID on 50,000 samples for your model but claim competitor models have lower FID using their 2,000-sample published numbers. The sample size bias makes the comparison invalid but difficult to detect without access to the original evaluation code.

Mode dropping detection failure: FID uses a Gaussian approximation for covariance. A model that drops 20% of modes (generates 80% of the diversity of the real distribution) may have similar FID to a model that covers all modes, because the covariance matrix from 80% coverage is similar enough to 100% coverage in the Gaussian approximation.

The Alignment Tax

There is a consistent empirical pattern: models optimized via RLHF for human preference achieve higher human preference scores but worse FID than models optimized for FID. The reason: human preferences systematically favor high-CFG outputs (sharp, saturated, "impressive") that have lower diversity and worse FID. Models that learn this preference via RLHF shift toward that region of the quality-diversity Pareto frontier.

This means: FID and human preference are not the same thing, and optimizing one will hurt the other. Both are valid objectives for different evaluation contexts. FID is appropriate for scientific benchmarking and distribution matching. Human preference is appropriate for product quality evaluation.

Evaluation Budget: How Many Samples for Reliable FID?

The FID sample size bias follows approximately:

$\mathbb{E}[FID_{N}] \approx FID_{\infty} + \frac{C}{N}$

where $C$ depends on the specific dataset and model. The bias is generally larger for lower-FID models (the correction term is proportionally more significant when the true FID is small).

Practical guidelines derived from the bias analysis:

Difference You Want to Detect	Required Samples (each set)
0.1 FID points (fine comparison)	100,000+
0.5 FID points (typical paper improvement)	50,000
1.0 FID points (major improvement)	20,000
2.0 FID points (architectural change)	10,000
Sanity check only	2,000-5,000

For detecting differences at the frontier (FID below 3.0), even 50,000 samples may be insufficient without bootstrap confidence intervals. Small FID improvements at the frontier may be within the noise of the Gaussian approximation.

10. Domain-Specific Evaluation Metrics

Audio Generation

FAD (Fréchet Audio Distance): the FID equivalent for audio. Extract features from a pretrained audio classifier (VGGish trained on AudioSet), fit Gaussians to real and generated audio features, compute Fréchet distance. Same interpretation as FID: lower is better. Same pitfalls: sample size bias, classifier-specific features.

KL divergence on AudioSet classes: compute AudioSet class probability distributions for real and generated audio clips, compute KL divergence. Measures whether generated audio contains the right categories of sound.

CLAP Score: for text-to-audio generation, compute cosine similarity between CLAP text and audio embeddings - the audio equivalent of CLIP-T.

MOS-LQO: automated MOS prediction using a pretrained quality assessment model (DNSMOS for speech, PLCMos for music). Correlates reasonably with human MOS without requiring human annotators.

Video Generation

FVD (Fréchet Video Distance): FID for videos using I3D (Inception 3D) features trained on Kinetics-400 action recognition. Captures both spatial quality and temporal motion statistics. Same sample size pitfalls as FID.

Temporal consistency score: compute optical flow between adjacent frames, measure the variance of flow magnitude. Smooth optical flow indicates temporally consistent motion; high variance indicates frame-to-frame incoherence.

CLIP-based video-text alignment: compute CLIP-T between the text prompt and individual sampled frames, average across frames. Does not capture temporal coherence of the content across frames.

Action recognition accuracy: for action-conditioned video generation, measure whether a pretrained action recognition classifier (I3D, TimeSformer) predicts the correct action class on generated videos.

Protein and Molecule Generation

Validity rate: fraction of generated molecules satisfying chemical valency rules. Target > 95%.

Uniqueness: fraction of valid generated molecules that are structurally distinct (different SMILES strings after canonicalization).

Novelty: fraction of valid, unique molecules not found in the training set.

QED (Quantitative Estimate of Drug-likeness): composite metric measuring resemblance to known oral drugs on a 0-1 scale. Based on Lipinski's Rule of Five and other drug-likeness criteria.

SA Score (Synthetic Accessibility): predicts how difficult a molecule would be to synthesize, from 1 (easy) to 10 (practically impossible to make). Generated molecules should score below 4 for practical utility.

Protein designability: for protein backbone generation, measures whether a sequence can be designed to fold into the generated backbone (using ESMFold or AlphaFold2 to check if the predicted structure matches the generated backbone).

11. YouTube Resources

Video	Channel	What You Learn
FID Explained - Fréchet Inception Distance	Yannic Kilcher	Full FID derivation, Gaussian approximation, sample bias
Evaluating Generative Models	Outlier	IS, FID, Precision/Recall - visual walkthrough, intuition
Human Evaluation of Text-to-Image Models	AI Coffee Break	ELO ratings, DrawBench, annotation study design
CLIP Score and Text-Image Alignment	Andrej Karpathy	CLIP features, alignment metrics, limitations, CLIP-S formula
Generative Model Evaluation in Practice	ML Street Talk	Research perspective on evaluation gaps, metric gaming

12. Common Pitfalls

:::danger Never compare FID values computed with different Inception checkpoints Different libraries use different versions of InceptionV3 with different preprocessing pipelines. clean-fid uses bilinear resizing with antialiasing; pytorch-fid uses bicubic; TensorFlow-based implementations use yet another variant. FID values from different implementations can differ by 0.5-2.0 for the same model - larger than many claimed improvements in papers. Always specify the exact library and version used. When rerunning competitor baselines, use the same code used for your own model evaluation. :::

:::warning FID computed on fewer than 10,000 samples is misleading FID has a negative bias proportional to 1/N. A model evaluated on 2,000 samples will report higher (worse) FID than the same model evaluated on 50,000 samples. The bias can be several FID points for small sample sizes, completely obscuring real differences between models. If you need to run quick development checks, label these clearly as "rough sanity checks" and never compare them to published benchmark numbers. Standard is 50,000 samples for CIFAR-10 and ImageNet-class benchmarks. :::

:::danger Low FID does not imply good text-image alignment FID measures whether the generated image distribution matches the real image distribution, with no reference to the text prompt. A model trained to generate photorealistic random images with no text conditioning would have excellent FID while being completely useless for text-to-image. For any text-to-image model, always report CLIP-T alongside FID. A model with FID=2.0 and CLIP-T=0.20 is worse for the text-to-image task than a model with FID=4.0 and CLIP-T=0.32. :::

:::warning High Precision without Recall means the model has mode collapse A model that always generates the same few "safe" images (very high quality, maximally generic cat photos when asked for cats) will achieve excellent Precision - every generated image is realistic. But Recall will be near zero - it covers almost none of the diversity of real cat photos. FID might even be decent because the mean and covariance from those few modes may happen to approximately match the Gaussian fit of the real distribution. Always check Recall separately. If Recall is below 0.5 while Precision is above 0.9, your model has severe mode collapse that FID alone is not catching. :::

13. Interview Q&A

Q1: Explain FID from first principles. What does it actually measure, and what are its main failure modes?

FID measures the Fréchet (Wasserstein-2) distance between two multivariate Gaussians fitted to the Inception v3 feature distributions of real and generated images. It has two terms: the mean difference squared $\|\mu_r - \mu_g\|^2$ (does the average generated image occupy the same region of feature space as the average real image?) and the trace term (does the spread and correlation structure of generated image features match the real distribution?).

Main failure modes: (1) Gaussian approximation: real image feature distributions are not multivariate Gaussian - they are multi-modal, heavy-tailed, and have complex correlation structure. FID works well on average but misleads for distributions with unusual structure. (2) Inception-anchored features: FID is meaningful only for distributions similar to ImageNet. On medical, satellite, or artistic images, Inception features are semantically meaningless - FID becomes arbitrary. (3) Sample size bias: FID is biased upward (worse) for small samples - comparing models at different sample sizes is invalid. (4) No text alignment measurement: for text-to-image models, FID ignores whether generated images match their prompts - a critical quality axis. (5) Gaming vulnerability: a memorization model achieves perfect FID but is useless.

Q2: Why is Inception Score insufficient, and when would a high IS indicate a bad model?

IS measures quality (Inception is confident about each image's class) and diversity (images span many classes). It fails when: (1) the model generates sharp, confident hallucinations that Inception classifies confidently but that look nothing like real data - IS is high, quality is actually low; (2) the model is evaluated on non-ImageNet domains where Inception's class predictions are meaningless; (3) there is within-class mode collapse - generating 1000 identical golden retrievers in the same pose receives the same IS as generating 1000 diverse golden retrievers, because both produce confident "golden retriever" predictions.

High IS indicating a bad model: if a model is adversarially trained to produce images that maximize Inception's classification confidence and class diversity without any constraint on realism, it can achieve extremely high IS while generating images that look like maximally class-discriminative activations rather than natural images.

Q3: What is Precision in the Kynkäänniemi framework, and how does it differ from FID?

Precision measures the fraction of generated samples that fall inside the real data manifold, approximated using k-nearest-neighbor balls in feature space. A generated sample is "inside the real manifold" if it is within the k-NN radius of at least one real data point. High Precision means every generated image is realistic and indistinguishable from real data.

The key difference from FID: Precision measures only sample quality, with no consideration of whether the generated distribution covers all modes of the real distribution. FID combines quality and diversity in a single number (via the mean and covariance terms) but cannot separate them. A model with Precision=0.95 but Recall=0.2 (misses 80% of real data modes) would have a mediocre FID that doesn't reveal the severity of mode collapse. Precision/Recall explicitly decomposes the quality-diversity tradeoff, making the evaluation interpretable and actionable.

Q4: How do you design a rigorous human evaluation study for a text-to-image model?

Key design decisions: (1) Task decomposition: separate "does the image match the prompt?" from "is the image photorealistic/high quality?" - one rating mixes orthogonal signals. (2) Forced binary choice: pairwise "which image better matches the prompt?" is more reliable than 1-7 Likert scales, which have high annotator variance. (3) Position counterbalancing: show each model on the left and right equally often - consistent position bias toward the first image is real and substantial (5-10%). (4) Systematic prompt sampling: stratified sampling from DrawBench or PartiPrompts categories, not cherry-picked. (5) Inter-annotator agreement: compute Krippendorff's alpha - report it, don't hide it. If alpha is below 0.4, the annotation task is too subjective. (6) Sample size: for a 95% CI of ±3% win rate, you need at least 1,000 pairwise comparisons per model pair with 3+ independent ratings each - use a power analysis. (7) Annotator qualification: screening tasks with known-answer pairs to filter careless annotators.

For flagship model comparisons, use an ELO-based leaderboard (Chatbot Arena style): each pairwise comparison updates ratings, and 5,000-20,000 comparisons yield stable rankings with confidence intervals.

Q5: Why does guidance scale affect Precision-Recall, and what does this mean for evaluation?

Classifier-Free Guidance scale controls how aggressively the denoising model steers samples toward high-probability regions of the conditional distribution. At high CFG: the model generates images in the most "prototypical" or "representative" regions - sharp, confident, maximally class-representative images. These score high on Precision (clearly within the real manifold) but low on Recall (cluster in a few modes, missing the full diversity). At low CFG: the model is closer to unconditional generation, producing diverse samples across the full distribution. Recall increases, Precision decreases.

For evaluation, this means: always compare models at the same CFG scale, or report the full Precision-Recall curve across CFG values. A model with higher FID at a given CFG might dominate on Recall while the competitor model dominates on Precision - they serve different use cases. A product requiring diverse creative outputs needs high Recall; a product requiring consistent, safe outputs prefers high Precision. Reporting a single FID number obscures which model is better for which use case.

Q6: What sample count is needed for statistically reliable FID, and why?

The FID bias approximately follows $\mathbb{E}[FID_N] \approx FID_\infty + C/N$ . For detecting a difference of 0.5 FID points (a typical claimed improvement in papers), 50,000 samples per set are required. For detecting 0.1 FID point differences (fine-grained comparison at the frontier), 100,000+ samples are needed. Below 10,000 samples, bias can be several FID points, making comparisons between models meaningless.

In practice: use 50,000 samples for benchmark-comparable results. Provide bootstrap confidence intervals to show statistical significance - two models with FID=2.0±0.3 and FID=2.2±0.3 are not statistically different. Reviewers increasingly require confidence intervals for FID comparisons in ML papers. The clean-fid library supports bootstrapped CIs out of the box.

14. Building a Production Evaluation Dashboard

For a text-to-image service, the evaluation infrastructure should track:

Metric	Update Frequency	Sample Size	Alert Threshold
FID (vs held-out real set)	Every major release	50,000	Greater than 10% regression
CLIP-T score	Continuous (1% sample)	1,000/day	Greater than 3% regression
Precision	Every major release	50,000	Drop greater than 0.05 absolute
Recall	Every major release	50,000	Drop greater than 0.05 absolute
Human ELO (vs previous version)	Monthly	2,000 comparisons	Less than -50 ELO points
P99 inference latency	Continuous	All requests	Greater than 10% increase
Prompt-following failure rate	Daily (human QA sample)	200/day	Greater than 5% failure rate

The FID regression alert catches model quality degradation (e.g., after quantization, fine-tuning corruption, or infrastructure changes). CLIP-T regression catches prompt-following degradation - a separate failure mode that FID completely misses. Human ELO catches cases where automatic metrics improved but human preference declined - this happens regularly when models are optimized directly for FID, which can reduce diversity in ways that humans notice even when automatic metrics do not.

Recall deserves special attention in production monitoring. Mode collapse is easy to miss with FID alone but catastrophic for user experience: a model with collapsed Recall generates the same types of images for all prompts, making it feel repetitive and useless for creative tasks.

This lesson concludes the Diffusion Models module. Explore the Evaluation module for cross-domain ML evaluation techniques applied across classification, detection, and generation.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the VAE vs GAN vs Diffusion demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why This Exists - The Fundamental Difficulty​

Historical Context​

1. Inception Score (IS)​

Definition and Formula​

IS Values in Practice​

Critical Limitations​

2. Fréchet Inception Distance (FID)​

The Core Idea​

Mathematical Formulation​

What Each Term Measures​

FID Progress on CIFAR-10​

FID Pitfalls - Critical Knowledge​

3. Precision and Recall for Generative Models​

The Fundamental Limitation of FID​

Definitions​

k-NN Manifold Estimation​

The Guidance Scale Tradeoff​

Density and Coverage (Improved Variant)​

4. CLIP Score - Text-Image Alignment​

Definition​

CLIP Score Weaknesses​

Text-Image Evaluation Benchmarks​

5. DINO Similarity - Fine-Grained Identity​

When CLIP Fails for Identity Preservation​

DINO Features for Identity​

DINO vs CLIP Feature Comparison​

6. Human Evaluation​

When Automatic Metrics Fail​

ELO Rating for Pairwise Comparisons​

MOS (Mean Opinion Score) for Audio​

Rigorous Study Design​

7. The Quality-Diversity Pareto Frontier​

8. Complete Python Evaluation Pipeline​

9. Metric Gaming and Evaluation Failure Modes​

How FID Can Be Gamed​

The Alignment Tax​

Evaluation Budget: How Many Samples for Reliable FID?​

10. Domain-Specific Evaluation Metrics​

Audio Generation​

Video Generation​

Protein and Molecule Generation​

11. YouTube Resources​

12. Common Pitfalls​

13. Interview Q&A​

14. Building a Production Evaluation Dashboard​