What is classifier-free guidance?

Complete derivation of CFG from classifier guidance through the Ho-Salimans implicit classifier insight - the guidance scale trade-off, negative prompting mechanics, dynamic thresholding, CFG++ variants, and production sampling implementations.

How does CFG work in practice?

Classifier-Free Guidance - Steering Diffusion with Text covers classifier-free guidance, CFG, classifier guidance from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/diffusion-models/classifier-free-guidance

What is the difference between classifier-free guidance and classifier guidance?

See the full breakdown at https://engineersofai.com/docs/ml/diffusion-models/classifier-free-guidance

Classifier-Free Guidance - Steering Diffusion with Text

:::note Reading time: ~55 minutes | Interview relevance: Very High | Target roles: ML Engineer, Research Engineer, AI Engineer, Applied Scientist :::

The Prompt That Changed Everything

It is late 2022. You are building a text-to-image product. The model generates beautiful images, but when you type "a red dragon breathing fire over a medieval castle at sunset, cinematic lighting, dramatic composition," it produces something that vaguely resembles a blurry orange blob. The model does not care about your prompt enough.

You increase the conditioning weight in training. The image sharpens around the description. The dragon appears. The castle appears. The sunset lights the composition. But something goes wrong - the image looks overexposed, almost burned out, with strange high-frequency artifacts at edges. Colors are saturated beyond what exists in photographs. You have pushed the model too hard toward the text and it has started to produce images that look distinctly artificial.

This is the central tension in every text-to-image system: prompt fidelity versus image naturalness. A model that perfectly follows every word produces images that look technically wrong - too vivid, too sharp, too committed to every token in the prompt. A model that ignores the prompt produces diverse but useless outputs. Somewhere between these extremes is the region where images feel both photorealistic and precisely described.

The technique that solved this problem - Classifier-Free Guidance (CFG), introduced by Jonathan Ho and Tim Salimans at Google Brain in late 2021 - is arguably the single most impactful technique in modern generative AI. It is why every major text-to-image system (Stable Diffusion, DALL-E 2, Midjourney, Imagen) produces outputs that feel simultaneously natural and prompt-faithful. Understanding CFG in depth - including where it comes from, why it works, and where it fails - is foundational knowledge for any engineer working with diffusion models.

Why This Exists - The Conditioning Problem

Before CFG, two approaches existed for steering a diffusion model toward a text prompt or class label.

Approach 1 - Conditional training only. Train the U-Net with text embeddings via cross-attention and rely on the conditioning alone. This works, but weakly. The model has learned to generate good images unconditionally; it does not heavily rely on the text signal. The text conditioning acts more like a soft preference than a hard constraint. Quantitatively: conditional training alone gives moderate CLIP scores but does not dramatically improve FID over unconditional.

Approach 2 - Classifier guidance (Dhariwal and Nichol, 2021). During sampling, augment the score function with the gradient of a noise-aware classifier:

$\tilde{s}_\theta(x_t, t, c) = s_\theta(x_t, t) + w \cdot \nabla_{x_t} \log p_\phi(c \mid x_t)$

Here $s_\theta(x_t, t) = -\varepsilon_\theta(x_t, t)/\sqrt{1-\bar\alpha_t}$ is the score function, $p_\phi(c \mid x_t)$ is a classifier that has been trained on images at all noise levels $t$ , and $w$ is the guidance scale. The classifier gradient pushes sampling toward regions where the class $c$ is likely.

Dhariwal and Nichol demonstrated spectacular results: the ADM-G model (ADM + Classifier Guidance) surpassed BigGAN on FID for the first time, using a diffusion model. The guidance made class conditioning so strong that images looked genuinely class-conditional and photorealistic simultaneously.

But classifier guidance has fundamental problems that make it impractical for open-vocabulary text conditioning:

Requires a noise-aware classifier: a standard ImageNet classifier trained on clean images produces nearly random gradients at high noise levels (where $x_t$ is mostly noise). The classifier must be trained on images at every noise level $t = 1, \ldots, T$ . This doubles training cost.
Couples to the specific noise schedule: the classifier is trained on a particular $\{\beta_t\}$ schedule. Change the diffusion model's schedule and the classifier must be retrained.
Cannot generalize to open-vocabulary text: you can train a classifier for 1000 ImageNet classes, but you cannot train one for all possible natural language prompts. The space of text descriptions is infinite.
Two forward passes at inference: the diffusion model forward pass plus the classifier forward pass (including backprop through the classifier to compute $\nabla_{x_t}$ ) at every step.

Something fundamentally better was needed.

Historical Context - The Implicit Classifier Insight

Jonathan Ho (first author of DDPM) and Tim Salimans published "Classifier-Free Diffusion Guidance" as a NeurIPS 2021 workshop paper. The key insight came from Bayes' theorem.

The classifier gradient that makes classifier guidance work is:

$\nabla_{x_t} \log p_\phi(c \mid x_t) = \nabla_{x_t} \log p(x_t \mid c) - \nabla_{x_t} \log p(x_t)$

by Bayes' theorem (since $\log p(c|x_t) = \log p(x_t|c) - \log p(x_t) + \text{const}$ ).

The right-hand side is the difference between the conditional score $\nabla_{x_t} \log p(x_t|c)$ and the unconditional score $\nabla_{x_t} \log p(x_t)$ . These are both score functions - exactly what diffusion models estimate. A single diffusion model trained to handle both conditional and unconditional inputs estimates both scores simultaneously.

Therefore: if you train one U-Net that can predict noise both with and without conditioning, the classifier gradient is free - it is just the difference between two predictions from the same model. No separate classifier. No noise-aware training data. No backprop at inference time.

The implementation requires one change to training: with probability $p_\text{uncond}$ (typically 10-20%), replace the text conditioning with a null embedding. The model learns to predict noise both with and without conditioning. At inference, run the model twice and combine the results.

This insight reduced the state-of-the-art in text-to-image from "needs a classifier, impractical for open vocab" to "one extra forward pass, works for any text." The impact was immediate.

1. The Core CFG Derivation

From Score Functions to Noise Predictions

The score function $s_\theta(x_t, t) = \nabla_{x_t} \log p(x_t)$ relates to the noise prediction by:

$s_\theta(x_t, t) = -\frac{\varepsilon_\theta(x_t, t)}{\sqrt{1-\bar\alpha_t}}$

Using the Bayes' theorem decomposition:

$\nabla_{x_t} \log p(c \mid x_t) = \nabla_{x_t} \log p(x_t \mid c) - \nabla_{x_t} \log p(x_t)$

The guided score becomes:

$\tilde{s}(x_t, t, c) = s_\theta(x_t, t) + w \cdot \left(s_\theta^{cond}(x_t, t, c) - s_\theta(x_t, t)\right)$

Converting back from scores to noise predictions (multiplying by $-\sqrt{1-\bar\alpha_t}$ ):

$\boxed{\tilde{\varepsilon}_\theta(x_t, c) = \varepsilon_\theta(x_t, \emptyset) + w \cdot \left(\varepsilon_\theta(x_t, c) - \varepsilon_\theta(x_t, \emptyset)\right)}$

This is the CFG formula. The terms:

$\varepsilon_\theta(x_t, c)$ : conditional prediction - noise estimated given text prompt $c$
$\varepsilon_\theta(x_t, \emptyset)$ : unconditional prediction - noise estimated with null/empty embedding $\emptyset$
$w$ : guidance scale - amplification factor for the conditional direction
$\tilde{\varepsilon}_\theta$ : guided prediction used for the denoising step

Rearranging:

$\tilde{\varepsilon}_\theta(x_t, c) = (1 + w) \cdot \varepsilon_\theta(x_t, c) - w \cdot \varepsilon_\theta(x_t, \emptyset)$

This form makes the role of $w$ explicit. When $w = 0$ : pure conditional (weak guidance). When $w = 7.5$ : strongly amplified conditional direction. When $w < 0$ : move away from the prompt - occasionally useful for adversarial analysis or inverted conditioning.

The Extrapolation Perspective

For $w > 1$ , CFG is an extrapolation beyond the conditional prediction. The guided prediction $\tilde\varepsilon$ lies outside the segment between $\varepsilon_\theta(x_t, c)$ and $\varepsilon_\theta(x_t, \emptyset)$ - it is amplified beyond what the model would naturally predict. This extrapolation pushes toward high-probability regions of the conditional distribution $p(x|c)$ much more aggressively than the model alone, at the cost of leaving the natural image distribution.

For $w = 1$ : pure conditional, no extrapolation. For $w = 7.5$ : $\tilde\varepsilon = 8.5 \cdot \varepsilon_\theta^{cond} - 7.5 \cdot \varepsilon_\theta^{uncond}$ - the conditional prediction is weighted 8.5x and the unconditional is subtracted. For $w > 15$ : extreme extrapolation - the guided prediction is far from the image manifold, causing artifacts.

2. The Guidance Scale Trade-Off - FID vs CLIP Score

The guidance scale $w$ controls a fundamental trade-off:

FID Score - Image Quality and Diversity

FID (Fréchet Inception Distance) measures how close the generated image distribution is to the real image distribution. It captures both quality (do individual images look real?) and diversity (does the set of generated images cover the full range of real images?).

At low $w$ : generated images are diverse but have weak prompt adherence. FID is low (good) because the distribution closely matches real images.

At high $w$ : the model aggressively pursues images that match the text, ignoring diversity. Some images are highly detailed and sharp, but the distribution is narrower than the real distribution. FID increases (worsens).

CLIP Score - Text-Image Alignment

CLIP score measures the cosine similarity between the CLIP image embedding of generated images and the CLIP text embedding of the prompt. Higher is better - it measures how well the image matches the text.

At low $w$ : low CLIP score - the model ignores the prompt. At high $w$ : high CLIP score - the model is maximally prompt-faithful.

The Pareto Frontier

Empirically (from Ho and Salimans 2021 and subsequent papers):

Guidance Scale	FID (CIFAR-10 class-cond)	CLIP Score	Perceptual quality
$w = 0$	3.21	0.21	Natural, diverse, weakly conditioned
$w = 1.0$	3.5	0.24	Balanced, good starting point
$w = 3.0$	4.1	0.27	Clear subject, somewhat stylized
$w = 7.5$	5.9	0.31	Default SD - strong conditioning
$w = 12.0$	7.8	0.33	Very prompt-faithful, somewhat artificial
$w = 20.0$	14.2	0.34	Artifacts, oversaturation, uncanny valley

The sweet spot depends on the use case. For photorealistic portrait generation: $w = 6-8$ . For stylized illustration: $w = 9-12$ . For exact prompt following with known artifact tolerance: $w = 12-15$ .

Why High Guidance Causes Artifacts

At high $w$ , the extrapolation $\tilde\varepsilon = (1+w)\varepsilon^{cond} - w\varepsilon^{uncond}$ produces prediction values that are outside the range of normal Gaussian noise ( $\mathcal{N}(0,1)$ ). The predicted $x_0$ from this guidance will have values outside $[-1, 1]$ . When clipped to $[-1,1]$ (the training distribution's range), color information is lost and saturation artifacts appear. Dynamic thresholding (Section 6) addresses this.

3. Training for CFG - Unconditional Dropout

The Minimal Training Change

CFG requires exactly one change to standard conditional diffusion training: unconditional dropout. With probability $p_\text{uncond}$ (typically 0.10 to 0.20), replace the text conditioning $c$ with the null embedding $\emptyset$ :

def apply_cfg_dropout(
    text_embeddings: torch.Tensor,
    null_embedding: torch.Tensor,
    dropout_prob: float = 0.1
) -> torch.Tensor:
    """
    Apply unconditional dropout for CFG training.
    10% of the time, replace conditioning with null.
    This is the ONLY change needed to make a model CFG-capable.
    """
    batch_size = text_embeddings.shape[0]
    # For each sample in batch, independently decide to drop conditioning
    keep = torch.rand(batch_size) > dropout_prob  # True = keep conditioning
    keep = keep.view(-1, 1, 1)  # shape (B, 1, 1) for broadcasting over (B, 77, 768)

    # Mix: keep conditioning or replace with null
    return torch.where(keep.to(text_embeddings.device), text_embeddings, null_embedding)

No other changes: same loss function, same optimizer, same architecture. The model simply learns to also predict good noise without conditioning. At inference, running it with the null embedding gives the unconditional baseline needed for CFG.

The Null Embedding

The unconditional prediction $\varepsilon_\theta(x_t, \emptyset)$ requires a null conditioning signal. In Stable Diffusion, this is the CLIP text embedding of an empty string - all padding tokens, producing a specific fixed vector the U-Net has learned to interpret as "no prompt."

The null embedding matters more than it seems. Research has shown:

A learned null embedding (trainable vector optimized to produce good unconditional outputs) can outperform the empty-string CLIP embedding for image quality.
The null embedding acts as a "style baseline" - images generated with CFG diverge from it toward the prompt. The visual style of null-guided generation (typically blurry, generic) affects what the "prompt direction" looks like.
Null-Text Inversion (Mokady et al. 2022) showed that optimizing the null embedding per image enables precise image editing - the edited image stays close to the source by design.

Choosing the Dropout Probability

$p_\text{uncond} = 0.05$ (5%): the model rarely sees null conditioning during training, so unconditional generation quality is low. CFG at inference uses a poor unconditional baseline. Avoid.
$p_\text{uncond} = 0.10$ (10%): the SD 1.5 setting. Good balance - 90% of steps improve conditional understanding, 10% establish the unconditional baseline.
$p_\text{uncond} = 0.20$ (20%): stronger unconditional quality. Slightly weaker conditional quality. Use when image editing or DDIM inversion are primary use cases.
$p_\text{uncond} = 0.50$ (50%): equal time on conditional and unconditional. Training is slower for each capability. Rarely optimal.

4. Original Classifier Guidance - Dhariwal and Nichol 2021

Understanding classifier guidance is important for interview questions that ask you to compare CFG to its predecessor.

The Dhariwal-Nichol ADM+G Setup

Their model had three components:

ADM (Ablated Diffusion Model) - a U-Net diffusion model, class-conditional via adaptive group normalization
A separate noise-aware classifier $p_\phi(y|x_t)$ trained on images at all noise levels
Guidance at inference: $\tilde{s} = s_\theta(x_t, y) + w \cdot \nabla_{x_t} \log p_\phi(y|x_t)$

The noise-aware classifier was architecturally similar to the diffusion U-Net: it took noisy images at all $t$ and predicted class probabilities. This was not a standard classifier - it required training on noisy images with the specific noise schedule used during diffusion training.

Results and Problems

The results were impressive: ADM-G achieved FID 4.59 on ImageNet 256x256 (beating BigGAN's 6.95 at the time). Class-conditional samples looked genuinely class-specific and photorealistic.

Problems that made it impractical for production:

Cost: training the noise-aware classifier requires roughly as much compute as training the diffusion model itself.
Inflexibility: the classifier is tightly coupled to the noise schedule and image resolution. Change either and you need a new classifier.
Text scalability: you cannot train a classifier over all possible natural language prompts. Classifier guidance fundamentally cannot handle open-vocabulary conditioning.
Adversarial-like artifacts at high guidance: computing $\nabla_{x_t} \log p_\phi(y|x_t)$ via backprop through the classifier is like computing adversarial examples. At high $w$ , the gradient can push the image toward classifier-fooling patterns that look unnatural.

CFG eliminated all four problems: no separate model, decoupled from noise schedule, works for any text, no backprop.

5. Negative Prompting - Inverted Guidance

Mechanics

The standard CFG formula uses an empty string as the null embedding $\emptyset$ . You can replace this with any text description - including descriptions of things you do not want:

$\tilde{\varepsilon}_\theta(x_t, c^+, c^-) = \varepsilon_\theta(x_t, c^-) + w \cdot \left(\varepsilon_\theta(x_t, c^+) - \varepsilon_\theta(x_t, c^-)\right)$

Where $c^+$ is the positive prompt and $c^-$ is the negative prompt. The guidance simultaneously pushes toward $c^+$ and away from $c^-$ .

This is not a special feature - it is a direct consequence of the CFG formula being linear. Any text embedding can serve as the "unconditional baseline." When you use a negative prompt, you are just providing a more specific "direction to move away from."

What Negative Prompting Does and Does Not Do

Effective for:

Concrete visual artifacts: "blurry," "deformed hands," "extra fingers," "watermark," "text overlay" - these have well-defined CLIP embeddings in visual space
Specific visual styles to avoid: "cartoon," "anime," "oil painting," "low-resolution"
Obvious quality issues: "low quality," "jpeg artifacts," "overexposed"

Ineffective for:

Abstract emotional concepts: "sad," "anxious" - CLIP embeddings for abstract emotions are noisy and context-dependent
Negative logical constructions: "not a dog" - CLIP does not have a clear embedding for negated concepts; the model may actually attend to the "dog" concept
Fine-grained compositional control: "put the cat on the left" - spatial control requires attention manipulation (ControlNet) not negative prompting

Common Negative Prompt Patterns

# Domain-specific negative prompts

NEGATIVE_PROMPTS = {
    "photorealistic_portrait": (
        "blurry, out of focus, extra fingers, deformed hands, "
        "ugly face, bad anatomy, watermark, text overlay, logo, "
        "jpeg artifacts, low resolution, grainy"
    ),

    "landscape": (
        "cartoon, anime, illustration, painting, oversaturated, "
        "HDR, unrealistic colors, fog, haze, overexposed"
    ),

    "architecture": (
        "people, blurry, distorted perspective, unrealistic proportions, "
        "graffiti, modern elements in historical scenes"
    ),

    "product_photography": (
        "background clutter, shadows, reflections, dust, scratches, "
        "blurry, poor lighting, low quality"
    ),

    # Universal quality negative prompt (SD community standard)
    "universal": (
        "worst quality, low quality, normal quality, lowres, bad anatomy, "
        "bad hands, extra fingers, fewer fingers, missing fingers, "
        "text, watermark, artist name, signature, username"
    ),
}

# Dual-prompt CFG: positive pushes toward, negative pushes away
def dual_prompt_cfg(
    noise_cond: torch.Tensor,      # eps_theta(x_t, positive_prompt)
    noise_neg: torch.Tensor,       # eps_theta(x_t, negative_prompt)
    guidance_scale: float = 7.5
) -> torch.Tensor:
    """
    CFG with negative prompt replacing null/empty embedding.
    Equivalent to: guided = neg + w * (pos - neg)
    """
    return noise_neg + guidance_scale * (noise_cond - noise_neg)

6. Dynamic Thresholding - Fixing High-Guidance Artifacts

The Problem at High Guidance Scale

Google's Imagen paper (Saharia et al. 2022) identified and solved the main artifact problem at high guidance scales. When $w$ is large, the CFG formula amplifies $\varepsilon_\theta$ values beyond the range expected during training. Translating back to predicted $x_0$ :

$\hat{x}_0 = \frac{x_t - \sqrt{1-\bar\alpha_t} \cdot \tilde\varepsilon}{\sqrt{\bar\alpha_t}}$

With large $w$ , $\tilde\varepsilon$ is large, making $\hat{x}_0$ have values far outside $[-1, 1]$ . Static clipping (clamp(-1, 1)) discards the directionality of the out-of-range values - everywhere that would have been $[-1.5, +3.0]$ becomes $[-1, +1]$ , losing the gradient information. The result: color saturation artifacts, loss of texture detail.

The Dynamic Thresholding Solution

Imagen's dynamic thresholding clips adaptively:

Compute $\hat{x}_0$ from $\tilde\varepsilon$ (the guided noise prediction)
Compute $s = \max(1, \text{quantile}_{p}(|\hat{x}_0|))$ where $p = 0.995$ typically
If $s > 1$ : clip to $[-s, s]$ and rescale by $s$ : $\hat{x}_0 \leftarrow \text{clip}(\hat{x}_0, -s, s) / s$
Re-derive $\tilde\varepsilon$ from the thresholded $\hat{x}_0$

Mathematically:

$\hat{x}_0^{thresh} = \frac{\text{clip}(\hat{x}_0, -s, s)}{s}, \qquad s = \max\!\left(1, \text{percentile}_{99.5}(|\hat{x}_0|)\right)$

This preserves the direction of the prediction (where it points) while preventing the magnitude from exploding. The result: Imagen can use $w = 7.5-10$ with dynamic thresholding and produce images that are both strongly conditioned and free of color saturation artifacts.

Implementation note: the $s$ must be computed per-sample in the batch (not across the batch), since different images may have different $\hat{x}_0$ magnitudes.

7. CFG++ and Variants

CFG++ (Chung et al. 2024)

Standard CFG modifies the noise prediction $\tilde\varepsilon$ . CFG++ (Chung et al. 2024) instead modifies the score function itself and applies the correction to the denoising direction differently, motivated by the observation that standard CFG creates bias in the sampling trajectory.

The CFG++ update:

$\tilde{x}_{t-1} = x_{t-1}^{cond} + \sigma_{t-1}(x_{t-1}^{cond} - x_{t-1}^{uncond})$

This modifies the latent $x_{t-1}$ directly rather than the noise prediction, using the difference in denoised samples rather than in noise predictions. The benefit: more consistent sampling trajectory, less guidance-induced drift at high $w$ , slightly better FID/CLIP trade-off.

CFG++ is equivalent to standard CFG at $w = 0$ and $w = 1$ . At high guidance scales, it produces cleaner trajectories. It has been adopted in some production systems but has not fully displaced standard CFG due to its more complex implementation.

Auto CFG (Automatic Guidance Scale Scheduling)

Several works have explored time-varying guidance scales. The observation: guidance scale $w$ at high noise levels ( $t$ near $T$ ) primarily affects global composition and prompt adherence for major subject identification. Guidance at low noise levels ( $t$ near 0) affects fine detail and texture fidelity. Fixed $w$ is suboptimal because the optimal guidance strength differs between these regimes.

Simple schedule: high $w$ (e.g., 12) for $t > 0.5T$ , low $w$ (e.g., 5) for $t < 0.5T$ . Achieves better FID without sacrificing CLIP score.

Linear schedule: linearly decrease $w$ from $w_{max}$ to $w_{min}$ across the sampling trajectory.

Perturbed Attention Guidance (PAG): instead of running the model with null conditioning, run it with perturbed (e.g., identity-replaced) self-attention as the "unconditional" baseline. This sidesteps the need for a null text embedding entirely and can provide better structural guidance.

8. Architecture Diagram

9. Complete PyTorch CFG Implementation

import torch
import torch.nn as nn
import math
from typing import Optional, Union, List
from transformers import CLIPTextModel, CLIPTokenizer


# ============================================================
# CFG Sampler - complete production implementation
# ============================================================

class CFGSampler:
    """
    Production-grade CFG sampler for Stable Diffusion-style models.

    Features:
    - Batched CFG (single U-Net forward pass for both branches)
    - Negative prompting as custom unconditional baseline
    - Dynamic thresholding (Imagen-style, for high guidance scales)
    - Time-varying guidance scale schedule
    - Memory-efficient CFG with optional early-step skipping
    """

    def __init__(
        self,
        unet: nn.Module,
        noise_scheduler,
        text_encoder: CLIPTextModel,
        tokenizer: CLIPTokenizer,
        device: str = "cuda",
    ):
        self.unet = unet
        self.scheduler = noise_scheduler
        self.text_encoder = text_encoder
        self.tokenizer = tokenizer
        self.device = device

    @torch.no_grad()
    def encode_prompts(
        self,
        prompts: Union[str, List[str]],
        negative_prompts: Optional[Union[str, List[str]]] = None,
        max_length: int = 77,
    ) -> tuple:
        """
        Encode positive and negative prompts.

        Returns:
            pos_embeddings: (B, 77, 768) - conditional embeddings
            neg_embeddings: (B, 77, 768) - unconditional/negative embeddings
        """
        if isinstance(prompts, str):
            prompts = [prompts]
        B = len(prompts)

        # Positive prompts
        pos_tokens = self.tokenizer(
            prompts,
            padding="max_length",
            max_length=max_length,
            truncation=True,
            return_tensors="pt"
        )
        pos_embeddings = self.text_encoder(
            pos_tokens.input_ids.to(self.device)
        ).last_hidden_state  # (B, 77, 768)

        # Negative prompts (or empty string for null embedding)
        if negative_prompts is None:
            negative_prompts = [""] * B
        elif isinstance(negative_prompts, str):
            negative_prompts = [negative_prompts] * B

        neg_tokens = self.tokenizer(
            negative_prompts,
            padding="max_length",
            max_length=max_length,
            truncation=True,
            return_tensors="pt"
        )
        neg_embeddings = self.text_encoder(
            neg_tokens.input_ids.to(self.device)
        ).last_hidden_state  # (B, 77, 768)

        return pos_embeddings, neg_embeddings

    @torch.no_grad()
    def cfg_forward_pass(
        self,
        latents: torch.Tensor,
        timestep: torch.Tensor,
        pos_embeddings: torch.Tensor,
        neg_embeddings: torch.Tensor,
        guidance_scale: float,
    ) -> torch.Tensor:
        """
        Batched CFG forward pass: ONE U-Net call for both branches.

        Concatenating along batch dimension and running once is
        30-40% faster than two sequential calls, at identical cost.

        Args:
            latents:        (B, 4, 64, 64)
            timestep:       (B,) or scalar
            pos_embeddings: (B, 77, 768) - conditional
            neg_embeddings: (B, 77, 768) - unconditional/negative
            guidance_scale: w

        Returns:
            guided_noise: (B, 4, 64, 64)
        """
        B = latents.shape[0]

        # Batch both branches: shape (2B, 4, 64, 64)
        # Order: [negative/uncond, positive/cond] - splitting after is cleaner
        latents_2x = torch.cat([latents, latents], dim=0)

        # Batch text embeddings: shape (2B, 77, 768)
        text_embeddings_2x = torch.cat([neg_embeddings, pos_embeddings], dim=0)

        # Single forward pass for both branches
        noise_pred = self.unet(
            latents_2x,
            timestep.repeat(2) if timestep.dim() > 0 else timestep,
            encoder_hidden_states=text_embeddings_2x
        ).sample  # (2B, 4, 64, 64)

        # Split: first B = unconditional/negative, last B = conditional
        noise_uncond = noise_pred[:B]   # (B, 4, 64, 64)
        noise_cond   = noise_pred[B:]   # (B, 4, 64, 64)

        # CFG combination: guided = uncond + w * (cond - uncond)
        guided = noise_uncond + guidance_scale * (noise_cond - noise_uncond)

        return guided

    def dynamic_threshold(
        self,
        x_pred: torch.Tensor,
        percentile: float = 0.995,
    ) -> torch.Tensor:
        """
        Imagen-style dynamic thresholding.
        Prevents oversaturation at high guidance scales.

        Input: predicted x_0 (B, C, H, W), values may exceed [-1, 1]
        Output: thresholded x_0, values rescaled to lie in [-1, 1]
        """
        B = x_pred.shape[0]

        # Compute per-sample percentile of absolute values
        # Flatten spatial and channel dims for percentile computation
        x_flat = x_pred.reshape(B, -1).abs()  # (B, C*H*W)

        # s = max(1.0, percentile_p(|x_pred|))
        # s > 1 means some values exceed training range [-1, 1]
        s = torch.quantile(x_flat, percentile, dim=1)  # (B,)
        s = torch.maximum(s, torch.ones_like(s))        # s >= 1.0
        s = s.view(B, 1, 1, 1)                          # (B, 1, 1, 1) for broadcasting

        # Clip to [-s, s] and rescale to [-1, 1]
        x_thresholded = x_pred.clamp(-s, s) / s

        return x_thresholded

    @torch.no_grad()
    def sample(
        self,
        prompts: Union[str, List[str]],
        negative_prompts: Optional[Union[str, List[str]]] = None,
        num_inference_steps: int = 50,
        guidance_scale: float = 7.5,
        height: int = 512,
        width: int = 512,
        latent_channels: int = 4,
        use_dynamic_thresholding: bool = False,
        dynamic_threshold_percentile: float = 0.995,
        guidance_scale_schedule: Optional[List[float]] = None,
        generator: Optional[torch.Generator] = None,
    ) -> torch.Tensor:
        """
        Full CFG sampling loop.

        Args:
            guidance_scale_schedule: if provided, overrides guidance_scale
                per step. Length must equal num_inference_steps.
                e.g. [12, 12, 10, 10, 8, 8, 7.5, ..., 5] for decreasing schedule.

        Returns:
            latents: (B, 4, H//8, W//8) - decode with VAE for images
        """
        if isinstance(prompts, str):
            prompts = [prompts]
        B = len(prompts)

        # 1. Encode prompts
        pos_embeddings, neg_embeddings = self.encode_prompts(
            prompts, negative_prompts
        )

        # 2. Initialize latent noise
        latent_h = height // 8
        latent_w = width // 8
        latents = torch.randn(
            (B, latent_channels, latent_h, latent_w),
            generator=generator,
            device=self.device,
            dtype=pos_embeddings.dtype,
        )

        # Scale by scheduler's initial noise sigma
        self.scheduler.set_timesteps(num_inference_steps, device=self.device)
        latents = latents * self.scheduler.init_noise_sigma

        # 3. Denoising loop
        for step_idx, t in enumerate(self.scheduler.timesteps):
            timestep = torch.tensor([t], device=self.device)

            # Get guidance scale for this step (allow per-step scheduling)
            if guidance_scale_schedule is not None:
                w = guidance_scale_schedule[step_idx]
            else:
                w = guidance_scale

            # CFG forward pass (one batched U-Net call)
            noise_pred = self.cfg_forward_pass(
                latents, timestep, pos_embeddings, neg_embeddings, w
            )

            # Optional: Imagen dynamic thresholding
            if use_dynamic_thresholding:
                # Convert noise pred to x_0 prediction for thresholding
                alpha_bar = self.scheduler.alphas_cumprod[t]
                x0_pred = (latents - (1 - alpha_bar).sqrt() * noise_pred) / alpha_bar.sqrt()

                # Apply dynamic threshold
                x0_thresholded = self.dynamic_threshold(
                    x0_pred, dynamic_threshold_percentile
                )

                # Re-derive noise prediction from thresholded x0
                noise_pred = (latents - alpha_bar.sqrt() * x0_thresholded) / (1 - alpha_bar).sqrt()

            # Scheduler step
            latents = self.scheduler.step(
                noise_pred, t, latents
            ).prev_sample

        return latents

    @torch.no_grad()
    def guidance_scale_sweep(
        self,
        prompt: str,
        scales: List[float],
        seed: int = 42,
        num_steps: int = 30,
    ) -> dict:
        """
        Generate one image per guidance scale for comparison.
        Uses same seed so only guidance scale varies.
        Returns dict mapping scale → latent.
        """
        results = {}
        for w in scales:
            gen = torch.Generator(self.device).manual_seed(seed)
            latent = self.sample(
                [prompt],
                num_inference_steps=num_steps,
                guidance_scale=w,
                generator=gen,
            )
            results[w] = latent
            print(f"  Guidance scale {w:.1f}: latent std={latent.std():.3f}")
        return results


# ============================================================
# Training with CFG conditioning dropout
# ============================================================

class CFGTrainer:
    """
    Training wrapper that adds CFG-compatible unconditional dropout.
    Only one change from standard conditional training.
    """

    def __init__(
        self,
        unet: nn.Module,
        text_encoder: CLIPTextModel,
        noise_scheduler,
        uncond_prob: float = 0.1,
        device: str = "cuda",
    ):
        self.unet = unet
        self.text_encoder = text_encoder
        self.scheduler = noise_scheduler
        self.uncond_prob = uncond_prob
        self.device = device

        # Null embedding: CLIP encoding of empty string
        # Fixed for the training run (not learned, though learned is better)
        self._null_embedding: Optional[torch.Tensor] = None

    def get_null_embedding(self, tokenizer, batch_size: int) -> torch.Tensor:
        """
        Returns CLIP embedding of empty string as the null conditioning.
        Cached after first computation.
        """
        if self._null_embedding is None:
            tokens = tokenizer(
                [""] * 1,
                padding="max_length",
                max_length=77,
                truncation=True,
                return_tensors="pt"
            )
            with torch.no_grad():
                self._null_embedding = self.text_encoder(
                    tokens.input_ids.to(self.device)
                ).last_hidden_state  # (1, 77, 768)

        # Expand to batch size
        return self._null_embedding.expand(batch_size, -1, -1)

    def apply_cfg_dropout(
        self,
        text_embeddings: torch.Tensor,
        tokenizer,
    ) -> torch.Tensor:
        """
        Randomly replace text conditioning with null embedding.
        This is the ONLY training change needed for CFG capability.
        """
        B = text_embeddings.shape[0]
        null_embed = self.get_null_embedding(tokenizer, B)

        # Independent Bernoulli dropout per sample
        # keep_prob = 1 - uncond_prob
        keep_mask = torch.rand(B, device=self.device) > self.uncond_prob
        keep_mask = keep_mask.view(B, 1, 1)  # broadcast over (B, 77, 768)

        return torch.where(keep_mask, text_embeddings, null_embed)

    def training_step(
        self,
        x_0: torch.Tensor,
        text_embeddings: torch.Tensor,
        tokenizer,
    ) -> torch.Tensor:
        """
        Single CFG-enabled training step.
        Same as standard conditional training except for cfg_dropout.
        """
        B = x_0.shape[0]

        # Sample random timestep
        t = torch.randint(0, self.scheduler.config.num_train_timesteps, (B,), device=self.device)

        # Sample noise and compute noisy image
        noise = torch.randn_like(x_0)
        noisy_x = self.scheduler.add_noise(x_0, noise, t)

        # CRITICAL: apply CFG dropout to text conditioning
        # This is the only change from standard conditional training
        conditioned_text = self.apply_cfg_dropout(text_embeddings, tokenizer)

        # Predict noise with U-Net
        noise_pred = self.unet(
            noisy_x,
            t,
            encoder_hidden_states=conditioned_text
        ).sample

        # Standard MSE loss
        loss = nn.functional.mse_loss(noise_pred, noise)
        return loss


# ============================================================
# Guidance scale analysis
# ============================================================

def analyze_cfg_effect():
    """
    Demonstrates CFG math: how guidance scale affects prediction direction.
    """
    import numpy as np

    print("CFG Guidance Scale Analysis")
    print("=" * 60)
    print()
    print("Formula: eps_guided = eps_uncond + w * (eps_cond - eps_uncond)")
    print()

    # Simulate noise predictions
    torch.manual_seed(42)
    eps_cond  = torch.randn(4, 64, 64)  # conditional prediction
    eps_uncond = torch.randn(4, 64, 64)  # unconditional prediction

    # Compute guided predictions at different scales
    print(f"{'Scale w':>10} | {'||guided||':>12} | {'Align w/ cond':>15} | Interpretation")
    print("-" * 75)

    for w in [0.0, 1.0, 3.0, 7.5, 12.0, 20.0]:
        guided = eps_uncond + w * (eps_cond - eps_uncond)

        norm = guided.norm().item()
        # Cosine similarity with conditional prediction
        cos_sim = torch.nn.functional.cosine_similarity(
            guided.flatten().unsqueeze(0),
            eps_cond.flatten().unsqueeze(0)
        ).item()

        if w == 0.0:
            interp = "pure conditional (no guidance)"
        elif w == 1.0:
            interp = "moderate prompt attention"
        elif w <= 5.0:
            interp = "balanced quality/diversity"
        elif w <= 10.0:
            interp = "strong conditioning (SD default range)"
        elif w <= 15.0:
            interp = "very strong - risk of artifacts"
        else:
            interp = "extreme - artifacts likely"

        print(f"{w:>10.1f} | {norm:>12.2f} | {cos_sim:>15.3f} | {interp}")

    print()
    print("Key insight:")
    print("  At w=7.5: guided norm is ~8.5x cond norm - significant extrapolation")
    print("  At w>15: prediction leaves the natural image manifold → artifacts")
    print("  Dynamic thresholding rescales w>10 predictions back to safe range")


if __name__ == "__main__":
    analyze_cfg_effect()

10. YouTube Resources

Title	Channel	Why Watch
Diffusion Models Beat GANs - Classifier Guidance	Yannic Kilcher	Dhariwal and Nichol paper - classifier guidance original derivation
Classifier-Free Diffusion Guidance Explained	Outlier	CFG derivation with intuitive math and visual examples
Stable Diffusion from Scratch	Umar Jamil	Full implementation including CFG sampling loop with shapes
Imagen: Photorealistic T2I with Dynamic Thresholding	Two Minute Papers	Dynamic thresholding and high-guidance quality improvements
Negative Prompts in Stable Diffusion	Sebastian Kamph	Practical negative prompting strategies and mechanics

11. Production Engineering Notes

:::warning CFG doubles inference compute - the standard optimization Running two U-Net forward passes doubles compute per step. Always batch both branches: torch.cat([latents, latents], dim=0) gives batch size 2B, and both conditional and unconditional are processed in one forward pass. This is 30-40% faster than two sequential single-batch calls because GPU utilization is better at larger batch sizes. Every production deployment (Diffusers, ComfyUI, InvokeAI) uses this optimization. Never run two separate forward passes in a loop. :::

:::tip Skipping CFG on early high-noise steps saves compute Guidance scale $w$ has diminishing quality returns in the high-noise timesteps (large $t$ ). At $t$ near $T$ , the image is pure noise and even without guidance the model correctly identifies global composition. Kynkäänniemi et al. (2024) showed that skipping CFG for approximately the first 40% of timesteps (high-noise regime) saves 20% of total inference compute with minimal FID/CLIP-score degradation. Practical implementation: use full CFG only when t < 0.6 * T. :::

:::note Guidance scale is not universal - retune per model A guidance scale of $w=7.5$ that works well for SD 1.5 will produce oversaturated results with SDXL (which works better at $w=5-8$ ) and SD3/FLUX (which works better at $w=3-5$ ). Rectified-flow-based models (SD3, FLUX) have different noise schedules that make the effective guidance strength nonlinear with $w$ . Always calibrate guidance scale per model by examining FID-CLIP score curves on a validation set, not by transferring values from other models. :::

12. Common Mistakes

:::danger Two separate forward passes instead of one batched pass Running eps_cond = unet(x, t, cond_embed) and then eps_uncond = unet(x, t, null_embed) as separate calls is wasteful. Modern GPUs are heavily underutilized at batch size 1. Concatenate both inputs along the batch dimension and run one forward pass. The GPU processes both in parallel with near-identical wall-clock time to one call at batch size 2. This is the most impactful single optimization in a CFG sampling loop. :::

:::danger Reversed unconditional/conditional batch ordering The standard pattern is torch.cat([null_embed, cond_embed]) as the text input, meaning noise_pred[0] is unconditional and noise_pred[1] is conditional. Swapping the order and not adjusting the CFG formula means the guidance direction is inverted: guided = cond + w * (uncond - cond) moves away from the prompt. The model generates images maximally different from the text description. Silent failure - images look fine but ignore the prompt. Always annotate which half of the batch corresponds to which conditioning. :::

:::warning Guidance scale too high for the noise schedule At very low step counts (fewer than 20 steps with DDIM), high guidance scale amplifies discretization errors. Each Euler step accumulates $O(h^2)$ error; at high $w$ , this error is multiplied. The result: oversaturated images with edge artifacts at $w > 10$ and $S < 20$ steps. Use DPM-Solver-2 or DPM-Solver-3 when combining high guidance scale with low step count - they are more robust to coarse discretization. :::

:::warning Prompt truncation is silent CLIP tokenizes to 77 tokens. Exceeding this limit silently truncates the prompt - no error, no warning, just fewer tokens processed. A descriptive 150-word prompt loses everything after the first ~77 tokens. The words at the end of the prompt are the most likely to be cut. Reorder prompts to put the most important descriptors first. Use a token counter during development. Some implementations support long prompts via averaging multiple CLIP windows, but this loses cross-window attention context. :::

13. Interview Q&A

Q1: Derive the CFG formula from Bayes' theorem.

Classifier guidance modifies the score function with $\nabla_{x_t}\log p(c|x_t)$ . By Bayes: $\log p(c|x_t) = \log p(x_t|c) - \log p(x_t) + \text{const}$ . Taking gradients: $\nabla_{x_t}\log p(c|x_t) = \nabla_{x_t}\log p(x_t|c) - \nabla_{x_t}\log p(x_t)$ . The right-hand side is the conditional score minus the unconditional score. Converting to noise predictions (using $s_\theta = -\varepsilon_\theta/\sqrt{1-\bar\alpha_t}$ ):

$\tilde\varepsilon_\theta(x_t, c) = \varepsilon_\theta(x_t, \emptyset) + w\left(\varepsilon_\theta(x_t, c) - \varepsilon_\theta(x_t, \emptyset)\right)$

The same model produces both the conditional and unconditional prediction by training with unconditional dropout - no separate classifier needed. The implicit classifier gradient is derived from the difference between two outputs of the same network.

Q2: What are the failure modes of classifier guidance that CFG avoids?

Four problems. First: classifier guidance requires a separate noise-aware classifier trained at all noise levels - doubling training cost. CFG uses the same model twice. Second: the classifier is coupled to the specific noise schedule; change the schedule and retrain the classifier. CFG has no such coupling. Third: for open-vocabulary text conditioning, you cannot train a classifier over all possible prompts. CFG handles any text embedding. Fourth: the classifier gradient is computed via backprop, which creates adversarial-like artifacts at high guidance scale. CFG uses a simple linear combination of forward passes, no backprop at inference.

Q3: What is the effect of guidance scale on FID and CLIP score, and what scale would you use in production?

There is a fundamental trade-off: higher $w$ increases CLIP score (better prompt adherence) and increases FID (worse image quality/diversity). The Pareto optimal region for most SD-scale models is $w = 5-10$ . At $w < 3$ : low CLIP score, weak conditioning. At $w = 7.5$ : SD default - strong conditioning with acceptable FID. At $w > 15$ : high CLIP score but FID worsens sharply due to mode-seeking behavior and out-of-distribution predictions. Production choice: $w = 7-8$ for SD 1.5, $w = 5-7$ for SDXL, calibrated per model.

Q4: How does negative prompting work mathematically?

Negative prompting replaces the null/empty embedding $\emptyset$ with a negative text description $c^-$ : $\tilde\varepsilon = \varepsilon_\theta(x_t, c^-) + w(\varepsilon_\theta(x_t, c^+) - \varepsilon_\theta(x_t, c^-))$ . The guidance simultaneously pushes toward $c^+$ and away from $c^-$ . This works because the CFG formula is linear in the conditioning embeddings - any two embeddings define a direction in prediction space. Practically effective for concrete visual attributes (blurry, deformed, watermark) because CLIP embeds these with strong visual meaning. Less effective for abstract concepts (emotion, negation) where CLIP embeddings are noisy.

Q5: What is dynamic thresholding and when is it needed?

Dynamic thresholding (Imagen, Saharia et al. 2022) addresses oversaturation at high $w$ . With large guidance scale, the guided noise prediction $\tilde\varepsilon$ is large, causing the predicted $\hat{x}_0$ to exceed $[-1, 1]$ . Static clipping discards directional information - everywhere that exceeds $\pm 1$ becomes $\pm 1$ , losing the gradient. Dynamic thresholding instead clips $\hat{x}_0$ to $[-s, s]$ where $s = \max(1, \text{percentile}_{99.5}(|\hat{x}_0|))$ and rescales by $s$ . This preserves the direction of the prediction while preventing magnitude explosion. Use it when deploying with $w > 8$ and high-realism requirements. Minimal compute overhead - just one percentile computation per step.

Q6: How does CFG training differ from standard conditional training?

The only difference is unconditional dropout: with probability $p_\text{uncond} \approx 0.1$ , replace the text conditioning with the null/empty-string CLIP embedding. No change to the loss function, optimizer, model architecture, or anything else. This forces the model to learn unconditional generation in addition to conditional generation. At inference, the same model run with null conditioning provides the unconditional baseline. 10% unconditional steps is typically optimal - enough to learn good unconditional generation without significantly degrading conditional quality. The result: a CFG-capable model with zero architectural changes and negligible training overhead.

Q7: Why is CFG at $w > 1$ considered extrapolation, and what does this mean practically?

For $w > 1$ , the guided prediction $\tilde\varepsilon = (1+w)\varepsilon^{cond} - w\varepsilon^{uncond}$ lies beyond the conditional prediction in the direction away from the unconditional. It is an extrapolation past $\varepsilon^{cond}$ rather than interpolation between $\varepsilon^{cond}$ and $\varepsilon^{uncond}$ . Practically: at $w = 7.5$ , the guided prediction weights the conditional prediction 8.5x and subtracts 7.5x the unconditional prediction. The result points far outside the convex hull of $\varepsilon^{cond}$ and $\varepsilon^{uncond}$ . This produces more extreme images in the conditional direction - stronger, more saturated, more committed to the prompt. Above $w \approx 12-15$ , the extrapolation leaves the natural image manifold entirely, producing images that look correct but artificial, with colors and textures more intense than physically possible.

This lesson is part of the Diffusion Models module. Next: Fine-Tuning Diffusion Models - DreamBooth, LoRA, Textual Inversion, ControlNet.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Diffusion Process (DDPM) demo on the EngineersOfAI Playground - no code required.

:::

The Prompt That Changed Everything​

Why This Exists - The Conditioning Problem​

Historical Context - The Implicit Classifier Insight​

1. The Core CFG Derivation​

From Score Functions to Noise Predictions​

The Extrapolation Perspective​

2. The Guidance Scale Trade-Off - FID vs CLIP Score​

FID Score - Image Quality and Diversity​

CLIP Score - Text-Image Alignment​

The Pareto Frontier​

Why High Guidance Causes Artifacts​

3. Training for CFG - Unconditional Dropout​

The Minimal Training Change​

The Null Embedding​

Choosing the Dropout Probability​

4. Original Classifier Guidance - Dhariwal and Nichol 2021​

The Dhariwal-Nichol ADM+G Setup​

Results and Problems​

5. Negative Prompting - Inverted Guidance​

Mechanics​

What Negative Prompting Does and Does Not Do​

Common Negative Prompt Patterns​

6. Dynamic Thresholding - Fixing High-Guidance Artifacts​

The Problem at High Guidance Scale​

The Dynamic Thresholding Solution​

7. CFG++ and Variants​

CFG++ (Chung et al. 2024)​

Auto CFG (Automatic Guidance Scale Scheduling)​

8. Architecture Diagram​

9. Complete PyTorch CFG Implementation​

10. YouTube Resources​

11. Production Engineering Notes​

12. Common Mistakes​

13. Interview Q&A​