Classifier-Free Guidance - Steering Diffusion with Text
:::note Reading time: ~55 minutes | Interview relevance: Very High | Target roles: ML Engineer, Research Engineer, AI Engineer, Applied Scientist :::
The Prompt That Changed Everything
It is late 2022. You are building a text-to-image product. The model generates beautiful images, but when you type "a red dragon breathing fire over a medieval castle at sunset, cinematic lighting, dramatic composition," it produces something that vaguely resembles a blurry orange blob. The model does not care about your prompt enough.
You increase the conditioning weight in training. The image sharpens around the description. The dragon appears. The castle appears. The sunset lights the composition. But something goes wrong - the image looks overexposed, almost burned out, with strange high-frequency artifacts at edges. Colors are saturated beyond what exists in photographs. You have pushed the model too hard toward the text and it has started to produce images that look distinctly artificial.
This is the central tension in every text-to-image system: prompt fidelity versus image naturalness. A model that perfectly follows every word produces images that look technically wrong - too vivid, too sharp, too committed to every token in the prompt. A model that ignores the prompt produces diverse but useless outputs. Somewhere between these extremes is the region where images feel both photorealistic and precisely described.
The technique that solved this problem - Classifier-Free Guidance (CFG), introduced by Jonathan Ho and Tim Salimans at Google Brain in late 2021 - is arguably the single most impactful technique in modern generative AI. It is why every major text-to-image system (Stable Diffusion, DALL-E 2, Midjourney, Imagen) produces outputs that feel simultaneously natural and prompt-faithful. Understanding CFG in depth - including where it comes from, why it works, and where it fails - is foundational knowledge for any engineer working with diffusion models.
Why This Exists - The Conditioning Problem
Before CFG, two approaches existed for steering a diffusion model toward a text prompt or class label.
Approach 1 - Conditional training only. Train the U-Net with text embeddings via cross-attention and rely on the conditioning alone. This works, but weakly. The model has learned to generate good images unconditionally; it does not heavily rely on the text signal. The text conditioning acts more like a soft preference than a hard constraint. Quantitatively: conditional training alone gives moderate CLIP scores but does not dramatically improve FID over unconditional.
Approach 2 - Classifier guidance (Dhariwal and Nichol, 2021). During sampling, augment the score function with the gradient of a noise-aware classifier:
Here is the score function, is a classifier that has been trained on images at all noise levels , and is the guidance scale. The classifier gradient pushes sampling toward regions where the class is likely.
Dhariwal and Nichol demonstrated spectacular results: the ADM-G model (ADM + Classifier Guidance) surpassed BigGAN on FID for the first time, using a diffusion model. The guidance made class conditioning so strong that images looked genuinely class-conditional and photorealistic simultaneously.
But classifier guidance has fundamental problems that make it impractical for open-vocabulary text conditioning:
-
Requires a noise-aware classifier: a standard ImageNet classifier trained on clean images produces nearly random gradients at high noise levels (where is mostly noise). The classifier must be trained on images at every noise level . This doubles training cost.
-
Couples to the specific noise schedule: the classifier is trained on a particular schedule. Change the diffusion model's schedule and the classifier must be retrained.
-
Cannot generalize to open-vocabulary text: you can train a classifier for 1000 ImageNet classes, but you cannot train one for all possible natural language prompts. The space of text descriptions is infinite.
-
Two forward passes at inference: the diffusion model forward pass plus the classifier forward pass (including backprop through the classifier to compute ) at every step.
Something fundamentally better was needed.
Historical Context - The Implicit Classifier Insight
Jonathan Ho (first author of DDPM) and Tim Salimans published "Classifier-Free Diffusion Guidance" as a NeurIPS 2021 workshop paper. The key insight came from Bayes' theorem.
The classifier gradient that makes classifier guidance work is:
by Bayes' theorem (since ).
The right-hand side is the difference between the conditional score and the unconditional score . These are both score functions - exactly what diffusion models estimate. A single diffusion model trained to handle both conditional and unconditional inputs estimates both scores simultaneously.
Therefore: if you train one U-Net that can predict noise both with and without conditioning, the classifier gradient is free - it is just the difference between two predictions from the same model. No separate classifier. No noise-aware training data. No backprop at inference time.
The implementation requires one change to training: with probability (typically 10-20%), replace the text conditioning with a null embedding. The model learns to predict noise both with and without conditioning. At inference, run the model twice and combine the results.
This insight reduced the state-of-the-art in text-to-image from "needs a classifier, impractical for open vocab" to "one extra forward pass, works for any text." The impact was immediate.
1. The Core CFG Derivation
From Score Functions to Noise Predictions
The score function relates to the noise prediction by:
Using the Bayes' theorem decomposition:
The guided score becomes:
Converting back from scores to noise predictions (multiplying by ):
This is the CFG formula. The terms:
- : conditional prediction - noise estimated given text prompt
- : unconditional prediction - noise estimated with null/empty embedding
- : guidance scale - amplification factor for the conditional direction
- : guided prediction used for the denoising step
Rearranging:
This form makes the role of explicit. When : pure conditional (weak guidance). When : strongly amplified conditional direction. When : move away from the prompt - occasionally useful for adversarial analysis or inverted conditioning.
The Extrapolation Perspective
For , CFG is an extrapolation beyond the conditional prediction. The guided prediction lies outside the segment between and - it is amplified beyond what the model would naturally predict. This extrapolation pushes toward high-probability regions of the conditional distribution much more aggressively than the model alone, at the cost of leaving the natural image distribution.
For : pure conditional, no extrapolation. For : - the conditional prediction is weighted 8.5x and the unconditional is subtracted. For : extreme extrapolation - the guided prediction is far from the image manifold, causing artifacts.
2. The Guidance Scale Trade-Off - FID vs CLIP Score
The guidance scale controls a fundamental trade-off:
FID Score - Image Quality and Diversity
FID (Fréchet Inception Distance) measures how close the generated image distribution is to the real image distribution. It captures both quality (do individual images look real?) and diversity (does the set of generated images cover the full range of real images?).
At low : generated images are diverse but have weak prompt adherence. FID is low (good) because the distribution closely matches real images.
At high : the model aggressively pursues images that match the text, ignoring diversity. Some images are highly detailed and sharp, but the distribution is narrower than the real distribution. FID increases (worsens).
CLIP Score - Text-Image Alignment
CLIP score measures the cosine similarity between the CLIP image embedding of generated images and the CLIP text embedding of the prompt. Higher is better - it measures how well the image matches the text.
At low : low CLIP score - the model ignores the prompt. At high : high CLIP score - the model is maximally prompt-faithful.
The Pareto Frontier
Empirically (from Ho and Salimans 2021 and subsequent papers):
| Guidance Scale | FID (CIFAR-10 class-cond) | CLIP Score | Perceptual quality |
|---|---|---|---|
| 3.21 | 0.21 | Natural, diverse, weakly conditioned | |
| 3.5 | 0.24 | Balanced, good starting point | |
| 4.1 | 0.27 | Clear subject, somewhat stylized | |
| 5.9 | 0.31 | Default SD - strong conditioning | |
| 7.8 | 0.33 | Very prompt-faithful, somewhat artificial | |
| 14.2 | 0.34 | Artifacts, oversaturation, uncanny valley |
The sweet spot depends on the use case. For photorealistic portrait generation: . For stylized illustration: . For exact prompt following with known artifact tolerance: .
Why High Guidance Causes Artifacts
At high , the extrapolation produces prediction values that are outside the range of normal Gaussian noise (). The predicted from this guidance will have values outside . When clipped to (the training distribution's range), color information is lost and saturation artifacts appear. Dynamic thresholding (Section 6) addresses this.
3. Training for CFG - Unconditional Dropout
The Minimal Training Change
CFG requires exactly one change to standard conditional diffusion training: unconditional dropout. With probability (typically 0.10 to 0.20), replace the text conditioning with the null embedding :
def apply_cfg_dropout(
text_embeddings: torch.Tensor,
null_embedding: torch.Tensor,
dropout_prob: float = 0.1
) -> torch.Tensor:
"""
Apply unconditional dropout for CFG training.
10% of the time, replace conditioning with null.
This is the ONLY change needed to make a model CFG-capable.
"""
batch_size = text_embeddings.shape[0]
# For each sample in batch, independently decide to drop conditioning
keep = torch.rand(batch_size) > dropout_prob # True = keep conditioning
keep = keep.view(-1, 1, 1) # shape (B, 1, 1) for broadcasting over (B, 77, 768)
# Mix: keep conditioning or replace with null
return torch.where(keep.to(text_embeddings.device), text_embeddings, null_embedding)
No other changes: same loss function, same optimizer, same architecture. The model simply learns to also predict good noise without conditioning. At inference, running it with the null embedding gives the unconditional baseline needed for CFG.
The Null Embedding
The unconditional prediction requires a null conditioning signal. In Stable Diffusion, this is the CLIP text embedding of an empty string - all padding tokens, producing a specific fixed vector the U-Net has learned to interpret as "no prompt."
The null embedding matters more than it seems. Research has shown:
- A learned null embedding (trainable vector optimized to produce good unconditional outputs) can outperform the empty-string CLIP embedding for image quality.
- The null embedding acts as a "style baseline" - images generated with CFG diverge from it toward the prompt. The visual style of null-guided generation (typically blurry, generic) affects what the "prompt direction" looks like.
- Null-Text Inversion (Mokady et al. 2022) showed that optimizing the null embedding per image enables precise image editing - the edited image stays close to the source by design.
Choosing the Dropout Probability
- (5%): the model rarely sees null conditioning during training, so unconditional generation quality is low. CFG at inference uses a poor unconditional baseline. Avoid.
- (10%): the SD 1.5 setting. Good balance - 90% of steps improve conditional understanding, 10% establish the unconditional baseline.
- (20%): stronger unconditional quality. Slightly weaker conditional quality. Use when image editing or DDIM inversion are primary use cases.
- (50%): equal time on conditional and unconditional. Training is slower for each capability. Rarely optimal.
4. Original Classifier Guidance - Dhariwal and Nichol 2021
Understanding classifier guidance is important for interview questions that ask you to compare CFG to its predecessor.
The Dhariwal-Nichol ADM+G Setup
Their model had three components:
- ADM (Ablated Diffusion Model) - a U-Net diffusion model, class-conditional via adaptive group normalization
- A separate noise-aware classifier trained on images at all noise levels
- Guidance at inference:
The noise-aware classifier was architecturally similar to the diffusion U-Net: it took noisy images at all and predicted class probabilities. This was not a standard classifier - it required training on noisy images with the specific noise schedule used during diffusion training.
Results and Problems
The results were impressive: ADM-G achieved FID 4.59 on ImageNet 256x256 (beating BigGAN's 6.95 at the time). Class-conditional samples looked genuinely class-specific and photorealistic.
Problems that made it impractical for production:
- Cost: training the noise-aware classifier requires roughly as much compute as training the diffusion model itself.
- Inflexibility: the classifier is tightly coupled to the noise schedule and image resolution. Change either and you need a new classifier.
- Text scalability: you cannot train a classifier over all possible natural language prompts. Classifier guidance fundamentally cannot handle open-vocabulary conditioning.
- Adversarial-like artifacts at high guidance: computing via backprop through the classifier is like computing adversarial examples. At high , the gradient can push the image toward classifier-fooling patterns that look unnatural.
CFG eliminated all four problems: no separate model, decoupled from noise schedule, works for any text, no backprop.
5. Negative Prompting - Inverted Guidance
Mechanics
The standard CFG formula uses an empty string as the null embedding . You can replace this with any text description - including descriptions of things you do not want:
Where is the positive prompt and is the negative prompt. The guidance simultaneously pushes toward and away from .
This is not a special feature - it is a direct consequence of the CFG formula being linear. Any text embedding can serve as the "unconditional baseline." When you use a negative prompt, you are just providing a more specific "direction to move away from."
What Negative Prompting Does and Does Not Do
Effective for:
- Concrete visual artifacts: "blurry," "deformed hands," "extra fingers," "watermark," "text overlay" - these have well-defined CLIP embeddings in visual space
- Specific visual styles to avoid: "cartoon," "anime," "oil painting," "low-resolution"
- Obvious quality issues: "low quality," "jpeg artifacts," "overexposed"
Ineffective for:
- Abstract emotional concepts: "sad," "anxious" - CLIP embeddings for abstract emotions are noisy and context-dependent
- Negative logical constructions: "not a dog" - CLIP does not have a clear embedding for negated concepts; the model may actually attend to the "dog" concept
- Fine-grained compositional control: "put the cat on the left" - spatial control requires attention manipulation (ControlNet) not negative prompting
Common Negative Prompt Patterns
# Domain-specific negative prompts
NEGATIVE_PROMPTS = {
"photorealistic_portrait": (
"blurry, out of focus, extra fingers, deformed hands, "
"ugly face, bad anatomy, watermark, text overlay, logo, "
"jpeg artifacts, low resolution, grainy"
),
"landscape": (
"cartoon, anime, illustration, painting, oversaturated, "
"HDR, unrealistic colors, fog, haze, overexposed"
),
"architecture": (
"people, blurry, distorted perspective, unrealistic proportions, "
"graffiti, modern elements in historical scenes"
),
"product_photography": (
"background clutter, shadows, reflections, dust, scratches, "
"blurry, poor lighting, low quality"
),
# Universal quality negative prompt (SD community standard)
"universal": (
"worst quality, low quality, normal quality, lowres, bad anatomy, "
"bad hands, extra fingers, fewer fingers, missing fingers, "
"text, watermark, artist name, signature, username"
),
}
# Dual-prompt CFG: positive pushes toward, negative pushes away
def dual_prompt_cfg(
noise_cond: torch.Tensor, # eps_theta(x_t, positive_prompt)
noise_neg: torch.Tensor, # eps_theta(x_t, negative_prompt)
guidance_scale: float = 7.5
) -> torch.Tensor:
"""
CFG with negative prompt replacing null/empty embedding.
Equivalent to: guided = neg + w * (pos - neg)
"""
return noise_neg + guidance_scale * (noise_cond - noise_neg)
6. Dynamic Thresholding - Fixing High-Guidance Artifacts
The Problem at High Guidance Scale
Google's Imagen paper (Saharia et al. 2022) identified and solved the main artifact problem at high guidance scales. When is large, the CFG formula amplifies values beyond the range expected during training. Translating back to predicted :
With large , is large, making have values far outside . Static clipping (clamp(-1, 1)) discards the directionality of the out-of-range values - everywhere that would have been becomes , losing the gradient information. The result: color saturation artifacts, loss of texture detail.
The Dynamic Thresholding Solution
Imagen's dynamic thresholding clips adaptively:
- Compute from (the guided noise prediction)
- Compute where typically
- If : clip to and rescale by :
- Re-derive from the thresholded
Mathematically:
This preserves the direction of the prediction (where it points) while preventing the magnitude from exploding. The result: Imagen can use with dynamic thresholding and produce images that are both strongly conditioned and free of color saturation artifacts.
Implementation note: the must be computed per-sample in the batch (not across the batch), since different images may have different magnitudes.
7. CFG++ and Variants
CFG++ (Chung et al. 2024)
Standard CFG modifies the noise prediction . CFG++ (Chung et al. 2024) instead modifies the score function itself and applies the correction to the denoising direction differently, motivated by the observation that standard CFG creates bias in the sampling trajectory.
The CFG++ update:
This modifies the latent directly rather than the noise prediction, using the difference in denoised samples rather than in noise predictions. The benefit: more consistent sampling trajectory, less guidance-induced drift at high , slightly better FID/CLIP trade-off.
CFG++ is equivalent to standard CFG at and . At high guidance scales, it produces cleaner trajectories. It has been adopted in some production systems but has not fully displaced standard CFG due to its more complex implementation.
Auto CFG (Automatic Guidance Scale Scheduling)
Several works have explored time-varying guidance scales. The observation: guidance scale at high noise levels ( near ) primarily affects global composition and prompt adherence for major subject identification. Guidance at low noise levels ( near 0) affects fine detail and texture fidelity. Fixed is suboptimal because the optimal guidance strength differs between these regimes.
Simple schedule: high (e.g., 12) for , low (e.g., 5) for . Achieves better FID without sacrificing CLIP score.
Linear schedule: linearly decrease from to across the sampling trajectory.
Perturbed Attention Guidance (PAG): instead of running the model with null conditioning, run it with perturbed (e.g., identity-replaced) self-attention as the "unconditional" baseline. This sidesteps the need for a null text embedding entirely and can provide better structural guidance.
8. Architecture Diagram
9. Complete PyTorch CFG Implementation
import torch
import torch.nn as nn
import math
from typing import Optional, Union, List
from transformers import CLIPTextModel, CLIPTokenizer
# ============================================================
# CFG Sampler - complete production implementation
# ============================================================
class CFGSampler:
"""
Production-grade CFG sampler for Stable Diffusion-style models.
Features:
- Batched CFG (single U-Net forward pass for both branches)
- Negative prompting as custom unconditional baseline
- Dynamic thresholding (Imagen-style, for high guidance scales)
- Time-varying guidance scale schedule
- Memory-efficient CFG with optional early-step skipping
"""
def __init__(
self,
unet: nn.Module,
noise_scheduler,
text_encoder: CLIPTextModel,
tokenizer: CLIPTokenizer,
device: str = "cuda",
):
self.unet = unet
self.scheduler = noise_scheduler
self.text_encoder = text_encoder
self.tokenizer = tokenizer
self.device = device
@torch.no_grad()
def encode_prompts(
self,
prompts: Union[str, List[str]],
negative_prompts: Optional[Union[str, List[str]]] = None,
max_length: int = 77,
) -> tuple:
"""
Encode positive and negative prompts.
Returns:
pos_embeddings: (B, 77, 768) - conditional embeddings
neg_embeddings: (B, 77, 768) - unconditional/negative embeddings
"""
if isinstance(prompts, str):
prompts = [prompts]
B = len(prompts)
# Positive prompts
pos_tokens = self.tokenizer(
prompts,
padding="max_length",
max_length=max_length,
truncation=True,
return_tensors="pt"
)
pos_embeddings = self.text_encoder(
pos_tokens.input_ids.to(self.device)
).last_hidden_state # (B, 77, 768)
# Negative prompts (or empty string for null embedding)
if negative_prompts is None:
negative_prompts = [""] * B
elif isinstance(negative_prompts, str):
negative_prompts = [negative_prompts] * B
neg_tokens = self.tokenizer(
negative_prompts,
padding="max_length",
max_length=max_length,
truncation=True,
return_tensors="pt"
)
neg_embeddings = self.text_encoder(
neg_tokens.input_ids.to(self.device)
).last_hidden_state # (B, 77, 768)
return pos_embeddings, neg_embeddings
@torch.no_grad()
def cfg_forward_pass(
self,
latents: torch.Tensor,
timestep: torch.Tensor,
pos_embeddings: torch.Tensor,
neg_embeddings: torch.Tensor,
guidance_scale: float,
) -> torch.Tensor:
"""
Batched CFG forward pass: ONE U-Net call for both branches.
Concatenating along batch dimension and running once is
30-40% faster than two sequential calls, at identical cost.
Args:
latents: (B, 4, 64, 64)
timestep: (B,) or scalar
pos_embeddings: (B, 77, 768) - conditional
neg_embeddings: (B, 77, 768) - unconditional/negative
guidance_scale: w
Returns:
guided_noise: (B, 4, 64, 64)
"""
B = latents.shape[0]
# Batch both branches: shape (2B, 4, 64, 64)
# Order: [negative/uncond, positive/cond] - splitting after is cleaner
latents_2x = torch.cat([latents, latents], dim=0)
# Batch text embeddings: shape (2B, 77, 768)
text_embeddings_2x = torch.cat([neg_embeddings, pos_embeddings], dim=0)
# Single forward pass for both branches
noise_pred = self.unet(
latents_2x,
timestep.repeat(2) if timestep.dim() > 0 else timestep,
encoder_hidden_states=text_embeddings_2x
).sample # (2B, 4, 64, 64)
# Split: first B = unconditional/negative, last B = conditional
noise_uncond = noise_pred[:B] # (B, 4, 64, 64)
noise_cond = noise_pred[B:] # (B, 4, 64, 64)
# CFG combination: guided = uncond + w * (cond - uncond)
guided = noise_uncond + guidance_scale * (noise_cond - noise_uncond)
return guided
def dynamic_threshold(
self,
x_pred: torch.Tensor,
percentile: float = 0.995,
) -> torch.Tensor:
"""
Imagen-style dynamic thresholding.
Prevents oversaturation at high guidance scales.
Input: predicted x_0 (B, C, H, W), values may exceed [-1, 1]
Output: thresholded x_0, values rescaled to lie in [-1, 1]
"""
B = x_pred.shape[0]
# Compute per-sample percentile of absolute values
# Flatten spatial and channel dims for percentile computation
x_flat = x_pred.reshape(B, -1).abs() # (B, C*H*W)
# s = max(1.0, percentile_p(|x_pred|))
# s > 1 means some values exceed training range [-1, 1]
s = torch.quantile(x_flat, percentile, dim=1) # (B,)
s = torch.maximum(s, torch.ones_like(s)) # s >= 1.0
s = s.view(B, 1, 1, 1) # (B, 1, 1, 1) for broadcasting
# Clip to [-s, s] and rescale to [-1, 1]
x_thresholded = x_pred.clamp(-s, s) / s
return x_thresholded
@torch.no_grad()
def sample(
self,
prompts: Union[str, List[str]],
negative_prompts: Optional[Union[str, List[str]]] = None,
num_inference_steps: int = 50,
guidance_scale: float = 7.5,
height: int = 512,
width: int = 512,
latent_channels: int = 4,
use_dynamic_thresholding: bool = False,
dynamic_threshold_percentile: float = 0.995,
guidance_scale_schedule: Optional[List[float]] = None,
generator: Optional[torch.Generator] = None,
) -> torch.Tensor:
"""
Full CFG sampling loop.
Args:
guidance_scale_schedule: if provided, overrides guidance_scale
per step. Length must equal num_inference_steps.
e.g. [12, 12, 10, 10, 8, 8, 7.5, ..., 5] for decreasing schedule.
Returns:
latents: (B, 4, H//8, W//8) - decode with VAE for images
"""
if isinstance(prompts, str):
prompts = [prompts]
B = len(prompts)
# 1. Encode prompts
pos_embeddings, neg_embeddings = self.encode_prompts(
prompts, negative_prompts
)
# 2. Initialize latent noise
latent_h = height // 8
latent_w = width // 8
latents = torch.randn(
(B, latent_channels, latent_h, latent_w),
generator=generator,
device=self.device,
dtype=pos_embeddings.dtype,
)
# Scale by scheduler's initial noise sigma
self.scheduler.set_timesteps(num_inference_steps, device=self.device)
latents = latents * self.scheduler.init_noise_sigma
# 3. Denoising loop
for step_idx, t in enumerate(self.scheduler.timesteps):
timestep = torch.tensor([t], device=self.device)
# Get guidance scale for this step (allow per-step scheduling)
if guidance_scale_schedule is not None:
w = guidance_scale_schedule[step_idx]
else:
w = guidance_scale
# CFG forward pass (one batched U-Net call)
noise_pred = self.cfg_forward_pass(
latents, timestep, pos_embeddings, neg_embeddings, w
)
# Optional: Imagen dynamic thresholding
if use_dynamic_thresholding:
# Convert noise pred to x_0 prediction for thresholding
alpha_bar = self.scheduler.alphas_cumprod[t]
x0_pred = (latents - (1 - alpha_bar).sqrt() * noise_pred) / alpha_bar.sqrt()
# Apply dynamic threshold
x0_thresholded = self.dynamic_threshold(
x0_pred, dynamic_threshold_percentile
)
# Re-derive noise prediction from thresholded x0
noise_pred = (latents - alpha_bar.sqrt() * x0_thresholded) / (1 - alpha_bar).sqrt()
# Scheduler step
latents = self.scheduler.step(
noise_pred, t, latents
).prev_sample
return latents
@torch.no_grad()
def guidance_scale_sweep(
self,
prompt: str,
scales: List[float],
seed: int = 42,
num_steps: int = 30,
) -> dict:
"""
Generate one image per guidance scale for comparison.
Uses same seed so only guidance scale varies.
Returns dict mapping scale → latent.
"""
results = {}
for w in scales:
gen = torch.Generator(self.device).manual_seed(seed)
latent = self.sample(
[prompt],
num_inference_steps=num_steps,
guidance_scale=w,
generator=gen,
)
results[w] = latent
print(f" Guidance scale {w:.1f}: latent std={latent.std():.3f}")
return results
# ============================================================
# Training with CFG conditioning dropout
# ============================================================
class CFGTrainer:
"""
Training wrapper that adds CFG-compatible unconditional dropout.
Only one change from standard conditional training.
"""
def __init__(
self,
unet: nn.Module,
text_encoder: CLIPTextModel,
noise_scheduler,
uncond_prob: float = 0.1,
device: str = "cuda",
):
self.unet = unet
self.text_encoder = text_encoder
self.scheduler = noise_scheduler
self.uncond_prob = uncond_prob
self.device = device
# Null embedding: CLIP encoding of empty string
# Fixed for the training run (not learned, though learned is better)
self._null_embedding: Optional[torch.Tensor] = None
def get_null_embedding(self, tokenizer, batch_size: int) -> torch.Tensor:
"""
Returns CLIP embedding of empty string as the null conditioning.
Cached after first computation.
"""
if self._null_embedding is None:
tokens = tokenizer(
[""] * 1,
padding="max_length",
max_length=77,
truncation=True,
return_tensors="pt"
)
with torch.no_grad():
self._null_embedding = self.text_encoder(
tokens.input_ids.to(self.device)
).last_hidden_state # (1, 77, 768)
# Expand to batch size
return self._null_embedding.expand(batch_size, -1, -1)
def apply_cfg_dropout(
self,
text_embeddings: torch.Tensor,
tokenizer,
) -> torch.Tensor:
"""
Randomly replace text conditioning with null embedding.
This is the ONLY training change needed for CFG capability.
"""
B = text_embeddings.shape[0]
null_embed = self.get_null_embedding(tokenizer, B)
# Independent Bernoulli dropout per sample
# keep_prob = 1 - uncond_prob
keep_mask = torch.rand(B, device=self.device) > self.uncond_prob
keep_mask = keep_mask.view(B, 1, 1) # broadcast over (B, 77, 768)
return torch.where(keep_mask, text_embeddings, null_embed)
def training_step(
self,
x_0: torch.Tensor,
text_embeddings: torch.Tensor,
tokenizer,
) -> torch.Tensor:
"""
Single CFG-enabled training step.
Same as standard conditional training except for cfg_dropout.
"""
B = x_0.shape[0]
# Sample random timestep
t = torch.randint(0, self.scheduler.config.num_train_timesteps, (B,), device=self.device)
# Sample noise and compute noisy image
noise = torch.randn_like(x_0)
noisy_x = self.scheduler.add_noise(x_0, noise, t)
# CRITICAL: apply CFG dropout to text conditioning
# This is the only change from standard conditional training
conditioned_text = self.apply_cfg_dropout(text_embeddings, tokenizer)
# Predict noise with U-Net
noise_pred = self.unet(
noisy_x,
t,
encoder_hidden_states=conditioned_text
).sample
# Standard MSE loss
loss = nn.functional.mse_loss(noise_pred, noise)
return loss
# ============================================================
# Guidance scale analysis
# ============================================================
def analyze_cfg_effect():
"""
Demonstrates CFG math: how guidance scale affects prediction direction.
"""
import numpy as np
print("CFG Guidance Scale Analysis")
print("=" * 60)
print()
print("Formula: eps_guided = eps_uncond + w * (eps_cond - eps_uncond)")
print()
# Simulate noise predictions
torch.manual_seed(42)
eps_cond = torch.randn(4, 64, 64) # conditional prediction
eps_uncond = torch.randn(4, 64, 64) # unconditional prediction
# Compute guided predictions at different scales
print(f"{'Scale w':>10} | {'||guided||':>12} | {'Align w/ cond':>15} | Interpretation")
print("-" * 75)
for w in [0.0, 1.0, 3.0, 7.5, 12.0, 20.0]:
guided = eps_uncond + w * (eps_cond - eps_uncond)
norm = guided.norm().item()
# Cosine similarity with conditional prediction
cos_sim = torch.nn.functional.cosine_similarity(
guided.flatten().unsqueeze(0),
eps_cond.flatten().unsqueeze(0)
).item()
if w == 0.0:
interp = "pure conditional (no guidance)"
elif w == 1.0:
interp = "moderate prompt attention"
elif w <= 5.0:
interp = "balanced quality/diversity"
elif w <= 10.0:
interp = "strong conditioning (SD default range)"
elif w <= 15.0:
interp = "very strong - risk of artifacts"
else:
interp = "extreme - artifacts likely"
print(f"{w:>10.1f} | {norm:>12.2f} | {cos_sim:>15.3f} | {interp}")
print()
print("Key insight:")
print(" At w=7.5: guided norm is ~8.5x cond norm - significant extrapolation")
print(" At w>15: prediction leaves the natural image manifold → artifacts")
print(" Dynamic thresholding rescales w>10 predictions back to safe range")
if __name__ == "__main__":
analyze_cfg_effect()
10. YouTube Resources
| Title | Channel | Why Watch |
|---|---|---|
| Diffusion Models Beat GANs - Classifier Guidance | Yannic Kilcher | Dhariwal and Nichol paper - classifier guidance original derivation |
| Classifier-Free Diffusion Guidance Explained | Outlier | CFG derivation with intuitive math and visual examples |
| Stable Diffusion from Scratch | Umar Jamil | Full implementation including CFG sampling loop with shapes |
| Imagen: Photorealistic T2I with Dynamic Thresholding | Two Minute Papers | Dynamic thresholding and high-guidance quality improvements |
| Negative Prompts in Stable Diffusion | Sebastian Kamph | Practical negative prompting strategies and mechanics |
11. Production Engineering Notes
:::warning CFG doubles inference compute - the standard optimization
Running two U-Net forward passes doubles compute per step. Always batch both branches: torch.cat([latents, latents], dim=0) gives batch size 2B, and both conditional and unconditional are processed in one forward pass. This is 30-40% faster than two sequential single-batch calls because GPU utilization is better at larger batch sizes. Every production deployment (Diffusers, ComfyUI, InvokeAI) uses this optimization. Never run two separate forward passes in a loop.
:::
:::tip Skipping CFG on early high-noise steps saves compute
Guidance scale has diminishing quality returns in the high-noise timesteps (large ). At near , the image is pure noise and even without guidance the model correctly identifies global composition. Kynkäänniemi et al. (2024) showed that skipping CFG for approximately the first 40% of timesteps (high-noise regime) saves 20% of total inference compute with minimal FID/CLIP-score degradation. Practical implementation: use full CFG only when t < 0.6 * T.
:::
:::note Guidance scale is not universal - retune per model A guidance scale of that works well for SD 1.5 will produce oversaturated results with SDXL (which works better at ) and SD3/FLUX (which works better at ). Rectified-flow-based models (SD3, FLUX) have different noise schedules that make the effective guidance strength nonlinear with . Always calibrate guidance scale per model by examining FID-CLIP score curves on a validation set, not by transferring values from other models. :::
12. Common Mistakes
:::danger Two separate forward passes instead of one batched pass
Running eps_cond = unet(x, t, cond_embed) and then eps_uncond = unet(x, t, null_embed) as separate calls is wasteful. Modern GPUs are heavily underutilized at batch size 1. Concatenate both inputs along the batch dimension and run one forward pass. The GPU processes both in parallel with near-identical wall-clock time to one call at batch size 2. This is the most impactful single optimization in a CFG sampling loop.
:::
:::danger Reversed unconditional/conditional batch ordering
The standard pattern is torch.cat([null_embed, cond_embed]) as the text input, meaning noise_pred[0] is unconditional and noise_pred[1] is conditional. Swapping the order and not adjusting the CFG formula means the guidance direction is inverted: guided = cond + w * (uncond - cond) moves away from the prompt. The model generates images maximally different from the text description. Silent failure - images look fine but ignore the prompt. Always annotate which half of the batch corresponds to which conditioning.
:::
:::warning Guidance scale too high for the noise schedule At very low step counts (fewer than 20 steps with DDIM), high guidance scale amplifies discretization errors. Each Euler step accumulates error; at high , this error is multiplied. The result: oversaturated images with edge artifacts at and steps. Use DPM-Solver-2 or DPM-Solver-3 when combining high guidance scale with low step count - they are more robust to coarse discretization. :::
:::warning Prompt truncation is silent CLIP tokenizes to 77 tokens. Exceeding this limit silently truncates the prompt - no error, no warning, just fewer tokens processed. A descriptive 150-word prompt loses everything after the first ~77 tokens. The words at the end of the prompt are the most likely to be cut. Reorder prompts to put the most important descriptors first. Use a token counter during development. Some implementations support long prompts via averaging multiple CLIP windows, but this loses cross-window attention context. :::
13. Interview Q&A
Q1: Derive the CFG formula from Bayes' theorem.
Classifier guidance modifies the score function with . By Bayes: . Taking gradients: . The right-hand side is the conditional score minus the unconditional score. Converting to noise predictions (using ):
The same model produces both the conditional and unconditional prediction by training with unconditional dropout - no separate classifier needed. The implicit classifier gradient is derived from the difference between two outputs of the same network.
Q2: What are the failure modes of classifier guidance that CFG avoids?
Four problems. First: classifier guidance requires a separate noise-aware classifier trained at all noise levels - doubling training cost. CFG uses the same model twice. Second: the classifier is coupled to the specific noise schedule; change the schedule and retrain the classifier. CFG has no such coupling. Third: for open-vocabulary text conditioning, you cannot train a classifier over all possible prompts. CFG handles any text embedding. Fourth: the classifier gradient is computed via backprop, which creates adversarial-like artifacts at high guidance scale. CFG uses a simple linear combination of forward passes, no backprop at inference.
Q3: What is the effect of guidance scale on FID and CLIP score, and what scale would you use in production?
There is a fundamental trade-off: higher increases CLIP score (better prompt adherence) and increases FID (worse image quality/diversity). The Pareto optimal region for most SD-scale models is . At : low CLIP score, weak conditioning. At : SD default - strong conditioning with acceptable FID. At : high CLIP score but FID worsens sharply due to mode-seeking behavior and out-of-distribution predictions. Production choice: for SD 1.5, for SDXL, calibrated per model.
Q4: How does negative prompting work mathematically?
Negative prompting replaces the null/empty embedding with a negative text description : . The guidance simultaneously pushes toward and away from . This works because the CFG formula is linear in the conditioning embeddings - any two embeddings define a direction in prediction space. Practically effective for concrete visual attributes (blurry, deformed, watermark) because CLIP embeds these with strong visual meaning. Less effective for abstract concepts (emotion, negation) where CLIP embeddings are noisy.
Q5: What is dynamic thresholding and when is it needed?
Dynamic thresholding (Imagen, Saharia et al. 2022) addresses oversaturation at high . With large guidance scale, the guided noise prediction is large, causing the predicted to exceed . Static clipping discards directional information - everywhere that exceeds becomes , losing the gradient. Dynamic thresholding instead clips to where and rescales by . This preserves the direction of the prediction while preventing magnitude explosion. Use it when deploying with and high-realism requirements. Minimal compute overhead - just one percentile computation per step.
Q6: How does CFG training differ from standard conditional training?
The only difference is unconditional dropout: with probability , replace the text conditioning with the null/empty-string CLIP embedding. No change to the loss function, optimizer, model architecture, or anything else. This forces the model to learn unconditional generation in addition to conditional generation. At inference, the same model run with null conditioning provides the unconditional baseline. 10% unconditional steps is typically optimal - enough to learn good unconditional generation without significantly degrading conditional quality. The result: a CFG-capable model with zero architectural changes and negligible training overhead.
Q7: Why is CFG at considered extrapolation, and what does this mean practically?
For , the guided prediction lies beyond the conditional prediction in the direction away from the unconditional. It is an extrapolation past rather than interpolation between and . Practically: at , the guided prediction weights the conditional prediction 8.5x and subtracts 7.5x the unconditional prediction. The result points far outside the convex hull of and . This produces more extreme images in the conditional direction - stronger, more saturated, more committed to the prompt. Above , the extrapolation leaves the natural image manifold entirely, producing images that look correct but artificial, with colors and textures more intense than physically possible.
This lesson is part of the Diffusion Models module. Next: Fine-Tuning Diffusion Models - DreamBooth, LoRA, Textual Inversion, ControlNet.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Diffusion Process (DDPM) demo on the EngineersOfAI Playground - no code required.
:::
