Diffusion Models Beyond Images - Audio, Video, 3D, Molecules, Text
:::note Reading time: ~55 minutes | Interview relevance: High | Target roles: ML Engineer, Research Engineer, AI Engineer, Applied Scientist :::
The Real Interview Moment
The interviewer has just watched the Sora demo - 60-second photorealistic videos with consistent physics, objects that respect gravity, cameras that track action coherently. "Walk me through what makes video diffusion different from image diffusion. Then explain how they apparently scaled this to minute-long videos at 1080p."
Then a harder question: "How would you apply diffusion to generate protein structures? What changes - the data representation, the noise process, the denoising network architecture?"
These questions test whether you understand diffusion as a general framework or just as "the thing inside Stable Diffusion." The mathematical core - forward noising, reverse denoising, reconstruction loss - is modality-agnostic. What changes for each new domain is: the data representation (waveforms, voxels, point clouds, token sequences, amino acid backbone frames), the architecture (1D convolutions, 3D U-Net, SE(3)-equivariant graph network), and the noise process (Gaussian on continuous spaces, Brownian motion on SO(3) for rotation groups, absorbing-state masking for discrete tokens).
Understanding these generalizations is what separates engineers who can deploy existing diffusion systems from engineers who can design new ones for novel domains. This lesson builds the unified framework for all of them.
Why This Exists - Diffusion as a Universal Framework
The success of DDPM and LDMs created a template that researchers immediately began applying to every data modality where generation is a hard problem. The appeal is clear: diffusion models offer stable training (no adversarial dynamics), high sample quality, and well-understood likelihood bounds - advantages over GAN-based approaches that had previously dominated audio synthesis and 3D generation.
Three conditions must hold for diffusion to apply to a new domain:
1. A forward noising process: a way to gradually corrupt data into a known prior. For continuous data, Gaussian diffusion works directly. For data embedded in a curved space (rotation groups, graph topology), you need Riemannian diffusion adapted to the geometry. For discrete data (tokens), you need a categorical corruption process.
2. A denoising network: a neural architecture that takes corrupted data and a noise level as input and predicts the clean data or the noise. The architecture must respect the geometry of the data - 1D convolutions for audio sequences, 3D convolutions for video, equivariant networks for molecular geometry that should not change under rotations.
3. A training loss: a reconstruction objective that is well-behaved and provides dense gradient signal. MSE on noise prediction works for continuous domains. Cross-entropy on masked tokens works for discrete domains.
The adaption challenge for each domain is in how you satisfy these three conditions, particularly when the data is not a continuous vector in .
Historical Context
The generalization of diffusion beyond images happened with striking speed in 2022-2024. WaveGrad (Chen et al. 2020) was among the first to apply diffusion to raw audio waveforms for neural text-to-speech. DiffWave (Kong et al. 2021) improved quality and conditioning flexibility. AudioLDM (Liu et al. 2023) brought the latent diffusion approach to general audio with text conditioning via CLAP embeddings.
DreamFusion (Poole et al. 2022, Google Brain) introduced Score Distillation Sampling - a conceptually profound idea that allowed a pretrained 2D diffusion model to serve as a prior for optimizing a 3D NeRF, requiring no 3D training data. This simultaneously solved the 3D generation problem and deepened our understanding of what a diffusion model's score function actually represents.
Video Diffusion Models (Ho et al. 2022, Google Brain) introduced the inflated 3D U-Net with temporal attention layers added to a pretrained 2D image diffusion model. Make-A-Video (Meta, 2022) and Imagen Video (Google, 2022) refined this approach. Sora (OpenAI, 2024) scaled it dramatically with spatiotemporal DiT patches and compression via a 3D VAE, producing 60-second 1080p video.
In the biological sciences, FrameDiff (Watson et al. 2023) and RFdiffusion (David Baker lab, 2023) brought diffusion to protein backbone generation using SE(3)-equivariant networks on residue frame representations. DiffSBDD (Schneuing et al. 2022) extended this to structure-based drug design - generating drug-like molecules conditioned on protein binding pockets.
Diffusion-LM (Li et al. 2022) tackled the hard problem of applying diffusion to discrete text tokens via embedding-space diffusion. MDLM (Shi et al. 2024) improved on this with masked diffusion that more naturally handles discrete vocabularies.
The Unified Framework: What Stays the Same, What Changes
Before diving into individual domains, it is worth making the unifying structure explicit.
What stays the same across all domains:
- The training loss structure: optimize a denoising network to undo corruption
- The inference procedure: iterative denoising from noise toward data
- The conditioning mechanism: cross-attention or concatenation of conditioning signals
- The guidance mechanism: classifier-free guidance is domain-agnostic
What changes for each domain:
| Component | Images | Audio | Video | Molecules | Text |
|---|---|---|---|---|---|
| Data representation | 2D pixel grid / 2D latent | 1D waveform or 2D mel | 3D spatiotemporal | 3D point set + graph | Discrete token sequence |
| Forward noise | Gaussian on pixels/latents | Gaussian on waveform/spectrogram | Gaussian on spatiotemporal latents | Gaussian (translation) + Brownian on SO(3) | Masking / token absorption |
| Denoising architecture | 2D U-Net or DiT | 1D U-Net or 2D U-Net on spectrogram | 3D U-Net or spatiotemporal DiT | E(3)-equivariant GNN | Bidirectional transformer |
| Key challenge | Resolution scaling | High temporal resolution (44kHz) | Temporal consistency across frames | Chemical validity, symmetry equivariance | Discrete tokens, rounding error |
1. Audio Diffusion
The Two Representations: Waveform vs Spectrogram
Audio generation faces a fundamental choice that shapes every architectural decision.
Raw waveform: a 1-second clip at 44.1kHz stereo = 88,200 float values. This is a 1D signal with multi-scale temporal structure: individual wave cycles at microsecond timescales carry timbre and pitch; phrases and melodies at second timescales carry musical content. The challenge: modeling all these scales simultaneously with a single architecture requires 1D U-Nets with very large receptive fields and many downsampling levels.
Mel spectrogram: a 2D time-frequency representation that compresses audio by roughly 100x while preserving perceptually relevant information. A 1-second clip at 44.1kHz becomes roughly a mel spectrogram (128 frequency bins × 87 time frames). This looks like an image - frequency on y-axis, time on x-axis, amplitude as pixel intensity. Standard 2D U-Net architectures from image diffusion can be applied directly, with a separate vocoder (HiFi-GAN, BigVGAN) converting spectrograms back to waveforms.
The tradeoff: waveform diffusion requires more complex architectures but enables end-to-end learning of all audio details. Spectrogram diffusion is simpler and leverages the mature 2D U-Net literature, but vocoder artifacts can limit quality, especially for music.
WaveGrad - Waveform Diffusion for Neural TTS
WaveGrad (Chen et al. 2020, Google Brain) was one of the first demonstrations that Gaussian diffusion on raw waveforms produces competitive text-to-speech quality. The architecture is a 1D U-Net with multiple encoder and decoder levels, conditioned on mel spectrograms via feature-wise linear modulation (FiLM).
The conditioning structure is important: the mel spectrogram provides a coarse acoustic plan (what speech sounds like), while diffusion models the fine-grained waveform details (exact sample-level audio texture that makes speech sound natural rather than synthetic).
Key results: WaveGrad at 1000 diffusion steps achieved near-WaveNet quality for TTS. Later variants with optimized schedules reduced this to 6-50 steps with minimal quality loss, making it practical for real-time inference.
The 1D architecture has downsampling and upsampling operations along the time dimension, with skip connections across resolution levels. Timestep conditioning is injected via normalization layer scales (analogous to adaptive instance normalization in image generation).
AudioLDM - Latent Diffusion for General Audio
AudioLDM (Liu et al. 2023) applied the LDM framework to audio, enabling text-conditioned generation of arbitrary sounds, music, and speech:
Audio → VAE encoder → compact latent (e.g., 8×64 from 128×512 spectrogram)
↓
Diffusion in latent space
Conditioned via CLAP text-audio embeddings
↓
VAE decoder → mel spectrogram
↓
HiFi-GAN vocoder → waveform
CLAP (Contrastive Language-Audio Pretraining): analogous to CLIP for images, CLAP aligns text and audio embeddings in a shared space. Training on 300K+ audio-text pairs from AudioCaps, WavCaps, and FreeSound. CLAP enables prompts like "rain on a metal roof," "a jazz trumpet solo at 120 BPM," "crowd cheering in a stadium."
AudioLDM 2 extended this to music generation, sound effect synthesis, and speech continuation. The key improvement: a more powerful audio VAE with better spectrogram reconstruction fidelity, and larger CLAP conditioning for richer semantic understanding.
Stable Audio - Long-Form Music Generation
Stability AI's Stable Audio (Evans et al. 2023) generates up to 90-second audio at 44.1kHz stereo by pushing the latent compression to a higher ratio, allowing diffusion over the entire audio structure in a compact latent space:
Key innovations:
- Autoencoder with large compression factor: the temporal compression ratio is around 2048x relative to the raw waveform, enabling even long clips to be represented as manageable latent sequences
- Timing conditioning: the model is conditioned on start time, total duration, and sample rate as Fourier-encoded features. This allows generating specific sections of longer tracks ("generate bars 8-16") and precise control over output length
- T5 text conditioning: uses a large text encoder (T5-XXL) for richer semantic conditioning than CLAP alone, enabling detailed prompts like "cinematic orchestral music, dramatic, with French horns and cello, 140 BPM, minor key, for a battle scene"
Code: Audio Spectrogram Diffusion
"""
Audio spectrogram diffusion - generating audio conditioned on text via mel spectrograms.
Demonstrates the key architectural choices for audio diffusion.
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from typing import Optional, Tuple
# ============================================================
# Mel spectrogram conversion utilities
# ============================================================
class MelSpectrogramConverter:
"""
Convert between waveform and mel spectrogram representation.
The mel spectrogram is the "image" that diffusion will operate on.
"""
def __init__(
self,
sample_rate: int = 22050,
n_fft: int = 1024,
hop_length: int = 256,
n_mels: int = 128,
):
self.sample_rate = sample_rate
self.n_fft = n_fft
self.hop_length = hop_length
self.n_mels = n_mels
def waveform_to_mel(self, waveform: torch.Tensor) -> torch.Tensor:
"""
Convert waveform to log-mel spectrogram.
Args:
waveform: (T,) mono audio samples at sample_rate Hz
Returns:
mel: (n_mels, T//hop_length) log-mel spectrogram
"""
try:
import torchaudio
mel_transform = torchaudio.transforms.MelSpectrogram(
sample_rate=self.sample_rate,
n_fft=self.n_fft,
hop_length=self.hop_length,
n_mels=self.n_mels,
)
mel = mel_transform(waveform)
# Log-scale and normalize to [-1, 1]
mel = torch.log(mel + 1e-6)
mel = (mel - mel.mean()) / (mel.std() + 1e-6)
return mel
except ImportError:
raise ImportError("torchaudio required for mel spectrogram conversion")
def mel_shape_for_duration(self, duration_sec: float) -> Tuple[int, int]:
"""Returns the (n_mels, time_frames) shape for a given audio duration."""
time_samples = int(duration_sec * self.sample_rate)
time_frames = time_samples // self.hop_length
return (self.n_mels, time_frames)
# ============================================================
# 1D U-Net for waveform diffusion (WaveGrad-style)
# ============================================================
class WaveformDiffusionBlock(nn.Module):
"""
1D U-Net residual block for waveform diffusion.
Conditioned on mel spectrogram features via FiLM (feature-wise linear modulation).
"""
def __init__(self, channels: int, dilation: int = 1):
super().__init__()
# Dilated convolution for large receptive field
self.conv = nn.Conv1d(
channels, channels * 2,
kernel_size=3, padding=dilation, dilation=dilation
)
# FiLM conditioning: mel features → scale and shift
self.film_scale = nn.Linear(128, channels) # 128 = mel conditioning channels
self.film_shift = nn.Linear(128, channels)
self.res_proj = nn.Conv1d(channels, channels, kernel_size=1)
def forward(self, x: torch.Tensor, mel_cond: torch.Tensor) -> torch.Tensor:
"""
Args:
x: (B, C, T) waveform features at current resolution
mel_cond: (B, 128) mel conditioning vector (mean-pooled)
Returns:
(B, C, T) updated features
"""
# FiLM conditioning
scale = self.film_scale(mel_cond).unsqueeze(-1) # (B, C, 1)
shift = self.film_shift(mel_cond).unsqueeze(-1) # (B, C, 1)
# Gated activation
h = self.conv(x) # (B, 2C, T)
h_tanh, h_sigmoid = h.chunk(2, dim=1) # (B, C, T) each
h = torch.tanh(h_tanh + scale) * torch.sigmoid(h_sigmoid + shift)
return x + self.res_proj(h) # residual connection
# ============================================================
# Temporal attention for video diffusion
# ============================================================
class TemporalAttentionBlock(nn.Module):
"""
Temporal self-attention block for video diffusion.
Added to each spatial attention block in the 2D U-Net to create a 3D U-Net.
The 2D U-Net processes (B*T, C, H, W) - each frame independently.
Temporal attention reshapes to (B*H*W, T, C) - each spatial location over time.
This allows each frame's features at position (h,w) to attend to
all other frames' features at the same spatial position.
"""
def __init__(self, channels: int, num_heads: int = 8, num_frames: int = 16):
super().__init__()
self.channels = channels
self.num_heads = num_heads
self.num_frames = num_frames
self.head_dim = channels // num_heads
self.q = nn.Linear(channels, channels)
self.k = nn.Linear(channels, channels)
self.v = nn.Linear(channels, channels)
self.out_proj = nn.Linear(channels, channels)
self.norm = nn.LayerNorm(channels)
# Critical: zero-initialize output projection
# The temporal attention starts as a no-op so the pretrained
# image model behavior is preserved at training start.
nn.init.zeros_(self.out_proj.weight)
nn.init.zeros_(self.out_proj.bias)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Args:
x: (B*T, C, H, W) spatial features for all frames
Returns:
(B*T, C, H, W) temporally attended features
"""
BT, C, H, W = x.shape
T = self.num_frames
B = BT // T
# Reshape to temporal sequence: each spatial position over all frames
# (B*T, C, H, W) -> (B, T, H, W, C) -> (B, H, W, T, C) -> (B*H*W, T, C)
x_flat = x.permute(0, 2, 3, 1) # (B*T, H, W, C)
x_flat = x_flat.reshape(B, T, H, W, C) # (B, T, H, W, C)
x_flat = x_flat.permute(0, 2, 3, 1, 4) # (B, H, W, T, C)
x_flat = x_flat.reshape(B * H * W, T, C) # (B*H*W, T, C)
x_norm = self.norm(x_flat)
q = self.q(x_norm).reshape(B * H * W, T, self.num_heads, self.head_dim).transpose(1, 2)
k = self.k(x_norm).reshape(B * H * W, T, self.num_heads, self.head_dim).transpose(1, 2)
v = self.v(x_norm).reshape(B * H * W, T, self.num_heads, self.head_dim).transpose(1, 2)
# Self-attention over T temporal dimension
scale = self.head_dim ** -0.5
attn = torch.softmax(torch.matmul(q, k.transpose(-2, -1)) * scale, dim=-1)
out = torch.matmul(attn, v) # (B*H*W, heads, T, head_dim)
out = out.transpose(1, 2).reshape(B * H * W, T, C)
out = self.out_proj(out) # zero-initialized -> no-op at start
x_flat = x_flat + out # residual
# Reshape back: (B*H*W, T, C) -> (B, H, W, T, C) -> (B*T, C, H, W)
x_out = x_flat.reshape(B, H, W, T, C).permute(0, 3, 4, 1, 2).reshape(B * T, C, H, W)
return x_out
# ============================================================
# Score Distillation Sampling (SDS) for 3D generation
# ============================================================
class ScoreDistillationSampling:
"""
SDS loss for DreamFusion-style 3D optimization using a 2D diffusion model.
Key idea: render the 3D scene from a random viewpoint, then use the 2D
diffusion model's score function to push the rendered image toward
the high-quality image manifold matching the text prompt.
The gradient flows: loss -> rendered_image -> differentiable renderer -> NeRF theta
"""
def __init__(
self,
diffusion_model,
guidance_scale: float = 100.0, # MUCH higher than image generation (7.5)
min_timestep: float = 0.02,
max_timestep: float = 0.98,
):
self.model = diffusion_model
self.guidance_scale = guidance_scale
self.min_t = min_timestep
self.max_t = max_timestep
def compute_sds_loss(
self,
rendered_image: torch.Tensor, # (B, 3, H, W) rendered view
text_embeddings: torch.Tensor, # (2, 77, 768) [conditional, unconditional]
) -> torch.Tensor:
"""
The SDS gradient:
nabla_theta L_SDS = E_{t, eps} [ w(t) * (eps_phi(z_t; y, t) - eps) * dz/dtheta ]
where:
- eps_phi is the diffusion model's noise prediction (with CFG)
- eps is the actual noise added
- dz/dtheta is the gradient through the renderer to NeRF parameters
The SDS gradient points in the direction that makes the rendered image
"look less noisy" according to the diffusion model's learned score.
"""
B = rendered_image.shape[0]
device = rendered_image.device
with torch.no_grad():
# Encode to latent
latents = self.model.vae.encode(rendered_image).latent_dist.sample()
latents = latents * self.model.vae.config.scaling_factor
# Sample timestep (uniformly - can weight toward middle for stability)
t_val = torch.randint(
int(self.min_t * 1000), int(self.max_t * 1000),
(B,), device=device
)
# Add noise
noise = torch.randn_like(latents)
noisy_latents = self.model.scheduler.add_noise(latents, noise, t_val)
# Predict noise (conditioned + unconditioned for CFG)
noisy_stacked = torch.cat([noisy_latents] * 2)
t_stacked = torch.cat([t_val] * 2)
noise_pred = self.model.unet(noisy_stacked, t_stacked, text_embeddings).sample
noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2)
noise_pred_guided = noise_pred_uncond + self.guidance_scale * (
noise_pred_cond - noise_pred_uncond
)
# SDS gradient: how much the predicted noise differs from actual noise
# This delta, when backpropagated through the renderer, tells the NeRF
# "update your geometry/appearance so this view looks more natural"
sds_gradient = noise_pred_guided - noise
# Construct loss whose gradient equals sds_gradient
target = (latents - sds_gradient).detach()
loss = 0.5 * F.mse_loss(latents, target, reduction="mean")
return loss
# ============================================================
# Masked diffusion language model (MDLM approach)
# ============================================================
class MaskedDiffusionLM(nn.Module):
"""
Masked Diffusion Language Model - discrete token diffusion.
Forward process: progressively mask tokens (token -> [MASK])
Reverse process: unmask tokens (predict which token replaces [MASK])
Advantage over embedding-space diffusion (Diffusion-LM):
- No rounding step error (tokens stay discrete)
- Natural "noise" state: [MASK] is clearly defined
- Bidirectional context: model sees all positions simultaneously
"""
MASK_TOKEN_ID = 103 # [MASK] in standard BERT tokenizer
def __init__(self, vocab_size: int, d_model: int = 768, num_layers: int = 12):
super().__init__()
self.vocab_size = vocab_size
self.embedding = nn.Embedding(vocab_size, d_model)
encoder_layer = nn.TransformerEncoderLayer(
d_model=d_model, nhead=12, dim_feedforward=3072, batch_first=True
)
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
self.output_proj = nn.Linear(d_model, vocab_size)
def mask_tokens(self, input_ids: torch.Tensor, mask_prob: float) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Forward process: replace tokens with [MASK] with probability mask_prob.
mask_prob ~ t/T analogous to noise level in continuous diffusion.
"""
mask = torch.bernoulli(
torch.full(input_ids.shape, mask_prob, device=input_ids.device)
).bool()
masked_ids = input_ids.clone()
masked_ids[mask] = self.MASK_TOKEN_ID
return masked_ids, mask
def forward(self, masked_ids: torch.Tensor) -> torch.Tensor:
x = self.embedding(masked_ids)
x = self.transformer(x)
return self.output_proj(x) # (B, L, vocab_size)
def training_step(self, input_ids: torch.Tensor) -> torch.Tensor:
# Sample mask rate uniformly (analogous to sampling diffusion timestep)
t = torch.rand(1).item()
masked_ids, mask = self.mask_tokens(input_ids, mask_prob=t)
logits = self.forward(masked_ids)
# Cross-entropy only on masked positions - bidirectional context from all others
return F.cross_entropy(logits[mask], input_ids[mask])
2. Video Diffusion
The Core Challenge - Temporal Consistency
Video generation adds the dimension of time. A naive approach - generate frames independently with an image diffusion model - produces temporally incoherent results: each frame is individually photorealistic but objects change appearance between frames, people teleport, motion is not physically plausible, lighting changes randomly. The world the model generates is incoherent across time.
Temporal consistency requires the denoising network to process multiple frames simultaneously, so that the noise prediction at each frame is informed by what happens in adjacent frames. This requires extending the model's architecture from 2D (spatial) to 3D (spatial + temporal).
Video Diffusion Models - Inflated 3D U-Net
Ho et al. (2022) extended the 2D image U-Net to a "3D U-Net" by adding temporal attention layers to each spatial attention block. The model processes videos as tensors of shape with frames:
Spatial attention: standard 2D self-attention over the spatial dimensions for each frame independently. Captures within-frame structure. Input reshaped from to process each frame.
Temporal attention: 1D self-attention over the temporal dimension at each spatial position . Input reshaped from to - each spatial location's feature vector over all frames as a temporal sequence.
The alternating spatial/temporal attention allows the model to reason about both spatial coherence (each frame looks realistic) and temporal coherence (motion is smooth and physically plausible across frames).
Initialization from pretrained image model: the temporal attention layers are added to a pretrained 2D image model and initialized to output zero (identity initialization). The model starts with the same behavior as the image model - processing each frame independently - and gradually learns temporal dependencies during training. This transfer learning dramatically speeds up training and improves quality.
Make-A-Video - Decoupled Image and Motion Learning
Meta's Make-A-Video (Singer et al. 2022) proposes a key insight about data availability: large-scale text-image datasets are abundant, but text-video datasets are scarce and expensive to curate. The solution: decouple the learning of text-image correspondence from the learning of temporal motion dynamics.
Step 1: Train a high-quality text-to-image model on a large text-image dataset (e.g., LAION-5B with 5 billion image-text pairs). This gives the model rich semantic understanding of how text maps to visual content.
Step 2: Add temporal attention layers and train these only on video data - but without text labels. The video data teaches what natural motion looks like. Text does not need to accompany the video at this stage.
Inference: text conditioning drives semantic content (from step 1), temporal layers drive motion coherence (from step 2). The model generates semantically correct, temporally coherent videos from text prompts without requiring text-video paired training data.
This is an elegant use of data separation to solve the data scarcity problem in video diffusion.
Sora - Spatiotemporal DiT at Scale
OpenAI's Sora (2024) represents the scaled-up endpoint of video diffusion research, generating 60-second videos at up to 1080p. Several architectural choices enable this:
Spatiotemporal patches and compression: Sora first compresses video through a spatiotemporal VAE that compresses both spatial dimensions and temporal dimension simultaneously. This creates a compact 3D latent volume. The latent is then divided into non-overlapping 3D "patches" (spatiotemporal tokens), analogous to how ViT divides images into 2D patches.
Diffusion Transformer (DiT) over spatiotemporal tokens: instead of a U-Net with inductive spatial biases, Sora uses a pure transformer that attends globally over all spatiotemporal patch tokens. This is more scalable - transformers scale cleanly with compute via standard scaling laws, and the architecture handles variable video lengths natively by adjusting the number of tokens.
Variable aspect ratios and durations: Sora trains on videos at their native aspect ratios and frame rates, conditioning the model on these properties at inference. This prevents the "letterbox artifacts" and temporal stride inconsistencies of models trained at fixed resolutions.
Emergent physics: OpenAI described Sora as learning to simulate the physical world - the model learns not just visual style but physical plausibility (gravity, object permanence, camera motion, cause and effect). This physics understanding emerges from training on diverse video data at sufficient scale; it is not explicitly programmed.
Temporal Consistency Approaches Comparison
| Approach | Method | Memory per Frame | Max Coherent Length | Notes |
|---|---|---|---|---|
| 3D U-Net (Ho et al.) | Spatial + temporal attention alternating | High | ~32 frames | Strong coherence, memory bottleneck |
| Inflated attention | Insert temporal attn into pretrained 2D | High | ~32 frames | Good pretrained initialization |
| Autoregressive | Condition each chunk on previous chunk | Low | Unlimited | Error accumulates across chunks |
| DiT (Sora-style) | Global attention over spatiotemporal patches | Very high (but heavy compression) | 60+ seconds | Best coherence, requires 3D VAE compression |
| Cascaded | Low-res temporal model + spatial SR | Moderate | Variable | Two-stage training complexity |
3. 3D Generation
The 3D Representation Challenge
Generating 3D objects requires choosing a representation. Each has different properties affecting how well Gaussian diffusion applies:
Voxel grids: 3D arrays of occupancy or color values. Natural extension of 2D pixels. Easy to apply diffusion directly. Problem: cubic memory scaling - a voxel grid = 16M cells. High-resolution 3D is infeasible.
Point clouds: unordered sets of 3D points with optional color/normal attributes. Memory-efficient. Problem: points are unordered (no canonical ordering), so standard convolutional operations don't apply directly.
Meshes: vertices, edges, and faces defining a surface. Variable topology makes this hard to parameterize for diffusion.
Neural implicit functions (NeRF, SDF): continuous functions mapping 3D coordinates to density/color or signed distance. No explicit topology constraints. Compatible with differentiable rendering. This is what DreamFusion uses.
Triplane: compact hybrid representation - three axis-aligned feature planes (XY, XZ, YZ) that together define a 3D volume. Each plane is a 2D image that can be processed with standard 2D diffusion. Feature at 3D point = sum of features sampled from each plane at the corresponding 2D projection.
DreamFusion - 2D Diffusion as a 3D Prior
DreamFusion (Poole et al. 2022, Google) is one of the most conceptually elegant ideas in recent ML research. The question it answers: how do you generate high-quality 3D objects without 3D training data?
The key insight: a pretrained 2D image diffusion model implicitly encodes knowledge about what 3D objects look like from every viewpoint, because its training data contains images from millions of different viewpoints. You can extract this 3D knowledge without 3D training data using Score Distillation Sampling (SDS).
The SDS procedure:
- Initialize a NeRF randomly
- Sample a random camera viewpoint
- Render a 2D image via differentiable rendering
- Ask the diffusion model: "how should this rendered image change to better match the text prompt ?"
- The diffusion model's answer is a gradient signal: the SDS loss gradient
- Backpropagate through the renderer to update the NeRF parameters
- Repeat from step 2
The SDS gradient for a rendered image at timestep :
where is the noise predicted by the diffusion model with classifier-free guidance, is a timestep-dependent weighting, and is the gradient from the differentiable renderer to the NeRF parameters.
Why it works: the diffusion model's score function points toward high-probability regions of the image distribution matching prompt . When the rendered image looks unrealistic or off-prompt, the score points away from it. SDS uses this signal to update the 3D representation so that rendered views become increasingly realistic and prompt-consistent from every viewpoint.
The guidance scale caveat: SDS requires much higher guidance scale (100-200) than image generation (7-12). At low guidance scale, the gradient signal is too weak for the NeRF to converge. This causes "over-smoothing" artifacts and textureless surfaces.
Point-E and Shap-E
OpenAI released two direct 3D generation models that avoid the iterative SDS optimization:
Point-E (2022): generates colored point clouds via a two-stage approach:
- Text → single high-quality "conditioning image" using an image diffusion model
- Conditioning image → 3D point cloud using a transformer-based point cloud diffusion model
The point cloud diffusion model treats the set of points as a sequence and uses a transformer to model point-point relationships. Because point sets are orderless, the model must be permutation-equivariant - the output should not depend on the ordering of input points. This is handled by treating all points as a bag-of-vectors with cross-attention.
Shap-E (2023): generates implicit 3D functions (a small NeRF MLP that represents the object via signed distance + color) conditioned on text or images. The diffusion is in the weight space of the small MLP - a fixed-size vector of NeRF parameters. Higher quality than Point-E and supports textured mesh extraction.
Both models generate 3D content in seconds (no iterative optimization), but at lower fidelity than DreamFusion-style SDS which optimizes for minutes per object.
4. Molecular Generation and Protein Design
Why Molecules Need Special Treatment
Molecules are not images. They are graphs - atoms as nodes, chemical bonds as edges - embedded in 3D space. The key symmetry of molecules: rotating or translating the entire molecule does not change its chemical identity, biological activity, or binding properties. This is SE(3) symmetry (the symmetry group of rotations and translations in 3D space).
A standard Gaussian diffusion model trained on atom coordinates would see the same molecule as different inputs depending on where it is in space or how it is oriented. This would waste enormous model capacity on learning to be invariant to these irrelevant transformations. The solution: use SE(3)-equivariant network architectures that respect the symmetry by construction.
SE(3)-Equivariant Diffusion for Proteins
FrameDiff (Watson et al. 2023) and RFdiffusion (David Baker lab, 2023) represent each protein residue as a rigid body frame: a rotation (orientation of the residue backbone) and a translation (position of the atom).
The forward diffusion process must operate on this frame representation:
- Translation noise: standard Gaussian in - the same as for image diffusion
- Rotation noise: Brownian motion on - the natural analog of Gaussian noise on the rotation manifold. This is Riemannian diffusion, where the "noise" respects the curved geometry of the rotation group
The denoising network uses SE(3)-equivariant message passing: atom positions are represented as pairwise distance vectors (invariant to global translation), and attention weights are computed from these invariant features. The output predictions transform consistently with input transformations.
RFdiffusion achievements:
- Designs protein scaffolds with entirely novel folds not found in the Protein Data Bank
- Generates binders that attach to specified target proteins
- Conditions on arbitrary structural motifs - generates proteins that contain a given functional site in a new scaffold context
- Used in practice by pharmaceutical companies for therapeutic protein design
DiffSBDD - Molecule Generation for Drug Discovery
DiffSBDD (Schneuing et al. 2022) generates drug-like small molecule ligands conditioned on a protein binding pocket (structure-based drug design). The model:
- Encodes the 3D protein pocket as a set of atom features and positions using an equivariant graph neural network
- Applies diffusion over joint atom type (discrete: C, N, O, S...) + position (continuous: ) space for the generated ligand
- Conditions the reverse denoising process on protein pocket features via cross-attention
For atom types (discrete), a categorical noise process is used: atoms are gradually corrupted to an "absorbing" state, and the reverse process predicts the correct atom type. For positions (continuous), standard Gaussian diffusion applies.
The output is a 3D molecule that fits geometrically into the protein pocket, has correct chemical valency, and exhibits drug-like properties (correct bond lengths, angles, pharmacophore features).
Practical Impact
Companies including Genentech, Roche, Recursion, and Isomorphic Labs (DeepMind spinout) use diffusion-based molecular generation in their drug discovery pipelines. The key advantage: generative models can explore chemical space beyond existing databases, creating novel molecules that are not in any known dataset. This combinatorial chemical space is estimated at molecules - no database can cover it.
RFdiffusion has been used to design proteins with new folds that outperform natural proteins as binding agents, reported in multiple Nature publications in 2023-2024.
5. Text Diffusion - The Discrete Challenge
The Core Problem: Tokens Are Not Real Numbers
Text is fundamentally discrete - words or subword tokens from a finite vocabulary (e.g., 50,000 tokens in GPT's vocabulary). Gaussian diffusion is designed for continuous spaces: adding Gaussian noise to a real-valued vector makes mathematical sense. But adding Gaussian noise to the integer 4523 (the token ID for "cat") produces a meaningless non-integer.
There is no "slightly noisy version of the token for 'cat'" in the same way there is a slightly noisy version of a pixel value. This discrete nature is the core challenge for text diffusion.
Approach 1: Embed-Then-Diffuse (Diffusion-LM)
Diffusion-LM (Li et al. 2022) maps discrete tokens to continuous embedding vectors, then applies Gaussian diffusion in embedding space:
The forward process adds Gaussian noise to token embeddings. The reverse process denoises embeddings back toward the original token embeddings. A "rounding" step at the end projects the denoised embedding to the nearest token in the vocabulary:
Problems with rounding: the rounding step is non-differentiable and introduces a systematic error. For common tokens whose embeddings are well-separated, rounding works reliably. For rare tokens with embeddings close to common tokens in the embedding space, the rounding step frequently produces the wrong token. This "rounding mismatch" degrades text generation quality, especially for specialized vocabulary.
Approach 2: Masked Diffusion (MDLM)
MDLM (Shi et al. 2024) uses a discrete forward process that replaces tokens with a [MASK] token (an absorbing state):
At full noise (), all tokens are masked. The reverse process is a bidirectional transformer that unmasks tokens by predicting which token should replace each [MASK] given surrounding context. This is essentially BERT-style masked language modeling but with a gradual, multi-step denoising schedule.
Advantages of masked diffusion:
- No rounding error - tokens remain discrete throughout
- The bidirectional transformer can use full context for each prediction
- Natural handling of variable-length sequences
- Compatible with standard transformer architectures
The controllable generation advantage: unlike autoregressive LLMs that generate left-to-right, masked diffusion generates the entire sequence simultaneously. This enables:
- Infilling: given surrounding context, generate the missing middle section
- Global coherence constraints: apply constraints to the entire sequence at inference time
- Non-causal reasoning: the model sees and reasons about all positions at once
Current Status of Text Diffusion
Despite theoretical advantages in controllability, text diffusion models in 2024 still lag significantly behind large autoregressive LLMs (GPT-4, Claude, Gemini) in raw language generation quality. Autoregressive LLMs benefit from decades of scaling work, efficient attention implementations (FlashAttention), and training at scales (trillions of tokens) that text diffusion has not yet reached. The gap is narrowing, particularly for specialized applications like protein sequence generation and structured text generation.
6. Cross-Domain Architecture Comparison
7. Time Series and Tabular Diffusion
TabDDPM - Tabular Data Synthesis
Real-world tabular datasets have mixed types (continuous numeric, binary, nominal categorical, ordinal), skewed distributions, and typically small sample sizes. Generating synthetic tabular data that preserves the statistical properties of real data is valuable for privacy-preserving data sharing and data augmentation.
TabDDPM (Kotelnikov et al. 2023) applies Gaussian diffusion to tabular data:
- Continuous features: standard Gaussian diffusion directly on normalized feature values
- Categorical features: one-hot encoded, then Gaussian diffusion with multinomial conversion at sampling time
- Architecture: simple MLP (not CNN) as the denoising network - no spatial structure to exploit
- Class conditioning: class label injected as additional input for class-conditional synthesis
The simplicity is a feature: tabular data has no spatial or temporal structure that CNNs exploit, so the MLP denoising network is appropriate and efficient.
TabDDPM outperforms GAN-based tabular generators (CTGAN, TVAE) on most benchmark datasets, demonstrating that diffusion models' stable training (no mode collapse, no adversarial instability) transfers well to the tabular domain.
CSDI - Time Series Imputation
CSDI (Conditional Score-based Diffusion Models, Tashiro et al. 2021) applies diffusion to multivariate time series imputation - filling in missing observations. The model treats observed timesteps as conditioning and missing timesteps as the generation target.
The key architectural choice: the denoising network uses both temporal self-attention (attends over time) and feature cross-attention (attends over correlated variables at each timestep). This captures the two key structure types in multivariate time series: temporal dependencies (how each variable evolves over time) and cross-variable correlations (how different variables co-vary at the same timestep).
CSDI frames imputation as conditioned generation: given observed values at timesteps in set , generate values at missing timesteps in set . At inference, the model denoise from noise toward the conditional distribution .
TimeGrad - Probabilistic Time Series Forecasting
TimeGrad (Rasul et al. 2021) applies autoregressive diffusion to time series forecasting. Unlike CSDI which generates all timesteps simultaneously, TimeGrad generates future values autoregressively - each future timestep is generated conditioned on past observed values.
The approach: a recurrent neural network (RNN or transformer) processes the observed history and produces a context vector. Diffusion generates the distribution over the next timestep conditioned on this context. This gives a full probabilistic forecast - not just a point estimate but a distribution over future values.
TimeGrad outperformed previous probabilistic forecasting methods (DeepAR, GP-forecasting) on benchmark datasets by capturing complex multi-modal forecast distributions that Gaussian approximations miss.
8. YouTube Resources
| Video | Channel | What You Learn |
|---|---|---|
| DreamFusion Explained | Yannic Kilcher | Score Distillation Sampling derivation, 2D-to-3D knowledge transfer |
| Video Diffusion Models Paper | Yannic Kilcher | 3D U-Net architecture, temporal attention, factorized attention |
| Sora: Video Generation Explained | Two Minute Papers | Sora capabilities, spatiotemporal patches, DiT architecture |
| RFdiffusion for Protein Design | Sergey Ovchinnikov | SE(3) diffusion on protein frames, Baker lab experimental results |
| AudioLDM and Audio Diffusion | AI Coffee Break | CLAP conditioning, spectrogram diffusion, vocoder integration |
9. Production Engineering Notes
Domain-Specific Deployment Considerations
Audio generation in production: latency is the critical constraint. A 5-second audio clip at 44.1kHz requires either direct waveform diffusion (slow, high memory) or spectrogram diffusion + vocoder (faster). The vocoder (HiFi-GAN) is fast (real-time factor > 100x), making it suitable for streaming. For real-time TTS, use 6-step waveform diffusion (WaveGrad-style) rather than full 1000-step generation.
Video generation at scale: spatiotemporal VAE compression is essential - without compression, the latent sequence of a 10-second 720p video would be too large for any practical transformer to process. The 3D VAE must preserve temporal coherence in the latent space, not just spatial fidelity. Test this by encoding and decoding clips and checking for temporal artifacts in the reconstruction.
Molecular generation in drug discovery: generated molecules must pass a validity filter pipeline before any downstream use:
- Chemical valency check (correct number of bonds per atom)
- Fragment connectivity check (no disconnected atoms)
- Force field minimization (MMFF94 or ETKDG for 3D geometry optimization)
- ADMET property prediction (absorption, distribution, metabolism, excretion, toxicity)
- Docking score against target protein (AutoDock Vina or Glide)
Only about 30-70% of raw generated molecules pass these filters, depending on the model. Always report validity rate as a model quality metric.
Practical Domain Selection Guide
| Data Type | Recommended Approach | Key Library |
|---|---|---|
| Audio (speech) | Waveform diffusion or spec LDM | diffusers + torchaudio |
| Audio (music, long) | Spectrogram LDM with timing conditioning | Stable Audio (stability-sdk) |
| Video (short clips) | 3D U-Net with temporal attention | diffusers VideoDiffusionPipeline |
| 3D objects (fast) | Point-E or Shap-E | openai/point-e |
| 3D objects (high quality) | DreamFusion / SDS over NeRF | threestudio framework |
| Proteins (backbone) | RFdiffusion | RoseTTAFold-diffusion (Baker lab) |
| Drug molecules | DiffSBDD or TargetDiff | DiffSBDD GitHub |
| Text | Autoregressive LLM (diffusion not yet competitive) | transformers |
| Tabular | TabDDPM | tab-ddpm |
| Time series (imputation) | CSDI | CSDI GitHub |
| Time series (forecasting) | TimeGrad or diffusion-based | GluonTS |
10. Common Pitfalls
:::danger SDS guidance scale must be much higher than image generation DreamFusion uses guidance scale 100-200, far higher than the 7-12 used for image generation. At low guidance scale (7-12), the SDS gradient is too weak for the NeRF to converge - the signal from "this rendered view doesn't look like a high-quality image" is not strong enough to overcome the random initialization of the NeRF. The result is a flat, textureless, colorless surface. At extremely high scale (above 300), the SDS gradient saturates, causing oversaturated, oversmoothed "Dreamfield" artifacts. The sweet spot is 50-150 for most 3D scenes with Stable Diffusion as the prior. :::
:::warning Temporal attention zero-initialization is not optional for video models When adding temporal attention to a pretrained 2D image diffusion model, zero-initialize the output projection of the temporal attention block. Without zero-initialization, random initial values immediately corrupt the pretrained spatial attention's activations at training step 1, preventing the model from leveraging its pretrained weights. The training signal becomes chaotic and convergence is slow. Zero-initialization ensures the model begins training with exactly the same behavior as the original image model. Temporal reasoning is learned gradually from a stable starting point. :::
:::warning Molecular generation requires validity constraints at sampling time Diffusion models generate atom positions and types as continuous values that must be discretized and validated. Standard post-processing pipeline: (1) round continuous atom type predictions to the nearest valid atom type using argmax; (2) infer bonds from interatomic distances using distance thresholds; (3) check valency constraints and fix with MMFF94 force field; (4) remove disconnected fragments; (5) filter by QED and SA score for drug-likeness. Skipping any of these steps and passing raw model outputs to downstream docking calculations will produce invalid results. Always report the fraction of valid molecules as a primary model metric. :::
:::danger Spectrogram diffusion vocoder quality determines the perceptual ceiling In a spectrogram diffusion pipeline, you have two quality ceilings: the diffusion model's spectrogram quality and the vocoder's waveform reconstruction quality. If the vocoder cannot faithfully reconstruct the spectrogram, even a perfect spectrogram diffusion model will produce artifacts - metallic ringing, phasing, or unnatural formant transitions. Always evaluate the vocoder independently (encode a real audio clip to mel spectrogram and decode back) before attributing audio quality problems to the diffusion model. HiFi-GAN and BigVGAN are high-quality vocoders; WaveGlow is acceptable; Griffin-Lim is unacceptably bad for production use. :::
11. Interview Q&A
Q1: Why does SDS allow 3D generation without 3D training data?
A pretrained 2D diffusion model has learned a score function that points toward high-quality images consistent with a text prompt. This score function implicitly encodes knowledge about how objects look from different viewpoints - because its training images contain objects photographed from all angles, in all lighting conditions, from all distances.
SDS exploits this by using the 2D score function as a 3D training signal. When a NeRF renders a view that looks like low-quality noise to the diffusion model, the SDS gradient tells the NeRF: "update your 3D geometry and texture so that this rendered view looks more like a high-quality image matching the prompt." Since this works from any randomly sampled camera position, the NeRF learns a 3D representation that looks plausible from all viewpoints. The key insight: multi-view consistency is implicitly encoded in the 2D training distribution - natural images are not arbitrarily random, they follow 3D projective geometry - and SDS extracts this 3D prior without explicit 3D supervision.
Q2: How does temporal attention achieve frame-to-frame consistency in video diffusion?
Temporal attention reshapes the feature maps from to - treating each spatial location's feature vector over all frames as a temporal sequence of tokens. Self-attention over the temporal dimension allows each frame's feature at position to attend to - and be informed by - the features at the same spatial position in all other frames.
This creates a soft global temporal constraint: the denoising prediction at frame at position is conditioned on what happens at in frames through . When the network predicts how to denoise frame 's features, it can "see" whether the same spatial position is stable or moving in adjacent frames and generate consistent content. Without temporal attention, each frame is denoised independently with no information about neighboring frames, producing temporally incoherent results.
Q3: What are the key challenges of applying diffusion to discrete text tokens?
Two fundamental challenges: (1) The standard Gaussian forward process is not defined for discrete tokens. Adding Gaussian noise to an integer token ID produces a non-integer that has no vocabulary interpretation - you cannot decode it back to a word. (2) Any scheme that maps tokens to continuous space (embedding-space diffusion) requires a final rounding step to recover discrete tokens, and this rounding introduces systematic errors, especially for tokens with embeddings close to other tokens in the embedding space.
Masked diffusion (MDLM) addresses both challenges: the forward process is categorical (mask tokens with probability ), the noise state is clearly defined ([MASK] token), and the reverse process unmasks using cross-entropy on discrete tokens - no rounding needed. The tradeoff is that masked diffusion is not as deeply theoretically grounded as Gaussian diffusion (the Brownian motion / DDPM framework does not apply directly), but it is more natural for discrete data and produces competitive results.
Q4: Why must diffusion models for molecular geometry be SE(3)-equivariant?
A molecule's energy, binding affinity, and chemical validity are functions of the relative positions and orientations of its atoms - not of where the molecule is in absolute space or which direction it faces. Rotating the entire molecule 90 degrees does not change any chemical property.
A network that is not SE(3)-equivariant will treat the same molecule in different orientations as different inputs, wasting capacity on learning orientation invariance from data and producing orientation-dependent predictions. SE(3)-equivariance enforces for rotations and translations - the predicted atom positions transform consistently with the input. This is implemented using equivariant message passing (EGNN, SE3-Transformer, EquiformerV2) where messages between atoms are computed from rotationally invariant features (interatomic distances, dot products of direction vectors) and the output forces are expressed as vectors in the equivariant frame.
Q5: How does Sora's spatiotemporal DiT approach differ from the 3D U-Net approach, and why does it scale better?
3D U-Net (Ho et al. style): processes video as a 4D tensor using hierarchical resolution levels. At each level, spatial attention over and temporal attention over are applied alternately. Memory scales as for the attention maps. For 60-second 1080p video, this is completely intractable - the sequence length in the attention layers would be in the millions.
Sora's approach: first compress the video aggressively via a 3D spatiotemporal VAE that reduces both spatial and temporal dimensions simultaneously (the exact compression ratio is not published, but likely 8x-32x in each dimension). This creates a compact 3D latent. The latent is then patchified into spatiotemporal tokens - a fixed-size 3D patch maps to one token. A pure transformer (DiT) attends globally over all these tokens. This scales because: (1) the 3D VAE compression makes the token count manageable even for long videos; (2) transformer architecture scales via standard scaling laws; (3) variable sequence lengths (different video durations) are handled naturally by simply having more or fewer tokens, without any architectural changes.
The DiT also handles variable aspect ratios and frame rates by including these as conditioning information (via learned embeddings), avoiding the letterbox artifacts of fixed-resolution models.
Q6: What is the key advantage of masked diffusion over autoregressive generation for text, and why hasn't it displaced LLMs?
The key advantage of masked diffusion (MDLM-style) is global coherence and controllability. An autoregressive model generates tokens left-to-right; position is generated before seeing positions through . This means the beginning of a sentence constrains the end, making it hard to satisfy global constraints (fill in the middle of text, generate a sonnet with a specific rhyme scheme, impose semantic constraints on the whole output). Masked diffusion generates all positions simultaneously, with each position having access to all others at every denoising step - enabling true bidirectional reasoning and easier constraint satisfaction.
Why it hasn't displaced LLMs: autoregressive models have a 10-year head start in scaling infrastructure, training pipelines, and RLHF alignment. They scale predictably with compute. They are efficiently sampled (key-value cache). Text diffusion models are not yet competitive at the 70B+ parameter scale that defines frontier LLMs. The compute per token at inference is also higher for diffusion (multiple denoising steps vs one forward pass per autoregressive token). The gap is narrowing but text diffusion remains a research direction rather than a production-ready LLM replacement in 2024-2025.
12. Emerging Directions - 2025 and Beyond
Flow Matching for All Modalities
Flow matching (Lipman et al. 2022, Liu et al. 2022) replaces the stochastic DDPM forward process with a deterministic flow:
The model learns velocity - how fast and in what direction to move from noise toward data. Because trajectories are straight lines, the ODE has lower curvature and converges with fewer steps than DDPM. This is the theoretical foundation for FLUX.1's efficiency claims.
Flow matching generalizes to non-Gaussian targets and non-Euclidean geometries. For molecular generation, you can define flows on the SO(3) rotation manifold directly. For audio, you can flow from a silence-filled spectrogram rather than Gaussian noise, providing a better initialization.
World Models and Interactive Simulation
Video diffusion as a world model is an active research frontier. Genie (Google DeepMind, 2024) generates interactive simulations - you can "play" them with keyboard controls. The model generates the next video frame conditioned on the previous frame and an action token. This is directly applicable to robotics: train the world model on robot manipulation video, then use it for model-based planning without physical hardware.
Diamond (Ecoffet et al. 2024) demonstrates a video diffusion model as an interactive Atari game environment - the game runs entirely inside the generative model. The gap between "realistic video generation" and "simulated world for robot training" is narrowing rapidly.
Diffusion for Scientific Discovery
Beyond drug discovery, diffusion models are being applied to materials science (generating crystal structures with target electronic or mechanical properties using SE(3)-equivariant diffusion over periodic unit cells), climate modeling (generating physically consistent atmospheric states conditioned on observations), and genomics (generating DNA sequences with target regulatory properties using masked diffusion over the 4-letter nucleotide alphabet).
The pattern is consistent: wherever there is a high-dimensional structured distribution with domain-specific symmetries that is expensive to sample from conventionally, diffusion provides a powerful generative framework - provided you design the forward process, noise model, and denoising architecture to respect the domain's structure.
13. Interview Q&A
Q1: Why does SDS allow 3D generation without 3D training data?
A pretrained 2D diffusion model has learned a score function that points toward high-quality images consistent with a text prompt. This score function implicitly encodes knowledge about how objects look from different viewpoints - because its training images contain objects photographed from all angles, in all lighting conditions, from all distances.
SDS exploits this by using the 2D score function as a 3D training signal. When a NeRF renders a view that looks like low-quality noise to the diffusion model, the SDS gradient tells the NeRF: "update your 3D geometry and texture so that this rendered view looks more like a high-quality image matching the prompt." Since this works from any randomly sampled camera position, the NeRF learns a 3D representation that looks plausible from all viewpoints. The key insight: multi-view consistency is implicitly encoded in the 2D training distribution - natural images are not arbitrarily random, they follow 3D projective geometry - and SDS extracts this 3D prior without explicit 3D supervision.
Q2: How does temporal attention achieve frame-to-frame consistency in video diffusion?
Temporal attention reshapes the feature maps from to - treating each spatial location's feature vector over all frames as a temporal sequence of tokens. Self-attention over the temporal dimension allows each frame's feature at position to attend to - and be informed by - the features at the same spatial position in all other frames.
This creates a soft global temporal constraint: the denoising prediction at frame at position is conditioned on what happens at in frames through . When the network predicts how to denoise frame 's features, it "sees" whether the same spatial position is stable or moving in adjacent frames and generates consistent content. Without temporal attention, each frame is denoised independently - producing per-frame photorealism with no coherent motion.
The zero-initialization of the temporal attention output projection is the critical training detail: the model starts with the 2D image model's behavior fully intact, and temporal coherence is learned gradually from a stable pretrained initialization.
Q3: What are the key challenges of applying diffusion to discrete text tokens?
Two fundamental challenges: (1) The standard Gaussian forward process is not defined for discrete tokens. Adding Gaussian noise to an integer token ID (e.g., 4523 for "cat") produces a non-integer that has no valid vocabulary interpretation. (2) Any scheme that maps tokens to continuous space requires a final rounding step to recover discrete tokens, introducing systematic errors - particularly for tokens with embeddings close to other tokens.
Masked diffusion (MDLM) addresses both: the forward process is categorical (mask tokens with probability proportional to ), the "noise" state is clearly defined ([MASK] token), and the reverse process unmasks tokens via cross-entropy on discrete tokens - no rounding needed. Masked diffusion also enables bidirectional context: the transformer sees all positions at each denoising step, unlike autoregressive LLMs that generate left-to-right with only leftward context.
Q4: Why must diffusion models for molecular geometry be SE(3)-equivariant?
A molecule's energy, binding affinity, and chemical validity are functions of the relative positions and orientations of its atoms - not of where the molecule sits in absolute space. Rotating the entire molecule 90 degrees changes none of its chemical properties.
A non-equivariant network will treat the same molecule differently depending on its orientation, wasting capacity on spurious orientation-dependent patterns and producing inconsistent predictions. SE(3)-equivariance enforces for rotations and translations - the predicted atom positions transform consistently with the input. This is implemented using equivariant message passing (EGNN, SE3-Transformer) where messages are computed from rotation-invariant features (interatomic distances, dot products of direction vectors) and output forces are expressed in equivariant vector frames.
Q5: How does Sora's spatiotemporal DiT approach differ from the 3D U-Net approach, and why does it scale better?
3D U-Net processes video as a 4D tensor using hierarchical resolution levels with alternating spatial and temporal attention. Memory scales as for attention maps. For 60-second 1080p video, this makes the naive 3D U-Net completely intractable - the temporal attention sequence length would be millions.
Sora's approach: first compress the video aggressively via a 3D spatiotemporal VAE reducing both spatial and temporal dimensions simultaneously. The compact 3D latent is then patchified into spatiotemporal tokens. A pure transformer (DiT) attends globally over all these tokens. This scales because: (1) the 3D VAE compression makes token count manageable even for long videos; (2) transformer architecture scales cleanly via standard scaling laws with no architectural changes needed for different lengths; (3) variable aspect ratios and frame rates are handled naturally by conditioning on these properties rather than baking them into the architecture. The global attention over all spatiotemporal tokens (vs factorized spatial+temporal) also enables better long-range consistency.
Q6: What would you change if applying diffusion to generate DNA sequences?
DNA sequences are discrete - four nucleotides (A, T, G, C) at each position. The approach mirrors text diffusion but with a smaller vocabulary and domain-specific structure.
Forward process: use masked diffusion (absorbing state = [MASK] token) rather than Gaussian. The forward process gradually masks nucleotides; the reverse process predicts which nucleotide fills each masked position from bidirectional sequence context.
Architecture: a bidirectional transformer (similar to BERT) as the denoising network. Unlike language, DNA has important long-range dependencies (enhancer-promoter interactions spanning thousands of bases) and is inherently bidirectional - neither strand is "the beginning." Bidirectional attention naturally captures these dependencies.
Conditioning: condition on desired properties - gene expression level, transcription factor binding motifs, species origin - injected via cross-attention or concatenated embeddings.
Validity constraints: generated sequences must satisfy biological validity: correct codon structure (if coding sequence), GC content within viable range, no premature stop codons (for open reading frames). These constraints are applied as post-processing filters or incorporated as training objectives.
Practical systems like Evo (Arc Institute, 2024) use a different approach - a large autoregressive model trained on genomic sequences - but diffusion offers the bidirectional generation advantage for applications like motif insertion (generating a sequence that contains a specified functional motif in a natural context).
14. Practical Guidance: Choosing the Right Architecture by Domain
When a new generative modeling task arrives, use this decision guide:
Is the data continuous?
Yes, and it has 2D spatial structure (images, spectrograms, depth maps): use latent diffusion with a 2D U-Net or DiT. Compress with a VAE to a compact latent before diffusing.
Yes, and it is a 1D temporal signal (raw audio waveforms, ECG, vibration signals): use 1D U-Net diffusion. Consider mel spectrogram conversion to 2D if processing infrastructure for images is more mature.
Yes, and it is a 3D geometric structure (molecules, protein coordinates): use SE(3)-equivariant graph network. Apply Gaussian diffusion for translations and Riemannian diffusion (Brownian motion on SO(3)) for rotations.
Yes, and it is a tabular vector of mixed features: flatten all features into a continuous vector and use MLP-based diffusion (TabDDPM approach). Encode categoricals as one-hot or learned embeddings before diffusing.
No, it is discrete tokens (text, DNA, SMILES strings): use masked diffusion with a bidirectional transformer. Avoid embedding-space Gaussian diffusion unless you have a specific need - rounding errors degrade quality.
How long is the sequence?
Under 1,000 elements: direct diffusion in data space is feasible.
1,000–100,000 elements: compress via a learned autoencoder (VAE), then diffuse in the compact latent.
Over 100,000 elements (long videos, genome-scale DNA): patch/segment approach, or autoregressive generation of segments with diffusion within each segment.
Is compute the primary constraint?
Reduce diffusion steps via DPM-Solver++ (requires 10-20 steps with minimal quality loss), Consistency Models (1-4 steps), or DDIM (deterministic, ~50 steps). Use latent diffusion to reduce the resolution at which diffusion operates.
Do you need a generative prior for optimization (like SDS for 3D)?
Use an existing pretrained domain-appropriate diffusion model as the score function. Define a differentiable mapping from your optimization target (the thing you want to generate/optimize) to the domain the diffusion model understands. Use SDS or variational score distillation to update the target via the pretrained model's gradient.
This lesson is part of the Diffusion Models module. Next: Evaluating Generative Models - FID, IS, Precision/Recall, Human Evaluation.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Diffusion Process (DDPM) demo on the EngineersOfAI Playground - no code required.
:::
