Skip to main content

Autoencoders

import ReadingTime from '@site/src/components/ReadingTime';

:::note Interview Relevance - High Autoencoders appear in interviews for ML engineering, research, and AI engineering roles. Key topics: bottleneck and reconstruction loss derivation, why denoising is better than undercomplete alone, how sparse autoencoders differ from LASSO, reconstruction error for anomaly detection, and - for research roles - the Anthropic monosemanticity connection to sparse autoencoders. :::

The Real Interview Moment

You are interviewing at a fraud detection company. The interviewer describes a painful problem: "We have 200 million transactions per day. We know fraud accounts for 0.01% of them - about 20,000 fraudulent transactions. But we only have labels on about 500 of those. Our supervised model is starving for data. What do you do?"

You think for a moment. Then: "I'd train an autoencoder on the 199 million unlabeled normal transactions. The autoencoder learns what 'normal' looks like - its reconstruction loss captures the statistical structure of legitimate transactions. When a fraudulent transaction comes in, the autoencoder fails to reconstruct it well. High reconstruction error becomes an anomaly score. I don't need any labels to train it. Then I use the 500 labeled frauds to calibrate the threshold."

The interviewer leans forward. "What if fraudsters learn to mimic normal transaction patterns?"

"Then reconstruction error degrades as an anomaly signal - but I can monitor that. And I can use the autoencoder's latent representation as input to a classifier trained on the 500 labeled examples. The latent space learned from 199 million normal samples gives the classifier far richer features than training on 500 examples alone."

That exchange describes two of the most powerful uses of autoencoders in production: anomaly detection via reconstruction error, and semi-supervised representation learning.

Why This Exists - The Representation Learning Problem

Supervised learning requires labels. Labels are expensive. For a 200-million-transaction dataset, getting even 1% labeled requires 2 million human annotations. The question: can a neural network learn useful representations of data without any labels?

The autoencoder's answer: make the network reconstruct its own input through a bottleneck. To do this well, it must learn to compress the input into a compact representation that captures the essential structure - the factors of variation that matter for distinguishing different inputs. This is unsupervised representation learning.

The bottleneck is the key: without it, the network could simply copy the input (identity mapping). The bottleneck forces compression - the network must decide what to remember and what to discard.

Historical Context

Autoencoders were first proposed by Rumelhart, Hinton, and Williams in 1986 in the backpropagation paper. The denoising autoencoder was formalized by Vincent et al. (2008, "Extracting and Composing Robust Features with Denoising Autoencoders"). Sparse autoencoders were studied extensively by Ng et al. around 2011 in the context of sparse coding for visual cortex models. The recent resurgence comes from two directions: variational autoencoders (VAEs, Kingma & Welling 2013) for generation, and sparse autoencoders for LLM interpretability (Anthropic, 2023–2025).

Architecture: Encoder → Bottleneck → Decoder

An autoencoder is a neural network trained to reconstruct its input through a low-dimensional bottleneck:

Input x ─────────────────────────────────────────────────────────────────────
(d dims) │
[Encoder f: x → z] │
W_1, W_2, ..., W_k (learnable) │
│ │
▼ │
z (bottleneck) ← the learned representation │
(latent_dim << d) │
│ │
[Decoder g: z → x̂] │
W_k+1, ..., W_n (learnable) │
│ │
▼ │
x̂ (reconstruction) │
(d dims) │
│ │
Loss: L(x, x̂) = ||x - x̂||² (MSE, continuous data) │
L(x, x̂) = BCE(x, x̂) (binary/image in [0,1]) ←──────────────

The information bottleneck principle: To minimize L(x,x^)L(x, \hat{x}), the encoder must compress xx into zz such that zz retains sufficient information to reconstruct xx. The smaller the bottleneck, the more the network is forced to learn a compact, meaningful representation.

Reconstruction loss formulas:

LMSE=1ni=1nxix^i2=1ndi=1nj=1d(xijx^ij)2\mathcal{L}_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^n \|x_i - \hat{x}_i\|^2 = \frac{1}{nd}\sum_{i=1}^n\sum_{j=1}^d (x_{ij} - \hat{x}_{ij})^2

LBCE=1ni=1nj=1d[xijlogx^ij+(1xij)log(1x^ij)]\mathcal{L}_{\text{BCE}} = -\frac{1}{n}\sum_{i=1}^n \sum_{j=1}^d \left[x_{ij}\log\hat{x}_{ij} + (1-x_{ij})\log(1-\hat{x}_{ij})\right]

Use MSE for continuous-valued data (sensor readings, standardized features). Use BCE for data in [0,1][0, 1] with Sigmoid output (pixel intensities, binary features). BCE penalizes confident wrong predictions more strongly than MSE, producing sharper reconstructions.

Basic PyTorch Autoencoder

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
import matplotlib.pyplot as plt

class Autoencoder(nn.Module):
"""
Fully connected autoencoder with symmetric encoder-decoder architecture.

Architecture: input → [hidden_dims] → latent → [hidden_dims reversed] → input
BatchNorm + ReLU on hidden layers; no activation on final encoder/decoder layers.
"""
def __init__(self, input_dim: int, latent_dim: int,
hidden_dims: list = None, dropout: float = 0.0):
super().__init__()
if hidden_dims is None:
hidden_dims = [256, 128]

# ── Encoder ──────────────────────────────────────────────────────────
encoder_layers = []
in_dim = input_dim
for h_dim in hidden_dims:
encoder_layers.extend([
nn.Linear(in_dim, h_dim),
nn.BatchNorm1d(h_dim),
nn.ReLU(inplace=True),
])
if dropout > 0:
encoder_layers.append(nn.Dropout(dropout))
in_dim = h_dim

encoder_layers.append(nn.Linear(in_dim, latent_dim))
# No activation at bottleneck - allow full real-valued latent space
self.encoder = nn.Sequential(*encoder_layers)

# ── Decoder (mirror of encoder) ────────────────────────────────────
decoder_layers = []
in_dim = latent_dim
for h_dim in reversed(hidden_dims):
decoder_layers.extend([
nn.Linear(in_dim, h_dim),
nn.BatchNorm1d(h_dim),
nn.ReLU(inplace=True),
])
if dropout > 0:
decoder_layers.append(nn.Dropout(dropout))
in_dim = h_dim

decoder_layers.append(nn.Linear(in_dim, input_dim))
# No output activation for MSE loss (continuous data)
# Use Sigmoid() if using BCE loss with [0,1] data
self.decoder = nn.Sequential(*decoder_layers)

def encode(self, x: torch.Tensor) -> torch.Tensor:
return self.encoder(x)

def decode(self, z: torch.Tensor) -> torch.Tensor:
return self.decoder(z)

def forward(self, x: torch.Tensor):
z = self.encode(x)
x_hat = self.decode(z)
return x_hat, z


def train_autoencoder(model: nn.Module, X_train: np.ndarray,
X_val: np.ndarray = None,
n_epochs: int = 50, batch_size: int = 256,
lr: float = 1e-3, device: str = 'cpu') -> dict:
"""
Training loop with validation loss tracking and learning rate scheduling.
Returns: dict with train_losses and val_losses.
"""
model = model.to(device)
optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5,
factor=0.5, verbose=True)
criterion = nn.MSELoss()

X_tensor = torch.FloatTensor(X_train).to(device)
dataset = TensorDataset(X_tensor)
loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

history = {'train': [], 'val': []}

for epoch in range(n_epochs):
# Training
model.train()
epoch_loss = 0.0
for (batch,) in loader:
optimizer.zero_grad()
x_hat, _ = model(batch)
loss = criterion(x_hat, batch)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
epoch_loss += loss.item() * len(batch)

train_loss = epoch_loss / len(X_train)
history['train'].append(train_loss)

# Validation
if X_val is not None:
model.eval()
with torch.no_grad():
X_val_tensor = torch.FloatTensor(X_val).to(device)
x_hat_val, _ = model(X_val_tensor)
val_loss = criterion(x_hat_val, X_val_tensor).item()
history['val'].append(val_loss)
scheduler.step(val_loss)
else:
scheduler.step(train_loss)

if (epoch + 1) % 10 == 0:
val_str = f" Val: {val_loss:.6f}" if X_val is not None else ""
print(f"Epoch {epoch+1:3d}/{n_epochs} Train: {train_loss:.6f}{val_str}")

return history

MNIST Autoencoder: Visualizing Reconstructions

from torchvision import datasets, transforms
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt

def build_mnist_autoencoder():
"""Convolutional autoencoder for 28x28 MNIST images."""

class ConvAutoencoder(nn.Module):
def __init__(self, latent_dim: int = 32):
super().__init__()
# Encoder: 28×28 → 14×14 → 7×7 → latent_dim
self.encoder_conv = nn.Sequential(
nn.Conv2d(1, 32, 3, stride=2, padding=1), # → 14×14×32
nn.BatchNorm2d(32),
nn.LeakyReLU(0.2, inplace=True),
nn.Conv2d(32, 64, 3, stride=2, padding=1), # → 7×7×64
nn.BatchNorm2d(64),
nn.LeakyReLU(0.2, inplace=True),
)
self.encoder_fc = nn.Linear(64 * 7 * 7, latent_dim)

# Decoder: latent_dim → 7×7×64 → 14×14×32 → 28×28×1
self.decoder_fc = nn.Linear(latent_dim, 64 * 7 * 7)
self.decoder_conv = nn.Sequential(
nn.ConvTranspose2d(64, 32, 3, stride=2, padding=1, output_padding=1), # → 14×14
nn.BatchNorm2d(32),
nn.LeakyReLU(0.2, inplace=True),
nn.ConvTranspose2d(32, 1, 3, stride=2, padding=1, output_padding=1), # → 28×28
nn.Sigmoid(), # pixel values in [0, 1] - use with BCE loss
)

def encode(self, x):
h = self.encoder_conv(x)
return self.encoder_fc(h.view(h.size(0), -1))

def decode(self, z):
h = self.decoder_fc(z).view(-1, 64, 7, 7)
return self.decoder_conv(h)

def forward(self, x):
z = self.encode(x)
return self.decode(z), z

return ConvAutoencoder(latent_dim=32)


def train_mnist_ae(n_epochs: int = 20, batch_size: int = 256):
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Training on {device}")

# Load MNIST
transform = transforms.Compose([transforms.ToTensor()])
mnist_train = datasets.MNIST('./data', train=True, download=True, transform=transform)
mnist_test = datasets.MNIST('./data', train=False, download=True, transform=transform)

train_loader = DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=2)
test_loader = DataLoader(mnist_test, batch_size=512, shuffle=False)

model = build_mnist_autoencoder().to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)
criterion = nn.BCELoss() # BCE for pixel values in [0, 1]

train_losses = []
for epoch in range(n_epochs):
model.train()
total_loss = 0
for x, _ in train_loader:
x = x.to(device)
optimizer.zero_grad()
x_hat, _ = model(x)
loss = criterion(x_hat, x)
loss.backward()
optimizer.step()
total_loss += loss.item()
avg = total_loss / len(train_loader)
train_losses.append(avg)
if (epoch + 1) % 5 == 0:
print(f"Epoch {epoch+1}/{n_epochs} BCE Loss: {avg:.4f}")

# Visualize reconstructions
model.eval()
x_batch, y_batch = next(iter(test_loader))
x_sample = x_batch[:10].to(device)

with torch.no_grad():
x_recon, z = model(x_sample)

fig, axes = plt.subplots(3, 10, figsize=(20, 6))
for i in range(10):
# Original
axes[0, i].imshow(x_sample[i, 0].cpu(), cmap='gray', vmin=0, vmax=1)
axes[0, i].axis('off')
if i == 0: axes[0, i].set_ylabel("Original", fontsize=11)

# Reconstruction
axes[1, i].imshow(x_recon[i, 0].cpu(), cmap='gray', vmin=0, vmax=1)
axes[1, i].axis('off')
if i == 0: axes[1, i].set_ylabel("Reconstructed", fontsize=11)

# Residual (absolute error)
residual = (x_sample[i, 0] - x_recon[i, 0]).abs().cpu()
axes[2, i].imshow(residual, cmap='hot', vmin=0, vmax=0.5)
axes[2, i].axis('off')
if i == 0: axes[2, i].set_ylabel("Residual", fontsize=11)

axes[0, 5].set_title(f"Autoencoder Reconstruction - latent_dim=32", fontsize=12, pad=20)
plt.tight_layout()
plt.show()

return model, train_losses

Reconstruction Error as Anomaly Score

The reconstruction error is the most direct anomaly signal from an autoencoder:

import torch
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

def reconstruction_error(model: nn.Module, X: np.ndarray,
device: str = 'cpu') -> np.ndarray:
"""Per-sample reconstruction MSE - higher = more anomalous."""
model.eval()
X_tensor = torch.FloatTensor(X).to(device)
with torch.no_grad():
x_hat, _ = model(X_tensor)
errors = ((X_tensor - x_hat) ** 2).mean(dim=1).cpu().numpy()
return errors


class AnomalyDetector:
"""
Autoencoder-based anomaly detector.
Fit on normal data only; score new samples by reconstruction error.
"""

def __init__(self, input_dim: int, latent_dim: int,
hidden_dims: list = None, threshold_percentile: float = 99.0):
self.model = Autoencoder(input_dim, latent_dim, hidden_dims)
self.scaler = StandardScaler()
self.threshold_percentile = threshold_percentile
self.threshold_ = None
self.device = 'cuda' if torch.cuda.is_available() else 'cpu'

def fit(self, X_normal: np.ndarray, n_epochs: int = 50,
batch_size: int = 256, lr: float = 1e-3) -> 'AnomalyDetector':
"""Fit on normal data only."""
X_scaled = self.scaler.fit_transform(X_normal)

history = train_autoencoder(
self.model, X_scaled, n_epochs=n_epochs,
batch_size=batch_size, lr=lr, device=self.device
)

# Set threshold from training errors
train_errors = reconstruction_error(self.model, X_scaled, self.device)
self.threshold_ = np.percentile(train_errors, self.threshold_percentile)
self.train_error_p50_ = np.percentile(train_errors, 50)

print(f"\nThreshold ({self.threshold_percentile}th pct): {self.threshold_:.6f}")
print(f"Median train error: {self.train_error_p50_:.6f}")

return self

def score(self, X: np.ndarray) -> np.ndarray:
X_scaled = self.scaler.transform(X)
return reconstruction_error(self.model, X_scaled, self.device)

def predict(self, X: np.ndarray) -> np.ndarray:
"""True = anomaly."""
return self.score(X) > self.threshold_

def evaluate_roc(self, X_normal: np.ndarray, X_anomaly: np.ndarray):
normal_scores = self.score(X_normal)
anomaly_scores = self.score(X_anomaly)

y_true = np.array([0]*len(X_normal) + [1]*len(X_anomaly))
y_score = np.concatenate([normal_scores, anomaly_scores])

auc = roc_auc_score(y_true, y_score)
fpr, tpr, thresholds = roc_curve(y_true, y_score)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Score distributions
axes[0].hist(normal_scores, bins=60, alpha=0.6, color='#3b82f6',
density=True, label=f'Normal (n={len(X_normal)})')
axes[0].hist(anomaly_scores, bins=60, alpha=0.6, color='#ef4444',
density=True, label=f'Anomaly (n={len(X_anomaly)})')
axes[0].axvline(self.threshold_, color='#1e293b', linestyle='--',
linewidth=2, label=f'Threshold={self.threshold_:.4f}')
axes[0].set_xlabel("Reconstruction Error (MSE)")
axes[0].set_ylabel("Density")
axes[0].set_title(f"Score Distribution - AUC-ROC={auc:.4f}")
axes[0].legend()

# ROC curve
axes[1].plot(fpr, tpr, color='#7c3aed', linewidth=2, label=f'AUC={auc:.4f}')
axes[1].plot([0,1], [0,1], 'k--', linewidth=1, alpha=0.5)
axes[1].set_xlabel("False Positive Rate")
axes[1].set_ylabel("True Positive Rate")
axes[1].set_title("ROC Curve")
axes[1].legend()

plt.tight_layout()
plt.show()

print(f"AUC-ROC: {auc:.4f}")
print(f"Detection rate: {(anomaly_scores > self.threshold_).mean():.1%}")
print(f"False alarm rate:{(normal_scores > self.threshold_).mean():.1%}")

return auc

Denoising Autoencoder (DAE)

A denoising autoencoder (Vincent et al., 2008) corrupts the input with noise during training and learns to reconstruct the clean input. This forces the model to learn the underlying data manifold rather than a near-identity mapping.

Mathematical motivation: Training a DAE is equivalent to learning to estimate E[xx~]\mathbb{E}[x | \tilde{x}] - the clean signal given its noisy observation. This is related to score matching and forms the theoretical foundation of score-based generative models (diffusion models).

Objective: LDAE=1nixif(g(x~i))2\mathcal{L}_{\text{DAE}} = \frac{1}{n}\sum_i \|x_i - f(g(\tilde{x}_i))\|^2

where x~i=xi+ϵi\tilde{x}_i = x_i + \epsilon_i, ϵiN(0,σ2I)\epsilon_i \sim \mathcal{N}(0, \sigma^2 I) (Gaussian noise) or x~i=ximi\tilde{x}_i = x_i \odot m_i (masking noise, miBernoulli(p)m_i \sim \text{Bernoulli}(p)).

import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import DataLoader, TensorDataset
import torch.optim as optim

class DenoisingAutoencoder(nn.Module):
"""
Denoising autoencoder: corrupts input with noise, learns to recover clean input.

noise_type: 'gaussian' | 'masking' | 'salt_pepper'
noise_level: std for gaussian, drop probability for masking
"""
def __init__(self, input_dim: int, latent_dim: int,
hidden_dims: list = None,
noise_type: str = 'gaussian',
noise_level: float = 0.2):
super().__init__()
if hidden_dims is None:
hidden_dims = [256, 128]

self.noise_type = noise_type
self.noise_level = noise_level

# Same architecture as standard AE
# Encoder
enc = []
in_d = input_dim
for h in hidden_dims:
enc.extend([nn.Linear(in_d, h), nn.BatchNorm1d(h), nn.ReLU(inplace=True)])
in_d = h
enc.append(nn.Linear(in_d, latent_dim))
self.encoder = nn.Sequential(*enc)

# Decoder
dec = []
in_d = latent_dim
for h in reversed(hidden_dims):
dec.extend([nn.Linear(in_d, h), nn.BatchNorm1d(h), nn.ReLU(inplace=True)])
in_d = h
dec.append(nn.Linear(in_d, input_dim))
self.decoder = nn.Sequential(*dec)

def corrupt(self, x: torch.Tensor) -> torch.Tensor:
"""Apply noise only during training."""
if not self.training:
return x

if self.noise_type == 'gaussian':
return x + torch.randn_like(x) * self.noise_level

elif self.noise_type == 'masking':
# Set random fraction of inputs to zero
mask = torch.bernoulli(torch.ones_like(x) * (1 - self.noise_level))
return x * mask

elif self.noise_type == 'salt_pepper':
# Randomly set pixels to 0 or 1
noise = torch.randint(0, 2, x.shape, dtype=x.dtype, device=x.device)
mask = torch.bernoulli(torch.ones_like(x) * self.noise_level)
return x * (1 - mask) + noise * mask

return x

def encode(self, x: torch.Tensor) -> torch.Tensor:
return self.encoder(self.corrupt(x))

def decode(self, z: torch.Tensor) -> torch.Tensor:
return self.decoder(z)

def forward(self, x: torch.Tensor):
z = self.encode(x) # encodes from noisy input
x_hat = self.decode(z)
return x_hat, z # x_hat reconstructed from noisy, compared to clean x


def train_denoising_ae(model: DenoisingAutoencoder,
X_train: np.ndarray,
n_epochs: int = 50,
batch_size: int = 256,
lr: float = 1e-3) -> list:
"""
Training loop: CRUCIAL - loss compares x_hat to CLEAN x (not noisy).
The model sees noisy input; the target is clean input.
"""
optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
criterion = nn.MSELoss()
X_tensor = torch.FloatTensor(X_train)
loader = DataLoader(TensorDataset(X_tensor), batch_size=batch_size, shuffle=True)

losses = []
for epoch in range(n_epochs):
model.train()
total = 0.0
for (batch,) in loader:
optimizer.zero_grad()
x_hat, _ = model(batch) # internally corrupts batch
loss = criterion(x_hat, batch) # compare to CLEAN batch - critical!
loss.backward()
optimizer.step()
total += loss.item() * len(batch)

avg = total / len(X_train)
losses.append(avg)
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}/{n_epochs} DAE Loss: {avg:.6f}")

return losses

Sparse Autoencoder

A sparse autoencoder adds an 1\ell_1 penalty on the bottleneck activations, forcing most neurons to be near-zero for any given input. This produces sparse codes where each neuron responds to a specific, interpretable pattern:

Loss function: Lsparse=xx^2reconstruction+λz1sparsity penalty\mathcal{L}_{\text{sparse}} = \underbrace{\|x - \hat{x}\|^2}_{\text{reconstruction}} + \underbrace{\lambda \|z\|_1}_{\text{sparsity penalty}}

The 1\ell_1 penalty promotes sparse activations (most entries of zz near zero) - similar to LASSO regression, which uses 1\ell_1 to promote sparse coefficients. The difference: in a sparse AE the sparsity is on neural activations, not on model weights.

Polysemanticity and superposition: Language models are believed to represent more features than they have neurons by "superposing" multiple features onto the same neuron (polysemanticity). A sparse autoencoder trained on the model's activations can disentangle these superposed features into monosemantic directions - one neuron per concept.

import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import DataLoader, TensorDataset
import torch.optim as optim

class SparseAutoencoder(nn.Module):
"""
Sparse autoencoder with L1 sparsity on latent activations.

Two variants:
- undercomplete: latent_dim < input_dim (bottleneck alone forces compression)
- overcomplete: latent_dim >> input_dim (need L1 to prevent identity mapping)

The Anthropic monosemanticity paper uses overcomplete SAEs (latent_dim = 4096-16384)
trained on MLP layer activations of small language models.
"""

def __init__(self, input_dim: int, latent_dim: int,
sparsity_weight: float = 1e-3,
tied_weights: bool = False):
"""
tied_weights: decoder weights = encoder weights transposed
Reduces parameters, acts as additional regularizer.
"""
super().__init__()
self.sparsity_weight = sparsity_weight
self.tied_weights = tied_weights

# Encoder: input → latent (with ReLU to produce sparse non-negative activations)
self.encoder_weight = nn.Parameter(torch.randn(latent_dim, input_dim) * 0.01)
self.encoder_bias = nn.Parameter(torch.zeros(latent_dim))

if not tied_weights:
self.decoder_weight = nn.Parameter(torch.randn(input_dim, latent_dim) * 0.01)
self.decoder_bias = nn.Parameter(torch.zeros(input_dim))

# Pre-encoder bias (used in Anthropic's architecture)
self.pre_bias = nn.Parameter(torch.zeros(input_dim))

def encode(self, x: torch.Tensor) -> torch.Tensor:
"""Encode: ReLU activation produces non-negative sparse codes."""
x_centered = x - self.pre_bias
z = torch.relu(x_centered @ self.encoder_weight.T + self.encoder_bias)
return z

def decode(self, z: torch.Tensor) -> torch.Tensor:
if self.tied_weights:
# Tied weights: decoder = encoder.T
return z @ self.encoder_weight + self.decoder_bias
else:
return z @ self.decoder_weight + self.decoder_bias

def forward(self, x: torch.Tensor):
z = self.encode(x)
x_hat = self.decode(z)
return x_hat, z

def loss(self, x: torch.Tensor, x_hat: torch.Tensor, z: torch.Tensor) -> dict:
"""Composite loss: reconstruction + L1 sparsity."""
recon_loss = ((x - x_hat) ** 2).mean()
sparsity_loss = self.sparsity_weight * z.abs().mean()
total_loss = recon_loss + sparsity_loss

# Sparsity metrics
with torch.no_grad():
l0 = (z.abs() > 1e-3).float().mean() # fraction of active neurons
l1 = z.abs().mean() # average activation magnitude

return {
'total': total_loss,
'recon': recon_loss,
'sparsity': sparsity_loss,
'l0_active': l0,
'l1_mean': l1,
}


def train_sparse_ae(model: SparseAutoencoder, X_train: np.ndarray,
n_epochs: int = 50, batch_size: int = 256,
lr: float = 1e-3) -> list:
optimizer = optim.Adam(model.parameters(), lr=lr)
X_tensor = torch.FloatTensor(X_train)
loader = DataLoader(TensorDataset(X_tensor), batch_size=batch_size, shuffle=True)

history = []

for epoch in range(n_epochs):
model.train()
epoch_stats = {'total': 0, 'recon': 0, 'sparsity': 0, 'l0': 0}

for (batch,) in loader:
optimizer.zero_grad()
x_hat, z = model(batch)
losses = model.loss(batch, x_hat, z)
losses['total'].backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()

for k in epoch_stats:
key = k if k != 'l0' else 'l0_active'
epoch_stats[k] += losses[key].item() * len(batch)

n = len(X_train)
history.append({k: v/n for k, v in epoch_stats.items()})

if (epoch + 1) % 10 == 0:
s = history[-1]
print(f"Epoch {epoch+1:3d} "
f"Total={s['total']:.5f} "
f"Recon={s['recon']:.5f} "
f"Sparsity={s['sparsity']:.5f} "
f"L0={s['l0']:.1%}")

return history

Sparse Autoencoders for LLM Interpretability

:::note Sparse Autoencoders and Monosemanticity This is one of the most active research areas in AI safety and interpretability (2023–2025). The core problem: transformer neurons are polysemantic - a single neuron activates for semantically unrelated concepts (e.g., "Python programming," "snakes," "DNA"). This is believed to arise from superposition: models pack more features than they have dimensions by representing features as nearly-orthogonal directions in a high-dimensional space.

Anthropic's 2023 "Towards Monosemanticity" paper trained a sparse autoencoder on the MLP layer activations of a 1-layer transformer to find monosemantic directions - directions in activation space corresponding to a single interpretable concept.

The architecture:

  • Input: MLP layer activations from a forward pass, shape (n_tokens, d_model)
  • Encoder: linear + ReLU → overcomplete latent space (typically 4–16× wider than input)
  • Decoder: linear back to d_model
  • Loss: reconstruction MSE + λz1\lambda \|z\|_1

The overcomplete latent space (wider than input) allows the SAE to allocate one neuron per feature even when features are superposed in the model. After training, each neuron in the SAE corresponds (approximately) to one interpretable concept. :::

import torch
import torch.nn as nn
import numpy as np
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class LLMSparseAutoencoder(nn.Module):
"""
Sparse autoencoder for LLM activation interpretability.
Based on Anthropic's architecture in "Towards Monosemanticity" (2023).

Key differences from standard SAE:
- Overcomplete: latent_dim >> d_model (typically 4x-16x wider)
- Pre-encoder bias subtraction (removes mean activation)
- L1 on post-ReLU activations
- Decoder columns normalized to unit norm (prevents all-weight-in-one-direction)
"""

def __init__(self, d_model: int, n_features: int, l1_coeff: float = 1e-3):
"""
d_model: dimension of LLM activations (e.g., 512 for small models)
n_features: number of SAE features (typically 4096-65536)
l1_coeff: sparsity regularization strength
"""
super().__init__()
self.d_model = d_model
self.n_features = n_features
self.l1_coeff = l1_coeff

# Pre-encoder bias: subtracted before encoding
self.b_pre = nn.Parameter(torch.zeros(d_model))

# Encoder
self.W_enc = nn.Parameter(torch.randn(d_model, n_features) / (d_model ** 0.5))
self.b_enc = nn.Parameter(torch.zeros(n_features))

# Decoder (columns = feature directions in activation space)
self.W_dec = nn.Parameter(torch.randn(n_features, d_model) / (n_features ** 0.5))
self.b_dec = nn.Parameter(torch.zeros(d_model))

# Normalize decoder columns at init
self._normalize_decoder()

@torch.no_grad()
def _normalize_decoder(self):
"""Normalize decoder weight columns to unit norm."""
norms = self.W_dec.norm(dim=1, keepdim=True).clamp(min=1e-8)
self.W_dec.data = self.W_dec.data / norms

def encode(self, x: torch.Tensor) -> torch.Tensor:
"""Encode LLM activations to sparse feature coefficients."""
x_centered = x - self.b_pre
pre_act = x_centered @ self.W_enc + self.b_enc
return torch.relu(pre_act) # non-negative sparse activations

def decode(self, z: torch.Tensor) -> torch.Tensor:
"""Reconstruct LLM activations from sparse feature coefficients."""
return z @ self.W_dec + self.b_dec

def forward(self, x: torch.Tensor):
z = self.encode(x)
x_hat = self.decode(z)
return x_hat, z

def loss(self, x: torch.Tensor, x_hat: torch.Tensor, z: torch.Tensor) -> dict:
recon = ((x - x_hat) ** 2).mean()
sparsity = self.l1_coeff * z.abs().mean()
total = recon + sparsity

with torch.no_grad():
# L0: average number of active features per token
l0 = (z > 0).float().sum(dim=-1).mean()
# Dead features: neurons that never activate
dead_frac = (z.max(dim=0).values == 0).float().mean()

return {
'loss': total,
'recon_loss': recon,
'l1_loss': sparsity,
'l0_per_token': l0,
'dead_feature_frac': dead_frac,
}

def get_top_activating_examples(self, z: torch.Tensor,
feature_idx: int, top_k: int = 10):
"""
For a given feature (neuron), return indices of examples that
activate it most strongly - used for interpretability analysis.
"""
feature_acts = z[:, feature_idx]
top_k_idx = feature_acts.topk(top_k).indices
return top_k_idx, feature_acts[top_k_idx]


# Training loop for LLM SAE
def train_llm_sae(sae: LLMSparseAutoencoder,
activations: np.ndarray, # (n_tokens, d_model)
n_epochs: int = 10,
batch_size: int = 4096,
lr: float = 5e-4) -> list:
"""
Train SAE on cached LLM activations.
In practice: activations are cached from a forward pass over a text corpus.
"""
device = 'cuda' if torch.cuda.is_available() else 'cpu'
sae = sae.to(device)
optimizer = optim.Adam(sae.parameters(), lr=lr)

X_tensor = torch.FloatTensor(activations).to(device)
loader = DataLoader(TensorDataset(X_tensor), batch_size=batch_size, shuffle=True)

history = []

for epoch in range(n_epochs):
sae.train()
epoch_losses = []

for (batch,) in loader:
optimizer.zero_grad()
x_hat, z = sae(batch)
metrics = sae.loss(batch, x_hat, z)
metrics['loss'].backward()

# Gradient step
optimizer.step()

# CRITICAL: renormalize decoder after each step
# (prevents one feature direction from dominating)
with torch.no_grad():
sae._normalize_decoder()

epoch_losses.append({k: v.item() for k, v in metrics.items()})

avg = {k: np.mean([m[k] for m in epoch_losses]) for k in epoch_losses[0]}
history.append(avg)

print(f"Epoch {epoch+1}/{n_epochs} "
f"Loss={avg['loss']:.5f} "
f"Recon={avg['recon_loss']:.5f} "
f"L0={avg['l0_per_token']:.1f} "
f"Dead={avg['dead_feature_frac']:.1%}")

return history

Autoencoder Architecture Overview

Autoencoder Variants Comparison

VariantRegularizationKey IdeaBest For
UndercompleteBottleneck size (k<dk < d)Compression forces meaningful codesFeature learning, compression
Denoising (DAE)Noise corruption during trainingLearn data manifold, not identityRobust representations, pre-training
Sparse (SAE)λz1\lambda\|z\|_1 on activationsOne neuron per conceptInterpretability, monosemanticity
Contractive (CAE)Frobenius norm of Jacobian z/xF2\|\partial z / \partial x\|^2_FStable representationsTransfer learning
Variational (VAE)KL divergence on latent distributionLatent space is a probability distributionGenerative modeling

:::danger Do Not Fit the Scaler or Autoencoder on Test Data The autoencoder's threshold for anomaly detection is calibrated on training reconstruction errors. If you accidentally include test (or anomalous) data in training, the network will partially learn to reconstruct anomalies - degrading the anomaly score. Fit the autoencoder and scaler exclusively on known-normal training data. This is especially important for security applications where attackers may attempt to inject normal-looking traffic to lower the threshold. :::

:::warning Reconstruction Loss Choice: MSE vs BCE Use BCE only when the decoder output is bounded to [0,1][0, 1] via Sigmoid. Using BCE on unbounded output (no Sigmoid) produces NaN gradients when predicted values fall outside [0,1][0, 1]. Use MSE for standardized continuous features (after StandardScaler). Using BCE on data standardized to zero-mean produces incorrect loss values - BCE assumes probabilities, not real-valued inputs. :::

YouTube Resources

ChannelVideo TitleWhy Watch
StatQuest with Josh StarmerAutoencoders - simply explainedBest visual introduction to the bottleneck concept
Andrej Karpathy (CS231n 2016)Autoencoders lectureClassic lecture covering undercomplete, denoising, variational
Yannic KilcherAnthropic Monosemanticity paperExplains sparse AEs for LLM interpretability with paper walkthrough
Umar JamilCoding autoencoders from scratchFull PyTorch implementation walkthrough

Interview Q&A

Q1: What is the difference between an undercomplete and an overcomplete autoencoder?

An undercomplete autoencoder has a bottleneck smaller than the input dimension (latent_dim < input_dim). The compression itself forces the network to learn meaningful structure - you cannot memorize the identity function through a narrow bottleneck. An overcomplete autoencoder has a larger bottleneck than the input (latent_dim > input_dim or latent_dim >> input_dim). Without additional regularization, an overcomplete AE trivially learns the identity - each neuron copies one input feature or memorizes training examples. To learn useful representations in an overcomplete setting, you must add regularization: noise (denoising AE) or 1\ell_1 penalty (sparse AE). Overcomplete sparse AEs are particularly useful for interpretability because the expanded latent space allows many more disentangled features to be represented simultaneously.

Q2: Why is a denoising autoencoder better than a standard undercomplete autoencoder for representation learning?

A standard undercomplete AE can learn a degenerate solution: if the bottleneck is just slightly smaller than the input, the network can learn a near-identity mapping (compress 100 dimensions to 95 dimensions trivially) without capturing meaningful structure. The denoising objective forces the network to learn the underlying data manifold: to reconstruct the clean input from a corrupted version, the model must learn which directions of variation are "real" signal vs noise. This produces representations that generalize better and are more robust to input perturbations. Theoretically (Vincent et al., 2011), training a Gaussian denoising AE is equivalent to approximating the score function xlogp(x)\nabla_x \log p(x) - connecting it to score-based generative models and diffusion models.

Q3: How do you set the reconstruction error threshold for anomaly detection with an autoencoder?

Fit the autoencoder exclusively on normal data. Compute reconstruction errors on the training set (or a held-out normal validation set). Set the threshold at the 99th or 99.5th percentile of these errors - this allows 0.5–1% false positives on normal data, which is usually acceptable. The exact percentile depends on the business cost of false positives vs missed detections. To calibrate: if you have a small labeled anomaly set, compute precision and recall at multiple thresholds and select based on the F1 score or the business-specified FPR constraint. In production: monitor the threshold's stability - if normal data distribution shifts, the reconstruction error distribution shifts too. Refit the autoencoder on a rolling window of recent normal data and recalibrate the threshold periodically.

Q4: What reconstruction loss should you use - MSE or BCE - and why?

MSE for continuous-valued data: sensor readings, standardized features, real-valued embeddings. MSE assumes a Gaussian noise model - minimizing MSE is equivalent to maximizing the likelihood under p(x^z)=N(x^;g(z),σ2I)p(\hat{x}|z) = \mathcal{N}(\hat{x}; g(z), \sigma^2 I). BCE for data bounded in [0,1][0, 1]: pixel intensities (image autoencoders with Sigmoid output), binary features. BCE assumes a Bernoulli noise model - minimizing BCE is maximum likelihood under p(x^z)=Bernoulli(g(z))p(\hat{x}|z) = \text{Bernoulli}(g(z)). The practical rule: always use Sigmoid as the final decoder activation when using BCE. Never apply BCE to unbounded output - it produces NaN gradients. Never apply BCE to features normalized with StandardScaler (which can be negative).

Q5: What are sparse autoencoders and why are they used for LLM interpretability?

Sparse autoencoders add an 1\ell_1 penalty on bottleneck activations (L=xx^2+λz1\mathcal{L} = \|x - \hat{x}\|^2 + \lambda\|z\|_1), forcing most activations to be near-zero for any given input. In the context of LLM interpretability, the problem being solved is polysemanticity: individual neurons in language models respond to multiple unrelated concepts (e.g., the same neuron activates for "Paris," "France," "Europe," and "Eiffel Tower"). This is thought to arise from superposition - the model represents more features than it has neurons by placing features in nearly-orthogonal directions in activation space. A sparse AE trained on a language model's MLP activations (with a latent space much larger than the activation space - 16,384 neurons for a 512-dimensional hidden state) can disentangle these superposed features into approximately monosemantic directions: one SAE neuron per concept. The Anthropic "Towards Monosemanticity" (2023) paper demonstrated this for a 1-layer transformer, finding thousands of interpretable features.

Q6: How would you use an autoencoder to build a semi-supervised fraud detection system when you have 10 million unlabeled transactions and only 500 labeled fraudulent examples?

A two-stage approach: Stage 1 - train an autoencoder on the 10 million unlabeled transactions (treating them all as normal). The bottleneck representation z=f(x)z = f(x) learns the statistical structure of normal transactions. Reconstruction error provides an unsupervised anomaly score. Stage 2 - use the encoder as a feature extractor. For each of your 500 labeled fraudulent examples and an equal number of labeled normal examples, compute z=f(x)z = f(x) and train a supervised classifier (gradient boosting or logistic regression) on the latent representations. The latent representations learned from 10 million samples give the classifier far richer features than training on 500 raw transaction feature vectors would allow. Optionally combine both signals: a high reconstruction error OR a high classifier score triggers a fraud alert, with different thresholds calibrated to the desired precision-recall operating point.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Autoencoder Latent Space demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.