t-SNE and UMAP
import ReadingTime from '@site/src/components/ReadingTime';
:::note Interview Relevance - High t-SNE and UMAP are standard interview topics for roles involving embeddings, NLP, computer vision, and model interpretability. Interviewers commonly ask about the crowding problem, the perplexity hyperparameter, why t-SNE distances are meaningless, and when you would use UMAP vs t-SNE in production. :::
The Real Interview Moment
You are presenting your BERT-based customer support routing system to the VP of Engineering. You claim the model understands the difference between billing complaints, technical issues, and feature requests. He asks: "Can you show me that?" You cannot show him 768-dimensional embeddings as a table. You fire up a Jupyter notebook, run UMAP on 10,000 support ticket embeddings, and project to 2D. Billing complaints cluster in the top-left. Technical issues form a tight cluster in the center. Feature requests scatter across the right side. A small isolated cluster near the bottom turns out to be outage reports - a category the routing system was misclassifying as technical issues.
"That cluster," you say, pointing to the outliers, "those are outage reports. They look like technical issues to the classifier because the text is similar - 'the system is not working' - but the intent is completely different. We need a separate class."
Two tools made that conversation possible: UMAP (to produce the visualization in 4 seconds instead of 4 minutes) and your understanding of what the visualization does and does not mean.
Why This Exists - The Crowding Problem
PCA projects data onto a linear subspace. For data with non-linear structure - word embeddings, image feature vectors, single-cell RNA profiles - the linear projection folds the manifold back on itself, placing points that are intrinsically far apart near each other in 2D.
The naive fix - SNE (Stochastic Neighbor Embedding, Hinton & Roweis 2002) - computes Gaussian similarity in high-D and Gaussian similarity in low-D, minimizing the KL divergence between the two distributions. But it has a fundamental flaw: the crowding problem.
In 2D, the area of an annulus at radius scales as . In 10D, the volume of the corresponding shell scales as . There is vastly more volume near moderate distances in high-D than in 2D. When you project 10D data to 2D, moderately-far neighbors in 10D have nowhere to go - they crowd into the center of the 2D plot, obliterating the structure you are trying to see.
t-SNE (van der Maaten & Hinton, 2008) solves this by using a heavy-tailed t-distribution in the low-dimensional space. The t-distribution gives moderately-far points more room to spread out, resolving the crowding problem.
SNE: The Foundation
SNE defines a conditional similarity from point to point using a Gaussian centered at :
The bandwidth is per-point - each point gets its own bandwidth chosen so that the entropy of equals the user-specified perplexity:
Perplexity effective number of neighbors. Points in dense regions get small (tight neighborhoods); points in sparse regions get large (broader neighborhoods). This adaptive bandwidth is one of t-SNE's key advantages over distance-based methods.
Symmetric SNE symmetrizes: . This ensures and the gradient is simpler.
In the low-dimensional space, SNE uses another Gaussian:
The objective is to minimize the KL divergence between and :
Problem: the Gaussian in the low-D space creates the crowding problem. Moderately-distant points in high-D have , but in 2D the Gaussian forces them close together.
t-SNE: Replacing the Low-Dimensional Gaussian
The key modification in t-SNE: replace the low-dimensional Gaussian with a Student t-distribution with 1 degree of freedom (a Cauchy distribution):
The t-distribution has heavy tails: at distance , its density falls off as instead of . This means moderately-distant points in high-D can be placed at moderate distances in 2D - they no longer need to crowd the center.
Comparison:
| Distribution | Density at distance | Effect |
|---|---|---|
| Gaussian | Rapid falloff - moderately distant points crowd together | |
| Student-t (1 dof) | Heavy tail - moderately distant points can spread out |
The t-SNE objective:
The gradient with respect to has a clean form:
This gradient has an intuitive interpretation: if (points are closer in high-D than in 2D), the gradient pulls toward . If (points are farther in high-D), the gradient pushes away from .
Optimization uses gradient descent with momentum and an important trick called early exaggeration: multiply by 4–12 in the first 250 iterations. This forces clusters to form early (points with high are strongly attracted), and then the exaggeration factor is removed to fine-tune the cluster structure.
The Perplexity Hyperparameter
Perplexity controls the effective neighborhood size. Intuitively: "how many neighbors should each point consider?"
Effect of perplexity on the embedding:
| Perplexity | Local vs global balance | When to use |
|---|---|---|
| 5–10 | Very local - tight microclusters, may fragment real clusters | When you want to see fine-grained substructure |
| 20–50 | Balanced - standard starting point | Default for most tasks |
| 100–200 | More global - clusters may merge | Large datasets where global structure matters |
Rules:
- Perplexity must be less than (more neighbors than points makes no sense)
- Always run t-SNE at multiple perplexity values and check stability
- For large datasets (), perplexity of 50–100 is typical
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
# Load MNIST digits (8×8, 64 dimensions)
digits = load_digits()
X, y = digits.data, digits.target
# BEST PRACTICE: PCA pre-reduction to 50 dims before t-SNE
# Removes noise, speeds up t-SNE significantly (O(n²) cost dominated by pairwise dist)
X_pca50 = PCA(n_components=50, random_state=42).fit_transform(
StandardScaler().fit_transform(X)
)
# Compare perplexity values
perplexities = [5, 15, 30, 50, 100]
fig, axes = plt.subplots(1, len(perplexities), figsize=(25, 5))
for ax, perp in zip(axes, perplexities):
tsne = TSNE(
n_components=2,
perplexity=perp,
learning_rate='auto',
n_iter=1000,
random_state=42,
)
X_tsne = tsne.fit_transform(X_pca50)
ax.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', s=5, alpha=0.7)
ax.set_title(f"Perplexity={perp}\nKL={tsne.kl_divergence_:.3f}", fontsize=10)
ax.axis('off')
plt.suptitle("t-SNE: Effect of Perplexity on MNIST Digits", fontsize=13)
plt.tight_layout()
plt.show()
Barnes-Hut Approximation: Making t-SNE Scalable
Naive t-SNE is per gradient step - impossible for . The Barnes-Hut approximation reduces this to using the same idea as the Barnes-Hut N-body simulation in astrophysics:
- Build a quadtree (2D) or octree (3D) over the current embedding positions
- For each point, cells far away are treated as a single "super-point" at their center of mass
- A cell is considered "far" if its angular size (width/distance) is below a threshold (typically 0.5)
This approximates the repulsive forces (negative gradient term) without computing all pairwise interactions, at the cost of some approximation error.
from sklearn.manifold import TSNE
import numpy as np
# Barnes-Hut t-SNE (sklearn's default for n_components <= 3)
tsne_bh = TSNE(
n_components=2,
perplexity=30,
learning_rate='auto',
n_iter=1000,
n_iter_without_progress=150, # early stopping
early_exaggeration=12.0, # multiplier for p_ij in first ~250 iters
method='barnes_hut', # O(n log n) - default for 2D/3D
angle=0.5, # Barnes-Hut approximation angle (lower=more accurate)
random_state=42,
n_jobs=-1,
verbose=1,
)
# For small n (< 3000) or when you need exact results:
tsne_exact = TSNE(
n_components=2,
perplexity=30,
method='exact', # O(n²) - exact computation
random_state=42,
)
# Production-scale t-SNE: PCA first, then Barnes-Hut
def production_tsne(X: np.ndarray, n_pca_dims: int = 50, perplexity: int = 30,
random_state: int = 42) -> np.ndarray:
"""
Production-ready t-SNE:
1. Standardize
2. PCA to 50 dims (noise removal + speed)
3. Barnes-Hut t-SNE
"""
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
print(f"Input shape: {X.shape}")
# Step 1: Standardize
X_scaled = StandardScaler().fit_transform(X)
# Step 2: PCA pre-reduction
n_pca = min(n_pca_dims, X.shape[1], X.shape[0]-1)
X_pca = PCA(n_components=n_pca, random_state=random_state).fit_transform(X_scaled)
print(f"After PCA: {X_pca.shape}")
# Step 3: t-SNE
tsne = TSNE(
n_components=2,
perplexity=min(perplexity, X.shape[0] // 3), # perplexity < n/3
learning_rate='auto',
n_iter=1000,
method='barnes_hut',
random_state=random_state,
n_jobs=-1,
)
X_2d = tsne.fit_transform(X_pca)
print(f"KL divergence: {tsne.kl_divergence_:.4f}")
return X_2d
The Critical t-SNE Misconceptions
:::danger t-SNE Distances and Sizes Are Meaningless t-SNE distorts distances to preserve local neighborhood structure. Cluster sizes, distances between clusters, and relative positions of clusters are all arbitrary. You cannot make quantitative claims about distances or proportions from a t-SNE plot. :::
Misconception 1: Cluster size = cluster importance
t-SNE stretches regions of low density and compresses regions of high density. A large cluster in the plot may simply be a region where t-SNE expanded the layout. A tight cluster may be compressed. Neither reflects the actual number of points or spread.
Misconception 2: Distance between clusters = similarity in original space
t-SNE preserves local neighborhood structure but destroys global structure. Two clusters that are far apart in the t-SNE plot may be close in the original space. Two clusters that appear adjacent may be far apart.
# Demonstration: t-SNE distance vs actual distance
from scipy.spatial.distance import pdist
from scipy.stats import spearmanr
import numpy as np
def check_distance_preservation(X_orig: np.ndarray, X_2d: np.ndarray,
n_sample: int = 500) -> dict:
"""
Check how well 2D embedding preserves pairwise distances.
A good embedding has high correlation (close to 1.0) for PCA,
but t-SNE will show low correlation - by design.
"""
rng = np.random.default_rng(42)
idx = rng.choice(len(X_orig), min(n_sample, len(X_orig)), replace=False)
d_orig = pdist(X_orig[idx], metric='euclidean')
d_2d = pdist(X_2d[idx], metric='euclidean')
spearman_r, pvalue = spearmanr(d_orig, d_2d)
pearson_r = np.corrcoef(d_orig, d_2d)[0, 1]
# Local vs global preservation
# For local: only consider pairs within top 10th percentile of proximity
local_threshold = np.percentile(d_orig, 10)
local_mask = d_orig < local_threshold
local_spearman, _ = spearmanr(d_orig[local_mask], d_2d[local_mask])
print(f"Global Spearman rank correlation: {spearman_r:.3f}")
print(f"Global Pearson correlation: {pearson_r:.3f}")
print(f"Local Spearman (near pairs): {local_spearman:.3f}")
print()
print("Interpretation:")
print(" Global ~0.3-0.5 for t-SNE: EXPECTED - global structure not preserved")
print(" Local ~0.8-0.9 for t-SNE: GOOD - local structure well preserved")
print(" Global ~0.9+ for PCA: EXPECTED - PCA preserves global distances")
return {'global_spearman': spearman_r, 'local_spearman': local_spearman}
Misconception 3: One t-SNE run is definitive
t-SNE uses random initialization. Different seeds produce different layouts. A cluster structure is only meaningful if it appears consistently across multiple random seeds.
def check_tsne_stability(X_pca: np.ndarray, y: np.ndarray,
n_seeds: int = 5, perplexity: int = 30):
"""Run t-SNE with multiple seeds to check layout stability."""
fig, axes = plt.subplots(1, n_seeds, figsize=(5*n_seeds, 4))
for i, seed in enumerate(range(n_seeds)):
tsne = TSNE(n_components=2, perplexity=perplexity,
learning_rate='auto', random_state=seed)
X_2d = tsne.fit_transform(X_pca)
axes[i].scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='tab10', s=5, alpha=0.7)
axes[i].set_title(f"Seed {seed}\nKL={tsne.kl_divergence_:.3f}", fontsize=9)
axes[i].axis('off')
plt.suptitle("t-SNE Stability Check (5 random seeds)\n"
"Same structures appearing in all runs = real signal", fontsize=11)
plt.tight_layout()
plt.show()
UMAP: Faster, More Faithful, Parametric
UMAP (Uniform Manifold Approximation and Projection, McInnes et al. 2018) solves t-SNE's core limitations through a fundamentally different theoretical foundation: Riemannian geometry and topological data analysis.
UMAP Theory: Fuzzy Simplicial Complexes
Where t-SNE uses Gaussian distributions to model local structure, UMAP builds a weighted graph (fuzzy simplicial complex) over the data:
Step 1: Build the high-dimensional graph
For each point , find its nearest neighbors. Define the weight of the edge as:
where is the distance to the nearest neighbor (ensures the closest neighbor always has weight 1 - making the representation locally adaptive) and is chosen so the weights sum to .
The symmetrized graph weight:
Step 2: Build the low-dimensional graph
The low-dimensional edge weights use:
Parameters and are determined by min_dist (the minimum distance between points in the embedding - smaller values produce tighter clusters).
Step 3: Optimize the embedding
Minimize the cross-entropy between the high-D and low-D graphs (binary cross-entropy over edges):
Optimization uses stochastic gradient descent with negative sampling (like word2vec) - efficient and .
Key algorithmic advances over t-SNE:
- NNDescent for approximate nearest neighbor graph: vs for exact NN
- Cross-entropy with negative sampling: avoids normalization sum (the "Z" denominator in t-SNE's )
- Parametric mode: the encoder is a neural network - new points get a single forward pass
UMAP Key Parameters
| Parameter | Effect | Default | Tuning Guide |
|---|---|---|---|
n_neighbors | Size of local neighborhood | 15 | Small (5): fine local structure; Large (100+): global structure |
min_dist | Tightness of clusters in 2D | 0.1 | Small (0.0): tight clusters; Large (0.5): more spread out |
n_components | Output dimensions | 2 | 2 or 3 for viz; 16–64 for downstream ML |
metric | Distance in high-D | 'euclidean' | 'cosine' for embeddings; 'hamming' for binary |
n_epochs | Optimization steps | auto | More = better layout, slower |
# pip install umap-learn
import umap
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
digits = load_digits()
X, y = digits.data, digits.target
X_scaled = StandardScaler().fit_transform(X)
X_pca50 = PCA(n_components=50, random_state=42).fit_transform(X_scaled)
# ── Full UMAP parameter exploration ───────────────────────────────────────────
fig, axes = plt.subplots(2, 3, figsize=(18, 11))
configs = [
{'n_neighbors': 5, 'min_dist': 0.1, 'label': 'n_neighbors=5\n(very local)'},
{'n_neighbors': 15, 'min_dist': 0.1, 'label': 'n_neighbors=15\n(default)'},
{'n_neighbors': 50, 'min_dist': 0.1, 'label': 'n_neighbors=50\n(more global)'},
{'n_neighbors': 15, 'min_dist': 0.0, 'label': 'min_dist=0.0\n(tight clusters)'},
{'n_neighbors': 15, 'min_dist': 0.5, 'label': 'min_dist=0.5\n(spread out)'},
{'n_neighbors': 15, 'min_dist': 0.99, 'label': 'min_dist=0.99\n(almost uniform)'},
]
for ax, cfg in zip(axes.flat, configs):
reducer = umap.UMAP(
n_components=2,
n_neighbors=cfg['n_neighbors'],
min_dist=cfg['min_dist'],
random_state=42,
n_jobs=-1,
)
X_2d = reducer.fit_transform(X_pca50)
ax.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='tab10', s=8, alpha=0.7)
ax.set_title(cfg['label'], fontsize=11)
ax.axis('off')
plt.suptitle("UMAP Hyperparameter Effects on MNIST Digits", fontsize=14)
plt.tight_layout()
plt.show()
UMAP with Non-Euclidean Metrics
One of UMAP's major advantages over t-SNE: it natively supports non-Euclidean distance metrics.
import umap
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
# Text data: 20 newsgroups - use cosine distance (text-appropriate)
newsgroups = fetch_20newsgroups(subset='train', categories=[
'sci.space', 'rec.sport.hockey', 'talk.politics.guns', 'comp.graphics'
], remove=('headers', 'footers', 'quotes'), random_state=42)
# TF-IDF features
tfidf = TfidfVectorizer(max_features=5000, stop_words='english', sublinear_tf=True)
X_tfidf = tfidf.fit_transform(newsgroups.data) # sparse matrix
# UMAP with cosine distance (correct for sparse text features)
reducer_cosine = umap.UMAP(
n_components=2,
metric='cosine', # correct for text - invariant to document length
n_neighbors=20,
min_dist=0.1,
random_state=42,
n_jobs=-1,
verbose=True,
)
# UMAP accepts sparse matrices natively
X_2d_text = reducer_cosine.fit_transform(X_tfidf)
fig, ax = plt.subplots(figsize=(10, 8))
scatter = ax.scatter(X_2d_text[:, 0], X_2d_text[:, 1],
c=newsgroups.target, cmap='tab10', s=5, alpha=0.7)
plt.colorbar(scatter, ax=ax, label='Category')
ax.set_title("UMAP of 20 Newsgroups (cosine distance, TF-IDF)", fontsize=12)
ax.axis('off')
plt.tight_layout()
plt.show()
# ── Other useful metrics ───────────────────────────────────────────────────────
# 'hamming' - binary features (bag-of-words presence/absence)
# 'jaccard' - set similarity (tags, categories)
# 'manhattan' - L1 - more robust to outliers than L2 for tabular data
# 'precomputed' - pass a precomputed distance matrix
Parametric UMAP: Transform New Points
Unlike t-SNE, UMAP learns a mapping that can be applied to new points. There are two modes:
import umap
import numpy as np
# ── Mode 1: Standard UMAP - fit once, transform later ──────────────────────────
reducer = umap.UMAP(n_components=2, random_state=42)
reducer.fit(X_train_pca)
# Transform training data
X_train_2d = reducer.transform(X_train_pca)
# Transform new test data - consistent with training layout!
X_test_2d = reducer.transform(X_test_pca)
print("Can transform new points:", True)
# ── Mode 2: Parametric UMAP - neural network encoder ──────────────────────────
# Trains a neural network as the embedding function
# Enables very fast inference on new points (single forward pass)
from umap.parametric_umap import ParametricUMAP
param_reducer = ParametricUMAP(
n_components=2,
n_neighbors=15,
min_dist=0.1,
n_training_epochs=10, # number of epochs to train the encoder network
encoder=None, # auto-builds MLP: input_dim → 100 → 100 → n_components
verbose=True,
)
param_reducer.fit(X_train_pca)
X_train_2d = param_reducer.transform(X_train_pca) # batch forward pass
X_test_2d = param_reducer.transform(X_test_pca) # fast inference
# Check consistency: same points should map to same location
X_train_2d_check = param_reducer.transform(X_train_pca[:10])
# Standard UMAP: X_train_2d_check ≈ X_train_2d[:10] (deterministic)
# Parametric UMAP: X_train_2d_check == X_train_2d[:10] (exact, it's a function)
UMAP for Dimensionality Reduction (Not Just Visualization)
UMAP at higher dimensions () is a powerful general-purpose dimensionality reduction technique - not just for visualization:
import umap
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import numpy as np
def compare_reduction_methods(X_train, y_train, X_test, y_test, target_dim=32):
"""Compare PCA vs UMAP for downstream classification."""
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
results = {}
# Baseline: raw features
lr_raw = LogisticRegression(max_iter=1000, random_state=42)
lr_raw.fit(X_train, y_train)
results['raw'] = accuracy_score(y_test, lr_raw.predict(X_test))
# PCA reduction
pca_pipe = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=target_dim, random_state=42)),
('lr', LogisticRegression(max_iter=1000, random_state=42)),
])
pca_pipe.fit(X_train, y_train)
results[f'pca_{target_dim}'] = accuracy_score(y_test, pca_pipe.predict(X_test))
# UMAP reduction
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
reducer = umap.UMAP(
n_components=target_dim,
n_neighbors=30,
min_dist=0.0, # tight clusters: good for downstream classification
metric='euclidean',
random_state=42,
)
X_train_umap = reducer.fit_transform(X_train_scaled)
X_test_umap = reducer.transform(X_test_scaled)
lr_umap = LogisticRegression(max_iter=1000, random_state=42)
lr_umap.fit(X_train_umap, y_train)
results[f'umap_{target_dim}'] = accuracy_score(y_test, lr_umap.predict(X_test_umap))
for method, acc in results.items():
print(f"{method:20s}: {acc:.4f}")
return results
t-SNE vs UMAP: Flow and Decision
Production: Embedding Space Monitoring
import umap
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import ks_2samp
class EmbeddingDriftMonitor:
"""
Monitor embedding space drift in production using UMAP.
Fits once on baseline, transforms new batches consistently.
"""
def __init__(self, n_neighbors: int = 15, min_dist: float = 0.1,
drift_threshold: float = 0.5):
self.reducer = umap.UMAP(
n_components=2,
n_neighbors=n_neighbors,
min_dist=min_dist,
random_state=42
)
self.drift_threshold = drift_threshold
self.baseline_2d_ = None
self.baseline_stats_ = None
def fit_baseline(self, embeddings: np.ndarray,
labels: np.ndarray = None) -> 'EmbeddingDriftMonitor':
"""Fit UMAP reducer on baseline (training/reference) embeddings."""
self.reducer.fit(embeddings)
self.baseline_2d_ = self.reducer.transform(embeddings)
self.baseline_labels_ = labels
# Compute reference statistics for drift detection
self.baseline_stats_ = {
'mean': self.baseline_2d_.mean(axis=0),
'std': self.baseline_2d_.std(axis=0),
'x_dist': self.baseline_2d_[:, 0],
'y_dist': self.baseline_2d_[:, 1],
}
print(f"Baseline fitted: {embeddings.shape[0]} samples")
print(f"2D bounds: x=[{self.baseline_2d_[:,0].min():.2f}, "
f"{self.baseline_2d_[:,0].max():.2f}], "
f"y=[{self.baseline_2d_[:,1].min():.2f}, "
f"{self.baseline_2d_[:,1].max():.2f}]")
return self
def check_drift(self, new_embeddings: np.ndarray,
period_label: str = "Production") -> dict:
"""
Transform new embeddings and check for distribution drift.
Returns drift metrics and raises alert if drift exceeds threshold.
"""
new_2d = self.reducer.transform(new_embeddings)
# Centroid shift
centroid_shift = np.linalg.norm(
new_2d.mean(axis=0) - self.baseline_stats_['mean']
)
# KS test on marginal distributions
ks_x = ks_2samp(self.baseline_stats_['x_dist'], new_2d[:, 0])
ks_y = ks_2samp(self.baseline_stats_['y_dist'], new_2d[:, 1])
# Coverage: fraction of new points outside baseline bounding box + margin
x_min, x_max = self.baseline_2d_[:,0].min(), self.baseline_2d_[:,0].max()
y_min, y_max = self.baseline_2d_[:,1].min(), self.baseline_2d_[:,1].max()
margin = 0.2 * max(x_max - x_min, y_max - y_min)
out_of_distribution = (
(new_2d[:, 0] < x_min - margin) |
(new_2d[:, 0] > x_max + margin) |
(new_2d[:, 1] < y_min - margin) |
(new_2d[:, 1] > y_max + margin)
).mean()
metrics = {
'centroid_shift': centroid_shift,
'ks_stat_x': ks_x.statistic,
'ks_stat_y': ks_y.statistic,
'ks_pvalue_x': ks_x.pvalue,
'ks_pvalue_y': ks_y.pvalue,
'out_of_distribution_rate': out_of_distribution,
}
drift_detected = (
centroid_shift > self.drift_threshold or
ks_x.pvalue < 0.01 or ks_y.pvalue < 0.01 or
out_of_distribution > 0.05
)
if drift_detected:
print(f"DRIFT ALERT for {period_label}:")
print(f" Centroid shift: {centroid_shift:.4f} (threshold={self.drift_threshold})")
print(f" KS p-values: x={ks_x.pvalue:.4f}, y={ks_y.pvalue:.4f}")
print(f" OOD rate: {out_of_distribution:.1%}")
else:
print(f"{period_label}: No significant drift detected")
return metrics, new_2d
def plot_drift(self, new_2d: np.ndarray, period_label: str = "Production"):
fig, axes = plt.subplots(1, 2, figsize=(16, 7))
# Baseline
if hasattr(self, 'baseline_labels_') and self.baseline_labels_ is not None:
scatter = axes[0].scatter(
self.baseline_2d_[:, 0], self.baseline_2d_[:, 1],
c=self.baseline_labels_, cmap='tab10', s=5, alpha=0.6
)
else:
axes[0].scatter(self.baseline_2d_[:, 0], self.baseline_2d_[:, 1],
s=5, alpha=0.5, color='#3b82f6')
axes[0].set_title("Baseline Embedding Space", fontsize=12)
axes[0].axis('off')
# Overlay: baseline (gray) + new points (red)
axes[1].scatter(self.baseline_2d_[:, 0], self.baseline_2d_[:, 1],
s=3, alpha=0.2, color='#94a3b8', label='Baseline')
axes[1].scatter(new_2d[:, 0], new_2d[:, 1],
s=15, alpha=0.7, color='#ef4444', label=period_label)
axes[1].set_title(f"Drift: Baseline vs {period_label}", fontsize=12)
axes[1].legend()
axes[1].axis('off')
plt.tight_layout()
plt.show()
:::tip UMAP for Production, t-SNE for Exploration Use UMAP in production: it can transform new points, runs in , and handles millions of samples. Use t-SNE for one-time exploratory visualizations on static datasets where you want the cleanest possible cluster separation. Always run t-SNE on PCA-pre-reduced data (50 dimensions) for speed and noise reduction. :::
:::warning Do Not Use t-SNE or UMAP Coordinates as Features
Never feed t-SNE or UMAP 2D coordinates into a downstream classifier. t-SNE is non-parametric - you cannot transform new points reproducibly. UMAP coordinates are stable for the fitted model but the 2D space is not semantically meaningful for classification (distances are distorted). Use PCA, raw embeddings, or UMAP at higher dimensions (, min_dist=0.0) for feature extraction.
:::
YouTube Resources
| Channel | Video Title | Why Watch |
|---|---|---|
| StatQuest with Josh Starmer | t-SNE, Clearly Explained | Best visual walkthrough of perplexity and the algorithm steps |
| ritvikmath | t-SNE Explained | Math derivation of the KL objective and gradient |
| Leland McInnes (UMAP author) | UMAP: Theory and Practice | First-principles explanation of the topological foundation |
| Two Minute Papers | UMAP paper overview | Quick intuition for the crowding problem vs UMAP approach |
Interview Q&A
Q1: Explain the crowding problem in SNE and how t-SNE solves it.
In 10 or more dimensions, the volume available at moderate distances grows much faster than in 2D. When you project high-D data to 2D using a Gaussian similarity in both spaces, the 2D space cannot accommodate all the moderately-distant neighbors - they all crowd into the center of the plot, obliterating local structure. t-SNE replaces the low-dimensional Gaussian with a Student t-distribution (1 degree of freedom), which has a heavy tail: its density at distance falls off as instead of . The heavy tail gives moderately-distant points more room to spread apart, resolving the crowding. The mismatch between the narrow Gaussian (high-D) and the heavy-tailed t-distribution (low-D) is intentional - it produces clear cluster separation.
Q2: What does the perplexity parameter in t-SNE actually control? What happens at extreme values?
Perplexity controls the effective number of neighbors each point considers. Mathematically, for each point , the bandwidth is chosen so that the entropy of the conditional distribution satisfies . This makes the bandwidth adaptive: dense regions get smaller (tight neighborhoods), sparse regions get larger (broader neighborhoods). At perplexity 5: only immediate neighbors matter - produces fragmented microclusters that may split what should be one cluster. At perplexity 200: each point considers many neighbors - produces smooth global layout but loses fine-grained local structure. Start with perplexity 30, then explore 5, 15, 50, and 100. Clusters that appear stable across all perplexities are real.
Q3: A colleague reports that cluster A is three times larger than cluster B in the t-SNE plot and concludes that class A is three times more common. What is wrong?
Cluster sizes in t-SNE are meaningless. t-SNE stretches regions of low density and compresses regions of high density to preserve local neighborhood structure. A cluster's area in the 2D plot does not correspond to the number of points, the spread in original space, or any other quantitative property. To assess cluster size, count the data points assigned to each cluster by running a separate algorithm (K-means, DBSCAN, or a distance threshold) on the original high-dimensional embeddings. Never use t-SNE geometrical properties for quantitative conclusions - only for qualitative pattern identification.
Q4: How does UMAP differ from t-SNE algorithmically? When would you choose UMAP over t-SNE?
Algorithmically, t-SNE models similarities as Gaussians (high-D) and t-distributions (low-D) and minimizes KL divergence; it runs in naively or with Barnes-Hut, and cannot transform new points. UMAP builds a weighted graph using fuzzy topological methods (local distances normalized by nearest-neighbor distance, symmetrized), then minimizes cross-entropy between the high-D and low-D graphs using negative sampling - via NNDescent. Choose UMAP over t-SNE when: (1) - t-SNE becomes too slow, (2) you need to transform new data points at inference, (3) you want to use the reduction for downstream ML (not just visualization), (4) you need a non-Euclidean metric (cosine, Hamming, precomputed). Choose t-SNE when: you want the very clearest cluster separation for a publication figure on a static dataset.
Q5: Why can't you use t-SNE coordinates as input features for a downstream classifier?
t-SNE is non-parametric - it optimizes 2D positions for a specific fixed dataset. Given a new point, there is no function to apply. You must include the new point in the dataset and rerun the entire optimization from scratch - but then the entire layout changes, so you cannot compare to the old embedding or train a classifier on one run's coordinates and predict on another. Additionally, the 2D coordinates have no semantic meaning: the same real-world concept may land in completely different positions in two different t-SNE runs. For production feature extraction, use PCA (parametric, stable), UMAP's standard fit/transform (stable for the fitted model), or UMAP at higher dimensions.
Q6: How would you use UMAP in a production ML monitoring system to detect concept drift?
Fit UMAP on a representative baseline dataset of model input embeddings (e.g., the first month of production inference). Each week, transform new production embeddings using the fitted reducer. Apply three drift detectors: (1) centroid shift - if the mean 2D position of this week's points shifts significantly from baseline ( standard deviations), distribution shift may be occurring; (2) KS test on marginal distributions of the 2D coordinates - indicates significant change; (3) out-of-distribution rate - what fraction of new points fall outside the bounding box of baseline points (with some margin), indicating novel input types. Alert when any threshold is exceeded. This is more interpretable than MMD or statistical tests on raw high-dimensional embeddings because you can visualize which part of the embedding space the new points are landing in.
Comparison Table: PCA vs t-SNE vs UMAP
| Property | PCA | t-SNE | UMAP |
|---|---|---|---|
| Type | Linear | Non-linear | Non-linear |
| Preserves global structure | Yes | No | Better than t-SNE |
| Preserves local structure | Partially | Very well | Very well |
| Parametric (transform new pts) | Yes | No | Yes (standard mode) |
| Speed | / | ||
| Max practical | Millions | ~100K (BH) | Millions |
| Reproducible | Yes | No (random init) | More stable |
| Cluster sizes meaningful | Yes | No | Slightly more |
| Distances meaningful | Yes | No | Better than t-SNE |
| Non-Euclidean metrics | No (linear only) | Yes | Yes |
| Supports sparse input | Yes (TruncatedSVD) | No | Yes |
| Higher-dim output for ML | Yes | Not useful | Yes () |
| Best for | Production reduction | Publication figures | Production monitoring |
Common t-SNE and UMAP Pitfalls
# ── PITFALL 1: Not using PCA pre-reduction before t-SNE ───────────────────────
# WRONG: t-SNE on raw 768-dim embeddings - slow and noisy
tsne_bad = TSNE(n_components=2, perplexity=30, random_state=42)
X_bad = tsne_bad.fit_transform(X_raw_768dim) # O(n² × 768) - very slow
# RIGHT: PCA first, then t-SNE
X_pca50 = PCA(n_components=50, random_state=42).fit_transform(
StandardScaler().fit_transform(X_raw_768dim)
)
X_good = TSNE(n_components=2, perplexity=30, random_state=42).fit_transform(X_pca50)
# ── PITFALL 2: Only one t-SNE run ─────────────────────────────────────────────
# WRONG: trust a single run
X_single = TSNE(perplexity=30, random_state=42).fit_transform(X_pca50)
# RIGHT: run multiple seeds, verify cluster stability
for seed in [42, 123, 7, 99, 555]:
X_seed = TSNE(perplexity=30, random_state=seed).fit_transform(X_pca50[:500])
# Visually verify the same 10 digit clusters appear in each run
# ── PITFALL 3: UMAP min_dist=0 for all tasks ─────────────────────────────────
# min_dist=0 creates very tight clusters - great for cluster visualization
# but can create artificial structure in noisy data
reducer_viz = umap.UMAP(n_components=2, min_dist=0.0, n_neighbors=15) # visualization
reducer_ml = umap.UMAP(n_components=32, min_dist=0.1, n_neighbors=30) # ML features
# ── PITFALL 4: Comparing embeddings from different t-SNE runs ─────────────────
# WRONG: train classifier on t-SNE run 1, evaluate on run 2
X_run1 = TSNE(random_state=42).fit_transform(X_pca50)
X_run2 = TSNE(random_state=99).fit_transform(X_pca50)
# The coordinate spaces are completely unrelated - any classifier trained on run1
# will produce meaningless predictions on run2 coordinates.
# RIGHT: use UMAP which is consistent across fit() and transform() calls
import umap
reducer = umap.UMAP(random_state=42)
reducer.fit(X_train_pca)
X_train_2d = reducer.transform(X_train_pca) # consistent
X_test_2d = reducer.transform(X_test_pca) # consistent with training layout
# ── PITFALL 5: Wrong metric for text embeddings ───────────────────────────────
# WRONG: Euclidean distance for text embeddings (biased by vector magnitude)
reducer_bad = umap.UMAP(metric='euclidean') # magnitude dominates
# RIGHT: cosine distance for normalized embeddings
reducer_good = umap.UMAP(metric='cosine') # direction matters, not magnitude
# Or: L2-normalize first, then Euclidean (equivalent to cosine for normalized vecs)
X_normalized = X_embeddings / np.linalg.norm(X_embeddings, axis=1, keepdims=True)
reducer_ok = umap.UMAP(metric='euclidean') # works correctly on normalized data
Practical Workflow: From Raw Embeddings to Production Visualization
import numpy as np
import umap
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, normalize
import matplotlib.pyplot as plt
def embedding_visualization_pipeline(
embeddings: np.ndarray,
labels: np.ndarray = None,
label_names: dict = None,
embedding_type: str = 'dense', # 'dense' or 'text'
n_pca_dims: int = 50,
method: str = 'umap', # 'umap' or 'tsne'
n_neighbors: int = 15,
min_dist: float = 0.1,
perplexity: int = 30,
random_state: int = 42,
title: str = "Embedding Visualization",
) -> tuple:
"""
Production-ready pipeline: embeddings → 2D visualization.
Steps:
1. Normalize (if text embeddings)
2. PCA pre-reduction (if n_dims > n_pca_dims)
3. t-SNE or UMAP to 2D
4. Visualization with labels and centroids
"""
print(f"Input shape: {embeddings.shape}")
# Step 1: Normalize for text/sentence embeddings
if embedding_type == 'text':
embeddings = normalize(embeddings, norm='l2')
print("Applied L2 normalization (text embeddings)")
# Step 2: PCA pre-reduction
if embeddings.shape[1] > n_pca_dims:
pca = PCA(n_components=n_pca_dims, svd_solver='randomized',
random_state=random_state)
if embedding_type != 'text':
embeddings = StandardScaler().fit_transform(embeddings)
embeddings_pca = pca.fit_transform(embeddings)
print(f"PCA: {embeddings.shape[1]}D → {n_pca_dims}D "
f"({pca.explained_variance_ratio_.sum():.1%} variance)")
else:
embeddings_pca = embeddings
# Step 3: 2D projection
if method == 'umap':
metric = 'cosine' if embedding_type == 'text' else 'euclidean'
reducer = umap.UMAP(
n_components=2,
n_neighbors=n_neighbors,
min_dist=min_dist,
metric=metric,
random_state=random_state,
n_jobs=-1,
)
X_2d = reducer.fit_transform(embeddings_pca)
print(f"UMAP: {embeddings_pca.shape[1]}D → 2D")
else: # tsne
from sklearn.manifold import TSNE
tsne = TSNE(
n_components=2,
perplexity=min(perplexity, len(embeddings) // 3),
learning_rate='auto',
random_state=random_state,
n_jobs=-1,
)
X_2d = tsne.fit_transform(embeddings_pca)
print(f"t-SNE: {embeddings_pca.shape[1]}D → 2D, "
f"KL={tsne.kl_divergence_:.4f}")
# Step 4: Visualization
fig, ax = plt.subplots(figsize=(12, 9))
if labels is not None:
unique_labels = np.unique(labels)
cmap = plt.cm.tab20 if len(unique_labels) > 10 else plt.cm.tab10
colors = cmap(np.linspace(0, 1, len(unique_labels)))
for i, (label, color) in enumerate(zip(unique_labels, colors)):
mask = labels == label
lname = label_names.get(label, str(label)) if label_names else str(label)
ax.scatter(X_2d[mask, 0], X_2d[mask, 1], color=color,
s=8, alpha=0.6, label=lname)
# Plot centroid
centroid = X_2d[mask].mean(axis=0)
ax.annotate(lname, centroid, fontsize=9, fontweight='bold',
ha='center', va='center',
bbox=dict(boxstyle='round,pad=0.3', facecolor='white',
alpha=0.8, edgecolor=color))
ax.legend(markerscale=2, bbox_to_anchor=(1.01, 1), loc='upper left',
fontsize=9, title="Classes")
else:
ax.scatter(X_2d[:, 0], X_2d[:, 1], s=5, alpha=0.5, color='#3b82f6')
ax.set_title(title, fontsize=14)
ax.axis('off')
plt.tight_layout()
plt.show()
return X_2d, reducer if method == 'umap' else None
:::tip 🎮 Interactive Playground
Visualize this concept: Try the t-SNE & UMAP demo on the EngineersOfAI Playground - no code required.
:::
