Skip to main content

t-SNE and UMAP

import ReadingTime from '@site/src/components/ReadingTime';

:::note Interview Relevance - High t-SNE and UMAP are standard interview topics for roles involving embeddings, NLP, computer vision, and model interpretability. Interviewers commonly ask about the crowding problem, the perplexity hyperparameter, why t-SNE distances are meaningless, and when you would use UMAP vs t-SNE in production. :::

The Real Interview Moment

You are presenting your BERT-based customer support routing system to the VP of Engineering. You claim the model understands the difference between billing complaints, technical issues, and feature requests. He asks: "Can you show me that?" You cannot show him 768-dimensional embeddings as a table. You fire up a Jupyter notebook, run UMAP on 10,000 support ticket embeddings, and project to 2D. Billing complaints cluster in the top-left. Technical issues form a tight cluster in the center. Feature requests scatter across the right side. A small isolated cluster near the bottom turns out to be outage reports - a category the routing system was misclassifying as technical issues.

"That cluster," you say, pointing to the outliers, "those are outage reports. They look like technical issues to the classifier because the text is similar - 'the system is not working' - but the intent is completely different. We need a separate class."

Two tools made that conversation possible: UMAP (to produce the visualization in 4 seconds instead of 4 minutes) and your understanding of what the visualization does and does not mean.

Why This Exists - The Crowding Problem

PCA projects data onto a linear subspace. For data with non-linear structure - word embeddings, image feature vectors, single-cell RNA profiles - the linear projection folds the manifold back on itself, placing points that are intrinsically far apart near each other in 2D.

The naive fix - SNE (Stochastic Neighbor Embedding, Hinton & Roweis 2002) - computes Gaussian similarity in high-D and Gaussian similarity in low-D, minimizing the KL divergence between the two distributions. But it has a fundamental flaw: the crowding problem.

In 2D, the area of an annulus at radius rr scales as rr. In 10D, the volume of the corresponding shell scales as r9r^9. There is vastly more volume near moderate distances in high-D than in 2D. When you project 10D data to 2D, moderately-far neighbors in 10D have nowhere to go - they crowd into the center of the 2D plot, obliterating the structure you are trying to see.

t-SNE (van der Maaten & Hinton, 2008) solves this by using a heavy-tailed t-distribution in the low-dimensional space. The t-distribution gives moderately-far points more room to spread out, resolving the crowding problem.

SNE: The Foundation

SNE defines a conditional similarity from point xix_i to point xjx_j using a Gaussian centered at xix_i:

pji=exp(xixj2/2σi2)kiexp(xixk2/2σi2)p_{j|i} = \frac{\exp\left(-\|x_i - x_j\|^2 / 2\sigma_i^2\right)}{\sum_{k \neq i} \exp\left(-\|x_i - x_k\|^2 / 2\sigma_i^2\right)}

The bandwidth σi\sigma_i is per-point - each point gets its own bandwidth chosen so that the entropy of PiP_i equals the user-specified perplexity:

Perplexity(Pi)=2H(Pi)whereH(Pi)=jpjilog2pji\text{Perplexity}(P_i) = 2^{H(P_i)} \quad \text{where} \quad H(P_i) = -\sum_{j} p_{j|i} \log_2 p_{j|i}

Perplexity \approx effective number of neighbors. Points in dense regions get small σi\sigma_i (tight neighborhoods); points in sparse regions get large σi\sigma_i (broader neighborhoods). This adaptive bandwidth is one of t-SNE's key advantages over distance-based methods.

Symmetric SNE symmetrizes: pij=(pji+pij)/2np_{ij} = (p_{j|i} + p_{i|j}) / 2n. This ensures pij=pjip_{ij} = p_{ji} and the gradient is simpler.

In the low-dimensional space, SNE uses another Gaussian:

qij=exp(yiyj2)klexp(ykyl2)q_{ij} = \frac{\exp(-\|y_i - y_j\|^2)}{\sum_{k \neq l} \exp(-\|y_k - y_l\|^2)}

The objective is to minimize the KL divergence between PP and QQ:

LSNE=KL(PQ)=ijpijlogpijqij\mathcal{L}_{\text{SNE}} = \text{KL}(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}

Problem: the Gaussian in the low-D space creates the crowding problem. Moderately-distant points in high-D have pij>0p_{ij} > 0, but in 2D the Gaussian forces them close together.

t-SNE: Replacing the Low-Dimensional Gaussian

The key modification in t-SNE: replace the low-dimensional Gaussian with a Student t-distribution with 1 degree of freedom (a Cauchy distribution):

qij=(1+yiyj2)1kl(1+ykyl2)1q_{ij} = \frac{\left(1 + \|y_i - y_j\|^2\right)^{-1}}{\sum_{k \neq l}\left(1 + \|y_k - y_l\|^2\right)^{-1}}

The t-distribution has heavy tails: at distance rr, its density falls off as 1/r21/r^2 instead of exp(r2)\exp(-r^2). This means moderately-distant points in high-D can be placed at moderate distances in 2D - they no longer need to crowd the center.

Comparison:

DistributionDensity at distance rrEffect
Gaussianer2\propto e^{-r^2}Rapid falloff - moderately distant points crowd together
Student-t (1 dof)(1+r2)1\propto (1+r^2)^{-1}Heavy tail - moderately distant points can spread out

The t-SNE objective:

L=KL(PQ)=ijpijlogpijqij=ijpijlogpijijpijlogqij\mathcal{L} = \text{KL}(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}} = \sum_{i \neq j} p_{ij} \log p_{ij} - \sum_{i \neq j} p_{ij} \log q_{ij}

The gradient with respect to yiy_i has a clean form:

Lyi=4ji(pijqij)(yiyj)(1+yiyj2)1\frac{\partial \mathcal{L}}{\partial y_i} = 4 \sum_{j \neq i} (p_{ij} - q_{ij})(y_i - y_j)\left(1 + \|y_i - y_j\|^2\right)^{-1}

This gradient has an intuitive interpretation: if pij>qijp_{ij} > q_{ij} (points are closer in high-D than in 2D), the gradient pulls yiy_i toward yjy_j. If pij<qijp_{ij} < q_{ij} (points are farther in high-D), the gradient pushes yiy_i away from yjy_j.

Optimization uses gradient descent with momentum and an important trick called early exaggeration: multiply pijp_{ij} by 4–12 in the first 250 iterations. This forces clusters to form early (points with high pijp_{ij} are strongly attracted), and then the exaggeration factor is removed to fine-tune the cluster structure.

The Perplexity Hyperparameter

Perplexity controls the effective neighborhood size. Intuitively: "how many neighbors should each point consider?"

perplexity=2H(Pi)number of effective neighbors of point i\text{perplexity} = 2^{H(P_i)} \approx \text{number of effective neighbors of point } i

Effect of perplexity on the embedding:

PerplexityLocal vs global balanceWhen to use
5–10Very local - tight microclusters, may fragment real clustersWhen you want to see fine-grained substructure
20–50Balanced - standard starting pointDefault for most tasks
100–200More global - clusters may mergeLarge datasets where global structure matters

Rules:

  1. Perplexity must be less than n/3n/3 (more neighbors than points makes no sense)
  2. Always run t-SNE at multiple perplexity values and check stability
  3. For large datasets (n>10,000n > 10{,}000), perplexity of 50–100 is typical
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler

# Load MNIST digits (8×8, 64 dimensions)
digits = load_digits()
X, y = digits.data, digits.target

# BEST PRACTICE: PCA pre-reduction to 50 dims before t-SNE
# Removes noise, speeds up t-SNE significantly (O(n²) cost dominated by pairwise dist)
X_pca50 = PCA(n_components=50, random_state=42).fit_transform(
StandardScaler().fit_transform(X)
)

# Compare perplexity values
perplexities = [5, 15, 30, 50, 100]
fig, axes = plt.subplots(1, len(perplexities), figsize=(25, 5))

for ax, perp in zip(axes, perplexities):
tsne = TSNE(
n_components=2,
perplexity=perp,
learning_rate='auto',
n_iter=1000,
random_state=42,
)
X_tsne = tsne.fit_transform(X_pca50)
ax.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', s=5, alpha=0.7)
ax.set_title(f"Perplexity={perp}\nKL={tsne.kl_divergence_:.3f}", fontsize=10)
ax.axis('off')

plt.suptitle("t-SNE: Effect of Perplexity on MNIST Digits", fontsize=13)
plt.tight_layout()
plt.show()

Barnes-Hut Approximation: Making t-SNE Scalable

Naive t-SNE is O(n2)O(n^2) per gradient step - impossible for n>50,000n > 50{,}000. The Barnes-Hut approximation reduces this to O(nlogn)O(n \log n) using the same idea as the Barnes-Hut N-body simulation in astrophysics:

  1. Build a quadtree (2D) or octree (3D) over the current embedding positions
  2. For each point, cells far away are treated as a single "super-point" at their center of mass
  3. A cell is considered "far" if its angular size θ=w/r\theta = w/r (width/distance) is below a threshold (typically 0.5)

This approximates the repulsive forces (negative gradient term) without computing all O(n2)O(n^2) pairwise interactions, at the cost of some approximation error.

from sklearn.manifold import TSNE
import numpy as np

# Barnes-Hut t-SNE (sklearn's default for n_components <= 3)
tsne_bh = TSNE(
n_components=2,
perplexity=30,
learning_rate='auto',
n_iter=1000,
n_iter_without_progress=150, # early stopping
early_exaggeration=12.0, # multiplier for p_ij in first ~250 iters
method='barnes_hut', # O(n log n) - default for 2D/3D
angle=0.5, # Barnes-Hut approximation angle (lower=more accurate)
random_state=42,
n_jobs=-1,
verbose=1,
)

# For small n (< 3000) or when you need exact results:
tsne_exact = TSNE(
n_components=2,
perplexity=30,
method='exact', # O(n²) - exact computation
random_state=42,
)

# Production-scale t-SNE: PCA first, then Barnes-Hut
def production_tsne(X: np.ndarray, n_pca_dims: int = 50, perplexity: int = 30,
random_state: int = 42) -> np.ndarray:
"""
Production-ready t-SNE:
1. Standardize
2. PCA to 50 dims (noise removal + speed)
3. Barnes-Hut t-SNE
"""
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

print(f"Input shape: {X.shape}")

# Step 1: Standardize
X_scaled = StandardScaler().fit_transform(X)

# Step 2: PCA pre-reduction
n_pca = min(n_pca_dims, X.shape[1], X.shape[0]-1)
X_pca = PCA(n_components=n_pca, random_state=random_state).fit_transform(X_scaled)
print(f"After PCA: {X_pca.shape}")

# Step 3: t-SNE
tsne = TSNE(
n_components=2,
perplexity=min(perplexity, X.shape[0] // 3), # perplexity < n/3
learning_rate='auto',
n_iter=1000,
method='barnes_hut',
random_state=random_state,
n_jobs=-1,
)
X_2d = tsne.fit_transform(X_pca)
print(f"KL divergence: {tsne.kl_divergence_:.4f}")

return X_2d

The Critical t-SNE Misconceptions

:::danger t-SNE Distances and Sizes Are Meaningless t-SNE distorts distances to preserve local neighborhood structure. Cluster sizes, distances between clusters, and relative positions of clusters are all arbitrary. You cannot make quantitative claims about distances or proportions from a t-SNE plot. :::

Misconception 1: Cluster size = cluster importance

t-SNE stretches regions of low density and compresses regions of high density. A large cluster in the plot may simply be a region where t-SNE expanded the layout. A tight cluster may be compressed. Neither reflects the actual number of points or spread.

Misconception 2: Distance between clusters = similarity in original space

t-SNE preserves local neighborhood structure but destroys global structure. Two clusters that are far apart in the t-SNE plot may be close in the original space. Two clusters that appear adjacent may be far apart.

# Demonstration: t-SNE distance vs actual distance
from scipy.spatial.distance import pdist
from scipy.stats import spearmanr
import numpy as np

def check_distance_preservation(X_orig: np.ndarray, X_2d: np.ndarray,
n_sample: int = 500) -> dict:
"""
Check how well 2D embedding preserves pairwise distances.
A good embedding has high correlation (close to 1.0) for PCA,
but t-SNE will show low correlation - by design.
"""
rng = np.random.default_rng(42)
idx = rng.choice(len(X_orig), min(n_sample, len(X_orig)), replace=False)

d_orig = pdist(X_orig[idx], metric='euclidean')
d_2d = pdist(X_2d[idx], metric='euclidean')

spearman_r, pvalue = spearmanr(d_orig, d_2d)
pearson_r = np.corrcoef(d_orig, d_2d)[0, 1]

# Local vs global preservation
# For local: only consider pairs within top 10th percentile of proximity
local_threshold = np.percentile(d_orig, 10)
local_mask = d_orig < local_threshold

local_spearman, _ = spearmanr(d_orig[local_mask], d_2d[local_mask])

print(f"Global Spearman rank correlation: {spearman_r:.3f}")
print(f"Global Pearson correlation: {pearson_r:.3f}")
print(f"Local Spearman (near pairs): {local_spearman:.3f}")
print()
print("Interpretation:")
print(" Global ~0.3-0.5 for t-SNE: EXPECTED - global structure not preserved")
print(" Local ~0.8-0.9 for t-SNE: GOOD - local structure well preserved")
print(" Global ~0.9+ for PCA: EXPECTED - PCA preserves global distances")

return {'global_spearman': spearman_r, 'local_spearman': local_spearman}

Misconception 3: One t-SNE run is definitive

t-SNE uses random initialization. Different seeds produce different layouts. A cluster structure is only meaningful if it appears consistently across multiple random seeds.

def check_tsne_stability(X_pca: np.ndarray, y: np.ndarray,
n_seeds: int = 5, perplexity: int = 30):
"""Run t-SNE with multiple seeds to check layout stability."""
fig, axes = plt.subplots(1, n_seeds, figsize=(5*n_seeds, 4))

for i, seed in enumerate(range(n_seeds)):
tsne = TSNE(n_components=2, perplexity=perplexity,
learning_rate='auto', random_state=seed)
X_2d = tsne.fit_transform(X_pca)

axes[i].scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='tab10', s=5, alpha=0.7)
axes[i].set_title(f"Seed {seed}\nKL={tsne.kl_divergence_:.3f}", fontsize=9)
axes[i].axis('off')

plt.suptitle("t-SNE Stability Check (5 random seeds)\n"
"Same structures appearing in all runs = real signal", fontsize=11)
plt.tight_layout()
plt.show()

UMAP: Faster, More Faithful, Parametric

UMAP (Uniform Manifold Approximation and Projection, McInnes et al. 2018) solves t-SNE's core limitations through a fundamentally different theoretical foundation: Riemannian geometry and topological data analysis.

UMAP Theory: Fuzzy Simplicial Complexes

Where t-SNE uses Gaussian distributions to model local structure, UMAP builds a weighted graph (fuzzy simplicial complex) over the data:

Step 1: Build the high-dimensional graph

For each point xix_i, find its kk nearest neighbors. Define the weight of the edge (xi,xj)(x_i, x_j) as:

wijhigh=exp(d(xi,xj)ρiσi)w_{ij}^{\text{high}} = \exp\left(-\frac{d(x_i, x_j) - \rho_i}{\sigma_i}\right)

where ρi\rho_i is the distance to the nearest neighbor (ensures the closest neighbor always has weight 1 - making the representation locally adaptive) and σi\sigma_i is chosen so the weights sum to log2(k)\log_2(k).

The symmetrized graph weight:

wij=wijhigh+wjihighwijhighwjihighw_{ij} = w_{ij}^{\text{high}} + w_{ji}^{\text{high}} - w_{ij}^{\text{high}} \cdot w_{ji}^{\text{high}}

Step 2: Build the low-dimensional graph

The low-dimensional edge weights use:

wijlow=(1+ayiyj2b)1w_{ij}^{\text{low}} = \left(1 + a \cdot \|y_i - y_j\|^{2b}\right)^{-1}

Parameters aa and bb are determined by min_dist (the minimum distance between points in the embedding - smaller values produce tighter clusters).

Step 3: Optimize the embedding

Minimize the cross-entropy between the high-D and low-D graphs (binary cross-entropy over edges):

L=ij[wijlogwijwijlow+(1wij)log1wij1wijlow]\mathcal{L} = \sum_{i \neq j} \left[ w_{ij} \log \frac{w_{ij}}{w_{ij}^{\text{low}}} + (1 - w_{ij}) \log \frac{1 - w_{ij}}{1 - w_{ij}^{\text{low}}} \right]

Optimization uses stochastic gradient descent with negative sampling (like word2vec) - efficient and O(nlogn)O(n \log n).

Key algorithmic advances over t-SNE:

  1. NNDescent for approximate nearest neighbor graph: O(nlogn)O(n \log n) vs O(n2)O(n^2) for exact NN
  2. Cross-entropy with negative sampling: avoids O(n2)O(n^2) normalization sum (the "Z" denominator in t-SNE's qijq_{ij})
  3. Parametric mode: the encoder is a neural network - new points get a single forward pass

UMAP Key Parameters

ParameterEffectDefaultTuning Guide
n_neighborsSize of local neighborhood15Small (5): fine local structure; Large (100+): global structure
min_distTightness of clusters in 2D0.1Small (0.0): tight clusters; Large (0.5): more spread out
n_componentsOutput dimensions22 or 3 for viz; 16–64 for downstream ML
metricDistance in high-D'euclidean''cosine' for embeddings; 'hamming' for binary
n_epochsOptimization stepsautoMore = better layout, slower
# pip install umap-learn
import umap
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

digits = load_digits()
X, y = digits.data, digits.target
X_scaled = StandardScaler().fit_transform(X)
X_pca50 = PCA(n_components=50, random_state=42).fit_transform(X_scaled)

# ── Full UMAP parameter exploration ───────────────────────────────────────────
fig, axes = plt.subplots(2, 3, figsize=(18, 11))

configs = [
{'n_neighbors': 5, 'min_dist': 0.1, 'label': 'n_neighbors=5\n(very local)'},
{'n_neighbors': 15, 'min_dist': 0.1, 'label': 'n_neighbors=15\n(default)'},
{'n_neighbors': 50, 'min_dist': 0.1, 'label': 'n_neighbors=50\n(more global)'},
{'n_neighbors': 15, 'min_dist': 0.0, 'label': 'min_dist=0.0\n(tight clusters)'},
{'n_neighbors': 15, 'min_dist': 0.5, 'label': 'min_dist=0.5\n(spread out)'},
{'n_neighbors': 15, 'min_dist': 0.99, 'label': 'min_dist=0.99\n(almost uniform)'},
]

for ax, cfg in zip(axes.flat, configs):
reducer = umap.UMAP(
n_components=2,
n_neighbors=cfg['n_neighbors'],
min_dist=cfg['min_dist'],
random_state=42,
n_jobs=-1,
)
X_2d = reducer.fit_transform(X_pca50)
ax.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='tab10', s=8, alpha=0.7)
ax.set_title(cfg['label'], fontsize=11)
ax.axis('off')

plt.suptitle("UMAP Hyperparameter Effects on MNIST Digits", fontsize=14)
plt.tight_layout()
plt.show()

UMAP with Non-Euclidean Metrics

One of UMAP's major advantages over t-SNE: it natively supports non-Euclidean distance metrics.

import umap
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups

# Text data: 20 newsgroups - use cosine distance (text-appropriate)
newsgroups = fetch_20newsgroups(subset='train', categories=[
'sci.space', 'rec.sport.hockey', 'talk.politics.guns', 'comp.graphics'
], remove=('headers', 'footers', 'quotes'), random_state=42)

# TF-IDF features
tfidf = TfidfVectorizer(max_features=5000, stop_words='english', sublinear_tf=True)
X_tfidf = tfidf.fit_transform(newsgroups.data) # sparse matrix

# UMAP with cosine distance (correct for sparse text features)
reducer_cosine = umap.UMAP(
n_components=2,
metric='cosine', # correct for text - invariant to document length
n_neighbors=20,
min_dist=0.1,
random_state=42,
n_jobs=-1,
verbose=True,
)
# UMAP accepts sparse matrices natively
X_2d_text = reducer_cosine.fit_transform(X_tfidf)

fig, ax = plt.subplots(figsize=(10, 8))
scatter = ax.scatter(X_2d_text[:, 0], X_2d_text[:, 1],
c=newsgroups.target, cmap='tab10', s=5, alpha=0.7)
plt.colorbar(scatter, ax=ax, label='Category')
ax.set_title("UMAP of 20 Newsgroups (cosine distance, TF-IDF)", fontsize=12)
ax.axis('off')
plt.tight_layout()
plt.show()

# ── Other useful metrics ───────────────────────────────────────────────────────
# 'hamming' - binary features (bag-of-words presence/absence)
# 'jaccard' - set similarity (tags, categories)
# 'manhattan' - L1 - more robust to outliers than L2 for tabular data
# 'precomputed' - pass a precomputed distance matrix

Parametric UMAP: Transform New Points

Unlike t-SNE, UMAP learns a mapping that can be applied to new points. There are two modes:

import umap
import numpy as np

# ── Mode 1: Standard UMAP - fit once, transform later ──────────────────────────
reducer = umap.UMAP(n_components=2, random_state=42)
reducer.fit(X_train_pca)

# Transform training data
X_train_2d = reducer.transform(X_train_pca)

# Transform new test data - consistent with training layout!
X_test_2d = reducer.transform(X_test_pca)

print("Can transform new points:", True)

# ── Mode 2: Parametric UMAP - neural network encoder ──────────────────────────
# Trains a neural network as the embedding function
# Enables very fast inference on new points (single forward pass)
from umap.parametric_umap import ParametricUMAP

param_reducer = ParametricUMAP(
n_components=2,
n_neighbors=15,
min_dist=0.1,
n_training_epochs=10, # number of epochs to train the encoder network
encoder=None, # auto-builds MLP: input_dim → 100 → 100 → n_components
verbose=True,
)
param_reducer.fit(X_train_pca)

X_train_2d = param_reducer.transform(X_train_pca) # batch forward pass
X_test_2d = param_reducer.transform(X_test_pca) # fast inference

# Check consistency: same points should map to same location
X_train_2d_check = param_reducer.transform(X_train_pca[:10])
# Standard UMAP: X_train_2d_check ≈ X_train_2d[:10] (deterministic)
# Parametric UMAP: X_train_2d_check == X_train_2d[:10] (exact, it's a function)

UMAP for Dimensionality Reduction (Not Just Visualization)

UMAP at higher dimensions (k=16,32,64k = 16, 32, 64) is a powerful general-purpose dimensionality reduction technique - not just for visualization:

import umap
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import numpy as np

def compare_reduction_methods(X_train, y_train, X_test, y_test, target_dim=32):
"""Compare PCA vs UMAP for downstream classification."""
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

results = {}

# Baseline: raw features
lr_raw = LogisticRegression(max_iter=1000, random_state=42)
lr_raw.fit(X_train, y_train)
results['raw'] = accuracy_score(y_test, lr_raw.predict(X_test))

# PCA reduction
pca_pipe = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=target_dim, random_state=42)),
('lr', LogisticRegression(max_iter=1000, random_state=42)),
])
pca_pipe.fit(X_train, y_train)
results[f'pca_{target_dim}'] = accuracy_score(y_test, pca_pipe.predict(X_test))

# UMAP reduction
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

reducer = umap.UMAP(
n_components=target_dim,
n_neighbors=30,
min_dist=0.0, # tight clusters: good for downstream classification
metric='euclidean',
random_state=42,
)
X_train_umap = reducer.fit_transform(X_train_scaled)
X_test_umap = reducer.transform(X_test_scaled)

lr_umap = LogisticRegression(max_iter=1000, random_state=42)
lr_umap.fit(X_train_umap, y_train)
results[f'umap_{target_dim}'] = accuracy_score(y_test, lr_umap.predict(X_test_umap))

for method, acc in results.items():
print(f"{method:20s}: {acc:.4f}")

return results

t-SNE vs UMAP: Flow and Decision

Production: Embedding Space Monitoring

import umap
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import ks_2samp

class EmbeddingDriftMonitor:
"""
Monitor embedding space drift in production using UMAP.
Fits once on baseline, transforms new batches consistently.
"""

def __init__(self, n_neighbors: int = 15, min_dist: float = 0.1,
drift_threshold: float = 0.5):
self.reducer = umap.UMAP(
n_components=2,
n_neighbors=n_neighbors,
min_dist=min_dist,
random_state=42
)
self.drift_threshold = drift_threshold
self.baseline_2d_ = None
self.baseline_stats_ = None

def fit_baseline(self, embeddings: np.ndarray,
labels: np.ndarray = None) -> 'EmbeddingDriftMonitor':
"""Fit UMAP reducer on baseline (training/reference) embeddings."""
self.reducer.fit(embeddings)
self.baseline_2d_ = self.reducer.transform(embeddings)
self.baseline_labels_ = labels

# Compute reference statistics for drift detection
self.baseline_stats_ = {
'mean': self.baseline_2d_.mean(axis=0),
'std': self.baseline_2d_.std(axis=0),
'x_dist': self.baseline_2d_[:, 0],
'y_dist': self.baseline_2d_[:, 1],
}
print(f"Baseline fitted: {embeddings.shape[0]} samples")
print(f"2D bounds: x=[{self.baseline_2d_[:,0].min():.2f}, "
f"{self.baseline_2d_[:,0].max():.2f}], "
f"y=[{self.baseline_2d_[:,1].min():.2f}, "
f"{self.baseline_2d_[:,1].max():.2f}]")
return self

def check_drift(self, new_embeddings: np.ndarray,
period_label: str = "Production") -> dict:
"""
Transform new embeddings and check for distribution drift.

Returns drift metrics and raises alert if drift exceeds threshold.
"""
new_2d = self.reducer.transform(new_embeddings)

# Centroid shift
centroid_shift = np.linalg.norm(
new_2d.mean(axis=0) - self.baseline_stats_['mean']
)

# KS test on marginal distributions
ks_x = ks_2samp(self.baseline_stats_['x_dist'], new_2d[:, 0])
ks_y = ks_2samp(self.baseline_stats_['y_dist'], new_2d[:, 1])

# Coverage: fraction of new points outside baseline bounding box + margin
x_min, x_max = self.baseline_2d_[:,0].min(), self.baseline_2d_[:,0].max()
y_min, y_max = self.baseline_2d_[:,1].min(), self.baseline_2d_[:,1].max()
margin = 0.2 * max(x_max - x_min, y_max - y_min)

out_of_distribution = (
(new_2d[:, 0] < x_min - margin) |
(new_2d[:, 0] > x_max + margin) |
(new_2d[:, 1] < y_min - margin) |
(new_2d[:, 1] > y_max + margin)
).mean()

metrics = {
'centroid_shift': centroid_shift,
'ks_stat_x': ks_x.statistic,
'ks_stat_y': ks_y.statistic,
'ks_pvalue_x': ks_x.pvalue,
'ks_pvalue_y': ks_y.pvalue,
'out_of_distribution_rate': out_of_distribution,
}

drift_detected = (
centroid_shift > self.drift_threshold or
ks_x.pvalue < 0.01 or ks_y.pvalue < 0.01 or
out_of_distribution > 0.05
)

if drift_detected:
print(f"DRIFT ALERT for {period_label}:")
print(f" Centroid shift: {centroid_shift:.4f} (threshold={self.drift_threshold})")
print(f" KS p-values: x={ks_x.pvalue:.4f}, y={ks_y.pvalue:.4f}")
print(f" OOD rate: {out_of_distribution:.1%}")
else:
print(f"{period_label}: No significant drift detected")

return metrics, new_2d

def plot_drift(self, new_2d: np.ndarray, period_label: str = "Production"):
fig, axes = plt.subplots(1, 2, figsize=(16, 7))

# Baseline
if hasattr(self, 'baseline_labels_') and self.baseline_labels_ is not None:
scatter = axes[0].scatter(
self.baseline_2d_[:, 0], self.baseline_2d_[:, 1],
c=self.baseline_labels_, cmap='tab10', s=5, alpha=0.6
)
else:
axes[0].scatter(self.baseline_2d_[:, 0], self.baseline_2d_[:, 1],
s=5, alpha=0.5, color='#3b82f6')
axes[0].set_title("Baseline Embedding Space", fontsize=12)
axes[0].axis('off')

# Overlay: baseline (gray) + new points (red)
axes[1].scatter(self.baseline_2d_[:, 0], self.baseline_2d_[:, 1],
s=3, alpha=0.2, color='#94a3b8', label='Baseline')
axes[1].scatter(new_2d[:, 0], new_2d[:, 1],
s=15, alpha=0.7, color='#ef4444', label=period_label)
axes[1].set_title(f"Drift: Baseline vs {period_label}", fontsize=12)
axes[1].legend()
axes[1].axis('off')

plt.tight_layout()
plt.show()

:::tip UMAP for Production, t-SNE for Exploration Use UMAP in production: it can transform new points, runs in O(nlogn)O(n \log n), and handles millions of samples. Use t-SNE for one-time exploratory visualizations on static datasets where you want the cleanest possible cluster separation. Always run t-SNE on PCA-pre-reduced data (50 dimensions) for speed and noise reduction. :::

:::warning Do Not Use t-SNE or UMAP Coordinates as Features Never feed t-SNE or UMAP 2D coordinates into a downstream classifier. t-SNE is non-parametric - you cannot transform new points reproducibly. UMAP coordinates are stable for the fitted model but the 2D space is not semantically meaningful for classification (distances are distorted). Use PCA, raw embeddings, or UMAP at higher dimensions (k16k \geq 16, min_dist=0.0) for feature extraction. :::

YouTube Resources

ChannelVideo TitleWhy Watch
StatQuest with Josh Starmert-SNE, Clearly ExplainedBest visual walkthrough of perplexity and the algorithm steps
ritvikmatht-SNE ExplainedMath derivation of the KL objective and gradient
Leland McInnes (UMAP author)UMAP: Theory and PracticeFirst-principles explanation of the topological foundation
Two Minute PapersUMAP paper overviewQuick intuition for the crowding problem vs UMAP approach

Interview Q&A

Q1: Explain the crowding problem in SNE and how t-SNE solves it.

In 10 or more dimensions, the volume available at moderate distances grows much faster than in 2D. When you project high-D data to 2D using a Gaussian similarity in both spaces, the 2D space cannot accommodate all the moderately-distant neighbors - they all crowd into the center of the plot, obliterating local structure. t-SNE replaces the low-dimensional Gaussian with a Student t-distribution (1 degree of freedom), which has a heavy tail: its density at distance rr falls off as (1+r2)1(1+r^2)^{-1} instead of er2e^{-r^2}. The heavy tail gives moderately-distant points more room to spread apart, resolving the crowding. The mismatch between the narrow Gaussian (high-D) and the heavy-tailed t-distribution (low-D) is intentional - it produces clear cluster separation.

Q2: What does the perplexity parameter in t-SNE actually control? What happens at extreme values?

Perplexity controls the effective number of neighbors each point considers. Mathematically, for each point xix_i, the bandwidth σi\sigma_i is chosen so that the entropy of the conditional distribution PiP_i satisfies perplexity=2H(Pi)\text{perplexity} = 2^{H(P_i)}. This makes the bandwidth adaptive: dense regions get smaller σi\sigma_i (tight neighborhoods), sparse regions get larger σi\sigma_i (broader neighborhoods). At perplexity 5: only immediate neighbors matter - produces fragmented microclusters that may split what should be one cluster. At perplexity 200: each point considers many neighbors - produces smooth global layout but loses fine-grained local structure. Start with perplexity 30, then explore 5, 15, 50, and 100. Clusters that appear stable across all perplexities are real.

Q3: A colleague reports that cluster A is three times larger than cluster B in the t-SNE plot and concludes that class A is three times more common. What is wrong?

Cluster sizes in t-SNE are meaningless. t-SNE stretches regions of low density and compresses regions of high density to preserve local neighborhood structure. A cluster's area in the 2D plot does not correspond to the number of points, the spread in original space, or any other quantitative property. To assess cluster size, count the data points assigned to each cluster by running a separate algorithm (K-means, DBSCAN, or a distance threshold) on the original high-dimensional embeddings. Never use t-SNE geometrical properties for quantitative conclusions - only for qualitative pattern identification.

Q4: How does UMAP differ from t-SNE algorithmically? When would you choose UMAP over t-SNE?

Algorithmically, t-SNE models similarities as Gaussians (high-D) and t-distributions (low-D) and minimizes KL divergence; it runs in O(n2)O(n^2) naively or O(nlogn)O(n \log n) with Barnes-Hut, and cannot transform new points. UMAP builds a weighted graph using fuzzy topological methods (local distances normalized by nearest-neighbor distance, symmetrized), then minimizes cross-entropy between the high-D and low-D graphs using negative sampling - O(nlogn)O(n \log n) via NNDescent. Choose UMAP over t-SNE when: (1) n>50,000n > 50{,}000 - t-SNE becomes too slow, (2) you need to transform new data points at inference, (3) you want to use the reduction for downstream ML (not just visualization), (4) you need a non-Euclidean metric (cosine, Hamming, precomputed). Choose t-SNE when: you want the very clearest cluster separation for a publication figure on a static dataset.

Q5: Why can't you use t-SNE coordinates as input features for a downstream classifier?

t-SNE is non-parametric - it optimizes 2D positions for a specific fixed dataset. Given a new point, there is no function to apply. You must include the new point in the dataset and rerun the entire optimization from scratch - but then the entire layout changes, so you cannot compare to the old embedding or train a classifier on one run's coordinates and predict on another. Additionally, the 2D coordinates have no semantic meaning: the same real-world concept may land in completely different positions in two different t-SNE runs. For production feature extraction, use PCA (parametric, stable), UMAP's standard fit/transform (stable for the fitted model), or UMAP at higher dimensions.

Q6: How would you use UMAP in a production ML monitoring system to detect concept drift?

Fit UMAP on a representative baseline dataset of model input embeddings (e.g., the first month of production inference). Each week, transform new production embeddings using the fitted reducer. Apply three drift detectors: (1) centroid shift - if the mean 2D position of this week's points shifts significantly from baseline (>0.5> 0.5 standard deviations), distribution shift may be occurring; (2) KS test on marginal distributions of the 2D coordinates - p<0.01p < 0.01 indicates significant change; (3) out-of-distribution rate - what fraction of new points fall outside the bounding box of baseline points (with some margin), indicating novel input types. Alert when any threshold is exceeded. This is more interpretable than MMD or statistical tests on raw high-dimensional embeddings because you can visualize which part of the embedding space the new points are landing in.

Comparison Table: PCA vs t-SNE vs UMAP

PropertyPCAt-SNEUMAP
TypeLinearNon-linearNon-linear
Preserves global structureYesNoBetter than t-SNE
Preserves local structurePartiallyVery wellVery well
Parametric (transform new pts)YesNoYes (standard mode)
SpeedO(ndmin(n,d))O(nd\min(n,d))O(n2)O(n^2) / O(nlogn)O(n\log n)O(nlogn)O(n\log n)
Max practical nnMillions~100K (BH)Millions
ReproducibleYesNo (random init)More stable
Cluster sizes meaningfulYesNoSlightly more
Distances meaningfulYesNoBetter than t-SNE
Non-Euclidean metricsNo (linear only)YesYes
Supports sparse inputYes (TruncatedSVD)NoYes
Higher-dim output for MLYesNot usefulYes (k16k \geq 16)
Best forProduction reductionPublication figuresProduction monitoring

Common t-SNE and UMAP Pitfalls

# ── PITFALL 1: Not using PCA pre-reduction before t-SNE ───────────────────────
# WRONG: t-SNE on raw 768-dim embeddings - slow and noisy
tsne_bad = TSNE(n_components=2, perplexity=30, random_state=42)
X_bad = tsne_bad.fit_transform(X_raw_768dim) # O(n² × 768) - very slow

# RIGHT: PCA first, then t-SNE
X_pca50 = PCA(n_components=50, random_state=42).fit_transform(
StandardScaler().fit_transform(X_raw_768dim)
)
X_good = TSNE(n_components=2, perplexity=30, random_state=42).fit_transform(X_pca50)

# ── PITFALL 2: Only one t-SNE run ─────────────────────────────────────────────
# WRONG: trust a single run
X_single = TSNE(perplexity=30, random_state=42).fit_transform(X_pca50)
# RIGHT: run multiple seeds, verify cluster stability
for seed in [42, 123, 7, 99, 555]:
X_seed = TSNE(perplexity=30, random_state=seed).fit_transform(X_pca50[:500])
# Visually verify the same 10 digit clusters appear in each run

# ── PITFALL 3: UMAP min_dist=0 for all tasks ─────────────────────────────────
# min_dist=0 creates very tight clusters - great for cluster visualization
# but can create artificial structure in noisy data
reducer_viz = umap.UMAP(n_components=2, min_dist=0.0, n_neighbors=15) # visualization
reducer_ml = umap.UMAP(n_components=32, min_dist=0.1, n_neighbors=30) # ML features

# ── PITFALL 4: Comparing embeddings from different t-SNE runs ─────────────────
# WRONG: train classifier on t-SNE run 1, evaluate on run 2
X_run1 = TSNE(random_state=42).fit_transform(X_pca50)
X_run2 = TSNE(random_state=99).fit_transform(X_pca50)
# The coordinate spaces are completely unrelated - any classifier trained on run1
# will produce meaningless predictions on run2 coordinates.

# RIGHT: use UMAP which is consistent across fit() and transform() calls
import umap
reducer = umap.UMAP(random_state=42)
reducer.fit(X_train_pca)
X_train_2d = reducer.transform(X_train_pca) # consistent
X_test_2d = reducer.transform(X_test_pca) # consistent with training layout

# ── PITFALL 5: Wrong metric for text embeddings ───────────────────────────────
# WRONG: Euclidean distance for text embeddings (biased by vector magnitude)
reducer_bad = umap.UMAP(metric='euclidean') # magnitude dominates

# RIGHT: cosine distance for normalized embeddings
reducer_good = umap.UMAP(metric='cosine') # direction matters, not magnitude
# Or: L2-normalize first, then Euclidean (equivalent to cosine for normalized vecs)
X_normalized = X_embeddings / np.linalg.norm(X_embeddings, axis=1, keepdims=True)
reducer_ok = umap.UMAP(metric='euclidean') # works correctly on normalized data

Practical Workflow: From Raw Embeddings to Production Visualization

import numpy as np
import umap
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, normalize
import matplotlib.pyplot as plt

def embedding_visualization_pipeline(
embeddings: np.ndarray,
labels: np.ndarray = None,
label_names: dict = None,
embedding_type: str = 'dense', # 'dense' or 'text'
n_pca_dims: int = 50,
method: str = 'umap', # 'umap' or 'tsne'
n_neighbors: int = 15,
min_dist: float = 0.1,
perplexity: int = 30,
random_state: int = 42,
title: str = "Embedding Visualization",
) -> tuple:
"""
Production-ready pipeline: embeddings → 2D visualization.

Steps:
1. Normalize (if text embeddings)
2. PCA pre-reduction (if n_dims > n_pca_dims)
3. t-SNE or UMAP to 2D
4. Visualization with labels and centroids
"""
print(f"Input shape: {embeddings.shape}")

# Step 1: Normalize for text/sentence embeddings
if embedding_type == 'text':
embeddings = normalize(embeddings, norm='l2')
print("Applied L2 normalization (text embeddings)")

# Step 2: PCA pre-reduction
if embeddings.shape[1] > n_pca_dims:
pca = PCA(n_components=n_pca_dims, svd_solver='randomized',
random_state=random_state)
if embedding_type != 'text':
embeddings = StandardScaler().fit_transform(embeddings)
embeddings_pca = pca.fit_transform(embeddings)
print(f"PCA: {embeddings.shape[1]}D → {n_pca_dims}D "
f"({pca.explained_variance_ratio_.sum():.1%} variance)")
else:
embeddings_pca = embeddings

# Step 3: 2D projection
if method == 'umap':
metric = 'cosine' if embedding_type == 'text' else 'euclidean'
reducer = umap.UMAP(
n_components=2,
n_neighbors=n_neighbors,
min_dist=min_dist,
metric=metric,
random_state=random_state,
n_jobs=-1,
)
X_2d = reducer.fit_transform(embeddings_pca)
print(f"UMAP: {embeddings_pca.shape[1]}D → 2D")

else: # tsne
from sklearn.manifold import TSNE
tsne = TSNE(
n_components=2,
perplexity=min(perplexity, len(embeddings) // 3),
learning_rate='auto',
random_state=random_state,
n_jobs=-1,
)
X_2d = tsne.fit_transform(embeddings_pca)
print(f"t-SNE: {embeddings_pca.shape[1]}D → 2D, "
f"KL={tsne.kl_divergence_:.4f}")

# Step 4: Visualization
fig, ax = plt.subplots(figsize=(12, 9))

if labels is not None:
unique_labels = np.unique(labels)
cmap = plt.cm.tab20 if len(unique_labels) > 10 else plt.cm.tab10
colors = cmap(np.linspace(0, 1, len(unique_labels)))

for i, (label, color) in enumerate(zip(unique_labels, colors)):
mask = labels == label
lname = label_names.get(label, str(label)) if label_names else str(label)
ax.scatter(X_2d[mask, 0], X_2d[mask, 1], color=color,
s=8, alpha=0.6, label=lname)

# Plot centroid
centroid = X_2d[mask].mean(axis=0)
ax.annotate(lname, centroid, fontsize=9, fontweight='bold',
ha='center', va='center',
bbox=dict(boxstyle='round,pad=0.3', facecolor='white',
alpha=0.8, edgecolor=color))

ax.legend(markerscale=2, bbox_to_anchor=(1.01, 1), loc='upper left',
fontsize=9, title="Classes")
else:
ax.scatter(X_2d[:, 0], X_2d[:, 1], s=5, alpha=0.5, color='#3b82f6')

ax.set_title(title, fontsize=14)
ax.axis('off')
plt.tight_layout()
plt.show()

return X_2d, reducer if method == 'umap' else None

:::tip 🎮 Interactive Playground

Visualize this concept: Try the t-SNE & UMAP demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.