Non-linear dimensionality reduction with t-SNE and UMAP - crowding problem, KL divergence optimization, perplexity, Barnes-Hut approximation, UMAP topological foundations, and production-safe usage.

How does UMAP work in practice?

t-SNE and UMAP covers t-SNE, UMAP, dimensionality reduction from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/unsupervised-learning/t-sne-and-umap

What is the difference between t-SNE and dimensionality reduction?

See the full breakdown at https://engineersofai.com/docs/ml/unsupervised-learning/t-sne-and-umap

t-SNE and UMAP

import ReadingTime from '@site/src/components/ReadingTime';

:::note Interview Relevance - High t-SNE and UMAP are standard interview topics for roles involving embeddings, NLP, computer vision, and model interpretability. Interviewers commonly ask about the crowding problem, the perplexity hyperparameter, why t-SNE distances are meaningless, and when you would use UMAP vs t-SNE in production. :::

The Real Interview Moment

You are presenting your BERT-based customer support routing system to the VP of Engineering. You claim the model understands the difference between billing complaints, technical issues, and feature requests. He asks: "Can you show me that?" You cannot show him 768-dimensional embeddings as a table. You fire up a Jupyter notebook, run UMAP on 10,000 support ticket embeddings, and project to 2D. Billing complaints cluster in the top-left. Technical issues form a tight cluster in the center. Feature requests scatter across the right side. A small isolated cluster near the bottom turns out to be outage reports - a category the routing system was misclassifying as technical issues.

"That cluster," you say, pointing to the outliers, "those are outage reports. They look like technical issues to the classifier because the text is similar - 'the system is not working' - but the intent is completely different. We need a separate class."

Two tools made that conversation possible: UMAP (to produce the visualization in 4 seconds instead of 4 minutes) and your understanding of what the visualization does and does not mean.

Why This Exists - The Crowding Problem

PCA projects data onto a linear subspace. For data with non-linear structure - word embeddings, image feature vectors, single-cell RNA profiles - the linear projection folds the manifold back on itself, placing points that are intrinsically far apart near each other in 2D.

The naive fix - SNE (Stochastic Neighbor Embedding, Hinton & Roweis 2002) - computes Gaussian similarity in high-D and Gaussian similarity in low-D, minimizing the KL divergence between the two distributions. But it has a fundamental flaw: the crowding problem.

In 2D, the area of an annulus at radius $r$ scales as $r$ . In 10D, the volume of the corresponding shell scales as $r^9$ . There is vastly more volume near moderate distances in high-D than in 2D. When you project 10D data to 2D, moderately-far neighbors in 10D have nowhere to go - they crowd into the center of the 2D plot, obliterating the structure you are trying to see.

t-SNE (van der Maaten & Hinton, 2008) solves this by using a heavy-tailed t-distribution in the low-dimensional space. The t-distribution gives moderately-far points more room to spread out, resolving the crowding problem.

SNE: The Foundation

SNE defines a conditional similarity from point $x_i$ to point $x_j$ using a Gaussian centered at $x_i$ :

$p_{j|i} = \frac{\exp\left(-\|x_i - x_j\|^2 / 2\sigma_i^2\right)}{\sum_{k \neq i} \exp\left(-\|x_i - x_k\|^2 / 2\sigma_i^2\right)}$

The bandwidth $\sigma_i$ is per-point - each point gets its own bandwidth chosen so that the entropy of $P_i$ equals the user-specified perplexity:

$\text{Perplexity}(P_i) = 2^{H(P_i)} \quad \text{where} \quad H(P_i) = -\sum_{j} p_{j|i} \log_2 p_{j|i}$

Perplexity $\approx$ effective number of neighbors. Points in dense regions get small $\sigma_i$ (tight neighborhoods); points in sparse regions get large $\sigma_i$ (broader neighborhoods). This adaptive bandwidth is one of t-SNE's key advantages over distance-based methods.

Symmetric SNE symmetrizes: $p_{ij} = (p_{j|i} + p_{i|j}) / 2n$ . This ensures $p_{ij} = p_{ji}$ and the gradient is simpler.

In the low-dimensional space, SNE uses another Gaussian:

$q_{ij} = \frac{\exp(-\|y_i - y_j\|^2)}{\sum_{k \neq l} \exp(-\|y_k - y_l\|^2)}$

The objective is to minimize the KL divergence between $P$ and $Q$ :

$\mathcal{L}_{\text{SNE}} = \text{KL}(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$

Problem: the Gaussian in the low-D space creates the crowding problem. Moderately-distant points in high-D have $p_{ij} > 0$ , but in 2D the Gaussian forces them close together.

t-SNE: Replacing the Low-Dimensional Gaussian

The key modification in t-SNE: replace the low-dimensional Gaussian with a Student t-distribution with 1 degree of freedom (a Cauchy distribution):

$q_{ij} = \frac{\left(1 + \|y_i - y_j\|^2\right)^{-1}}{\sum_{k \neq l}\left(1 + \|y_k - y_l\|^2\right)^{-1}}$

The t-distribution has heavy tails: at distance $r$ , its density falls off as $1/r^2$ instead of $\exp(-r^2)$ . This means moderately-distant points in high-D can be placed at moderate distances in 2D - they no longer need to crowd the center.

Comparison:

Distribution	Density at distance $r$	Effect
Gaussian	$\propto e^{-r^2}$	Rapid falloff - moderately distant points crowd together
Student-t (1 dof)	$\propto (1+r^2)^{-1}$	Heavy tail - moderately distant points can spread out

The t-SNE objective:

$\mathcal{L} = \text{KL}(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}} = \sum_{i \neq j} p_{ij} \log p_{ij} - \sum_{i \neq j} p_{ij} \log q_{ij}$

The gradient with respect to $y_i$ has a clean form:

$\frac{\partial \mathcal{L}}{\partial y_i} = 4 \sum_{j \neq i} (p_{ij} - q_{ij})(y_i - y_j)\left(1 + \|y_i - y_j\|^2\right)^{-1}$

This gradient has an intuitive interpretation: if $p_{ij} > q_{ij}$ (points are closer in high-D than in 2D), the gradient pulls $y_i$ toward $y_j$ . If $p_{ij} < q_{ij}$ (points are farther in high-D), the gradient pushes $y_i$ away from $y_j$ .

Optimization uses gradient descent with momentum and an important trick called early exaggeration: multiply $p_{ij}$ by 4–12 in the first 250 iterations. This forces clusters to form early (points with high $p_{ij}$ are strongly attracted), and then the exaggeration factor is removed to fine-tune the cluster structure.

The Perplexity Hyperparameter

Perplexity controls the effective neighborhood size. Intuitively: "how many neighbors should each point consider?"

$\text{perplexity} = 2^{H(P_i)} \approx \text{number of effective neighbors of point } i$

Effect of perplexity on the embedding:

Perplexity	Local vs global balance	When to use
5–10	Very local - tight microclusters, may fragment real clusters	When you want to see fine-grained substructure
20–50	Balanced - standard starting point	Default for most tasks
100–200	More global - clusters may merge	Large datasets where global structure matters

Rules:

Perplexity must be less than $n/3$ (more neighbors than points makes no sense)
Always run t-SNE at multiple perplexity values and check stability
For large datasets ( $n > 10{,}000$ ), perplexity of 50–100 is typical

import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler

# Load MNIST digits (8×8, 64 dimensions)
digits = load_digits()
X, y = digits.data, digits.target

# BEST PRACTICE: PCA pre-reduction to 50 dims before t-SNE
# Removes noise, speeds up t-SNE significantly (O(n²) cost dominated by pairwise dist)
X_pca50 = PCA(n_components=50, random_state=42).fit_transform(
    StandardScaler().fit_transform(X)
)

# Compare perplexity values
perplexities = [5, 15, 30, 50, 100]
fig, axes = plt.subplots(1, len(perplexities), figsize=(25, 5))

for ax, perp in zip(axes, perplexities):
    tsne = TSNE(
        n_components=2,
        perplexity=perp,
        learning_rate='auto',
        n_iter=1000,
        random_state=42,
    )
    X_tsne = tsne.fit_transform(X_pca50)
    ax.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', s=5, alpha=0.7)
    ax.set_title(f"Perplexity={perp}\nKL={tsne.kl_divergence_:.3f}", fontsize=10)
    ax.axis('off')

plt.suptitle("t-SNE: Effect of Perplexity on MNIST Digits", fontsize=13)
plt.tight_layout()
plt.show()

Barnes-Hut Approximation: Making t-SNE Scalable

Naive t-SNE is $O(n^2)$ per gradient step - impossible for $n > 50{,}000$ . The Barnes-Hut approximation reduces this to $O(n \log n)$ using the same idea as the Barnes-Hut N-body simulation in astrophysics:

Build a quadtree (2D) or octree (3D) over the current embedding positions
For each point, cells far away are treated as a single "super-point" at their center of mass
A cell is considered "far" if its angular size $\theta = w/r$ (width/distance) is below a threshold (typically 0.5)

This approximates the repulsive forces (negative gradient term) without computing all $O(n^2)$ pairwise interactions, at the cost of some approximation error.

from sklearn.manifold import TSNE
import numpy as np

# Barnes-Hut t-SNE (sklearn's default for n_components <= 3)
tsne_bh = TSNE(
    n_components=2,
    perplexity=30,
    learning_rate='auto',
    n_iter=1000,
    n_iter_without_progress=150,   # early stopping
    early_exaggeration=12.0,       # multiplier for p_ij in first ~250 iters
    method='barnes_hut',           # O(n log n) - default for 2D/3D
    angle=0.5,                     # Barnes-Hut approximation angle (lower=more accurate)
    random_state=42,
    n_jobs=-1,
    verbose=1,
)

# For small n (< 3000) or when you need exact results:
tsne_exact = TSNE(
    n_components=2,
    perplexity=30,
    method='exact',      # O(n²) - exact computation
    random_state=42,
)

# Production-scale t-SNE: PCA first, then Barnes-Hut
def production_tsne(X: np.ndarray, n_pca_dims: int = 50, perplexity: int = 30,
                     random_state: int = 42) -> np.ndarray:
    """
    Production-ready t-SNE:
    1. Standardize
    2. PCA to 50 dims (noise removal + speed)
    3. Barnes-Hut t-SNE
    """
    from sklearn.preprocessing import StandardScaler
    from sklearn.decomposition import PCA

    print(f"Input shape: {X.shape}")

    # Step 1: Standardize
    X_scaled = StandardScaler().fit_transform(X)

    # Step 2: PCA pre-reduction
    n_pca = min(n_pca_dims, X.shape[1], X.shape[0]-1)
    X_pca = PCA(n_components=n_pca, random_state=random_state).fit_transform(X_scaled)
    print(f"After PCA: {X_pca.shape}")

    # Step 3: t-SNE
    tsne = TSNE(
        n_components=2,
        perplexity=min(perplexity, X.shape[0] // 3),  # perplexity < n/3
        learning_rate='auto',
        n_iter=1000,
        method='barnes_hut',
        random_state=random_state,
        n_jobs=-1,
    )
    X_2d = tsne.fit_transform(X_pca)
    print(f"KL divergence: {tsne.kl_divergence_:.4f}")

    return X_2d

The Critical t-SNE Misconceptions

:::danger t-SNE Distances and Sizes Are Meaningless t-SNE distorts distances to preserve local neighborhood structure. Cluster sizes, distances between clusters, and relative positions of clusters are all arbitrary. You cannot make quantitative claims about distances or proportions from a t-SNE plot. :::

Misconception 1: Cluster size = cluster importance

t-SNE stretches regions of low density and compresses regions of high density. A large cluster in the plot may simply be a region where t-SNE expanded the layout. A tight cluster may be compressed. Neither reflects the actual number of points or spread.

Misconception 2: Distance between clusters = similarity in original space

t-SNE preserves local neighborhood structure but destroys global structure. Two clusters that are far apart in the t-SNE plot may be close in the original space. Two clusters that appear adjacent may be far apart.

# Demonstration: t-SNE distance vs actual distance
from scipy.spatial.distance import pdist
from scipy.stats import spearmanr
import numpy as np

def check_distance_preservation(X_orig: np.ndarray, X_2d: np.ndarray,
                                  n_sample: int = 500) -> dict:
    """
    Check how well 2D embedding preserves pairwise distances.
    A good embedding has high correlation (close to 1.0) for PCA,
    but t-SNE will show low correlation - by design.
    """
    rng = np.random.default_rng(42)
    idx = rng.choice(len(X_orig), min(n_sample, len(X_orig)), replace=False)

    d_orig = pdist(X_orig[idx], metric='euclidean')
    d_2d   = pdist(X_2d[idx],   metric='euclidean')

    spearman_r, pvalue = spearmanr(d_orig, d_2d)
    pearson_r = np.corrcoef(d_orig, d_2d)[0, 1]

    # Local vs global preservation
    # For local: only consider pairs within top 10th percentile of proximity
    local_threshold = np.percentile(d_orig, 10)
    local_mask = d_orig < local_threshold

    local_spearman, _ = spearmanr(d_orig[local_mask], d_2d[local_mask])

    print(f"Global Spearman rank correlation: {spearman_r:.3f}")
    print(f"Global Pearson correlation:       {pearson_r:.3f}")
    print(f"Local Spearman (near pairs):      {local_spearman:.3f}")
    print()
    print("Interpretation:")
    print("  Global ~0.3-0.5 for t-SNE: EXPECTED - global structure not preserved")
    print("  Local  ~0.8-0.9 for t-SNE: GOOD - local structure well preserved")
    print("  Global ~0.9+ for PCA: EXPECTED - PCA preserves global distances")

    return {'global_spearman': spearman_r, 'local_spearman': local_spearman}

Misconception 3: One t-SNE run is definitive

t-SNE uses random initialization. Different seeds produce different layouts. A cluster structure is only meaningful if it appears consistently across multiple random seeds.

def check_tsne_stability(X_pca: np.ndarray, y: np.ndarray,
                           n_seeds: int = 5, perplexity: int = 30):
    """Run t-SNE with multiple seeds to check layout stability."""
    fig, axes = plt.subplots(1, n_seeds, figsize=(5*n_seeds, 4))

    for i, seed in enumerate(range(n_seeds)):
        tsne = TSNE(n_components=2, perplexity=perplexity,
                    learning_rate='auto', random_state=seed)
        X_2d = tsne.fit_transform(X_pca)

        axes[i].scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='tab10', s=5, alpha=0.7)
        axes[i].set_title(f"Seed {seed}\nKL={tsne.kl_divergence_:.3f}", fontsize=9)
        axes[i].axis('off')

    plt.suptitle("t-SNE Stability Check (5 random seeds)\n"
                  "Same structures appearing in all runs = real signal", fontsize=11)
    plt.tight_layout()
    plt.show()

UMAP: Faster, More Faithful, Parametric

UMAP (Uniform Manifold Approximation and Projection, McInnes et al. 2018) solves t-SNE's core limitations through a fundamentally different theoretical foundation: Riemannian geometry and topological data analysis.

UMAP Theory: Fuzzy Simplicial Complexes

Where t-SNE uses Gaussian distributions to model local structure, UMAP builds a weighted graph (fuzzy simplicial complex) over the data:

Step 1: Build the high-dimensional graph

For each point $x_i$ , find its $k$ nearest neighbors. Define the weight of the edge $(x_i, x_j)$ as:

$w_{ij}^{\text{high}} = \exp\left(-\frac{d(x_i, x_j) - \rho_i}{\sigma_i}\right)$

where $\rho_i$ is the distance to the nearest neighbor (ensures the closest neighbor always has weight 1 - making the representation locally adaptive) and $\sigma_i$ is chosen so the weights sum to $\log_2(k)$ .

The symmetrized graph weight:

$w_{ij} = w_{ij}^{\text{high}} + w_{ji}^{\text{high}} - w_{ij}^{\text{high}} \cdot w_{ji}^{\text{high}}$

Step 2: Build the low-dimensional graph

The low-dimensional edge weights use:

$w_{ij}^{\text{low}} = \left(1 + a \cdot \|y_i - y_j\|^{2b}\right)^{-1}$

Parameters $a$ and $b$ are determined by min_dist (the minimum distance between points in the embedding - smaller values produce tighter clusters).

Step 3: Optimize the embedding

Minimize the cross-entropy between the high-D and low-D graphs (binary cross-entropy over edges):

$\mathcal{L} = \sum_{i \neq j} \left[ w_{ij} \log \frac{w_{ij}}{w_{ij}^{\text{low}}} + (1 - w_{ij}) \log \frac{1 - w_{ij}}{1 - w_{ij}^{\text{low}}} \right]$

Optimization uses stochastic gradient descent with negative sampling (like word2vec) - efficient and $O(n \log n)$ .

Key algorithmic advances over t-SNE:

NNDescent for approximate nearest neighbor graph: $O(n \log n)$ vs $O(n^2)$ for exact NN
Cross-entropy with negative sampling: avoids $O(n^2)$ normalization sum (the "Z" denominator in t-SNE's $q_{ij}$ )
Parametric mode: the encoder is a neural network - new points get a single forward pass

UMAP Key Parameters

Parameter	Effect	Default	Tuning Guide
`n_neighbors`	Size of local neighborhood	15	Small (5): fine local structure; Large (100+): global structure
`min_dist`	Tightness of clusters in 2D	0.1	Small (0.0): tight clusters; Large (0.5): more spread out
`n_components`	Output dimensions	2	2 or 3 for viz; 16–64 for downstream ML
`metric`	Distance in high-D	'euclidean'	'cosine' for embeddings; 'hamming' for binary
`n_epochs`	Optimization steps	auto	More = better layout, slower

# pip install umap-learn
import umap
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

digits = load_digits()
X, y = digits.data, digits.target
X_scaled = StandardScaler().fit_transform(X)
X_pca50 = PCA(n_components=50, random_state=42).fit_transform(X_scaled)

# ── Full UMAP parameter exploration ───────────────────────────────────────────
fig, axes = plt.subplots(2, 3, figsize=(18, 11))

configs = [
    {'n_neighbors': 5,  'min_dist': 0.1,  'label': 'n_neighbors=5\n(very local)'},
    {'n_neighbors': 15, 'min_dist': 0.1,  'label': 'n_neighbors=15\n(default)'},
    {'n_neighbors': 50, 'min_dist': 0.1,  'label': 'n_neighbors=50\n(more global)'},
    {'n_neighbors': 15, 'min_dist': 0.0,  'label': 'min_dist=0.0\n(tight clusters)'},
    {'n_neighbors': 15, 'min_dist': 0.5,  'label': 'min_dist=0.5\n(spread out)'},
    {'n_neighbors': 15, 'min_dist': 0.99, 'label': 'min_dist=0.99\n(almost uniform)'},
]

for ax, cfg in zip(axes.flat, configs):
    reducer = umap.UMAP(
        n_components=2,
        n_neighbors=cfg['n_neighbors'],
        min_dist=cfg['min_dist'],
        random_state=42,
        n_jobs=-1,
    )
    X_2d = reducer.fit_transform(X_pca50)
    ax.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='tab10', s=8, alpha=0.7)
    ax.set_title(cfg['label'], fontsize=11)
    ax.axis('off')

plt.suptitle("UMAP Hyperparameter Effects on MNIST Digits", fontsize=14)
plt.tight_layout()
plt.show()

UMAP with Non-Euclidean Metrics

One of UMAP's major advantages over t-SNE: it natively supports non-Euclidean distance metrics.

import umap
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups

# Text data: 20 newsgroups - use cosine distance (text-appropriate)
newsgroups = fetch_20newsgroups(subset='train', categories=[
    'sci.space', 'rec.sport.hockey', 'talk.politics.guns', 'comp.graphics'
], remove=('headers', 'footers', 'quotes'), random_state=42)

# TF-IDF features
tfidf = TfidfVectorizer(max_features=5000, stop_words='english', sublinear_tf=True)
X_tfidf = tfidf.fit_transform(newsgroups.data)  # sparse matrix

# UMAP with cosine distance (correct for sparse text features)
reducer_cosine = umap.UMAP(
    n_components=2,
    metric='cosine',       # correct for text - invariant to document length
    n_neighbors=20,
    min_dist=0.1,
    random_state=42,
    n_jobs=-1,
    verbose=True,
)
# UMAP accepts sparse matrices natively
X_2d_text = reducer_cosine.fit_transform(X_tfidf)

fig, ax = plt.subplots(figsize=(10, 8))
scatter = ax.scatter(X_2d_text[:, 0], X_2d_text[:, 1],
                      c=newsgroups.target, cmap='tab10', s=5, alpha=0.7)
plt.colorbar(scatter, ax=ax, label='Category')
ax.set_title("UMAP of 20 Newsgroups (cosine distance, TF-IDF)", fontsize=12)
ax.axis('off')
plt.tight_layout()
plt.show()

# ── Other useful metrics ───────────────────────────────────────────────────────
# 'hamming'  - binary features (bag-of-words presence/absence)
# 'jaccard'  - set similarity (tags, categories)
# 'manhattan' - L1 - more robust to outliers than L2 for tabular data
# 'precomputed' - pass a precomputed distance matrix

Parametric UMAP: Transform New Points

Unlike t-SNE, UMAP learns a mapping that can be applied to new points. There are two modes:

import umap
import numpy as np

# ── Mode 1: Standard UMAP - fit once, transform later ──────────────────────────
reducer = umap.UMAP(n_components=2, random_state=42)
reducer.fit(X_train_pca)

# Transform training data
X_train_2d = reducer.transform(X_train_pca)

# Transform new test data - consistent with training layout!
X_test_2d  = reducer.transform(X_test_pca)

print("Can transform new points:", True)

# ── Mode 2: Parametric UMAP - neural network encoder ──────────────────────────
# Trains a neural network as the embedding function
# Enables very fast inference on new points (single forward pass)
from umap.parametric_umap import ParametricUMAP

param_reducer = ParametricUMAP(
    n_components=2,
    n_neighbors=15,
    min_dist=0.1,
    n_training_epochs=10,   # number of epochs to train the encoder network
    encoder=None,           # auto-builds MLP: input_dim → 100 → 100 → n_components
    verbose=True,
)
param_reducer.fit(X_train_pca)

X_train_2d = param_reducer.transform(X_train_pca)   # batch forward pass
X_test_2d  = param_reducer.transform(X_test_pca)    # fast inference

# Check consistency: same points should map to same location
X_train_2d_check = param_reducer.transform(X_train_pca[:10])
# Standard UMAP: X_train_2d_check ≈ X_train_2d[:10] (deterministic)
# Parametric UMAP: X_train_2d_check == X_train_2d[:10] (exact, it's a function)

UMAP for Dimensionality Reduction (Not Just Visualization)

UMAP at higher dimensions ( $k = 16, 32, 64$ ) is a powerful general-purpose dimensionality reduction technique - not just for visualization:

import umap
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import numpy as np

def compare_reduction_methods(X_train, y_train, X_test, y_test, target_dim=32):
    """Compare PCA vs UMAP for downstream classification."""
    from sklearn.decomposition import PCA
    from sklearn.pipeline import Pipeline

    results = {}

    # Baseline: raw features
    lr_raw = LogisticRegression(max_iter=1000, random_state=42)
    lr_raw.fit(X_train, y_train)
    results['raw'] = accuracy_score(y_test, lr_raw.predict(X_test))

    # PCA reduction
    pca_pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('pca', PCA(n_components=target_dim, random_state=42)),
        ('lr', LogisticRegression(max_iter=1000, random_state=42)),
    ])
    pca_pipe.fit(X_train, y_train)
    results[f'pca_{target_dim}'] = accuracy_score(y_test, pca_pipe.predict(X_test))

    # UMAP reduction
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled  = scaler.transform(X_test)

    reducer = umap.UMAP(
        n_components=target_dim,
        n_neighbors=30,
        min_dist=0.0,          # tight clusters: good for downstream classification
        metric='euclidean',
        random_state=42,
    )
    X_train_umap = reducer.fit_transform(X_train_scaled)
    X_test_umap  = reducer.transform(X_test_scaled)

    lr_umap = LogisticRegression(max_iter=1000, random_state=42)
    lr_umap.fit(X_train_umap, y_train)
    results[f'umap_{target_dim}'] = accuracy_score(y_test, lr_umap.predict(X_test_umap))

    for method, acc in results.items():
        print(f"{method:20s}: {acc:.4f}")

    return results

t-SNE vs UMAP: Flow and Decision

Production: Embedding Space Monitoring

import umap
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import ks_2samp

class EmbeddingDriftMonitor:
    """
    Monitor embedding space drift in production using UMAP.
    Fits once on baseline, transforms new batches consistently.
    """

    def __init__(self, n_neighbors: int = 15, min_dist: float = 0.1,
                 drift_threshold: float = 0.5):
        self.reducer = umap.UMAP(
            n_components=2,
            n_neighbors=n_neighbors,
            min_dist=min_dist,
            random_state=42
        )
        self.drift_threshold = drift_threshold
        self.baseline_2d_ = None
        self.baseline_stats_ = None

    def fit_baseline(self, embeddings: np.ndarray,
                      labels: np.ndarray = None) -> 'EmbeddingDriftMonitor':
        """Fit UMAP reducer on baseline (training/reference) embeddings."""
        self.reducer.fit(embeddings)
        self.baseline_2d_ = self.reducer.transform(embeddings)
        self.baseline_labels_ = labels

        # Compute reference statistics for drift detection
        self.baseline_stats_ = {
            'mean': self.baseline_2d_.mean(axis=0),
            'std': self.baseline_2d_.std(axis=0),
            'x_dist': self.baseline_2d_[:, 0],
            'y_dist': self.baseline_2d_[:, 1],
        }
        print(f"Baseline fitted: {embeddings.shape[0]} samples")
        print(f"2D bounds: x=[{self.baseline_2d_[:,0].min():.2f}, "
              f"{self.baseline_2d_[:,0].max():.2f}], "
              f"y=[{self.baseline_2d_[:,1].min():.2f}, "
              f"{self.baseline_2d_[:,1].max():.2f}]")
        return self

    def check_drift(self, new_embeddings: np.ndarray,
                     period_label: str = "Production") -> dict:
        """
        Transform new embeddings and check for distribution drift.

        Returns drift metrics and raises alert if drift exceeds threshold.
        """
        new_2d = self.reducer.transform(new_embeddings)

        # Centroid shift
        centroid_shift = np.linalg.norm(
            new_2d.mean(axis=0) - self.baseline_stats_['mean']
        )

        # KS test on marginal distributions
        ks_x = ks_2samp(self.baseline_stats_['x_dist'], new_2d[:, 0])
        ks_y = ks_2samp(self.baseline_stats_['y_dist'], new_2d[:, 1])

        # Coverage: fraction of new points outside baseline bounding box + margin
        x_min, x_max = self.baseline_2d_[:,0].min(), self.baseline_2d_[:,0].max()
        y_min, y_max = self.baseline_2d_[:,1].min(), self.baseline_2d_[:,1].max()
        margin = 0.2 * max(x_max - x_min, y_max - y_min)

        out_of_distribution = (
            (new_2d[:, 0] < x_min - margin) |
            (new_2d[:, 0] > x_max + margin) |
            (new_2d[:, 1] < y_min - margin) |
            (new_2d[:, 1] > y_max + margin)
        ).mean()

        metrics = {
            'centroid_shift': centroid_shift,
            'ks_stat_x': ks_x.statistic,
            'ks_stat_y': ks_y.statistic,
            'ks_pvalue_x': ks_x.pvalue,
            'ks_pvalue_y': ks_y.pvalue,
            'out_of_distribution_rate': out_of_distribution,
        }

        drift_detected = (
            centroid_shift > self.drift_threshold or
            ks_x.pvalue < 0.01 or ks_y.pvalue < 0.01 or
            out_of_distribution > 0.05
        )

        if drift_detected:
            print(f"DRIFT ALERT for {period_label}:")
            print(f"  Centroid shift: {centroid_shift:.4f} (threshold={self.drift_threshold})")
            print(f"  KS p-values: x={ks_x.pvalue:.4f}, y={ks_y.pvalue:.4f}")
            print(f"  OOD rate: {out_of_distribution:.1%}")
        else:
            print(f"{period_label}: No significant drift detected")

        return metrics, new_2d

    def plot_drift(self, new_2d: np.ndarray, period_label: str = "Production"):
        fig, axes = plt.subplots(1, 2, figsize=(16, 7))

        # Baseline
        if hasattr(self, 'baseline_labels_') and self.baseline_labels_ is not None:
            scatter = axes[0].scatter(
                self.baseline_2d_[:, 0], self.baseline_2d_[:, 1],
                c=self.baseline_labels_, cmap='tab10', s=5, alpha=0.6
            )
        else:
            axes[0].scatter(self.baseline_2d_[:, 0], self.baseline_2d_[:, 1],
                             s=5, alpha=0.5, color='#3b82f6')
        axes[0].set_title("Baseline Embedding Space", fontsize=12)
        axes[0].axis('off')

        # Overlay: baseline (gray) + new points (red)
        axes[1].scatter(self.baseline_2d_[:, 0], self.baseline_2d_[:, 1],
                         s=3, alpha=0.2, color='#94a3b8', label='Baseline')
        axes[1].scatter(new_2d[:, 0], new_2d[:, 1],
                         s=15, alpha=0.7, color='#ef4444', label=period_label)
        axes[1].set_title(f"Drift: Baseline vs {period_label}", fontsize=12)
        axes[1].legend()
        axes[1].axis('off')

        plt.tight_layout()
        plt.show()

:::tip UMAP for Production, t-SNE for Exploration Use UMAP in production: it can transform new points, runs in $O(n \log n)$ , and handles millions of samples. Use t-SNE for one-time exploratory visualizations on static datasets where you want the cleanest possible cluster separation. Always run t-SNE on PCA-pre-reduced data (50 dimensions) for speed and noise reduction. :::

:::warning Do Not Use t-SNE or UMAP Coordinates as Features Never feed t-SNE or UMAP 2D coordinates into a downstream classifier. t-SNE is non-parametric - you cannot transform new points reproducibly. UMAP coordinates are stable for the fitted model but the 2D space is not semantically meaningful for classification (distances are distorted). Use PCA, raw embeddings, or UMAP at higher dimensions ( $k \geq 16$ , min_dist=0.0) for feature extraction. :::

YouTube Resources

Channel	Video Title	Why Watch
StatQuest with Josh Starmer	t-SNE, Clearly Explained	Best visual walkthrough of perplexity and the algorithm steps
ritvikmath	t-SNE Explained	Math derivation of the KL objective and gradient
Leland McInnes (UMAP author)	UMAP: Theory and Practice	First-principles explanation of the topological foundation
Two Minute Papers	UMAP paper overview	Quick intuition for the crowding problem vs UMAP approach

Interview Q&A

Q1: Explain the crowding problem in SNE and how t-SNE solves it.

In 10 or more dimensions, the volume available at moderate distances grows much faster than in 2D. When you project high-D data to 2D using a Gaussian similarity in both spaces, the 2D space cannot accommodate all the moderately-distant neighbors - they all crowd into the center of the plot, obliterating local structure. t-SNE replaces the low-dimensional Gaussian with a Student t-distribution (1 degree of freedom), which has a heavy tail: its density at distance $r$ falls off as $(1+r^2)^{-1}$ instead of $e^{-r^2}$ . The heavy tail gives moderately-distant points more room to spread apart, resolving the crowding. The mismatch between the narrow Gaussian (high-D) and the heavy-tailed t-distribution (low-D) is intentional - it produces clear cluster separation.

Q2: What does the perplexity parameter in t-SNE actually control? What happens at extreme values?

Perplexity controls the effective number of neighbors each point considers. Mathematically, for each point $x_i$ , the bandwidth $\sigma_i$ is chosen so that the entropy of the conditional distribution $P_i$ satisfies $\text{perplexity} = 2^{H(P_i)}$ . This makes the bandwidth adaptive: dense regions get smaller $\sigma_i$ (tight neighborhoods), sparse regions get larger $\sigma_i$ (broader neighborhoods). At perplexity 5: only immediate neighbors matter - produces fragmented microclusters that may split what should be one cluster. At perplexity 200: each point considers many neighbors - produces smooth global layout but loses fine-grained local structure. Start with perplexity 30, then explore 5, 15, 50, and 100. Clusters that appear stable across all perplexities are real.

Q3: A colleague reports that cluster A is three times larger than cluster B in the t-SNE plot and concludes that class A is three times more common. What is wrong?

Cluster sizes in t-SNE are meaningless. t-SNE stretches regions of low density and compresses regions of high density to preserve local neighborhood structure. A cluster's area in the 2D plot does not correspond to the number of points, the spread in original space, or any other quantitative property. To assess cluster size, count the data points assigned to each cluster by running a separate algorithm (K-means, DBSCAN, or a distance threshold) on the original high-dimensional embeddings. Never use t-SNE geometrical properties for quantitative conclusions - only for qualitative pattern identification.

Q4: How does UMAP differ from t-SNE algorithmically? When would you choose UMAP over t-SNE?

Algorithmically, t-SNE models similarities as Gaussians (high-D) and t-distributions (low-D) and minimizes KL divergence; it runs in $O(n^2)$ naively or $O(n \log n)$ with Barnes-Hut, and cannot transform new points. UMAP builds a weighted graph using fuzzy topological methods (local distances normalized by nearest-neighbor distance, symmetrized), then minimizes cross-entropy between the high-D and low-D graphs using negative sampling - $O(n \log n)$ via NNDescent. Choose UMAP over t-SNE when: (1) $n > 50{,}000$ - t-SNE becomes too slow, (2) you need to transform new data points at inference, (3) you want to use the reduction for downstream ML (not just visualization), (4) you need a non-Euclidean metric (cosine, Hamming, precomputed). Choose t-SNE when: you want the very clearest cluster separation for a publication figure on a static dataset.

Q5: Why can't you use t-SNE coordinates as input features for a downstream classifier?

t-SNE is non-parametric - it optimizes 2D positions for a specific fixed dataset. Given a new point, there is no function to apply. You must include the new point in the dataset and rerun the entire optimization from scratch - but then the entire layout changes, so you cannot compare to the old embedding or train a classifier on one run's coordinates and predict on another. Additionally, the 2D coordinates have no semantic meaning: the same real-world concept may land in completely different positions in two different t-SNE runs. For production feature extraction, use PCA (parametric, stable), UMAP's standard fit/transform (stable for the fitted model), or UMAP at higher dimensions.

Q6: How would you use UMAP in a production ML monitoring system to detect concept drift?

Fit UMAP on a representative baseline dataset of model input embeddings (e.g., the first month of production inference). Each week, transform new production embeddings using the fitted reducer. Apply three drift detectors: (1) centroid shift - if the mean 2D position of this week's points shifts significantly from baseline ( $> 0.5$ standard deviations), distribution shift may be occurring; (2) KS test on marginal distributions of the 2D coordinates - $p < 0.01$ indicates significant change; (3) out-of-distribution rate - what fraction of new points fall outside the bounding box of baseline points (with some margin), indicating novel input types. Alert when any threshold is exceeded. This is more interpretable than MMD or statistical tests on raw high-dimensional embeddings because you can visualize which part of the embedding space the new points are landing in.

Comparison Table: PCA vs t-SNE vs UMAP

Property	PCA	t-SNE	UMAP
Type	Linear	Non-linear	Non-linear
Preserves global structure	Yes	No	Better than t-SNE
Preserves local structure	Partially	Very well	Very well
Parametric (transform new pts)	Yes	No	Yes (standard mode)
Speed	$O(nd\min(n,d))$	$O(n^2)$ / $O(n\log n)$	$O(n\log n)$
Max practical $n$	Millions	~100K (BH)	Millions
Reproducible	Yes	No (random init)	More stable
Cluster sizes meaningful	Yes	No	Slightly more
Distances meaningful	Yes	No	Better than t-SNE
Non-Euclidean metrics	No (linear only)	Yes	Yes
Supports sparse input	Yes (TruncatedSVD)	No	Yes
Higher-dim output for ML	Yes	Not useful	Yes ( $k \geq 16$ )
Best for	Production reduction	Publication figures	Production monitoring

Common t-SNE and UMAP Pitfalls

# ── PITFALL 1: Not using PCA pre-reduction before t-SNE ───────────────────────
# WRONG: t-SNE on raw 768-dim embeddings - slow and noisy
tsne_bad = TSNE(n_components=2, perplexity=30, random_state=42)
X_bad = tsne_bad.fit_transform(X_raw_768dim)  # O(n² × 768) - very slow

# RIGHT: PCA first, then t-SNE
X_pca50 = PCA(n_components=50, random_state=42).fit_transform(
    StandardScaler().fit_transform(X_raw_768dim)
)
X_good = TSNE(n_components=2, perplexity=30, random_state=42).fit_transform(X_pca50)

# ── PITFALL 2: Only one t-SNE run ─────────────────────────────────────────────
# WRONG: trust a single run
X_single = TSNE(perplexity=30, random_state=42).fit_transform(X_pca50)
# RIGHT: run multiple seeds, verify cluster stability
for seed in [42, 123, 7, 99, 555]:
    X_seed = TSNE(perplexity=30, random_state=seed).fit_transform(X_pca50[:500])
    # Visually verify the same 10 digit clusters appear in each run

# ── PITFALL 3: UMAP min_dist=0 for all tasks ─────────────────────────────────
# min_dist=0 creates very tight clusters - great for cluster visualization
# but can create artificial structure in noisy data
reducer_viz = umap.UMAP(n_components=2, min_dist=0.0, n_neighbors=15)  # visualization
reducer_ml  = umap.UMAP(n_components=32, min_dist=0.1, n_neighbors=30)  # ML features

# ── PITFALL 4: Comparing embeddings from different t-SNE runs ─────────────────
# WRONG: train classifier on t-SNE run 1, evaluate on run 2
X_run1 = TSNE(random_state=42).fit_transform(X_pca50)
X_run2 = TSNE(random_state=99).fit_transform(X_pca50)
# The coordinate spaces are completely unrelated - any classifier trained on run1
# will produce meaningless predictions on run2 coordinates.

# RIGHT: use UMAP which is consistent across fit() and transform() calls
import umap
reducer = umap.UMAP(random_state=42)
reducer.fit(X_train_pca)
X_train_2d = reducer.transform(X_train_pca)   # consistent
X_test_2d  = reducer.transform(X_test_pca)    # consistent with training layout

# ── PITFALL 5: Wrong metric for text embeddings ───────────────────────────────
# WRONG: Euclidean distance for text embeddings (biased by vector magnitude)
reducer_bad = umap.UMAP(metric='euclidean')    # magnitude dominates

# RIGHT: cosine distance for normalized embeddings
reducer_good = umap.UMAP(metric='cosine')      # direction matters, not magnitude
# Or: L2-normalize first, then Euclidean (equivalent to cosine for normalized vecs)
X_normalized = X_embeddings / np.linalg.norm(X_embeddings, axis=1, keepdims=True)
reducer_ok = umap.UMAP(metric='euclidean')     # works correctly on normalized data

Practical Workflow: From Raw Embeddings to Production Visualization

import numpy as np
import umap
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, normalize
import matplotlib.pyplot as plt

def embedding_visualization_pipeline(
    embeddings: np.ndarray,
    labels: np.ndarray = None,
    label_names: dict = None,
    embedding_type: str = 'dense',   # 'dense' or 'text'
    n_pca_dims: int = 50,
    method: str = 'umap',            # 'umap' or 'tsne'
    n_neighbors: int = 15,
    min_dist: float = 0.1,
    perplexity: int = 30,
    random_state: int = 42,
    title: str = "Embedding Visualization",
) -> tuple:
    """
    Production-ready pipeline: embeddings → 2D visualization.

    Steps:
    1. Normalize (if text embeddings)
    2. PCA pre-reduction (if n_dims > n_pca_dims)
    3. t-SNE or UMAP to 2D
    4. Visualization with labels and centroids
    """
    print(f"Input shape: {embeddings.shape}")

    # Step 1: Normalize for text/sentence embeddings
    if embedding_type == 'text':
        embeddings = normalize(embeddings, norm='l2')
        print("Applied L2 normalization (text embeddings)")

    # Step 2: PCA pre-reduction
    if embeddings.shape[1] > n_pca_dims:
        pca = PCA(n_components=n_pca_dims, svd_solver='randomized',
                   random_state=random_state)
        if embedding_type != 'text':
            embeddings = StandardScaler().fit_transform(embeddings)
        embeddings_pca = pca.fit_transform(embeddings)
        print(f"PCA: {embeddings.shape[1]}D → {n_pca_dims}D "
              f"({pca.explained_variance_ratio_.sum():.1%} variance)")
    else:
        embeddings_pca = embeddings

    # Step 3: 2D projection
    if method == 'umap':
        metric = 'cosine' if embedding_type == 'text' else 'euclidean'
        reducer = umap.UMAP(
            n_components=2,
            n_neighbors=n_neighbors,
            min_dist=min_dist,
            metric=metric,
            random_state=random_state,
            n_jobs=-1,
        )
        X_2d = reducer.fit_transform(embeddings_pca)
        print(f"UMAP: {embeddings_pca.shape[1]}D → 2D")

    else:  # tsne
        from sklearn.manifold import TSNE
        tsne = TSNE(
            n_components=2,
            perplexity=min(perplexity, len(embeddings) // 3),
            learning_rate='auto',
            random_state=random_state,
            n_jobs=-1,
        )
        X_2d = tsne.fit_transform(embeddings_pca)
        print(f"t-SNE: {embeddings_pca.shape[1]}D → 2D, "
              f"KL={tsne.kl_divergence_:.4f}")

    # Step 4: Visualization
    fig, ax = plt.subplots(figsize=(12, 9))

    if labels is not None:
        unique_labels = np.unique(labels)
        cmap = plt.cm.tab20 if len(unique_labels) > 10 else plt.cm.tab10
        colors = cmap(np.linspace(0, 1, len(unique_labels)))

        for i, (label, color) in enumerate(zip(unique_labels, colors)):
            mask = labels == label
            lname = label_names.get(label, str(label)) if label_names else str(label)
            ax.scatter(X_2d[mask, 0], X_2d[mask, 1], color=color,
                        s=8, alpha=0.6, label=lname)

            # Plot centroid
            centroid = X_2d[mask].mean(axis=0)
            ax.annotate(lname, centroid, fontsize=9, fontweight='bold',
                        ha='center', va='center',
                        bbox=dict(boxstyle='round,pad=0.3', facecolor='white',
                                   alpha=0.8, edgecolor=color))

        ax.legend(markerscale=2, bbox_to_anchor=(1.01, 1), loc='upper left',
                   fontsize=9, title="Classes")
    else:
        ax.scatter(X_2d[:, 0], X_2d[:, 1], s=5, alpha=0.5, color='#3b82f6')

    ax.set_title(title, fontsize=14)
    ax.axis('off')
    plt.tight_layout()
    plt.show()

    return X_2d, reducer if method == 'umap' else None

:::tip 🎮 Interactive Playground

Visualize this concept: Try the t-SNE & UMAP demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why This Exists - The Crowding Problem​

SNE: The Foundation​

t-SNE: Replacing the Low-Dimensional Gaussian​

The Perplexity Hyperparameter​

Barnes-Hut Approximation: Making t-SNE Scalable​

The Critical t-SNE Misconceptions​

Misconception 1: Cluster size = cluster importance​

Misconception 2: Distance between clusters = similarity in original space​

Misconception 3: One t-SNE run is definitive​

UMAP: Faster, More Faithful, Parametric​

UMAP Theory: Fuzzy Simplicial Complexes​

UMAP Key Parameters​

UMAP with Non-Euclidean Metrics​

Parametric UMAP: Transform New Points​

UMAP for Dimensionality Reduction (Not Just Visualization)​

t-SNE vs UMAP: Flow and Decision​

Production: Embedding Space Monitoring​

YouTube Resources​

Interview Q&A​

Comparison Table: PCA vs t-SNE vs UMAP​

Common t-SNE and UMAP Pitfalls​

Practical Workflow: From Raw Embeddings to Production Visualization​