Skip to main content

Dimensionality Reduction - Taming the Curse of High Dimensions

Reading time: ~30 min | Interview relevance: High | Roles: MLE, Data Scientist, AI Engineer, Research Scientist

The Real Interview Moment

The interviewer asks: "You have a dataset with 50,000 features from gene expression data and only 200 patient samples. Your model is performing worse than a simple baseline. What's going on, and how would you approach this?"

You immediately recognize the curse of dimensionality - with 50,000 features and 200 samples, the model has far more parameters to learn than data points to learn from. The feature space is so vast that any two points are nearly equidistant, making it impossible for distance-based algorithms to find meaningful structure. But the interviewer doesn't just want you to say "use PCA." They want you to reason about which dimensionality reduction approach to use, how many dimensions to keep, and whether you should select features or extract new ones.

This question tests whether you understand the mathematical foundations of dimensionality reduction and can make principled decisions about when and how to apply it.

What You Will Master

  • The curse of dimensionality - why it matters and how it manifests
  • PCA: mathematical derivation, eigenvalue interpretation, choosing components
  • t-SNE: how it works, limitations, and proper usage
  • UMAP: advantages over t-SNE, when to choose it
  • Autoencoders for non-linear dimensionality reduction
  • Feature selection vs. feature extraction - when to use each
  • Practical decision framework for choosing the right method
  • Explained variance and the elbow method for component selection

Self-Assessment: Where Are You Now?

LevelDescriptionTarget
Beginner"I know PCA reduces dimensions somehow"Read all parts, focus on PCA math
Intermediate"I can explain PCA and have used t-SNE for visualization"Focus on Parts 2-3, decision flowchart, practice problems
Advanced"I know the methods but want to nail the mathematical proofs and tradeoffs"Jump to PCA derivation, UMAP vs t-SNE, and practice problems

Part 1 - The Curse of Dimensionality

Why High Dimensions Break Everything

As dimensionality increases, several counterintuitive phenomena emerge:

1. Volume concentrates at the edges

In a d-dimensional unit hypercube, the fraction of volume within distance epsilon of the surface is:

1(12ϵ)d1 - (1 - 2\epsilon)^d

For d=100 and epsilon=0.01: 10.98100=10.133=0.8671 - 0.98^{100} = 1 - 0.133 = 0.867. Over 86% of the volume is within 1% of the surface. Almost all points are "edge cases."

2. Distances become meaningless

As dimensions grow, the ratio of the nearest to farthest neighbor distance converges to 1:

limddmaxdmindmin0\lim_{d \to \infty} \frac{d_{max} - d_{min}}{d_{min}} \to 0

When all points are roughly equidistant, algorithms that rely on distance (k-NN, SVM, clustering) lose discriminative power.

3. Data becomes sparse

To maintain the same density of data points in a d-dimensional space, you need exponentially more data. Going from 10 samples per unit in 1D to 10D requires 101010^{10} samples to maintain the same density.

60-Second Answer

"The curse of dimensionality means that as features increase, the data becomes increasingly sparse in the feature space. Distances between points converge, making similarity-based algorithms ineffective. You need exponentially more data to maintain statistical significance. Dimensionality reduction addresses this by projecting data into a lower-dimensional space that preserves the most important structure - either through feature selection (picking the best existing features) or feature extraction (creating new features as combinations of originals). PCA is the most common extraction method; it finds orthogonal directions of maximum variance."

When Dimensionality Reduction Helps

ScenarioBenefit
Features >> samples (p >> n)Prevents overfitting, enables learning
VisualizationProject to 2D/3D for human understanding
Noise reductionRemove low-variance noise dimensions
Computational efficiencyFewer features = faster training and inference
MulticollinearityPCA produces uncorrelated features
Storage and bandwidthCompressed representations for production

When It Hurts

ScenarioProblem
Interpretability requiredExtracted features are hard to explain
All features are importantReduction discards signal
Non-linear relationshipsLinear methods (PCA) miss curved structure
Downstream task suffersReduced representation loses task-relevant info

Part 2 - PCA (Principal Component Analysis)

The Mathematical Foundation

Goal: Find a linear projection that maximizes variance in the projected space.

Setup: Given data matrix XRn×dX \in \mathbb{R}^{n \times d} (n samples, d features), centered so each column has mean 0.

Step 1: Compute the covariance matrix

C=1n1XTXRd×dC = \frac{1}{n-1} X^T X \in \mathbb{R}^{d \times d}

Step 2: Eigendecomposition

C=VΛVTC = V \Lambda V^T

Where:

  • V=[v1,v2,...,vd]V = [v_1, v_2, ..., v_d] - eigenvectors (principal components), orthonormal
  • Λ=diag(λ1,λ2,...,λd)\Lambda = \text{diag}(\lambda_1, \lambda_2, ..., \lambda_d) - eigenvalues, sorted λ1λ2...λd\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_d

Step 3: Project onto top k components

Z=XVkRn×kZ = X V_k \in \mathbb{R}^{n \times k}

Where Vk=[v1,...,vk]V_k = [v_1, ..., v_k] is the matrix of the top k eigenvectors.

Why this works: The first eigenvector v1v_1 is the direction of maximum variance. The second v2v_2 is the direction of maximum variance orthogonal to v1v_1. And so on. By keeping the top k, you preserve the most "informative" directions and discard noise.

import numpy as np
from sklearn.decomposition import PCA

# Manual PCA
X_centered = X - X.mean(axis=0)
cov_matrix = np.cov(X_centered, rowvar=False)
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)

# Sort by eigenvalue (descending)
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

# Project onto top k components
k = 50
X_reduced = X_centered @ eigenvectors[:, :k]

# Using scikit-learn (equivalent, but also handles SVD internally)
pca = PCA(n_components=50)
X_reduced = pca.fit_transform(X)
Interviewer's Perspective

"Walk me through PCA step by step." I want to hear: (1) center the data, (2) compute covariance matrix, (3) eigendecomposition, (4) project onto top eigenvectors. Bonus points for mentioning SVD as the practical implementation (more numerically stable than eigendecomposition) and explaining that eigenvalues represent variance captured by each component.

SVD vs. Eigendecomposition

In practice, PCA is computed via SVD (Singular Value Decomposition) rather than eigendecomposition:

X=UΣVTX = U \Sigma V^T

Where:

  • URn×nU \in \mathbb{R}^{n \times n} - left singular vectors
  • ΣRn×d\Sigma \in \mathbb{R}^{n \times d} - singular values (diagonal)
  • VRd×dV \in \mathbb{R}^{d \times d} - right singular vectors = principal components

The relationship: λi=σi2n1\lambda_i = \frac{\sigma_i^2}{n-1} (eigenvalues = squared singular values / (n-1))

Why SVD?

  • Numerically more stable than computing XTXX^T X (which squares condition numbers)
  • Can use truncated SVD for efficiency (only compute top k components)
  • Works for non-square matrices

Explained Variance and Choosing k

The explained variance ratio for component i:

EVRi=λij=1dλj\text{EVR}_i = \frac{\lambda_i}{\sum_{j=1}^{d} \lambda_j}

The elbow method: Plot cumulative explained variance vs. number of components. Choose k at the "elbow" - where adding more components gives diminishing returns.

pca = PCA()
pca.fit(X)

# Cumulative explained variance
cumulative_var = np.cumsum(pca.explained_variance_ratio_)

# Find k for 95% variance retained
k_95 = np.argmax(cumulative_var >= 0.95) + 1
print(f"Components for 95% variance: {k_95}")

# Common thresholds
# 90% - aggressive reduction, might lose some signal
# 95% - good balance for most applications
# 99% - conservative, minimal information loss
Common Trap

"I'll keep components until I reach 95% explained variance." This is a reasonable heuristic, but it assumes that high-variance directions are the most informative for your task. In some cases, the signal lives in low-variance directions. For example, in a dataset where all features vary widely due to noise but the class-discriminative signal is subtle, PCA might discard the very dimensions that matter. Always validate with downstream task performance.

PCA Assumptions and Limitations

AssumptionImplication
LinearityPCA finds linear projections; misses curved/manifold structure
Variance = importanceMaximizes variance, but high variance might be noise
OrthogonalityComponents are orthogonal; real structure might not be
Gaussian-like distributionWorks best when data is roughly Gaussian; skewed data may need transformation first
Centered dataPCA subtracts the mean; if mean is informative, this is lost
Scale sensitivityFeatures must be standardized; otherwise large-scale features dominate
Instant Rejection

"I ran PCA on my features without standardizing them first." If feature A ranges from 0-1000 and feature B ranges from 0-1, PCA will be dominated by feature A regardless of feature B's importance. Always StandardScaler().fit_transform(X) before PCA, unless all features are on the same scale.

Kernel PCA

For non-linear relationships, Kernel PCA applies PCA in a high-dimensional feature space induced by a kernel function, without explicitly computing the transformation.

from sklearn.decomposition import KernelPCA

kpca = KernelPCA(n_components=50, kernel='rbf', gamma=0.01)
X_reduced = kpca.fit_transform(X)

Kernels:

  • linear - equivalent to standard PCA
  • rbf - captures non-linear structure; gamma controls locality
  • poly - polynomial relationships; degree parameter

Part 3 - Visualization Methods: t-SNE and UMAP

t-SNE (t-distributed Stochastic Neighbor Embedding)

Purpose: Primarily for visualization - projects high-dimensional data to 2D/3D while preserving local neighborhood structure.

How it works (intuition):

  1. Compute pairwise similarities in high-dimensional space using Gaussian kernels
  2. Initialize random 2D positions
  3. Compute pairwise similarities in 2D using Student's t-distribution (heavy tails)
  4. Minimize the KL divergence between high-D and low-D similarity distributions using gradient descent

Why t-distribution in low-D? The "crowding problem" - in 2D, there's less room for moderately distant points. The heavy-tailed t-distribution allows moderately similar points to be farther apart in 2D without being penalized, creating well-separated clusters.

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, learning_rate='auto',
n_iter=1000, random_state=42)
X_2d = tsne.fit_transform(X)

Critical parameters:

  • perplexity (5-50): Roughly how many neighbors each point considers. Low perplexity = tight, many clusters. High perplexity = broader structure visible. Try multiple values.
  • learning_rate: Too low = compressed ball. Too high = uniform ball. 'auto' is usually fine.
  • n_iter: At least 1000. Premature stopping produces artifacts.
Common Trap

The most common t-SNE misinterpretation mistakes:

  1. Cluster sizes are meaningless. t-SNE expands dense clusters and contracts sparse ones. You cannot compare cluster sizes.
  2. Distances between clusters are meaningless. Two well-separated clusters in t-SNE might be close in the original space and vice versa.
  3. Multiple runs produce different layouts. t-SNE is non-convex; different random initializations give different results. Always run multiple times.
  4. Don't use it for downstream ML. t-SNE is for visualization only - the mapping is non-parametric and can't be applied to new data.

UMAP (Uniform Manifold Approximation and Projection)

Purpose: Visualization AND dimensionality reduction for downstream tasks.

How it works (intuition):

  1. Build a weighted k-nearest-neighbor graph in high-dimensional space
  2. Optimize a low-dimensional layout to preserve the graph structure
  3. Uses cross-entropy loss between high-D and low-D fuzzy topological representations

Advantages over t-SNE:

  • Much faster - scales to millions of points (t-SNE is O(n^2) without approximations)
  • Preserves more global structure - cluster-to-cluster relationships are more meaningful
  • Can embed to higher dimensions - not limited to 2D/3D; use for actual dimensionality reduction
  • Parametric version available - can transform new unseen data
import umap

reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1,
metric='euclidean', random_state=42)
X_2d = reducer.fit_transform(X)

# For dimensionality reduction (not just visualization)
reducer_50d = umap.UMAP(n_components=50, random_state=42)
X_reduced = reducer_50d.fit_transform(X_train)
X_test_reduced = reducer_50d.transform(X_test) # Can transform new data!

Critical parameters:

  • n_neighbors (5-50): Controls local vs. global structure. Low = fine detail, may fragment. High = broader patterns, may merge clusters.
  • min_dist (0.0-0.99): How tightly points are packed. 0.0 = tight clusters. 0.5 = spread out.
  • metric: euclidean, cosine, manhattan, etc. Choose based on your data type.

Method Selection Flowchart

Dimensionality Reduction Method Selection

Head-to-Head Comparison

FeaturePCAt-SNEUMAP
TypeLinearNon-linearNon-linear
SpeedFast O(nd^2)Slow O(n^2)Fast O(n log n)
ScalabilityMillions of samples<50K practicalMillions of samples
Global structurePreservedNot preservedPartially preserved
Local structureApproximateExcellentExcellent
New dataYes (projection)No (re-run needed)Yes (transform)
Target dimensionsAny k2-3 onlyAny k
DeterministicYesNoNo
InterpretableYes (loadings)NoNo
Use for MLYesNoYes
Company Variation

Google: Heavy use of PCA for feature preprocessing in large-scale ML systems. Random projections (Johnson-Lindenstrauss) for approximate dimensionality reduction at scale.

Meta: UMAP for embedding visualization in recommendation systems. PCA as a denoising step before training.

Biotech/Pharma: t-SNE and UMAP are standard for single-cell genomics visualization. PCA for initial denoising before t-SNE/UMAP (PCA first to 50D, then t-SNE to 2D).

OpenAI: Autoencoders and learned representations dominate; classical DR methods used mainly for analysis/debugging.

Part 4 - Autoencoders for Dimensionality Reduction

Architecture

An autoencoder learns a compressed representation by training a neural network to reconstruct its input:

Input (d dimensions) → Encoder → Bottleneck (k dimensions) → Decoder → Output (d dimensions)

Loss = ||input - output||^2 (reconstruction error)

The bottleneck layer IS the dimensionality reduction - it's a k-dimensional representation that captures enough information to reconstruct the input.

import torch
import torch.nn as nn

class Autoencoder(nn.Module):
def __init__(self, input_dim, latent_dim):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 256),
nn.ReLU(),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, latent_dim)
)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 128),
nn.ReLU(),
nn.Linear(128, 256),
nn.ReLU(),
nn.Linear(256, input_dim)
)

def forward(self, x):
z = self.encoder(x) # Compressed representation
x_hat = self.decoder(z) # Reconstruction
return x_hat, z

# Training
model = Autoencoder(input_dim=500, latent_dim=50)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(100):
x_hat, z = model(X_train_tensor)
loss = criterion(x_hat, X_train_tensor)
optimizer.zero_grad()
loss.backward()
optimizer.step()

# Use encoder for dimensionality reduction
with torch.no_grad():
X_reduced = model.encoder(X_tensor).numpy()

Autoencoder vs. PCA

AspectPCAAutoencoder
RelationshipsLinear onlyNon-linear (with non-linear activations)
OptimalityOptimal linear reduction (provably)Depends on architecture and training
With linear activationEquivalent to PCASame solution as PCA
TrainingClosed-form (eigendecomposition)Iterative (gradient descent)
Overfitting riskNone (deterministic)Yes (regularization needed)
ScalabilityExcellent (truncated SVD)Requires GPU for large data
InterpretabilityHigh (loadings)Low (black box)
Interviewer's Perspective

"When would you use an autoencoder instead of PCA?" Strong answer: "When the data has non-linear structure that PCA can't capture. For example, images on a manifold - pixel-space PCA blurs images, while an autoencoder learns meaningful features. But autoencoders are harder to train, prone to overfitting, and not guaranteed to find the optimal reduction. I'd start with PCA and only switch to autoencoders if PCA's reconstruction error is unacceptably high."

Variational Autoencoders (VAE) for Reduction

VAEs add a probabilistic twist - the latent space is regularized to follow a Gaussian distribution, producing a smooth, continuous latent space.

class VAE(nn.Module):
def __init__(self, input_dim, latent_dim):
super().__init__()
self.encoder = nn.Sequential(nn.Linear(input_dim, 256), nn.ReLU())
self.fc_mu = nn.Linear(256, latent_dim)
self.fc_logvar = nn.Linear(256, latent_dim)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 256), nn.ReLU(),
nn.Linear(256, input_dim)
)

def encode(self, x):
h = self.encoder(x)
return self.fc_mu(h), self.fc_logvar(h)

def reparameterize(self, mu, logvar):
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mu + eps * std

def forward(self, x):
mu, logvar = self.encode(x)
z = self.reparameterize(mu, logvar)
return self.decoder(z), mu, logvar

Part 5 - Feature Selection vs. Feature Extraction

The Fundamental Difference

  • Feature extraction (PCA, t-SNE, UMAP, autoencoders): Creates new features as combinations of originals. The new features may not be interpretable.
  • Feature selection: Chooses a subset of original features. The selected features retain their original meaning.

Feature Selection Methods

Feature Selection Taxonomy

Filter methods - score each feature independently, fast but ignore feature interactions:

from sklearn.feature_selection import SelectKBest, mutual_info_classif

# Mutual information - captures non-linear relationships
selector = SelectKBest(mutual_info_classif, k=50)
X_selected = selector.fit_transform(X_train, y_train)

# Get selected feature names
selected_features = X_train.columns[selector.get_support()]

Wrapper methods - use a model to evaluate subsets, powerful but slow:

from sklearn.feature_selection import RFE

# Recursive Feature Elimination
rfe = RFE(estimator=RandomForestClassifier(n_estimators=100),
n_features_to_select=50, step=10)
rfe.fit(X_train, y_train)
selected_mask = rfe.support_

Embedded methods - feature selection is built into the model training:

from sklearn.linear_model import LassoCV

# L1 regularization drives irrelevant feature weights to zero
lasso = LassoCV(cv=5)
lasso.fit(X_train, y_train)
selected = np.abs(lasso.coef_) > 0 # Non-zero coefficients = selected features
print(f"Selected {selected.sum()} out of {X_train.shape[1]} features")

# Tree-based importance
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=200)
rf.fit(X_train, y_train)
importances = rf.feature_importances_
top_50 = np.argsort(importances)[-50:] # Top 50 most important features

When to Use Selection vs. Extraction

CriterionFeature SelectionFeature Extraction
Interpretability neededYesNo
Non-linear relationshipsLimitedYes (kernel PCA, autoencoders)
SpeedVaries (filters fast, wrappers slow)Fast (PCA), slow (autoencoders)
Handles redundancyPartially (can select correlated features)Yes (PCA decorrelates)
Domain knowledge helpsYesLess so
Regulatory requirementsOften required (explainability)May not satisfy

Practice Problems

Problem 1: PCA from Scratch (Mid-Level)

Scenario: You're asked to implement PCA without using any library. Your dataset has 1000 samples and 10 features.

Question: Walk through the complete implementation, explain each step, and compute the explained variance for each component.

Hint 1 - Direction

Start with centering the data. Then compute the covariance matrix. You need eigendecomposition, then sorting, then projection.

Hint 2 - Insight

The covariance matrix is (XTX)/(n1)(X^T X) / (n-1) after centering. Each eigenvalue represents the variance explained by its corresponding eigenvector. Sort by eigenvalue in descending order.

Hint 3 - Full Solution
import numpy as np

def pca_from_scratch(X, n_components):
# Step 1: Center the data
mean = X.mean(axis=0)
X_centered = X - mean

# Step 2: Compute covariance matrix
n = X.shape[0]
cov_matrix = (X_centered.T @ X_centered) / (n - 1)

# Step 3: Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)

# Step 4: Sort by eigenvalue (descending)
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

# Step 5: Select top k components
components = eigenvectors[:, :n_components]

# Step 6: Project
X_reduced = X_centered @ components

# Explained variance
total_variance = eigenvalues.sum()
explained_variance_ratio = eigenvalues[:n_components] / total_variance

return X_reduced, components, explained_variance_ratio

# Usage
X_reduced, components, evr = pca_from_scratch(X, n_components=3)
print(f"Explained variance ratios: {evr}")
print(f"Total variance retained: {evr.sum():.3f}")

Key points to articulate:

  1. Centering is essential - PCA finds directions of maximum variance from the mean
  2. Covariance matrix is symmetric positive semi-definite
  3. eigh is used (not eig) because the covariance matrix is symmetric - more numerically stable
  4. Eigenvectors are the principal directions; eigenvalues are the variance along each direction
  5. In practice, use SVD for numerical stability: U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)

Scoring Rubric:

  • Strong Hire: Correct implementation, explains why centering is needed, mentions SVD as preferred alternative, discusses numerical stability
  • Lean Hire: Correct implementation but can't explain the mathematical reasoning
  • No Hire: Can't implement without library or makes errors in matrix operations

Problem 2: PCA vs. t-SNE Debate (Mid-Level)

Scenario: Your manager asks you to use t-SNE to reduce 500 features to 50 for model training. You disagree. Explain why and propose alternatives.

Hint 1 - Direction

Think about what t-SNE preserves (local neighborhoods) and what it doesn't (global distances, feature interpretability). Can you apply t-SNE to new test data?

Hint 2 - Insight

t-SNE is designed for 2-3 dimensional visualization, not for 50D feature extraction. It's non-parametric (can't transform new data), non-deterministic, and doesn't preserve distances needed for downstream ML.

Hint 3 - Full Solution

Why t-SNE is wrong for this task:

  1. Non-parametric: t-SNE doesn't learn a mapping function. You can't apply it to new test data. You'd have to re-run t-SNE on train+test together, which leaks test information.

  2. Designed for 2-3D: t-SNE's perplexity-based similarity computation is designed for very low dimensions. At 50D, the crowding problem it solves doesn't exist, and the method provides no theoretical guarantees.

  3. Non-deterministic: Different runs produce different embeddings. Model training becomes non-reproducible.

  4. Distances are meaningless: t-SNE preserves local neighborhoods, not distances. A downstream model relying on distances (SVM, k-NN, linear regression) will receive meaningless features.

  5. Computational cost: O(n^2) per iteration, prohibitive for large datasets.

Alternatives (in order of recommendation):

  1. PCA to 50D: Fast, deterministic, preserves global variance, works on new data
  2. UMAP to 50D: Non-linear, preserves more structure than PCA, has .transform() for new data
  3. Autoencoder with 50D bottleneck: Non-linear, learnable, works on new data
  4. Feature selection (Lasso/RF importance): If interpretability matters

Scoring Rubric:

  • Strong Hire: Lists multiple specific reasons t-SNE fails for this use case, proposes ranked alternatives, explains when t-SNE IS appropriate
  • Lean Hire: Knows t-SNE is for visualization but can't articulate all the reasons why
  • No Hire: Agrees with the manager or doesn't see the problem

Problem 3: The Variance Paradox (Senior-Level)

Scenario: You apply PCA to a fraud detection dataset and retain 95% of variance with 20 components (down from 200). But your model's performance drops significantly compared to using all 200 features. What happened?

Hint 1 - Direction

PCA maximizes variance, but what if the signal for fraud detection is in low-variance directions?

Hint 2 - Insight

Consider: if 99% of transactions are legitimate, the high-variance directions describe variation within the legitimate class. The subtle patterns distinguishing fraud live in low-variance components that PCA discarded.

Hint 3 - Full Solution

The problem: PCA assumes high variance = important signal. In fraud detection:

  • The top principal components capture variance in legitimate transactions (different purchase amounts, locations, times)
  • Fraudulent transactions differ from legitimate ones in subtle, low-variance ways
  • By keeping only 95% variance, PCA discarded the very dimensions that distinguish fraud

Solutions:

  1. Supervised dimensionality reduction: Use LDA (Linear Discriminant Analysis) instead of PCA. LDA maximizes class separation, not variance.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis(n_components=1) # Binary: max 1 component
X_lda = lda.fit_transform(X_train, y_train)
  1. Keep more components: Retain 99.9% of variance, or better yet, evaluate downstream performance for different k values.

  2. Feature selection instead: Use a supervised method (mutual information, tree importance) that considers the target.

  3. Inspect discarded components: Check which components PCA dropped and whether they correlate with the target.

# Check correlation of each PC with the target
pca = PCA()
X_pca = pca.fit_transform(X_train)
for i in range(X_pca.shape[1]):
corr = np.corrcoef(X_pca[:, i], y_train)[0, 1]
if abs(corr) > 0.1:
print(f"PC{i+1}: variance_ratio={pca.explained_variance_ratio_[i]:.4f}, "
f"target_corr={corr:.3f}")

Key insight: PCA is unsupervised - it doesn't consider the target variable. For tasks where the discriminative signal is in low-variance directions, PCA can actually hurt performance.

Scoring Rubric:

  • Strong Hire: Identifies the variance-importance disconnect, explains with the fraud detection context, proposes LDA or supervised alternatives, discusses how to diagnose the issue
  • Lean Hire: Knows PCA might discard useful features but can't explain the mechanism
  • No Hire: Suggests "keep more components" without understanding why the issue occurred

Problem 4: Choosing the Right Method (Senior-Level)

Scenario: You have four datasets. For each, recommend a dimensionality reduction strategy and justify:

  • (A) 100K text documents, TF-IDF with 50K features, topic modeling task
  • (B) Single-cell RNA sequencing: 10K cells, 20K genes, want to visualize cell types
  • (C) Sensor data: 1M time series samples, 500 features, latency-critical anomaly detection
  • (D) Customer features: 200 features, 50K customers, interpretable credit scoring model needed
Hint 1 - Direction

Match the method to the constraints: interpretability, scale, linearity, visualization vs. ML preprocessing.

Hint 2 - Insight

Dataset A: sparse, large, topic-oriented. Dataset B: biology standard practice. Dataset C: scale and latency matter. Dataset D: interpretability is legally required for credit scoring.

Hint 3 - Full Solution

(A) Text documents - Truncated SVD (LSA)

  • TF-IDF matrices are sparse; standard PCA would densify them in memory
  • Truncated SVD works directly on sparse matrices
  • Components are interpretable as "topics" (Latent Semantic Analysis)
  • 100-300 components is typical for topic modeling
  • Alternative: NMF (Non-negative Matrix Factorization) for more interpretable topics

(B) Single-cell RNA sequencing - PCA then UMAP

  • Standard bioinformatics pipeline: PCA to 50D (denoising step), then UMAP to 2D (visualization)
  • PCA first removes noise in high-variance technical artifacts (batch effects)
  • UMAP then reveals cell type clusters in the denoised space
  • t-SNE also common but UMAP preserves more global structure and is faster for 10K cells
  • Use Leiden/Louvain clustering on the PCA-reduced (not UMAP-reduced) data for quantitative analysis

(C) Sensor data - PCA with online updates

  • Latency-critical: PCA projection is a simple matrix multiply (very fast)
  • 500 features to ~50 PCA components is typical
  • Use incremental PCA for streaming updates
  • Alternative: random projections (even faster, theoretical guarantees from Johnson-Lindenstrauss lemma)

(D) Credit scoring - Feature selection (Lasso + domain expertise)

  • Regulatory requirements (ECOA, FCRA) demand interpretable models
  • Feature extraction (PCA) creates uninterpretable components - rejected by regulators
  • Use L1 regularization (Lasso) to select a sparse set of original features
  • Validate with domain experts: selected features should be explainable
  • Typical: reduce 200 to 20-30 interpretable features

Scoring Rubric:

  • Strong Hire: Correct method for all four with detailed justification, mentions domain-specific conventions (bioinformatics pipeline, regulatory requirements), discusses alternatives
  • Lean Hire: Gets 3 out of 4 right with reasonable justification
  • No Hire: Recommends PCA for everything or doesn't consider the specific constraints

Problem 5: The Johnson-Lindenstrauss Lemma (Staff-Level)

Scenario: You need to reduce 1 million feature vectors from 10,000 dimensions to a lower dimension for approximate nearest neighbor search. The system must process 100K queries per second. PCA is too slow to train on this data.

Question: What approach would you use, and what theoretical guarantee can you give about the quality of the reduction?

Hint 1 - Direction

Random projections don't require any training - just multiply by a random matrix. There's a theorem that guarantees distance preservation.

Hint 2 - Insight

The Johnson-Lindenstrauss lemma states that n points in high-dimensional space can be embedded into O(log n / epsilon^2) dimensions while preserving all pairwise distances within a factor of (1 +/- epsilon).

Hint 3 - Full Solution

Solution: Random Projections

Multiply by a random matrix RRd×kR \in \mathbb{R}^{d \times k} where each entry is drawn from N(0,1/k)N(0, 1/k):

Xreduced=XRX_{reduced} = X \cdot R

Johnson-Lindenstrauss Lemma: For any set of n points in Rd\mathbb{R}^d, a random projection to k=O(lognϵ2)k = O(\frac{\log n}{\epsilon^2}) dimensions preserves all pairwise distances within (1±ϵ)(1 \pm \epsilon) with high probability:

(1ϵ)uv2RuRv2(1+ϵ)uv2(1 - \epsilon) \|u - v\|^2 \leq \|Ru - Rv\|^2 \leq (1 + \epsilon) \|u - v\|^2

For this problem:

  • n = 1,000,000 points
  • epsilon = 0.1 (10% distance distortion acceptable)
  • k = O(log(1,000,000) / 0.01) = O(1,382) - roughly 1,400 dimensions
from sklearn.random_projection import GaussianRandomProjection

# JL-lemma guided dimension
rp = GaussianRandomProjection(n_components=1400, random_state=42)
X_reduced = rp.fit_transform(X) # "fit" just generates the random matrix

# Even faster: sparse random projection
from sklearn.random_projection import SparseRandomProjection
rp_sparse = SparseRandomProjection(n_components=1400, density='auto')
X_reduced = rp_sparse.fit_transform(X)

Why this works for ANN search:

  1. No training needed: Random matrix generation is O(d * k), not data-dependent
  2. Distance preservation: JL guarantee means nearest neighbors are approximately preserved
  3. Fast inference: Matrix multiply is O(d * k) per query, highly parallelizable
  4. Composable: Can combine with approximate nearest neighbor indices (FAISS, Annoy)

Production architecture:

  • Pre-compute X_reduced for all 1M vectors
  • Store reduced vectors in FAISS index
  • For each query: random project (0.01ms) then ANN search (0.1ms) = well under latency budget

Scoring Rubric:

  • Strong Hire: States JL lemma correctly, computes dimension bound, implements random projection, designs production system with ANN index
  • Lean Hire: Knows about random projections but can't state the JL guarantee precisely
  • No Hire: Insists on PCA or doesn't know about random projections

Interview Cheat Sheet

QuestionKey Points
What is the curse of dimensionality?Distances converge, volume concentrates at edges, need exponentially more data
Explain PCA step by stepCenter data, compute covariance, eigendecompose, project onto top k eigenvectors
How to choose number of components?Elbow method on cumulative explained variance; common thresholds: 90%, 95%, 99%; validate with downstream task
PCA vs. t-SNE?PCA: linear, global structure, ML preprocessing. t-SNE: non-linear, local structure, visualization only
PCA vs. UMAP?PCA: linear, deterministic, interpretable. UMAP: non-linear, preserves manifold, has transform()
When PCA fails?Non-linear structure, signal in low-variance directions, unscaled features
Must you scale before PCA?Yes (StandardScaler), unless all features already on same scale
t-SNE cluster sizes meaningful?No - t-SNE distorts sizes and inter-cluster distances; only neighborhoods reliable
Feature selection vs. extraction?Selection: interpretable, subset of originals. Extraction: creates new features, captures more structure
Random projections?JL lemma guarantees distance preservation in O(log n / eps^2) dims; no training needed
When autoencoders?Non-linear structure, enough data to train, interpretability not required
Kernel PCA?PCA in kernel-induced feature space; captures non-linear relationships; expensive for large datasets

Spaced Repetition Checkpoints

Day 0 - Initial Learning

  • Explain the curse of dimensionality with three concrete manifestations
  • Walk through PCA step by step (center, covariance, eigen, project)
  • List 3 differences between PCA, t-SNE, and UMAP
  • Explain why you must standardize features before PCA

Day 3 - Recall

  • Derive PCA from the variance maximization objective
  • Explain why t-SNE uses a t-distribution in the low-dimensional space
  • Draw the feature selection taxonomy (filter, wrapper, embedded)
  • Explain when PCA's explained variance heuristic fails

Day 7 - Application

  • Implement PCA from scratch (centering, covariance, eigendecomposition)
  • Given a dataset description, choose the right DR method and justify
  • Explain the JL lemma and compute a dimension bound
  • Solve Practice Problem 2 without hints

Day 14 - Integration

  • Compare 6 dimensionality reduction methods with pros/cons
  • Design a complete DR pipeline for a real-world problem (genomics, NLP, etc.)
  • Explain when unsupervised DR (PCA) vs. supervised DR (LDA) is appropriate
  • Solve Practice Problem 4 with all four datasets

Day 21 - Mastery

  • Teach PCA derivation and the curse of dimensionality from scratch
  • Design a production dimensionality reduction system with latency constraints
  • Explain VAE latent space properties vs. standard autoencoder
  • Confidently handle: "Why not just use PCA for everything?"
© 2026 EngineersOfAI. All rights reserved.