Dimensionality Reduction - Taming the Curse of High Dimensions
Reading time: ~30 min | Interview relevance: High | Roles: MLE, Data Scientist, AI Engineer, Research Scientist
The Real Interview Moment
The interviewer asks: "You have a dataset with 50,000 features from gene expression data and only 200 patient samples. Your model is performing worse than a simple baseline. What's going on, and how would you approach this?"
You immediately recognize the curse of dimensionality - with 50,000 features and 200 samples, the model has far more parameters to learn than data points to learn from. The feature space is so vast that any two points are nearly equidistant, making it impossible for distance-based algorithms to find meaningful structure. But the interviewer doesn't just want you to say "use PCA." They want you to reason about which dimensionality reduction approach to use, how many dimensions to keep, and whether you should select features or extract new ones.
This question tests whether you understand the mathematical foundations of dimensionality reduction and can make principled decisions about when and how to apply it.
What You Will Master
- The curse of dimensionality - why it matters and how it manifests
- PCA: mathematical derivation, eigenvalue interpretation, choosing components
- t-SNE: how it works, limitations, and proper usage
- UMAP: advantages over t-SNE, when to choose it
- Autoencoders for non-linear dimensionality reduction
- Feature selection vs. feature extraction - when to use each
- Practical decision framework for choosing the right method
- Explained variance and the elbow method for component selection
Self-Assessment: Where Are You Now?
| Level | Description | Target |
|---|---|---|
| Beginner | "I know PCA reduces dimensions somehow" | Read all parts, focus on PCA math |
| Intermediate | "I can explain PCA and have used t-SNE for visualization" | Focus on Parts 2-3, decision flowchart, practice problems |
| Advanced | "I know the methods but want to nail the mathematical proofs and tradeoffs" | Jump to PCA derivation, UMAP vs t-SNE, and practice problems |
Part 1 - The Curse of Dimensionality
Why High Dimensions Break Everything
As dimensionality increases, several counterintuitive phenomena emerge:
1. Volume concentrates at the edges
In a d-dimensional unit hypercube, the fraction of volume within distance epsilon of the surface is:
For d=100 and epsilon=0.01: . Over 86% of the volume is within 1% of the surface. Almost all points are "edge cases."
2. Distances become meaningless
As dimensions grow, the ratio of the nearest to farthest neighbor distance converges to 1:
When all points are roughly equidistant, algorithms that rely on distance (k-NN, SVM, clustering) lose discriminative power.
3. Data becomes sparse
To maintain the same density of data points in a d-dimensional space, you need exponentially more data. Going from 10 samples per unit in 1D to 10D requires samples to maintain the same density.
"The curse of dimensionality means that as features increase, the data becomes increasingly sparse in the feature space. Distances between points converge, making similarity-based algorithms ineffective. You need exponentially more data to maintain statistical significance. Dimensionality reduction addresses this by projecting data into a lower-dimensional space that preserves the most important structure - either through feature selection (picking the best existing features) or feature extraction (creating new features as combinations of originals). PCA is the most common extraction method; it finds orthogonal directions of maximum variance."
When Dimensionality Reduction Helps
| Scenario | Benefit |
|---|---|
| Features >> samples (p >> n) | Prevents overfitting, enables learning |
| Visualization | Project to 2D/3D for human understanding |
| Noise reduction | Remove low-variance noise dimensions |
| Computational efficiency | Fewer features = faster training and inference |
| Multicollinearity | PCA produces uncorrelated features |
| Storage and bandwidth | Compressed representations for production |
When It Hurts
| Scenario | Problem |
|---|---|
| Interpretability required | Extracted features are hard to explain |
| All features are important | Reduction discards signal |
| Non-linear relationships | Linear methods (PCA) miss curved structure |
| Downstream task suffers | Reduced representation loses task-relevant info |
Part 2 - PCA (Principal Component Analysis)
The Mathematical Foundation
Goal: Find a linear projection that maximizes variance in the projected space.
Setup: Given data matrix (n samples, d features), centered so each column has mean 0.
Step 1: Compute the covariance matrix
Step 2: Eigendecomposition
Where:
- - eigenvectors (principal components), orthonormal
- - eigenvalues, sorted
Step 3: Project onto top k components
Where is the matrix of the top k eigenvectors.
Why this works: The first eigenvector is the direction of maximum variance. The second is the direction of maximum variance orthogonal to . And so on. By keeping the top k, you preserve the most "informative" directions and discard noise.
import numpy as np
from sklearn.decomposition import PCA
# Manual PCA
X_centered = X - X.mean(axis=0)
cov_matrix = np.cov(X_centered, rowvar=False)
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
# Sort by eigenvalue (descending)
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Project onto top k components
k = 50
X_reduced = X_centered @ eigenvectors[:, :k]
# Using scikit-learn (equivalent, but also handles SVD internally)
pca = PCA(n_components=50)
X_reduced = pca.fit_transform(X)
"Walk me through PCA step by step." I want to hear: (1) center the data, (2) compute covariance matrix, (3) eigendecomposition, (4) project onto top eigenvectors. Bonus points for mentioning SVD as the practical implementation (more numerically stable than eigendecomposition) and explaining that eigenvalues represent variance captured by each component.
SVD vs. Eigendecomposition
In practice, PCA is computed via SVD (Singular Value Decomposition) rather than eigendecomposition:
Where:
- - left singular vectors
- - singular values (diagonal)
- - right singular vectors = principal components
The relationship: (eigenvalues = squared singular values / (n-1))
Why SVD?
- Numerically more stable than computing (which squares condition numbers)
- Can use truncated SVD for efficiency (only compute top k components)
- Works for non-square matrices
Explained Variance and Choosing k
The explained variance ratio for component i:
The elbow method: Plot cumulative explained variance vs. number of components. Choose k at the "elbow" - where adding more components gives diminishing returns.
pca = PCA()
pca.fit(X)
# Cumulative explained variance
cumulative_var = np.cumsum(pca.explained_variance_ratio_)
# Find k for 95% variance retained
k_95 = np.argmax(cumulative_var >= 0.95) + 1
print(f"Components for 95% variance: {k_95}")
# Common thresholds
# 90% - aggressive reduction, might lose some signal
# 95% - good balance for most applications
# 99% - conservative, minimal information loss
"I'll keep components until I reach 95% explained variance." This is a reasonable heuristic, but it assumes that high-variance directions are the most informative for your task. In some cases, the signal lives in low-variance directions. For example, in a dataset where all features vary widely due to noise but the class-discriminative signal is subtle, PCA might discard the very dimensions that matter. Always validate with downstream task performance.
PCA Assumptions and Limitations
| Assumption | Implication |
|---|---|
| Linearity | PCA finds linear projections; misses curved/manifold structure |
| Variance = importance | Maximizes variance, but high variance might be noise |
| Orthogonality | Components are orthogonal; real structure might not be |
| Gaussian-like distribution | Works best when data is roughly Gaussian; skewed data may need transformation first |
| Centered data | PCA subtracts the mean; if mean is informative, this is lost |
| Scale sensitivity | Features must be standardized; otherwise large-scale features dominate |
"I ran PCA on my features without standardizing them first." If feature A ranges from 0-1000 and feature B ranges from 0-1, PCA will be dominated by feature A regardless of feature B's importance. Always StandardScaler().fit_transform(X) before PCA, unless all features are on the same scale.
Kernel PCA
For non-linear relationships, Kernel PCA applies PCA in a high-dimensional feature space induced by a kernel function, without explicitly computing the transformation.
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components=50, kernel='rbf', gamma=0.01)
X_reduced = kpca.fit_transform(X)
Kernels:
linear- equivalent to standard PCArbf- captures non-linear structure; gamma controls localitypoly- polynomial relationships; degree parameter
Part 3 - Visualization Methods: t-SNE and UMAP
t-SNE (t-distributed Stochastic Neighbor Embedding)
Purpose: Primarily for visualization - projects high-dimensional data to 2D/3D while preserving local neighborhood structure.
How it works (intuition):
- Compute pairwise similarities in high-dimensional space using Gaussian kernels
- Initialize random 2D positions
- Compute pairwise similarities in 2D using Student's t-distribution (heavy tails)
- Minimize the KL divergence between high-D and low-D similarity distributions using gradient descent
Why t-distribution in low-D? The "crowding problem" - in 2D, there's less room for moderately distant points. The heavy-tailed t-distribution allows moderately similar points to be farther apart in 2D without being penalized, creating well-separated clusters.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30, learning_rate='auto',
n_iter=1000, random_state=42)
X_2d = tsne.fit_transform(X)
Critical parameters:
- perplexity (5-50): Roughly how many neighbors each point considers. Low perplexity = tight, many clusters. High perplexity = broader structure visible. Try multiple values.
- learning_rate: Too low = compressed ball. Too high = uniform ball.
'auto'is usually fine. - n_iter: At least 1000. Premature stopping produces artifacts.
The most common t-SNE misinterpretation mistakes:
- Cluster sizes are meaningless. t-SNE expands dense clusters and contracts sparse ones. You cannot compare cluster sizes.
- Distances between clusters are meaningless. Two well-separated clusters in t-SNE might be close in the original space and vice versa.
- Multiple runs produce different layouts. t-SNE is non-convex; different random initializations give different results. Always run multiple times.
- Don't use it for downstream ML. t-SNE is for visualization only - the mapping is non-parametric and can't be applied to new data.
UMAP (Uniform Manifold Approximation and Projection)
Purpose: Visualization AND dimensionality reduction for downstream tasks.
How it works (intuition):
- Build a weighted k-nearest-neighbor graph in high-dimensional space
- Optimize a low-dimensional layout to preserve the graph structure
- Uses cross-entropy loss between high-D and low-D fuzzy topological representations
Advantages over t-SNE:
- Much faster - scales to millions of points (t-SNE is O(n^2) without approximations)
- Preserves more global structure - cluster-to-cluster relationships are more meaningful
- Can embed to higher dimensions - not limited to 2D/3D; use for actual dimensionality reduction
- Parametric version available - can transform new unseen data
import umap
reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1,
metric='euclidean', random_state=42)
X_2d = reducer.fit_transform(X)
# For dimensionality reduction (not just visualization)
reducer_50d = umap.UMAP(n_components=50, random_state=42)
X_reduced = reducer_50d.fit_transform(X_train)
X_test_reduced = reducer_50d.transform(X_test) # Can transform new data!
Critical parameters:
- n_neighbors (5-50): Controls local vs. global structure. Low = fine detail, may fragment. High = broader patterns, may merge clusters.
- min_dist (0.0-0.99): How tightly points are packed. 0.0 = tight clusters. 0.5 = spread out.
- metric:
euclidean,cosine,manhattan, etc. Choose based on your data type.
Method Selection Flowchart
Head-to-Head Comparison
| Feature | PCA | t-SNE | UMAP |
|---|---|---|---|
| Type | Linear | Non-linear | Non-linear |
| Speed | Fast O(nd^2) | Slow O(n^2) | Fast O(n log n) |
| Scalability | Millions of samples | <50K practical | Millions of samples |
| Global structure | Preserved | Not preserved | Partially preserved |
| Local structure | Approximate | Excellent | Excellent |
| New data | Yes (projection) | No (re-run needed) | Yes (transform) |
| Target dimensions | Any k | 2-3 only | Any k |
| Deterministic | Yes | No | No |
| Interpretable | Yes (loadings) | No | No |
| Use for ML | Yes | No | Yes |
Google: Heavy use of PCA for feature preprocessing in large-scale ML systems. Random projections (Johnson-Lindenstrauss) for approximate dimensionality reduction at scale.
Meta: UMAP for embedding visualization in recommendation systems. PCA as a denoising step before training.
Biotech/Pharma: t-SNE and UMAP are standard for single-cell genomics visualization. PCA for initial denoising before t-SNE/UMAP (PCA first to 50D, then t-SNE to 2D).
OpenAI: Autoencoders and learned representations dominate; classical DR methods used mainly for analysis/debugging.
Part 4 - Autoencoders for Dimensionality Reduction
Architecture
An autoencoder learns a compressed representation by training a neural network to reconstruct its input:
Input (d dimensions) → Encoder → Bottleneck (k dimensions) → Decoder → Output (d dimensions)
Loss = ||input - output||^2 (reconstruction error)
The bottleneck layer IS the dimensionality reduction - it's a k-dimensional representation that captures enough information to reconstruct the input.
import torch
import torch.nn as nn
class Autoencoder(nn.Module):
def __init__(self, input_dim, latent_dim):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 256),
nn.ReLU(),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, latent_dim)
)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 128),
nn.ReLU(),
nn.Linear(128, 256),
nn.ReLU(),
nn.Linear(256, input_dim)
)
def forward(self, x):
z = self.encoder(x) # Compressed representation
x_hat = self.decoder(z) # Reconstruction
return x_hat, z
# Training
model = Autoencoder(input_dim=500, latent_dim=50)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(100):
x_hat, z = model(X_train_tensor)
loss = criterion(x_hat, X_train_tensor)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Use encoder for dimensionality reduction
with torch.no_grad():
X_reduced = model.encoder(X_tensor).numpy()
Autoencoder vs. PCA
| Aspect | PCA | Autoencoder |
|---|---|---|
| Relationships | Linear only | Non-linear (with non-linear activations) |
| Optimality | Optimal linear reduction (provably) | Depends on architecture and training |
| With linear activation | Equivalent to PCA | Same solution as PCA |
| Training | Closed-form (eigendecomposition) | Iterative (gradient descent) |
| Overfitting risk | None (deterministic) | Yes (regularization needed) |
| Scalability | Excellent (truncated SVD) | Requires GPU for large data |
| Interpretability | High (loadings) | Low (black box) |
"When would you use an autoencoder instead of PCA?" Strong answer: "When the data has non-linear structure that PCA can't capture. For example, images on a manifold - pixel-space PCA blurs images, while an autoencoder learns meaningful features. But autoencoders are harder to train, prone to overfitting, and not guaranteed to find the optimal reduction. I'd start with PCA and only switch to autoencoders if PCA's reconstruction error is unacceptably high."
Variational Autoencoders (VAE) for Reduction
VAEs add a probabilistic twist - the latent space is regularized to follow a Gaussian distribution, producing a smooth, continuous latent space.
class VAE(nn.Module):
def __init__(self, input_dim, latent_dim):
super().__init__()
self.encoder = nn.Sequential(nn.Linear(input_dim, 256), nn.ReLU())
self.fc_mu = nn.Linear(256, latent_dim)
self.fc_logvar = nn.Linear(256, latent_dim)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 256), nn.ReLU(),
nn.Linear(256, input_dim)
)
def encode(self, x):
h = self.encoder(x)
return self.fc_mu(h), self.fc_logvar(h)
def reparameterize(self, mu, logvar):
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mu + eps * std
def forward(self, x):
mu, logvar = self.encode(x)
z = self.reparameterize(mu, logvar)
return self.decoder(z), mu, logvar
Part 5 - Feature Selection vs. Feature Extraction
The Fundamental Difference
- Feature extraction (PCA, t-SNE, UMAP, autoencoders): Creates new features as combinations of originals. The new features may not be interpretable.
- Feature selection: Chooses a subset of original features. The selected features retain their original meaning.
Feature Selection Methods
Filter methods - score each feature independently, fast but ignore feature interactions:
from sklearn.feature_selection import SelectKBest, mutual_info_classif
# Mutual information - captures non-linear relationships
selector = SelectKBest(mutual_info_classif, k=50)
X_selected = selector.fit_transform(X_train, y_train)
# Get selected feature names
selected_features = X_train.columns[selector.get_support()]
Wrapper methods - use a model to evaluate subsets, powerful but slow:
from sklearn.feature_selection import RFE
# Recursive Feature Elimination
rfe = RFE(estimator=RandomForestClassifier(n_estimators=100),
n_features_to_select=50, step=10)
rfe.fit(X_train, y_train)
selected_mask = rfe.support_
Embedded methods - feature selection is built into the model training:
from sklearn.linear_model import LassoCV
# L1 regularization drives irrelevant feature weights to zero
lasso = LassoCV(cv=5)
lasso.fit(X_train, y_train)
selected = np.abs(lasso.coef_) > 0 # Non-zero coefficients = selected features
print(f"Selected {selected.sum()} out of {X_train.shape[1]} features")
# Tree-based importance
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200)
rf.fit(X_train, y_train)
importances = rf.feature_importances_
top_50 = np.argsort(importances)[-50:] # Top 50 most important features
When to Use Selection vs. Extraction
| Criterion | Feature Selection | Feature Extraction |
|---|---|---|
| Interpretability needed | Yes | No |
| Non-linear relationships | Limited | Yes (kernel PCA, autoencoders) |
| Speed | Varies (filters fast, wrappers slow) | Fast (PCA), slow (autoencoders) |
| Handles redundancy | Partially (can select correlated features) | Yes (PCA decorrelates) |
| Domain knowledge helps | Yes | Less so |
| Regulatory requirements | Often required (explainability) | May not satisfy |
Practice Problems
Problem 1: PCA from Scratch (Mid-Level)
Scenario: You're asked to implement PCA without using any library. Your dataset has 1000 samples and 10 features.
Question: Walk through the complete implementation, explain each step, and compute the explained variance for each component.
Hint 1 - Direction
Start with centering the data. Then compute the covariance matrix. You need eigendecomposition, then sorting, then projection.
Hint 2 - Insight
The covariance matrix is after centering. Each eigenvalue represents the variance explained by its corresponding eigenvector. Sort by eigenvalue in descending order.
Hint 3 - Full Solution
import numpy as np
def pca_from_scratch(X, n_components):
# Step 1: Center the data
mean = X.mean(axis=0)
X_centered = X - mean
# Step 2: Compute covariance matrix
n = X.shape[0]
cov_matrix = (X_centered.T @ X_centered) / (n - 1)
# Step 3: Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
# Step 4: Sort by eigenvalue (descending)
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Step 5: Select top k components
components = eigenvectors[:, :n_components]
# Step 6: Project
X_reduced = X_centered @ components
# Explained variance
total_variance = eigenvalues.sum()
explained_variance_ratio = eigenvalues[:n_components] / total_variance
return X_reduced, components, explained_variance_ratio
# Usage
X_reduced, components, evr = pca_from_scratch(X, n_components=3)
print(f"Explained variance ratios: {evr}")
print(f"Total variance retained: {evr.sum():.3f}")
Key points to articulate:
- Centering is essential - PCA finds directions of maximum variance from the mean
- Covariance matrix is symmetric positive semi-definite
eighis used (noteig) because the covariance matrix is symmetric - more numerically stable- Eigenvectors are the principal directions; eigenvalues are the variance along each direction
- In practice, use SVD for numerical stability:
U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)
Scoring Rubric:
- Strong Hire: Correct implementation, explains why centering is needed, mentions SVD as preferred alternative, discusses numerical stability
- Lean Hire: Correct implementation but can't explain the mathematical reasoning
- No Hire: Can't implement without library or makes errors in matrix operations
Problem 2: PCA vs. t-SNE Debate (Mid-Level)
Scenario: Your manager asks you to use t-SNE to reduce 500 features to 50 for model training. You disagree. Explain why and propose alternatives.
Hint 1 - Direction
Think about what t-SNE preserves (local neighborhoods) and what it doesn't (global distances, feature interpretability). Can you apply t-SNE to new test data?
Hint 2 - Insight
t-SNE is designed for 2-3 dimensional visualization, not for 50D feature extraction. It's non-parametric (can't transform new data), non-deterministic, and doesn't preserve distances needed for downstream ML.
Hint 3 - Full Solution
Why t-SNE is wrong for this task:
-
Non-parametric: t-SNE doesn't learn a mapping function. You can't apply it to new test data. You'd have to re-run t-SNE on train+test together, which leaks test information.
-
Designed for 2-3D: t-SNE's perplexity-based similarity computation is designed for very low dimensions. At 50D, the crowding problem it solves doesn't exist, and the method provides no theoretical guarantees.
-
Non-deterministic: Different runs produce different embeddings. Model training becomes non-reproducible.
-
Distances are meaningless: t-SNE preserves local neighborhoods, not distances. A downstream model relying on distances (SVM, k-NN, linear regression) will receive meaningless features.
-
Computational cost: O(n^2) per iteration, prohibitive for large datasets.
Alternatives (in order of recommendation):
- PCA to 50D: Fast, deterministic, preserves global variance, works on new data
- UMAP to 50D: Non-linear, preserves more structure than PCA, has
.transform()for new data - Autoencoder with 50D bottleneck: Non-linear, learnable, works on new data
- Feature selection (Lasso/RF importance): If interpretability matters
Scoring Rubric:
- Strong Hire: Lists multiple specific reasons t-SNE fails for this use case, proposes ranked alternatives, explains when t-SNE IS appropriate
- Lean Hire: Knows t-SNE is for visualization but can't articulate all the reasons why
- No Hire: Agrees with the manager or doesn't see the problem
Problem 3: The Variance Paradox (Senior-Level)
Scenario: You apply PCA to a fraud detection dataset and retain 95% of variance with 20 components (down from 200). But your model's performance drops significantly compared to using all 200 features. What happened?
Hint 1 - Direction
PCA maximizes variance, but what if the signal for fraud detection is in low-variance directions?
Hint 2 - Insight
Consider: if 99% of transactions are legitimate, the high-variance directions describe variation within the legitimate class. The subtle patterns distinguishing fraud live in low-variance components that PCA discarded.
Hint 3 - Full Solution
The problem: PCA assumes high variance = important signal. In fraud detection:
- The top principal components capture variance in legitimate transactions (different purchase amounts, locations, times)
- Fraudulent transactions differ from legitimate ones in subtle, low-variance ways
- By keeping only 95% variance, PCA discarded the very dimensions that distinguish fraud
Solutions:
- Supervised dimensionality reduction: Use LDA (Linear Discriminant Analysis) instead of PCA. LDA maximizes class separation, not variance.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=1) # Binary: max 1 component
X_lda = lda.fit_transform(X_train, y_train)
-
Keep more components: Retain 99.9% of variance, or better yet, evaluate downstream performance for different k values.
-
Feature selection instead: Use a supervised method (mutual information, tree importance) that considers the target.
-
Inspect discarded components: Check which components PCA dropped and whether they correlate with the target.
# Check correlation of each PC with the target
pca = PCA()
X_pca = pca.fit_transform(X_train)
for i in range(X_pca.shape[1]):
corr = np.corrcoef(X_pca[:, i], y_train)[0, 1]
if abs(corr) > 0.1:
print(f"PC{i+1}: variance_ratio={pca.explained_variance_ratio_[i]:.4f}, "
f"target_corr={corr:.3f}")
Key insight: PCA is unsupervised - it doesn't consider the target variable. For tasks where the discriminative signal is in low-variance directions, PCA can actually hurt performance.
Scoring Rubric:
- Strong Hire: Identifies the variance-importance disconnect, explains with the fraud detection context, proposes LDA or supervised alternatives, discusses how to diagnose the issue
- Lean Hire: Knows PCA might discard useful features but can't explain the mechanism
- No Hire: Suggests "keep more components" without understanding why the issue occurred
Problem 4: Choosing the Right Method (Senior-Level)
Scenario: You have four datasets. For each, recommend a dimensionality reduction strategy and justify:
- (A) 100K text documents, TF-IDF with 50K features, topic modeling task
- (B) Single-cell RNA sequencing: 10K cells, 20K genes, want to visualize cell types
- (C) Sensor data: 1M time series samples, 500 features, latency-critical anomaly detection
- (D) Customer features: 200 features, 50K customers, interpretable credit scoring model needed
Hint 1 - Direction
Match the method to the constraints: interpretability, scale, linearity, visualization vs. ML preprocessing.
Hint 2 - Insight
Dataset A: sparse, large, topic-oriented. Dataset B: biology standard practice. Dataset C: scale and latency matter. Dataset D: interpretability is legally required for credit scoring.
Hint 3 - Full Solution
(A) Text documents - Truncated SVD (LSA)
- TF-IDF matrices are sparse; standard PCA would densify them in memory
- Truncated SVD works directly on sparse matrices
- Components are interpretable as "topics" (Latent Semantic Analysis)
- 100-300 components is typical for topic modeling
- Alternative: NMF (Non-negative Matrix Factorization) for more interpretable topics
(B) Single-cell RNA sequencing - PCA then UMAP
- Standard bioinformatics pipeline: PCA to 50D (denoising step), then UMAP to 2D (visualization)
- PCA first removes noise in high-variance technical artifacts (batch effects)
- UMAP then reveals cell type clusters in the denoised space
- t-SNE also common but UMAP preserves more global structure and is faster for 10K cells
- Use Leiden/Louvain clustering on the PCA-reduced (not UMAP-reduced) data for quantitative analysis
(C) Sensor data - PCA with online updates
- Latency-critical: PCA projection is a simple matrix multiply (very fast)
- 500 features to ~50 PCA components is typical
- Use incremental PCA for streaming updates
- Alternative: random projections (even faster, theoretical guarantees from Johnson-Lindenstrauss lemma)
(D) Credit scoring - Feature selection (Lasso + domain expertise)
- Regulatory requirements (ECOA, FCRA) demand interpretable models
- Feature extraction (PCA) creates uninterpretable components - rejected by regulators
- Use L1 regularization (Lasso) to select a sparse set of original features
- Validate with domain experts: selected features should be explainable
- Typical: reduce 200 to 20-30 interpretable features
Scoring Rubric:
- Strong Hire: Correct method for all four with detailed justification, mentions domain-specific conventions (bioinformatics pipeline, regulatory requirements), discusses alternatives
- Lean Hire: Gets 3 out of 4 right with reasonable justification
- No Hire: Recommends PCA for everything or doesn't consider the specific constraints
Problem 5: The Johnson-Lindenstrauss Lemma (Staff-Level)
Scenario: You need to reduce 1 million feature vectors from 10,000 dimensions to a lower dimension for approximate nearest neighbor search. The system must process 100K queries per second. PCA is too slow to train on this data.
Question: What approach would you use, and what theoretical guarantee can you give about the quality of the reduction?
Hint 1 - Direction
Random projections don't require any training - just multiply by a random matrix. There's a theorem that guarantees distance preservation.
Hint 2 - Insight
The Johnson-Lindenstrauss lemma states that n points in high-dimensional space can be embedded into O(log n / epsilon^2) dimensions while preserving all pairwise distances within a factor of (1 +/- epsilon).
Hint 3 - Full Solution
Solution: Random Projections
Multiply by a random matrix where each entry is drawn from :
Johnson-Lindenstrauss Lemma: For any set of n points in , a random projection to dimensions preserves all pairwise distances within with high probability:
For this problem:
- n = 1,000,000 points
- epsilon = 0.1 (10% distance distortion acceptable)
- k = O(log(1,000,000) / 0.01) = O(1,382) - roughly 1,400 dimensions
from sklearn.random_projection import GaussianRandomProjection
# JL-lemma guided dimension
rp = GaussianRandomProjection(n_components=1400, random_state=42)
X_reduced = rp.fit_transform(X) # "fit" just generates the random matrix
# Even faster: sparse random projection
from sklearn.random_projection import SparseRandomProjection
rp_sparse = SparseRandomProjection(n_components=1400, density='auto')
X_reduced = rp_sparse.fit_transform(X)
Why this works for ANN search:
- No training needed: Random matrix generation is O(d * k), not data-dependent
- Distance preservation: JL guarantee means nearest neighbors are approximately preserved
- Fast inference: Matrix multiply is O(d * k) per query, highly parallelizable
- Composable: Can combine with approximate nearest neighbor indices (FAISS, Annoy)
Production architecture:
- Pre-compute X_reduced for all 1M vectors
- Store reduced vectors in FAISS index
- For each query: random project (0.01ms) then ANN search (0.1ms) = well under latency budget
Scoring Rubric:
- Strong Hire: States JL lemma correctly, computes dimension bound, implements random projection, designs production system with ANN index
- Lean Hire: Knows about random projections but can't state the JL guarantee precisely
- No Hire: Insists on PCA or doesn't know about random projections
Interview Cheat Sheet
| Question | Key Points |
|---|---|
| What is the curse of dimensionality? | Distances converge, volume concentrates at edges, need exponentially more data |
| Explain PCA step by step | Center data, compute covariance, eigendecompose, project onto top k eigenvectors |
| How to choose number of components? | Elbow method on cumulative explained variance; common thresholds: 90%, 95%, 99%; validate with downstream task |
| PCA vs. t-SNE? | PCA: linear, global structure, ML preprocessing. t-SNE: non-linear, local structure, visualization only |
| PCA vs. UMAP? | PCA: linear, deterministic, interpretable. UMAP: non-linear, preserves manifold, has transform() |
| When PCA fails? | Non-linear structure, signal in low-variance directions, unscaled features |
| Must you scale before PCA? | Yes (StandardScaler), unless all features already on same scale |
| t-SNE cluster sizes meaningful? | No - t-SNE distorts sizes and inter-cluster distances; only neighborhoods reliable |
| Feature selection vs. extraction? | Selection: interpretable, subset of originals. Extraction: creates new features, captures more structure |
| Random projections? | JL lemma guarantees distance preservation in O(log n / eps^2) dims; no training needed |
| When autoencoders? | Non-linear structure, enough data to train, interpretability not required |
| Kernel PCA? | PCA in kernel-induced feature space; captures non-linear relationships; expensive for large datasets |
Spaced Repetition Checkpoints
Day 0 - Initial Learning
- Explain the curse of dimensionality with three concrete manifestations
- Walk through PCA step by step (center, covariance, eigen, project)
- List 3 differences between PCA, t-SNE, and UMAP
- Explain why you must standardize features before PCA
Day 3 - Recall
- Derive PCA from the variance maximization objective
- Explain why t-SNE uses a t-distribution in the low-dimensional space
- Draw the feature selection taxonomy (filter, wrapper, embedded)
- Explain when PCA's explained variance heuristic fails
Day 7 - Application
- Implement PCA from scratch (centering, covariance, eigendecomposition)
- Given a dataset description, choose the right DR method and justify
- Explain the JL lemma and compute a dimension bound
- Solve Practice Problem 2 without hints
Day 14 - Integration
- Compare 6 dimensionality reduction methods with pros/cons
- Design a complete DR pipeline for a real-world problem (genomics, NLP, etc.)
- Explain when unsupervised DR (PCA) vs. supervised DR (LDA) is appropriate
- Solve Practice Problem 4 with all four datasets
Day 21 - Mastery
- Teach PCA derivation and the curse of dimensionality from scratch
- Design a production dimensionality reduction system with latency constraints
- Explain VAE latent space properties vs. standard autoencoder
- Confidently handle: "Why not just use PCA for everything?"
