PCA Dimensionality Reduction
import ReadingTime from '@site/src/components/ReadingTime';
:::note Interview Relevance - Very High PCA appears in nearly every ML interview that touches feature engineering, model efficiency, or data exploration. The eigendecomposition derivation, SVD connection, and the critical fit-on-train-only rule are standard MLE/DS interview questions. Expect at least one PCA question in any senior ML interview. :::
The Real Interview Moment
You are three months into a new role at a medical device company. The ML team has a sensor array that produces 2,000 measurements per second - temperature, pressure, vibration, acoustic emission, and current draw across 400 sensors. The company wants to predict bearing failures 30 minutes before they happen.
You load the data and the first thing you notice: the sensors are highly correlated. The 12 temperature sensors in the motor housing all move together within 0.1°C of each other. The 8 pressure sensors show with each other. You have 2,000 features but maybe 50 independent sources of variation.
You run PCA on a month of training data. 95% of the variance is explained by the first 47 components. You reduce from 2,000 to 47 features. Training time drops from 8 hours to 4 minutes. Model accuracy improves because you removed correlated noise. And as a bonus: reconstruction error on future sensor readings flags the anomalous states before your failure detector even sees them.
This lesson is about how that transformation works - from the geometric intuition, through the eigendecomposition math, all the way to the production engineering patterns you need to use it correctly.
Why This Exists - The Curse of Dimensionality
In high-dimensional spaces, machine learning breaks down in counterintuitive ways. In 1,000 dimensions, the volume of a unit hypersphere is essentially zero compared to the unit cube - almost all points cluster near the boundary. Distance metrics become meaningless: the ratio of the max to min distance between random points approaches 1 as . The nearest neighbor is barely nearer than any random point.
This "curse of dimensionality" has three practical consequences:
- More data required: the number of training examples needed grows exponentially with
- Overfitting: models fit noise in irrelevant dimensions
- Distance concentration: clustering and nearest-neighbor methods fail because distances stop discriminating
The insight behind PCA: real-world high-dimensional data does not fill the full -dimensional space uniformly. It lives near a low-dimensional linear subspace - a manifold of much lower intrinsic dimensionality. PCA finds that subspace.
Historical Context
PCA was invented by Karl Pearson in 1901 ("On lines and planes of closest fit to systems of points in space") and independently reformulated by Harold Hotelling in 1933 who connected it to the eigenvectors of the correlation matrix. The connection to SVD was established later and is the foundation of modern numerical implementations. Every production ML library (sklearn, NumPy, LAPACK) computes PCA via SVD for numerical stability reasons.
PCA as Variance Maximization - The Geometric Intuition
Imagine a cloud of points in 2D. The cloud is elongated - it stretches further along one direction than the other. PCA finds the axis of maximum stretch (the first principal component) and the perpendicular axis (the second component). When you project all points onto the first axis, you retain the most information about the original positions.
In dimensions: the first principal component is the direction along which the variance of the projected data is maximized:
where is the sample covariance matrix of the centered data.
Using Lagrange multipliers to enforce :
This gives - the first principal component is the eigenvector of the covariance matrix corresponding to its largest eigenvalue . The maximized variance is .
The -th principal component is the -th eigenvector (sorted by decreasing eigenvalue). The explained variance ratio of component :
The Covariance Matrix and Eigendecomposition
The sample covariance matrix is:
where is the mean-centered data matrix ().
Reconstruction formula: Given the top- components , the projection is:
The reconstruction (inverse transform) is:
The reconstruction error (total information lost by using components instead of ) is:
It equals the sum of discarded eigenvalues - a clean result that lets you choose by specifying an acceptable reconstruction error.
import numpy as np
from sklearn.preprocessing import StandardScaler
def pca_from_scratch(X: np.ndarray, n_components: int):
"""
PCA via eigendecomposition of the covariance matrix.
Numerically correct but less stable than SVD for ill-conditioned data.
"""
# Step 1: Center the data (CRITICAL - PCA requires zero-mean features)
X_c = X - X.mean(axis=0) # (n, d)
# Step 2: Sample covariance matrix
n = X.shape[0]
Sigma = (X_c.T @ X_c) / (n - 1) # (d, d)
# Step 3: Eigendecomposition
# np.linalg.eigh is faster and more stable than eig for symmetric matrices
eigenvalues, eigenvectors = np.linalg.eigh(Sigma) # eigenvalues are real
# Step 4: Sort by descending eigenvalue
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Step 5: Select top-k components
W_k = eigenvectors[:, :n_components].T # (k, d) - rows are components
# Step 6: Explained variance ratio
evr = eigenvalues[:n_components] / eigenvalues.sum()
cumulative_evr = evr.cumsum()
# Step 7: Project data
Z = X_c @ W_k.T # (n, k)
# Step 8: Reconstruction error (Frobenius norm squared)
X_hat = Z @ W_k # (n, d) - reconstructed in centered space
recon_error = np.mean((X_c - X_hat) ** 2)
return {
'components': W_k, # (k, d)
'eigenvalues': eigenvalues[:n_components],
'explained_variance_ratio': evr,
'cumulative_evr': cumulative_evr,
'Z': Z, # (n, k) projected data
'reconstruction_error': recon_error,
}
# Test on correlated data with known structure
np.random.seed(42)
n, d = 1000, 20
# 3 latent factors drive all 20 features
latent_factors = np.random.randn(n, 3)
W_true = np.random.randn(3, d) # mixing matrix
X_correlated = latent_factors @ W_true + 0.2 * np.random.randn(n, d)
X_scaled = StandardScaler().fit_transform(X_correlated)
result = pca_from_scratch(X_scaled, n_components=5)
print("Eigenvalues (top 5):", result['eigenvalues'].round(4))
print("EVR (top 5): ", result['explained_variance_ratio'].round(4))
print("Cumulative EVR: ", result['cumulative_evr'].round(4))
print(f"Reconstruction MSE (5 components): {result['reconstruction_error']:.4f}")
SVD: The Numerically Stable Implementation
Computing the covariance matrix squares the condition number of . If has condition number , then has condition number . For near-collinear features, this causes catastrophic numerical errors in the eigendecomposition.
SVD is the correct approach: decompose directly:
where:
- : left singular vectors (sample coordinates)
- : diagonal matrix of singular values
- : right singular vectors - these are the principal components
Connection to eigendecomposition:
so eigenvalues of the covariance are , and eigenvectors are the columns of .
The projected data (scores) can be computed without explicitly forming :
where and are the top- left singular vectors and values.
def pca_svd(X: np.ndarray, n_components: int):
"""
PCA via SVD - numerically stable, preferred in production.
Equivalent to sklearn.decomposition.PCA internals.
"""
X_c = X - X.mean(axis=0)
n = X.shape[0]
# Economy SVD: only compute n_components singular vectors
# full_matrices=False: thin/economy SVD - O(n*d*min(n,d)) instead of O(n^2*d)
U, S, Vt = np.linalg.svd(X_c, full_matrices=False)
# U: (n, min(n,d)), S: (min(n,d),), Vt: (min(n,d), d)
# Top-k components (rows of Vt)
components = Vt[:n_components] # (k, d)
# Eigenvalues from singular values
eigenvalues = (S[:n_components] ** 2) / (n - 1)
total_variance = (S ** 2).sum() / (n - 1)
evr = eigenvalues / total_variance
# Projection: U[:, :k] * S[:k] == X_c @ components.T
Z = U[:, :n_components] * S[:n_components] # (n, k)
# Reconstruction error
X_hat = Z @ components
recon_error_total = ((X_c - X_hat) ** 2).sum()
return {
'components': components,
'eigenvalues': eigenvalues,
'explained_variance_ratio': evr,
'Z': Z,
'reconstruction_error_total': recon_error_total,
}
# Verify SVD matches eigendecomposition result
result_svd = pca_svd(X_scaled, n_components=5)
print("SVD EVR:", result_svd['explained_variance_ratio'].round(4))
# Should match eigendecomposition result above
Sign convention: PCA components have an arbitrary sign flip - and are both valid. sklearn deterministically chooses the sign by making the largest-magnitude element positive. If you implement PCA from scratch and compare to sklearn, you may see sign flips - this is expected and correct.
Sklearn PCA: Production API
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
# ── Always use a Pipeline - prevents data leakage ──────────────────────────────
pipe = Pipeline([
('scaler', StandardScaler()), # fit on train, transform both
('pca', PCA(n_components=0.95, # keep 95% of variance (pass float for auto-k)
svd_solver='full', # 'randomized' for large d with small k
random_state=42)),
])
pipe.fit(X_train)
X_train_pca = pipe.transform(X_train)
X_test_pca = pipe.transform(X_test) # uses training mean/std/components
pca = pipe.named_steps['pca']
print(f"Original features: {X_train.shape[1]}")
print(f"Components kept: {pca.n_components_}")
print(f"Variance retained: {pca.explained_variance_ratio_.sum():.4f}")
# Access the components (principal directions in feature space)
print(f"Components shape: {pca.components_.shape}") # (k, d)
# Reconstruction (inverse transform back to original space)
X_reconstructed = pipe.inverse_transform(X_train_pca)
recon_error = np.mean((StandardScaler().fit_transform(X_train) - X_reconstructed) ** 2)
print(f"Reconstruction MSE: {recon_error:.4f}")
SVD solver selection:
| Solver | When to use | Time | Notes |
|---|---|---|---|
'full' | or need exact results | LAPACK full SVD | |
'randomized' | , large | Halko-Martinsson-Tropp | |
'arpack' | , sparse | ARPACK eigensolver | |
'auto' | Default | Picks above based on shape | Good default |
For a 10,000-sample, 10,000-feature dataset keeping 64 components: 'randomized' is faster than 'full'.
Scree Plot and Choosing the Number of Components
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
def plot_explained_variance(X_train: np.ndarray, max_components: int = None):
"""
Full scree plot + cumulative EVR to guide component selection.
Returns a dict with k values for common thresholds.
"""
X_scaled = StandardScaler().fit_transform(X_train)
max_k = min(X_train.shape[0], X_train.shape[1])
if max_components:
max_k = min(max_k, max_components)
pca_full = PCA(n_components=max_k, svd_solver='full').fit(X_scaled)
evr = pca_full.explained_variance_ratio_
cumulative = evr.cumsum()
# Find k for common thresholds
thresholds = {0.80: None, 0.90: None, 0.95: None, 0.99: None}
for t in thresholds:
thresholds[t] = int(np.argmax(cumulative >= t)) + 1
# Kaiser criterion: keep components with eigenvalue > 1
# (only meaningful when data is standardized)
eigenvalues = pca_full.explained_variance_ # = eigenvalues of cov matrix
kaiser_k = int((eigenvalues > 1.0).sum())
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Scree plot
axes[0].bar(range(1, len(evr)+1), evr, color='#3b82f6', alpha=0.7)
axes[0].plot(range(1, len(evr)+1), evr, 'o-', color='#1e40af', ms=3)
axes[0].axhline(y=0.01, color='#dc2626', linestyle='--', linewidth=1,
label='1% threshold')
axes[0].set_xlabel("Principal Component")
axes[0].set_ylabel("Explained Variance Ratio")
axes[0].set_title("Scree Plot")
axes[0].legend()
# Cumulative EVR
axes[1].plot(range(1, len(cumulative)+1), cumulative, 'o-',
color='#059669', ms=3, linewidth=1.5)
colors = {'80%': '#f59e0b', '90%': '#f97316', '95%': '#dc2626', '99%': '#7c3aed'}
for t, k in thresholds.items():
label = f"{int(t*100)}%"
c = colors[label]
axes[1].axhline(y=t, color=c, linestyle='--', linewidth=1, label=f'{label} @ k={k}')
axes[1].axvline(x=k, color=c, linestyle=':', linewidth=0.8, alpha=0.7)
axes[1].set_xlabel("Number of Components")
axes[1].set_ylabel("Cumulative Explained Variance")
axes[1].set_title("Cumulative EVR")
axes[1].legend(fontsize=8)
# Eigenvalue spectrum
axes[2].semilogy(range(1, len(eigenvalues)+1), eigenvalues, 'o-',
color='#7c3aed', ms=3, linewidth=1.5)
axes[2].axhline(y=1.0, color='#dc2626', linestyle='--', linewidth=1,
label=f'Kaiser criterion (k={kaiser_k})')
axes[2].set_xlabel("Principal Component")
axes[2].set_ylabel("Eigenvalue (log scale)")
axes[2].set_title("Eigenvalue Spectrum")
axes[2].legend()
plt.tight_layout()
plt.show()
print("Component selection guide:")
for t, k in thresholds.items():
print(f" {int(t*100)}% variance: k={k}")
print(f" Kaiser criterion (eigenvalue > 1): k={kaiser_k}")
return thresholds, kaiser_k
# Usage
thresholds, kaiser_k = plot_explained_variance(X_train, max_components=50)
Practical selection rules:
| Use case | Recommended threshold |
|---|---|
| 2D/3D visualization | Always 2 or 3, regardless of EVR |
| Preprocessing for supervised ML | 95% EVR - start here, tune |
| Compression / storage | Balance EVR vs storage budget |
| Anomaly detection | 99% EVR - need to capture rare normal patterns |
| Downstream clustering | 80–90% EVR - noise removal is the priority |
| Neural network input preprocessing | 95% EVR + whitening |
PCA Whitening
Whitening scales each principal component to have unit variance. After projection to , each component is divided by its standard deviation:
This produces a spherical (isotropic) distribution: all components have variance 1 and are uncorrelated. The prevents division by zero for near-zero eigenvalues and acts as regularization.
When to use whitening:
- Preprocessing for ICA (independent component analysis) - whitening is a required step
- Before training neural networks on high-dimensional inputs where feature scales differ wildly
- Before distance-based algorithms when you want equal weighting of all directions
When NOT to use whitening:
- Whitening amplifies noise: low-variance components (often noise) are boosted by dividing by small . If your data has many noise dimensions, whitening makes things worse.
from sklearn.decomposition import PCA
import numpy as np
# sklearn supports whitening natively
pca_white = PCA(n_components=50, whiten=True)
X_whitened = pca_white.fit_transform(X_train_scaled)
print("Whitened component stds:", X_whitened.std(axis=0)[:5].round(4))
# Should all be ~1.0
# Without whitening, components have very different scales
pca_nowhite = PCA(n_components=50, whiten=False)
X_nowhite = pca_nowhite.fit_transform(X_train_scaled)
print("Unwhitened component stds:", X_nowhite.std(axis=0)[:5].round(4))
# Decreasing: first component has the most variance
# Manual whitening (equivalent)
Z = pca_nowhite.transform(X_train_scaled)
lambdas = pca_nowhite.explained_variance_ # eigenvalues = variances of components
epsilon = 1e-8
Z_whitened_manual = Z / np.sqrt(lambdas + epsilon)
Kernel PCA for Non-Linear Structure
Standard PCA finds linear subspaces. If your data lives on a curved manifold (concentric circles, swiss roll, non-linear clusters), PCA will project it onto a flat subspace that does not capture the manifold's structure.
Kernel PCA applies the kernel trick: instead of computing PCA in the -dimensional input space, it implicitly computes PCA in a (possibly infinite-dimensional) feature space defined by a kernel function:
The kernel matrix with is centered and then its top- eigenvectors give the projected coordinates.
Common kernels:
- RBF (Gaussian): - infinite-dimensional feature space, captures any smooth non-linearity
- Polynomial: - polynomial features up to degree
- Sigmoid: - neural network connection
from sklearn.decomposition import KernelPCA
from sklearn.datasets import make_moons, make_circles
import matplotlib.pyplot as plt
import numpy as np
fig, axes = plt.subplots(2, 4, figsize=(20, 9))
datasets = [
('Moons', make_moons(n_samples=400, noise=0.08, random_state=42)),
('Circles', make_circles(n_samples=400, noise=0.05, factor=0.4, random_state=42)),
]
for row, (name, (X_d, y_d)) in enumerate(datasets):
# Original data
axes[row, 0].scatter(X_d[:, 0], X_d[:, 1], c=y_d, cmap='bwr', s=15, alpha=0.8)
axes[row, 0].set_title(f"{name} - Original", fontsize=10)
axes[row, 0].axis('off')
# Linear PCA
from sklearn.decomposition import PCA as LinearPCA
X_lin = LinearPCA(n_components=1).fit_transform(X_d)
axes[row, 1].scatter(X_lin, np.zeros_like(X_lin), c=y_d, cmap='bwr', s=15, alpha=0.8)
axes[row, 1].set_title("Linear PCA (1D)\n- classes overlap", fontsize=9)
axes[row, 1].axis('off')
# RBF Kernel PCA
kpca_rbf = KernelPCA(n_components=2, kernel='rbf', gamma=15,
fit_inverse_transform=True, random_state=42)
X_rbf = kpca_rbf.fit_transform(X_d)
axes[row, 2].scatter(X_rbf[:, 0], X_rbf[:, 1], c=y_d, cmap='bwr', s=15, alpha=0.8)
axes[row, 2].set_title("RBF Kernel PCA\n- classes separate!", fontsize=9)
axes[row, 2].axis('off')
# Polynomial Kernel PCA
kpca_poly = KernelPCA(n_components=2, kernel='poly', degree=3, gamma=1,
coef0=1, random_state=42)
X_poly = kpca_poly.fit_transform(X_d)
axes[row, 3].scatter(X_poly[:, 0], X_poly[:, 1], c=y_d, cmap='bwr', s=15, alpha=0.8)
axes[row, 3].set_title("Poly Kernel PCA\ndegree=3", fontsize=9)
axes[row, 3].axis('off')
plt.suptitle("Kernel PCA vs Linear PCA on Non-Linear Data", fontsize=14)
plt.tight_layout()
plt.show()
Kernel PCA limitations:
- Requires computing the kernel matrix - memory and time
- Cannot transform new points without the training kernel matrix (unless
fit_inverse_transform=True) - Impractical for . For large-scale non-linear reduction, use UMAP instead.
Incremental PCA for Data That Doesn't Fit in Memory
Standard PCA loads all data into RAM to compute the covariance matrix. For very large datasets, use IncrementalPCA:
from sklearn.decomposition import IncrementalPCA
import numpy as np
def fit_incremental_pca(data_generator, n_components: int,
n_samples_total: int, batch_size: int = 1000):
"""
Fit PCA incrementally on data too large to fit in memory.
data_generator: yields batches of shape (batch_size, d)
"""
ipca = IncrementalPCA(n_components=n_components, batch_size=batch_size)
# Process batches one at a time
for batch_idx, X_batch in enumerate(data_generator):
ipca.partial_fit(X_batch)
if (batch_idx + 1) % 10 == 0:
n_processed = (batch_idx + 1) * batch_size
print(f"Processed {n_processed}/{n_samples_total} samples "
f"({100*n_processed/n_samples_total:.1f}%)")
print(f"EVR (top {n_components}): {ipca.explained_variance_ratio_[:5].round(4)}")
print(f"Cumulative EVR: {ipca.explained_variance_ratio_.cumsum()[-1]:.4f}")
return ipca
# Simulate a large dataset loaded in chunks
def large_data_generator(n_total: int, d: int, batch_size: int):
n_batches = n_total // batch_size
for _ in range(n_batches):
yield np.random.randn(batch_size, d)
n_total = 500_000 # 500K samples
d = 200 # 200 features
batch_size = 2_000
n_components = 50
gen = large_data_generator(n_total, d, batch_size)
ipca = fit_incremental_pca(gen, n_components=n_components,
n_samples_total=n_total, batch_size=batch_size)
# Transform new batches
X_new_batch = np.random.randn(1000, d)
X_new_reduced = ipca.transform(X_new_batch)
print(f"Transformed shape: {X_new_reduced.shape}")
Biplot: Visualizing Variables in PCA Space
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
def pca_biplot(X: np.ndarray, y: np.ndarray = None,
feature_names: list = None, label_names: dict = None):
"""
Biplot: samples in PCA space + original feature loading arrows.
Feature arrows show which original features drive each component.
"""
scaler = StandardScaler()
pca_2d = PCA(n_components=2, svd_solver='full')
X_2d = pca_2d.fit_transform(scaler.fit_transform(X))
# Loadings = how much each feature contributes to each component
# pca_2d.components_ shape: (2, d)
# loadings[i, :] = contribution of feature i to [PC1, PC2]
loadings = pca_2d.components_.T # (d, 2)
if feature_names is None:
feature_names = [f"X{i}" for i in range(X.shape[1])]
fig, ax = plt.subplots(figsize=(11, 8))
# Plot samples
if y is not None:
unique_labels = np.unique(y)
colors = plt.cm.tab10(np.linspace(0, 1, len(unique_labels)))
for label, color in zip(unique_labels, colors):
mask = y == label
lname = label_names.get(label, str(label)) if label_names else str(label)
ax.scatter(X_2d[mask, 0], X_2d[mask, 1], color=color, s=15,
alpha=0.6, label=lname)
ax.legend(title="Class", fontsize=8, loc='upper left')
else:
ax.scatter(X_2d[:, 0], X_2d[:, 1], s=10, alpha=0.4, color='#3b82f6')
# Plot loading arrows (scaled for visibility)
scale = 2.0 * np.abs(X_2d).max() / np.abs(loadings).max()
for i, (name, loading) in enumerate(zip(feature_names, loadings)):
ax.annotate(
'', xy=(loading[0]*scale, loading[1]*scale), xytext=(0, 0),
arrowprops=dict(arrowstyle='->', color='#dc2626', lw=1.5)
)
ax.text(loading[0]*scale*1.1, loading[1]*scale*1.1, name,
fontsize=8, color='#7f1d1d', ha='center', va='center')
ax.axhline(0, color='gray', linestyle='--', linewidth=0.5, alpha=0.5)
ax.axvline(0, color='gray', linestyle='--', linewidth=0.5, alpha=0.5)
pc1_evr = pca_2d.explained_variance_ratio_[0]
pc2_evr = pca_2d.explained_variance_ratio_[1]
ax.set_xlabel(f"PC1 ({pc1_evr:.1%} variance)", fontsize=12)
ax.set_ylabel(f"PC2 ({pc2_evr:.1%} variance)", fontsize=12)
ax.set_title(f"PCA Biplot - {pc1_evr + pc2_evr:.1%} total variance shown", fontsize=13)
return fig, loadings
PCA Flow: Architecture Overview
Production: PCA Anomaly Detection
PCA is one of the most effective anomaly detection methods for tabular data: fit on normal samples only, compute reconstruction error for each new point, flag high-error points as anomalies.
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, precision_recall_curve
import matplotlib.pyplot as plt
class PCAAnomalyDetector:
"""
Anomaly detection via PCA reconstruction error.
Fit on normal data, flag high reconstruction error as anomalous.
"""
def __init__(self, n_components: int = None, variance_threshold: float = 0.95,
threshold_percentile: float = 99.5):
"""
n_components: if None, auto-select to retain variance_threshold of variance
threshold_percentile: percentile of training errors used as anomaly threshold
"""
self.n_components = n_components
self.variance_threshold = variance_threshold
self.threshold_percentile = threshold_percentile
self.pca = None
self.scaler = StandardScaler()
self.threshold_ = None
self.train_error_mean_ = None
self.train_error_std_ = None
def fit(self, X_normal: np.ndarray) -> 'PCAAnomalyDetector':
"""Fit exclusively on known-normal data."""
X_scaled = self.scaler.fit_transform(X_normal)
if self.n_components is None:
self.pca = PCA(n_components=self.variance_threshold,
svd_solver='full', random_state=42)
else:
self.pca = PCA(n_components=self.n_components,
svd_solver='full', random_state=42)
self.pca.fit(X_scaled)
# Compute training reconstruction errors
errors = self._reconstruction_error(X_scaled)
self.train_error_mean_ = errors.mean()
self.train_error_std_ = errors.std()
self.threshold_ = np.percentile(errors, self.threshold_percentile)
print(f"PCA components: {self.pca.n_components_}")
print(f"Variance retained: {self.pca.explained_variance_ratio_.sum():.4f}")
print(f"Train error - mean: {self.train_error_mean_:.4f}, "
f"std: {self.train_error_std_:.4f}")
print(f"Anomaly threshold ({self.threshold_percentile}th pct): {self.threshold_:.4f}")
return self
def _reconstruction_error(self, X_scaled: np.ndarray) -> np.ndarray:
"""Per-sample reconstruction MSE."""
Z = self.pca.transform(X_scaled)
X_recon = self.pca.inverse_transform(Z)
return np.mean((X_scaled - X_recon) ** 2, axis=1)
def anomaly_score(self, X: np.ndarray) -> np.ndarray:
"""Higher = more anomalous (raw reconstruction error)."""
return self._reconstruction_error(self.scaler.transform(X))
def predict(self, X: np.ndarray) -> np.ndarray:
"""True = anomaly."""
return self.anomaly_score(X) > self.threshold_
def evaluate(self, X_normal_test: np.ndarray,
X_anomaly_test: np.ndarray) -> dict:
"""Evaluate detector on labeled test sets."""
normal_scores = self.anomaly_score(X_normal_test)
anomaly_scores = self.anomaly_score(X_anomaly_test)
y_true = np.array([0]*len(X_normal_test) + [1]*len(X_anomaly_test))
y_score = np.concatenate([normal_scores, anomaly_scores])
auc = roc_auc_score(y_true, y_score)
# Precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_true, y_score)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Score distribution
axes[0].hist(normal_scores, bins=50, alpha=0.6, color='#3b82f6',
label=f'Normal (n={len(X_normal_test)})', density=True)
axes[0].hist(anomaly_scores, bins=50, alpha=0.6, color='#ef4444',
label=f'Anomaly (n={len(X_anomaly_test)})', density=True)
axes[0].axvline(self.threshold_, color='black', linestyle='--',
linewidth=2, label=f'Threshold={self.threshold_:.4f}')
axes[0].set_xlabel("Reconstruction Error (MSE)")
axes[0].set_ylabel("Density")
axes[0].set_title(f"Score Distribution - AUC={auc:.4f}")
axes[0].legend()
# Precision-Recall
axes[1].plot(recall, precision, 'o-', color='#7c3aed', ms=3, linewidth=2)
axes[1].set_xlabel("Recall")
axes[1].set_ylabel("Precision")
axes[1].set_title("Precision-Recall Curve")
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return {
'auc_roc': auc,
'normal_error_mean': normal_scores.mean(),
'anomaly_error_mean': anomaly_scores.mean(),
'detection_rate': (anomaly_scores > self.threshold_).mean(),
'false_positive_rate': (normal_scores > self.threshold_).mean(),
}
# Usage
from sklearn.datasets import make_classification
import numpy as np
# Simulate normal + anomalous data
X_all, _ = make_classification(n_samples=12000, n_features=30,
n_informative=15, n_redundant=10, random_state=42)
X_normal_train = X_all[:10000]
X_normal_test = X_all[10000:11000]
# Anomalies: samples from different distribution
rng = np.random.default_rng(42)
X_anomaly = rng.normal(loc=3.0, scale=0.5, size=(1000, 30))
detector = PCAAnomalyDetector(variance_threshold=0.95, threshold_percentile=99.5)
detector.fit(X_normal_train)
metrics = detector.evaluate(X_normal_test, X_anomaly)
print("\nEvaluation:")
for k, v in metrics.items():
print(f" {k}: {v:.4f}")
:::danger Fit PCA on Training Data Only
NEVER call pca.fit() or pca.fit_transform() on validation or test data. Doing so leaks information about the test distribution into your model - a form of data leakage. The scaler and PCA must be fit exclusively on training data, then transform() applied to all other splits. Always use a sklearn Pipeline with fit() on train and transform() on test to enforce this automatically.
:::
:::warning PCA Assumes Linear Structure PCA finds linear subspaces. If your data lives on a non-linear manifold (Swiss roll, concentric circles, word embeddings), PCA will project it poorly. Check: if the cumulative EVR is still below 70% after retaining many components, your data likely has non-linear structure. Solutions: Kernel PCA (small ), UMAP or t-SNE (visualization), or autoencoders (deep non-linear reduction). :::
YouTube Resources
| Channel | Video Title | Why Watch |
|---|---|---|
| StatQuest with Josh Starmer | PCA Step-by-Step | Best visual explanation of eigenvectors and variance maximization |
| 3Blue1Brown | Essence of Linear Algebra (eigenvectors) | Geometric intuition for eigendecomposition |
| ritvikmath | PCA from Scratch | Walks through the matrix algebra clearly |
| Andrej Karpathy (CS231n) | PCA and Whitening lecture | ML engineering perspective, whitening walkthrough |
Interview Q&A
Q1: Explain PCA as an eigendecomposition problem. What do the eigenvalues represent?
PCA seeks the directions (principal components) along which the data has maximum variance. The sample covariance matrix captures pairwise feature co-variances. Its eigenvectors are the principal components - orthogonal directions in the -dimensional feature space. The -th eigenvector is the direction of the -th most variance. The corresponding eigenvalue is the variance of the data along - i.e., the variance of the projected coordinates . The explained variance ratio is . Choosing the top- eigenvectors minimizes reconstruction error: .
Q2: Why is PCA computed via SVD rather than direct eigendecomposition of the covariance matrix?
Computing the covariance matrix squares the condition number of : if has condition number , then has . For nearly-collinear features (common in sensor data, text embeddings), this causes catastrophic cancellation in the eigendecomposition. SVD decomposes directly - - without squaring the condition number. The right singular vectors are the principal components, and eigenvalues are . Every production library (sklearn, scipy, MATLAB) uses SVD for this reason. For very large datasets with , randomized SVD (Halko et al.) computes approximate SVD in instead of .
Q3: You have 10,000-dimensional text embeddings and want to reduce to 64 dimensions. How do you do this efficiently and how do you evaluate quality?
Use PCA(n_components=64, svd_solver='randomized'). Randomized SVD computes approximate SVD in - roughly 150× faster than full SVD for . Always fit on the training corpus only, then transform() new documents. Evaluate quality: (1) explained variance ratio - if 64 components explain less than 50% of variance, the embedding space has more structure; consider more components. (2) Downstream task quality - measure accuracy or recall@K on a labeled validation set before and after reduction. (3) Reconstruction error on held-out normal points. Note: PCA is preferred over UMAP for production reduction because it is parametric - you can transform new documents at inference with a simple matrix multiply.
Q4: What does PCA whitening do and when should you use it?
Whitening divides each projected component by its standard deviation: . This produces components with unit variance - a spherical (isotropic) distribution. Use it for: (1) ICA - whitening is mathematically required before running ICA. (2) Neural network preprocessing when features have vastly different scales after PCA. (3) Distance-based methods where you want all components to contribute equally. Do NOT use it when data has many low-variance dimensions (often noise), because whitening amplifies noise by dividing by small eigenvalues. The regularization (default in sklearn's whiten=True) mitigates this but does not eliminate it.
Q5: What is the difference between standard PCA and Kernel PCA? When would you choose each?
Standard PCA finds linear subspaces - directions in the original input space. It fails if data has non-linear structure (concentric circles, swiss roll). Kernel PCA implicitly maps data to a high-dimensional feature space via and applies PCA there. The key insight: you never compute explicitly - you only need the kernel matrix . With RBF kernel , it maps to an infinite-dimensional space and can separate any non-linearly-separable structure. Use standard PCA when: is large (millions), need new-point inference, data has approximately linear structure. Use Kernel PCA when: data has non-linear structure, . For large-scale non-linear reduction, use UMAP - it scales better and preserves global structure.
Q6: A model you trained with PCA preprocessing is degrading in production. What do you check?
First, check if the data distribution has shifted: fit a new PCA on recent production data and compare the explained variance ratios and principal component directions to the original. If the directions have rotated significantly ( similarity between old and new ), the data distribution has shifted and you need to refit. Second, monitor reconstruction errors over time: if average reconstruction error on production data increases, new patterns are appearing that the original PCA does not represent well - a sign of distribution shift. Third, check the scaler: if the mean or standard deviation of raw features has shifted, your scaler is miscalibrated. Solution: retrain PCA (and scaler) on a sliding window of recent data, and re-validate downstream model performance.
PCA Limitations: When to Use Something Else
PCA is a powerful baseline but fails in predictable ways:
1. Non-linear structure: PCA finds the best linear subspace. If the data lives on a curved manifold (swiss roll, concentric circles), the linear projection folds the manifold rather than unrolling it. Use Kernel PCA, t-SNE, UMAP, or autoencoders.
2. Discrete or categorical features: PCA assumes continuous numeric features. Never apply PCA directly to one-hot encoded categoricals or binary features - use MCA (Multiple Correspondence Analysis) or other appropriate methods.
3. Very high cardinality with sparse structure: For 100,000-dimensional sparse text features (bag-of-words), forming the covariance matrix is impossible. Use truncated SVD (sklearn.decomposition.TruncatedSVD) which operates directly on the sparse matrix without centering.
4. Interpretability requirement: PCA components are linear combinations of all original features - they are "eigengenes" or "eigenfaces" but rarely interpretable in domain terms. If you need to identify which specific features matter most, use supervised feature selection or SHAP.
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
# ── Truncated SVD for sparse text data ───────────────────────────────────────
# TruncatedSVD ≈ PCA but does NOT center the data first
# Centering a sparse matrix makes it dense - destroys sparsity, kills memory
# For text: TruncatedSVD on TF-IDF = Latent Semantic Analysis (LSA)
corpus = [
"machine learning and deep learning for NLP",
"support vector machines for classification",
"neural networks and backpropagation",
"random forests and gradient boosting",
"natural language processing with transformers",
]
tfidf = TfidfVectorizer(max_features=1000, sublinear_tf=True)
X_sparse = tfidf.fit_transform(corpus) # sparse: (5, 1000)
print(f"Sparse matrix: {X_sparse.shape}, density={X_sparse.nnz / (X_sparse.shape[0]*X_sparse.shape[1]):.3f}")
# LSA: TruncatedSVD on TF-IDF
svd = TruncatedSVD(n_components=50, n_iter=7, random_state=42)
X_lsa = svd.fit_transform(X_sparse) # (5, 50) - dense latent representation
print(f"LSA reduced shape: {X_lsa.shape}")
print(f"Explained variance: {svd.explained_variance_ratio_.sum():.4f}")
# For production information retrieval:
# 1. fit TruncatedSVD on training corpus
# 2. transform query TF-IDF vector with same SVD
# 3. cosine similarity in LSA space (faster + better than raw TF-IDF cosine)
PCA for Eigenfaces: A Classic Application
Eigenfaces (Turk & Pentland, 1991) is the canonical demonstration that PCA can learn meaningful visual representations without any labels. Each face image is treated as a high-dimensional vector; PCA finds the directions of maximum variance across the face image space - these "eigenfaces" form a basis for the face subspace.
from sklearn.datasets import fetch_lfw_people
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
def eigenfaces_demo(n_components: int = 150, n_images: int = 500):
"""
Eigenfaces: PCA on face images.
Each principal component = 'eigenface' (a ghostly average face direction).
New faces reconstructed as linear combinations of eigenfaces.
"""
# Load Labeled Faces in the Wild
lfw = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
X = lfw.data # (n_images, h*w) pixel matrices flattened
y = lfw.target
h, w = lfw.images.shape[1:]
print(f"Dataset: {X.shape[0]} faces, {X.shape[1]} pixels each ({h}×{w})")
# Fit PCA - find eigenfaces
pca = PCA(n_components=n_components, svd_solver='randomized', random_state=42)
X_reduced = pca.fit_transform(X)
evr_cumulative = pca.explained_variance_ratio_.cumsum()
print(f"{n_components} eigenfaces explain {evr_cumulative[-1]:.1%} of variance")
# Plot first 20 eigenfaces
eigenfaces = pca.components_.reshape((n_components, h, w))
fig, axes = plt.subplots(4, 5, figsize=(14, 11))
for i, ax in enumerate(axes.flat):
ax.imshow(eigenfaces[i], cmap='gray', interpolation='nearest')
ax.set_title(f"PC {i+1}", fontsize=9)
ax.axis('off')
plt.suptitle("Top 20 Eigenfaces (Principal Components of Face Images)", fontsize=13)
plt.tight_layout()
plt.show()
# Visualize reconstruction at different numbers of components
face_idx = 42
original_face = X[face_idx]
fig, axes = plt.subplots(1, 6, figsize=(18, 4))
axes[0].imshow(original_face.reshape(h, w), cmap='gray')
axes[0].set_title("Original")
axes[0].axis('off')
for i, k in enumerate([5, 20, 50, 100, 150]):
pca_k = PCA(n_components=k, svd_solver='randomized', random_state=42)
pca_k.fit(X)
face_k = pca_k.inverse_transform(pca_k.transform([original_face]))
axes[i+1].imshow(face_k.reshape(h, w), cmap='gray')
axes[i+1].set_title(f"k={k}\n({pca_k.explained_variance_ratio_.sum():.0%})")
axes[i+1].axis('off')
plt.suptitle(f"Face Reconstruction at Different k Values", fontsize=12)
plt.tight_layout()
plt.show()
return pca, X_reduced
# Run (requires internet connection to download LFW dataset)
# pca, X_reduced = eigenfaces_demo(n_components=150)
Production Monitoring: PCA Component Drift
In production, the principal components of your data can drift as the underlying distribution shifts. Monitoring this provides early warning before model performance degrades:
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
class PCAdriftMonitor:
"""
Monitor PCA component drift in production.
Alert when principal directions or explained variance ratios shift significantly.
"""
def __init__(self, n_components: int = 20, cosine_threshold: float = 0.95):
self.n_components = n_components
self.cosine_threshold = cosine_threshold
self.baseline_pca_ = None
self.baseline_scaler_ = None
self.baseline_evr_ = None
def fit_baseline(self, X: np.ndarray) -> 'PCAdriftMonitor':
self.baseline_scaler_ = StandardScaler().fit(X)
X_scaled = self.baseline_scaler_.transform(X)
self.baseline_pca_ = PCA(n_components=self.n_components).fit(X_scaled)
self.baseline_evr_ = self.baseline_pca_.explained_variance_ratio_.copy()
print(f"Baseline fitted: {X.shape[0]} samples, "
f"{self.n_components} components")
print(f"Cumulative EVR: {self.baseline_evr_.sum():.4f}")
return self
def check_drift(self, X_new: np.ndarray, period_label: str = "Production") -> dict:
X_scaled = self.baseline_scaler_.transform(X_new)
# Fit PCA on new data
new_pca = PCA(n_components=self.n_components).fit(X_scaled)
# Compare principal components via cosine similarity
cosine_sims = []
for k in range(self.n_components):
v_base = self.baseline_pca_.components_[k]
v_new = new_pca.components_[k]
# Absolute cosine sim (sign flip is arbitrary)
cos_sim = abs(np.dot(v_base, v_new) / (np.linalg.norm(v_base) * np.linalg.norm(v_new)))
cosine_sims.append(cos_sim)
min_cosine = min(cosine_sims[:5]) # focus on top 5 components
# Compare EVR
evr_shift = np.abs(new_pca.explained_variance_ratio_ - self.baseline_evr_).mean()
# Reconstruction error shift
X_recon_base = self.baseline_pca_.inverse_transform(
self.baseline_pca_.transform(X_scaled)
)
recon_error = np.mean((X_scaled - X_recon_base) ** 2)
drift_detected = (
min_cosine < self.cosine_threshold or
evr_shift > 0.02 or
recon_error > 2 * (1 - self.baseline_evr_.sum())
)
report = {
'period': period_label,
'min_cosine_similarity_top5': min_cosine,
'evr_shift_mean': evr_shift,
'reconstruction_error': recon_error,
'drift_detected': drift_detected,
}
status = "DRIFT ALERT" if drift_detected else "OK"
print(f"[{status}] {period_label}: "
f"cos_sim={min_cosine:.4f}, "
f"evr_shift={evr_shift:.4f}, "
f"recon_err={recon_error:.4f}")
return report
:::tip 🎮 Interactive Playground
Visualize this concept: Try the PCA Explorer demo on the EngineersOfAI Playground - no code required.
:::
