Skip to main content

PCA Dimensionality Reduction

import ReadingTime from '@site/src/components/ReadingTime';

:::note Interview Relevance - Very High PCA appears in nearly every ML interview that touches feature engineering, model efficiency, or data exploration. The eigendecomposition derivation, SVD connection, and the critical fit-on-train-only rule are standard MLE/DS interview questions. Expect at least one PCA question in any senior ML interview. :::

The Real Interview Moment

You are three months into a new role at a medical device company. The ML team has a sensor array that produces 2,000 measurements per second - temperature, pressure, vibration, acoustic emission, and current draw across 400 sensors. The company wants to predict bearing failures 30 minutes before they happen.

You load the data and the first thing you notice: the sensors are highly correlated. The 12 temperature sensors in the motor housing all move together within 0.1°C of each other. The 8 pressure sensors show r=0.97r = 0.97 with each other. You have 2,000 features but maybe 50 independent sources of variation.

You run PCA on a month of training data. 95% of the variance is explained by the first 47 components. You reduce from 2,000 to 47 features. Training time drops from 8 hours to 4 minutes. Model accuracy improves because you removed correlated noise. And as a bonus: reconstruction error on future sensor readings flags the anomalous states before your failure detector even sees them.

This lesson is about how that transformation works - from the geometric intuition, through the eigendecomposition math, all the way to the production engineering patterns you need to use it correctly.

Why This Exists - The Curse of Dimensionality

In high-dimensional spaces, machine learning breaks down in counterintuitive ways. In 1,000 dimensions, the volume of a unit hypersphere is essentially zero compared to the unit cube - almost all points cluster near the boundary. Distance metrics become meaningless: the ratio of the max to min distance between random points approaches 1 as dd \to \infty. The nearest neighbor is barely nearer than any random point.

This "curse of dimensionality" has three practical consequences:

  1. More data required: the number of training examples needed grows exponentially with dd
  2. Overfitting: models fit noise in irrelevant dimensions
  3. Distance concentration: clustering and nearest-neighbor methods fail because distances stop discriminating

The insight behind PCA: real-world high-dimensional data does not fill the full dd-dimensional space uniformly. It lives near a low-dimensional linear subspace - a manifold of much lower intrinsic dimensionality. PCA finds that subspace.

Historical Context

PCA was invented by Karl Pearson in 1901 ("On lines and planes of closest fit to systems of points in space") and independently reformulated by Harold Hotelling in 1933 who connected it to the eigenvectors of the correlation matrix. The connection to SVD was established later and is the foundation of modern numerical implementations. Every production ML library (sklearn, NumPy, LAPACK) computes PCA via SVD for numerical stability reasons.

PCA as Variance Maximization - The Geometric Intuition

Imagine a cloud of points in 2D. The cloud is elongated - it stretches further along one direction than the other. PCA finds the axis of maximum stretch (the first principal component) and the perpendicular axis (the second component). When you project all points onto the first axis, you retain the most information about the original positions.

In dd dimensions: the first principal component w1\mathbf{w}_1 is the direction along which the variance of the projected data is maximized:

w1=argmaxw=1Var(Xw)=argmaxw=1wTΣw\mathbf{w}_1 = \arg\max_{\|\mathbf{w}\|=1} \operatorname{Var}(\mathbf{X}\mathbf{w}) = \arg\max_{\|\mathbf{w}\|=1} \mathbf{w}^T \mathbf{\Sigma} \mathbf{w}

where Σ\mathbf{\Sigma} is the d×dd \times d sample covariance matrix of the centered data.

Using Lagrange multipliers to enforce w=1\|\mathbf{w}\|=1:

w[wTΣwλ(wTw1)]=2Σw2λw=0\frac{\partial}{\partial \mathbf{w}}\left[\mathbf{w}^T \mathbf{\Sigma} \mathbf{w} - \lambda(\mathbf{w}^T\mathbf{w} - 1)\right] = 2\mathbf{\Sigma}\mathbf{w} - 2\lambda\mathbf{w} = \mathbf{0}

This gives Σw1=λ1w1\mathbf{\Sigma}\mathbf{w}_1 = \lambda_1 \mathbf{w}_1 - the first principal component is the eigenvector of the covariance matrix corresponding to its largest eigenvalue λ1\lambda_1. The maximized variance is λ1\lambda_1.

The kk-th principal component is the kk-th eigenvector (sorted by decreasing eigenvalue). The explained variance ratio of component kk:

EVRk=λkj=1dλj\text{EVR}_k = \frac{\lambda_k}{\sum_{j=1}^{d} \lambda_j}

The Covariance Matrix and Eigendecomposition

The sample covariance matrix is:

Σ=1n1XcTXc\mathbf{\Sigma} = \frac{1}{n-1}\mathbf{X}_c^T \mathbf{X}_c

where Xc=XXˉ\mathbf{X}_c = \mathbf{X} - \bar{\mathbf{X}} is the mean-centered data matrix (n×dn \times d).

Reconstruction formula: Given the top-kk components WkRk×d\mathbf{W}_k \in \mathbb{R}^{k \times d}, the projection is:

Z=XcWkTRn×k\mathbf{Z} = \mathbf{X}_c \mathbf{W}_k^T \in \mathbb{R}^{n \times k}

The reconstruction (inverse transform) is:

X^c=ZWk=XcWkTWk\hat{\mathbf{X}}_c = \mathbf{Z} \mathbf{W}_k = \mathbf{X}_c \mathbf{W}_k^T \mathbf{W}_k

The reconstruction error (total information lost by using kk components instead of dd) is:

Error=XcX^cF2=j=k+1dλj\text{Error} = \|\mathbf{X}_c - \hat{\mathbf{X}}_c\|_F^2 = \sum_{j=k+1}^{d} \lambda_j

It equals the sum of discarded eigenvalues - a clean result that lets you choose kk by specifying an acceptable reconstruction error.

import numpy as np
from sklearn.preprocessing import StandardScaler

def pca_from_scratch(X: np.ndarray, n_components: int):
"""
PCA via eigendecomposition of the covariance matrix.
Numerically correct but less stable than SVD for ill-conditioned data.
"""
# Step 1: Center the data (CRITICAL - PCA requires zero-mean features)
X_c = X - X.mean(axis=0) # (n, d)

# Step 2: Sample covariance matrix
n = X.shape[0]
Sigma = (X_c.T @ X_c) / (n - 1) # (d, d)

# Step 3: Eigendecomposition
# np.linalg.eigh is faster and more stable than eig for symmetric matrices
eigenvalues, eigenvectors = np.linalg.eigh(Sigma) # eigenvalues are real

# Step 4: Sort by descending eigenvalue
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

# Step 5: Select top-k components
W_k = eigenvectors[:, :n_components].T # (k, d) - rows are components

# Step 6: Explained variance ratio
evr = eigenvalues[:n_components] / eigenvalues.sum()
cumulative_evr = evr.cumsum()

# Step 7: Project data
Z = X_c @ W_k.T # (n, k)

# Step 8: Reconstruction error (Frobenius norm squared)
X_hat = Z @ W_k # (n, d) - reconstructed in centered space
recon_error = np.mean((X_c - X_hat) ** 2)

return {
'components': W_k, # (k, d)
'eigenvalues': eigenvalues[:n_components],
'explained_variance_ratio': evr,
'cumulative_evr': cumulative_evr,
'Z': Z, # (n, k) projected data
'reconstruction_error': recon_error,
}


# Test on correlated data with known structure
np.random.seed(42)
n, d = 1000, 20

# 3 latent factors drive all 20 features
latent_factors = np.random.randn(n, 3)
W_true = np.random.randn(3, d) # mixing matrix
X_correlated = latent_factors @ W_true + 0.2 * np.random.randn(n, d)
X_scaled = StandardScaler().fit_transform(X_correlated)

result = pca_from_scratch(X_scaled, n_components=5)

print("Eigenvalues (top 5):", result['eigenvalues'].round(4))
print("EVR (top 5): ", result['explained_variance_ratio'].round(4))
print("Cumulative EVR: ", result['cumulative_evr'].round(4))
print(f"Reconstruction MSE (5 components): {result['reconstruction_error']:.4f}")

SVD: The Numerically Stable Implementation

Computing the covariance matrix Σ=XcTXc/(n1)\mathbf{\Sigma} = \mathbf{X}_c^T\mathbf{X}_c/(n-1) squares the condition number of Xc\mathbf{X}_c. If Xc\mathbf{X}_c has condition number κ\kappa, then Σ\mathbf{\Sigma} has condition number κ2\kappa^2. For near-collinear features, this causes catastrophic numerical errors in the eigendecomposition.

SVD is the correct approach: decompose Xc\mathbf{X}_c directly:

Xc=USVT\mathbf{X}_c = \mathbf{U} \mathbf{S} \mathbf{V}^T

where:

  • URn×r\mathbf{U} \in \mathbb{R}^{n \times r}: left singular vectors (sample coordinates)
  • SRr×r\mathbf{S} \in \mathbb{R}^{r \times r}: diagonal matrix of singular values s1s2sr0s_1 \geq s_2 \geq \ldots \geq s_r \geq 0
  • VRd×r\mathbf{V} \in \mathbb{R}^{d \times r}: right singular vectors - these are the principal components
  • r=min(n,d)r = \min(n, d)

Connection to eigendecomposition: XcTXc=VSTUTUSVT=VS2VT\mathbf{X}_c^T\mathbf{X}_c = \mathbf{V}\mathbf{S}^T\mathbf{U}^T\mathbf{U}\mathbf{S}\mathbf{V}^T = \mathbf{V}\mathbf{S}^2\mathbf{V}^T

so eigenvalues of the covariance are λk=sk2/(n1)\lambda_k = s_k^2/(n-1), and eigenvectors are the columns of V\mathbf{V}.

The projected data (scores) can be computed without explicitly forming VTXc\mathbf{V}^T\mathbf{X}_c: Z=UkSk\mathbf{Z} = \mathbf{U}_k \mathbf{S}_k

where Uk\mathbf{U}_k and Sk\mathbf{S}_k are the top-kk left singular vectors and values.

def pca_svd(X: np.ndarray, n_components: int):
"""
PCA via SVD - numerically stable, preferred in production.
Equivalent to sklearn.decomposition.PCA internals.
"""
X_c = X - X.mean(axis=0)
n = X.shape[0]

# Economy SVD: only compute n_components singular vectors
# full_matrices=False: thin/economy SVD - O(n*d*min(n,d)) instead of O(n^2*d)
U, S, Vt = np.linalg.svd(X_c, full_matrices=False)
# U: (n, min(n,d)), S: (min(n,d),), Vt: (min(n,d), d)

# Top-k components (rows of Vt)
components = Vt[:n_components] # (k, d)

# Eigenvalues from singular values
eigenvalues = (S[:n_components] ** 2) / (n - 1)
total_variance = (S ** 2).sum() / (n - 1)
evr = eigenvalues / total_variance

# Projection: U[:, :k] * S[:k] == X_c @ components.T
Z = U[:, :n_components] * S[:n_components] # (n, k)

# Reconstruction error
X_hat = Z @ components
recon_error_total = ((X_c - X_hat) ** 2).sum()

return {
'components': components,
'eigenvalues': eigenvalues,
'explained_variance_ratio': evr,
'Z': Z,
'reconstruction_error_total': recon_error_total,
}


# Verify SVD matches eigendecomposition result
result_svd = pca_svd(X_scaled, n_components=5)
print("SVD EVR:", result_svd['explained_variance_ratio'].round(4))
# Should match eigendecomposition result above

Sign convention: PCA components have an arbitrary sign flip - w\mathbf{w} and w-\mathbf{w} are both valid. sklearn deterministically chooses the sign by making the largest-magnitude element positive. If you implement PCA from scratch and compare to sklearn, you may see sign flips - this is expected and correct.

Sklearn PCA: Production API

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

# ── Always use a Pipeline - prevents data leakage ──────────────────────────────
pipe = Pipeline([
('scaler', StandardScaler()), # fit on train, transform both
('pca', PCA(n_components=0.95, # keep 95% of variance (pass float for auto-k)
svd_solver='full', # 'randomized' for large d with small k
random_state=42)),
])

pipe.fit(X_train)
X_train_pca = pipe.transform(X_train)
X_test_pca = pipe.transform(X_test) # uses training mean/std/components

pca = pipe.named_steps['pca']
print(f"Original features: {X_train.shape[1]}")
print(f"Components kept: {pca.n_components_}")
print(f"Variance retained: {pca.explained_variance_ratio_.sum():.4f}")

# Access the components (principal directions in feature space)
print(f"Components shape: {pca.components_.shape}") # (k, d)

# Reconstruction (inverse transform back to original space)
X_reconstructed = pipe.inverse_transform(X_train_pca)
recon_error = np.mean((StandardScaler().fit_transform(X_train) - X_reconstructed) ** 2)
print(f"Reconstruction MSE: {recon_error:.4f}")

SVD solver selection:

SolverWhen to useTimeNotes
'full'd<500d < 500 or need exact resultsO(min(n,d)3)O(\min(n,d)^3)LAPACK full SVD
'randomized'kmin(n,d)k \ll \min(n,d), large ddO(ndk)O(n \cdot d \cdot k)Halko-Martinsson-Tropp
'arpack'kmin(n,d)k \ll \min(n,d), sparse XXO(ndk)O(n \cdot d \cdot k)ARPACK eigensolver
'auto'DefaultPicks above based on shapeGood default

For a 10,000-sample, 10,000-feature dataset keeping 64 components: 'randomized' is 100×\sim 100\times faster than 'full'.

Scree Plot and Choosing the Number of Components

import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

def plot_explained_variance(X_train: np.ndarray, max_components: int = None):
"""
Full scree plot + cumulative EVR to guide component selection.
Returns a dict with k values for common thresholds.
"""
X_scaled = StandardScaler().fit_transform(X_train)

max_k = min(X_train.shape[0], X_train.shape[1])
if max_components:
max_k = min(max_k, max_components)

pca_full = PCA(n_components=max_k, svd_solver='full').fit(X_scaled)
evr = pca_full.explained_variance_ratio_
cumulative = evr.cumsum()

# Find k for common thresholds
thresholds = {0.80: None, 0.90: None, 0.95: None, 0.99: None}
for t in thresholds:
thresholds[t] = int(np.argmax(cumulative >= t)) + 1

# Kaiser criterion: keep components with eigenvalue > 1
# (only meaningful when data is standardized)
eigenvalues = pca_full.explained_variance_ # = eigenvalues of cov matrix
kaiser_k = int((eigenvalues > 1.0).sum())

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Scree plot
axes[0].bar(range(1, len(evr)+1), evr, color='#3b82f6', alpha=0.7)
axes[0].plot(range(1, len(evr)+1), evr, 'o-', color='#1e40af', ms=3)
axes[0].axhline(y=0.01, color='#dc2626', linestyle='--', linewidth=1,
label='1% threshold')
axes[0].set_xlabel("Principal Component")
axes[0].set_ylabel("Explained Variance Ratio")
axes[0].set_title("Scree Plot")
axes[0].legend()

# Cumulative EVR
axes[1].plot(range(1, len(cumulative)+1), cumulative, 'o-',
color='#059669', ms=3, linewidth=1.5)
colors = {'80%': '#f59e0b', '90%': '#f97316', '95%': '#dc2626', '99%': '#7c3aed'}
for t, k in thresholds.items():
label = f"{int(t*100)}%"
c = colors[label]
axes[1].axhline(y=t, color=c, linestyle='--', linewidth=1, label=f'{label} @ k={k}')
axes[1].axvline(x=k, color=c, linestyle=':', linewidth=0.8, alpha=0.7)
axes[1].set_xlabel("Number of Components")
axes[1].set_ylabel("Cumulative Explained Variance")
axes[1].set_title("Cumulative EVR")
axes[1].legend(fontsize=8)

# Eigenvalue spectrum
axes[2].semilogy(range(1, len(eigenvalues)+1), eigenvalues, 'o-',
color='#7c3aed', ms=3, linewidth=1.5)
axes[2].axhline(y=1.0, color='#dc2626', linestyle='--', linewidth=1,
label=f'Kaiser criterion (k={kaiser_k})')
axes[2].set_xlabel("Principal Component")
axes[2].set_ylabel("Eigenvalue (log scale)")
axes[2].set_title("Eigenvalue Spectrum")
axes[2].legend()

plt.tight_layout()
plt.show()

print("Component selection guide:")
for t, k in thresholds.items():
print(f" {int(t*100)}% variance: k={k}")
print(f" Kaiser criterion (eigenvalue > 1): k={kaiser_k}")

return thresholds, kaiser_k

# Usage
thresholds, kaiser_k = plot_explained_variance(X_train, max_components=50)

Practical selection rules:

Use caseRecommended threshold
2D/3D visualizationAlways 2 or 3, regardless of EVR
Preprocessing for supervised ML95% EVR - start here, tune
Compression / storageBalance EVR vs storage budget
Anomaly detection99% EVR - need to capture rare normal patterns
Downstream clustering80–90% EVR - noise removal is the priority
Neural network input preprocessing95% EVR + whitening

PCA Whitening

Whitening scales each principal component to have unit variance. After projection to zkz_k, each component is divided by its standard deviation:

z~k=zkλk+ϵ\tilde{z}_k = \frac{z_k}{\sqrt{\lambda_k + \epsilon}}

This produces a spherical (isotropic) distribution: all components have variance 1 and are uncorrelated. The ϵ\epsilon prevents division by zero for near-zero eigenvalues and acts as regularization.

When to use whitening:

  • Preprocessing for ICA (independent component analysis) - whitening is a required step
  • Before training neural networks on high-dimensional inputs where feature scales differ wildly
  • Before distance-based algorithms when you want equal weighting of all directions

When NOT to use whitening:

  • Whitening amplifies noise: low-variance components (often noise) are boosted by dividing by small λk\sqrt{\lambda_k}. If your data has many noise dimensions, whitening makes things worse.
from sklearn.decomposition import PCA
import numpy as np

# sklearn supports whitening natively
pca_white = PCA(n_components=50, whiten=True)
X_whitened = pca_white.fit_transform(X_train_scaled)

print("Whitened component stds:", X_whitened.std(axis=0)[:5].round(4))
# Should all be ~1.0

# Without whitening, components have very different scales
pca_nowhite = PCA(n_components=50, whiten=False)
X_nowhite = pca_nowhite.fit_transform(X_train_scaled)
print("Unwhitened component stds:", X_nowhite.std(axis=0)[:5].round(4))
# Decreasing: first component has the most variance

# Manual whitening (equivalent)
Z = pca_nowhite.transform(X_train_scaled)
lambdas = pca_nowhite.explained_variance_ # eigenvalues = variances of components
epsilon = 1e-8
Z_whitened_manual = Z / np.sqrt(lambdas + epsilon)

Kernel PCA for Non-Linear Structure

Standard PCA finds linear subspaces. If your data lives on a curved manifold (concentric circles, swiss roll, non-linear clusters), PCA will project it onto a flat subspace that does not capture the manifold's structure.

Kernel PCA applies the kernel trick: instead of computing PCA in the dd-dimensional input space, it implicitly computes PCA in a (possibly infinite-dimensional) feature space ϕ:RdH\phi: \mathbb{R}^d \to \mathcal{H} defined by a kernel function:

K(xi,xj)=ϕ(xi),ϕ(xj)HK(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle_{\mathcal{H}}

The kernel matrix KRn×n\mathbf{K} \in \mathbb{R}^{n \times n} with Kij=K(xi,xj)K_{ij} = K(x_i, x_j) is centered and then its top-kk eigenvectors give the projected coordinates.

Common kernels:

  • RBF (Gaussian): K(x,y)=exp(γxy2)K(x, y) = \exp(-\gamma \|x-y\|^2) - infinite-dimensional feature space, captures any smooth non-linearity
  • Polynomial: K(x,y)=(γxTy+r)pK(x, y) = (\gamma x^T y + r)^p - polynomial features up to degree pp
  • Sigmoid: K(x,y)=tanh(γxTy+r)K(x, y) = \tanh(\gamma x^T y + r) - neural network connection
from sklearn.decomposition import KernelPCA
from sklearn.datasets import make_moons, make_circles
import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(2, 4, figsize=(20, 9))

datasets = [
('Moons', make_moons(n_samples=400, noise=0.08, random_state=42)),
('Circles', make_circles(n_samples=400, noise=0.05, factor=0.4, random_state=42)),
]

for row, (name, (X_d, y_d)) in enumerate(datasets):
# Original data
axes[row, 0].scatter(X_d[:, 0], X_d[:, 1], c=y_d, cmap='bwr', s=15, alpha=0.8)
axes[row, 0].set_title(f"{name} - Original", fontsize=10)
axes[row, 0].axis('off')

# Linear PCA
from sklearn.decomposition import PCA as LinearPCA
X_lin = LinearPCA(n_components=1).fit_transform(X_d)
axes[row, 1].scatter(X_lin, np.zeros_like(X_lin), c=y_d, cmap='bwr', s=15, alpha=0.8)
axes[row, 1].set_title("Linear PCA (1D)\n- classes overlap", fontsize=9)
axes[row, 1].axis('off')

# RBF Kernel PCA
kpca_rbf = KernelPCA(n_components=2, kernel='rbf', gamma=15,
fit_inverse_transform=True, random_state=42)
X_rbf = kpca_rbf.fit_transform(X_d)
axes[row, 2].scatter(X_rbf[:, 0], X_rbf[:, 1], c=y_d, cmap='bwr', s=15, alpha=0.8)
axes[row, 2].set_title("RBF Kernel PCA\n- classes separate!", fontsize=9)
axes[row, 2].axis('off')

# Polynomial Kernel PCA
kpca_poly = KernelPCA(n_components=2, kernel='poly', degree=3, gamma=1,
coef0=1, random_state=42)
X_poly = kpca_poly.fit_transform(X_d)
axes[row, 3].scatter(X_poly[:, 0], X_poly[:, 1], c=y_d, cmap='bwr', s=15, alpha=0.8)
axes[row, 3].set_title("Poly Kernel PCA\ndegree=3", fontsize=9)
axes[row, 3].axis('off')

plt.suptitle("Kernel PCA vs Linear PCA on Non-Linear Data", fontsize=14)
plt.tight_layout()
plt.show()

Kernel PCA limitations:

  • Requires computing the n×nn \times n kernel matrix - O(n2)O(n^2) memory and O(n2d)O(n^2 d) time
  • Cannot transform new points without the training kernel matrix (unless fit_inverse_transform=True)
  • Impractical for n>10,000n > 10{,}000. For large-scale non-linear reduction, use UMAP instead.

Incremental PCA for Data That Doesn't Fit in Memory

Standard PCA loads all data into RAM to compute the covariance matrix. For very large datasets, use IncrementalPCA:

from sklearn.decomposition import IncrementalPCA
import numpy as np

def fit_incremental_pca(data_generator, n_components: int,
n_samples_total: int, batch_size: int = 1000):
"""
Fit PCA incrementally on data too large to fit in memory.

data_generator: yields batches of shape (batch_size, d)
"""
ipca = IncrementalPCA(n_components=n_components, batch_size=batch_size)

# Process batches one at a time
for batch_idx, X_batch in enumerate(data_generator):
ipca.partial_fit(X_batch)
if (batch_idx + 1) % 10 == 0:
n_processed = (batch_idx + 1) * batch_size
print(f"Processed {n_processed}/{n_samples_total} samples "
f"({100*n_processed/n_samples_total:.1f}%)")

print(f"EVR (top {n_components}): {ipca.explained_variance_ratio_[:5].round(4)}")
print(f"Cumulative EVR: {ipca.explained_variance_ratio_.cumsum()[-1]:.4f}")

return ipca


# Simulate a large dataset loaded in chunks
def large_data_generator(n_total: int, d: int, batch_size: int):
n_batches = n_total // batch_size
for _ in range(n_batches):
yield np.random.randn(batch_size, d)

n_total = 500_000 # 500K samples
d = 200 # 200 features
batch_size = 2_000
n_components = 50

gen = large_data_generator(n_total, d, batch_size)
ipca = fit_incremental_pca(gen, n_components=n_components,
n_samples_total=n_total, batch_size=batch_size)

# Transform new batches
X_new_batch = np.random.randn(1000, d)
X_new_reduced = ipca.transform(X_new_batch)
print(f"Transformed shape: {X_new_reduced.shape}")

Biplot: Visualizing Variables in PCA Space

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np

def pca_biplot(X: np.ndarray, y: np.ndarray = None,
feature_names: list = None, label_names: dict = None):
"""
Biplot: samples in PCA space + original feature loading arrows.
Feature arrows show which original features drive each component.
"""
scaler = StandardScaler()
pca_2d = PCA(n_components=2, svd_solver='full')

X_2d = pca_2d.fit_transform(scaler.fit_transform(X))

# Loadings = how much each feature contributes to each component
# pca_2d.components_ shape: (2, d)
# loadings[i, :] = contribution of feature i to [PC1, PC2]
loadings = pca_2d.components_.T # (d, 2)

if feature_names is None:
feature_names = [f"X{i}" for i in range(X.shape[1])]

fig, ax = plt.subplots(figsize=(11, 8))

# Plot samples
if y is not None:
unique_labels = np.unique(y)
colors = plt.cm.tab10(np.linspace(0, 1, len(unique_labels)))
for label, color in zip(unique_labels, colors):
mask = y == label
lname = label_names.get(label, str(label)) if label_names else str(label)
ax.scatter(X_2d[mask, 0], X_2d[mask, 1], color=color, s=15,
alpha=0.6, label=lname)
ax.legend(title="Class", fontsize=8, loc='upper left')
else:
ax.scatter(X_2d[:, 0], X_2d[:, 1], s=10, alpha=0.4, color='#3b82f6')

# Plot loading arrows (scaled for visibility)
scale = 2.0 * np.abs(X_2d).max() / np.abs(loadings).max()
for i, (name, loading) in enumerate(zip(feature_names, loadings)):
ax.annotate(
'', xy=(loading[0]*scale, loading[1]*scale), xytext=(0, 0),
arrowprops=dict(arrowstyle='->', color='#dc2626', lw=1.5)
)
ax.text(loading[0]*scale*1.1, loading[1]*scale*1.1, name,
fontsize=8, color='#7f1d1d', ha='center', va='center')

ax.axhline(0, color='gray', linestyle='--', linewidth=0.5, alpha=0.5)
ax.axvline(0, color='gray', linestyle='--', linewidth=0.5, alpha=0.5)

pc1_evr = pca_2d.explained_variance_ratio_[0]
pc2_evr = pca_2d.explained_variance_ratio_[1]
ax.set_xlabel(f"PC1 ({pc1_evr:.1%} variance)", fontsize=12)
ax.set_ylabel(f"PC2 ({pc2_evr:.1%} variance)", fontsize=12)
ax.set_title(f"PCA Biplot - {pc1_evr + pc2_evr:.1%} total variance shown", fontsize=13)

return fig, loadings

PCA Flow: Architecture Overview

Production: PCA Anomaly Detection

PCA is one of the most effective anomaly detection methods for tabular data: fit on normal samples only, compute reconstruction error for each new point, flag high-error points as anomalies.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, precision_recall_curve
import matplotlib.pyplot as plt

class PCAAnomalyDetector:
"""
Anomaly detection via PCA reconstruction error.
Fit on normal data, flag high reconstruction error as anomalous.
"""

def __init__(self, n_components: int = None, variance_threshold: float = 0.95,
threshold_percentile: float = 99.5):
"""
n_components: if None, auto-select to retain variance_threshold of variance
threshold_percentile: percentile of training errors used as anomaly threshold
"""
self.n_components = n_components
self.variance_threshold = variance_threshold
self.threshold_percentile = threshold_percentile
self.pca = None
self.scaler = StandardScaler()
self.threshold_ = None
self.train_error_mean_ = None
self.train_error_std_ = None

def fit(self, X_normal: np.ndarray) -> 'PCAAnomalyDetector':
"""Fit exclusively on known-normal data."""
X_scaled = self.scaler.fit_transform(X_normal)

if self.n_components is None:
self.pca = PCA(n_components=self.variance_threshold,
svd_solver='full', random_state=42)
else:
self.pca = PCA(n_components=self.n_components,
svd_solver='full', random_state=42)

self.pca.fit(X_scaled)

# Compute training reconstruction errors
errors = self._reconstruction_error(X_scaled)
self.train_error_mean_ = errors.mean()
self.train_error_std_ = errors.std()
self.threshold_ = np.percentile(errors, self.threshold_percentile)

print(f"PCA components: {self.pca.n_components_}")
print(f"Variance retained: {self.pca.explained_variance_ratio_.sum():.4f}")
print(f"Train error - mean: {self.train_error_mean_:.4f}, "
f"std: {self.train_error_std_:.4f}")
print(f"Anomaly threshold ({self.threshold_percentile}th pct): {self.threshold_:.4f}")

return self

def _reconstruction_error(self, X_scaled: np.ndarray) -> np.ndarray:
"""Per-sample reconstruction MSE."""
Z = self.pca.transform(X_scaled)
X_recon = self.pca.inverse_transform(Z)
return np.mean((X_scaled - X_recon) ** 2, axis=1)

def anomaly_score(self, X: np.ndarray) -> np.ndarray:
"""Higher = more anomalous (raw reconstruction error)."""
return self._reconstruction_error(self.scaler.transform(X))

def predict(self, X: np.ndarray) -> np.ndarray:
"""True = anomaly."""
return self.anomaly_score(X) > self.threshold_

def evaluate(self, X_normal_test: np.ndarray,
X_anomaly_test: np.ndarray) -> dict:
"""Evaluate detector on labeled test sets."""
normal_scores = self.anomaly_score(X_normal_test)
anomaly_scores = self.anomaly_score(X_anomaly_test)

y_true = np.array([0]*len(X_normal_test) + [1]*len(X_anomaly_test))
y_score = np.concatenate([normal_scores, anomaly_scores])

auc = roc_auc_score(y_true, y_score)

# Precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_true, y_score)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Score distribution
axes[0].hist(normal_scores, bins=50, alpha=0.6, color='#3b82f6',
label=f'Normal (n={len(X_normal_test)})', density=True)
axes[0].hist(anomaly_scores, bins=50, alpha=0.6, color='#ef4444',
label=f'Anomaly (n={len(X_anomaly_test)})', density=True)
axes[0].axvline(self.threshold_, color='black', linestyle='--',
linewidth=2, label=f'Threshold={self.threshold_:.4f}')
axes[0].set_xlabel("Reconstruction Error (MSE)")
axes[0].set_ylabel("Density")
axes[0].set_title(f"Score Distribution - AUC={auc:.4f}")
axes[0].legend()

# Precision-Recall
axes[1].plot(recall, precision, 'o-', color='#7c3aed', ms=3, linewidth=2)
axes[1].set_xlabel("Recall")
axes[1].set_ylabel("Precision")
axes[1].set_title("Precision-Recall Curve")
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

return {
'auc_roc': auc,
'normal_error_mean': normal_scores.mean(),
'anomaly_error_mean': anomaly_scores.mean(),
'detection_rate': (anomaly_scores > self.threshold_).mean(),
'false_positive_rate': (normal_scores > self.threshold_).mean(),
}


# Usage
from sklearn.datasets import make_classification
import numpy as np

# Simulate normal + anomalous data
X_all, _ = make_classification(n_samples=12000, n_features=30,
n_informative=15, n_redundant=10, random_state=42)
X_normal_train = X_all[:10000]
X_normal_test = X_all[10000:11000]

# Anomalies: samples from different distribution
rng = np.random.default_rng(42)
X_anomaly = rng.normal(loc=3.0, scale=0.5, size=(1000, 30))

detector = PCAAnomalyDetector(variance_threshold=0.95, threshold_percentile=99.5)
detector.fit(X_normal_train)
metrics = detector.evaluate(X_normal_test, X_anomaly)

print("\nEvaluation:")
for k, v in metrics.items():
print(f" {k}: {v:.4f}")

:::danger Fit PCA on Training Data Only NEVER call pca.fit() or pca.fit_transform() on validation or test data. Doing so leaks information about the test distribution into your model - a form of data leakage. The scaler and PCA must be fit exclusively on training data, then transform() applied to all other splits. Always use a sklearn Pipeline with fit() on train and transform() on test to enforce this automatically. :::

:::warning PCA Assumes Linear Structure PCA finds linear subspaces. If your data lives on a non-linear manifold (Swiss roll, concentric circles, word embeddings), PCA will project it poorly. Check: if the cumulative EVR is still below 70% after retaining many components, your data likely has non-linear structure. Solutions: Kernel PCA (small nn), UMAP or t-SNE (visualization), or autoencoders (deep non-linear reduction). :::

YouTube Resources

ChannelVideo TitleWhy Watch
StatQuest with Josh StarmerPCA Step-by-StepBest visual explanation of eigenvectors and variance maximization
3Blue1BrownEssence of Linear Algebra (eigenvectors)Geometric intuition for eigendecomposition
ritvikmathPCA from ScratchWalks through the matrix algebra clearly
Andrej Karpathy (CS231n)PCA and Whitening lectureML engineering perspective, whitening walkthrough

Interview Q&A

Q1: Explain PCA as an eigendecomposition problem. What do the eigenvalues represent?

PCA seeks the directions (principal components) along which the data has maximum variance. The sample covariance matrix Σ=XcTXc/(n1)\mathbf{\Sigma} = \mathbf{X}_c^T\mathbf{X}_c/(n-1) captures pairwise feature co-variances. Its eigenvectors are the principal components - orthogonal directions in the dd-dimensional feature space. The kk-th eigenvector wk\mathbf{w}_k is the direction of the kk-th most variance. The corresponding eigenvalue λk\lambda_k is the variance of the data along wk\mathbf{w}_k - i.e., the variance of the projected coordinates Xcwk\mathbf{X}_c \mathbf{w}_k. The explained variance ratio is λk/jλj\lambda_k / \sum_j \lambda_j. Choosing the top-kk eigenvectors minimizes reconstruction error: XcX^cF2=j=k+1dλj\|\mathbf{X}_c - \hat{\mathbf{X}}_c\|_F^2 = \sum_{j=k+1}^d \lambda_j.

Q2: Why is PCA computed via SVD rather than direct eigendecomposition of the covariance matrix?

Computing the covariance matrix Σ=XcTXc/(n1)\mathbf{\Sigma} = \mathbf{X}_c^T\mathbf{X}_c/(n-1) squares the condition number of Xc\mathbf{X}_c: if Xc\mathbf{X}_c has condition number κ\kappa, then Σ\mathbf{\Sigma} has κ2\kappa^2. For nearly-collinear features (common in sensor data, text embeddings), this causes catastrophic cancellation in the eigendecomposition. SVD decomposes Xc\mathbf{X}_c directly - Xc=USVT\mathbf{X}_c = \mathbf{U}\mathbf{S}\mathbf{V}^T - without squaring the condition number. The right singular vectors V\mathbf{V} are the principal components, and eigenvalues are λk=sk2/(n1)\lambda_k = s_k^2/(n-1). Every production library (sklearn, scipy, MATLAB) uses SVD for this reason. For very large datasets with kdk \ll d, randomized SVD (Halko et al.) computes approximate SVD in O(ndk)O(ndk) instead of O(min(n,d)3)O(\min(n,d)^3).

Q3: You have 10,000-dimensional text embeddings and want to reduce to 64 dimensions. How do you do this efficiently and how do you evaluate quality?

Use PCA(n_components=64, svd_solver='randomized'). Randomized SVD computes approximate SVD in O(nd64)O(n \cdot d \cdot 64) - roughly 150× faster than full SVD for d=10,000d=10{,}000. Always fit on the training corpus only, then transform() new documents. Evaluate quality: (1) explained variance ratio - if 64 components explain less than 50% of variance, the embedding space has more structure; consider more components. (2) Downstream task quality - measure accuracy or recall@K on a labeled validation set before and after reduction. (3) Reconstruction error on held-out normal points. Note: PCA is preferred over UMAP for production reduction because it is parametric - you can transform new documents at inference with a simple matrix multiply.

Q4: What does PCA whitening do and when should you use it?

Whitening divides each projected component by its standard deviation: z~k=zk/λk+ϵ\tilde{z}_k = z_k / \sqrt{\lambda_k + \epsilon}. This produces components with unit variance - a spherical (isotropic) distribution. Use it for: (1) ICA - whitening is mathematically required before running ICA. (2) Neural network preprocessing when features have vastly different scales after PCA. (3) Distance-based methods where you want all components to contribute equally. Do NOT use it when data has many low-variance dimensions (often noise), because whitening amplifies noise by dividing by small eigenvalues. The ϵ\epsilon regularization (default in sklearn's whiten=True) mitigates this but does not eliminate it.

Q5: What is the difference between standard PCA and Kernel PCA? When would you choose each?

Standard PCA finds linear subspaces - directions in the original input space. It fails if data has non-linear structure (concentric circles, swiss roll). Kernel PCA implicitly maps data to a high-dimensional feature space via ϕ(x)\phi(x) and applies PCA there. The key insight: you never compute ϕ(x)\phi(x) explicitly - you only need the kernel matrix Kij=k(xi,xj)K_{ij} = k(x_i, x_j). With RBF kernel k(x,y)=exp(γxy2)k(x,y) = \exp(-\gamma\|x-y\|^2), it maps to an infinite-dimensional space and can separate any non-linearly-separable structure. Use standard PCA when: nn is large (millions), need new-point inference, data has approximately linear structure. Use Kernel PCA when: data has non-linear structure, n5,000n \leq 5{,}000. For large-scale non-linear reduction, use UMAP - it scales better and preserves global structure.

Q6: A model you trained with PCA preprocessing is degrading in production. What do you check?

First, check if the data distribution has shifted: fit a new PCA on recent production data and compare the explained variance ratios and principal component directions to the original. If the directions have rotated significantly (cos\cos similarity between old and new w1<0.9\mathbf{w}_1 < 0.9), the data distribution has shifted and you need to refit. Second, monitor reconstruction errors over time: if average reconstruction error on production data increases, new patterns are appearing that the original PCA does not represent well - a sign of distribution shift. Third, check the scaler: if the mean or standard deviation of raw features has shifted, your scaler is miscalibrated. Solution: retrain PCA (and scaler) on a sliding window of recent data, and re-validate downstream model performance.

PCA Limitations: When to Use Something Else

PCA is a powerful baseline but fails in predictable ways:

1. Non-linear structure: PCA finds the best linear subspace. If the data lives on a curved manifold (swiss roll, concentric circles), the linear projection folds the manifold rather than unrolling it. Use Kernel PCA, t-SNE, UMAP, or autoencoders.

2. Discrete or categorical features: PCA assumes continuous numeric features. Never apply PCA directly to one-hot encoded categoricals or binary features - use MCA (Multiple Correspondence Analysis) or other appropriate methods.

3. Very high cardinality with sparse structure: For 100,000-dimensional sparse text features (bag-of-words), forming the 100,000×100,000100{,}000 \times 100{,}000 covariance matrix is impossible. Use truncated SVD (sklearn.decomposition.TruncatedSVD) which operates directly on the sparse matrix without centering.

4. Interpretability requirement: PCA components are linear combinations of all original features - they are "eigengenes" or "eigenfaces" but rarely interpretable in domain terms. If you need to identify which specific features matter most, use supervised feature selection or SHAP.

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# ── Truncated SVD for sparse text data ───────────────────────────────────────
# TruncatedSVD ≈ PCA but does NOT center the data first
# Centering a sparse matrix makes it dense - destroys sparsity, kills memory
# For text: TruncatedSVD on TF-IDF = Latent Semantic Analysis (LSA)

corpus = [
"machine learning and deep learning for NLP",
"support vector machines for classification",
"neural networks and backpropagation",
"random forests and gradient boosting",
"natural language processing with transformers",
]

tfidf = TfidfVectorizer(max_features=1000, sublinear_tf=True)
X_sparse = tfidf.fit_transform(corpus) # sparse: (5, 1000)

print(f"Sparse matrix: {X_sparse.shape}, density={X_sparse.nnz / (X_sparse.shape[0]*X_sparse.shape[1]):.3f}")

# LSA: TruncatedSVD on TF-IDF
svd = TruncatedSVD(n_components=50, n_iter=7, random_state=42)
X_lsa = svd.fit_transform(X_sparse) # (5, 50) - dense latent representation

print(f"LSA reduced shape: {X_lsa.shape}")
print(f"Explained variance: {svd.explained_variance_ratio_.sum():.4f}")

# For production information retrieval:
# 1. fit TruncatedSVD on training corpus
# 2. transform query TF-IDF vector with same SVD
# 3. cosine similarity in LSA space (faster + better than raw TF-IDF cosine)

PCA for Eigenfaces: A Classic Application

Eigenfaces (Turk & Pentland, 1991) is the canonical demonstration that PCA can learn meaningful visual representations without any labels. Each face image is treated as a high-dimensional vector; PCA finds the directions of maximum variance across the face image space - these "eigenfaces" form a basis for the face subspace.

from sklearn.datasets import fetch_lfw_people
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np

def eigenfaces_demo(n_components: int = 150, n_images: int = 500):
"""
Eigenfaces: PCA on face images.
Each principal component = 'eigenface' (a ghostly average face direction).
New faces reconstructed as linear combinations of eigenfaces.
"""
# Load Labeled Faces in the Wild
lfw = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
X = lfw.data # (n_images, h*w) pixel matrices flattened
y = lfw.target
h, w = lfw.images.shape[1:]

print(f"Dataset: {X.shape[0]} faces, {X.shape[1]} pixels each ({h}×{w})")

# Fit PCA - find eigenfaces
pca = PCA(n_components=n_components, svd_solver='randomized', random_state=42)
X_reduced = pca.fit_transform(X)

evr_cumulative = pca.explained_variance_ratio_.cumsum()
print(f"{n_components} eigenfaces explain {evr_cumulative[-1]:.1%} of variance")

# Plot first 20 eigenfaces
eigenfaces = pca.components_.reshape((n_components, h, w))

fig, axes = plt.subplots(4, 5, figsize=(14, 11))
for i, ax in enumerate(axes.flat):
ax.imshow(eigenfaces[i], cmap='gray', interpolation='nearest')
ax.set_title(f"PC {i+1}", fontsize=9)
ax.axis('off')
plt.suptitle("Top 20 Eigenfaces (Principal Components of Face Images)", fontsize=13)
plt.tight_layout()
plt.show()

# Visualize reconstruction at different numbers of components
face_idx = 42
original_face = X[face_idx]

fig, axes = plt.subplots(1, 6, figsize=(18, 4))
axes[0].imshow(original_face.reshape(h, w), cmap='gray')
axes[0].set_title("Original")
axes[0].axis('off')

for i, k in enumerate([5, 20, 50, 100, 150]):
pca_k = PCA(n_components=k, svd_solver='randomized', random_state=42)
pca_k.fit(X)
face_k = pca_k.inverse_transform(pca_k.transform([original_face]))
axes[i+1].imshow(face_k.reshape(h, w), cmap='gray')
axes[i+1].set_title(f"k={k}\n({pca_k.explained_variance_ratio_.sum():.0%})")
axes[i+1].axis('off')

plt.suptitle(f"Face Reconstruction at Different k Values", fontsize=12)
plt.tight_layout()
plt.show()

return pca, X_reduced

# Run (requires internet connection to download LFW dataset)
# pca, X_reduced = eigenfaces_demo(n_components=150)

Production Monitoring: PCA Component Drift

In production, the principal components of your data can drift as the underlying distribution shifts. Monitoring this provides early warning before model performance degrades:

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

class PCAdriftMonitor:
"""
Monitor PCA component drift in production.
Alert when principal directions or explained variance ratios shift significantly.
"""

def __init__(self, n_components: int = 20, cosine_threshold: float = 0.95):
self.n_components = n_components
self.cosine_threshold = cosine_threshold
self.baseline_pca_ = None
self.baseline_scaler_ = None
self.baseline_evr_ = None

def fit_baseline(self, X: np.ndarray) -> 'PCAdriftMonitor':
self.baseline_scaler_ = StandardScaler().fit(X)
X_scaled = self.baseline_scaler_.transform(X)
self.baseline_pca_ = PCA(n_components=self.n_components).fit(X_scaled)
self.baseline_evr_ = self.baseline_pca_.explained_variance_ratio_.copy()
print(f"Baseline fitted: {X.shape[0]} samples, "
f"{self.n_components} components")
print(f"Cumulative EVR: {self.baseline_evr_.sum():.4f}")
return self

def check_drift(self, X_new: np.ndarray, period_label: str = "Production") -> dict:
X_scaled = self.baseline_scaler_.transform(X_new)

# Fit PCA on new data
new_pca = PCA(n_components=self.n_components).fit(X_scaled)

# Compare principal components via cosine similarity
cosine_sims = []
for k in range(self.n_components):
v_base = self.baseline_pca_.components_[k]
v_new = new_pca.components_[k]
# Absolute cosine sim (sign flip is arbitrary)
cos_sim = abs(np.dot(v_base, v_new) / (np.linalg.norm(v_base) * np.linalg.norm(v_new)))
cosine_sims.append(cos_sim)

min_cosine = min(cosine_sims[:5]) # focus on top 5 components

# Compare EVR
evr_shift = np.abs(new_pca.explained_variance_ratio_ - self.baseline_evr_).mean()

# Reconstruction error shift
X_recon_base = self.baseline_pca_.inverse_transform(
self.baseline_pca_.transform(X_scaled)
)
recon_error = np.mean((X_scaled - X_recon_base) ** 2)

drift_detected = (
min_cosine < self.cosine_threshold or
evr_shift > 0.02 or
recon_error > 2 * (1 - self.baseline_evr_.sum())
)

report = {
'period': period_label,
'min_cosine_similarity_top5': min_cosine,
'evr_shift_mean': evr_shift,
'reconstruction_error': recon_error,
'drift_detected': drift_detected,
}

status = "DRIFT ALERT" if drift_detected else "OK"
print(f"[{status}] {period_label}: "
f"cos_sim={min_cosine:.4f}, "
f"evr_shift={evr_shift:.4f}, "
f"recon_err={recon_error:.4f}")

return report

:::tip 🎮 Interactive Playground

Visualize this concept: Try the PCA Explorer demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.