Skip to main content

PCA from Linear Algebra - Eigendecomposition of Covariance Matrices

Reading time: ~26 minutes | Level: Mathematical Foundations → ML Engineering

PCA is not a machine learning algorithm. It is the direct application of eigendecomposition to a covariance matrix. If you do not understand eigenvalues, you do not understand PCA - you are just calling sklearn.decomposition.PCA.

This matters more than you might think. When PCA "fails" - when your features still look correlated after projection, when the scree plot shows no clear elbow, when t-SNE shows structure that PCA misses - you need to diagnose the problem from the underlying math, not from the API documentation.

In ML, PCA appears everywhere:

  • Feature preprocessing before linear models (reduces noise, removes multicollinearity)
  • Dimensionality reduction before clustering (k-means in PCA space)
  • Eigenfaces for face recognition (still competitive for small datasets)
  • Visualization of high-dimensional embeddings (before t-SNE became dominant)
  • Whitening as a preprocessing step for ICA and other methods
  • Low-rank approximation for data compression

This lesson derives PCA from first principles and connects every step to the linear algebra you already know.

What You Will Learn

  • What PCA is trying to do: the variance maximization objective
  • The covariance matrix: what Σᵢⱼ encodes and why it is symmetric
  • Eigendecomposition of the covariance matrix: why eigenvectors are principal components
  • Explained variance ratio and the scree plot: how much information each component captures
  • When to use PCA and when NOT to
  • PCA via SVD: the numerically stable route (what sklearn actually computes)
  • NumPy: PCA from scratch; comparison to sklearn
  • ML applications: Eigenfaces, feature preprocessing, t-SNE vs PCA

Prerequisites

Part 1 - What PCA Is Trying to Do

The optimization problem

Given a dataset X ∈ ℝⁿˣᵈ (n samples, d features), PCA finds a sequence of directions w₁, w₂, ..., wₖ in ℝᵈ such that projecting the data onto these directions:

  1. Maximizes the variance of the projected data
  2. Each subsequent direction is orthogonal to all previous directions

Formally, the first principal component is:

w₁ = argmax_{‖w‖=1} Var(Xw) = argmax_{‖w‖=1} wᵀ Σ w

where Σ = XᵀX / (n-1) is the sample covariance matrix (after centering X).

The second principal component is:

w₂ = argmax_{‖w‖=1, w⊥w₁} wᵀ Σ w

And so on. The solution: wᵢ is the i-th eigenvector of Σ, and the maximum variance along wᵢ equals the i-th eigenvalue λᵢ.

Why variance maximization?

Variance measures information content. A projection direction with high variance captures real spread in the data - different data points map to different projected values. A direction with near-zero variance means all data points project to nearly the same value - the direction carries no discriminating information.

import numpy as np

def variance_along_direction(X: np.ndarray, w: np.ndarray) -> float:
"""
Compute the variance of the projection of X onto unit vector w.
This is what PCA maximizes.
"""
w = w / np.linalg.norm(w) # ensure unit vector
X_centered = X - X.mean(axis=0)
projections = X_centered @ w # (n,) - scalar projection of each sample
return float(np.var(projections, ddof=1))

np.random.seed(42)

# Data with known structure: high variance along [1, 2]/sqrt(5)
cov_true = np.array([[5.0, 4.0], [4.0, 5.0]])
X = np.random.multivariate_normal([0, 0], cov_true, size=500)

# Variance along several directions
directions = {
"axis x=[1,0]": np.array([1.0, 0.0]),
"axis y=[0,1]": np.array([0.0, 1.0]),
"diagonal [1,1]/sqrt(2)": np.array([1.0, 1.0]) / np.sqrt(2),
"diagonal [1,-1]/sqrt(2)":np.array([1.0, -1.0]) / np.sqrt(2),
}

print("Variance along each direction:")
for name, w in directions.items():
var = variance_along_direction(X, w)
print(f" {name}: {var:.3f}")

# The PCA direction should have maximum variance
eigenvalues, eigenvectors = np.linalg.eigh(np.cov(X.T))
# eigh returns ascending -> reverse for descending
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

print(f"\nPCA direction (PC1): {eigenvectors[:, 0].round(4)}")
print(f"Variance along PC1: {variance_along_direction(X, eigenvectors[:, 0]):.3f}")
print(f"Eigenvalue lambda_1: {eigenvalues[0]:.3f}")
# These should match: Var(X w₁) = w₁ᵀ Σ w₁ = λ₁ when w₁ is an eigenvector

Part 2 - The Covariance Matrix: What It Encodes

The sample covariance matrix of a centered dataset X_c ∈ ℝⁿˣᵈ is:

Sigma = (1/(n-1)) * X_c^T @ X_c in R^(d x d)

The (i, j) entry Σᵢⱼ = Cov(feature_i, feature_j):

Sigma_ij = (1/(n-1)) * sum_k (x_ki - mu_i)(x_kj - mu_j)

Properties that matter for PCA

  1. Symmetric: Σᵢⱼ = Σⱼᵢ (covariance is commutative)
  2. Positive semidefinite: wᵀΣw ≥ 0 for all w (variances are non-negative)
  3. Diagonal entries are variances: Σᵢᵢ = Var(feature_i)
  4. Off-diagonal entries encode linear correlation: Σᵢⱼ = 0 iff features i and j are uncorrelated

Because Σ is symmetric positive semidefinite, by the spectral theorem:

  • All eigenvalues are real and non-negative
  • Eigenvectors are orthogonal
  • Eigendecomposition always exists: Σ = Q Λ Qᵀ
import numpy as np

def covariance_matrix_demo():
"""Explore what the covariance matrix encodes."""
np.random.seed(42)

# Synthetic data: feature 1 has high variance, features 2 and 3 are correlated
n = 500
f1 = np.random.randn(n) * 3.0 # high variance
f2 = np.random.randn(n) * 0.5 # low variance
f3 = 0.9 * f1 + 0.1 * np.random.randn(n) # strongly correlated with f1

X = np.column_stack([f1, f2, f3])

# Center
X_c = X - X.mean(axis=0)
n_samples = X_c.shape[0]

# Manual covariance computation
Sigma = X_c.T @ X_c / (n_samples - 1) # (3, 3)

print("Covariance matrix Sigma:")
print(Sigma.round(3))
print()
print(f"Diagonal (variances): {np.diag(Sigma).round(3)}")
print(f"Sigma[0,2] (cov f1 f3): {Sigma[0,2]:.3f} <- should be large (correlated)")
print(f"Sigma[0,1] (cov f1 f2): {Sigma[0,1]:.3f} <- should be near 0 (independent)")

# Verify with numpy
Sigma_np = np.cov(X.T, ddof=1)
print(f"\nMatches np.cov: {np.allclose(Sigma, Sigma_np)}")

# Properties
eigenvalues = np.linalg.eigvalsh(Sigma)
print(f"\nEigenvalues (ascending): {eigenvalues.round(4)}")
print(f"All non-negative: {np.all(eigenvalues >= -1e-10)}")

# Symmetric
print(f"Symmetric: {np.allclose(Sigma, Sigma.T)}")

return Sigma

Sigma = covariance_matrix_demo()

The correlation matrix vs the covariance matrix

The correlation matrix normalizes the covariance matrix so diagonal entries are 1:

Rho_ij = Sigma_ij / sqrt(Sigma_ii * Sigma_jj) = Corr(feature_i, feature_j)

PCA on the correlation matrix (standardized PCA) treats all features as equally important regardless of their variance. Use it when features have very different scales. PCA on the covariance matrix is appropriate when features are already on the same scale.

import numpy as np

def covariance_vs_correlation_pca(X: np.ndarray) -> None:
"""Compare PCA on covariance vs correlation matrix."""
X_c = X - X.mean(axis=0)
n = X_c.shape[0]

# Covariance matrix
Sigma_cov = X_c.T @ X_c / (n - 1)

# Correlation matrix: standardize first
std = X_c.std(axis=0, ddof=1)
X_std = X_c / std
Sigma_corr = X_std.T @ X_std / (n - 1)

print("PCA on covariance matrix (raw features):")
eigs_cov = np.linalg.eigvalsh(Sigma_cov)[::-1]
print(f" Eigenvalues: {eigs_cov.round(3)}")
print(f" Explained variance: {(eigs_cov/eigs_cov.sum()).round(3)}")

print("\nPCA on correlation matrix (standardized features):")
eigs_corr = np.linalg.eigvalsh(Sigma_corr)[::-1]
print(f" Eigenvalues: {eigs_corr.round(3)}")
print(f" Explained variance: {(eigs_corr/eigs_corr.sum()).round(3)}")

np.random.seed(0)
# Features with very different scales
X_mixed = np.column_stack([
np.random.randn(200) * 100, # large scale
np.random.randn(200) * 0.01, # small scale
np.random.randn(200), # unit scale
])
covariance_vs_correlation_pca(X_mixed)

Part 3 - Eigendecomposition of the Covariance Matrix

Since Σ is symmetric positive semidefinite, the spectral theorem gives:

Sigma = Q Lambda Q^T

where:

  • Q = [q₁ | q₂ | ... | qₐ] ∈ ℝᵈˣᵈ - orthonormal eigenvectors (Qᵀ = Q⁻¹)
  • Λ = diag(λ₁, λ₂, ..., λₐ) - eigenvalues in descending order (λ₁ ≥ λ₂ ≥ ... ≥ 0)

Why eigenvectors ARE the principal components

Claim: the direction w that maximizes wᵀΣw subject to ‖w‖ = 1 is the eigenvector of Σ corresponding to the largest eigenvalue.

Proof (Lagrange multipliers):

Maximize wᵀΣw subject to wᵀw = 1.

Lagrangian: L(w, λ) = wᵀΣw - λ(wᵀw - 1)

Setting gradient to zero: 2Σw - 2λw = 0 → Σw = λw

So w must be an eigenvector of Σ. Substituting back:

w^T Sigma w = w^T (lambda w) = lambda (w^T w) = lambda

The objective value equals the eigenvalue. To maximize, choose the eigenvector with the largest eigenvalue λ₁. QED.

import numpy as np

def pca_via_eigendecomposition(
X: np.ndarray,
n_components: int
) -> tuple:
"""
PCA via eigendecomposition of the covariance matrix.
Mathematically explicit - matches the derivation step by step.

Args:
X: data matrix (n_samples, n_features)
n_components: number of principal components to retain

Returns:
X_projected: projected data (n_samples, n_components)
components: principal components (n_components, n_features)
explained_variance: eigenvalues for retained components
explained_variance_ratio: fraction of variance per component
"""
n, d = X.shape

# Step 1: center
mean = X.mean(axis=0)
X_c = X - mean

# Step 2: covariance matrix (d x d)
Sigma = X_c.T @ X_c / (n - 1)

# Step 3: eigendecomposition (symmetric matrix -> use eigh)
eigenvalues, eigenvectors = np.linalg.eigh(Sigma)
# eigh returns ascending order -> sort descending
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx] # columns are eigenvectors

# Step 4: select top-k components
# components_.shape = (n_components, d) -- each row is a principal direction
components = eigenvectors[:, :n_components].T

# Step 5: project data
X_projected = X_c @ eigenvectors[:, :n_components] # (n, k)

# Explained variance
explained_variance = eigenvalues[:n_components]
explained_variance_ratio = eigenvalues[:n_components] / eigenvalues.sum()

return X_projected, components, explained_variance, explained_variance_ratio

# Test
np.random.seed(42)
n, d, k = 300, 10, 3

# Data with known rank-3 structure
true_components = np.random.randn(d, k)
true_components, _ = np.linalg.qr(true_components) # orthogonalize
scores = np.random.randn(n, k)
X = scores @ true_components.T + 0.1 * np.random.randn(n, d)

X_proj, comps, ev, evr = pca_via_eigendecomposition(X, n_components=k)
print(f"Projected shape: {X_proj.shape}")
print(f"Explained variance: {ev.round(4)}")
print(f"Explained variance ratio: {evr.round(4)}")
print(f"Total explained: {evr.sum():.4f}")

Part 4 - Explained Variance Ratio and the Scree Plot

The explained variance ratio for component i is:

EVR_i = lambda_i / (lambda_1 + lambda_2 + ... + lambda_d)

It measures what fraction of the total data variance is captured by the i-th principal component.

The scree plot visualizes these ratios to guide the choice of k (number of components to keep).

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

def scree_plot_and_selection(X: np.ndarray, max_components: int = None) -> dict:
"""
Compute explained variance ratio and generate scree plot data.

Returns:
dict with eigenvalues, cumulative EVR, and recommended k
"""
n, d = X.shape
max_components = max_components or d

X_c = X - X.mean(axis=0)
Sigma = X_c.T @ X_c / (n - 1)

eigenvalues = np.linalg.eigvalsh(Sigma)[::-1] # descending

evr = eigenvalues / eigenvalues.sum()
cumulative_evr = np.cumsum(evr)

# Common selection heuristics:
# 1. Elbow rule: find the point where marginal gain drops sharply
# 2. 80% threshold: choose k such that cumulative EVR >= 0.80
# 3. Kaiser rule: keep components with eigenvalue > 1 (for correlation PCA)

k_80 = int(np.searchsorted(cumulative_evr, 0.80)) + 1
k_95 = int(np.searchsorted(cumulative_evr, 0.95)) + 1

print(f"Total features: {d}")
print(f"Components for 80% variance: {k_80}")
print(f"Components for 95% variance: {k_95}")
print(f"Top-5 explained variance ratios: {evr[:5].round(4)}")
print(f"Cumulative (top 5): {cumulative_evr[:5].round(4)}")

# Plot scree plot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

n_plot = min(max_components, d)

# Individual EVR
axes[0].bar(range(1, n_plot + 1), evr[:n_plot], alpha=0.7, color='#6366f1')
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Explained Variance Ratio')
axes[0].set_title('Scree Plot (Individual)')
axes[0].axhline(y=evr[0]*0.1, color='red', linestyle='--', alpha=0.5,
label='10% of PC1')
axes[0].legend()

# Cumulative EVR
axes[1].plot(range(1, n_plot + 1), cumulative_evr[:n_plot],
'o-', color='#10b981', linewidth=2)
axes[1].axhline(y=0.80, color='orange', linestyle='--', label='80%')
axes[1].axhline(y=0.95, color='red', linestyle='--', label='95%')
axes[1].axvline(x=k_80, color='orange', linestyle=':', alpha=0.7)
axes[1].axvline(x=k_95, color='red', linestyle=':', alpha=0.7)
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Explained Variance')
axes[1].set_title('Cumulative Scree Plot')
axes[1].legend()

plt.tight_layout()
plt.savefig('/tmp/pca_scree_plot.png', dpi=100, bbox_inches='tight')
print("\nSaved: /tmp/pca_scree_plot.png")

return {
"eigenvalues": eigenvalues,
"evr": evr,
"cumulative_evr": cumulative_evr,
"k_80": k_80,
"k_95": k_95,
}

# Example with synthetic data
np.random.seed(42)
# 5 strong components, 15 noise components
X_signal = np.random.randn(500, 5) @ np.random.randn(5, 20)
X_noise = 0.1 * np.random.randn(500, 20)
X = X_signal + X_noise

result = scree_plot_and_selection(X, max_components=20)

Part 5 - When to Use PCA and When NOT To

Use PCA when:

SituationWhy PCA helps
Multicollinearity in linear regressionPCA components are orthogonal by construction
High-dimensional features before k-meansCurse of dimensionality; distance metrics degrade in high-d
Visualization of embeddingsProject to 2D/3D while preserving maximal variance
Removing noiseSmall eigenvalues often correspond to noise
Reducing computational costFewer features = faster downstream models

Do NOT use PCA when:

:::danger PCA destroys label information PCA maximizes variance, not class separability. A direction with high variance may be useless for classification. Use LDA (Linear Discriminant Analysis) instead if you want the projection to preserve class structure. PCA can actually make classification harder by discarding low-variance directions that happen to be discriminative. :::

:::danger PCA cannot handle nonlinear manifolds PCA finds the best linear subspace. If your data lies on a nonlinear manifold (a Swiss roll, a circle, a curved surface), PCA will perform poorly. Use UMAP or t-SNE for visualization, or Kernel PCA for nonlinear dimensionality reduction. :::

import numpy as np

def when_pca_fails_demo():
"""Two cases where PCA gives suboptimal results."""

# Case 1: PCA ignores label structure
np.random.seed(42)
n = 200

# Two classes, separated along a low-variance direction
class1 = np.random.randn(n, 2) * np.array([10, 0.1]) + np.array([0, 1.0])
class2 = np.random.randn(n, 2) * np.array([10, 0.1]) + np.array([0, -1.0])
X = np.vstack([class1, class2])
y = np.array([0]*n + [1]*n)

X_c = X - X.mean(axis=0)
Sigma = X_c.T @ X_c / (len(X) - 1)
eigs, vecs = np.linalg.eigh(Sigma)
idx = np.argsort(eigs)[::-1]
pc1 = vecs[:, idx[0]]

print("Case 1: PCA vs LDA on class-structured data")
print(f" PC1 direction: {pc1.round(4)}")
# PC1 will point along high-variance x-axis (useless for separating classes)
# True separating direction is y-axis (low variance, but discriminative)

proj_pc1 = X_c @ pc1
class1_proj = proj_pc1[:n]
class2_proj = proj_pc1[n:]
overlap = (class1_proj.min() < class2_proj.max() and
class2_proj.min() < class1_proj.max())
print(f" Classes overlapping after PCA projection: {overlap}")
# True -> PCA failed to separate the classes

# Case 2: nonlinear structure (Swiss roll-like)
t = np.linspace(0, 4 * np.pi, 500)
# Points on a spiral in 2D
X_spiral = np.column_stack([t * np.cos(t), t * np.sin(t)])
X_spiral += 0.3 * np.random.randn(*X_spiral.shape)

X_c2 = X_spiral - X_spiral.mean(axis=0)
Sigma2 = X_c2.T @ X_c2 / (len(X_c2) - 1)
eigs2, vecs2 = np.linalg.eigh(Sigma2)
idx2 = np.argsort(eigs2)[::-1]
evr2 = eigs2[idx2] / eigs2.sum()

print("\nCase 2: PCA on spiral (nonlinear) data")
print(f" EVR: {evr2.round(4)}")
# Both components needed -> PCA cannot compress a spiral
# t-SNE or UMAP would show the manifold structure

when_pca_fails_demo()

Part 6 - PCA via SVD: The Numerically Stable Route

sklearn.decomposition.PCA does not compute the covariance matrix eigendecomposition. It uses SVD of the centered data matrix directly. Here is why.

The connection between SVD and PCA

Given the centered data X_c ∈ ℝⁿˣᵈ, the SVD is:

X_c = U Sigma_sv V^T

where Σ_sv = diag(σ₁, σ₂, ...) are singular values.

The covariance matrix is:

Cov = X_c^T X_c / (n-1)
= (V Sigma_sv U^T) (U Sigma_sv V^T) / (n-1)
= V (Sigma_sv^2 / (n-1)) V^T

This is the eigendecomposition of Cov with:

  • Eigenvectors = columns of V (right singular vectors of X_c)
  • Eigenvalues = σᵢ² / (n-1) (squared singular values divided by n-1)

So SVD of X_c directly yields the PCA solution - without forming the d×d covariance matrix.

Why SVD is better

MethodOperationCostIssues
Eigen of CovForm X_cᵀX_c, then eighO(nd² + d³)Squaring X_c loses precision; d×d matrix can be huge
SVD of X_cDirect SVDO(nd²) or O(nd·k) for truncatedNumerically stable; avoids squaring; truncated SVD scales to large d

:::tip Always use SVD-based PCA for numerical stability When features have very different scales (condition number of X_c is large), forming XᵀX squares the condition number. If cond(X_c) = 10⁸, then cond(XᵀX) ≈ 10¹⁶ - near the floating-point limit. SVD operates directly on X_c and avoids this squaring. :::

import numpy as np

def pca_via_svd(
X: np.ndarray,
n_components: int
) -> tuple:
"""
PCA via SVD of the centered data matrix.
This is what sklearn.decomposition.PCA does internally.

Args:
X: data matrix (n_samples, n_features)
n_components: number of components to retain

Returns:
X_projected: (n_samples, n_components)
components: (n_components, n_features) -- each row is a PC direction
explained_variance: eigenvalues (variance along each PC)
explained_variance_ratio: fraction of variance per PC
"""
n, d = X.shape

# Step 1: center
mean = X.mean(axis=0)
X_c = X - mean

# Step 2: full SVD
# U: (n, min(n,d)), s: (min(n,d),), Vt: (min(n,d), d)
U, s, Vt = np.linalg.svd(X_c, full_matrices=False)

# Step 3: extract components
# Vt rows are right singular vectors = principal directions
components = Vt[:n_components] # (k, d)

# Step 4: project (scores = U[:, :k] @ diag(s[:k]))
X_projected = U[:, :n_components] * s[:n_components] # (n, k)

# Step 5: explained variance
# sigma_i^2 / (n-1) = eigenvalue of covariance matrix
explained_variance = s[:n_components]**2 / (n - 1)
total_variance = (s**2).sum() / (n - 1)
explained_variance_ratio = explained_variance / total_variance

return X_projected, components, explained_variance, explained_variance_ratio

# Compare SVD-based PCA vs eigendecomposition-based PCA
np.random.seed(42)
n, d, k = 300, 50, 5
X = np.random.randn(n, d) @ np.random.randn(d, k) # rank-5 data

# SVD-based
X_svd, comps_svd, ev_svd, evr_svd = pca_via_svd(X, n_components=k)

# Eigendecomposition-based
X_c = X - X.mean(axis=0)
Sigma = X_c.T @ X_c / (n - 1)
eigenvalues, eigenvectors = np.linalg.eigh(Sigma)
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
X_eig = X_c @ eigenvectors[:, :k]

print("SVD-based EVR: ", evr_svd.round(5))
print("Eig-based EVR: ", (eigenvalues[:k] / eigenvalues.sum()).round(5))
print(f"Results match (up to sign): "
f"{np.allclose(np.abs(X_svd), np.abs(X_eig))}")

Part 7 - NumPy: PCA from Scratch vs sklearn

Complete PCA class from scratch

import numpy as np

class PCAFromScratch:
"""
PCA implementation using SVD of the centered data matrix.
API matches sklearn.decomposition.PCA.
"""

def __init__(self, n_components: int = None):
self.n_components = n_components
self.mean_ = None
self.components_ = None # (n_components, n_features)
self.explained_variance_ = None # eigenvalues
self.explained_variance_ratio_ = None
self.singular_values_ = None

def fit(self, X: np.ndarray) -> 'PCAFromScratch':
n, d = X.shape
k = self.n_components or min(n, d)

# Center
self.mean_ = X.mean(axis=0)
X_c = X - self.mean_

# SVD (use truncated SVD for large k < min(n,d))
U, s, Vt = np.linalg.svd(X_c, full_matrices=False)

# Store results
self.singular_values_ = s[:k]
self.components_ = Vt[:k] # (k, d)
self.explained_variance_ = s[:k]**2 / (n - 1)
self.explained_variance_ratio_ = (s[:k]**2 / (s**2).sum())

return self

def transform(self, X: np.ndarray) -> np.ndarray:
"""Project X into the principal component space."""
assert self.mean_ is not None, "Must call fit() first"
X_c = X - self.mean_
return X_c @ self.components_.T # (n, k)

def fit_transform(self, X: np.ndarray) -> np.ndarray:
return self.fit(X).transform(X)

def inverse_transform(self, X_proj: np.ndarray) -> np.ndarray:
"""Reconstruct from principal component space to original space."""
return X_proj @ self.components_ + self.mean_

def reconstruction_error(self, X: np.ndarray) -> float:
"""Frobenius norm of reconstruction error (measures information loss)."""
X_proj = self.transform(X)
X_recon = self.inverse_transform(X_proj)
return float(np.linalg.norm(X - X_recon, 'fro'))

# Compare to sklearn
from sklearn.decomposition import PCA as SklearnPCA

np.random.seed(42)
X = np.random.randn(200, 30)
k = 10

my_pca = PCAFromScratch(n_components=k)
X_my = my_pca.fit_transform(X)

sk_pca = SklearnPCA(n_components=k, svd_solver='full')
X_sk = sk_pca.fit_transform(X)

print(f"Shape match: {X_my.shape == X_sk.shape}")
print(f"EVR match: {np.allclose(my_pca.explained_variance_ratio_,\n"
f" sk_pca.explained_variance_ratio_)}")
# Note: projections may differ in sign (eigenvectors are defined up to sign)
# but the subspace spanned is identical
print(f"EVR (mine): {my_pca.explained_variance_ratio_[:5].round(4)}")
print(f"EVR (sklearn): {sk_pca.explained_variance_ratio_[:5].round(4)}")

# Reconstruction error
err = my_pca.reconstruction_error(X)
total_var = np.sum((X - X.mean(axis=0))**2)
print(f"\nReconstruction error (Frobenius): {err:.4f}")
print(f"Fraction of total norm lost: {err**2 / total_var:.4f}")
print(f"Expected (1 - sum(EVR[:k])): "
f"{(1 - my_pca.explained_variance_ratio_.sum()):.4f}")

Truncated SVD for large-scale PCA

When d is large (e.g., d = 10,000 for image pixels), computing the full SVD is expensive. Use randomized SVD (the actual algorithm in sklearn.decomposition.PCA with svd_solver='randomized'):

import numpy as np
from sklearn.utils.extmath import randomized_svd

def pca_large_scale(X: np.ndarray, n_components: int) -> tuple:
"""
PCA for large datasets using randomized SVD.
O(n * d * k) instead of O(n * d^2).

Used by sklearn when n_features > 500 or n_components < 0.8 * min(n, d).
"""
n, d = X.shape
mean = X.mean(axis=0)
X_c = X - mean

# Randomized SVD: much faster for k << min(n, d)
U, s, Vt = randomized_svd(X_c, n_components=n_components,
n_iter=4, random_state=42)

components = Vt # (k, d)
X_projected = U * s # (n, k)
explained_variance = s**2 / (n - 1)

# Approximate total variance (not exact with randomized SVD)
total_var = np.sum(X_c**2) / (n - 1)
explained_variance_ratio = explained_variance / total_var

return X_projected, components, explained_variance, explained_variance_ratio

# Example: image-scale data (simulating 1000 images of 64x64 pixels)
np.random.seed(42)
X_large = np.random.randn(1000, 4096) # 4096 = 64*64 pixel dimensions

X_proj, comps, ev, evr = pca_large_scale(X_large, n_components=50)
print(f"Large-scale PCA: {X_large.shape} -> {X_proj.shape}")
print(f"Top-5 EVR: {evr[:5].round(4)}")

Part 8 - ML Applications: Eigenfaces, Preprocessing, t-SNE vs PCA

Eigenfaces: PCA for face recognition

The classic Eigenfaces algorithm applies PCA to a dataset of face images. Each image (d pixels) is a point in ℝᵈ. The principal components of the face image dataset are called eigenfaces - the "principal directions of variation" in face space.

import numpy as np

def eigenfaces_demo():
"""
Simulate the Eigenfaces algorithm with synthetic face-like data.
In practice: use sklearn's fetch_olivetti_faces dataset.
"""
np.random.seed(42)

# Simulate: 100 face images, each 64x64 pixels = 4096 dimensions
n_faces = 100
image_size = (32, 32)
d = image_size[0] * image_size[1]

# Synthetic: faces as random linear combinations of 10 "archetypes"
archetypes = np.random.randn(10, d) # 10 base face patterns
weights = np.random.randn(n_faces, 10)
faces = weights @ archetypes + 0.1 * np.random.randn(n_faces, d)

# Step 1: center (compute mean face)
mean_face = faces.mean(axis=0) # (d,) -- the "average face"
faces_c = faces - mean_face

# Step 2: SVD-based PCA
U, s, Vt = np.linalg.svd(faces_c, full_matrices=False)
eigenfaces = Vt # (min(n,d), d) -- each row is an eigenface

# Step 3: represent each face as a linear combination of top-k eigenfaces
k = 10
face_coords = U[:, :k] * s[:k] # (n_faces, k) -- "face space coordinates"

print(f"Original face dimensionality: {d}")
print(f"Face space dimensionality: {k}")
print(f"Compression ratio: {d / k:.0f}x")

evr = s**2 / (s**2).sum()
print(f"\nTop-{k} components explain {evr[:k].sum()*100:.1f}% of variance")

# Step 4: reconstruct a face from k eigenfaces
face_idx = 5
face_original = faces[face_idx]
face_coords_single = face_coords[face_idx] # (k,)
face_reconstructed = (face_coords_single @ eigenfaces[:k]) + mean_face

recon_error = np.linalg.norm(face_original - face_reconstructed)
print(f"\nReconstruction error (face {face_idx}): {recon_error:.4f}")

# Step 5: recognition -- find nearest neighbor in face space
# (This is how Eigenfaces recognition works)
query_face = faces[0] + 0.05 * np.random.randn(d) # slightly perturbed
query_c = query_face - mean_face
query_coords = query_c @ eigenfaces[:k].T # project onto eigenface space

distances = np.linalg.norm(face_coords - query_coords, axis=1)
nearest = np.argmin(distances)
print(f"\nFace recognition: query nearest to face {nearest} "
f"(should be 0): {nearest == 0}")

eigenfaces_demo()

Feature preprocessing with PCA

import numpy as np
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

def pca_preprocessing_pipeline():
"""
Show how PCA preprocessing affects downstream model performance.
"""
np.random.seed(42)
n = 500
d = 100 # high-dimensional features
true_rank = 5 # data actually lives in 5 dimensions

# Synthetic classification: label based on top-5 principal components
true_directions = np.random.randn(d, true_rank)
true_directions, _ = np.linalg.qr(true_directions)

X_low = np.random.randn(n, true_rank)
X = X_low @ true_directions.T + 0.1 * np.random.randn(n, d)

# Binary label: based on first true direction
y = (X_low[:, 0] > 0).astype(int)

# Baseline: logistic regression on all 100 features (standardized)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
lr = LogisticRegression(max_iter=1000)
scores_raw = cross_val_score(lr, X_scaled, y, cv=5)
print(f"Logistic regression on {d} features: {scores_raw.mean():.3f} +/- {scores_raw.std():.3f}")

# With PCA: reduce to 5 components
pca = PCA(n_components=5)
X_pca = pca.fit_transform(X_scaled)
scores_pca = cross_val_score(lr, X_pca, y, cv=5)
print(f"Logistic regression on 5 PCA components: {scores_pca.mean():.3f} +/- {scores_pca.std():.3f}")
print(f"EVR top-5: {pca.explained_variance_ratio_.round(4)}")

pca_preprocessing_pipeline()

t-SNE vs PCA: when to use which

PropertyPCAt-SNE
TypeLinear projectionNonlinear embedding
ObjectiveMaximize variancePreserve neighborhood structure
InterpretabilityPC directions have meaningAxes have no meaning
SpeedFast (O(nd²))Slow (O(n²) naive, O(n log n) with BH)
ReproducibilityDeterministicStochastic (set random_state)
Handles nonlinear structureNoYes
Works for n > 10,000YesSlow; use UMAP instead
Preserves global structureYesNo (only local structure)
Use for preprocessingYesNo (non-parametric, no transform())
import numpy as np

def pca_vs_tsne_guidance():
"""Rule of thumb: when to use PCA vs t-SNE."""

guidelines = {
"Reduce dimensions before k-means": "PCA (t-SNE is non-parametric)",
"Visualize clusters in 2D (linear separable)": "PCA (interpretable axes)",
"Visualize clusters in 2D (nonlinear)": "t-SNE or UMAP",
"Feature preprocessing for linear model": "PCA (orthogonal, removes collinearity)",
"Preserve global structure": "PCA",
"Preserve local neighborhood structure": "t-SNE / UMAP",
"n > 100,000 samples": "PCA (t-SNE too slow; use UMAP)",
"Streaming / online data": "IncrementalPCA (not t-SNE)",
"Need to transform new data": "PCA (has transform()); t-SNE does not",
}

print("PCA vs t-SNE decision guide:")
for task, recommendation in guidelines.items():
print(f" {task}:\n -> {recommendation}\n")

pca_vs_tsne_guidance()

Interview Questions

Q1: Derive PCA from scratch - what are the mathematical steps?

PCA solves the variance maximization problem. Here is the complete derivation:

Step 1: Setup Given data X ∈ ℝⁿˣᵈ. Center: X_c = X - mean(X).

Step 2: Covariance matrix Σ = X_cᵀ X_c / (n-1) ∈ ℝᵈˣᵈ. Symmetric positive semidefinite.

Step 3: Optimization problem Find w₁ = argmax ‖w‖=1 wᵀΣw.

Step 4: Lagrangian L(w, λ) = wᵀΣw - λ(wᵀw - 1)

∂L/∂w = 2Σw - 2λw = 0 → Σw = λw

So w must be an eigenvector of Σ, and the objective value = λ. Maximize by choosing the largest eigenvalue's eigenvector.

Step 5: Subsequent components By the same argument (with the constraint of orthogonality to previous components), the k-th component is the k-th eigenvector of Σ (k-th largest eigenvalue).

Step 6: Projection X_projected = X_c @ Q_k where Q_k = [q₁ | ... | qₖ].

Step 7: Explained variance EVR_i = λᵢ / Σⱼ λⱼ. Since wᵀΣw = λ when w is the eigenvector, the eigenvalue equals the variance along that direction.

Numerical implementation: use SVD of X_c (not eigendecomposition of Σ) for numerical stability. The right singular vectors of X_c are the eigenvectors of Σ.

Q2: Why does PCA use eigendecomposition of the covariance matrix?

PCA finds directions that maximize variance in projected data. The variance of projections of X_c onto unit vector w is:

Var(X_c w) = (1/(n-1)) ||X_c w||^2 = (1/(n-1)) w^T X_c^T X_c w = w^T Sigma w

Maximizing wᵀΣw subject to ‖w‖ = 1 is a constrained quadratic optimization. The Lagrangian gives Σw = λw - the eigenvector equation.

The eigenvectors of Σ solve the optimization because:

  1. Σ is symmetric positive semidefinite → eigenvectors are orthogonal (good: we want uncorrelated components)
  2. The maximum value of wᵀΣw at the optimum equals the corresponding eigenvalue λ
  3. Subsequent components (orthogonal to previous) correspond to subsequent eigenvectors in descending eigenvalue order

In short: eigendecomposition of Σ is exactly the solution to the variance maximization problem that defines PCA.

Why the covariance matrix specifically?: Because Var(X_c w) = wᵀΣw by definition. Any other symmetric matrix would solve a different optimization problem.

Q3: When would you use t-SNE instead of PCA?

Use t-SNE (or UMAP) when:

  1. The data lies on a nonlinear manifold: PCA finds the best linear subspace. A circle, a Swiss roll, or a curved high-dimensional surface cannot be well-represented by a linear projection. t-SNE preserves the local neighborhood graph, capturing nonlinear structure.

  2. You want to visualize cluster separation: t-SNE explicitly preserves local neighborhoods, making clusters visually distinct. PCA preserves global variance, which may not align with cluster boundaries.

  3. You are exploring data: t-SNE is excellent for EDA of high-dimensional data (word embeddings, image features, single-cell RNA-seq).

Use PCA instead of t-SNE when:

  1. You need to transform new data: t-SNE is non-parametric - there is no transform() function for new samples. PCA has an explicit linear projection.

  2. Speed matters: PCA is O(nd²); t-SNE is O(n² log n) or O(n log n) with Barnes-Hut. For n > 50,000, use PCA or UMAP.

  3. Interpretability is required: PCA axes have meaning (they are principal directions in feature space). t-SNE axes have no interpretation.

  4. You want to preserve global structure: t-SNE only preserves local neighborhoods. Distances between clusters in a t-SNE plot are not meaningful. PCA preserves the direction and magnitude of global variance.

  5. Preprocessing for a downstream model: Use PCA (or IncrementalPCA). t-SNE cannot be used as a feature extractor.

Rule of thumb: use PCA for preprocessing, use t-SNE or UMAP for visualization.

Q4: How does sklearn.decomposition.PCA actually compute principal components?

sklearn.decomposition.PCA uses SVD of the centered data matrix, not eigendecomposition of the covariance matrix. The specific algorithm depends on svd_solver:

svd_solver='full' (exact, default for small data):

  1. Center X: X_c = X - mean(X)
  2. Full SVD: X_c = U Σ Vᵀ via LAPACK dgesdd
  3. Principal components = rows of Vᵀ (right singular vectors)
  4. Explained variance = σᵢ² / (n-1)
  5. Scores = U[:, :k] * σ[:k]

svd_solver='randomized' (default when n_components < 0.8 × min(n, d)):

  1. Center X_c
  2. Randomized SVD (Halko et al. 2011): draw a random projection matrix Ω, compute Y = X_c Ω, orthogonalize Y = QR, then SVD on the small matrix Qᵀ X_c
  3. O(n × d × k) time instead of O(n × d²)

svd_solver='arpack' (for sparse data): Uses ARPACK's implicitly restarted Lanczos iteration - best for very sparse X.

svd_solver='covariance_eigh' (new in sklearn 1.1): Explicitly computes X_cᵀX_c then eigh - useful when d ≪ n (more samples than features).

Why not eigendecompose Σ directly?

  • Forming X_cᵀX_c squares the condition number (cond(Σ) = cond(X_c)²)
  • On nearly-singular data, this loses numerical precision
  • SVD is more numerically stable and directly works on X_c

Numerical connection: The right singular vectors of X_c are mathematically identical to the eigenvectors of X_cᵀX_c = (n-1)Σ. sklearn exploits this equivalence while using the more stable SVD path.

Practice Challenges

Level 1: Predict

Challenge: Given a dataset X ∈ ℝ¹⁰⁰ˣ⁵ generated as X = Z @ A where Z ∈ ℝ¹⁰⁰ˣ³ (random) and A ∈ ℝ³ˣ⁵ (random), predict:

  1. How many non-zero eigenvalues will the covariance matrix have?
  2. What will the explained variance ratios look like?
  3. How many PCA components are needed to perfectly reconstruct X?
Answer
  1. Non-zero eigenvalues: 3. X = ZA has rank at most min(rank(Z), rank(A)) = min(3, 3) = 3. The covariance matrix Σ = XᵀX/(n-1) has rank = rank(X) = 3. Only 3 of its 5 eigenvalues will be non-zero.

  2. Explained variance: The top 3 components together explain 100% of the variance. The bottom 2 eigenvalues are exactly 0 (rank-deficient). In practice (floating point), they will be very small but positive.

  3. Components for perfect reconstruction: 3 components are sufficient - the data lives in a 3-dimensional subspace of ℝ⁵.

import numpy as np

np.random.seed(42)
Z = np.random.randn(100, 3)
A = np.random.randn(3, 5)
X = Z @ A

X_c = X - X.mean(axis=0)
Sigma = X_c.T @ X_c / (len(X) - 1)
eigenvalues = np.linalg.eigvalsh(Sigma)[::-1]

print(f"Eigenvalues: {eigenvalues.round(4)}")
print(f"Non-zero (>1e-10): {np.sum(eigenvalues > 1e-10)}") # 3
print(f"EVR: {(eigenvalues / eigenvalues.sum()).round(4)}")
# [0.xx, 0.xx, 0.xx, 0.0000, 0.0000]

Level 2: Debug

Challenge: The following PCA implementation gives wrong explained variance ratios. Find and fix the bugs:

import numpy as np

def buggy_pca(X, k):
# Bug 1: missing centering
Sigma = X.T @ X / len(X)

# Bug 2: using eig instead of eigh
eigenvalues, eigenvectors = np.linalg.eig(Sigma)

# Bug 3: not sorting eigenvalues
components = eigenvectors[:, :k].T
X_proj = X @ eigenvectors[:, :k]

evr = eigenvalues[:k] / eigenvalues.sum()
return X_proj, evr
Answer

Three bugs:

Bug 1: Not centering the data. Sigma = X.T @ X / len(X) computes the second-moment matrix, not the covariance matrix. Fix: X_c = X - X.mean(axis=0) and use X_c.T @ X_c / (len(X) - 1).

Bug 2: np.linalg.eig returns complex eigenvalues due to floating-point errors (even for symmetric matrices). Fix: use np.linalg.eigh for symmetric matrices.

Bug 3: np.linalg.eigh returns eigenvalues in ascending order, so eigenvectors[:, :k] gives the smallest-variance directions. Fix: sort in descending order.

import numpy as np

def fixed_pca(X, k):
n = X.shape[0]

# Fix 1: center the data
X_c = X - X.mean(axis=0)

# Fix 1 (continued): use centered data for covariance
Sigma = X_c.T @ X_c / (n - 1)

# Fix 2: use eigh for symmetric matrix (guaranteed real eigenvalues)
eigenvalues, eigenvectors = np.linalg.eigh(Sigma)

# Fix 3: sort in descending order (eigh returns ascending)
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

# Correct projection (onto centered data)
components = eigenvectors[:, :k].T
X_proj = X_c @ eigenvectors[:, :k]

evr = eigenvalues[:k] / eigenvalues.sum()
return X_proj, evr

# Verify
np.random.seed(42)
X = np.random.randn(100, 10)
from sklearn.decomposition import PCA
sk_pca = PCA(n_components=3)
sk_pca.fit_transform(X)

_, evr = fixed_pca(X, k=3)
print(f"Fixed EVR: {evr.round(5)}")
print(f"sklearn EVR: {sk_pca.explained_variance_ratio_.round(5)}")
print(f"Match: {np.allclose(evr, sk_pca.explained_variance_ratio_)}") # True

Level 3: Design

Challenge: Implement WhiteningPCA - PCA followed by whitening (scaling each component to unit variance). Whitening produces decorrelated features with unit variance, which is useful as preprocessing for ICA and some neural network architectures. Show that the output is indeed uncorrelated with unit variance.

Answer
import numpy as np

class WhiteningPCA:
"""
PCA followed by whitening: transform data so that
- Components are uncorrelated
- Each component has unit variance

Two whitening approaches:
1. ZCA whitening: W = Q Lambda^{-1/2} Q^T (preserves original space orientation)
2. PCA whitening: W = Lambda^{-1/2} Q^T (in PC space)
"""

def __init__(self, n_components: int = None, epsilon: float = 1e-5):
"""
Args:
n_components: number of components (None = keep all)
epsilon: small constant for numerical stability (prevents divide by zero)
"""
self.n_components = n_components
self.epsilon = epsilon
self.mean_ = None
self.components_ = None # eigenvectors (descending eigenvalue order)
self.eigenvalues_ = None # for scaling

def fit(self, X: np.ndarray) -> 'WhiteningPCA':
n, d = X.shape
k = self.n_components or d

self.mean_ = X.mean(axis=0)
X_c = X - self.mean_

Sigma = X_c.T @ X_c / (n - 1)
eigenvalues, eigenvectors = np.linalg.eigh(Sigma)

# Descending order
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

# Keep top k
self.eigenvalues_ = eigenvalues[:k]
self.components_ = eigenvectors[:, :k] # (d, k) -- columns are PCs
return self

def transform(self, X: np.ndarray) -> np.ndarray:
"""
PCA whitening: project onto PCs, then scale by 1/sqrt(lambda).
Result: uncorrelated features with unit variance.
"""
X_c = X - self.mean_
# Project onto principal components
Z = X_c @ self.components_ # (n, k)
# Scale to unit variance
scale = 1.0 / np.sqrt(self.eigenvalues_ + self.epsilon)
return Z * scale # (n, k)

def fit_transform(self, X: np.ndarray) -> np.ndarray:
return self.fit(X).transform(X)

def zca_transform(self, X: np.ndarray) -> np.ndarray:
"""
ZCA whitening: rotates whitened data back to original feature space.
Result: looks most like original data while being whitened.
"""
X_c = X - self.mean_
Z = self.transform(X) # PCA whitened in PC space
# Rotate back to original space: multiply by Q^T inverse = Q
return Z @ self.components_.T # (n, d)

# Test
np.random.seed(42)
# Correlated data
cov = np.array([[4, 3], [3, 4]], dtype=float)
X = np.random.multivariate_normal([0, 0], cov, size=1000)

wpca = WhiteningPCA(n_components=2)
X_whitened = wpca.fit_transform(X)

# Verify: covariance of whitened data should be identity
cov_whitened = np.cov(X_whitened.T, ddof=1)
print("Covariance of whitened data (should be I):")
print(cov_whitened.round(3))
print(f"Max off-diagonal: {np.abs(cov_whitened - np.eye(2)).max():.4f}") # near 0

# ZCA whitening
X_zca = wpca.zca_transform(X)
cov_zca = np.cov(X_zca.T, ddof=1)
print("\nCovariance of ZCA-whitened data (should be I):")
print(cov_zca.round(3))

Quick Reference Cheatsheet

OperationMathNumPyNotes
Covariance matrixΣ = X_cᵀX_c / (n-1)np.cov(X.T, ddof=1)Must center X first
Eigendecompose ΣΣ = QΛQᵀnp.linalg.eigh(Sigma)Use eigh (not eig) for symmetric
Sort eigenvectorsλ₁ ≥ λ₂ ≥ ...idx = np.argsort(eigs)[::-1]eigh returns ascending
PCA via SVDX_c = U Σ_sv Vᵀnp.linalg.svd(X_c, full_matrices=False)Most stable method
PC directionscolumns of QVt[:k] from SVDRows of Vt = right singular vectors
ProjectionZ = X_c @ Q_kX_c @ eigenvectors[:, :k](n, k) projected data
Explained varianceλᵢ / Σλⱼeigenvalues[:k] / eigenvalues.sum()Each PC's fraction of total variance
ReconstructionX_c @ Q_k @ Q_kᵀ + meanX_proj @ components + mean_Low-rank approximation
sklearn-PCA(n_components=k).fit_transform(X)Uses randomized SVD for large data

Key Takeaways

  • PCA is eigendecomposition of the covariance matrix. It is not a black-box algorithm - it is a direct application of the spectral theorem to the sample covariance.
  • The covariance matrix Σ is symmetric positive semidefinite. Its eigenvectors are orthogonal, its eigenvalues are non-negative, and its eigendecomposition always exists.
  • The i-th principal component is the i-th eigenvector of Σ. The corresponding eigenvalue equals the variance of projections onto that direction.
  • Explained variance ratio = λᵢ / Σλⱼ. The scree plot of explained variance ratios guides the choice of k.
  • PCA fails on nonlinear manifolds and does not preserve label structure. Use t-SNE or UMAP for visualization; use LDA when class separability matters.
  • sklearn.decomposition.PCA uses SVD of the centered data matrix - not explicit eigendecomposition of Σ. This is more numerically stable because forming XᵀX squares the condition number.
  • For large datasets, randomized SVD reduces cost from O(nd²) to O(ndk). For streaming data, use IncrementalPCA.

Next: Dot Products and Projections →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the PCA Explorer demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.