Principal Component Analysis via eigendecomposition and SVD - covariance geometry, reconstruction error, Kernel PCA, Incremental PCA, whitening, and production use for preprocessing and anomaly detection.

How does principal component analysis work in practice?

PCA Dimensionality Reduction covers PCA, principal component analysis, dimensionality reduction from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/unsupervised-learning/pca-dimensionality-reduction

What is the difference between PCA and dimensionality reduction?

See the full breakdown at https://engineersofai.com/docs/ml/unsupervised-learning/pca-dimensionality-reduction

PCA Dimensionality Reduction

import ReadingTime from '@site/src/components/ReadingTime';

:::note Interview Relevance - Very High PCA appears in nearly every ML interview that touches feature engineering, model efficiency, or data exploration. The eigendecomposition derivation, SVD connection, and the critical fit-on-train-only rule are standard MLE/DS interview questions. Expect at least one PCA question in any senior ML interview. :::

The Real Interview Moment

You are three months into a new role at a medical device company. The ML team has a sensor array that produces 2,000 measurements per second - temperature, pressure, vibration, acoustic emission, and current draw across 400 sensors. The company wants to predict bearing failures 30 minutes before they happen.

You load the data and the first thing you notice: the sensors are highly correlated. The 12 temperature sensors in the motor housing all move together within 0.1°C of each other. The 8 pressure sensors show $r = 0.97$ with each other. You have 2,000 features but maybe 50 independent sources of variation.

You run PCA on a month of training data. 95% of the variance is explained by the first 47 components. You reduce from 2,000 to 47 features. Training time drops from 8 hours to 4 minutes. Model accuracy improves because you removed correlated noise. And as a bonus: reconstruction error on future sensor readings flags the anomalous states before your failure detector even sees them.

This lesson is about how that transformation works - from the geometric intuition, through the eigendecomposition math, all the way to the production engineering patterns you need to use it correctly.

Why This Exists - The Curse of Dimensionality

In high-dimensional spaces, machine learning breaks down in counterintuitive ways. In 1,000 dimensions, the volume of a unit hypersphere is essentially zero compared to the unit cube - almost all points cluster near the boundary. Distance metrics become meaningless: the ratio of the max to min distance between random points approaches 1 as $d \to \infty$ . The nearest neighbor is barely nearer than any random point.

This "curse of dimensionality" has three practical consequences:

More data required: the number of training examples needed grows exponentially with $d$
Overfitting: models fit noise in irrelevant dimensions
Distance concentration: clustering and nearest-neighbor methods fail because distances stop discriminating

The insight behind PCA: real-world high-dimensional data does not fill the full $d$ -dimensional space uniformly. It lives near a low-dimensional linear subspace - a manifold of much lower intrinsic dimensionality. PCA finds that subspace.

Historical Context

PCA was invented by Karl Pearson in 1901 ("On lines and planes of closest fit to systems of points in space") and independently reformulated by Harold Hotelling in 1933 who connected it to the eigenvectors of the correlation matrix. The connection to SVD was established later and is the foundation of modern numerical implementations. Every production ML library (sklearn, NumPy, LAPACK) computes PCA via SVD for numerical stability reasons.

PCA as Variance Maximization - The Geometric Intuition

Imagine a cloud of points in 2D. The cloud is elongated - it stretches further along one direction than the other. PCA finds the axis of maximum stretch (the first principal component) and the perpendicular axis (the second component). When you project all points onto the first axis, you retain the most information about the original positions.

In $d$ dimensions: the first principal component $\mathbf{w}_1$ is the direction along which the variance of the projected data is maximized:

$\mathbf{w}_1 = \arg\max_{\|\mathbf{w}\|=1} \operatorname{Var}(\mathbf{X}\mathbf{w}) = \arg\max_{\|\mathbf{w}\|=1} \mathbf{w}^T \mathbf{\Sigma} \mathbf{w}$

where $\mathbf{\Sigma}$ is the $d \times d$ sample covariance matrix of the centered data.

Using Lagrange multipliers to enforce $\|\mathbf{w}\|=1$ :

$\frac{\partial}{\partial \mathbf{w}}\left[\mathbf{w}^T \mathbf{\Sigma} \mathbf{w} - \lambda(\mathbf{w}^T\mathbf{w} - 1)\right] = 2\mathbf{\Sigma}\mathbf{w} - 2\lambda\mathbf{w} = \mathbf{0}$

This gives $\mathbf{\Sigma}\mathbf{w}_1 = \lambda_1 \mathbf{w}_1$ - the first principal component is the eigenvector of the covariance matrix corresponding to its largest eigenvalue $\lambda_1$ . The maximized variance is $\lambda_1$ .

The $k$ -th principal component is the $k$ -th eigenvector (sorted by decreasing eigenvalue). The explained variance ratio of component $k$ :

$\text{EVR}_k = \frac{\lambda_k}{\sum_{j=1}^{d} \lambda_j}$

The Covariance Matrix and Eigendecomposition

The sample covariance matrix is:

$\mathbf{\Sigma} = \frac{1}{n-1}\mathbf{X}_c^T \mathbf{X}_c$

where $\mathbf{X}_c = \mathbf{X} - \bar{\mathbf{X}}$ is the mean-centered data matrix ( $n \times d$ ).

Reconstruction formula: Given the top- $k$ components $\mathbf{W}_k \in \mathbb{R}^{k \times d}$ , the projection is:

$\mathbf{Z} = \mathbf{X}_c \mathbf{W}_k^T \in \mathbb{R}^{n \times k}$

The reconstruction (inverse transform) is:

$\hat{\mathbf{X}}_c = \mathbf{Z} \mathbf{W}_k = \mathbf{X}_c \mathbf{W}_k^T \mathbf{W}_k$

The reconstruction error (total information lost by using $k$ components instead of $d$ ) is:

$\text{Error} = \|\mathbf{X}_c - \hat{\mathbf{X}}_c\|_F^2 = \sum_{j=k+1}^{d} \lambda_j$

It equals the sum of discarded eigenvalues - a clean result that lets you choose $k$ by specifying an acceptable reconstruction error.

import numpy as np
from sklearn.preprocessing import StandardScaler

def pca_from_scratch(X: np.ndarray, n_components: int):
    """
    PCA via eigendecomposition of the covariance matrix.
    Numerically correct but less stable than SVD for ill-conditioned data.
    """
    # Step 1: Center the data (CRITICAL - PCA requires zero-mean features)
    X_c = X - X.mean(axis=0)     # (n, d)

    # Step 2: Sample covariance matrix
    n = X.shape[0]
    Sigma = (X_c.T @ X_c) / (n - 1)   # (d, d)

    # Step 3: Eigendecomposition
    # np.linalg.eigh is faster and more stable than eig for symmetric matrices
    eigenvalues, eigenvectors = np.linalg.eigh(Sigma)  # eigenvalues are real

    # Step 4: Sort by descending eigenvalue
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues  = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]

    # Step 5: Select top-k components
    W_k = eigenvectors[:, :n_components].T   # (k, d) - rows are components

    # Step 6: Explained variance ratio
    evr = eigenvalues[:n_components] / eigenvalues.sum()
    cumulative_evr = evr.cumsum()

    # Step 7: Project data
    Z = X_c @ W_k.T    # (n, k)

    # Step 8: Reconstruction error (Frobenius norm squared)
    X_hat = Z @ W_k    # (n, d) - reconstructed in centered space
    recon_error = np.mean((X_c - X_hat) ** 2)

    return {
        'components': W_k,                              # (k, d)
        'eigenvalues': eigenvalues[:n_components],
        'explained_variance_ratio': evr,
        'cumulative_evr': cumulative_evr,
        'Z': Z,                                         # (n, k) projected data
        'reconstruction_error': recon_error,
    }


# Test on correlated data with known structure
np.random.seed(42)
n, d = 1000, 20

# 3 latent factors drive all 20 features
latent_factors = np.random.randn(n, 3)
W_true = np.random.randn(3, d)          # mixing matrix
X_correlated = latent_factors @ W_true + 0.2 * np.random.randn(n, d)
X_scaled = StandardScaler().fit_transform(X_correlated)

result = pca_from_scratch(X_scaled, n_components=5)

print("Eigenvalues (top 5):", result['eigenvalues'].round(4))
print("EVR (top 5):        ", result['explained_variance_ratio'].round(4))
print("Cumulative EVR:     ", result['cumulative_evr'].round(4))
print(f"Reconstruction MSE (5 components): {result['reconstruction_error']:.4f}")

SVD: The Numerically Stable Implementation

Computing the covariance matrix $\mathbf{\Sigma} = \mathbf{X}_c^T\mathbf{X}_c/(n-1)$ squares the condition number of $\mathbf{X}_c$ . If $\mathbf{X}_c$ has condition number $\kappa$ , then $\mathbf{\Sigma}$ has condition number $\kappa^2$ . For near-collinear features, this causes catastrophic numerical errors in the eigendecomposition.

SVD is the correct approach: decompose $\mathbf{X}_c$ directly:

$\mathbf{X}_c = \mathbf{U} \mathbf{S} \mathbf{V}^T$

where:

$\mathbf{U} \in \mathbb{R}^{n \times r}$ : left singular vectors (sample coordinates)
$\mathbf{S} \in \mathbb{R}^{r \times r}$ : diagonal matrix of singular values $s_1 \geq s_2 \geq \ldots \geq s_r \geq 0$
$\mathbf{V} \in \mathbb{R}^{d \times r}$ : right singular vectors - these are the principal components
$r = \min(n, d)$

Connection to eigendecomposition: $\mathbf{X}_c^T\mathbf{X}_c = \mathbf{V}\mathbf{S}^T\mathbf{U}^T\mathbf{U}\mathbf{S}\mathbf{V}^T = \mathbf{V}\mathbf{S}^2\mathbf{V}^T$

so eigenvalues of the covariance are $\lambda_k = s_k^2/(n-1)$ , and eigenvectors are the columns of $\mathbf{V}$ .

The projected data (scores) can be computed without explicitly forming $\mathbf{V}^T\mathbf{X}_c$ : $\mathbf{Z} = \mathbf{U}_k \mathbf{S}_k$

where $\mathbf{U}_k$ and $\mathbf{S}_k$ are the top- $k$ left singular vectors and values.

def pca_svd(X: np.ndarray, n_components: int):
    """
    PCA via SVD - numerically stable, preferred in production.
    Equivalent to sklearn.decomposition.PCA internals.
    """
    X_c = X - X.mean(axis=0)
    n = X.shape[0]

    # Economy SVD: only compute n_components singular vectors
    # full_matrices=False: thin/economy SVD - O(n*d*min(n,d)) instead of O(n^2*d)
    U, S, Vt = np.linalg.svd(X_c, full_matrices=False)
    # U: (n, min(n,d)), S: (min(n,d),), Vt: (min(n,d), d)

    # Top-k components (rows of Vt)
    components = Vt[:n_components]        # (k, d)

    # Eigenvalues from singular values
    eigenvalues   = (S[:n_components] ** 2) / (n - 1)
    total_variance = (S ** 2).sum() / (n - 1)
    evr = eigenvalues / total_variance

    # Projection: U[:, :k] * S[:k] == X_c @ components.T
    Z = U[:, :n_components] * S[:n_components]    # (n, k)

    # Reconstruction error
    X_hat = Z @ components
    recon_error_total = ((X_c - X_hat) ** 2).sum()

    return {
        'components': components,
        'eigenvalues': eigenvalues,
        'explained_variance_ratio': evr,
        'Z': Z,
        'reconstruction_error_total': recon_error_total,
    }


# Verify SVD matches eigendecomposition result
result_svd = pca_svd(X_scaled, n_components=5)
print("SVD EVR:", result_svd['explained_variance_ratio'].round(4))
# Should match eigendecomposition result above

Sign convention: PCA components have an arbitrary sign flip - $\mathbf{w}$ and $-\mathbf{w}$ are both valid. sklearn deterministically chooses the sign by making the largest-magnitude element positive. If you implement PCA from scratch and compare to sklearn, you may see sign flips - this is expected and correct.

Sklearn PCA: Production API

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

# ── Always use a Pipeline - prevents data leakage ──────────────────────────────
pipe = Pipeline([
    ('scaler', StandardScaler()),      # fit on train, transform both
    ('pca', PCA(n_components=0.95,     # keep 95% of variance (pass float for auto-k)
                svd_solver='full',     # 'randomized' for large d with small k
                random_state=42)),
])

pipe.fit(X_train)
X_train_pca = pipe.transform(X_train)
X_test_pca  = pipe.transform(X_test)   # uses training mean/std/components

pca = pipe.named_steps['pca']
print(f"Original features:  {X_train.shape[1]}")
print(f"Components kept:    {pca.n_components_}")
print(f"Variance retained:  {pca.explained_variance_ratio_.sum():.4f}")

# Access the components (principal directions in feature space)
print(f"Components shape: {pca.components_.shape}")    # (k, d)

# Reconstruction (inverse transform back to original space)
X_reconstructed = pipe.inverse_transform(X_train_pca)
recon_error = np.mean((StandardScaler().fit_transform(X_train) - X_reconstructed) ** 2)
print(f"Reconstruction MSE: {recon_error:.4f}")

SVD solver selection:

Solver	When to use	Time	Notes
`'full'`	$d < 500$ or need exact results	$O(\min(n,d)^3)$	LAPACK full SVD
`'randomized'`	$k \ll \min(n,d)$ , large $d$	$O(n \cdot d \cdot k)$	Halko-Martinsson-Tropp
`'arpack'`	$k \ll \min(n,d)$ , sparse $X$	$O(n \cdot d \cdot k)$	ARPACK eigensolver
`'auto'`	Default	Picks above based on shape	Good default

For a 10,000-sample, 10,000-feature dataset keeping 64 components: 'randomized' is $\sim 100\times$ faster than 'full'.

Scree Plot and Choosing the Number of Components

import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

def plot_explained_variance(X_train: np.ndarray, max_components: int = None):
    """
    Full scree plot + cumulative EVR to guide component selection.
    Returns a dict with k values for common thresholds.
    """
    X_scaled = StandardScaler().fit_transform(X_train)

    max_k = min(X_train.shape[0], X_train.shape[1])
    if max_components:
        max_k = min(max_k, max_components)

    pca_full = PCA(n_components=max_k, svd_solver='full').fit(X_scaled)
    evr = pca_full.explained_variance_ratio_
    cumulative = evr.cumsum()

    # Find k for common thresholds
    thresholds = {0.80: None, 0.90: None, 0.95: None, 0.99: None}
    for t in thresholds:
        thresholds[t] = int(np.argmax(cumulative >= t)) + 1

    # Kaiser criterion: keep components with eigenvalue > 1
    # (only meaningful when data is standardized)
    eigenvalues = pca_full.explained_variance_   # = eigenvalues of cov matrix
    kaiser_k = int((eigenvalues > 1.0).sum())

    fig, axes = plt.subplots(1, 3, figsize=(18, 5))

    # Scree plot
    axes[0].bar(range(1, len(evr)+1), evr, color='#3b82f6', alpha=0.7)
    axes[0].plot(range(1, len(evr)+1), evr, 'o-', color='#1e40af', ms=3)
    axes[0].axhline(y=0.01, color='#dc2626', linestyle='--', linewidth=1,
                     label='1% threshold')
    axes[0].set_xlabel("Principal Component")
    axes[0].set_ylabel("Explained Variance Ratio")
    axes[0].set_title("Scree Plot")
    axes[0].legend()

    # Cumulative EVR
    axes[1].plot(range(1, len(cumulative)+1), cumulative, 'o-',
                  color='#059669', ms=3, linewidth=1.5)
    colors = {'80%': '#f59e0b', '90%': '#f97316', '95%': '#dc2626', '99%': '#7c3aed'}
    for t, k in thresholds.items():
        label = f"{int(t*100)}%"
        c = colors[label]
        axes[1].axhline(y=t, color=c, linestyle='--', linewidth=1, label=f'{label} @ k={k}')
        axes[1].axvline(x=k, color=c, linestyle=':', linewidth=0.8, alpha=0.7)
    axes[1].set_xlabel("Number of Components")
    axes[1].set_ylabel("Cumulative Explained Variance")
    axes[1].set_title("Cumulative EVR")
    axes[1].legend(fontsize=8)

    # Eigenvalue spectrum
    axes[2].semilogy(range(1, len(eigenvalues)+1), eigenvalues, 'o-',
                      color='#7c3aed', ms=3, linewidth=1.5)
    axes[2].axhline(y=1.0, color='#dc2626', linestyle='--', linewidth=1,
                     label=f'Kaiser criterion (k={kaiser_k})')
    axes[2].set_xlabel("Principal Component")
    axes[2].set_ylabel("Eigenvalue (log scale)")
    axes[2].set_title("Eigenvalue Spectrum")
    axes[2].legend()

    plt.tight_layout()
    plt.show()

    print("Component selection guide:")
    for t, k in thresholds.items():
        print(f"  {int(t*100)}% variance: k={k}")
    print(f"  Kaiser criterion (eigenvalue > 1): k={kaiser_k}")

    return thresholds, kaiser_k

# Usage
thresholds, kaiser_k = plot_explained_variance(X_train, max_components=50)

Practical selection rules:

Use case	Recommended threshold
2D/3D visualization	Always 2 or 3, regardless of EVR
Preprocessing for supervised ML	95% EVR - start here, tune
Compression / storage	Balance EVR vs storage budget
Anomaly detection	99% EVR - need to capture rare normal patterns
Downstream clustering	80–90% EVR - noise removal is the priority
Neural network input preprocessing	95% EVR + whitening

PCA Whitening

Whitening scales each principal component to have unit variance. After projection to $z_k$ , each component is divided by its standard deviation:

$\tilde{z}_k = \frac{z_k}{\sqrt{\lambda_k + \epsilon}}$

This produces a spherical (isotropic) distribution: all components have variance 1 and are uncorrelated. The $\epsilon$ prevents division by zero for near-zero eigenvalues and acts as regularization.

When to use whitening:

Preprocessing for ICA (independent component analysis) - whitening is a required step
Before training neural networks on high-dimensional inputs where feature scales differ wildly
Before distance-based algorithms when you want equal weighting of all directions

When NOT to use whitening:

Whitening amplifies noise: low-variance components (often noise) are boosted by dividing by small $\sqrt{\lambda_k}$ . If your data has many noise dimensions, whitening makes things worse.

from sklearn.decomposition import PCA
import numpy as np

# sklearn supports whitening natively
pca_white = PCA(n_components=50, whiten=True)
X_whitened = pca_white.fit_transform(X_train_scaled)

print("Whitened component stds:", X_whitened.std(axis=0)[:5].round(4))
# Should all be ~1.0

# Without whitening, components have very different scales
pca_nowhite = PCA(n_components=50, whiten=False)
X_nowhite = pca_nowhite.fit_transform(X_train_scaled)
print("Unwhitened component stds:", X_nowhite.std(axis=0)[:5].round(4))
# Decreasing: first component has the most variance

# Manual whitening (equivalent)
Z = pca_nowhite.transform(X_train_scaled)
lambdas = pca_nowhite.explained_variance_   # eigenvalues = variances of components
epsilon = 1e-8
Z_whitened_manual = Z / np.sqrt(lambdas + epsilon)

Kernel PCA for Non-Linear Structure

Standard PCA finds linear subspaces. If your data lives on a curved manifold (concentric circles, swiss roll, non-linear clusters), PCA will project it onto a flat subspace that does not capture the manifold's structure.

Kernel PCA applies the kernel trick: instead of computing PCA in the $d$ -dimensional input space, it implicitly computes PCA in a (possibly infinite-dimensional) feature space $\phi: \mathbb{R}^d \to \mathcal{H}$ defined by a kernel function:

$K(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle_{\mathcal{H}}$

The kernel matrix $\mathbf{K} \in \mathbb{R}^{n \times n}$ with $K_{ij} = K(x_i, x_j)$ is centered and then its top- $k$ eigenvectors give the projected coordinates.

Common kernels:

RBF (Gaussian): $K(x, y) = \exp(-\gamma \|x-y\|^2)$ - infinite-dimensional feature space, captures any smooth non-linearity
Polynomial: $K(x, y) = (\gamma x^T y + r)^p$ - polynomial features up to degree $p$
Sigmoid: $K(x, y) = \tanh(\gamma x^T y + r)$ - neural network connection

from sklearn.decomposition import KernelPCA
from sklearn.datasets import make_moons, make_circles
import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(2, 4, figsize=(20, 9))

datasets = [
    ('Moons', make_moons(n_samples=400, noise=0.08, random_state=42)),
    ('Circles', make_circles(n_samples=400, noise=0.05, factor=0.4, random_state=42)),
]

for row, (name, (X_d, y_d)) in enumerate(datasets):
    # Original data
    axes[row, 0].scatter(X_d[:, 0], X_d[:, 1], c=y_d, cmap='bwr', s=15, alpha=0.8)
    axes[row, 0].set_title(f"{name} - Original", fontsize=10)
    axes[row, 0].axis('off')

    # Linear PCA
    from sklearn.decomposition import PCA as LinearPCA
    X_lin = LinearPCA(n_components=1).fit_transform(X_d)
    axes[row, 1].scatter(X_lin, np.zeros_like(X_lin), c=y_d, cmap='bwr', s=15, alpha=0.8)
    axes[row, 1].set_title("Linear PCA (1D)\n- classes overlap", fontsize=9)
    axes[row, 1].axis('off')

    # RBF Kernel PCA
    kpca_rbf = KernelPCA(n_components=2, kernel='rbf', gamma=15,
                          fit_inverse_transform=True, random_state=42)
    X_rbf = kpca_rbf.fit_transform(X_d)
    axes[row, 2].scatter(X_rbf[:, 0], X_rbf[:, 1], c=y_d, cmap='bwr', s=15, alpha=0.8)
    axes[row, 2].set_title("RBF Kernel PCA\n- classes separate!", fontsize=9)
    axes[row, 2].axis('off')

    # Polynomial Kernel PCA
    kpca_poly = KernelPCA(n_components=2, kernel='poly', degree=3, gamma=1,
                           coef0=1, random_state=42)
    X_poly = kpca_poly.fit_transform(X_d)
    axes[row, 3].scatter(X_poly[:, 0], X_poly[:, 1], c=y_d, cmap='bwr', s=15, alpha=0.8)
    axes[row, 3].set_title("Poly Kernel PCA\ndegree=3", fontsize=9)
    axes[row, 3].axis('off')

plt.suptitle("Kernel PCA vs Linear PCA on Non-Linear Data", fontsize=14)
plt.tight_layout()
plt.show()

Kernel PCA limitations:

Requires computing the $n \times n$ kernel matrix - $O(n^2)$ memory and $O(n^2 d)$ time
Cannot transform new points without the training kernel matrix (unless fit_inverse_transform=True)
Impractical for $n > 10{,}000$ . For large-scale non-linear reduction, use UMAP instead.

Incremental PCA for Data That Doesn't Fit in Memory

Standard PCA loads all data into RAM to compute the covariance matrix. For very large datasets, use IncrementalPCA:

from sklearn.decomposition import IncrementalPCA
import numpy as np

def fit_incremental_pca(data_generator, n_components: int,
                         n_samples_total: int, batch_size: int = 1000):
    """
    Fit PCA incrementally on data too large to fit in memory.

    data_generator: yields batches of shape (batch_size, d)
    """
    ipca = IncrementalPCA(n_components=n_components, batch_size=batch_size)

    # Process batches one at a time
    for batch_idx, X_batch in enumerate(data_generator):
        ipca.partial_fit(X_batch)
        if (batch_idx + 1) % 10 == 0:
            n_processed = (batch_idx + 1) * batch_size
            print(f"Processed {n_processed}/{n_samples_total} samples "
                  f"({100*n_processed/n_samples_total:.1f}%)")

    print(f"EVR (top {n_components}): {ipca.explained_variance_ratio_[:5].round(4)}")
    print(f"Cumulative EVR:         {ipca.explained_variance_ratio_.cumsum()[-1]:.4f}")

    return ipca


# Simulate a large dataset loaded in chunks
def large_data_generator(n_total: int, d: int, batch_size: int):
    n_batches = n_total // batch_size
    for _ in range(n_batches):
        yield np.random.randn(batch_size, d)

n_total = 500_000   # 500K samples
d = 200             # 200 features
batch_size = 2_000
n_components = 50

gen = large_data_generator(n_total, d, batch_size)
ipca = fit_incremental_pca(gen, n_components=n_components,
                            n_samples_total=n_total, batch_size=batch_size)

# Transform new batches
X_new_batch = np.random.randn(1000, d)
X_new_reduced = ipca.transform(X_new_batch)
print(f"Transformed shape: {X_new_reduced.shape}")

Biplot: Visualizing Variables in PCA Space

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np

def pca_biplot(X: np.ndarray, y: np.ndarray = None,
               feature_names: list = None, label_names: dict = None):
    """
    Biplot: samples in PCA space + original feature loading arrows.
    Feature arrows show which original features drive each component.
    """
    scaler = StandardScaler()
    pca_2d = PCA(n_components=2, svd_solver='full')

    X_2d = pca_2d.fit_transform(scaler.fit_transform(X))

    # Loadings = how much each feature contributes to each component
    # pca_2d.components_ shape: (2, d)
    # loadings[i, :] = contribution of feature i to [PC1, PC2]
    loadings = pca_2d.components_.T   # (d, 2)

    if feature_names is None:
        feature_names = [f"X{i}" for i in range(X.shape[1])]

    fig, ax = plt.subplots(figsize=(11, 8))

    # Plot samples
    if y is not None:
        unique_labels = np.unique(y)
        colors = plt.cm.tab10(np.linspace(0, 1, len(unique_labels)))
        for label, color in zip(unique_labels, colors):
            mask = y == label
            lname = label_names.get(label, str(label)) if label_names else str(label)
            ax.scatter(X_2d[mask, 0], X_2d[mask, 1], color=color, s=15,
                        alpha=0.6, label=lname)
        ax.legend(title="Class", fontsize=8, loc='upper left')
    else:
        ax.scatter(X_2d[:, 0], X_2d[:, 1], s=10, alpha=0.4, color='#3b82f6')

    # Plot loading arrows (scaled for visibility)
    scale = 2.0 * np.abs(X_2d).max() / np.abs(loadings).max()
    for i, (name, loading) in enumerate(zip(feature_names, loadings)):
        ax.annotate(
            '', xy=(loading[0]*scale, loading[1]*scale), xytext=(0, 0),
            arrowprops=dict(arrowstyle='->', color='#dc2626', lw=1.5)
        )
        ax.text(loading[0]*scale*1.1, loading[1]*scale*1.1, name,
                fontsize=8, color='#7f1d1d', ha='center', va='center')

    ax.axhline(0, color='gray', linestyle='--', linewidth=0.5, alpha=0.5)
    ax.axvline(0, color='gray', linestyle='--', linewidth=0.5, alpha=0.5)

    pc1_evr = pca_2d.explained_variance_ratio_[0]
    pc2_evr = pca_2d.explained_variance_ratio_[1]
    ax.set_xlabel(f"PC1 ({pc1_evr:.1%} variance)", fontsize=12)
    ax.set_ylabel(f"PC2 ({pc2_evr:.1%} variance)", fontsize=12)
    ax.set_title(f"PCA Biplot - {pc1_evr + pc2_evr:.1%} total variance shown", fontsize=13)

    return fig, loadings

PCA Flow: Architecture Overview

Production: PCA Anomaly Detection

PCA is one of the most effective anomaly detection methods for tabular data: fit on normal samples only, compute reconstruction error for each new point, flag high-error points as anomalies.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, precision_recall_curve
import matplotlib.pyplot as plt

class PCAAnomalyDetector:
    """
    Anomaly detection via PCA reconstruction error.
    Fit on normal data, flag high reconstruction error as anomalous.
    """

    def __init__(self, n_components: int = None, variance_threshold: float = 0.95,
                 threshold_percentile: float = 99.5):
        """
        n_components: if None, auto-select to retain variance_threshold of variance
        threshold_percentile: percentile of training errors used as anomaly threshold
        """
        self.n_components = n_components
        self.variance_threshold = variance_threshold
        self.threshold_percentile = threshold_percentile
        self.pca = None
        self.scaler = StandardScaler()
        self.threshold_ = None
        self.train_error_mean_ = None
        self.train_error_std_  = None

    def fit(self, X_normal: np.ndarray) -> 'PCAAnomalyDetector':
        """Fit exclusively on known-normal data."""
        X_scaled = self.scaler.fit_transform(X_normal)

        if self.n_components is None:
            self.pca = PCA(n_components=self.variance_threshold,
                           svd_solver='full', random_state=42)
        else:
            self.pca = PCA(n_components=self.n_components,
                           svd_solver='full', random_state=42)

        self.pca.fit(X_scaled)

        # Compute training reconstruction errors
        errors = self._reconstruction_error(X_scaled)
        self.train_error_mean_ = errors.mean()
        self.train_error_std_  = errors.std()
        self.threshold_ = np.percentile(errors, self.threshold_percentile)

        print(f"PCA components: {self.pca.n_components_}")
        print(f"Variance retained: {self.pca.explained_variance_ratio_.sum():.4f}")
        print(f"Train error - mean: {self.train_error_mean_:.4f}, "
              f"std: {self.train_error_std_:.4f}")
        print(f"Anomaly threshold ({self.threshold_percentile}th pct): {self.threshold_:.4f}")

        return self

    def _reconstruction_error(self, X_scaled: np.ndarray) -> np.ndarray:
        """Per-sample reconstruction MSE."""
        Z = self.pca.transform(X_scaled)
        X_recon = self.pca.inverse_transform(Z)
        return np.mean((X_scaled - X_recon) ** 2, axis=1)

    def anomaly_score(self, X: np.ndarray) -> np.ndarray:
        """Higher = more anomalous (raw reconstruction error)."""
        return self._reconstruction_error(self.scaler.transform(X))

    def predict(self, X: np.ndarray) -> np.ndarray:
        """True = anomaly."""
        return self.anomaly_score(X) > self.threshold_

    def evaluate(self, X_normal_test: np.ndarray,
                  X_anomaly_test: np.ndarray) -> dict:
        """Evaluate detector on labeled test sets."""
        normal_scores  = self.anomaly_score(X_normal_test)
        anomaly_scores = self.anomaly_score(X_anomaly_test)

        y_true  = np.array([0]*len(X_normal_test) + [1]*len(X_anomaly_test))
        y_score = np.concatenate([normal_scores, anomaly_scores])

        auc = roc_auc_score(y_true, y_score)

        # Precision-recall curve
        precision, recall, thresholds = precision_recall_curve(y_true, y_score)

        fig, axes = plt.subplots(1, 2, figsize=(14, 5))

        # Score distribution
        axes[0].hist(normal_scores, bins=50, alpha=0.6, color='#3b82f6',
                      label=f'Normal (n={len(X_normal_test)})', density=True)
        axes[0].hist(anomaly_scores, bins=50, alpha=0.6, color='#ef4444',
                      label=f'Anomaly (n={len(X_anomaly_test)})', density=True)
        axes[0].axvline(self.threshold_, color='black', linestyle='--',
                         linewidth=2, label=f'Threshold={self.threshold_:.4f}')
        axes[0].set_xlabel("Reconstruction Error (MSE)")
        axes[0].set_ylabel("Density")
        axes[0].set_title(f"Score Distribution - AUC={auc:.4f}")
        axes[0].legend()

        # Precision-Recall
        axes[1].plot(recall, precision, 'o-', color='#7c3aed', ms=3, linewidth=2)
        axes[1].set_xlabel("Recall")
        axes[1].set_ylabel("Precision")
        axes[1].set_title("Precision-Recall Curve")
        axes[1].grid(True, alpha=0.3)

        plt.tight_layout()
        plt.show()

        return {
            'auc_roc': auc,
            'normal_error_mean': normal_scores.mean(),
            'anomaly_error_mean': anomaly_scores.mean(),
            'detection_rate': (anomaly_scores > self.threshold_).mean(),
            'false_positive_rate': (normal_scores > self.threshold_).mean(),
        }


# Usage
from sklearn.datasets import make_classification
import numpy as np

# Simulate normal + anomalous data
X_all, _ = make_classification(n_samples=12000, n_features=30,
                                 n_informative=15, n_redundant=10, random_state=42)
X_normal_train = X_all[:10000]
X_normal_test  = X_all[10000:11000]

# Anomalies: samples from different distribution
rng = np.random.default_rng(42)
X_anomaly = rng.normal(loc=3.0, scale=0.5, size=(1000, 30))

detector = PCAAnomalyDetector(variance_threshold=0.95, threshold_percentile=99.5)
detector.fit(X_normal_train)
metrics = detector.evaluate(X_normal_test, X_anomaly)

print("\nEvaluation:")
for k, v in metrics.items():
    print(f"  {k}: {v:.4f}")

:::danger Fit PCA on Training Data Only NEVER call pca.fit() or pca.fit_transform() on validation or test data. Doing so leaks information about the test distribution into your model - a form of data leakage. The scaler and PCA must be fit exclusively on training data, then transform() applied to all other splits. Always use a sklearn Pipeline with fit() on train and transform() on test to enforce this automatically. :::

:::warning PCA Assumes Linear Structure PCA finds linear subspaces. If your data lives on a non-linear manifold (Swiss roll, concentric circles, word embeddings), PCA will project it poorly. Check: if the cumulative EVR is still below 70% after retaining many components, your data likely has non-linear structure. Solutions: Kernel PCA (small $n$ ), UMAP or t-SNE (visualization), or autoencoders (deep non-linear reduction). :::

YouTube Resources

Channel	Video Title	Why Watch
StatQuest with Josh Starmer	PCA Step-by-Step	Best visual explanation of eigenvectors and variance maximization
3Blue1Brown	Essence of Linear Algebra (eigenvectors)	Geometric intuition for eigendecomposition
ritvikmath	PCA from Scratch	Walks through the matrix algebra clearly
Andrej Karpathy (CS231n)	PCA and Whitening lecture	ML engineering perspective, whitening walkthrough

Interview Q&A

Q1: Explain PCA as an eigendecomposition problem. What do the eigenvalues represent?

PCA seeks the directions (principal components) along which the data has maximum variance. The sample covariance matrix $\mathbf{\Sigma} = \mathbf{X}_c^T\mathbf{X}_c/(n-1)$ captures pairwise feature co-variances. Its eigenvectors are the principal components - orthogonal directions in the $d$ -dimensional feature space. The $k$ -th eigenvector $\mathbf{w}_k$ is the direction of the $k$ -th most variance. The corresponding eigenvalue $\lambda_k$ is the variance of the data along $\mathbf{w}_k$ - i.e., the variance of the projected coordinates $\mathbf{X}_c \mathbf{w}_k$ . The explained variance ratio is $\lambda_k / \sum_j \lambda_j$ . Choosing the top- $k$ eigenvectors minimizes reconstruction error: $\|\mathbf{X}_c - \hat{\mathbf{X}}_c\|_F^2 = \sum_{j=k+1}^d \lambda_j$ .

Q2: Why is PCA computed via SVD rather than direct eigendecomposition of the covariance matrix?

Computing the covariance matrix $\mathbf{\Sigma} = \mathbf{X}_c^T\mathbf{X}_c/(n-1)$ squares the condition number of $\mathbf{X}_c$ : if $\mathbf{X}_c$ has condition number $\kappa$ , then $\mathbf{\Sigma}$ has $\kappa^2$ . For nearly-collinear features (common in sensor data, text embeddings), this causes catastrophic cancellation in the eigendecomposition. SVD decomposes $\mathbf{X}_c$ directly - $\mathbf{X}_c = \mathbf{U}\mathbf{S}\mathbf{V}^T$ - without squaring the condition number. The right singular vectors $\mathbf{V}$ are the principal components, and eigenvalues are $\lambda_k = s_k^2/(n-1)$ . Every production library (sklearn, scipy, MATLAB) uses SVD for this reason. For very large datasets with $k \ll d$ , randomized SVD (Halko et al.) computes approximate SVD in $O(ndk)$ instead of $O(\min(n,d)^3)$ .

Q3: You have 10,000-dimensional text embeddings and want to reduce to 64 dimensions. How do you do this efficiently and how do you evaluate quality?

Use PCA(n_components=64, svd_solver='randomized'). Randomized SVD computes approximate SVD in $O(n \cdot d \cdot 64)$ - roughly 150× faster than full SVD for $d=10{,}000$ . Always fit on the training corpus only, then transform() new documents. Evaluate quality: (1) explained variance ratio - if 64 components explain less than 50% of variance, the embedding space has more structure; consider more components. (2) Downstream task quality - measure accuracy or recall@K on a labeled validation set before and after reduction. (3) Reconstruction error on held-out normal points. Note: PCA is preferred over UMAP for production reduction because it is parametric - you can transform new documents at inference with a simple matrix multiply.

Q4: What does PCA whitening do and when should you use it?

Whitening divides each projected component by its standard deviation: $\tilde{z}_k = z_k / \sqrt{\lambda_k + \epsilon}$ . This produces components with unit variance - a spherical (isotropic) distribution. Use it for: (1) ICA - whitening is mathematically required before running ICA. (2) Neural network preprocessing when features have vastly different scales after PCA. (3) Distance-based methods where you want all components to contribute equally. Do NOT use it when data has many low-variance dimensions (often noise), because whitening amplifies noise by dividing by small eigenvalues. The $\epsilon$ regularization (default in sklearn's whiten=True) mitigates this but does not eliminate it.

Q5: What is the difference between standard PCA and Kernel PCA? When would you choose each?

Standard PCA finds linear subspaces - directions in the original input space. It fails if data has non-linear structure (concentric circles, swiss roll). Kernel PCA implicitly maps data to a high-dimensional feature space via $\phi(x)$ and applies PCA there. The key insight: you never compute $\phi(x)$ explicitly - you only need the kernel matrix $K_{ij} = k(x_i, x_j)$ . With RBF kernel $k(x,y) = \exp(-\gamma\|x-y\|^2)$ , it maps to an infinite-dimensional space and can separate any non-linearly-separable structure. Use standard PCA when: $n$ is large (millions), need new-point inference, data has approximately linear structure. Use Kernel PCA when: data has non-linear structure, $n \leq 5{,}000$ . For large-scale non-linear reduction, use UMAP - it scales better and preserves global structure.

Q6: A model you trained with PCA preprocessing is degrading in production. What do you check?

First, check if the data distribution has shifted: fit a new PCA on recent production data and compare the explained variance ratios and principal component directions to the original. If the directions have rotated significantly ( $\cos$ similarity between old and new $\mathbf{w}_1 < 0.9$ ), the data distribution has shifted and you need to refit. Second, monitor reconstruction errors over time: if average reconstruction error on production data increases, new patterns are appearing that the original PCA does not represent well - a sign of distribution shift. Third, check the scaler: if the mean or standard deviation of raw features has shifted, your scaler is miscalibrated. Solution: retrain PCA (and scaler) on a sliding window of recent data, and re-validate downstream model performance.

PCA Limitations: When to Use Something Else

PCA is a powerful baseline but fails in predictable ways:

1. Non-linear structure: PCA finds the best linear subspace. If the data lives on a curved manifold (swiss roll, concentric circles), the linear projection folds the manifold rather than unrolling it. Use Kernel PCA, t-SNE, UMAP, or autoencoders.

2. Discrete or categorical features: PCA assumes continuous numeric features. Never apply PCA directly to one-hot encoded categoricals or binary features - use MCA (Multiple Correspondence Analysis) or other appropriate methods.

3. Very high cardinality with sparse structure: For 100,000-dimensional sparse text features (bag-of-words), forming the $100{,}000 \times 100{,}000$ covariance matrix is impossible. Use truncated SVD (sklearn.decomposition.TruncatedSVD) which operates directly on the sparse matrix without centering.

4. Interpretability requirement: PCA components are linear combinations of all original features - they are "eigengenes" or "eigenfaces" but rarely interpretable in domain terms. If you need to identify which specific features matter most, use supervised feature selection or SHAP.

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# ── Truncated SVD for sparse text data ───────────────────────────────────────
# TruncatedSVD ≈ PCA but does NOT center the data first
# Centering a sparse matrix makes it dense - destroys sparsity, kills memory
# For text: TruncatedSVD on TF-IDF = Latent Semantic Analysis (LSA)

corpus = [
    "machine learning and deep learning for NLP",
    "support vector machines for classification",
    "neural networks and backpropagation",
    "random forests and gradient boosting",
    "natural language processing with transformers",
]

tfidf = TfidfVectorizer(max_features=1000, sublinear_tf=True)
X_sparse = tfidf.fit_transform(corpus)  # sparse: (5, 1000)

print(f"Sparse matrix: {X_sparse.shape}, density={X_sparse.nnz / (X_sparse.shape[0]*X_sparse.shape[1]):.3f}")

# LSA: TruncatedSVD on TF-IDF
svd = TruncatedSVD(n_components=50, n_iter=7, random_state=42)
X_lsa = svd.fit_transform(X_sparse)   # (5, 50) - dense latent representation

print(f"LSA reduced shape: {X_lsa.shape}")
print(f"Explained variance: {svd.explained_variance_ratio_.sum():.4f}")

# For production information retrieval:
# 1. fit TruncatedSVD on training corpus
# 2. transform query TF-IDF vector with same SVD
# 3. cosine similarity in LSA space (faster + better than raw TF-IDF cosine)

PCA for Eigenfaces: A Classic Application

Eigenfaces (Turk & Pentland, 1991) is the canonical demonstration that PCA can learn meaningful visual representations without any labels. Each face image is treated as a high-dimensional vector; PCA finds the directions of maximum variance across the face image space - these "eigenfaces" form a basis for the face subspace.

from sklearn.datasets import fetch_lfw_people
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np

def eigenfaces_demo(n_components: int = 150, n_images: int = 500):
    """
    Eigenfaces: PCA on face images.
    Each principal component = 'eigenface' (a ghostly average face direction).
    New faces reconstructed as linear combinations of eigenfaces.
    """
    # Load Labeled Faces in the Wild
    lfw = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
    X = lfw.data                    # (n_images, h*w) pixel matrices flattened
    y = lfw.target
    h, w = lfw.images.shape[1:]

    print(f"Dataset: {X.shape[0]} faces, {X.shape[1]} pixels each ({h}×{w})")

    # Fit PCA - find eigenfaces
    pca = PCA(n_components=n_components, svd_solver='randomized', random_state=42)
    X_reduced = pca.fit_transform(X)

    evr_cumulative = pca.explained_variance_ratio_.cumsum()
    print(f"{n_components} eigenfaces explain {evr_cumulative[-1]:.1%} of variance")

    # Plot first 20 eigenfaces
    eigenfaces = pca.components_.reshape((n_components, h, w))

    fig, axes = plt.subplots(4, 5, figsize=(14, 11))
    for i, ax in enumerate(axes.flat):
        ax.imshow(eigenfaces[i], cmap='gray', interpolation='nearest')
        ax.set_title(f"PC {i+1}", fontsize=9)
        ax.axis('off')
    plt.suptitle("Top 20 Eigenfaces (Principal Components of Face Images)", fontsize=13)
    plt.tight_layout()
    plt.show()

    # Visualize reconstruction at different numbers of components
    face_idx = 42
    original_face = X[face_idx]

    fig, axes = plt.subplots(1, 6, figsize=(18, 4))
    axes[0].imshow(original_face.reshape(h, w), cmap='gray')
    axes[0].set_title("Original")
    axes[0].axis('off')

    for i, k in enumerate([5, 20, 50, 100, 150]):
        pca_k = PCA(n_components=k, svd_solver='randomized', random_state=42)
        pca_k.fit(X)
        face_k = pca_k.inverse_transform(pca_k.transform([original_face]))
        axes[i+1].imshow(face_k.reshape(h, w), cmap='gray')
        axes[i+1].set_title(f"k={k}\n({pca_k.explained_variance_ratio_.sum():.0%})")
        axes[i+1].axis('off')

    plt.suptitle(f"Face Reconstruction at Different k Values", fontsize=12)
    plt.tight_layout()
    plt.show()

    return pca, X_reduced

# Run (requires internet connection to download LFW dataset)
# pca, X_reduced = eigenfaces_demo(n_components=150)

Production Monitoring: PCA Component Drift

In production, the principal components of your data can drift as the underlying distribution shifts. Monitoring this provides early warning before model performance degrades:

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

class PCAdriftMonitor:
    """
    Monitor PCA component drift in production.
    Alert when principal directions or explained variance ratios shift significantly.
    """

    def __init__(self, n_components: int = 20, cosine_threshold: float = 0.95):
        self.n_components = n_components
        self.cosine_threshold = cosine_threshold
        self.baseline_pca_ = None
        self.baseline_scaler_ = None
        self.baseline_evr_ = None

    def fit_baseline(self, X: np.ndarray) -> 'PCAdriftMonitor':
        self.baseline_scaler_ = StandardScaler().fit(X)
        X_scaled = self.baseline_scaler_.transform(X)
        self.baseline_pca_ = PCA(n_components=self.n_components).fit(X_scaled)
        self.baseline_evr_ = self.baseline_pca_.explained_variance_ratio_.copy()
        print(f"Baseline fitted: {X.shape[0]} samples, "
              f"{self.n_components} components")
        print(f"Cumulative EVR: {self.baseline_evr_.sum():.4f}")
        return self

    def check_drift(self, X_new: np.ndarray, period_label: str = "Production") -> dict:
        X_scaled = self.baseline_scaler_.transform(X_new)

        # Fit PCA on new data
        new_pca = PCA(n_components=self.n_components).fit(X_scaled)

        # Compare principal components via cosine similarity
        cosine_sims = []
        for k in range(self.n_components):
            v_base = self.baseline_pca_.components_[k]
            v_new  = new_pca.components_[k]
            # Absolute cosine sim (sign flip is arbitrary)
            cos_sim = abs(np.dot(v_base, v_new) / (np.linalg.norm(v_base) * np.linalg.norm(v_new)))
            cosine_sims.append(cos_sim)

        min_cosine = min(cosine_sims[:5])  # focus on top 5 components

        # Compare EVR
        evr_shift = np.abs(new_pca.explained_variance_ratio_ - self.baseline_evr_).mean()

        # Reconstruction error shift
        X_recon_base = self.baseline_pca_.inverse_transform(
            self.baseline_pca_.transform(X_scaled)
        )
        recon_error = np.mean((X_scaled - X_recon_base) ** 2)

        drift_detected = (
            min_cosine < self.cosine_threshold or
            evr_shift > 0.02 or
            recon_error > 2 * (1 - self.baseline_evr_.sum())
        )

        report = {
            'period': period_label,
            'min_cosine_similarity_top5': min_cosine,
            'evr_shift_mean': evr_shift,
            'reconstruction_error': recon_error,
            'drift_detected': drift_detected,
        }

        status = "DRIFT ALERT" if drift_detected else "OK"
        print(f"[{status}] {period_label}: "
              f"cos_sim={min_cosine:.4f}, "
              f"evr_shift={evr_shift:.4f}, "
              f"recon_err={recon_error:.4f}")

        return report

:::tip 🎮 Interactive Playground

Visualize this concept: Try the PCA Explorer demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why This Exists - The Curse of Dimensionality​

Historical Context​

PCA as Variance Maximization - The Geometric Intuition​

The Covariance Matrix and Eigendecomposition​

SVD: The Numerically Stable Implementation​

Sklearn PCA: Production API​

Scree Plot and Choosing the Number of Components​

PCA Whitening​

Kernel PCA for Non-Linear Structure​

Incremental PCA for Data That Doesn't Fit in Memory​

Biplot: Visualizing Variables in PCA Space​

PCA Flow: Architecture Overview​

Production: PCA Anomaly Detection​

YouTube Resources​

Interview Q&A​

PCA Limitations: When to Use Something Else​

PCA for Eigenfaces: A Classic Application​

Production Monitoring: PCA Component Drift​