What is polynomial features?

Extend linear models to nonlinear patterns - polynomial basis expansion, curse of dimensionality, Mercer's theorem for valid kernels, RBF kernel via infinite-dimensional feature space, kernel ridge regression dual form, Nyström and random Fourier features for scalability.

How does kernel methods work in practice?

Polynomial Features and Kernel Methods covers polynomial features, kernel methods, kernel trick from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/linear-models/polynomial-features-and-kernels

What is the difference between polynomial features and kernel trick?

See the full breakdown at https://engineersofai.com/docs/ml/linear-models/polynomial-features-and-kernels

Polynomial Features and Kernel Methods

:::note Reading time: ~45 minutes | Interview relevance: High | Target roles: ML Engineer, Research Engineer, Applied Scientist, AI Engineer :::

The Real Interview Moment

You are interviewing at a biotech company. Their ML team has been using linear regression with raw features to predict protein binding affinity from amino acid sequences - R² of 0.38. The interviewer says: "We think there are interaction effects between amino acid positions. How would you capture those? And if we have 1000 amino acid positions, what is the problem with explicit polynomial features?"

The candidate who says "add polynomial features" gets the follow-up: "Degree-2 polynomial expansion of 1000 features gives how many features? What about degree-3? And how does the kernel trick solve this?"

The answer: $\binom{1002}{2} = 501{,}501$ features for degree-2. For degree-3: $\binom{1003}{3} = 167{,}167{,}000$ features. Even storing a single training matrix at degree-3 requires terabytes. The kernel trick computes the equivalent of working in that 167-million-dimensional space with just a dot product in the original space - $O(d)$ cost instead of $O(D)$ . The RBF kernel extends this to an infinite-dimensional feature space, with the same $O(d)$ cost.

Understanding this distinction - and the precise mathematical conditions that make it possible (Mercer's theorem) - separates engineers who "use kernels" from engineers who understand why they work.

Why This Exists - The Nonlinearity Problem

Linear models draw hyperplanes. A linear regression or logistic regression with raw features cannot capture curves, interaction effects, or decision boundaries that are not planar. For the protein problem, an amino acid at position 50 might interact synergistically with an amino acid at position 300 - their joint effect on binding affinity exceeds the sum of their individual effects. No linear model in the original features can represent this.

Two approaches handle nonlinearity while preserving linear machinery:

Explicit feature engineering: transform inputs (add $x_1^2$ , $x_1 x_2$ , $\sin(x_3)$ , etc.) and apply linear models in the expanded space. Standard linear model theory applies without modification.
Kernel methods: implicitly work in the expanded feature space without ever computing the features. The kernel function computes inner products in the high-dimensional space directly.

Both approaches are mathematically equivalent. They differ dramatically in computational cost when features are high-dimensional.

Historical Context

Polynomial features were used in statistics long before modern machine learning - astronomers fit polynomial curves to orbital data in the 19th century, and polynomial regression appears in Gauss's work on orbit prediction. The modern kernel framework emerged from support vector machines (Boser, Guyon, Vapnik 1992; Cortes and Vapnik 1995). Mercer's theorem itself dates to 1909, predating all of modern machine learning by decades. The key insight - that algorithms depending only on inner products can be "kernelized" - was explicitly formalized in the SVM context. Rahimi and Recht (2007) introduced random Fourier features ("random kitchen sinks"), enabling scalable approximation of kernel methods. Williams and Seeger (2001) developed Nyström approximation for kernel matrices.

Polynomial Feature Expansion

A degree- $p$ polynomial feature expansion maps $\mathbf{x} \in \mathbb{R}^d$ to a higher-dimensional vector $\phi(\mathbf{x}) \in \mathbb{R}^D$ containing all monomials up to degree $p$ .

For $d = 2$ , $p = 2$ :

$\phi(x_1, x_2) = \left(1,\; x_1,\; x_2,\; x_1^2,\; x_1 x_2,\; x_2^2\right)$

The model in the expanded space is still linear in $\phi(\mathbf{x})$ :

$\hat{y} = \mathbf{w}^\top \phi(\mathbf{x}) = w_0 + w_1 x_1 + w_2 x_2 + w_3 x_1^2 + w_4 x_1 x_2 + w_5 x_2^2$

This is nonlinear in $\mathbf{x}$ (it can represent a quadratic response surface) but linear in $\phi(\mathbf{x})$ . OLS, Ridge, LASSO - all linear model theory applies without modification to the expanded features.

The Curse of Dimensionality

The number of features after a degree- $p$ expansion of $d$ inputs (with intercept) is:

$D = \binom{d + p}{p}$

For large $d$ and small $p$ , this grows as $\approx d^p/p!$ .

$d$	$p=2$	$p=3$	$p=4$
10	66	286	1,001
100	5,151	176,851	4,598,126
500	125,751	20,917,501	Infeasible
1,000	501,501	167,167,000	Infeasible

For $d = 1000$ , $p = 3$ : 167 million features. The training matrix $\mathbf{X}\boldsymbol{\Phi} \in \mathbb{R}^{n \times 167M}$ requires - for $n = 10{,}000$ training points - approximately $167M \times 10{,}000 \times 4$ bytes = 6.7 TB of storage. Each gradient step costs $O(n \cdot D) = O(1.67 \times 10^{12})$ operations. This is completely infeasible.

The kernel trick avoids all of this.

Polynomial Features in Practice

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.datasets import make_regression


# ── Degree selection via cross-validation on 1D nonlinear data ────────────────
np.random.seed(42)
n = 200
X_1d = np.sort(np.random.uniform(-3, 3, n)).reshape(-1, 1)
# True function: quadratic + linear
y_1d = (0.5 * X_1d.ravel()**2 - 1.5 * X_1d.ravel()
        + np.random.randn(n) * 0.8)

print("Polynomial degree selection via 5-fold CV R²:")
print(f"{'Degree':>8} | {'CV R²':>10} | {'CV Std':>8} | {'N features':>12}")
print("-" * 48)

for degree in [1, 2, 3, 5, 8, 12, 20]:
    pipe = Pipeline([
        ("poly",   PolynomialFeatures(degree=degree, include_bias=True)),
        ("scaler", StandardScaler()),
        ("ridge",  Ridge(alpha=0.01)),
    ])
    cv_scores = cross_val_score(pipe, X_1d, y_1d, cv=5, scoring="r2")
    n_feats = PolynomialFeatures(degree=degree).fit_transform(X_1d).shape[1]
    print(f"{degree:>8} | {cv_scores.mean():>10.4f} | "
          f"{cv_scores.std():>8.4f} | {n_feats:>12}")

# ── Visualize underfitting vs overfitting ─────────────────────────────────────
fig, axes = plt.subplots(1, 4, figsize=(17, 4))
X_plot = np.linspace(-3.5, 3.5, 600).reshape(-1, 1)

for ax, degree in zip(axes, [1, 2, 6, 15]):
    pipe = Pipeline([
        ("poly",  PolynomialFeatures(degree=degree)),
        ("sc",    StandardScaler()),
        ("lr",    LinearRegression()),
    ])
    pipe.fit(X_1d, y_1d)
    cv_r2 = cross_val_score(pipe, X_1d, y_1d, cv=5, scoring="r2").mean()

    ax.scatter(X_1d, y_1d, s=12, alpha=0.45, color="steelblue")
    ax.plot(X_plot, pipe.predict(X_plot), "r-", linewidth=2)
    ax.set_title(f"Degree {degree}\nCV R²={cv_r2:.3f}")
    ax.set_ylim(-8, 10)
    ax.grid(True, alpha=0.2)

plt.suptitle("Polynomial Degree: Underfitting (deg 1) → Optimal (deg 2) → Overfitting (deg 15)",
             y=1.02)
plt.tight_layout()
plt.savefig("polynomial_overfitting.png", dpi=150, bbox_inches="tight")
plt.show()


# ── Feature count explosion for higher dimensions ─────────────────────────────
print("\nFeature count explosion:")
print(f"{'d':>6} | {'p=2':>10} | {'p=3':>15} | {'p=4':>18}")
from math import comb
for d in [10, 50, 100, 500, 1000]:
    p2 = comb(d+2, 2)
    p3 = comb(d+3, 3)
    p4 = comb(d+4, 4)
    print(f"{d:>6} | {p2:>10,} | {p3:>15,} | {p4:>18,}")

The Kernel Trick

The Core Insight

Many ML algorithms only need inner products $\phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j)$ between training examples - never the feature vectors themselves. Ridge regression, SVM, PCA, K-means, Gaussian processes - all can be formulated using only pairwise inner products. If we can compute $\phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j)$ directly without first computing $\phi(\mathbf{x}_i)$ , we get the benefit of the high-dimensional feature space at the cost of operating in the original space.

A kernel function does exactly this:

$k(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j)$

Polynomial Kernel - Explicit Proof

For $d = 2$ , degree $p = 2$ , constant $c = 1$ :

$k(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^\top\mathbf{z} + 1)^2 = (x_1 z_1 + x_2 z_2 + 1)^2$

Expanding:

$= x_1^2 z_1^2 + x_2^2 z_2^2 + 1 + 2x_1 z_1 x_2 z_2 + 2x_1 z_1 + 2x_2 z_2$

$= \phi(\mathbf{x})^\top\phi(\mathbf{z}) \quad \text{where} \quad \phi(\mathbf{x}) = (x_1^2,\; x_2^2,\; 1,\; \sqrt{2}x_1 x_2,\; \sqrt{2}x_1,\; \sqrt{2}x_2)$

The $\sqrt{2}$ scaling factors ensure the inner product matches. Computing the kernel: $O(d)$ - one dot product plus one addition and squaring. Computing $\phi$ then the dot product: $O(D) = O(d^2)$ . For $d = 1000$ : 1,000 vs 1,000,000 operations per pair.

In general: $k(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^\top\mathbf{z} + c)^p$ implicitly works with all degree- $p$ monomials in $d$ variables, a $D$ -dimensional space, in $O(d)$ time.

Common Kernels

Linear kernel: $k(\mathbf{x}, \mathbf{z}) = \mathbf{x}^\top\mathbf{z}$

The feature map is the identity: $\phi(\mathbf{x}) = \mathbf{x}$ . No nonlinearity. Kernelizing a linear model recovers the original formulation - useful for code structure and to enable the dual formulation when $d \gg n$ .

Polynomial kernel: $k(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^\top\mathbf{z} + c)^p$

Implicit features: all monomials of degree exactly $p$ (homogeneous) or up to degree $p$ (inhomogeneous, $c > 0$ ). Standard for NLP (character $n$ -grams, document similarity) where all degree- $p$ term interactions matter.

RBF / Gaussian kernel: $k(\mathbf{x}, \mathbf{z}) = \exp\!\left(-\frac{\|\mathbf{x} - \mathbf{z}\|^2}{2\sigma^2}\right) = \exp(-\gamma\|\mathbf{x} - \mathbf{z}\|^2)$

The implicit feature space is infinite-dimensional. To see this, use the Taylor expansion of $\exp$ . With $\gamma = 1$ and writing $\mathbf{x}^\top\mathbf{z} = \sum_k x_k z_k$ :

$\exp(-\|\mathbf{x}\|^2)\exp(-\|\mathbf{z}\|^2)\exp(2\mathbf{x}^\top\mathbf{z}) = \exp(-\|\mathbf{x}\|^2)\exp(-\|\mathbf{z}\|^2) \sum_{k=0}^{\infty} \frac{(2\mathbf{x}^\top\mathbf{z})^k}{k!}$

Each term $(2\mathbf{x}^\top\mathbf{z})^k/k!$ corresponds to degree- $k$ polynomial interactions, scaled. Summing from $k=0$ to $\infty$ gives an infinite polynomial feature space. Yet computing $k(\mathbf{x}, \mathbf{z})$ costs only $O(d)$ .

The RBF kernel also has an intuitive geometric interpretation: it measures similarity via distance. $k(\mathbf{x}, \mathbf{z}) = 1$ when $\mathbf{x} = \mathbf{z}$ (identical points), and $k(\mathbf{x}, \mathbf{z}) \to 0$ as $\|\mathbf{x} - \mathbf{z}\| \to \infty$ (infinitely dissimilar).

Matern kernel (used in Gaussian processes):

$k_\nu(\mathbf{x}, \mathbf{z}) = \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}\|\mathbf{x}-\mathbf{z}\|}{\ell}\right)^\nu K_\nu\!\left(\frac{\sqrt{2\nu}\|\mathbf{x}-\mathbf{z}\|}{\ell}\right)$

where $K_\nu$ is the modified Bessel function of the second kind, $\ell$ is the length scale, and $\nu$ controls smoothness. At $\nu = 1/2$ : Laplacian kernel $\exp(-\|\mathbf{x}-\mathbf{z}\|/\ell)$ . At $\nu \to \infty$ : RBF kernel. Matern-3/2 and Matern-5/2 are popular in GP regression for physically plausible smoothness assumptions.

Mercer's Theorem: What Makes a Valid Kernel

Not every symmetric function $k(\mathbf{x}, \mathbf{z})$ corresponds to an inner product in some feature space. Mercer's theorem (1909) characterizes exactly which functions do.

Mercer's theorem: A symmetric function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is a valid (Mercer) kernel - i.e., it corresponds to an inner product $\langle \phi(\mathbf{x}), \phi(\mathbf{z}) \rangle_\mathcal{H}$ in some Hilbert space $\mathcal{H}$ - if and only if the Gram matrix $\mathbf{K}$ with $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ is positive semi-definite for every finite set of input points $\{\mathbf{x}_1, \ldots, \mathbf{x}_n\}$ .

Equivalently: $k$ is a valid kernel $\iff$ for all finite sets of inputs and all $\boldsymbol{\alpha} \in \mathbb{R}^n$ :

$\sum_{i=1}^n\sum_{j=1}^n \alpha_i \alpha_j k(\mathbf{x}_i, \mathbf{x}_j) \geq 0$

This is the PSD condition. It ensures that we are genuinely computing inner products, not some other bilinear form, and it guarantees that optimization problems using the kernel have unique solutions (convexity from PSD Gram matrices).

Kernel algebra - constructing new valid kernels from known ones:

If $k_1$ and $k_2$ are valid kernels, then so are:

$k_1 + k_2$ (sum of PSD matrices is PSD)
$k_1 \cdot k_2$ (Schur/Hadamard product of PSD matrices is PSD)
$c \cdot k_1$ for any $c > 0$ (positive scaling)
$f(k_1)$ where $f$ is a polynomial with non-negative coefficients
$\exp(k_1)$ (exponential of any PSD kernel is PSD)

This algebra lets you build domain-specific kernels by combining simpler ones.

# ── Kernel validity check: is the Gram matrix PSD? ────────────────────────────

def gram_matrix(kernel_fn, X: np.ndarray) -> np.ndarray:
    """Compute n×n Gram matrix for a kernel function."""
    n = len(X)
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            K[i, j] = kernel_fn(X[i], X[j])
    return K


def is_psd(K: np.ndarray, tol: float = 1e-9) -> tuple[bool, float]:
    """Check PSD by computing minimum eigenvalue."""
    eigvals = np.linalg.eigvalsh(K)
    return eigvals.min() >= -tol, float(eigvals.min())


np.random.seed(42)
X_check = np.random.randn(40, 3)

# Define common kernels
kernels = {
    "Linear   k(x,z) = xᵀz":
        lambda x, z: x @ z,
    "Poly-3   k(x,z) = (xᵀz+1)³":
        lambda x, z: (x @ z + 1) ** 3,
    "RBF      k(x,z) = exp(-0.5‖x-z‖²)":
        lambda x, z: np.exp(-0.5 * np.sum((x - z) ** 2)),
    "Laplacian k(x,z) = exp(-‖x-z‖₁)":
        lambda x, z: np.exp(-np.sum(np.abs(x - z))),
    "Sigmoid  k(x,z) = tanh(xᵀz - 1)  [NOT always PSD]":
        lambda x, z: np.tanh(x @ z - 1.0),
    "Sum RBF+Poly [valid composition]":
        lambda x, z: (np.exp(-0.5 * np.sum((x - z) ** 2))
                      + (x @ z + 1) ** 2),
}

print("Mercer's theorem - Gram matrix PSD check:")
print("-" * 70)
for name, kfn in kernels.items():
    K = gram_matrix(kfn, X_check)
    valid, min_eig = is_psd(K)
    status = "VALID  " if valid else "INVALID"
    print(f"  {status}  min_eig={min_eig:+.4f}  {name}")

Kernel Ridge Regression: Dual Formulation

From Primal to Dual

Primal ridge regression (standard form):

$\hat{\mathbf{w}} = (\mathbf{X}^\top\mathbf{X} + n\lambda\mathbf{I}_d)^{-1}\mathbf{X}^\top\mathbf{y} \quad \text{- requires inverting a } d \times d \text{ matrix}$

Representer theorem: for any regularized empirical risk minimization over a RKHS (Reproducing Kernel Hilbert Space), the optimal solution lies in the span of the training data:

$\hat{\mathbf{w}} = \mathbf{X}^\top\boldsymbol{\alpha} = \sum_{i=1}^n \alpha_i \mathbf{x}_i$

Substituting $\mathbf{w} = \mathbf{X}^\top\boldsymbol{\alpha}$ into the ridge objective and differentiating:

$\hat{\boldsymbol{\alpha}} = (\mathbf{X}\mathbf{X}^\top + n\lambda\mathbf{I}_n)^{-1}\mathbf{y} = (\mathbf{K} + n\lambda\mathbf{I}_n)^{-1}\mathbf{y}$

where $\mathbf{K} = \mathbf{X}\mathbf{X}^\top$ is the $n \times n$ Gram matrix with $K_{ij} = \mathbf{x}_i^\top\mathbf{x}_j$ .

Prediction for a new point $\mathbf{x}_*$ :

$\hat{y}_* = \hat{\mathbf{w}}^\top\mathbf{x}_* = \boldsymbol{\alpha}^\top\mathbf{X}\mathbf{x}_* = \sum_{i=1}^n \alpha_i \mathbf{x}_i^\top\mathbf{x}_*$

Kernelization: replace every inner product $\mathbf{x}_i^\top\mathbf{x}_j$ with $k(\mathbf{x}_i, \mathbf{x}_j)$ :

$\boxed{\hat{\boldsymbol{\alpha}} = (K + n\lambda\mathbf{I})^{-1}\mathbf{y}, \qquad \hat{y}_* = \sum_{i=1}^n \alpha_i k(\mathbf{x}_i, \mathbf{x}_*) = \mathbf{k}_*^\top\hat{\boldsymbol{\alpha}}}$

where $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ and $\mathbf{k}_* = [k(\mathbf{x}_1, \mathbf{x}_*), \ldots, k(\mathbf{x}_n, \mathbf{x}_*)]^\top$ .

This dual form works with any valid kernel - including RBF, which implicitly uses an infinite-dimensional feature map. The primal problem would be $\infty \times \infty$ ; the dual is always $n \times n$ .

Primal vs Dual: When to Use Which

Criterion	Primal	Dual
System to invert	$d \times d$	$n \times n$
Prefer when	$n \gg d$	$d \gg n$ or $d = \infty$ (kernel)
Kernel required	No	Yes
Training complexity	$O(nd^2 + d^3)$	$O(n^2d + n^3)$
Prediction complexity	$O(d)$	$O(nd)$ (sum over training)

The crossover: if $n = d$ , both are equivalent. If using kernels (especially RBF), always use the dual - there is no choice, since the primal is infinite-dimensional.

Full NumPy Kernel Ridge Regression

class KernelRidgeRegression:
    """
    Kernel Ridge Regression via the dual formulation.
    Solves: α = (K + n·λ·I)⁻¹ y
    Predicts: ŷ(x*) = kₓ*ᵀ α where kₓ* = [k(x1,x*),...,k(xn,x*)]

    Supports: 'linear', 'polynomial', 'rbf', or any callable kernel.
    """

    def __init__(self, kernel: str = "rbf", alpha: float = 1.0,
                 gamma: float = 1.0, degree: int = 3, coef0: float = 1.0):
        self.kernel  = kernel
        self.alpha   = alpha    # regularization strength λ
        self.gamma   = gamma    # RBF: 1/(2σ²); Poly: scaling
        self.degree  = degree
        self.coef0   = coef0
        self._alpha_dual: np.ndarray | None = None
        self._X_train: np.ndarray | None    = None

    def _k(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """Compute kernel matrix between X1 [n1×d] and X2 [n2×d]."""
        if self.kernel == "linear":
            return X1 @ X2.T

        elif self.kernel == "polynomial":
            return (X1 @ X2.T + self.coef0) ** self.degree

        elif self.kernel == "rbf":
            # Vectorized squared Euclidean distances (no for-loop)
            # ||xi - xj||² = ||xi||² - 2 xi·xj + ||xj||²
            X1_sq = np.sum(X1 ** 2, axis=1, keepdims=True)   # [n1, 1]
            X2_sq = np.sum(X2 ** 2, axis=1, keepdims=True)   # [n2, 1]
            sq_dists = X1_sq + X2_sq.T - 2.0 * (X1 @ X2.T)  # [n1, n2]
            return np.exp(-self.gamma * sq_dists)

        elif callable(self.kernel):
            n1, n2 = len(X1), len(X2)
            K = np.zeros((n1, n2))
            for i in range(n1):
                for j in range(n2):
                    K[i, j] = self.kernel(X1[i], X2[j])
            return K

        else:
            raise ValueError(f"Unknown kernel: {self.kernel!r}")

    def fit(self, X: np.ndarray, y: np.ndarray) -> "KernelRidgeRegression":
        """Solve dual: α = (K + n·λ·I)⁻¹ y via lstsq for numerical stability."""
        self._X_train = X.copy()
        n = len(X)
        K = self._k(X, X)
        # Note: sklearn uses (K + alpha*n*I); we follow that convention
        K_reg = K + self.alpha * n * np.eye(n)
        self._alpha_dual, *_ = np.linalg.lstsq(K_reg, y, rcond=None)
        return self

    def predict(self, X_new: np.ndarray) -> np.ndarray:
        """ŷ = K(X_new, X_train) · α_dual"""
        K_new = self._k(X_new, self._X_train)  # [n_new, n_train]
        return K_new @ self._alpha_dual

    def score(self, X: np.ndarray, y: np.ndarray) -> float:
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return float(1.0 - ss_res / ss_tot)


# ── Demo on sinusoidal data ───────────────────────────────────────────────────
np.random.seed(42)
X_nl = np.sort(np.random.uniform(-3, 3, 300)).reshape(-1, 1)
y_nl = np.sin(X_nl.ravel()) + 0.15 * np.random.randn(300)

X_tr, X_te, y_tr, y_te = train_test_split(X_nl, y_nl, test_size=0.2,
                                            random_state=42)
sc = StandardScaler()
X_tr_sc = sc.fit_transform(X_tr)
X_te_sc  = sc.transform(X_te)

print("\nKernel Ridge Regression on sin(x) + noise:")
print("-" * 50)
configs = [
    ("Linear",              dict(kernel="linear")),
    ("Polynomial (d=3)",    dict(kernel="polynomial", degree=3)),
    ("RBF (γ=0.1)",         dict(kernel="rbf", gamma=0.1)),
    ("RBF (γ=1.0)",         dict(kernel="rbf", gamma=1.0)),
    ("RBF (γ=10.0)",        dict(kernel="rbf", gamma=10.0)),
]
for name, kwargs in configs:
    krr = KernelRidgeRegression(alpha=0.1, **kwargs)
    krr.fit(X_tr_sc, y_tr)
    r2 = krr.score(X_te_sc, y_te)
    print(f"  {name:<22s}: Test R² = {r2:.4f}")

# ── Visualize the three regimes of RBF bandwidth ─────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
X_plot = np.linspace(-3.5, 3.5, 600).reshape(-1, 1)
X_plot_sc = sc.transform(X_plot)

for ax, (gamma, title) in zip(axes, [
    (0.05, "γ=0.05 (over-smooth, underfitting)"),
    (1.0,  "γ=1.0  (good fit)"),
    (15.0, "γ=15.0 (over-sharp, overfitting)"),
]):
    krr = KernelRidgeRegression(alpha=0.1, kernel="rbf", gamma=gamma)
    krr.fit(X_tr_sc, y_tr)
    r2_te = krr.score(X_te_sc, y_te)

    ax.scatter(X_tr, y_tr, s=10, alpha=0.4, color="steelblue")
    ax.scatter(X_te, y_te, s=10, alpha=0.4, color="orange")
    ax.plot(X_plot, krr.predict(X_plot_sc), "r-", linewidth=2,
            label=f"KRR R²={r2_te:.3f}")
    ax.plot(X_plot, np.sin(X_plot), "g--", linewidth=1.2,
            alpha=0.7, label="True sin(x)")
    ax.set_title(title)
    ax.legend(fontsize=7)
    ax.set_ylim(-1.8, 1.8)
    ax.grid(True, alpha=0.2)

plt.tight_layout()
plt.savefig("kernel_ridge_gamma.png", dpi=150, bbox_inches="tight")
plt.show()

Sklearn KernelRidge and Grid Search

from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import GridSearchCV

# ── Multivariate nonlinear regression ─────────────────────────────────────────
np.random.seed(42)
n_m = 600
X_m = np.random.randn(n_m, 5)
y_m = (np.sin(X_m[:, 0]) * np.cos(X_m[:, 1])
       + X_m[:, 2] ** 2 - X_m[:, 3] * X_m[:, 4]
       + np.random.randn(n_m) * 0.2)

X_tr_m, X_te_m, y_tr_m, y_te_m = train_test_split(
    X_m, y_m, test_size=0.2, random_state=42)
sc_m = StandardScaler()
X_tr_m_sc = sc_m.fit_transform(X_tr_m)
X_te_m_sc  = sc_m.transform(X_te_m)

# Baseline: linear Ridge
lr_base = Ridge(alpha=1.0).fit(X_tr_m_sc, y_tr_m)
print(f"Linear Ridge baseline: Test R²={lr_base.score(X_te_m_sc, y_te_m):.4f}")

# KernelRidge with grid search over alpha and gamma
param_grid = {
    "alpha": [0.001, 0.01, 0.1, 1.0, 10.0],
    "gamma": [0.05, 0.1, 0.5, 1.0, 2.0, 5.0],
}
gs = GridSearchCV(
    KernelRidge(kernel="rbf"),
    param_grid, cv=5, scoring="r2", n_jobs=-1, verbose=0
)
gs.fit(X_tr_m_sc, y_tr_m)
print(f"KernelRidge (RBF, grid search):")
print(f"  Best params: {gs.best_params_}")
print(f"  CV R²: {gs.best_score_:.4f}")
print(f"  Test R²: {gs.best_estimator_.score(X_te_m_sc, y_te_m):.4f}")

Nyström Approximation: Scaling Kernel Methods

Kernel ridge regression has two scalability bottlenecks: computing the Gram matrix ( $O(n^2 d)$ time, $O(n^2)$ memory) and inverting it ( $O(n^3)$ time). For $n = 100{,}000$ and $d = 100$ : the Gram matrix alone requires $100{,}000^2 \times 8$ bytes = 80 GB.

Nyström approximation (Williams and Seeger, 2001): choose $m \ll n$ landmark points $\{\tilde{\mathbf{x}}_1, \ldots, \tilde{\mathbf{x}}_m\}$ (typically a random subset of training points). The full $n \times n$ Gram matrix is approximated as:

$\mathbf{K} \approx \mathbf{K}_{nm}\mathbf{K}_{mm}^{-1}\mathbf{K}_{mn} = \widetilde{\mathbf{K}}$

where $\mathbf{K}_{nm} \in \mathbb{R}^{n \times m}$ is the kernel matrix between all $n$ training points and the $m$ landmarks, and $\mathbf{K}_{mm} \in \mathbb{R}^{m \times m}$ is the kernel matrix between landmarks. Storage: $O(nm)$ instead of $O(n^2)$ . The approximation quality improves as $m$ increases - at $m = n$ , it is exact.

Random Fourier Features (Rahimi and Recht, 2007): approximates the RBF kernel via random projections. By Bochner's theorem, a stationary kernel $k(\mathbf{x}, \mathbf{z}) = k(\mathbf{x} - \mathbf{z})$ can be written as the expectation of $z_\omega(\mathbf{x})z_\omega(\mathbf{z})$ over random frequencies $\omega \sim p(\omega)$ :

$k(\mathbf{x}, \mathbf{z}) = \mathbb{E}_\omega\left[\cos(\omega^\top\mathbf{x} + b)\cos(\omega^\top\mathbf{z} + b)\right] \cdot 2$

Sample $D$ random frequencies $\omega_j \sim \mathcal{N}(0, \gamma \mathbf{I})$ and random phases $b_j \sim \text{Uniform}(0, 2\pi)$ . Map each input to:

$z(\mathbf{x}) = \sqrt{\frac{2}{D}}\left[\cos(\omega_1^\top\mathbf{x} + b_1), \ldots, \cos(\omega_D^\top\mathbf{x} + b_D)\right] \in \mathbb{R}^D$

Then $k(\mathbf{x}, \mathbf{z}) \approx z(\mathbf{x})^\top z(\mathbf{z})$ . With $D$ random features, fit a standard linear model on $z(\mathbf{X})$ - cost $O(nD)$ instead of $O(n^2)$ .

from sklearn.kernel_approximation import Nystroem, RBFSampler
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
import time

print("\nKernel approximation quality vs computation:")
print(f"{'Method':35s} | {'Test R²':>8} | {'Time (s)':>10}")
print("-" * 60)

# Exact kernel ridge (reference)
t0 = time.time()
krr_exact = KernelRidge(kernel="rbf", alpha=0.1, gamma=0.5)
krr_exact.fit(X_tr_m_sc, y_tr_m)
r2_exact = krr_exact.score(X_te_m_sc, y_te_m)
print(f"{'Exact KRR':35s} | {r2_exact:>8.4f} | {time.time()-t0:>10.3f}")

# Nyström approximation with varying m
for m in [50, 100, 200, 500]:
    t0 = time.time()
    pipe = Pipeline([
        ("nys",   Nystroem(kernel="rbf", gamma=0.5,
                           n_components=m, random_state=42)),
        ("ridge", Ridge(alpha=0.1)),
    ])
    pipe.fit(X_tr_m_sc, y_tr_m)
    r2 = pipe.score(X_te_m_sc, y_te_m)
    print(f"{'Nyström (m='+str(m)+')':35s} | {r2:>8.4f} | {time.time()-t0:>10.3f}")

# Random Fourier Features
for D in [100, 500, 1000]:
    t0 = time.time()
    pipe = Pipeline([
        ("rff",   RBFSampler(gamma=0.5, n_components=D, random_state=42)),
        ("ridge", Ridge(alpha=0.1)),
    ])
    pipe.fit(X_tr_m_sc, y_tr_m)
    r2 = pipe.score(X_te_m_sc, y_te_m)
    print(f"{'RFF (D='+str(D)+')':35s} | {r2:>8.4f} | {time.time()-t0:>10.3f}")

Kernel Selection Guide

print("\nKernel Selection Guide:")
print("=" * 70)

guide = [
    ("Linear  k(x,z) = xᵀz",
     "Feature space is already the right space; high-d text (d≫n works via dual)"),
    ("Polynomial  k(x,z) = (xᵀz+c)ᵖ",
     "Known interaction degree; NLP n-gram similarity; computer vision patch matching"),
    ("RBF  k(x,z) = exp(-γ‖x-z‖²)",
     "Default first choice; local similarity; tabular data after standardization"),
    ("Laplacian  k(x,z) = exp(-γ‖x-z‖₁)",
     "More robust to outliers in individual dims; sparser locality than RBF"),
    ("Matern-5/2",
     "GP regression for physical processes; assumption of finite differentiability"),
    ("String kernel  k(s,t) = #common substrings",
     "DNA/protein sequences; text classification without bag-of-words encoding"),
]

for name, use_when in guide:
    print(f"  {name}")
    print(f"    Use when: {use_when}\n")

print("RBF γ parameter guide:")
print("  Start with: γ = 1 / (d · Var[X])  ('scale' in sklearn)")
print("  Small γ → broad similarity, smooth function (risk: underfitting)")
print("  Large γ → narrow similarity, spiky function (risk: overfitting)")
print("  Always cross-validate γ jointly with the regularization α")

print("\nSVM vs KRR - same kernel, different loss:")
print("  KRR   loss = MSE        → regression, smooth solution")
print("  SVM   loss = hinge      → classification/regression, sparse support vectors")
print("  Both solve a quadratic optimization in the dual; both kernelizable")

Common Mistakes

:::danger Not scaling features before kernel methods

# WRONG - different feature scales distort RBF distances
krr = KernelRidge(kernel="rbf", gamma=0.5).fit(X_raw, y)

# CORRECT - always standardize first
sc = StandardScaler()
X_sc = sc.fit_transform(X_train)
krr = KernelRidge(kernel="rbf", gamma=0.5).fit(X_sc, y)

The RBF kernel computes $\exp(-\gamma\|\mathbf{x}-\mathbf{z}\|^2)$ . A feature with variance 1000 contributes $1000\times$ more to the Euclidean distance than a feature with variance 1, effectively ignoring the latter. After StandardScaler, all features have unit variance and contribute equally. Apply transform() (not fit_transform()) to test data. :::

:::danger Kernel ridge regression with large n - memory and compute explosion For $n = 10{,}000$ : Gram matrix = $10{,}000^2 \times 8$ bytes = 800 MB. For $n = 100{,}000$ : 80 GB. For $n > 10{,}000$ , use Nyström (sklearn.kernel_approximation.Nystroem + Ridge) or Random Fourier Features (RBFSampler + Ridge). These scale as $O(nm)$ instead of $O(n^2)$ and usually achieve 95%+ of the performance at a fraction of the cost. :::

:::warning High-degree polynomial features without regularization Degree- $p$ polynomial features create $O(d^p/p!)$ features. Without regularization, the model perfectly interpolates training data (R² = 1.0 if $D \geq n$ ) and fails catastrophically on test data. Always combine polynomial features with Ridge or Lasso regularization, and cross-validate both the degree and the regularization strength jointly in a Pipeline. :::

:::warning Choosing γ for RBF without cross-validation Too large a $\gamma$ : the model effectively interpolates training points (RBF kernel becomes an identity matrix) - perfect training fit, terrible test performance. Too small: the model is over-smooth, effectively linear. Always cross-validate over a logarithmic grid: param_grid = {"gamma": np.logspace(-4, 2, 30)}. sklearn's gamma="scale" is a good starting point: $\gamma = 1/(d \cdot \text{Var}[\mathbf{X}])$ . :::

YouTube Resources

Resource	Channel	Why Watch
The Kernel Trick Explained	StatQuest with Josh Starmer	Best visual introduction to kernel functions and the kernel trick - start here
MIT 6.034: Kernel Machines	MIT OpenCourseWare	Mercer's theorem, SVM, and kernel methods - rigorous mathematical treatment
Random Features for Kernel Machines (Rahimi & Recht)	NIPS Conference Talk	The random Fourier features paper - the scalability breakthrough for large kernels
Gaussian Processes from the Kernel Perspective	Richard Turner (Cambridge)	How kernels define function spaces - connects KRR to Gaussian process regression
Polynomial Features and Overfitting	StatQuest with Josh Starmer	Visual intuition for polynomial expansion and the bias-variance tradeoff with degree

Interview Q&A

Q1: Explain the kernel trick. Why is it computationally valuable?

The kernel trick exploits the fact that many ML algorithms need only inner products $\phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j)$ between feature vectors - never the vectors themselves. A kernel function $k(\mathbf{x}_i, \mathbf{x}_j)$ computes this inner product implicitly: for the polynomial kernel $(\mathbf{x}^\top\mathbf{z} + c)^p$ , computing $k$ costs $O(d)$ (one dot product plus scalar operations), while computing $\phi$ then its dot product costs $O(D) = O(d^p/p!)$ . For $d = 1000$ , $p = 3$ : 1,000 vs 167,000,000 operations per pair. For the RBF kernel, $D = \infty$ , yet computing $k$ still costs $O(d)$ . The kernel trick gives infinite expressiveness at the cost of a single dot product in the original space.

Q2: State Mercer's theorem. How do you verify a function is a valid kernel?

Mercer's theorem: a symmetric function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is a valid (Mercer) kernel - corresponds to an inner product in some Hilbert space - if and only if the Gram matrix $\mathbf{K}$ with $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ is positive semi-definite (all eigenvalues $\geq 0$ ) for every finite set of input points.

Verification strategies: (1) Show $k(\mathbf{x}, \mathbf{z}) = \phi(\mathbf{x})^\top\phi(\mathbf{z})$ for an explicit $\phi$ (constructive proof for polynomial kernel); (2) Compute eigenvalues of the Gram matrix on a sample of points - all should be $\geq 0$ ; (3) Apply kernel algebra: sums, products, positive scalar multiples, and exponentials of valid kernels are valid. The sigmoid kernel $\tanh(\alpha\mathbf{x}^\top\mathbf{z} + c)$ is not always PSD - its validity depends on $\alpha$ and $c$ .

Q3: Derive the dual form of kernel ridge regression. When is the dual preferred over the primal?

Primal ridge: $\hat{\mathbf{w}} = (\mathbf{X}^\top\mathbf{X} + n\lambda\mathbf{I}_d)^{-1}\mathbf{X}^\top\mathbf{y}$ - requires inverting a $d \times d$ matrix. By the representer theorem, the optimal solution lies in the span of training data: $\hat{\mathbf{w}} = \mathbf{X}^\top\boldsymbol{\alpha}$ . Substituting and simplifying: $(\mathbf{X}\mathbf{X}^\top + n\lambda\mathbf{I}_n)\boldsymbol{\alpha} = \mathbf{y}$ , giving $\hat{\boldsymbol{\alpha}} = (\mathbf{K} + n\lambda\mathbf{I}_n)^{-1}\mathbf{y}$ where $\mathbf{K} = \mathbf{X}\mathbf{X}^\top$ is $n \times n$ .

Use dual when: (1) $d \gg n$ - the $n \times n$ dual system is smaller than the $d \times d$ primal; (2) using a kernel (including RBF with $d = \infty$ ) - the primal is impossible; (3) you need to kernelize the algorithm for nonlinear patterns. Use primal when $n \gg d$ - inverting $d \times d$ is cheaper than $n \times n$ .

Q4: Why does the RBF kernel correspond to an infinite-dimensional feature space?

Using the Taylor expansion of $\exp(2\gamma\mathbf{x}^\top\mathbf{z}) = \sum_{k=0}^\infty \frac{(2\gamma\mathbf{x}^\top\mathbf{z})^k}{k!}$ , and writing $k(\mathbf{x}, \mathbf{z}) = \exp(-\gamma\|\mathbf{x}\|^2)\exp(-\gamma\|\mathbf{z}\|^2)\exp(2\gamma\mathbf{x}^\top\mathbf{z})$ :

$k(\mathbf{x}, \mathbf{z}) = \exp(-\gamma\|\mathbf{x}\|^2)\exp(-\gamma\|\mathbf{z}\|^2)\sum_{k=0}^\infty \frac{(2\gamma)^k}{k!}(\mathbf{x}^\top\mathbf{z})^k$

Each term corresponds to degree- $k$ polynomial interactions (monomials of degree $k$ in $\mathbf{x}$ dotted with monomials of degree $k$ in $\mathbf{z}$ ). The sum is infinite - all polynomial degrees contribute. The feature map $\phi(\mathbf{x})$ is therefore infinite-dimensional. Yet evaluating $k(\mathbf{x}, \mathbf{z}) = \exp(-\gamma\|\mathbf{x}-\mathbf{z}\|^2)$ costs $O(d)$ .

Q5: When would you choose polynomial features + linear model over kernel ridge regression?

Choose polynomial features + linear model when: (1) interpretability - PolynomialFeatures.get_feature_names_out() gives human-readable names like x1^2, x1 x2; polynomial coefficients have physical meaning (e.g., a quadratic displacement model); (2) large $n$ - for $n > 50{,}000$ , the $n \times n$ KRR system becomes infeasible while linear models on explicit features scale as $O(nD)$ ; (3) online learning - linear models support stochastic gradient descent and streaming updates; (4) known degree - physics or domain knowledge says the true relationship is quadratic (use degree 2, nothing higher); (5) downstream deployment - linear models are trivially serializable and deployable.

Choose kernel ridge regression when: (1) $d$ is large and the interaction structure is unknown - RBF covers all degrees; (2) $n$ is small enough (less than 10K) that $n \times n$ inversion is feasible; (3) you need a Gaussian process interpretation for uncertainty quantification (KRR is the MAP estimate of a GP with the corresponding kernel as the covariance function); (4) the degree is not known and you want to avoid committing to a fixed polynomial.

Q6: What is the Nyström approximation and when should you use it?

The Nyström approximation represents the full $n \times n$ Gram matrix as a rank- $m$ approximation using $m \ll n$ landmark points: $\mathbf{K} \approx \mathbf{K}_{nm}\mathbf{K}_{mm}^{-1}\mathbf{K}_{mn}$ . This reduces storage from $O(n^2)$ to $O(nm)$ and computation from $O(n^3)$ to $O(nm^2 + m^3)$ . For $n = 100{,}000$ and $m = 1000$ : storage drops from 80 GB to 800 MB, computation from $O(10^{15})$ to $O(10^{10})$ - a factor of $10^5$ reduction.

Use Nyström when $n > 10{,}000$ and you need a nonlinear kernel. Alternative: Random Fourier Features (RBFSampler in sklearn) approximates the RBF kernel via random cosine projections - same asymptotic scalability but different approximation error characteristics. In practice, Nyström has better approximation quality for the same number of components $m$ , but RFF is simpler to implement and more numerically stable.

Kernel Methods vs Neural Networks

A natural question: kernel methods were the state of the art before deep learning. Why do we still teach them?

The kernel method perspective on neural networks: deep neural networks with infinite width converge to Gaussian Processes with a kernel determined by the network architecture - the Neural Tangent Kernel (NTK, Jacot et al. 2018). At initialization, an infinitely-wide neural network's training dynamics are exactly kernel regression with the NTK. This connection helps explain why neural networks generalize despite massive overparameterization.

When kernels still win: for small datasets ( $n < 10{,}000$ ), well-tuned kernel methods routinely outperform neural networks because neural networks need large data to learn useful representations. For tabular data with few features ( $d < 100$ ), kernel ridge regression with RBF often matches gradient-boosted trees. For structured data (graphs, strings, molecules), domain-specific kernels encode prior knowledge that is difficult to learn from data alone.

# ── Kernel methods vs neural networks on small data ───────────────────────────
import torch
import torch.nn as nn
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import cross_val_score
import numpy as np

np.random.seed(42)

# Small dataset scenario: n=200, d=10, nonlinear true function
n_small, d_small = 200, 10
X_small = np.random.randn(n_small, d_small)
# True function: sin of first feature × cos of second, plus product interaction
y_small = (np.sin(X_small[:, 0]) * np.cos(X_small[:, 1])
           + X_small[:, 0] * X_small[:, 2]
           + np.random.randn(n_small) * 0.1)

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge

sc_s = StandardScaler()
X_small_sc = sc_s.fit_transform(X_small)

# Kernel Ridge Regression
krr_cv = cross_val_score(
    KernelRidge(kernel="rbf", alpha=0.1, gamma=0.5),
    X_small_sc, y_small, cv=5, scoring="r2"
)

# Simple MLP (too little data to train well)
class SmallMLP(nn.Module):
    def __init__(self, d):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d, 64), nn.ReLU(),
            nn.Linear(64, 64), nn.ReLU(),
            nn.Linear(64, 1)
        )
    def forward(self, x): return self.net(x).squeeze(-1)

# Manual 5-fold CV for MLP
from sklearn.model_selection import KFold

mlp_scores = []
kf = KFold(n_splits=5, shuffle=True, random_state=42)
X_t = torch.tensor(X_small_sc, dtype=torch.float32)
y_t = torch.tensor(y_small,    dtype=torch.float32)

for train_idx, val_idx in kf.split(X_small_sc):
    model = SmallMLP(d_small)
    optim = torch.optim.Adam(model.parameters(), lr=1e-3)

    for _ in range(500):
        pred = model(X_t[train_idx])
        loss = nn.functional.mse_loss(pred, y_t[train_idx])
        optim.zero_grad(); loss.backward(); optim.step()

    model.eval()
    with torch.no_grad():
        pred_val = model(X_t[val_idx]).numpy()
    y_val = y_small[val_idx]
    ss_res = np.sum((y_val - pred_val)**2)
    ss_tot = np.sum((y_val - y_val.mean())**2)
    mlp_scores.append(1 - ss_res / ss_tot)

print(f"\nSmall data comparison (n={n_small}, d={d_small}):")
print(f"  Kernel Ridge (RBF): CV R² = {krr_cv.mean():.4f} ± {krr_cv.std():.4f}")
print(f"  MLP (2-layer):      CV R² = {np.mean(mlp_scores):.4f} ± {np.std(mlp_scores):.4f}")
print(f"  Kernel usually wins on small tabular data - no representation learning needed")

# ── When kernel fails: large n ─────────────────────────────────────────────────
print("\nScalability limit of exact KRR:")
for n_large in [1_000, 5_000, 10_000, 50_000]:
    gram_mb = n_large ** 2 * 8 / 1e6
    print(f"  n={n_large:>6,}: Gram matrix = {gram_mb:>8.1f} MB "
          f"{'(feasible)' if gram_mb < 1000 else '(too large - use Nystroem)'}")

Reproducing Kernel Hilbert Spaces (RKHS) - A Brief Overview

The mathematical foundation of kernel methods is the Reproducing Kernel Hilbert Space (RKHS). Every valid kernel $k$ defines a unique RKHS $\mathcal{H}_k$ - a Hilbert space of functions where:

For every $x$ , the function $k(x, \cdot): x' \mapsto k(x, x')$ belongs to $\mathcal{H}_k$ .
The reproducing property holds: $\langle f, k(x, \cdot)\rangle_{\mathcal{H}_k} = f(x)$ for all $f \in \mathcal{H}_k$ and all $x$ .

The representer theorem - which we used to derive the kernel ridge regression dual - holds in any RKHS: the minimizer of any regularized empirical risk functional over $\mathcal{H}_k$ is a finite linear combination of kernel functions centered at training points:

$\hat{f}(x) = \sum_{i=1}^n \alpha_i k(\mathbf{x}_i, x)$

This is exactly the kernel ridge regression prediction formula $\hat{y}(x_*) = \sum_i \alpha_i k(\mathbf{x}_i, x_*)$ . The RKHS provides the theoretical guarantee that this dual representation is complete - no information is lost by working in the dual rather than the primal.

Connection to Gaussian Processes: kernel ridge regression with kernel $k$ is equivalent to the MAP estimate of a Gaussian Process with prior covariance $k$ . The GP posterior mean is exactly $\hat{f}$ from KRR, and the GP posterior variance provides uncertainty quantification that KRR alone does not give. This connection makes kernels interpretable as prior beliefs about function smoothness and structure.

Bias-Variance Tradeoff in Polynomial Degree and Kernel Bandwidth

Both polynomial degree and RBF bandwidth $\gamma$ control the bias-variance tradeoff - but through very different mechanisms. Understanding this connection helps you cross-validate more efficiently.

Polynomial degree: increasing $p$ increases model flexibility (more features, more potential patterns captured). Low $p$ : high bias (underfitting), low variance. High $p$ : low bias, high variance (overfitting). Optimal $p$ is typically found by CV in $\{1, 2, 3, 4, 5\}$ - values above 5 are almost never optimal for practical regression.

RBF bandwidth $\gamma = 1/(2\sigma^2)$ : larger $\gamma$ (smaller bandwidth $\sigma$ ) means each training point influences only nearby test points - high variance, low bias locally. Smaller $\gamma$ (larger bandwidth) means each training point influences test points far away - smooth global function, higher bias, lower variance.

# ── Bias-variance as a function of polynomial degree ─────────────────────────
import numpy as np
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate
import matplotlib.pyplot as plt

np.random.seed(42)
# True function: cubic with noise
n_bv = 300
X_bv = np.sort(np.random.uniform(-2, 2, n_bv)).reshape(-1, 1)
y_bv = X_bv.ravel()**3 - 2 * X_bv.ravel() + np.random.randn(n_bv) * 0.5

degrees = [1, 2, 3, 4, 5, 7, 10, 15]
train_mses, val_mses = [], []

for deg in degrees:
    pipe = Pipeline([
        ("poly",   PolynomialFeatures(degree=deg, include_bias=True)),
        ("scaler", StandardScaler()),
        ("lr",     LinearRegression()),
    ])
    cv_res = cross_validate(pipe, X_bv, y_bv, cv=5,
                             scoring="neg_mean_squared_error",
                             return_train_score=True)
    train_mses.append(-cv_res["train_score"].mean())
    val_mses.append(-cv_res["test_score"].mean())

opt_deg = degrees[int(np.argmin(val_mses))]
print("Polynomial Degree - Bias-Variance Tradeoff:")
print(f"{'Degree':>8} | {'Train MSE':>12} | {'Val MSE':>12}")
print("-" * 38)
for deg, tr, va in zip(degrees, train_mses, val_mses):
    flag = " ← optimal" if deg == opt_deg else ""
    print(f"{deg:>8} | {tr:>12.4f} | {va:>12.4f}{flag}")


# ── Bias-variance as a function of RBF gamma ─────────────────────────────────
from sklearn.kernel_ridge import KernelRidge

gammas = np.logspace(-3, 2, 20)
train_mses_g, val_mses_g = [], []

for gamma in gammas:
    cv_res = cross_validate(
        KernelRidge(kernel="rbf", alpha=0.01, gamma=gamma),
        X_bv, y_bv, cv=5,
        scoring="neg_mean_squared_error",
        return_train_score=True
    )
    train_mses_g.append(-cv_res["train_score"].mean())
    val_mses_g.append(-cv_res["test_score"].mean())

opt_gamma = gammas[int(np.argmin(val_mses_g))]
print(f"\nRBF Kernel - Optimal γ from CV: {opt_gamma:.4f}")
print(f"  (sklearn 'scale' default: 1/(d·Var[X]) = "
      f"{1/(1*np.var(X_bv)):.4f})")

String Kernels and Structured Data

The kernel framework extends beyond numeric feature vectors. String kernels compute similarity between sequences - DNA, protein chains, text - without any vectorization step.

The spectrum kernel (Leslie et al., 2002) for strings counts shared $k$ -mers (substrings of length $k$ ):

$k_\text{spectrum}(s, t) = \sum_{\text{u: all k-mers}} \phi_u(s) \cdot \phi_u(t)$

where $\phi_u(s)$ counts how many times k-mer $u$ appears in string $s$ . This is a valid (Mercer) kernel - it is a dot product of count vectors. With SVM or kernel ridge regression, it gives state-of-the-art performance on protein homology detection and remote sensing classification tasks, without any alignment or deep learning.

from collections import Counter

def spectrum_kernel(s: str, t: str, k: int = 3) -> int:
    """
    Spectrum kernel: count of shared k-mers between strings s and t.
    k(s,t) = φ(s)ᵀφ(t) where φ(s)_u = count of k-mer u in s.
    Valid (Mercer) kernel: it is an inner product of count vectors.
    """
    def count_kmers(seq, k):
        return Counter(seq[i:i+k] for i in range(len(seq)-k+1))

    kmers_s = count_kmers(s, k)
    kmers_t = count_kmers(t, k)

    # Dot product of count vectors over shared k-mers
    return sum(kmers_s[u] * kmers_t[u] for u in kmers_s if u in kmers_t)


# Example: protein sequence similarity
seq1 = "ACDEFGHIKLMNPQRSTVWY"   # 20 standard amino acids
seq2 = "ACDEFGHIKLMVPQRSTVWY"   # one substitution
seq3 = "AAAAAAAAAAAAAAAAAAAA"   # all-alanine (very different)

print("Spectrum kernel (k=3) - protein sequence similarity:")
print(f"  k(seq1, seq1) = {spectrum_kernel(seq1, seq1, k=3)}")
print(f"  k(seq1, seq2) = {spectrum_kernel(seq1, seq2, k=3)}  (one substitution)")
print(f"  k(seq1, seq3) = {spectrum_kernel(seq1, seq3, k=3)}  (very different)")
print("  Higher value = more similar sequences (shared 3-gram patterns)")

# Verify PSD on a small set of sequences
seqs = [seq1, seq2, seq3, "MVLSPADKTNVKAAW", "GSHSMRYFYTSVSRPG"]
n_seqs = len(seqs)
K_str = np.array([[spectrum_kernel(seqs[i], seqs[j], k=3)
                   for j in range(n_seqs)]
                  for i in range(n_seqs)], dtype=float)

eigvals = np.linalg.eigvalsh(K_str)
print(f"\nSpectrum kernel Gram matrix eigenvalues: {eigvals.round(2)}")
print(f"All non-negative (PSD): {(eigvals >= -1e-9).all()}")  # Should be True

:::tip 🎮 Interactive Playground

Visualize this concept: Try the SVM Kernels & Margin demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why This Exists - The Nonlinearity Problem​

Historical Context​

Polynomial Feature Expansion​

The Curse of Dimensionality​

Polynomial Features in Practice​

The Kernel Trick​

The Core Insight​

Polynomial Kernel - Explicit Proof​

Common Kernels​

Mercer's Theorem: What Makes a Valid Kernel​

Kernel Ridge Regression: Dual Formulation​

From Primal to Dual​

Primal vs Dual: When to Use Which​

Full NumPy Kernel Ridge Regression​

Sklearn KernelRidge and Grid Search​

Nyström Approximation: Scaling Kernel Methods​

Kernel Selection Guide​

Common Mistakes​

YouTube Resources​

Interview Q&A​

Kernel Methods vs Neural Networks​

Reproducing Kernel Hilbert Spaces (RKHS) - A Brief Overview​

Bias-Variance Tradeoff in Polynomial Degree and Kernel Bandwidth​

String Kernels and Structured Data​