Skip to main content

Polynomial Features and Kernel Methods

:::note Reading time: ~45 minutes | Interview relevance: High | Target roles: ML Engineer, Research Engineer, Applied Scientist, AI Engineer :::

The Real Interview Moment

You are interviewing at a biotech company. Their ML team has been using linear regression with raw features to predict protein binding affinity from amino acid sequences - R² of 0.38. The interviewer says: "We think there are interaction effects between amino acid positions. How would you capture those? And if we have 1000 amino acid positions, what is the problem with explicit polynomial features?"

The candidate who says "add polynomial features" gets the follow-up: "Degree-2 polynomial expansion of 1000 features gives how many features? What about degree-3? And how does the kernel trick solve this?"

The answer: (10022)=501,501\binom{1002}{2} = 501{,}501 features for degree-2. For degree-3: (10033)=167,167,000\binom{1003}{3} = 167{,}167{,}000 features. Even storing a single training matrix at degree-3 requires terabytes. The kernel trick computes the equivalent of working in that 167-million-dimensional space with just a dot product in the original space - O(d)O(d) cost instead of O(D)O(D). The RBF kernel extends this to an infinite-dimensional feature space, with the same O(d)O(d) cost.

Understanding this distinction - and the precise mathematical conditions that make it possible (Mercer's theorem) - separates engineers who "use kernels" from engineers who understand why they work.

Why This Exists - The Nonlinearity Problem

Linear models draw hyperplanes. A linear regression or logistic regression with raw features cannot capture curves, interaction effects, or decision boundaries that are not planar. For the protein problem, an amino acid at position 50 might interact synergistically with an amino acid at position 300 - their joint effect on binding affinity exceeds the sum of their individual effects. No linear model in the original features can represent this.

Two approaches handle nonlinearity while preserving linear machinery:

  1. Explicit feature engineering: transform inputs (add x12x_1^2, x1x2x_1 x_2, sin(x3)\sin(x_3), etc.) and apply linear models in the expanded space. Standard linear model theory applies without modification.
  2. Kernel methods: implicitly work in the expanded feature space without ever computing the features. The kernel function computes inner products in the high-dimensional space directly.

Both approaches are mathematically equivalent. They differ dramatically in computational cost when features are high-dimensional.

Historical Context

Polynomial features were used in statistics long before modern machine learning - astronomers fit polynomial curves to orbital data in the 19th century, and polynomial regression appears in Gauss's work on orbit prediction. The modern kernel framework emerged from support vector machines (Boser, Guyon, Vapnik 1992; Cortes and Vapnik 1995). Mercer's theorem itself dates to 1909, predating all of modern machine learning by decades. The key insight - that algorithms depending only on inner products can be "kernelized" - was explicitly formalized in the SVM context. Rahimi and Recht (2007) introduced random Fourier features ("random kitchen sinks"), enabling scalable approximation of kernel methods. Williams and Seeger (2001) developed Nyström approximation for kernel matrices.

Polynomial Feature Expansion

A degree-pp polynomial feature expansion maps xRd\mathbf{x} \in \mathbb{R}^d to a higher-dimensional vector ϕ(x)RD\phi(\mathbf{x}) \in \mathbb{R}^D containing all monomials up to degree pp.

For d=2d = 2, p=2p = 2:

ϕ(x1,x2)=(1,  x1,  x2,  x12,  x1x2,  x22)\phi(x_1, x_2) = \left(1,\; x_1,\; x_2,\; x_1^2,\; x_1 x_2,\; x_2^2\right)

The model in the expanded space is still linear in ϕ(x)\phi(\mathbf{x}):

y^=wϕ(x)=w0+w1x1+w2x2+w3x12+w4x1x2+w5x22\hat{y} = \mathbf{w}^\top \phi(\mathbf{x}) = w_0 + w_1 x_1 + w_2 x_2 + w_3 x_1^2 + w_4 x_1 x_2 + w_5 x_2^2

This is nonlinear in x\mathbf{x} (it can represent a quadratic response surface) but linear in ϕ(x)\phi(\mathbf{x}). OLS, Ridge, LASSO - all linear model theory applies without modification to the expanded features.

The Curse of Dimensionality

The number of features after a degree-pp expansion of dd inputs (with intercept) is:

D=(d+pp)D = \binom{d + p}{p}

For large dd and small pp, this grows as dp/p!\approx d^p/p!.

ddp=2p=2p=3p=3p=4p=4
10662861,001
1005,151176,8514,598,126
500125,75120,917,501Infeasible
1,000501,501167,167,000Infeasible

For d=1000d = 1000, p=3p = 3: 167 million features. The training matrix XΦRn×167M\mathbf{X}\boldsymbol{\Phi} \in \mathbb{R}^{n \times 167M} requires - for n=10,000n = 10{,}000 training points - approximately 167M×10,000×4167M \times 10{,}000 \times 4 bytes = 6.7 TB of storage. Each gradient step costs O(nD)=O(1.67×1012)O(n \cdot D) = O(1.67 \times 10^{12}) operations. This is completely infeasible.

The kernel trick avoids all of this.

Polynomial Features in Practice

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.datasets import make_regression


# ── Degree selection via cross-validation on 1D nonlinear data ────────────────
np.random.seed(42)
n = 200
X_1d = np.sort(np.random.uniform(-3, 3, n)).reshape(-1, 1)
# True function: quadratic + linear
y_1d = (0.5 * X_1d.ravel()**2 - 1.5 * X_1d.ravel()
+ np.random.randn(n) * 0.8)

print("Polynomial degree selection via 5-fold CV R²:")
print(f"{'Degree':>8} | {'CV R²':>10} | {'CV Std':>8} | {'N features':>12}")
print("-" * 48)

for degree in [1, 2, 3, 5, 8, 12, 20]:
pipe = Pipeline([
("poly", PolynomialFeatures(degree=degree, include_bias=True)),
("scaler", StandardScaler()),
("ridge", Ridge(alpha=0.01)),
])
cv_scores = cross_val_score(pipe, X_1d, y_1d, cv=5, scoring="r2")
n_feats = PolynomialFeatures(degree=degree).fit_transform(X_1d).shape[1]
print(f"{degree:>8} | {cv_scores.mean():>10.4f} | "
f"{cv_scores.std():>8.4f} | {n_feats:>12}")

# ── Visualize underfitting vs overfitting ─────────────────────────────────────
fig, axes = plt.subplots(1, 4, figsize=(17, 4))
X_plot = np.linspace(-3.5, 3.5, 600).reshape(-1, 1)

for ax, degree in zip(axes, [1, 2, 6, 15]):
pipe = Pipeline([
("poly", PolynomialFeatures(degree=degree)),
("sc", StandardScaler()),
("lr", LinearRegression()),
])
pipe.fit(X_1d, y_1d)
cv_r2 = cross_val_score(pipe, X_1d, y_1d, cv=5, scoring="r2").mean()

ax.scatter(X_1d, y_1d, s=12, alpha=0.45, color="steelblue")
ax.plot(X_plot, pipe.predict(X_plot), "r-", linewidth=2)
ax.set_title(f"Degree {degree}\nCV R²={cv_r2:.3f}")
ax.set_ylim(-8, 10)
ax.grid(True, alpha=0.2)

plt.suptitle("Polynomial Degree: Underfitting (deg 1) → Optimal (deg 2) → Overfitting (deg 15)",
y=1.02)
plt.tight_layout()
plt.savefig("polynomial_overfitting.png", dpi=150, bbox_inches="tight")
plt.show()


# ── Feature count explosion for higher dimensions ─────────────────────────────
print("\nFeature count explosion:")
print(f"{'d':>6} | {'p=2':>10} | {'p=3':>15} | {'p=4':>18}")
from math import comb
for d in [10, 50, 100, 500, 1000]:
p2 = comb(d+2, 2)
p3 = comb(d+3, 3)
p4 = comb(d+4, 4)
print(f"{d:>6} | {p2:>10,} | {p3:>15,} | {p4:>18,}")

The Kernel Trick

The Core Insight

Many ML algorithms only need inner products ϕ(xi)ϕ(xj)\phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j) between training examples - never the feature vectors themselves. Ridge regression, SVM, PCA, K-means, Gaussian processes - all can be formulated using only pairwise inner products. If we can compute ϕ(xi)ϕ(xj)\phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j) directly without first computing ϕ(xi)\phi(\mathbf{x}_i), we get the benefit of the high-dimensional feature space at the cost of operating in the original space.

A kernel function does exactly this:

k(xi,xj)=ϕ(xi)ϕ(xj)k(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j)

Polynomial Kernel - Explicit Proof

For d=2d = 2, degree p=2p = 2, constant c=1c = 1:

k(x,z)=(xz+1)2=(x1z1+x2z2+1)2k(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^\top\mathbf{z} + 1)^2 = (x_1 z_1 + x_2 z_2 + 1)^2

Expanding:

=x12z12+x22z22+1+2x1z1x2z2+2x1z1+2x2z2= x_1^2 z_1^2 + x_2^2 z_2^2 + 1 + 2x_1 z_1 x_2 z_2 + 2x_1 z_1 + 2x_2 z_2

=ϕ(x)ϕ(z)whereϕ(x)=(x12,  x22,  1,  2x1x2,  2x1,  2x2)= \phi(\mathbf{x})^\top\phi(\mathbf{z}) \quad \text{where} \quad \phi(\mathbf{x}) = (x_1^2,\; x_2^2,\; 1,\; \sqrt{2}x_1 x_2,\; \sqrt{2}x_1,\; \sqrt{2}x_2)

The 2\sqrt{2} scaling factors ensure the inner product matches. Computing the kernel: O(d)O(d) - one dot product plus one addition and squaring. Computing ϕ\phi then the dot product: O(D)=O(d2)O(D) = O(d^2). For d=1000d = 1000: 1,000 vs 1,000,000 operations per pair.

In general: k(x,z)=(xz+c)pk(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^\top\mathbf{z} + c)^p implicitly works with all degree-pp monomials in dd variables, a DD-dimensional space, in O(d)O(d) time.

Common Kernels

Linear kernel: k(x,z)=xzk(\mathbf{x}, \mathbf{z}) = \mathbf{x}^\top\mathbf{z}

The feature map is the identity: ϕ(x)=x\phi(\mathbf{x}) = \mathbf{x}. No nonlinearity. Kernelizing a linear model recovers the original formulation - useful for code structure and to enable the dual formulation when dnd \gg n.

Polynomial kernel: k(x,z)=(xz+c)pk(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^\top\mathbf{z} + c)^p

Implicit features: all monomials of degree exactly pp (homogeneous) or up to degree pp (inhomogeneous, c>0c > 0). Standard for NLP (character nn-grams, document similarity) where all degree-pp term interactions matter.

RBF / Gaussian kernel: k(x,z)=exp ⁣(xz22σ2)=exp(γxz2)k(\mathbf{x}, \mathbf{z}) = \exp\!\left(-\frac{\|\mathbf{x} - \mathbf{z}\|^2}{2\sigma^2}\right) = \exp(-\gamma\|\mathbf{x} - \mathbf{z}\|^2)

The implicit feature space is infinite-dimensional. To see this, use the Taylor expansion of exp\exp. With γ=1\gamma = 1 and writing xz=kxkzk\mathbf{x}^\top\mathbf{z} = \sum_k x_k z_k:

exp(x2)exp(z2)exp(2xz)=exp(x2)exp(z2)k=0(2xz)kk!\exp(-\|\mathbf{x}\|^2)\exp(-\|\mathbf{z}\|^2)\exp(2\mathbf{x}^\top\mathbf{z}) = \exp(-\|\mathbf{x}\|^2)\exp(-\|\mathbf{z}\|^2) \sum_{k=0}^{\infty} \frac{(2\mathbf{x}^\top\mathbf{z})^k}{k!}

Each term (2xz)k/k!(2\mathbf{x}^\top\mathbf{z})^k/k! corresponds to degree-kk polynomial interactions, scaled. Summing from k=0k=0 to \infty gives an infinite polynomial feature space. Yet computing k(x,z)k(\mathbf{x}, \mathbf{z}) costs only O(d)O(d).

The RBF kernel also has an intuitive geometric interpretation: it measures similarity via distance. k(x,z)=1k(\mathbf{x}, \mathbf{z}) = 1 when x=z\mathbf{x} = \mathbf{z} (identical points), and k(x,z)0k(\mathbf{x}, \mathbf{z}) \to 0 as xz\|\mathbf{x} - \mathbf{z}\| \to \infty (infinitely dissimilar).

Matern kernel (used in Gaussian processes):

kν(x,z)=21νΓ(ν)(2νxz)νKν ⁣(2νxz)k_\nu(\mathbf{x}, \mathbf{z}) = \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}\|\mathbf{x}-\mathbf{z}\|}{\ell}\right)^\nu K_\nu\!\left(\frac{\sqrt{2\nu}\|\mathbf{x}-\mathbf{z}\|}{\ell}\right)

where KνK_\nu is the modified Bessel function of the second kind, \ell is the length scale, and ν\nu controls smoothness. At ν=1/2\nu = 1/2: Laplacian kernel exp(xz/)\exp(-\|\mathbf{x}-\mathbf{z}\|/\ell). At ν\nu \to \infty: RBF kernel. Matern-3/2 and Matern-5/2 are popular in GP regression for physically plausible smoothness assumptions.

Mercer's Theorem: What Makes a Valid Kernel

Not every symmetric function k(x,z)k(\mathbf{x}, \mathbf{z}) corresponds to an inner product in some feature space. Mercer's theorem (1909) characterizes exactly which functions do.

Mercer's theorem: A symmetric function k:X×XRk: \mathcal{X} \times \mathcal{X} \to \mathbb{R} is a valid (Mercer) kernel - i.e., it corresponds to an inner product ϕ(x),ϕ(z)H\langle \phi(\mathbf{x}), \phi(\mathbf{z}) \rangle_\mathcal{H} in some Hilbert space H\mathcal{H} - if and only if the Gram matrix K\mathbf{K} with Kij=k(xi,xj)K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j) is positive semi-definite for every finite set of input points {x1,,xn}\{\mathbf{x}_1, \ldots, \mathbf{x}_n\}.

Equivalently: kk is a valid kernel     \iff for all finite sets of inputs and all αRn\boldsymbol{\alpha} \in \mathbb{R}^n:

i=1nj=1nαiαjk(xi,xj)0\sum_{i=1}^n\sum_{j=1}^n \alpha_i \alpha_j k(\mathbf{x}_i, \mathbf{x}_j) \geq 0

This is the PSD condition. It ensures that we are genuinely computing inner products, not some other bilinear form, and it guarantees that optimization problems using the kernel have unique solutions (convexity from PSD Gram matrices).

Kernel algebra - constructing new valid kernels from known ones:

If k1k_1 and k2k_2 are valid kernels, then so are:

  • k1+k2k_1 + k_2 (sum of PSD matrices is PSD)
  • k1k2k_1 \cdot k_2 (Schur/Hadamard product of PSD matrices is PSD)
  • ck1c \cdot k_1 for any c>0c > 0 (positive scaling)
  • f(k1)f(k_1) where ff is a polynomial with non-negative coefficients
  • exp(k1)\exp(k_1) (exponential of any PSD kernel is PSD)

This algebra lets you build domain-specific kernels by combining simpler ones.

# ── Kernel validity check: is the Gram matrix PSD? ────────────────────────────

def gram_matrix(kernel_fn, X: np.ndarray) -> np.ndarray:
"""Compute n×n Gram matrix for a kernel function."""
n = len(X)
K = np.zeros((n, n))
for i in range(n):
for j in range(n):
K[i, j] = kernel_fn(X[i], X[j])
return K


def is_psd(K: np.ndarray, tol: float = 1e-9) -> tuple[bool, float]:
"""Check PSD by computing minimum eigenvalue."""
eigvals = np.linalg.eigvalsh(K)
return eigvals.min() >= -tol, float(eigvals.min())


np.random.seed(42)
X_check = np.random.randn(40, 3)

# Define common kernels
kernels = {
"Linear k(x,z) = xᵀz":
lambda x, z: x @ z,
"Poly-3 k(x,z) = (xᵀz+1)³":
lambda x, z: (x @ z + 1) ** 3,
"RBF k(x,z) = exp(-0.5‖x-z‖²)":
lambda x, z: np.exp(-0.5 * np.sum((x - z) ** 2)),
"Laplacian k(x,z) = exp(-‖x-z‖₁)":
lambda x, z: np.exp(-np.sum(np.abs(x - z))),
"Sigmoid k(x,z) = tanh(xᵀz - 1) [NOT always PSD]":
lambda x, z: np.tanh(x @ z - 1.0),
"Sum RBF+Poly [valid composition]":
lambda x, z: (np.exp(-0.5 * np.sum((x - z) ** 2))
+ (x @ z + 1) ** 2),
}

print("Mercer's theorem - Gram matrix PSD check:")
print("-" * 70)
for name, kfn in kernels.items():
K = gram_matrix(kfn, X_check)
valid, min_eig = is_psd(K)
status = "VALID " if valid else "INVALID"
print(f" {status} min_eig={min_eig:+.4f} {name}")

Kernel Ridge Regression: Dual Formulation

From Primal to Dual

Primal ridge regression (standard form):

w^=(XX+nλId)1Xy- requires inverting a d×d matrix\hat{\mathbf{w}} = (\mathbf{X}^\top\mathbf{X} + n\lambda\mathbf{I}_d)^{-1}\mathbf{X}^\top\mathbf{y} \quad \text{- requires inverting a } d \times d \text{ matrix}

Representer theorem: for any regularized empirical risk minimization over a RKHS (Reproducing Kernel Hilbert Space), the optimal solution lies in the span of the training data:

w^=Xα=i=1nαixi\hat{\mathbf{w}} = \mathbf{X}^\top\boldsymbol{\alpha} = \sum_{i=1}^n \alpha_i \mathbf{x}_i

Substituting w=Xα\mathbf{w} = \mathbf{X}^\top\boldsymbol{\alpha} into the ridge objective and differentiating:

α^=(XX+nλIn)1y=(K+nλIn)1y\hat{\boldsymbol{\alpha}} = (\mathbf{X}\mathbf{X}^\top + n\lambda\mathbf{I}_n)^{-1}\mathbf{y} = (\mathbf{K} + n\lambda\mathbf{I}_n)^{-1}\mathbf{y}

where K=XX\mathbf{K} = \mathbf{X}\mathbf{X}^\top is the n×nn \times n Gram matrix with Kij=xixjK_{ij} = \mathbf{x}_i^\top\mathbf{x}_j.

Prediction for a new point x\mathbf{x}_*:

y^=w^x=αXx=i=1nαixix\hat{y}_* = \hat{\mathbf{w}}^\top\mathbf{x}_* = \boldsymbol{\alpha}^\top\mathbf{X}\mathbf{x}_* = \sum_{i=1}^n \alpha_i \mathbf{x}_i^\top\mathbf{x}_*

Kernelization: replace every inner product xixj\mathbf{x}_i^\top\mathbf{x}_j with k(xi,xj)k(\mathbf{x}_i, \mathbf{x}_j):

α^=(K+nλI)1y,y^=i=1nαik(xi,x)=kα^\boxed{\hat{\boldsymbol{\alpha}} = (K + n\lambda\mathbf{I})^{-1}\mathbf{y}, \qquad \hat{y}_* = \sum_{i=1}^n \alpha_i k(\mathbf{x}_i, \mathbf{x}_*) = \mathbf{k}_*^\top\hat{\boldsymbol{\alpha}}}

where Kij=k(xi,xj)K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j) and k=[k(x1,x),,k(xn,x)]\mathbf{k}_* = [k(\mathbf{x}_1, \mathbf{x}_*), \ldots, k(\mathbf{x}_n, \mathbf{x}_*)]^\top.

This dual form works with any valid kernel - including RBF, which implicitly uses an infinite-dimensional feature map. The primal problem would be ×\infty \times \infty; the dual is always n×nn \times n.

Primal vs Dual: When to Use Which

CriterionPrimalDual
System to invertd×dd \times dn×nn \times n
Prefer whenndn \gg ddnd \gg n or d=d = \infty (kernel)
Kernel requiredNoYes
Training complexityO(nd2+d3)O(nd^2 + d^3)O(n2d+n3)O(n^2d + n^3)
Prediction complexityO(d)O(d)O(nd)O(nd) (sum over training)

The crossover: if n=dn = d, both are equivalent. If using kernels (especially RBF), always use the dual - there is no choice, since the primal is infinite-dimensional.

Full NumPy Kernel Ridge Regression

class KernelRidgeRegression:
"""
Kernel Ridge Regression via the dual formulation.
Solves: α = (K + n·λ·I)⁻¹ y
Predicts: ŷ(x*) = kₓ*ᵀ α where kₓ* = [k(x1,x*),...,k(xn,x*)]

Supports: 'linear', 'polynomial', 'rbf', or any callable kernel.
"""

def __init__(self, kernel: str = "rbf", alpha: float = 1.0,
gamma: float = 1.0, degree: int = 3, coef0: float = 1.0):
self.kernel = kernel
self.alpha = alpha # regularization strength λ
self.gamma = gamma # RBF: 1/(2σ²); Poly: scaling
self.degree = degree
self.coef0 = coef0
self._alpha_dual: np.ndarray | None = None
self._X_train: np.ndarray | None = None

def _k(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
"""Compute kernel matrix between X1 [n1×d] and X2 [n2×d]."""
if self.kernel == "linear":
return X1 @ X2.T

elif self.kernel == "polynomial":
return (X1 @ X2.T + self.coef0) ** self.degree

elif self.kernel == "rbf":
# Vectorized squared Euclidean distances (no for-loop)
# ||xi - xj||² = ||xi||² - 2 xi·xj + ||xj||²
X1_sq = np.sum(X1 ** 2, axis=1, keepdims=True) # [n1, 1]
X2_sq = np.sum(X2 ** 2, axis=1, keepdims=True) # [n2, 1]
sq_dists = X1_sq + X2_sq.T - 2.0 * (X1 @ X2.T) # [n1, n2]
return np.exp(-self.gamma * sq_dists)

elif callable(self.kernel):
n1, n2 = len(X1), len(X2)
K = np.zeros((n1, n2))
for i in range(n1):
for j in range(n2):
K[i, j] = self.kernel(X1[i], X2[j])
return K

else:
raise ValueError(f"Unknown kernel: {self.kernel!r}")

def fit(self, X: np.ndarray, y: np.ndarray) -> "KernelRidgeRegression":
"""Solve dual: α = (K + n·λ·I)⁻¹ y via lstsq for numerical stability."""
self._X_train = X.copy()
n = len(X)
K = self._k(X, X)
# Note: sklearn uses (K + alpha*n*I); we follow that convention
K_reg = K + self.alpha * n * np.eye(n)
self._alpha_dual, *_ = np.linalg.lstsq(K_reg, y, rcond=None)
return self

def predict(self, X_new: np.ndarray) -> np.ndarray:
"""ŷ = K(X_new, X_train) · α_dual"""
K_new = self._k(X_new, self._X_train) # [n_new, n_train]
return K_new @ self._alpha_dual

def score(self, X: np.ndarray, y: np.ndarray) -> float:
y_pred = self.predict(X)
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - np.mean(y)) ** 2)
return float(1.0 - ss_res / ss_tot)


# ── Demo on sinusoidal data ───────────────────────────────────────────────────
np.random.seed(42)
X_nl = np.sort(np.random.uniform(-3, 3, 300)).reshape(-1, 1)
y_nl = np.sin(X_nl.ravel()) + 0.15 * np.random.randn(300)

X_tr, X_te, y_tr, y_te = train_test_split(X_nl, y_nl, test_size=0.2,
random_state=42)
sc = StandardScaler()
X_tr_sc = sc.fit_transform(X_tr)
X_te_sc = sc.transform(X_te)

print("\nKernel Ridge Regression on sin(x) + noise:")
print("-" * 50)
configs = [
("Linear", dict(kernel="linear")),
("Polynomial (d=3)", dict(kernel="polynomial", degree=3)),
("RBF (γ=0.1)", dict(kernel="rbf", gamma=0.1)),
("RBF (γ=1.0)", dict(kernel="rbf", gamma=1.0)),
("RBF (γ=10.0)", dict(kernel="rbf", gamma=10.0)),
]
for name, kwargs in configs:
krr = KernelRidgeRegression(alpha=0.1, **kwargs)
krr.fit(X_tr_sc, y_tr)
r2 = krr.score(X_te_sc, y_te)
print(f" {name:<22s}: Test R² = {r2:.4f}")

# ── Visualize the three regimes of RBF bandwidth ─────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
X_plot = np.linspace(-3.5, 3.5, 600).reshape(-1, 1)
X_plot_sc = sc.transform(X_plot)

for ax, (gamma, title) in zip(axes, [
(0.05, "γ=0.05 (over-smooth, underfitting)"),
(1.0, "γ=1.0 (good fit)"),
(15.0, "γ=15.0 (over-sharp, overfitting)"),
]):
krr = KernelRidgeRegression(alpha=0.1, kernel="rbf", gamma=gamma)
krr.fit(X_tr_sc, y_tr)
r2_te = krr.score(X_te_sc, y_te)

ax.scatter(X_tr, y_tr, s=10, alpha=0.4, color="steelblue")
ax.scatter(X_te, y_te, s=10, alpha=0.4, color="orange")
ax.plot(X_plot, krr.predict(X_plot_sc), "r-", linewidth=2,
label=f"KRR R²={r2_te:.3f}")
ax.plot(X_plot, np.sin(X_plot), "g--", linewidth=1.2,
alpha=0.7, label="True sin(x)")
ax.set_title(title)
ax.legend(fontsize=7)
ax.set_ylim(-1.8, 1.8)
ax.grid(True, alpha=0.2)

plt.tight_layout()
plt.savefig("kernel_ridge_gamma.png", dpi=150, bbox_inches="tight")
plt.show()
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import GridSearchCV

# ── Multivariate nonlinear regression ─────────────────────────────────────────
np.random.seed(42)
n_m = 600
X_m = np.random.randn(n_m, 5)
y_m = (np.sin(X_m[:, 0]) * np.cos(X_m[:, 1])
+ X_m[:, 2] ** 2 - X_m[:, 3] * X_m[:, 4]
+ np.random.randn(n_m) * 0.2)

X_tr_m, X_te_m, y_tr_m, y_te_m = train_test_split(
X_m, y_m, test_size=0.2, random_state=42)
sc_m = StandardScaler()
X_tr_m_sc = sc_m.fit_transform(X_tr_m)
X_te_m_sc = sc_m.transform(X_te_m)

# Baseline: linear Ridge
lr_base = Ridge(alpha=1.0).fit(X_tr_m_sc, y_tr_m)
print(f"Linear Ridge baseline: Test R²={lr_base.score(X_te_m_sc, y_te_m):.4f}")

# KernelRidge with grid search over alpha and gamma
param_grid = {
"alpha": [0.001, 0.01, 0.1, 1.0, 10.0],
"gamma": [0.05, 0.1, 0.5, 1.0, 2.0, 5.0],
}
gs = GridSearchCV(
KernelRidge(kernel="rbf"),
param_grid, cv=5, scoring="r2", n_jobs=-1, verbose=0
)
gs.fit(X_tr_m_sc, y_tr_m)
print(f"KernelRidge (RBF, grid search):")
print(f" Best params: {gs.best_params_}")
print(f" CV R²: {gs.best_score_:.4f}")
print(f" Test R²: {gs.best_estimator_.score(X_te_m_sc, y_te_m):.4f}")

Nyström Approximation: Scaling Kernel Methods

Kernel ridge regression has two scalability bottlenecks: computing the Gram matrix (O(n2d)O(n^2 d) time, O(n2)O(n^2) memory) and inverting it (O(n3)O(n^3) time). For n=100,000n = 100{,}000 and d=100d = 100: the Gram matrix alone requires 100,0002×8100{,}000^2 \times 8 bytes = 80 GB.

Nyström approximation (Williams and Seeger, 2001): choose mnm \ll n landmark points {x~1,,x~m}\{\tilde{\mathbf{x}}_1, \ldots, \tilde{\mathbf{x}}_m\} (typically a random subset of training points). The full n×nn \times n Gram matrix is approximated as:

KKnmKmm1Kmn=K~\mathbf{K} \approx \mathbf{K}_{nm}\mathbf{K}_{mm}^{-1}\mathbf{K}_{mn} = \widetilde{\mathbf{K}}

where KnmRn×m\mathbf{K}_{nm} \in \mathbb{R}^{n \times m} is the kernel matrix between all nn training points and the mm landmarks, and KmmRm×m\mathbf{K}_{mm} \in \mathbb{R}^{m \times m} is the kernel matrix between landmarks. Storage: O(nm)O(nm) instead of O(n2)O(n^2). The approximation quality improves as mm increases - at m=nm = n, it is exact.

Random Fourier Features (Rahimi and Recht, 2007): approximates the RBF kernel via random projections. By Bochner's theorem, a stationary kernel k(x,z)=k(xz)k(\mathbf{x}, \mathbf{z}) = k(\mathbf{x} - \mathbf{z}) can be written as the expectation of zω(x)zω(z)z_\omega(\mathbf{x})z_\omega(\mathbf{z}) over random frequencies ωp(ω)\omega \sim p(\omega):

k(x,z)=Eω[cos(ωx+b)cos(ωz+b)]2k(\mathbf{x}, \mathbf{z}) = \mathbb{E}_\omega\left[\cos(\omega^\top\mathbf{x} + b)\cos(\omega^\top\mathbf{z} + b)\right] \cdot 2

Sample DD random frequencies ωjN(0,γI)\omega_j \sim \mathcal{N}(0, \gamma \mathbf{I}) and random phases bjUniform(0,2π)b_j \sim \text{Uniform}(0, 2\pi). Map each input to:

z(x)=2D[cos(ω1x+b1),,cos(ωDx+bD)]RDz(\mathbf{x}) = \sqrt{\frac{2}{D}}\left[\cos(\omega_1^\top\mathbf{x} + b_1), \ldots, \cos(\omega_D^\top\mathbf{x} + b_D)\right] \in \mathbb{R}^D

Then k(x,z)z(x)z(z)k(\mathbf{x}, \mathbf{z}) \approx z(\mathbf{x})^\top z(\mathbf{z}). With DD random features, fit a standard linear model on z(X)z(\mathbf{X}) - cost O(nD)O(nD) instead of O(n2)O(n^2).

from sklearn.kernel_approximation import Nystroem, RBFSampler
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
import time

print("\nKernel approximation quality vs computation:")
print(f"{'Method':35s} | {'Test R²':>8} | {'Time (s)':>10}")
print("-" * 60)

# Exact kernel ridge (reference)
t0 = time.time()
krr_exact = KernelRidge(kernel="rbf", alpha=0.1, gamma=0.5)
krr_exact.fit(X_tr_m_sc, y_tr_m)
r2_exact = krr_exact.score(X_te_m_sc, y_te_m)
print(f"{'Exact KRR':35s} | {r2_exact:>8.4f} | {time.time()-t0:>10.3f}")

# Nyström approximation with varying m
for m in [50, 100, 200, 500]:
t0 = time.time()
pipe = Pipeline([
("nys", Nystroem(kernel="rbf", gamma=0.5,
n_components=m, random_state=42)),
("ridge", Ridge(alpha=0.1)),
])
pipe.fit(X_tr_m_sc, y_tr_m)
r2 = pipe.score(X_te_m_sc, y_te_m)
print(f"{'Nyström (m='+str(m)+')':35s} | {r2:>8.4f} | {time.time()-t0:>10.3f}")

# Random Fourier Features
for D in [100, 500, 1000]:
t0 = time.time()
pipe = Pipeline([
("rff", RBFSampler(gamma=0.5, n_components=D, random_state=42)),
("ridge", Ridge(alpha=0.1)),
])
pipe.fit(X_tr_m_sc, y_tr_m)
r2 = pipe.score(X_te_m_sc, y_te_m)
print(f"{'RFF (D='+str(D)+')':35s} | {r2:>8.4f} | {time.time()-t0:>10.3f}")

Kernel Selection Guide

print("\nKernel Selection Guide:")
print("=" * 70)

guide = [
("Linear k(x,z) = xᵀz",
"Feature space is already the right space; high-d text (d≫n works via dual)"),
("Polynomial k(x,z) = (xᵀz+c)ᵖ",
"Known interaction degree; NLP n-gram similarity; computer vision patch matching"),
("RBF k(x,z) = exp(-γ‖x-z‖²)",
"Default first choice; local similarity; tabular data after standardization"),
("Laplacian k(x,z) = exp(-γ‖x-z‖₁)",
"More robust to outliers in individual dims; sparser locality than RBF"),
("Matern-5/2",
"GP regression for physical processes; assumption of finite differentiability"),
("String kernel k(s,t) = #common substrings",
"DNA/protein sequences; text classification without bag-of-words encoding"),
]

for name, use_when in guide:
print(f" {name}")
print(f" Use when: {use_when}\n")

print("RBF γ parameter guide:")
print(" Start with: γ = 1 / (d · Var[X]) ('scale' in sklearn)")
print(" Small γ → broad similarity, smooth function (risk: underfitting)")
print(" Large γ → narrow similarity, spiky function (risk: overfitting)")
print(" Always cross-validate γ jointly with the regularization α")

print("\nSVM vs KRR - same kernel, different loss:")
print(" KRR loss = MSE → regression, smooth solution")
print(" SVM loss = hinge → classification/regression, sparse support vectors")
print(" Both solve a quadratic optimization in the dual; both kernelizable")

Common Mistakes

:::danger Not scaling features before kernel methods

# WRONG - different feature scales distort RBF distances
krr = KernelRidge(kernel="rbf", gamma=0.5).fit(X_raw, y)

# CORRECT - always standardize first
sc = StandardScaler()
X_sc = sc.fit_transform(X_train)
krr = KernelRidge(kernel="rbf", gamma=0.5).fit(X_sc, y)

The RBF kernel computes exp(γxz2)\exp(-\gamma\|\mathbf{x}-\mathbf{z}\|^2). A feature with variance 1000 contributes 1000×1000\times more to the Euclidean distance than a feature with variance 1, effectively ignoring the latter. After StandardScaler, all features have unit variance and contribute equally. Apply transform() (not fit_transform()) to test data. :::

:::danger Kernel ridge regression with large n - memory and compute explosion For n=10,000n = 10{,}000: Gram matrix = 10,0002×810{,}000^2 \times 8 bytes = 800 MB. For n=100,000n = 100{,}000: 80 GB. For n>10,000n > 10{,}000, use Nyström (sklearn.kernel_approximation.Nystroem + Ridge) or Random Fourier Features (RBFSampler + Ridge). These scale as O(nm)O(nm) instead of O(n2)O(n^2) and usually achieve 95%+ of the performance at a fraction of the cost. :::

:::warning High-degree polynomial features without regularization Degree-pp polynomial features create O(dp/p!)O(d^p/p!) features. Without regularization, the model perfectly interpolates training data (R² = 1.0 if DnD \geq n) and fails catastrophically on test data. Always combine polynomial features with Ridge or Lasso regularization, and cross-validate both the degree and the regularization strength jointly in a Pipeline. :::

:::warning Choosing γ for RBF without cross-validation Too large a γ\gamma: the model effectively interpolates training points (RBF kernel becomes an identity matrix) - perfect training fit, terrible test performance. Too small: the model is over-smooth, effectively linear. Always cross-validate over a logarithmic grid: param_grid = {"gamma": np.logspace(-4, 2, 30)}. sklearn's gamma="scale" is a good starting point: γ=1/(dVar[X])\gamma = 1/(d \cdot \text{Var}[\mathbf{X}]). :::

YouTube Resources

ResourceChannelWhy Watch
The Kernel Trick ExplainedStatQuest with Josh StarmerBest visual introduction to kernel functions and the kernel trick - start here
MIT 6.034: Kernel MachinesMIT OpenCourseWareMercer's theorem, SVM, and kernel methods - rigorous mathematical treatment
Random Features for Kernel Machines (Rahimi & Recht)NIPS Conference TalkThe random Fourier features paper - the scalability breakthrough for large kernels
Gaussian Processes from the Kernel PerspectiveRichard Turner (Cambridge)How kernels define function spaces - connects KRR to Gaussian process regression
Polynomial Features and OverfittingStatQuest with Josh StarmerVisual intuition for polynomial expansion and the bias-variance tradeoff with degree

Interview Q&A

Q1: Explain the kernel trick. Why is it computationally valuable?

The kernel trick exploits the fact that many ML algorithms need only inner products ϕ(xi)ϕ(xj)\phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j) between feature vectors - never the vectors themselves. A kernel function k(xi,xj)k(\mathbf{x}_i, \mathbf{x}_j) computes this inner product implicitly: for the polynomial kernel (xz+c)p(\mathbf{x}^\top\mathbf{z} + c)^p, computing kk costs O(d)O(d) (one dot product plus scalar operations), while computing ϕ\phi then its dot product costs O(D)=O(dp/p!)O(D) = O(d^p/p!). For d=1000d = 1000, p=3p = 3: 1,000 vs 167,000,000 operations per pair. For the RBF kernel, D=D = \infty, yet computing kk still costs O(d)O(d). The kernel trick gives infinite expressiveness at the cost of a single dot product in the original space.

Q2: State Mercer's theorem. How do you verify a function is a valid kernel?

Mercer's theorem: a symmetric function k:X×XRk: \mathcal{X} \times \mathcal{X} \to \mathbb{R} is a valid (Mercer) kernel - corresponds to an inner product in some Hilbert space - if and only if the Gram matrix K\mathbf{K} with Kij=k(xi,xj)K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j) is positive semi-definite (all eigenvalues 0\geq 0) for every finite set of input points.

Verification strategies: (1) Show k(x,z)=ϕ(x)ϕ(z)k(\mathbf{x}, \mathbf{z}) = \phi(\mathbf{x})^\top\phi(\mathbf{z}) for an explicit ϕ\phi (constructive proof for polynomial kernel); (2) Compute eigenvalues of the Gram matrix on a sample of points - all should be 0\geq 0; (3) Apply kernel algebra: sums, products, positive scalar multiples, and exponentials of valid kernels are valid. The sigmoid kernel tanh(αxz+c)\tanh(\alpha\mathbf{x}^\top\mathbf{z} + c) is not always PSD - its validity depends on α\alpha and cc.

Q3: Derive the dual form of kernel ridge regression. When is the dual preferred over the primal?

Primal ridge: w^=(XX+nλId)1Xy\hat{\mathbf{w}} = (\mathbf{X}^\top\mathbf{X} + n\lambda\mathbf{I}_d)^{-1}\mathbf{X}^\top\mathbf{y} - requires inverting a d×dd \times d matrix. By the representer theorem, the optimal solution lies in the span of training data: w^=Xα\hat{\mathbf{w}} = \mathbf{X}^\top\boldsymbol{\alpha}. Substituting and simplifying: (XX+nλIn)α=y(\mathbf{X}\mathbf{X}^\top + n\lambda\mathbf{I}_n)\boldsymbol{\alpha} = \mathbf{y}, giving α^=(K+nλIn)1y\hat{\boldsymbol{\alpha}} = (\mathbf{K} + n\lambda\mathbf{I}_n)^{-1}\mathbf{y} where K=XX\mathbf{K} = \mathbf{X}\mathbf{X}^\top is n×nn \times n.

Use dual when: (1) dnd \gg n - the n×nn \times n dual system is smaller than the d×dd \times d primal; (2) using a kernel (including RBF with d=d = \infty) - the primal is impossible; (3) you need to kernelize the algorithm for nonlinear patterns. Use primal when ndn \gg d - inverting d×dd \times d is cheaper than n×nn \times n.

Q4: Why does the RBF kernel correspond to an infinite-dimensional feature space?

Using the Taylor expansion of exp(2γxz)=k=0(2γxz)kk!\exp(2\gamma\mathbf{x}^\top\mathbf{z}) = \sum_{k=0}^\infty \frac{(2\gamma\mathbf{x}^\top\mathbf{z})^k}{k!}, and writing k(x,z)=exp(γx2)exp(γz2)exp(2γxz)k(\mathbf{x}, \mathbf{z}) = \exp(-\gamma\|\mathbf{x}\|^2)\exp(-\gamma\|\mathbf{z}\|^2)\exp(2\gamma\mathbf{x}^\top\mathbf{z}):

k(x,z)=exp(γx2)exp(γz2)k=0(2γ)kk!(xz)kk(\mathbf{x}, \mathbf{z}) = \exp(-\gamma\|\mathbf{x}\|^2)\exp(-\gamma\|\mathbf{z}\|^2)\sum_{k=0}^\infty \frac{(2\gamma)^k}{k!}(\mathbf{x}^\top\mathbf{z})^k

Each term corresponds to degree-kk polynomial interactions (monomials of degree kk in x\mathbf{x} dotted with monomials of degree kk in z\mathbf{z}). The sum is infinite - all polynomial degrees contribute. The feature map ϕ(x)\phi(\mathbf{x}) is therefore infinite-dimensional. Yet evaluating k(x,z)=exp(γxz2)k(\mathbf{x}, \mathbf{z}) = \exp(-\gamma\|\mathbf{x}-\mathbf{z}\|^2) costs O(d)O(d).

Q5: When would you choose polynomial features + linear model over kernel ridge regression?

Choose polynomial features + linear model when: (1) interpretability - PolynomialFeatures.get_feature_names_out() gives human-readable names like x1^2, x1 x2; polynomial coefficients have physical meaning (e.g., a quadratic displacement model); (2) large nn - for n>50,000n > 50{,}000, the n×nn \times n KRR system becomes infeasible while linear models on explicit features scale as O(nD)O(nD); (3) online learning - linear models support stochastic gradient descent and streaming updates; (4) known degree - physics or domain knowledge says the true relationship is quadratic (use degree 2, nothing higher); (5) downstream deployment - linear models are trivially serializable and deployable.

Choose kernel ridge regression when: (1) dd is large and the interaction structure is unknown - RBF covers all degrees; (2) nn is small enough (less than 10K) that n×nn \times n inversion is feasible; (3) you need a Gaussian process interpretation for uncertainty quantification (KRR is the MAP estimate of a GP with the corresponding kernel as the covariance function); (4) the degree is not known and you want to avoid committing to a fixed polynomial.

Q6: What is the Nyström approximation and when should you use it?

The Nyström approximation represents the full n×nn \times n Gram matrix as a rank-mm approximation using mnm \ll n landmark points: KKnmKmm1Kmn\mathbf{K} \approx \mathbf{K}_{nm}\mathbf{K}_{mm}^{-1}\mathbf{K}_{mn}. This reduces storage from O(n2)O(n^2) to O(nm)O(nm) and computation from O(n3)O(n^3) to O(nm2+m3)O(nm^2 + m^3). For n=100,000n = 100{,}000 and m=1000m = 1000: storage drops from 80 GB to 800 MB, computation from O(1015)O(10^{15}) to O(1010)O(10^{10}) - a factor of 10510^5 reduction.

Use Nyström when n>10,000n > 10{,}000 and you need a nonlinear kernel. Alternative: Random Fourier Features (RBFSampler in sklearn) approximates the RBF kernel via random cosine projections - same asymptotic scalability but different approximation error characteristics. In practice, Nyström has better approximation quality for the same number of components mm, but RFF is simpler to implement and more numerically stable.

Kernel Methods vs Neural Networks

A natural question: kernel methods were the state of the art before deep learning. Why do we still teach them?

The kernel method perspective on neural networks: deep neural networks with infinite width converge to Gaussian Processes with a kernel determined by the network architecture - the Neural Tangent Kernel (NTK, Jacot et al. 2018). At initialization, an infinitely-wide neural network's training dynamics are exactly kernel regression with the NTK. This connection helps explain why neural networks generalize despite massive overparameterization.

When kernels still win: for small datasets (n<10,000n < 10{,}000), well-tuned kernel methods routinely outperform neural networks because neural networks need large data to learn useful representations. For tabular data with few features (d<100d < 100), kernel ridge regression with RBF often matches gradient-boosted trees. For structured data (graphs, strings, molecules), domain-specific kernels encode prior knowledge that is difficult to learn from data alone.

# ── Kernel methods vs neural networks on small data ───────────────────────────
import torch
import torch.nn as nn
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import cross_val_score
import numpy as np

np.random.seed(42)

# Small dataset scenario: n=200, d=10, nonlinear true function
n_small, d_small = 200, 10
X_small = np.random.randn(n_small, d_small)
# True function: sin of first feature × cos of second, plus product interaction
y_small = (np.sin(X_small[:, 0]) * np.cos(X_small[:, 1])
+ X_small[:, 0] * X_small[:, 2]
+ np.random.randn(n_small) * 0.1)

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge

sc_s = StandardScaler()
X_small_sc = sc_s.fit_transform(X_small)

# Kernel Ridge Regression
krr_cv = cross_val_score(
KernelRidge(kernel="rbf", alpha=0.1, gamma=0.5),
X_small_sc, y_small, cv=5, scoring="r2"
)

# Simple MLP (too little data to train well)
class SmallMLP(nn.Module):
def __init__(self, d):
super().__init__()
self.net = nn.Sequential(
nn.Linear(d, 64), nn.ReLU(),
nn.Linear(64, 64), nn.ReLU(),
nn.Linear(64, 1)
)
def forward(self, x): return self.net(x).squeeze(-1)

# Manual 5-fold CV for MLP
from sklearn.model_selection import KFold

mlp_scores = []
kf = KFold(n_splits=5, shuffle=True, random_state=42)
X_t = torch.tensor(X_small_sc, dtype=torch.float32)
y_t = torch.tensor(y_small, dtype=torch.float32)

for train_idx, val_idx in kf.split(X_small_sc):
model = SmallMLP(d_small)
optim = torch.optim.Adam(model.parameters(), lr=1e-3)

for _ in range(500):
pred = model(X_t[train_idx])
loss = nn.functional.mse_loss(pred, y_t[train_idx])
optim.zero_grad(); loss.backward(); optim.step()

model.eval()
with torch.no_grad():
pred_val = model(X_t[val_idx]).numpy()
y_val = y_small[val_idx]
ss_res = np.sum((y_val - pred_val)**2)
ss_tot = np.sum((y_val - y_val.mean())**2)
mlp_scores.append(1 - ss_res / ss_tot)

print(f"\nSmall data comparison (n={n_small}, d={d_small}):")
print(f" Kernel Ridge (RBF): CV R² = {krr_cv.mean():.4f} ± {krr_cv.std():.4f}")
print(f" MLP (2-layer): CV R² = {np.mean(mlp_scores):.4f} ± {np.std(mlp_scores):.4f}")
print(f" Kernel usually wins on small tabular data - no representation learning needed")

# ── When kernel fails: large n ─────────────────────────────────────────────────
print("\nScalability limit of exact KRR:")
for n_large in [1_000, 5_000, 10_000, 50_000]:
gram_mb = n_large ** 2 * 8 / 1e6
print(f" n={n_large:>6,}: Gram matrix = {gram_mb:>8.1f} MB "
f"{'(feasible)' if gram_mb < 1000 else '(too large - use Nystroem)'}")

Reproducing Kernel Hilbert Spaces (RKHS) - A Brief Overview

The mathematical foundation of kernel methods is the Reproducing Kernel Hilbert Space (RKHS). Every valid kernel kk defines a unique RKHS Hk\mathcal{H}_k - a Hilbert space of functions where:

  1. For every xx, the function k(x,):xk(x,x)k(x, \cdot): x' \mapsto k(x, x') belongs to Hk\mathcal{H}_k.
  2. The reproducing property holds: f,k(x,)Hk=f(x)\langle f, k(x, \cdot)\rangle_{\mathcal{H}_k} = f(x) for all fHkf \in \mathcal{H}_k and all xx.

The representer theorem - which we used to derive the kernel ridge regression dual - holds in any RKHS: the minimizer of any regularized empirical risk functional over Hk\mathcal{H}_k is a finite linear combination of kernel functions centered at training points:

f^(x)=i=1nαik(xi,x)\hat{f}(x) = \sum_{i=1}^n \alpha_i k(\mathbf{x}_i, x)

This is exactly the kernel ridge regression prediction formula y^(x)=iαik(xi,x)\hat{y}(x_*) = \sum_i \alpha_i k(\mathbf{x}_i, x_*). The RKHS provides the theoretical guarantee that this dual representation is complete - no information is lost by working in the dual rather than the primal.

Connection to Gaussian Processes: kernel ridge regression with kernel kk is equivalent to the MAP estimate of a Gaussian Process with prior covariance kk. The GP posterior mean is exactly f^\hat{f} from KRR, and the GP posterior variance provides uncertainty quantification that KRR alone does not give. This connection makes kernels interpretable as prior beliefs about function smoothness and structure.

Bias-Variance Tradeoff in Polynomial Degree and Kernel Bandwidth

Both polynomial degree and RBF bandwidth γ\gamma control the bias-variance tradeoff - but through very different mechanisms. Understanding this connection helps you cross-validate more efficiently.

Polynomial degree: increasing pp increases model flexibility (more features, more potential patterns captured). Low pp: high bias (underfitting), low variance. High pp: low bias, high variance (overfitting). Optimal pp is typically found by CV in {1,2,3,4,5}\{1, 2, 3, 4, 5\} - values above 5 are almost never optimal for practical regression.

RBF bandwidth γ=1/(2σ2)\gamma = 1/(2\sigma^2): larger γ\gamma (smaller bandwidth σ\sigma) means each training point influences only nearby test points - high variance, low bias locally. Smaller γ\gamma (larger bandwidth) means each training point influences test points far away - smooth global function, higher bias, lower variance.

# ── Bias-variance as a function of polynomial degree ─────────────────────────
import numpy as np
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate
import matplotlib.pyplot as plt

np.random.seed(42)
# True function: cubic with noise
n_bv = 300
X_bv = np.sort(np.random.uniform(-2, 2, n_bv)).reshape(-1, 1)
y_bv = X_bv.ravel()**3 - 2 * X_bv.ravel() + np.random.randn(n_bv) * 0.5

degrees = [1, 2, 3, 4, 5, 7, 10, 15]
train_mses, val_mses = [], []

for deg in degrees:
pipe = Pipeline([
("poly", PolynomialFeatures(degree=deg, include_bias=True)),
("scaler", StandardScaler()),
("lr", LinearRegression()),
])
cv_res = cross_validate(pipe, X_bv, y_bv, cv=5,
scoring="neg_mean_squared_error",
return_train_score=True)
train_mses.append(-cv_res["train_score"].mean())
val_mses.append(-cv_res["test_score"].mean())

opt_deg = degrees[int(np.argmin(val_mses))]
print("Polynomial Degree - Bias-Variance Tradeoff:")
print(f"{'Degree':>8} | {'Train MSE':>12} | {'Val MSE':>12}")
print("-" * 38)
for deg, tr, va in zip(degrees, train_mses, val_mses):
flag = " ← optimal" if deg == opt_deg else ""
print(f"{deg:>8} | {tr:>12.4f} | {va:>12.4f}{flag}")


# ── Bias-variance as a function of RBF gamma ─────────────────────────────────
from sklearn.kernel_ridge import KernelRidge

gammas = np.logspace(-3, 2, 20)
train_mses_g, val_mses_g = [], []

for gamma in gammas:
cv_res = cross_validate(
KernelRidge(kernel="rbf", alpha=0.01, gamma=gamma),
X_bv, y_bv, cv=5,
scoring="neg_mean_squared_error",
return_train_score=True
)
train_mses_g.append(-cv_res["train_score"].mean())
val_mses_g.append(-cv_res["test_score"].mean())

opt_gamma = gammas[int(np.argmin(val_mses_g))]
print(f"\nRBF Kernel - Optimal γ from CV: {opt_gamma:.4f}")
print(f" (sklearn 'scale' default: 1/(d·Var[X]) = "
f"{1/(1*np.var(X_bv)):.4f})")

String Kernels and Structured Data

The kernel framework extends beyond numeric feature vectors. String kernels compute similarity between sequences - DNA, protein chains, text - without any vectorization step.

The spectrum kernel (Leslie et al., 2002) for strings counts shared kk-mers (substrings of length kk):

kspectrum(s,t)=u: all k-mersϕu(s)ϕu(t)k_\text{spectrum}(s, t) = \sum_{\text{u: all k-mers}} \phi_u(s) \cdot \phi_u(t)

where ϕu(s)\phi_u(s) counts how many times k-mer uu appears in string ss. This is a valid (Mercer) kernel - it is a dot product of count vectors. With SVM or kernel ridge regression, it gives state-of-the-art performance on protein homology detection and remote sensing classification tasks, without any alignment or deep learning.

from collections import Counter

def spectrum_kernel(s: str, t: str, k: int = 3) -> int:
"""
Spectrum kernel: count of shared k-mers between strings s and t.
k(s,t) = φ(s)ᵀφ(t) where φ(s)_u = count of k-mer u in s.
Valid (Mercer) kernel: it is an inner product of count vectors.
"""
def count_kmers(seq, k):
return Counter(seq[i:i+k] for i in range(len(seq)-k+1))

kmers_s = count_kmers(s, k)
kmers_t = count_kmers(t, k)

# Dot product of count vectors over shared k-mers
return sum(kmers_s[u] * kmers_t[u] for u in kmers_s if u in kmers_t)


# Example: protein sequence similarity
seq1 = "ACDEFGHIKLMNPQRSTVWY" # 20 standard amino acids
seq2 = "ACDEFGHIKLMVPQRSTVWY" # one substitution
seq3 = "AAAAAAAAAAAAAAAAAAAA" # all-alanine (very different)

print("Spectrum kernel (k=3) - protein sequence similarity:")
print(f" k(seq1, seq1) = {spectrum_kernel(seq1, seq1, k=3)}")
print(f" k(seq1, seq2) = {spectrum_kernel(seq1, seq2, k=3)} (one substitution)")
print(f" k(seq1, seq3) = {spectrum_kernel(seq1, seq3, k=3)} (very different)")
print(" Higher value = more similar sequences (shared 3-gram patterns)")

# Verify PSD on a small set of sequences
seqs = [seq1, seq2, seq3, "MVLSPADKTNVKAAW", "GSHSMRYFYTSVSRPG"]
n_seqs = len(seqs)
K_str = np.array([[spectrum_kernel(seqs[i], seqs[j], k=3)
for j in range(n_seqs)]
for i in range(n_seqs)], dtype=float)

eigvals = np.linalg.eigvalsh(K_str)
print(f"\nSpectrum kernel Gram matrix eigenvalues: {eigvals.round(2)}")
print(f"All non-negative (PSD): {(eigvals >= -1e-9).all()}") # Should be True

:::tip 🎮 Interactive Playground

Visualize this concept: Try the SVM Kernels & Margin demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.