Polynomial Features and Kernel Methods
:::note Reading time: ~45 minutes | Interview relevance: High | Target roles: ML Engineer, Research Engineer, Applied Scientist, AI Engineer :::
The Real Interview Moment
You are interviewing at a biotech company. Their ML team has been using linear regression with raw features to predict protein binding affinity from amino acid sequences - R² of 0.38. The interviewer says: "We think there are interaction effects between amino acid positions. How would you capture those? And if we have 1000 amino acid positions, what is the problem with explicit polynomial features?"
The candidate who says "add polynomial features" gets the follow-up: "Degree-2 polynomial expansion of 1000 features gives how many features? What about degree-3? And how does the kernel trick solve this?"
The answer: features for degree-2. For degree-3: features. Even storing a single training matrix at degree-3 requires terabytes. The kernel trick computes the equivalent of working in that 167-million-dimensional space with just a dot product in the original space - cost instead of . The RBF kernel extends this to an infinite-dimensional feature space, with the same cost.
Understanding this distinction - and the precise mathematical conditions that make it possible (Mercer's theorem) - separates engineers who "use kernels" from engineers who understand why they work.
Why This Exists - The Nonlinearity Problem
Linear models draw hyperplanes. A linear regression or logistic regression with raw features cannot capture curves, interaction effects, or decision boundaries that are not planar. For the protein problem, an amino acid at position 50 might interact synergistically with an amino acid at position 300 - their joint effect on binding affinity exceeds the sum of their individual effects. No linear model in the original features can represent this.
Two approaches handle nonlinearity while preserving linear machinery:
- Explicit feature engineering: transform inputs (add , , , etc.) and apply linear models in the expanded space. Standard linear model theory applies without modification.
- Kernel methods: implicitly work in the expanded feature space without ever computing the features. The kernel function computes inner products in the high-dimensional space directly.
Both approaches are mathematically equivalent. They differ dramatically in computational cost when features are high-dimensional.
Historical Context
Polynomial features were used in statistics long before modern machine learning - astronomers fit polynomial curves to orbital data in the 19th century, and polynomial regression appears in Gauss's work on orbit prediction. The modern kernel framework emerged from support vector machines (Boser, Guyon, Vapnik 1992; Cortes and Vapnik 1995). Mercer's theorem itself dates to 1909, predating all of modern machine learning by decades. The key insight - that algorithms depending only on inner products can be "kernelized" - was explicitly formalized in the SVM context. Rahimi and Recht (2007) introduced random Fourier features ("random kitchen sinks"), enabling scalable approximation of kernel methods. Williams and Seeger (2001) developed Nyström approximation for kernel matrices.
Polynomial Feature Expansion
A degree- polynomial feature expansion maps to a higher-dimensional vector containing all monomials up to degree .
For , :
The model in the expanded space is still linear in :
This is nonlinear in (it can represent a quadratic response surface) but linear in . OLS, Ridge, LASSO - all linear model theory applies without modification to the expanded features.
The Curse of Dimensionality
The number of features after a degree- expansion of inputs (with intercept) is:
For large and small , this grows as .
| 10 | 66 | 286 | 1,001 |
| 100 | 5,151 | 176,851 | 4,598,126 |
| 500 | 125,751 | 20,917,501 | Infeasible |
| 1,000 | 501,501 | 167,167,000 | Infeasible |
For , : 167 million features. The training matrix requires - for training points - approximately bytes = 6.7 TB of storage. Each gradient step costs operations. This is completely infeasible.
The kernel trick avoids all of this.
Polynomial Features in Practice
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.datasets import make_regression
# ── Degree selection via cross-validation on 1D nonlinear data ────────────────
np.random.seed(42)
n = 200
X_1d = np.sort(np.random.uniform(-3, 3, n)).reshape(-1, 1)
# True function: quadratic + linear
y_1d = (0.5 * X_1d.ravel()**2 - 1.5 * X_1d.ravel()
+ np.random.randn(n) * 0.8)
print("Polynomial degree selection via 5-fold CV R²:")
print(f"{'Degree':>8} | {'CV R²':>10} | {'CV Std':>8} | {'N features':>12}")
print("-" * 48)
for degree in [1, 2, 3, 5, 8, 12, 20]:
pipe = Pipeline([
("poly", PolynomialFeatures(degree=degree, include_bias=True)),
("scaler", StandardScaler()),
("ridge", Ridge(alpha=0.01)),
])
cv_scores = cross_val_score(pipe, X_1d, y_1d, cv=5, scoring="r2")
n_feats = PolynomialFeatures(degree=degree).fit_transform(X_1d).shape[1]
print(f"{degree:>8} | {cv_scores.mean():>10.4f} | "
f"{cv_scores.std():>8.4f} | {n_feats:>12}")
# ── Visualize underfitting vs overfitting ─────────────────────────────────────
fig, axes = plt.subplots(1, 4, figsize=(17, 4))
X_plot = np.linspace(-3.5, 3.5, 600).reshape(-1, 1)
for ax, degree in zip(axes, [1, 2, 6, 15]):
pipe = Pipeline([
("poly", PolynomialFeatures(degree=degree)),
("sc", StandardScaler()),
("lr", LinearRegression()),
])
pipe.fit(X_1d, y_1d)
cv_r2 = cross_val_score(pipe, X_1d, y_1d, cv=5, scoring="r2").mean()
ax.scatter(X_1d, y_1d, s=12, alpha=0.45, color="steelblue")
ax.plot(X_plot, pipe.predict(X_plot), "r-", linewidth=2)
ax.set_title(f"Degree {degree}\nCV R²={cv_r2:.3f}")
ax.set_ylim(-8, 10)
ax.grid(True, alpha=0.2)
plt.suptitle("Polynomial Degree: Underfitting (deg 1) → Optimal (deg 2) → Overfitting (deg 15)",
y=1.02)
plt.tight_layout()
plt.savefig("polynomial_overfitting.png", dpi=150, bbox_inches="tight")
plt.show()
# ── Feature count explosion for higher dimensions ─────────────────────────────
print("\nFeature count explosion:")
print(f"{'d':>6} | {'p=2':>10} | {'p=3':>15} | {'p=4':>18}")
from math import comb
for d in [10, 50, 100, 500, 1000]:
p2 = comb(d+2, 2)
p3 = comb(d+3, 3)
p4 = comb(d+4, 4)
print(f"{d:>6} | {p2:>10,} | {p3:>15,} | {p4:>18,}")
The Kernel Trick
The Core Insight
Many ML algorithms only need inner products between training examples - never the feature vectors themselves. Ridge regression, SVM, PCA, K-means, Gaussian processes - all can be formulated using only pairwise inner products. If we can compute directly without first computing , we get the benefit of the high-dimensional feature space at the cost of operating in the original space.
A kernel function does exactly this:
Polynomial Kernel - Explicit Proof
For , degree , constant :
Expanding:
The scaling factors ensure the inner product matches. Computing the kernel: - one dot product plus one addition and squaring. Computing then the dot product: . For : 1,000 vs 1,000,000 operations per pair.
In general: implicitly works with all degree- monomials in variables, a -dimensional space, in time.
Common Kernels
Linear kernel:
The feature map is the identity: . No nonlinearity. Kernelizing a linear model recovers the original formulation - useful for code structure and to enable the dual formulation when .
Polynomial kernel:
Implicit features: all monomials of degree exactly (homogeneous) or up to degree (inhomogeneous, ). Standard for NLP (character -grams, document similarity) where all degree- term interactions matter.
RBF / Gaussian kernel:
The implicit feature space is infinite-dimensional. To see this, use the Taylor expansion of . With and writing :
Each term corresponds to degree- polynomial interactions, scaled. Summing from to gives an infinite polynomial feature space. Yet computing costs only .
The RBF kernel also has an intuitive geometric interpretation: it measures similarity via distance. when (identical points), and as (infinitely dissimilar).
Matern kernel (used in Gaussian processes):
where is the modified Bessel function of the second kind, is the length scale, and controls smoothness. At : Laplacian kernel . At : RBF kernel. Matern-3/2 and Matern-5/2 are popular in GP regression for physically plausible smoothness assumptions.
Mercer's Theorem: What Makes a Valid Kernel
Not every symmetric function corresponds to an inner product in some feature space. Mercer's theorem (1909) characterizes exactly which functions do.
Mercer's theorem: A symmetric function is a valid (Mercer) kernel - i.e., it corresponds to an inner product in some Hilbert space - if and only if the Gram matrix with is positive semi-definite for every finite set of input points .
Equivalently: is a valid kernel for all finite sets of inputs and all :
This is the PSD condition. It ensures that we are genuinely computing inner products, not some other bilinear form, and it guarantees that optimization problems using the kernel have unique solutions (convexity from PSD Gram matrices).
Kernel algebra - constructing new valid kernels from known ones:
If and are valid kernels, then so are:
- (sum of PSD matrices is PSD)
- (Schur/Hadamard product of PSD matrices is PSD)
- for any (positive scaling)
- where is a polynomial with non-negative coefficients
- (exponential of any PSD kernel is PSD)
This algebra lets you build domain-specific kernels by combining simpler ones.
# ── Kernel validity check: is the Gram matrix PSD? ────────────────────────────
def gram_matrix(kernel_fn, X: np.ndarray) -> np.ndarray:
"""Compute n×n Gram matrix for a kernel function."""
n = len(X)
K = np.zeros((n, n))
for i in range(n):
for j in range(n):
K[i, j] = kernel_fn(X[i], X[j])
return K
def is_psd(K: np.ndarray, tol: float = 1e-9) -> tuple[bool, float]:
"""Check PSD by computing minimum eigenvalue."""
eigvals = np.linalg.eigvalsh(K)
return eigvals.min() >= -tol, float(eigvals.min())
np.random.seed(42)
X_check = np.random.randn(40, 3)
# Define common kernels
kernels = {
"Linear k(x,z) = xᵀz":
lambda x, z: x @ z,
"Poly-3 k(x,z) = (xᵀz+1)³":
lambda x, z: (x @ z + 1) ** 3,
"RBF k(x,z) = exp(-0.5‖x-z‖²)":
lambda x, z: np.exp(-0.5 * np.sum((x - z) ** 2)),
"Laplacian k(x,z) = exp(-‖x-z‖₁)":
lambda x, z: np.exp(-np.sum(np.abs(x - z))),
"Sigmoid k(x,z) = tanh(xᵀz - 1) [NOT always PSD]":
lambda x, z: np.tanh(x @ z - 1.0),
"Sum RBF+Poly [valid composition]":
lambda x, z: (np.exp(-0.5 * np.sum((x - z) ** 2))
+ (x @ z + 1) ** 2),
}
print("Mercer's theorem - Gram matrix PSD check:")
print("-" * 70)
for name, kfn in kernels.items():
K = gram_matrix(kfn, X_check)
valid, min_eig = is_psd(K)
status = "VALID " if valid else "INVALID"
print(f" {status} min_eig={min_eig:+.4f} {name}")
Kernel Ridge Regression: Dual Formulation
From Primal to Dual
Primal ridge regression (standard form):
Representer theorem: for any regularized empirical risk minimization over a RKHS (Reproducing Kernel Hilbert Space), the optimal solution lies in the span of the training data:
Substituting into the ridge objective and differentiating:
where is the Gram matrix with .
Prediction for a new point :
Kernelization: replace every inner product with :
where and .
This dual form works with any valid kernel - including RBF, which implicitly uses an infinite-dimensional feature map. The primal problem would be ; the dual is always .
Primal vs Dual: When to Use Which
| Criterion | Primal | Dual |
|---|---|---|
| System to invert | ||
| Prefer when | or (kernel) | |
| Kernel required | No | Yes |
| Training complexity | ||
| Prediction complexity | (sum over training) |
The crossover: if , both are equivalent. If using kernels (especially RBF), always use the dual - there is no choice, since the primal is infinite-dimensional.
Full NumPy Kernel Ridge Regression
class KernelRidgeRegression:
"""
Kernel Ridge Regression via the dual formulation.
Solves: α = (K + n·λ·I)⁻¹ y
Predicts: ŷ(x*) = kₓ*ᵀ α where kₓ* = [k(x1,x*),...,k(xn,x*)]
Supports: 'linear', 'polynomial', 'rbf', or any callable kernel.
"""
def __init__(self, kernel: str = "rbf", alpha: float = 1.0,
gamma: float = 1.0, degree: int = 3, coef0: float = 1.0):
self.kernel = kernel
self.alpha = alpha # regularization strength λ
self.gamma = gamma # RBF: 1/(2σ²); Poly: scaling
self.degree = degree
self.coef0 = coef0
self._alpha_dual: np.ndarray | None = None
self._X_train: np.ndarray | None = None
def _k(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
"""Compute kernel matrix between X1 [n1×d] and X2 [n2×d]."""
if self.kernel == "linear":
return X1 @ X2.T
elif self.kernel == "polynomial":
return (X1 @ X2.T + self.coef0) ** self.degree
elif self.kernel == "rbf":
# Vectorized squared Euclidean distances (no for-loop)
# ||xi - xj||² = ||xi||² - 2 xi·xj + ||xj||²
X1_sq = np.sum(X1 ** 2, axis=1, keepdims=True) # [n1, 1]
X2_sq = np.sum(X2 ** 2, axis=1, keepdims=True) # [n2, 1]
sq_dists = X1_sq + X2_sq.T - 2.0 * (X1 @ X2.T) # [n1, n2]
return np.exp(-self.gamma * sq_dists)
elif callable(self.kernel):
n1, n2 = len(X1), len(X2)
K = np.zeros((n1, n2))
for i in range(n1):
for j in range(n2):
K[i, j] = self.kernel(X1[i], X2[j])
return K
else:
raise ValueError(f"Unknown kernel: {self.kernel!r}")
def fit(self, X: np.ndarray, y: np.ndarray) -> "KernelRidgeRegression":
"""Solve dual: α = (K + n·λ·I)⁻¹ y via lstsq for numerical stability."""
self._X_train = X.copy()
n = len(X)
K = self._k(X, X)
# Note: sklearn uses (K + alpha*n*I); we follow that convention
K_reg = K + self.alpha * n * np.eye(n)
self._alpha_dual, *_ = np.linalg.lstsq(K_reg, y, rcond=None)
return self
def predict(self, X_new: np.ndarray) -> np.ndarray:
"""ŷ = K(X_new, X_train) · α_dual"""
K_new = self._k(X_new, self._X_train) # [n_new, n_train]
return K_new @ self._alpha_dual
def score(self, X: np.ndarray, y: np.ndarray) -> float:
y_pred = self.predict(X)
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - np.mean(y)) ** 2)
return float(1.0 - ss_res / ss_tot)
# ── Demo on sinusoidal data ───────────────────────────────────────────────────
np.random.seed(42)
X_nl = np.sort(np.random.uniform(-3, 3, 300)).reshape(-1, 1)
y_nl = np.sin(X_nl.ravel()) + 0.15 * np.random.randn(300)
X_tr, X_te, y_tr, y_te = train_test_split(X_nl, y_nl, test_size=0.2,
random_state=42)
sc = StandardScaler()
X_tr_sc = sc.fit_transform(X_tr)
X_te_sc = sc.transform(X_te)
print("\nKernel Ridge Regression on sin(x) + noise:")
print("-" * 50)
configs = [
("Linear", dict(kernel="linear")),
("Polynomial (d=3)", dict(kernel="polynomial", degree=3)),
("RBF (γ=0.1)", dict(kernel="rbf", gamma=0.1)),
("RBF (γ=1.0)", dict(kernel="rbf", gamma=1.0)),
("RBF (γ=10.0)", dict(kernel="rbf", gamma=10.0)),
]
for name, kwargs in configs:
krr = KernelRidgeRegression(alpha=0.1, **kwargs)
krr.fit(X_tr_sc, y_tr)
r2 = krr.score(X_te_sc, y_te)
print(f" {name:<22s}: Test R² = {r2:.4f}")
# ── Visualize the three regimes of RBF bandwidth ─────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
X_plot = np.linspace(-3.5, 3.5, 600).reshape(-1, 1)
X_plot_sc = sc.transform(X_plot)
for ax, (gamma, title) in zip(axes, [
(0.05, "γ=0.05 (over-smooth, underfitting)"),
(1.0, "γ=1.0 (good fit)"),
(15.0, "γ=15.0 (over-sharp, overfitting)"),
]):
krr = KernelRidgeRegression(alpha=0.1, kernel="rbf", gamma=gamma)
krr.fit(X_tr_sc, y_tr)
r2_te = krr.score(X_te_sc, y_te)
ax.scatter(X_tr, y_tr, s=10, alpha=0.4, color="steelblue")
ax.scatter(X_te, y_te, s=10, alpha=0.4, color="orange")
ax.plot(X_plot, krr.predict(X_plot_sc), "r-", linewidth=2,
label=f"KRR R²={r2_te:.3f}")
ax.plot(X_plot, np.sin(X_plot), "g--", linewidth=1.2,
alpha=0.7, label="True sin(x)")
ax.set_title(title)
ax.legend(fontsize=7)
ax.set_ylim(-1.8, 1.8)
ax.grid(True, alpha=0.2)
plt.tight_layout()
plt.savefig("kernel_ridge_gamma.png", dpi=150, bbox_inches="tight")
plt.show()
Sklearn KernelRidge and Grid Search
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import GridSearchCV
# ── Multivariate nonlinear regression ─────────────────────────────────────────
np.random.seed(42)
n_m = 600
X_m = np.random.randn(n_m, 5)
y_m = (np.sin(X_m[:, 0]) * np.cos(X_m[:, 1])
+ X_m[:, 2] ** 2 - X_m[:, 3] * X_m[:, 4]
+ np.random.randn(n_m) * 0.2)
X_tr_m, X_te_m, y_tr_m, y_te_m = train_test_split(
X_m, y_m, test_size=0.2, random_state=42)
sc_m = StandardScaler()
X_tr_m_sc = sc_m.fit_transform(X_tr_m)
X_te_m_sc = sc_m.transform(X_te_m)
# Baseline: linear Ridge
lr_base = Ridge(alpha=1.0).fit(X_tr_m_sc, y_tr_m)
print(f"Linear Ridge baseline: Test R²={lr_base.score(X_te_m_sc, y_te_m):.4f}")
# KernelRidge with grid search over alpha and gamma
param_grid = {
"alpha": [0.001, 0.01, 0.1, 1.0, 10.0],
"gamma": [0.05, 0.1, 0.5, 1.0, 2.0, 5.0],
}
gs = GridSearchCV(
KernelRidge(kernel="rbf"),
param_grid, cv=5, scoring="r2", n_jobs=-1, verbose=0
)
gs.fit(X_tr_m_sc, y_tr_m)
print(f"KernelRidge (RBF, grid search):")
print(f" Best params: {gs.best_params_}")
print(f" CV R²: {gs.best_score_:.4f}")
print(f" Test R²: {gs.best_estimator_.score(X_te_m_sc, y_te_m):.4f}")
Nyström Approximation: Scaling Kernel Methods
Kernel ridge regression has two scalability bottlenecks: computing the Gram matrix ( time, memory) and inverting it ( time). For and : the Gram matrix alone requires bytes = 80 GB.
Nyström approximation (Williams and Seeger, 2001): choose landmark points (typically a random subset of training points). The full Gram matrix is approximated as:
where is the kernel matrix between all training points and the landmarks, and is the kernel matrix between landmarks. Storage: instead of . The approximation quality improves as increases - at , it is exact.
Random Fourier Features (Rahimi and Recht, 2007): approximates the RBF kernel via random projections. By Bochner's theorem, a stationary kernel can be written as the expectation of over random frequencies :
Sample random frequencies and random phases . Map each input to:
Then . With random features, fit a standard linear model on - cost instead of .
from sklearn.kernel_approximation import Nystroem, RBFSampler
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
import time
print("\nKernel approximation quality vs computation:")
print(f"{'Method':35s} | {'Test R²':>8} | {'Time (s)':>10}")
print("-" * 60)
# Exact kernel ridge (reference)
t0 = time.time()
krr_exact = KernelRidge(kernel="rbf", alpha=0.1, gamma=0.5)
krr_exact.fit(X_tr_m_sc, y_tr_m)
r2_exact = krr_exact.score(X_te_m_sc, y_te_m)
print(f"{'Exact KRR':35s} | {r2_exact:>8.4f} | {time.time()-t0:>10.3f}")
# Nyström approximation with varying m
for m in [50, 100, 200, 500]:
t0 = time.time()
pipe = Pipeline([
("nys", Nystroem(kernel="rbf", gamma=0.5,
n_components=m, random_state=42)),
("ridge", Ridge(alpha=0.1)),
])
pipe.fit(X_tr_m_sc, y_tr_m)
r2 = pipe.score(X_te_m_sc, y_te_m)
print(f"{'Nyström (m='+str(m)+')':35s} | {r2:>8.4f} | {time.time()-t0:>10.3f}")
# Random Fourier Features
for D in [100, 500, 1000]:
t0 = time.time()
pipe = Pipeline([
("rff", RBFSampler(gamma=0.5, n_components=D, random_state=42)),
("ridge", Ridge(alpha=0.1)),
])
pipe.fit(X_tr_m_sc, y_tr_m)
r2 = pipe.score(X_te_m_sc, y_te_m)
print(f"{'RFF (D='+str(D)+')':35s} | {r2:>8.4f} | {time.time()-t0:>10.3f}")
Kernel Selection Guide
print("\nKernel Selection Guide:")
print("=" * 70)
guide = [
("Linear k(x,z) = xᵀz",
"Feature space is already the right space; high-d text (d≫n works via dual)"),
("Polynomial k(x,z) = (xᵀz+c)ᵖ",
"Known interaction degree; NLP n-gram similarity; computer vision patch matching"),
("RBF k(x,z) = exp(-γ‖x-z‖²)",
"Default first choice; local similarity; tabular data after standardization"),
("Laplacian k(x,z) = exp(-γ‖x-z‖₁)",
"More robust to outliers in individual dims; sparser locality than RBF"),
("Matern-5/2",
"GP regression for physical processes; assumption of finite differentiability"),
("String kernel k(s,t) = #common substrings",
"DNA/protein sequences; text classification without bag-of-words encoding"),
]
for name, use_when in guide:
print(f" {name}")
print(f" Use when: {use_when}\n")
print("RBF γ parameter guide:")
print(" Start with: γ = 1 / (d · Var[X]) ('scale' in sklearn)")
print(" Small γ → broad similarity, smooth function (risk: underfitting)")
print(" Large γ → narrow similarity, spiky function (risk: overfitting)")
print(" Always cross-validate γ jointly with the regularization α")
print("\nSVM vs KRR - same kernel, different loss:")
print(" KRR loss = MSE → regression, smooth solution")
print(" SVM loss = hinge → classification/regression, sparse support vectors")
print(" Both solve a quadratic optimization in the dual; both kernelizable")
Common Mistakes
:::danger Not scaling features before kernel methods
# WRONG - different feature scales distort RBF distances
krr = KernelRidge(kernel="rbf", gamma=0.5).fit(X_raw, y)
# CORRECT - always standardize first
sc = StandardScaler()
X_sc = sc.fit_transform(X_train)
krr = KernelRidge(kernel="rbf", gamma=0.5).fit(X_sc, y)
The RBF kernel computes . A feature with variance 1000 contributes more to the Euclidean distance than a feature with variance 1, effectively ignoring the latter. After StandardScaler, all features have unit variance and contribute equally. Apply transform() (not fit_transform()) to test data.
:::
:::danger Kernel ridge regression with large n - memory and compute explosion
For : Gram matrix = bytes = 800 MB. For : 80 GB. For , use Nyström (sklearn.kernel_approximation.Nystroem + Ridge) or Random Fourier Features (RBFSampler + Ridge). These scale as instead of and usually achieve 95%+ of the performance at a fraction of the cost.
:::
:::warning High-degree polynomial features without regularization Degree- polynomial features create features. Without regularization, the model perfectly interpolates training data (R² = 1.0 if ) and fails catastrophically on test data. Always combine polynomial features with Ridge or Lasso regularization, and cross-validate both the degree and the regularization strength jointly in a Pipeline. :::
:::warning Choosing γ for RBF without cross-validation
Too large a : the model effectively interpolates training points (RBF kernel becomes an identity matrix) - perfect training fit, terrible test performance. Too small: the model is over-smooth, effectively linear. Always cross-validate over a logarithmic grid: param_grid = {"gamma": np.logspace(-4, 2, 30)}. sklearn's gamma="scale" is a good starting point: .
:::
YouTube Resources
| Resource | Channel | Why Watch |
|---|---|---|
| The Kernel Trick Explained | StatQuest with Josh Starmer | Best visual introduction to kernel functions and the kernel trick - start here |
| MIT 6.034: Kernel Machines | MIT OpenCourseWare | Mercer's theorem, SVM, and kernel methods - rigorous mathematical treatment |
| Random Features for Kernel Machines (Rahimi & Recht) | NIPS Conference Talk | The random Fourier features paper - the scalability breakthrough for large kernels |
| Gaussian Processes from the Kernel Perspective | Richard Turner (Cambridge) | How kernels define function spaces - connects KRR to Gaussian process regression |
| Polynomial Features and Overfitting | StatQuest with Josh Starmer | Visual intuition for polynomial expansion and the bias-variance tradeoff with degree |
Interview Q&A
Q1: Explain the kernel trick. Why is it computationally valuable?
The kernel trick exploits the fact that many ML algorithms need only inner products between feature vectors - never the vectors themselves. A kernel function computes this inner product implicitly: for the polynomial kernel , computing costs (one dot product plus scalar operations), while computing then its dot product costs . For , : 1,000 vs 167,000,000 operations per pair. For the RBF kernel, , yet computing still costs . The kernel trick gives infinite expressiveness at the cost of a single dot product in the original space.
Q2: State Mercer's theorem. How do you verify a function is a valid kernel?
Mercer's theorem: a symmetric function is a valid (Mercer) kernel - corresponds to an inner product in some Hilbert space - if and only if the Gram matrix with is positive semi-definite (all eigenvalues ) for every finite set of input points.
Verification strategies: (1) Show for an explicit (constructive proof for polynomial kernel); (2) Compute eigenvalues of the Gram matrix on a sample of points - all should be ; (3) Apply kernel algebra: sums, products, positive scalar multiples, and exponentials of valid kernels are valid. The sigmoid kernel is not always PSD - its validity depends on and .
Q3: Derive the dual form of kernel ridge regression. When is the dual preferred over the primal?
Primal ridge: - requires inverting a matrix. By the representer theorem, the optimal solution lies in the span of training data: . Substituting and simplifying: , giving where is .
Use dual when: (1) - the dual system is smaller than the primal; (2) using a kernel (including RBF with ) - the primal is impossible; (3) you need to kernelize the algorithm for nonlinear patterns. Use primal when - inverting is cheaper than .
Q4: Why does the RBF kernel correspond to an infinite-dimensional feature space?
Using the Taylor expansion of , and writing :
Each term corresponds to degree- polynomial interactions (monomials of degree in dotted with monomials of degree in ). The sum is infinite - all polynomial degrees contribute. The feature map is therefore infinite-dimensional. Yet evaluating costs .
Q5: When would you choose polynomial features + linear model over kernel ridge regression?
Choose polynomial features + linear model when: (1) interpretability - PolynomialFeatures.get_feature_names_out() gives human-readable names like x1^2, x1 x2; polynomial coefficients have physical meaning (e.g., a quadratic displacement model); (2) large - for , the KRR system becomes infeasible while linear models on explicit features scale as ; (3) online learning - linear models support stochastic gradient descent and streaming updates; (4) known degree - physics or domain knowledge says the true relationship is quadratic (use degree 2, nothing higher); (5) downstream deployment - linear models are trivially serializable and deployable.
Choose kernel ridge regression when: (1) is large and the interaction structure is unknown - RBF covers all degrees; (2) is small enough (less than 10K) that inversion is feasible; (3) you need a Gaussian process interpretation for uncertainty quantification (KRR is the MAP estimate of a GP with the corresponding kernel as the covariance function); (4) the degree is not known and you want to avoid committing to a fixed polynomial.
Q6: What is the Nyström approximation and when should you use it?
The Nyström approximation represents the full Gram matrix as a rank- approximation using landmark points: . This reduces storage from to and computation from to . For and : storage drops from 80 GB to 800 MB, computation from to - a factor of reduction.
Use Nyström when and you need a nonlinear kernel. Alternative: Random Fourier Features (RBFSampler in sklearn) approximates the RBF kernel via random cosine projections - same asymptotic scalability but different approximation error characteristics. In practice, Nyström has better approximation quality for the same number of components , but RFF is simpler to implement and more numerically stable.
Kernel Methods vs Neural Networks
A natural question: kernel methods were the state of the art before deep learning. Why do we still teach them?
The kernel method perspective on neural networks: deep neural networks with infinite width converge to Gaussian Processes with a kernel determined by the network architecture - the Neural Tangent Kernel (NTK, Jacot et al. 2018). At initialization, an infinitely-wide neural network's training dynamics are exactly kernel regression with the NTK. This connection helps explain why neural networks generalize despite massive overparameterization.
When kernels still win: for small datasets (), well-tuned kernel methods routinely outperform neural networks because neural networks need large data to learn useful representations. For tabular data with few features (), kernel ridge regression with RBF often matches gradient-boosted trees. For structured data (graphs, strings, molecules), domain-specific kernels encode prior knowledge that is difficult to learn from data alone.
# ── Kernel methods vs neural networks on small data ───────────────────────────
import torch
import torch.nn as nn
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import cross_val_score
import numpy as np
np.random.seed(42)
# Small dataset scenario: n=200, d=10, nonlinear true function
n_small, d_small = 200, 10
X_small = np.random.randn(n_small, d_small)
# True function: sin of first feature × cos of second, plus product interaction
y_small = (np.sin(X_small[:, 0]) * np.cos(X_small[:, 1])
+ X_small[:, 0] * X_small[:, 2]
+ np.random.randn(n_small) * 0.1)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
sc_s = StandardScaler()
X_small_sc = sc_s.fit_transform(X_small)
# Kernel Ridge Regression
krr_cv = cross_val_score(
KernelRidge(kernel="rbf", alpha=0.1, gamma=0.5),
X_small_sc, y_small, cv=5, scoring="r2"
)
# Simple MLP (too little data to train well)
class SmallMLP(nn.Module):
def __init__(self, d):
super().__init__()
self.net = nn.Sequential(
nn.Linear(d, 64), nn.ReLU(),
nn.Linear(64, 64), nn.ReLU(),
nn.Linear(64, 1)
)
def forward(self, x): return self.net(x).squeeze(-1)
# Manual 5-fold CV for MLP
from sklearn.model_selection import KFold
mlp_scores = []
kf = KFold(n_splits=5, shuffle=True, random_state=42)
X_t = torch.tensor(X_small_sc, dtype=torch.float32)
y_t = torch.tensor(y_small, dtype=torch.float32)
for train_idx, val_idx in kf.split(X_small_sc):
model = SmallMLP(d_small)
optim = torch.optim.Adam(model.parameters(), lr=1e-3)
for _ in range(500):
pred = model(X_t[train_idx])
loss = nn.functional.mse_loss(pred, y_t[train_idx])
optim.zero_grad(); loss.backward(); optim.step()
model.eval()
with torch.no_grad():
pred_val = model(X_t[val_idx]).numpy()
y_val = y_small[val_idx]
ss_res = np.sum((y_val - pred_val)**2)
ss_tot = np.sum((y_val - y_val.mean())**2)
mlp_scores.append(1 - ss_res / ss_tot)
print(f"\nSmall data comparison (n={n_small}, d={d_small}):")
print(f" Kernel Ridge (RBF): CV R² = {krr_cv.mean():.4f} ± {krr_cv.std():.4f}")
print(f" MLP (2-layer): CV R² = {np.mean(mlp_scores):.4f} ± {np.std(mlp_scores):.4f}")
print(f" Kernel usually wins on small tabular data - no representation learning needed")
# ── When kernel fails: large n ─────────────────────────────────────────────────
print("\nScalability limit of exact KRR:")
for n_large in [1_000, 5_000, 10_000, 50_000]:
gram_mb = n_large ** 2 * 8 / 1e6
print(f" n={n_large:>6,}: Gram matrix = {gram_mb:>8.1f} MB "
f"{'(feasible)' if gram_mb < 1000 else '(too large - use Nystroem)'}")
Reproducing Kernel Hilbert Spaces (RKHS) - A Brief Overview
The mathematical foundation of kernel methods is the Reproducing Kernel Hilbert Space (RKHS). Every valid kernel defines a unique RKHS - a Hilbert space of functions where:
- For every , the function belongs to .
- The reproducing property holds: for all and all .
The representer theorem - which we used to derive the kernel ridge regression dual - holds in any RKHS: the minimizer of any regularized empirical risk functional over is a finite linear combination of kernel functions centered at training points:
This is exactly the kernel ridge regression prediction formula . The RKHS provides the theoretical guarantee that this dual representation is complete - no information is lost by working in the dual rather than the primal.
Connection to Gaussian Processes: kernel ridge regression with kernel is equivalent to the MAP estimate of a Gaussian Process with prior covariance . The GP posterior mean is exactly from KRR, and the GP posterior variance provides uncertainty quantification that KRR alone does not give. This connection makes kernels interpretable as prior beliefs about function smoothness and structure.
Bias-Variance Tradeoff in Polynomial Degree and Kernel Bandwidth
Both polynomial degree and RBF bandwidth control the bias-variance tradeoff - but through very different mechanisms. Understanding this connection helps you cross-validate more efficiently.
Polynomial degree: increasing increases model flexibility (more features, more potential patterns captured). Low : high bias (underfitting), low variance. High : low bias, high variance (overfitting). Optimal is typically found by CV in - values above 5 are almost never optimal for practical regression.
RBF bandwidth : larger (smaller bandwidth ) means each training point influences only nearby test points - high variance, low bias locally. Smaller (larger bandwidth) means each training point influences test points far away - smooth global function, higher bias, lower variance.
# ── Bias-variance as a function of polynomial degree ─────────────────────────
import numpy as np
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate
import matplotlib.pyplot as plt
np.random.seed(42)
# True function: cubic with noise
n_bv = 300
X_bv = np.sort(np.random.uniform(-2, 2, n_bv)).reshape(-1, 1)
y_bv = X_bv.ravel()**3 - 2 * X_bv.ravel() + np.random.randn(n_bv) * 0.5
degrees = [1, 2, 3, 4, 5, 7, 10, 15]
train_mses, val_mses = [], []
for deg in degrees:
pipe = Pipeline([
("poly", PolynomialFeatures(degree=deg, include_bias=True)),
("scaler", StandardScaler()),
("lr", LinearRegression()),
])
cv_res = cross_validate(pipe, X_bv, y_bv, cv=5,
scoring="neg_mean_squared_error",
return_train_score=True)
train_mses.append(-cv_res["train_score"].mean())
val_mses.append(-cv_res["test_score"].mean())
opt_deg = degrees[int(np.argmin(val_mses))]
print("Polynomial Degree - Bias-Variance Tradeoff:")
print(f"{'Degree':>8} | {'Train MSE':>12} | {'Val MSE':>12}")
print("-" * 38)
for deg, tr, va in zip(degrees, train_mses, val_mses):
flag = " ← optimal" if deg == opt_deg else ""
print(f"{deg:>8} | {tr:>12.4f} | {va:>12.4f}{flag}")
# ── Bias-variance as a function of RBF gamma ─────────────────────────────────
from sklearn.kernel_ridge import KernelRidge
gammas = np.logspace(-3, 2, 20)
train_mses_g, val_mses_g = [], []
for gamma in gammas:
cv_res = cross_validate(
KernelRidge(kernel="rbf", alpha=0.01, gamma=gamma),
X_bv, y_bv, cv=5,
scoring="neg_mean_squared_error",
return_train_score=True
)
train_mses_g.append(-cv_res["train_score"].mean())
val_mses_g.append(-cv_res["test_score"].mean())
opt_gamma = gammas[int(np.argmin(val_mses_g))]
print(f"\nRBF Kernel - Optimal γ from CV: {opt_gamma:.4f}")
print(f" (sklearn 'scale' default: 1/(d·Var[X]) = "
f"{1/(1*np.var(X_bv)):.4f})")
String Kernels and Structured Data
The kernel framework extends beyond numeric feature vectors. String kernels compute similarity between sequences - DNA, protein chains, text - without any vectorization step.
The spectrum kernel (Leslie et al., 2002) for strings counts shared -mers (substrings of length ):
where counts how many times k-mer appears in string . This is a valid (Mercer) kernel - it is a dot product of count vectors. With SVM or kernel ridge regression, it gives state-of-the-art performance on protein homology detection and remote sensing classification tasks, without any alignment or deep learning.
from collections import Counter
def spectrum_kernel(s: str, t: str, k: int = 3) -> int:
"""
Spectrum kernel: count of shared k-mers between strings s and t.
k(s,t) = φ(s)ᵀφ(t) where φ(s)_u = count of k-mer u in s.
Valid (Mercer) kernel: it is an inner product of count vectors.
"""
def count_kmers(seq, k):
return Counter(seq[i:i+k] for i in range(len(seq)-k+1))
kmers_s = count_kmers(s, k)
kmers_t = count_kmers(t, k)
# Dot product of count vectors over shared k-mers
return sum(kmers_s[u] * kmers_t[u] for u in kmers_s if u in kmers_t)
# Example: protein sequence similarity
seq1 = "ACDEFGHIKLMNPQRSTVWY" # 20 standard amino acids
seq2 = "ACDEFGHIKLMVPQRSTVWY" # one substitution
seq3 = "AAAAAAAAAAAAAAAAAAAA" # all-alanine (very different)
print("Spectrum kernel (k=3) - protein sequence similarity:")
print(f" k(seq1, seq1) = {spectrum_kernel(seq1, seq1, k=3)}")
print(f" k(seq1, seq2) = {spectrum_kernel(seq1, seq2, k=3)} (one substitution)")
print(f" k(seq1, seq3) = {spectrum_kernel(seq1, seq3, k=3)} (very different)")
print(" Higher value = more similar sequences (shared 3-gram patterns)")
# Verify PSD on a small set of sequences
seqs = [seq1, seq2, seq3, "MVLSPADKTNVKAAW", "GSHSMRYFYTSVSRPG"]
n_seqs = len(seqs)
K_str = np.array([[spectrum_kernel(seqs[i], seqs[j], k=3)
for j in range(n_seqs)]
for i in range(n_seqs)], dtype=float)
eigvals = np.linalg.eigvalsh(K_str)
print(f"\nSpectrum kernel Gram matrix eigenvalues: {eigvals.round(2)}")
print(f"All non-negative (PSD): {(eigvals >= -1e-9).all()}") # Should be True
:::tip 🎮 Interactive Playground
Visualize this concept: Try the SVM Kernels & Margin demo on the EngineersOfAI Playground - no code required.
:::
