Skip to main content

Gaussian Processes - Non-Parametric Bayesian Regression with Calibrated Uncertainty

:::note Reading time: ~40 minutes | Interview relevance: High | Target roles: MLE, Research Engineer, AI Engineer, Data Scientist :::

The Real Interview Moment

A pharmaceutical company is running drug discovery experiments. Each experiment tests one candidate compound for efficacy against a target protein. Each experiment costs $10,000 and takes three days to run. The company has a library of 50,000 candidate compounds. They can afford to test perhaps 500 over the next six months. The goal is to find the highest-efficacy compound in the library.

After 50 experiments, they have 50 labeled data points: compound feature vector to measured efficacy score. They want to predict which of the remaining 49,950 compounds to test next. A deep neural network trained on 50 data points is wildly overfit - it has hundreds of thousands of parameters and effectively memorises its 50 training examples, providing no useful generalisation. A linear model is underfitting - drug-compound relationships are highly nonlinear.

A Gaussian process does something neither approach can. It places a prior directly over functions - over all possible relationships between compound features and efficacy. After conditioning on the 50 observations, it gives a posterior predictive distribution at every candidate compound: not just a predicted efficacy score, but a full probability distribution. Compounds where the model is uncertain (high posterior variance) are promising exploration targets. Compounds where the model is confident and predicts high efficacy are promising exploitation targets. The interplay between these two drives Bayesian optimisation - the principled strategy for expensive experiment selection (covered in Lesson 08).

Gaussian processes are the mathematical machinery behind this entire workflow. Understanding them is essential for anyone working on hyperparameter tuning, experimental design, materials science, drug discovery, or any domain where experiments are expensive and data is scarce.


Why GPs? The Parametric vs Non-Parametric Divide

Standard ML models are parametric: they have a fixed number of parameters θ\theta (weights, biases, tree splits). Learning means finding the best parameter values. Once trained, the model's capacity is fixed - the number of parameters does not grow with the amount of data.

Gaussian processes are non-parametric: the model complexity grows with the data. There is no fixed-dimension weight vector - instead, the training examples themselves encode the model. Given a new input, the model computes similarity to every training point and makes a prediction weighted by those similarities.

More precisely, a GP is a prior directly over the space of functions f:XRf: \mathcal{X} \to \mathbb{R}. Rather than placing uncertainty over a fixed set of weights and then deriving a distribution over functions, a GP directly specifies what smooth, complex, and periodic functions are plausible a priori, and how that prior should update when we observe labeled examples.

The result is a posterior that is also a distribution over functions - we can sample plausible explanations for the data and make predictions with calibrated uncertainty.


The GP Formal Definition

A Gaussian process is defined by two components: a mean function and a covariance (kernel) function.

f(x)GP ⁣(μ(x),  k(x,x))f(x) \sim \mathcal{GP}\!\left(\mu(x),\; k(x, x')\right)

What this means: any finite collection of function values [f(x1),f(x2),,f(xn)][f(x_1), f(x_2), \ldots, f(x_n)] follows a multivariate Gaussian distribution:

[f(x1)f(xn)]N ⁣([μ(x1)μ(xn)],  [k(x1,x1)k(x1,xn)k(xn,x1)k(xn,xn)])\begin{bmatrix} f(x_1) \\ \vdots \\ f(x_n) \end{bmatrix} \sim \mathcal{N}\!\left(\begin{bmatrix} \mu(x_1) \\ \vdots \\ \mu(x_n) \end{bmatrix},\; \begin{bmatrix} k(x_1, x_1) & \cdots & k(x_1, x_n) \\ \vdots & \ddots & \vdots \\ k(x_n, x_1) & \cdots & k(x_n, x_n) \end{bmatrix}\right)

The mean function μ(x)\mu(x) is typically set to zero - we let the kernel encode all structure. The kernel k(x,x)k(x, x') encodes our beliefs about function smoothness, periodicity, and scale.


Kernel Functions: Encoding Beliefs About Functions

The kernel is the most important design choice in a GP. It encodes assumptions about the true function: how smooth is it? Does it have periodic structure? Does it vary at different rates in different directions?

Radial Basis Function (RBF) / Squared Exponential

kRBF(x,x)=σ2exp ⁣(xx222)k_{\text{RBF}}(x, x') = \sigma^2 \exp\!\left(-\frac{\|x - x'\|^2}{2\ell^2}\right)

Parameters:

  • σ2\sigma^2 (output scale): overall variance of function values
  • \ell (length scale): how quickly the function changes. Large \ell means smooth, slowly varying. Small \ell means rapidly varying.

The RBF kernel produces infinitely differentiable functions - extremely smooth. Use this when you believe the underlying function is very smooth with no kinks. Oversmoothing is possible if the true function has discontinuities.

Matérn Kernel (Most Common in Practice)

kν=5/2(x,x)=σ2 ⁣(1+5xx+5xx232)exp ⁣(5xx)k_{\nu=5/2}(x, x') = \sigma^2\!\left(1 + \frac{\sqrt{5}\,\|x-x'\|}{\ell} + \frac{5\,\|x-x'\|^2}{3\ell^2}\right)\exp\!\left(-\frac{\sqrt{5}\,\|x-x'\|}{\ell}\right)

The Matérn kernel has a smoothness parameter ν\nu. Common choices:

  • ν=1/2\nu = 1/2: continuous but not differentiable (rough functions)
  • ν=3/2\nu = 3/2: once differentiable
  • ν=5/2\nu = 5/2: twice differentiable (most common - smooth enough for most physical processes but allows some roughness)
  • ν\nu \to \infty: converges to RBF

Why Matérn over RBF? Real-world functions (temperature, stock prices, drug efficacy) are rarely infinitely smooth. Matérn 5/2 is a better default - it is smooth but allows the function to be more expressive at short length scales.

Periodic Kernel

kper(x,x)=σ2exp ⁣(2sin2 ⁣(πxx/p)2)k_{\text{per}}(x, x') = \sigma^2 \exp\!\left(-\frac{2\sin^2\!\left(\pi|x - x'|/p\right)}{\ell^2}\right)

Parameters: period pp, length scale \ell, output scale σ2\sigma^2.

Use for functions with known periodicity: seasonal demand, circadian rhythms, recurring patterns in time series.

Linear Kernel (Recovering Linear Regression)

klin(x,x)=σb2+σv2(xc)(xc)k_{\text{lin}}(x, x') = \sigma_b^2 + \sigma_v^2\,(x - c)(x' - c)

A GP with a linear kernel is exactly Bayesian linear regression. Kernels can be composed: k1+k2k_1 + k_2 (additive structure), k1×k2k_1 \times k_2 (interacting effects).

KernelFunction ShapeUse Case
RBFInfinitely smoothPhysical simulations, optical systems
Matérn 5/2Twice differentiableDefault for most regression tasks
Matérn 3/2Once differentiableRougher processes, terrain elevation
PeriodicOscillatorySeasonality, circadian data
LinearLinear trendsWhen linearity is a plausible assumption
Rational QuadraticMulti-scaleMixtures of length scales

GP Posterior: The Core Derivation

We have training data D={(xi,yi)}i=1n\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n with yi=f(xi)+εiy_i = f(x_i) + \varepsilon_i, εiN(0,σn2)\varepsilon_i \sim \mathcal{N}(0, \sigma_n^2).

Let K=K(X,X)K = K(X, X) denote the n×nn \times n kernel matrix with Kij=k(xi,xj)K_{ij} = k(x_i, x_j). Let K=K(x,X)K_* = K(x_*, X) be the 1×n1 \times n vector of kernel evaluations between the test point and training points. Let K=k(x,x)K_{**} = k(x_*, x_*) be the prior variance at the test point.

The joint distribution of training outputs and test output under the GP prior is:

[yf]N ⁣(0,  [K+σn2IKKK])\begin{bmatrix} y \\ f_* \end{bmatrix} \sim \mathcal{N}\!\left(\mathbf{0},\; \begin{bmatrix} K + \sigma_n^2 I & K_*^\top \\ K_* & K_{**} \end{bmatrix}\right)

Conditioning on training data yy using the Gaussian conditioning formula gives the posterior predictive:

μ(x)=K[K+σn2I]1y\boxed{\mu_*(x_*) = K_* \left[K + \sigma_n^2 I\right]^{-1} y}

σ2(x)=KK[K+σn2I]1K\boxed{\sigma_*^2(x_*) = K_{**} - K_* \left[K + \sigma_n^2 I\right]^{-1} K_*^\top}

Interpretation:

  • μ(x)\mu_*(x_*): posterior mean prediction - a weighted sum of training targets, where weights reflect similarity (via the kernel) to the test point
  • σ2(x)\sigma_*^2(x_*): posterior variance - prior variance KK_{**} minus the variance explained by training data

The posterior variance has a beautiful structure: it is the prior variance minus the "information gained" from training data. Near training points, this term is large and variance collapses. Far from training points, this term is small and variance approaches the prior.


Hyperparameter Learning via Marginal Likelihood

Kernel hyperparameters (\ell, σ2\sigma^2, σn2\sigma_n^2) are learned by maximising the log marginal likelihood:

logp(yX,,σ2,σn2)=12yKy1y12logdetKyn2log2π\log p(y|X, \ell, \sigma^2, \sigma_n^2) = -\frac{1}{2}y^\top K_y^{-1} y - \frac{1}{2}\log\det K_y - \frac{n}{2}\log 2\pi

where Ky=K+σn2IK_y = K + \sigma_n^2 I.

This objective balances data fit (the yKy1yy^\top K_y^{-1} y term, penalised when predictions are far from observations) and model complexity (the logdetKy\log \det K_y term, penalised for complex models that explain data by memorising it).

Maximising the marginal likelihood automatically implements Occam's razor - it prefers the simplest model that explains the data. Gradients of the log marginal likelihood with respect to hyperparameters are:

θjlogp(yX,θ)=12tr ⁣[(ααKy1)Kyθj]\frac{\partial}{\partial \theta_j}\log p(y|X, \theta) = \frac{1}{2}\text{tr}\!\left[\left(\alpha\alpha^\top - K_y^{-1}\right)\frac{\partial K_y}{\partial \theta_j}\right]

where α=Ky1y\alpha = K_y^{-1} y.


Computational Bottleneck: O(n³) Scaling

The biggest practical limitation of exact GPs is the need to compute Ky1K_y^{-1} (or equivalently, solve Kyα=yK_y \alpha = y). This requires:

  • Time: O(n3)O(n^3) for Cholesky decomposition of KyK_y
  • Memory: O(n2)O(n^2) to store KyK_y

For n=1,000n = 1{,}000: instant. For n=10,000n = 10{,}000: a few seconds. For n=100,000n = 100{,}000: hours and gigabytes. For n=1,000,000n = 1{,}000{,}000: impossible with exact GPs.

This scaling limits exact GPs to datasets of roughly n10,000n \leq 10{,}000. For larger datasets, sparse approximations are needed.


Sparse GPs: Inducing Points

The key idea of sparse GPs is to summarise the training data with mnm \ll n inducing points Z=[z1,,zm]Z = [z_1, \ldots, z_m]. These are pseudo-inputs placed in the input space - typically learned jointly with kernel hyperparameters.

The sparse approximation replaces Ky1K_y^{-1} with a low-rank plus diagonal approximation:

KyQnn+diag(KnnQnn)K_y \approx Q_{nn} + \text{diag}(K_{nn} - Q_{nn})

where Qnn=KnmKmm1KmnQ_{nn} = K_{nm} K_{mm}^{-1} K_{mn} and KnmK_{nm} is the n×mn \times m cross-kernel matrix.

The computational complexity drops from O(n3)O(n^3) to O(nm2)O(nm^2) - with m=500m = 500 and n=100,000n = 100{,}000, this is roughly an 8,000×8{,}000\times speedup.

The posterior predictive becomes:

μ(x)=Km[Kmm+σn2KmnKnm]1σn2Kmny\mu_*(x_*) = K_{*m}\left[K_{mm} + \sigma_n^{-2}K_{mn}K_{nm}\right]^{-1}\sigma_n^{-2}K_{mn}y

σ2(x)=kKmKmm1Km+KmΣKm\sigma_*^2(x_*) = k_{**} - K_{*m}K_{mm}^{-1}K_{m*} + K_{*m}\Sigma K_{m*}

where Σ=[Kmm+σn2KmnKnm]1\Sigma = \left[K_{mm} + \sigma_n^{-2}K_{mn}K_{nm}\right]^{-1}.


From Scratch: GP Regression with NumPy

import numpy as np
from scipy.linalg import cholesky, solve_triangular
from scipy.optimize import minimize

# ─── Kernel functions ────────────────────────────────────────────────────────

def rbf_kernel(X1, X2, length_scale=1.0, output_scale=1.0):
"""RBF (squared exponential) kernel. X1: (n1, d), X2: (n2, d)."""
X1 = np.atleast_2d(X1)
X2 = np.atleast_2d(X2)
diff = X1[:, None, :] - X2[None, :, :] # (n1, n2, d)
sq_dist = np.sum(diff ** 2, axis=-1) # (n1, n2)
return output_scale ** 2 * np.exp(-sq_dist / (2 * length_scale ** 2))

def matern52_kernel(X1, X2, length_scale=1.0, output_scale=1.0):
"""Matern 5/2 kernel - twice differentiable."""
X1 = np.atleast_2d(X1)
X2 = np.atleast_2d(X2)
diff = X1[:, None, :] - X2[None, :, :]
dist = np.sqrt(np.sum(diff ** 2, axis=-1) + 1e-12)
r = np.sqrt(5) * dist / length_scale
return output_scale ** 2 * (1 + r + r ** 2 / 3) * np.exp(-r)


# ─── GP Regression class ─────────────────────────────────────────────────────

class GaussianProcessRegressor:
"""
Exact GP regression from scratch using Cholesky decomposition.
Numerically stable O(n^3) fit, O(n^2) prediction.
"""

def __init__(self, kernel='matern52', length_scale=1.0,
output_scale=1.0, noise_var=0.01):
self.kernel_name = kernel
self.length_scale = length_scale
self.output_scale = output_scale
self.noise_var = noise_var
self._fitted = False

def _kernel(self, X1, X2):
if self.kernel_name == 'rbf':
return rbf_kernel(X1, X2, self.length_scale, self.output_scale)
return matern52_kernel(X1, X2, self.length_scale, self.output_scale)

def fit(self, X_train, y_train):
"""Cache Cholesky factor for O(n^2) prediction."""
self.X_train = np.atleast_2d(X_train)
self.y_train = y_train.ravel()
n = len(y_train)

K = self._kernel(self.X_train, self.X_train)
K_y = K + self.noise_var * np.eye(n)

# Cholesky: K_y = L @ L.T - numerically stable
try:
self.L_ = cholesky(K_y, lower=True)
except np.linalg.LinAlgError:
K_y += 1e-6 * np.eye(n)
self.L_ = cholesky(K_y, lower=True)

# alpha = K_y^{-1} y via forward + backward substitution
v = solve_triangular(self.L_, self.y_train, lower=True)
self.alpha_ = solve_triangular(self.L_.T, v, lower=False)
self._fitted = True
return self

def predict(self, X_test, return_std=True):
"""Posterior predictive mean and standard deviation."""
assert self._fitted, "Call fit() first."
X_test = np.atleast_2d(X_test)

K_star = self._kernel(X_test, self.X_train) # (n_test, n_train)
mu = K_star @ self.alpha_ # posterior mean

if not return_std:
return mu

# Posterior variance: K** - K_star @ K_y^{-1} @ K_star.T
K_star_star = self._kernel(X_test, X_test)
v = solve_triangular(self.L_, K_star.T, lower=True) # (n_train, n_test)
var = np.diag(K_star_star) - np.sum(v ** 2, axis=0)
var = np.maximum(var, 0.0) # clip tiny negatives from numerics
return mu, np.sqrt(var)

def log_marginal_likelihood(self):
"""
log p(y | X, hyperparams)
= -0.5 y.T @ K_y^{-1} @ y - 0.5 log det(K_y) - n/2 log(2pi)
"""
assert self._fitted
n = len(self.y_train)
log_det = 2 * np.sum(np.log(np.diag(self.L_))) # 2*sum(log diag(L))
data_fit = -0.5 * self.y_train @ self.alpha_
complexity = -0.5 * log_det
constant = -n / 2 * np.log(2 * np.pi)
return data_fit + complexity + constant


# ─── Hyperparameter optimisation ─────────────────────────────────────────────

def optimise_hyperparameters(X_train, y_train, kernel='matern52', n_restarts=5):
"""
Find length_scale, output_scale, noise_var by maximising log marginal likelihood.
Uses multiple random restarts to escape local optima.
"""
best_lml = -np.inf
best_params = None

for _ in range(n_restarts):
log_params0 = np.random.uniform(-2, 2, size=3) # random start in log space

def neg_lml(log_params):
ls, os, nv = np.exp(log_params)
gp = GaussianProcessRegressor(kernel=kernel, length_scale=ls,
output_scale=os, noise_var=nv)
gp.fit(X_train, y_train)
return -gp.log_marginal_likelihood()

result = minimize(neg_lml, log_params0, method='L-BFGS-B',
options={'maxiter': 200})

if -result.fun > best_lml:
best_lml = -result.fun
best_params = np.exp(result.x)

ls, os, nv = best_params
print(f"Best hyperparams: length_scale={ls:.4f}, "
f"output_scale={os:.4f}, noise_var={nv:.6f}")
print(f"Log marginal likelihood: {best_lml:.4f}")

return GaussianProcessRegressor(kernel=kernel, length_scale=ls,
output_scale=os, noise_var=nv).fit(X_train, y_train)


# ─── Demo: 1D regression with uncertainty quantification ─────────────────────

np.random.seed(42)

def true_fn(x):
return np.sin(3 * x) + 0.5 * np.sin(7 * x)

X_train = np.sort(np.random.choice(np.linspace(-3, 3, 200), size=25, replace=False))
y_train = true_fn(X_train) + np.random.randn(len(X_train)) * 0.15
X_test = np.linspace(-3.5, 3.5, 300)

# Manual hyperparameters (production: optimise via marginal likelihood above)
gp = GaussianProcessRegressor(kernel='matern52', length_scale=0.5,
output_scale=1.0, noise_var=0.02)
gp.fit(X_train[:, None], y_train)
mu, sigma = gp.predict(X_test[:, None])

print("GP Regression Results:")
print(f" Training points: {len(X_train)}")
rmse = np.sqrt(np.mean((mu - true_fn(X_test)) ** 2))
print(f" Prediction RMSE: {rmse:.4f}")
print(f" Avg predictive std: {sigma.mean():.4f}")
print(f" Max uncertainty (extrapolation): {sigma.max():.4f}")

# Uncertainty should be higher far from training data
near_train_mask = np.any(np.abs(X_test[:, None] - X_train[None, :]) < 0.3, axis=1)
print(f"\n Avg std near training data: {sigma[near_train_mask].mean():.4f}")
print(f" Avg std far from data: {sigma[~near_train_mask].mean():.4f}")

GPyTorch: Production GP Regression

For real applications, use GPyTorch - it supports GPU acceleration, automatic differentiation for hyperparameter optimisation, and efficient Cholesky-based inference.

import torch
import gpytorch

# ─── Exact GP Model ──────────────────────────────────────────────────────────

class ExactGPModel(gpytorch.models.ExactGP):
"""Standard exact GP with Matern 5/2 kernel. For n up to ~10,000."""

def __init__(self, train_x, train_y, likelihood):
super().__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ZeroMean()
# ScaleKernel wraps any kernel to learn output scale separately
self.covar_module = gpytorch.kernels.ScaleKernel(
gpytorch.kernels.MaternKernel(nu=2.5)
)

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)


def train_exact_gp(X_train, y_train, n_epochs=200, lr=0.1):
"""Train by maximising log marginal likelihood using Adam."""
train_x = torch.FloatTensor(X_train)
train_y = torch.FloatTensor(y_train)

likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood)

model.train()
likelihood.train()

optimizer = torch.optim.Adam(model.parameters(), lr=lr)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for epoch in range(n_epochs):
optimizer.zero_grad()
output = model(train_x)
loss = -mll(output, train_y)
loss.backward()
optimizer.step()

if (epoch + 1) % 50 == 0:
ls = model.covar_module.base_kernel.lengthscale.item()
noise = likelihood.noise.item()
print(f"Epoch {epoch+1}: LML={-loss.item():.3f}, "
f"length_scale={ls:.3f}, noise={noise:.4f}")

return model, likelihood


def predict(model, likelihood, X_test):
"""Posterior predictive: mean + 95% credible interval."""
test_x = torch.FloatTensor(X_test)
model.eval()
likelihood.eval()

with torch.no_grad(), gpytorch.settings.fast_pred_var():
pred = likelihood(model(test_x))
mean = pred.mean.numpy()
lower, upper = pred.confidence_region()
return mean, lower.numpy(), upper.numpy()


# ─── Sparse GP with Variational Inference (for large datasets) ───────────────

class SparseGPModel(gpytorch.models.ApproximateGP):
"""
Sparse Variational GP with learned inducing points.
Scales as O(nm^2) - viable for n up to ~1,000,000.
"""

def __init__(self, inducing_points):
variational_distribution = gpytorch.variational.CholeskyVariationalDistribution(
inducing_points.size(0)
)
variational_strategy = gpytorch.variational.VariationalStrategy(
self, inducing_points, variational_distribution,
learn_inducing_locations=True
)
super().__init__(variational_strategy)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(
gpytorch.kernels.RBFKernel()
)

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)


def train_sparse_gp(X_train, y_train, n_inducing=100, n_epochs=500, lr=0.01):
"""Train sparse GP by maximising variational ELBO."""
train_x = torch.FloatTensor(X_train)
train_y = torch.FloatTensor(y_train)

# Initialise inducing points at random subset of training data
idx = torch.randperm(len(train_x))[:n_inducing]
inducing_points = train_x[idx].clone()

model = SparseGPModel(inducing_points)
likelihood = gpytorch.likelihoods.GaussianLikelihood()

model.train()
likelihood.train()

optimizer = torch.optim.Adam(
list(model.parameters()) + list(likelihood.parameters()), lr=lr
)
# ELBO = -KL[q(u)||p(u)] + E_q[log p(y|f)]
mll = gpytorch.mlls.VariationalELBO(likelihood, model, num_data=len(train_y))

for epoch in range(n_epochs):
optimizer.zero_grad()
output = model(train_x)
loss = -mll(output, train_y)
loss.backward()
optimizer.step()

print(f"Sparse GP trained: {n_inducing} inducing points, {len(X_train)} observations")
return model, likelihood

Kernel Engineering: Composing Kernels for Real Problems

Complex phenomena can be decomposed into simpler components using kernel addition and multiplication.

Additive structure - independent components:

k(x,x)=ktrend(x,x)+kseasonal(x,x)+knoise(x,x)k(x, x') = k_{\text{trend}}(x, x') + k_{\text{seasonal}}(x, x') + k_{\text{noise}}(x, x')

Example: Time series with trend + seasonality + short-term variation:

import gpytorch

class CompositeGPModel(gpytorch.models.ExactGP):
"""
Composite kernel: long-term trend + seasonal component + irregular noise.
Inspired by the Automatic Statistician (Duvenaud et al., 2013).
"""

def __init__(self, train_x, train_y, likelihood):
super().__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()

# Long-term trend: linear kernel
trend = gpytorch.kernels.LinearKernel()

# Seasonal: periodic kernel with RBF modulation
# PeriodicKernel × RBFKernel = periodicity that slowly decays
seasonal = (
gpytorch.kernels.PeriodicKernel() *
gpytorch.kernels.RBFKernel()
)

# Short-term irregular variation: Matern 3/2
irregular = gpytorch.kernels.MaternKernel(nu=1.5)

# Combine: sum of independent components
self.covar_module = (
gpytorch.kernels.ScaleKernel(trend) +
gpytorch.kernels.ScaleKernel(seasonal) +
gpytorch.kernels.ScaleKernel(irregular)
)

def forward(self, x):
return gpytorch.distributions.MultivariateNormal(
self.mean_module(x), self.covar_module(x)
)

The additive structure is interpretable - you can inspect each component's contribution separately and understand what the model has learned about trend, seasonality, and noise.


GPs vs Neural Networks: When to Use Each

This is the most common interview question about GPs. The answer requires nuance.

DimensionGPNeural Network
Data requirementsExcellent with n<10,000n < 10{,}000Needs n>10,000n > 10{,}000 typically
ScalingO(n3)O(n^3) - bottleneckO(n)O(n) mini-batch
UncertaintyExact posterior predictiveRequires approximation
CalibrationWell-calibrated by constructionOften overconfident
Input typeStructured, tabular, fixed-dimImages, text, arbitrary
Hyperparameters2–5 per kernel - marginal LMLMillions - need validation set
InterpretabilityKernel choice is interpretableBlack box
GPU accelerationLimited (GPyTorch helps)Native

Deep Kernel Learning (Wilson et al., 2016) bridges the gap: use a neural network as a feature extractor, then place a GP on the learned feature space. This gives GP uncertainty estimates on top of neural network representations - gaining the scalability of neural features with GP calibration.


Uncertainty Decomposition: What GP Posterior Variance Tells You

The posterior variance σ2(x)\sigma_*^2(x_*) decomposes as:

σ2(x)=k(x,x)prior varianceKKy1Kreduction from data\sigma_*^2(x_*) = \underbrace{k(x_*, x_*)}_{\text{prior variance}} - \underbrace{K_* K_y^{-1} K_*^\top}_{\text{reduction from data}}

Key behaviours:

  • At a training point: variance collapses to σn2\sigma_n^2 (noise level) - the model is certain modulo irreducible noise
  • Between training points: variance depends on how densely training points cover the region
  • Far from training data: variance returns to the prior - the model says "I know nothing here"
  • Extrapolation region: variance is prior-level - this is the honest admission that the model should not be trusted
import numpy as np

def demonstrate_uncertainty_structure(gp, X_train, X_test):
"""Show how GP uncertainty grows away from training data."""
mu, sigma = gp.predict(X_test[:, None])

# Find min distance from each test point to nearest training point
dists = np.min(
np.abs(X_test[:, None] - X_train[None, :]), axis=1
)

# Bin test points by distance to nearest training point
bins = [0, 0.2, 0.5, 1.0, 2.0, np.inf]
labels = ['Very near', 'Near', 'Medium', 'Far', 'Very far']

print(f"\n{'Distance to nearest training point':>35} | {'Avg Uncertainty':>16}")
print("-" * 55)
for i in range(len(bins) - 1):
mask = (dists >= bins[i]) & (dists < bins[i + 1])
if mask.sum() > 0:
avg_std = sigma[mask].mean()
print(f"{labels[i]:>35} [{bins[i]:.1f}, {bins[i+1]:.1f}) | {avg_std:>16.4f}")

Production Engineering Notes

:::tip Sparse GPs for large-scale applications For n>10,000n > 10{,}000, use sparse GPs with m=100m = 1001,0001{,}000 inducing points. GPyTorch's VariationalStrategy handles this efficiently. The number of inducing points mm is a bias-variance trade-off: more inducing points gives better approximation but slower training. Start with m=min(n,500)m = \min(n, 500). :::

:::warning Kernel matrix conditioning GP computations require solving Kyα=yK_y \alpha = y. When KyK_y is nearly singular (duplicate training points, very small noise variance), the Cholesky decomposition becomes numerically unstable. Always add a small jitter term (10610^{-6} to 10410^{-4}) to the diagonal. GPyTorch handles this automatically via gpytorch.settings.cholesky_jitter. :::

:::danger GPs revert to prior in extrapolation regions Far from training data, the GP posterior predictive reverts to the prior - zero mean, prior variance. This is correct behaviour. But it means GP predictions in extrapolation regions are prior-dominated, not data-driven. Always visualise uncertainty and communicate that extrapolation predictions are uncertain. A model confidently predicting in a region it has never seen is miscalibrated; a GP correctly becomes uncertain there. :::

:::note Multi-output GPs for correlated targets When predicting multiple related outputs (temperature and humidity, multiple asset prices, protein expression levels), use multi-output GPs with cross-covariance kernels. GPyTorch's MultitaskKernel or LCMKernel handles this. If output correlations are known, encode them directly; if unknown, let the model learn them from data. :::


YouTube Resources

  • Gaussian Processes for Machine Learning - Richard Turner (Cambridge ML lectures): the clearest derivation of the GP posterior from the multivariate Gaussian conditioning formula
  • Gaussian Processes - Neil Lawrence (Sheffield ML lectures): intuition-first approach from a pioneer of GP methods
  • A Visual Exploration of Gaussian Processes - Jochen Görtler (distill.pub): highly recommended for building kernel intuition interactively
  • GPyTorch Tutorial - official GPyTorch team at ICML 2019: full practical walkthrough with sparse GPs and Deep Kernel Learning

Interview Q&A

Q1: What is a Gaussian process and how is it different from a neural network?

A Gaussian process is a prior distribution directly over functions. Rather than parameterising a function family with a fixed number of weights, a GP specifies that any finite collection of function values jointly follows a multivariate Gaussian distribution. The kernel function encodes beliefs about the function's properties - smoothness, periodicity, scale. After observing training data, the GP gives an exact closed-form posterior distribution over functions, including calibrated uncertainty estimates at every test point. A neural network is parametric - fixed capacity, point estimates, no principled uncertainty. GPs are non-parametric - complexity grows with data, exact posterior uncertainty. GPs win on small datasets and when uncertainty quantification is critical. Neural networks win on large datasets, raw high-dimensional inputs like images and text, and when low inference latency matters.

Q2: How does the kernel function affect GP predictions?

The kernel completely determines the GP's prior beliefs about functions. The RBF kernel produces infinitely smooth functions - any function sampled from this prior is continuous and differentiable everywhere. Matérn 5/2 produces twice-differentiable functions - more expressive, allowing mild roughness. The periodic kernel produces functions with a specific period. The length scale controls how quickly the function varies - a short length scale allows rapid changes; a long length scale enforces slow variation. Choosing the wrong kernel leads to systematically wrong predictions. In practice, use Matérn 5/2 as a default for physical processes, add periodic kernels when seasonality is known, and learn hyperparameters by maximising the log marginal likelihood rather than by cross-validation.

Q3: What is the computational bottleneck of exact GPs, and how do sparse GPs address it?

Exact GP inference requires computing Ky1K_y^{-1} via Cholesky decomposition, costing O(n3)O(n^3) time and O(n2)O(n^2) memory. For n=10,000n = 10{,}000 this is feasible. For n=100,000n = 100{,}000 it takes hours and tens of gigabytes. Sparse GPs introduce mnm \ll n inducing points - pseudo-observations that summarise the training data. The kernel matrix is approximated as low-rank plus diagonal, reducing cost to O(nm2)O(nm^2). With m=500m = 500 inducing points on n=100,000n = 100{,}000 observations, the speedup is roughly 8,000×8{,}000\times. The trade-off is approximation error - the sparse GP is biased relative to the exact GP, but with enough inducing points the approximation is excellent in practice.

Q4: How do you learn kernel hyperparameters in a GP?

Kernel hyperparameters (length scale, output scale, noise variance) are learned by maximising the log marginal likelihood: logp(yX)=12yKy1y12logdetKyn2log2π\log p(y|X) = -\frac{1}{2}y^\top K_y^{-1}y - \frac{1}{2}\log\det K_y - \frac{n}{2}\log 2\pi. This objective automatically balances data fit and model complexity - penalising both underfitting and overfitting without a held-out validation set. Gradients are computed analytically using the trace formula and optimised with L-BFGS-B or Adam. Multiple random restarts are recommended since the objective is non-convex. This is a key advantage over neural networks, where all hyperparameters must be tuned via cross-validation on a separate validation set.

Q5: How do GPs connect to Bayesian optimisation?

Bayesian optimisation uses a GP as a surrogate model for an expensive black-box objective function. At each iteration: fit a GP to all observed function evaluations; use the GP posterior mean and variance to compute an acquisition function (e.g., Expected Improvement); select the next evaluation point where the acquisition function is maximised; run the expensive evaluation and add it to observations. The GP posterior uncertainty plays a central role: regions with high uncertainty and reasonable predicted value are exploration candidates; regions with high predicted value and low uncertainty are exploitation candidates. The acquisition function balances these two. Without calibrated GP uncertainty, you cannot run principled Bayesian optimisation - which is why GPs are the default surrogate model for hyperparameter search and experimental design.


Common Mistakes

:::danger Too few inducing points in sparse GP Using m=10m = 10 inducing points when your function has complex structure forces the model to represent all variation through 10 points - the result is a severely underfit model with inappropriately small uncertainty. As a rule of thumb, use at least m=min(n,100)m = \min(n, 100) inducing points and increase until the ELBO plateaus. Plot the inducing point locations to verify they cover the input space well. :::

:::warning Not normalising inputs and outputs GPs are sensitive to input scale. If one feature spans 0–1 and another spans 0–10,000, the RBF kernel will be dominated by the large-scale feature. Always standardise inputs to zero mean and unit variance. Similarly, standardise outputs - the GP prior assumes zero-mean functions, and a large offset or very large variance will require very large output scale hyperparameters that create numerical instability. :::

:::warning Interpreting GP uncertainty as the only source of uncertainty GP posterior variance accounts for model uncertainty given the chosen kernel and optimised hyperparameters. It does not account for kernel misspecification (wrong structural assumptions about the function) or hyperparameter uncertainty (the learned hyperparameters themselves have uncertainty). For critical decisions, run sensitivity analyses with multiple kernels and report how conclusions depend on kernel choice. :::


Practice Problems

  1. Derive the GP posterior predictive mean and variance from first principles using the multivariate Gaussian conditioning formula. If [y,f]N(0,Σ)[y, f_*]^\top \sim \mathcal{N}(0, \Sigma) with block covariance structure, write the closed-form conditional distribution p(fy)p(f_*|y) and identify each term.

  2. Implement the periodic kernel from scratch and fit a GP to a synthetic sinusoidal time series. Compare the predictive uncertainty of the periodic kernel vs RBF kernel when extrapolating 2 full periods beyond the training window. Which kernel is better calibrated?

  3. The log marginal likelihood has two terms: data fit and complexity penalty. Generate 20 points from a noisy sinusoid. Fit GPs with length scales from 0.01 to 10. Plot both terms and the total LML. Identify the length scale that maximises LML and verify it visually matches a good fit.

  4. A GP with RBF kernel is trained on n=1,000n = 1{,}000 points in 20-dimensional input space. Estimate the time to compute the Cholesky factorisation and the kernel matrix memory footprint. If each observation costs $1,000 to collect, what is the break-even point for switching to a sparse GP with m=50m = 50 inducing points?

  5. Deep Kernel Learning uses a neural network ϕ(x)\phi(x) as a feature extractor, then places a GP on ϕ(x)\phi(x). What are the advantages over a pure GP? What are the disadvantages? When would you prefer Deep Kernel Learning over either a pure GP or a pure neural network?

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Gaussian Process demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.