Gaussian Processes

The Motivation: Uncertainty Over Functions

You're tuning a deep learning model. You need to find the optimal learning rate, batch size, and dropout rate. Training each configuration takes 4 hours on 8 GPUs. You can afford maybe 30 evaluations.

Random search and grid search don't care about what you've already learned. If configurations near learning rate 0.001 all performed well, you should search there more carefully - but random search doesn't use that information.

Bayesian optimization uses a Gaussian process to maintain a probabilistic model of the objective function surface, balancing exploration (trying configurations you're uncertain about) and exploitation (trying configurations that look good). It's the most sample-efficient approach to black-box function optimization, routinely finding hyperparameters 10-50x faster than grid search.

Gaussian processes are the mathematical foundation. A GP places a probability distribution over functions - not over parameters, but over entire functions. This lets you reason about uncertainty in your model of the objective surface.

What Is a Gaussian Process?

Definition: A Gaussian process is a collection of random variables, any finite subset of which has a joint Gaussian distribution.

$f \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))$

where:

$m(\mathbf{x}) = \mathbb{E}[f(\mathbf{x})]$ is the mean function (usually $m(\mathbf{x}) = 0$ )
$k(\mathbf{x}, \mathbf{x}') = \text{Cov}[f(\mathbf{x}), f(\mathbf{x}')]$ is the kernel function

For any finite set of inputs $\mathbf{X} = \{x_1, \ldots, x_n\}$ , the corresponding function values have a joint Gaussian distribution:

$\mathbf{f} = [f(x_1), \ldots, f(x_n)]^\top \sim \mathcal{N}(\mathbf{m}, \mathbf{K})$

where $\mathbf{K}_{ij} = k(x_i, x_j)$ is the Gram matrix (kernel matrix).

Key intuition: The kernel function encodes our prior beliefs about the function's smoothness. Points close together in input space should have similar function values. The kernel quantifies this similarity.

Kernel Functions: Encoding Prior Beliefs

The choice of kernel is everything in GP - it determines the qualitative behavior of the function prior.

RBF / Squared Exponential Kernel

$k_{SE}(x, x') = \sigma^2 \exp\left(-\frac{(x-x')^2}{2\ell^2}\right)$

$\sigma^2$ : output scale (variance)
$\ell$ : length scale (how quickly correlations decay)
Properties: Infinitely differentiable, produces extremely smooth functions
When to use: Believed to be smooth and slowly varying

Matérn Kernel

$k_{\nu}(x, x') = \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}|x-x'|}{\ell}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}|x-x'|}{\ell}\right)$

$\nu = 1/2$ : Exponential kernel (Ornstein-Uhlenbeck process) - continuous but not differentiable
$\nu = 3/2$ : Once-differentiable functions
$\nu = 5/2$ : Twice-differentiable functions
$\nu \to \infty$ : RBF kernel (infinitely differentiable)
When to use: When you don't want to assume infinite smoothness (Matérn 5/2 is a popular default)

Periodic Kernel

$k_{per}(x, x') = \sigma^2 \exp\left(-\frac{2\sin^2(\pi|x-x'|/p)}{\ell^2}\right)$

$p$ : period length
When to use: Time series with known periodicity (daily, weekly, annual patterns)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, Matern, ExpSineSquared, ConstantKernel

def plot_kernel_samples(kernel, kernel_name, ax, n_samples=5):
    """Draw samples from a GP with the given kernel to visualize its behavior."""
    X = np.linspace(0, 5, 200).reshape(-1, 1)
    gp = GaussianProcessRegressor(kernel=kernel)

    # Draw samples from the prior
    y_samples = gp.sample_y(X, n_samples=n_samples, random_state=42)

    for s in y_samples.T:
        ax.plot(X, s, alpha=0.6, linewidth=1.5)
    ax.set_title(f'{kernel_name}\n{kernel}')
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.grid(True, alpha=0.3)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

kernels = [
    (1.0 * RBF(length_scale=1.0), 'RBF Kernel\n(smooth functions)'),
    (1.0 * Matern(length_scale=1.0, nu=1.5), 'Matérn 3/2\n(once differentiable)'),
    (1.0 * ExpSineSquared(length_scale=1.0, periodicity=2.0), 'Periodic Kernel\n(p=2)')
]

for ax, (kernel, name) in zip(axes, kernels):
    plot_kernel_samples(kernel, name, ax)

plt.suptitle('GP Prior Samples Under Different Kernels', fontsize=13)
plt.tight_layout()
plt.savefig('gp_kernel_comparison.png', dpi=150)
print("Kernel comparison plot saved")

GP Regression: Posterior Derivation

Given training data $\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n$ with noisy observations $y_i = f(\mathbf{x}_i) + \epsilon_i$ , $\epsilon_i \sim \mathcal{N}(0, \sigma_n^2)$ , we want the posterior predictive distribution at new points $\mathbf{X}_*$ .

Prior: $\mathbf{f} \sim \mathcal{N}(\mathbf{0}, \mathbf{K}_{XX})$

Likelihood: $\mathbf{y} \mid \mathbf{f} \sim \mathcal{N}(\mathbf{f}, \sigma_n^2 \mathbf{I})$

Joint prior over training and test outputs:

$\begin{pmatrix} \mathbf{y} \\ \mathbf{f}_* \end{pmatrix} \sim \mathcal{N}\left(\mathbf{0}, \begin{pmatrix} \mathbf{K}_{XX} + \sigma_n^2 \mathbf{I} & \mathbf{K}_{X*} \\ \mathbf{K}_{*X} & \mathbf{K}_{**} \end{pmatrix}\right)$

Posterior predictive (by conditioning multivariate Gaussian):

$\mathbf{f}_* \mid \mathbf{X}, \mathbf{y}, \mathbf{X}_* \sim \mathcal{N}(\boldsymbol{\mu}_*, \boldsymbol{\Sigma}_*)$

$\boxed{\boldsymbol{\mu}_* = \mathbf{K}_{*X}(\mathbf{K}_{XX} + \sigma_n^2 \mathbf{I})^{-1}\mathbf{y}}$

$\boxed{\boldsymbol{\Sigma}_* = \mathbf{K}_{**} - \mathbf{K}_{*X}(\mathbf{K}_{XX} + \sigma_n^2 \mathbf{I})^{-1}\mathbf{K}_{X*}}$

where $\mathbf{K}_{XX} = k(\mathbf{X}, \mathbf{X})$ , $\mathbf{K}_{X*} = k(\mathbf{X}, \mathbf{X}_*)$ , $\mathbf{K}_{**} = k(\mathbf{X}_*, \mathbf{X}_*)$ .

Key properties:

The posterior mean $\boldsymbol{\mu}_*$ is a linear combination of training outputs
The posterior variance $\boldsymbol{\Sigma}_*$ decreases near training points (where we have data) and increases far from them (where we're uncertain)
Complexity: $O(n^3)$ for matrix inversion - the main scalability bottleneck

import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, Matern, WhiteKernel
import matplotlib.pyplot as plt

# Generate training data from a noisy function
np.random.seed(42)
def true_function(x):
    return np.sin(2*x) + 0.5*np.sin(5*x)

# Training data: sparse, noisy observations
n_train = 15
X_train = np.random.uniform(0, 4, n_train).reshape(-1, 1)
y_train = true_function(X_train.ravel()) + np.random.normal(0, 0.3, n_train)

# Test points
X_test = np.linspace(-0.5, 4.5, 300).reshape(-1, 1)

# GP with RBF kernel + noise term
kernel = 1.0 * RBF(length_scale=0.5, length_scale_bounds=(0.1, 10)) + \
         WhiteKernel(noise_level=0.1, noise_level_bounds=(1e-3, 1.0))

gp = GaussianProcessRegressor(
    kernel=kernel,
    n_restarts_optimizer=10,  # avoid local optima in hyperparameter optimization
    normalize_y=True
)
gp.fit(X_train, y_train)

# Posterior predictions
y_pred, y_std = gp.predict(X_test, return_std=True)

print(f"Learned kernel: {gp.kernel_}")
print(f"Log marginal likelihood: {gp.log_marginal_likelihood_value_:.4f}")

# Plot
fig, ax = plt.subplots(figsize=(12, 5))
ax.scatter(X_train, y_train, c='red', zorder=5, label='Training data', s=50)
ax.plot(X_test, true_function(X_test.ravel()), 'g--', linewidth=2, label='True function')
ax.plot(X_test, y_pred, 'b-', linewidth=2, label='GP posterior mean')
ax.fill_between(X_test.ravel(),
                y_pred - 2*y_std,
                y_pred + 2*y_std,
                alpha=0.3, color='blue', label='95% credible band')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.set_title('GP Regression: Posterior Mean and Uncertainty')
ax.legend()
plt.tight_layout()
plt.savefig('gp_regression.png', dpi=150)

Hyperparameter Optimization: Learning the Kernel Parameters

A GP has hyperparameters $\psi = (\sigma^2, \ell, \sigma_n^2)$ (output scale, length scale, noise level). These are optimized by maximizing the log marginal likelihood:

$\log p(\mathbf{y} \mid \mathbf{X}, \psi) = -\frac{1}{2}\mathbf{y}^\top (\mathbf{K} + \sigma_n^2\mathbf{I})^{-1}\mathbf{y} - \frac{1}{2}\log|\mathbf{K} + \sigma_n^2\mathbf{I}| - \frac{n}{2}\log 2\pi$

Term	Role
$-\frac{1}{2}\mathbf{y}^\top(\mathbf{K}+\sigma_n^2\mathbf{I})^{-1}\mathbf{y}$	Data fit: how well the GP explains the observations
$-\frac{1}{2}\log	\mathbf{K}+\sigma_n^2\mathbf{I}

This is Bayesian model selection: the marginal likelihood automatically balances fit and complexity (Occam's razor). Unlike cross-validation, it uses all data and works in closed form.

Bayesian Optimization: GPs for Hyperparameter Tuning

Bayesian optimization uses a GP as a surrogate model for the expensive objective function (e.g., validation loss after training a neural network). It chooses the next configuration to evaluate using an acquisition function.

Acquisition Functions

Expected Improvement (EI) - most popular:

$\text{EI}(x) = \mathbb{E}[\max(0, f^* - f(x))]$

where $f^*$ is the current best observed value. With Gaussian posterior:

$\text{EI}(x) = (\mu^* - \mu(x))\Phi(Z) + \sigma(x)\phi(Z), \quad Z = \frac{\mu^* - \mu(x)}{\sigma(x)}$

where $\Phi$ and $\phi$ are the standard Gaussian CDF and PDF.

Upper Confidence Bound (UCB): $\text{UCB}(x) = \mu(x) + \kappa \sigma(x)$

Higher $\kappa$ = more exploration.

import numpy as np
from scipy.stats import norm
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern
import matplotlib.pyplot as plt

class BayesianOptimizer:
    """
    Simple Bayesian optimization using GP with EI acquisition.
    Demonstrates the core algorithm used in Optuna, SMAC, BoTorch.
    """
    def __init__(self, bounds, kernel=None):
        self.bounds = bounds  # [(low1, high1), (low2, high2), ...]
        self.kernel = kernel or (1.0 * Matern(length_scale=1.0, nu=2.5))
        self.gp = GaussianProcessRegressor(
            kernel=self.kernel,
            n_restarts_optimizer=5,
            normalize_y=True,
            alpha=1e-6  # numerical stability
        )
        self.X_observed = []
        self.y_observed = []

    def suggest(self, n_candidates=1000):
        """Suggest next point to evaluate using EI acquisition."""
        if len(self.X_observed) < 2:
            # Random exploration at start
            return np.random.uniform(
                [b[0] for b in self.bounds],
                [b[1] for b in self.bounds]
            )

        # Sample candidate points
        candidates = np.random.uniform(
            [b[0] for b in self.bounds],
            [b[1] for b in self.bounds],
            size=(n_candidates, len(self.bounds))
        )

        # Compute EI for all candidates
        mu, sigma = self.gp.predict(candidates, return_std=True)
        y_best = np.min(self.y_observed)  # assuming minimization

        # EI formula
        z = (y_best - mu) / (sigma + 1e-9)
        ei = (y_best - mu) * norm.cdf(z) + sigma * norm.pdf(z)
        ei[sigma < 1e-10] = 0.0  # no improvement possible where variance is 0

        return candidates[np.argmax(ei)]

    def observe(self, x, y):
        """Record a new observation."""
        self.X_observed.append(x)
        self.y_observed.append(y)
        X = np.array(self.X_observed)
        y_arr = np.array(self.y_observed)
        self.gp.fit(X, y_arr)

    def best(self):
        """Return the best observed configuration."""
        idx = np.argmin(self.y_observed)
        return self.X_observed[idx], self.y_observed[idx]


# Example: Optimize a 1D function (representing "train neural net with lr=x")
def expensive_objective(lr):
    """Simulated validation loss as function of log learning rate."""
    return (np.sin(5 * lr) + 0.5 * np.sin(3 * lr) +
            0.3 * (lr - 0.3)**2 + np.random.normal(0, 0.05))

# Run Bayesian optimization
np.random.seed(42)
bounds = [(0.0, 1.0)]
optimizer = BayesianOptimizer(bounds=bounds)

print("Bayesian Optimization: Finding optimal learning rate")
print(f"{'Iteration':<12} {'LR':<12} {'Val Loss':<12}")
print("-" * 36)

# Initial random observations
for _ in range(3):
    x = np.random.uniform(0, 1, size=(1,))
    y = expensive_objective(x[0])
    optimizer.observe(x, y)

# Bayesian optimization loop
for i in range(15):
    x_next = optimizer.suggest().reshape(1, -1)
    y_next = expensive_objective(x_next[0, 0])
    optimizer.observe(x_next, y_next)
    print(f"{i+1:<12} {x_next[0,0]:<12.4f} {y_next:<12.4f}")

best_x, best_y = optimizer.best()
print(f"\nBest configuration found: lr = {best_x[0]:.4f}, loss = {best_y:.4f}")

Bayesian Optimization in Practice

Tool	Algorithm	Used By
Optuna	TPE (tree-structured Parzen estimator)	PyTorch ecosystem, very popular
SMAC	GP / Random Forest surrogate	AutoML community
BoTorch	GP with multi-point EI, GPU-accelerated	Meta AI, production scale
Hyperopt	TPE	Spark-based ML
Ray Tune	Multiple, including BOHB	Distributed HPO

Scalable GPs

Standard GP regression has $O(n^3)$ training cost. For large datasets, several approximations exist:

Method	Idea	Cost
Sparse GPs (inducing points)	Approximate posterior using $m \ll n$ inducing points	$O(nm^2)$
Random Fourier Features	Approximate kernel with random feature maps	$O(nd)$ training
Deep Kernel Learning	Use neural network features as GP inputs	$O(n)$ feature extraction + GP
Variational GPs (SVGP)	VI with inducing points + minibatch training	Scales to millions of points

# Illustration: sparse GP concept
# Instead of conditioning on all n training points, use m << n inducing points
# z_1, ..., z_m (optimized locations) as a summary of the data

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF

# Standard GP: O(n^3) - breaks at n > 10,000
# Sparse GP pseudocode:
"""
1. Choose m inducing points Z (can be a random subset or optimized)
2. Compute u = f(Z) -- function values at inducing points
3. Approximate posterior: p(f*|y) ≈ p(f*|u) * q(u|y)
4. Optimize inducing point locations Z along with kernel hyperparameters
5. Cost: O(nm^2 + m^3) instead of O(n^3)
"""
print("Sparse GP reduces O(n^3) to O(nm^2)")
print("With n=100,000 and m=500, this is a 400x speedup")

Interview Questions

Q1: What is a Gaussian process, and how does it differ from parametric regression?

A Gaussian process is a distribution over functions: any finite set of function evaluations $[f(x_1), \ldots, f(x_n)]$ has a joint Gaussian distribution. The GP is fully specified by a mean function $m(x)$ and kernel $k(x, x')$ . Unlike parametric regression (which assumes a specific functional form with finite parameters), GPs are nonparametric - they can represent any smooth function given enough data, and their complexity grows with the data. The kernel encodes assumptions about the function (smoothness, periodicity, etc.) without committing to a specific functional form. Practically: GPs give uncertainty estimates (posterior variance) at every prediction point, not just point predictions. They're exact Bayesian inference for regression with Gaussian likelihood.

Q2: What does the kernel function encode, and how do you choose it?

The kernel $k(x, x')$ encodes the prior covariance between function values at two input points - i.e., how strongly you believe $f(x)$ and $f(x')$ are related. It encodes: (1) smoothness - RBF implies infinitely smooth functions; Matérn $\nu=3/2$ implies once-differentiable functions; (2) length scale $\ell$ - how quickly correlations decay; small $\ell$ = wiggly function, large $\ell$ = slowly varying; (3) periodicity - periodic kernel for functions known to repeat; (4) scale $\sigma^2$ - overall variance of the function. Choose based on domain knowledge: for neural network hyperparameter optimization, Matérn 5/2 is standard (smooth enough to exploit, not too smooth). For time series with trends and seasonality, combine polynomial + periodic kernels. Kernel hyperparameters are optimized by maximizing the log marginal likelihood.

Q3: Explain how Bayesian optimization works and why it's more efficient than grid search.

Bayesian optimization maintains a GP surrogate model of the objective function (e.g., validation loss as a function of hyperparameters). At each step: (1) fit the GP to all previous (hyperparameter config, validation loss) observations; (2) use the GP posterior to compute an acquisition function (e.g., Expected Improvement) that balances exploitation (try configs the GP thinks are good) and exploration (try configs where the GP is uncertain); (3) evaluate the next config chosen by maximizing the acquisition function; (4) update the GP with the new observation. This is efficient because: the GP learns which regions of hyperparameter space are promising from previous evaluations; it avoids re-testing bad regions; it quantifies uncertainty to guide exploration. Grid search evaluates configs independently, ignoring all learned information. Bayesian optimization is typically 10-50x more sample-efficient, critical when each evaluation requires hours of training.

Q4: What is the computational complexity of GP regression and how does it scale?

Training (fitting): $O(n^3)$ due to matrix inversion of the $n \times n$ kernel matrix $(\mathbf{K} + \sigma_n^2\mathbf{I})^{-1}$ . Storage: $O(n^2)$ to store the matrix. Prediction: $O(n)$ per test point for the posterior mean, $O(n^2)$ per test point for the posterior variance. This makes standard GPs infeasible for $n > 10,000$ - $50,000$ . Solutions for large $n$ : (1) Sparse GPs with $m$ inducing points reduce to $O(nm^2 + m^3)$ ; (2) Random Fourier Features (Rahimi & Recht) approximate the kernel, reducing to $O(nD)$ with $D$ random features; (3) Structure kernel interpolation (SKI) exploits grid structure; (4) Scalable GP implementations in GPyTorch use GPU parallelism and the conjugate gradient method to avoid explicit inversion.

Q5: What is the log marginal likelihood, and why is it used for kernel hyperparameter optimization?

The log marginal likelihood $\log p(\mathbf{y}|\mathbf{X}, \psi) = -\frac{1}{2}\mathbf{y}^\top(\mathbf{K}_\psi + \sigma_n^2\mathbf{I})^{-1}\mathbf{y} - \frac{1}{2}\log|\mathbf{K}_\psi + \sigma_n^2\mathbf{I}| - \frac{n}{2}\log 2\pi$ has three terms: data fit (quadratic term), complexity penalty (log determinant), and a constant. It's used for hyperparameter optimization because: (1) it's a proper Bayesian objective - it averages the likelihood over all functions in the GP prior, not just the MAP function; (2) it automatically implements Occam's razor - the log-determinant term penalizes overly complex kernels (large kernel matrices have large determinants), while the quadratic term rewards good data fit; (3) it can be differentiated with respect to $\psi$ and optimized with gradient ascent, making it computationally tractable; (4) it uses all data without the data splitting required by cross-validation. The marginal likelihood is equivalent to Bayesian model comparison when comparing different kernel choices.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Gaussian Process demo on the EngineersOfAI Playground - no code required.

:::

The Motivation: Uncertainty Over Functions​

What Is a Gaussian Process?​

Kernel Functions: Encoding Prior Beliefs​

RBF / Squared Exponential Kernel​

Matérn Kernel​

Periodic Kernel​

GP Regression: Posterior Derivation​

Hyperparameter Optimization: Learning the Kernel Parameters​

Bayesian Optimization: GPs for Hyperparameter Tuning​

Acquisition Functions​

Bayesian Optimization in Practice​

Scalable GPs​

Interview Questions​