Skip to main content

Bayesian Linear Regression - Uncertainty Estimates for Every Prediction

:::note Reading time: ~38 minutes | Interview relevance: Very High | Target roles: MLE, Data Scientist, Research Engineer, AI Engineer :::

The Real Interview Moment

A semiconductor fabrication facility needs to predict the resistance of a chip component from voltage measurements taken at five test points during manufacturing. Each resistance measurement is inherently noisy - the measurement apparatus introduces roughly ±0.2 ohms of random error. The measurements are correlated because all five voltage readings come from the same physical component.

A standard linear regression model fits y=Xθ+εy = X\theta + \varepsilon and reports a single number: "predicted resistance: 47.3 ohms." The process engineer receiving this output needs to make a decision: does this component pass quality control? But 47.3 ohms with what confidence? The specification is 47.0±1.547.0 \pm 1.5 ohms. Is the true value likely within spec, or is the 47.3 estimate itself uncertain?

Bayesian linear regression answers this directly. It reports: "predicted resistance: 47.3 ohms, 95% credible interval [45.8, 48.7] ohms." The engineer can now compute: given this interval, what fraction of the posterior predictive distribution falls within the specification band? That is the probability the component passes - a number the engineer can act on. The decision to pass, fail, or request re-measurement is now grounded in calibrated uncertainty.

This lesson derives Bayesian linear regression from first principles, shows exactly how the posterior is computed in closed form, connects it to ridge regression, and builds the full implementation from scratch. Bayesian linear regression is also the foundation for understanding Gaussian processes (the kernel methods of Lesson 02 are Bayesian linear regression in infinite-dimensional feature space) and for understanding variational inference for neural networks.


Why Bayesian Linear Regression Exists

Standard linear regression (OLS) solves:

θ^OLS=argminθyXθ2=(XX)1Xy\hat{\theta}_{\text{OLS}} = \arg\min_\theta \|y - X\theta\|^2 = (X^\top X)^{-1} X^\top y

This is clean, fast, and analytically exact. But it gives a single point estimate with no uncertainty. When you compute predictions y^=Xtestθ^\hat{y} = X_{\text{test}}\hat{\theta}, you get a single number per test point.

Where does uncertainty come from? Two sources:

  1. Observation noise σ2\sigma^2: even if we knew the true θ\theta exactly, predictions would be noisy because y=Xθ+εy = X\theta + \varepsilon
  2. Parameter uncertainty: we estimated θ\theta from finite noisy data, so θ^\hat{\theta} is itself uncertain

OLS accounts for source 1 (confidence intervals on predictions use the residual variance) but ignores source 2 - it treats θ^\hat{\theta} as if it were exact. With large datasets, this is fine: parameter uncertainty is negligible. With small datasets, it is not: the estimate of θ\theta is itself highly uncertain and that uncertainty must propagate to predictions.

Bayesian linear regression explicitly models both sources. It places a prior on θ\theta, computes the exact posterior p(θX,y)p(\theta | X, y), and then marginalises over θ\theta to give a predictive distribution that correctly accounts for all uncertainty.


Model Setup

The standard Bayesian linear regression setup is:

Likelihood (data model): y=Xθ+ε,εN(0,σ2I)y = X\theta + \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, \sigma^2 I)

Equivalently: p(yX,θ)=N(yXθ,σ2I)p(y | X, \theta) = \mathcal{N}(y | X\theta, \sigma^2 I)

Prior (beliefs about weights before seeing data): p(θ)=N(θ0,τ2I)p(\theta) = \mathcal{N}(\theta | \mathbf{0}, \tau^2 I)

The prior says: we believe the true weights are centred at zero, with scale τ\tau. This encodes the regularisation belief that features should not have arbitrarily large effects.


The Posterior: Exact Derivation

We want the posterior p(θX,y)p(yX,θ)p(θ)p(\theta | X, y) \propto p(y | X, \theta)\,p(\theta).

Taking logarithms (since log is monotone, maximising the log is equivalent to maximising the probability):

logp(θX,y)=logp(yX,θ)+logp(θ)+const\log p(\theta | X, y) = \log p(y | X, \theta) + \log p(\theta) + \text{const}

=12σ2yXθ212τ2θ2+const= -\frac{1}{2\sigma^2}\|y - X\theta\|^2 - \frac{1}{2\tau^2}\|\theta\|^2 + \text{const}

Expanding and collecting terms in θ\theta:

=12θ(1σ2XX+1τ2I)SN1θ+θ1σ2XySN1mN+const= -\frac{1}{2}\theta^\top \underbrace{\left(\frac{1}{\sigma^2}X^\top X + \frac{1}{\tau^2}I\right)}_{S_N^{-1}} \theta + \theta^\top \underbrace{\frac{1}{\sigma^2}X^\top y}_{S_N^{-1} m_N} + \text{const}

This is a quadratic form in θ\theta - the log of a Gaussian. Completing the square, the posterior is:

p(θX,y)=N(θmN,SN)\boxed{p(\theta | X, y) = \mathcal{N}(\theta | m_N, S_N)}

where:

SN=(1τ2I+1σ2XX)1\boxed{S_N = \left(\frac{1}{\tau^2}I + \frac{1}{\sigma^2}X^\top X\right)^{-1}}

mN=1σ2SNXy\boxed{m_N = \frac{1}{\sigma^2} S_N X^\top y}

The matrix SNS_N is the posterior covariance of the weights - a d×dd \times d matrix encoding how uncertain we are about each weight and how weights co-vary. The vector mNm_N is the posterior mean - our best single-point estimate of θ\theta.

Sequential Updating

A beautiful property: if we observe data in batches, the posterior updates identically to processing all data at once. After batch 1 giving posterior N(m1,S1)\mathcal{N}(m_1, S_1), the prior for batch 2 is N(m1,S1)\mathcal{N}(m_1, S_1), and the new posterior is:

S21=S11+1σ2X2X2,m2=S2(S11m1+1σ2X2y2)S_2^{-1} = S_1^{-1} + \frac{1}{\sigma^2}X_2^\top X_2, \qquad m_2 = S_2\left(S_1^{-1} m_1 + \frac{1}{\sigma^2}X_2^\top y_2\right)

This is the basis of the Kalman filter - recursive Bayesian linear regression for sequential state estimation.


Connection to Ridge Regression

The MAP estimate (mode of the posterior, which for a Gaussian equals the mean) is:

θ^MAP=mN=1σ2SNXy=(σ2τ2I+XX)1Xy\hat{\theta}_{\text{MAP}} = m_N = \frac{1}{\sigma^2} S_N X^\top y = \left(\frac{\sigma^2}{\tau^2}I + X^\top X\right)^{-1} X^\top y

Compare to the ridge regression solution:

θ^ridge=(λI+XX)1Xy\hat{\theta}_{\text{ridge}} = \left(\lambda I + X^\top X\right)^{-1} X^\top y

They are identical with λ=σ2/τ2\lambda = \sigma^2 / \tau^2. Ridge regression is MAP inference under a Gaussian prior.

This gives a clean Bayesian interpretation of the regularisation coefficient:

  • Large τ2\tau^2 (broad prior, weak regularisation): λ0\lambda \to 0, solution converges to OLS
  • Small τ2\tau^2 (tight prior, strong regularisation): large λ\lambda, solution shrinks toward zero
  • σ2/τ2=λ\sigma^2 / \tau^2 = \lambda: the ratio of noise variance to prior variance sets the regularisation strength

The key insight: MAP (and ridge) gives only the mean, discarding the full posterior covariance SNS_N. The full Bayesian approach keeps SNS_N and uses it to compute predictive uncertainty.


The Posterior Predictive Distribution

Given a new input xx^*, the predictive distribution marginalises over all possible weight vectors:

p(yx,X,y)=p(yx,θ)p(θX,y)dθp(y^*|x^*, X, y) = \int p(y^*|x^*, \theta)\,p(\theta|X,y)\,d\theta

Since both the likelihood and posterior are Gaussian, this integral has a closed form:

p(yx,X,y)=N ⁣(y  |  mNx,    σ2+xSNx)\boxed{p(y^*|x^*, X, y) = \mathcal{N}\!\left(y^* \;\middle|\; m_N^\top x^*,\;\; \sigma^2 + x^{*\top} S_N x^*\right)}

The predictive variance has two components:

Var[y]=σ2aleatoric+xSNxepistemic\text{Var}[y^*] = \underbrace{\sigma^2}_{\text{aleatoric}} + \underbrace{x^{*\top} S_N x^*}_{\text{epistemic}}

TermNameMeaningReducible?
σ2\sigma^2Aleatoric (noise)Observation noise - irreducibleNo
xSNxx^{*\top} S_N x^*Epistemic (model)Uncertainty about parametersYes - more data shrinks SNS_N

As nn \to \infty, the posterior covariance SN0S_N \to 0 (data overwhelms the prior), and epistemic uncertainty vanishes. The only remaining uncertainty is the irreducible noise σ2\sigma^2.

The 95% credible interval for a new prediction is:

[mNx1.96σ2+xSNx,    mNx+1.96σ2+xSNx]\left[m_N^\top x^* - 1.96\sqrt{\sigma^2 + x^{*\top} S_N x^*},\;\; m_N^\top x^* + 1.96\sqrt{\sigma^2 + x^{*\top} S_N x^*}\right]

This is what the sensor calibration engineer needs - a honest statement of uncertainty, not a single point.


Evidence Maximisation (Empirical Bayes)

How do we set τ2\tau^2 (prior variance) and σ2\sigma^2 (noise variance)? Two options:

  1. Specify manually: use domain knowledge. If you believe weights should typically be within ±1, set τ2=1\tau^2 = 1.
  2. Empirical Bayes (evidence maximisation): learn τ2\tau^2 and σ2\sigma^2 from data by maximising the marginal likelihood p(yX,τ2,σ2)p(y|X, \tau^2, \sigma^2).

The marginal likelihood integrates out the weights:

p(yX,τ2,σ2)=p(yX,θ)p(θτ2)dθp(y|X, \tau^2, \sigma^2) = \int p(y|X,\theta)\,p(\theta|\tau^2)\,d\theta

For Gaussian-Gaussian conjugate models, this integral is tractable:

p(yX,τ2,σ2)=N ⁣(y  |  0,  σ2I+τ2XX)p(y|X, \tau^2, \sigma^2) = \mathcal{N}\!\left(y \;\middle|\; \mathbf{0},\; \sigma^2 I + \tau^2 X X^\top\right)

Taking the log:

logp(yX)=12y(σ2I+τ2XX)1y12logdet ⁣(σ2I+τ2XX)n2log2π\log p(y|X) = -\frac{1}{2}y^\top \left(\sigma^2 I + \tau^2 X X^\top\right)^{-1} y - \frac{1}{2}\log\det\!\left(\sigma^2 I + \tau^2 X X^\top\right) - \frac{n}{2}\log 2\pi

This is the same form as the GP marginal likelihood from Lesson 02 - and for good reason: a GP with a linear kernel is exactly Bayesian linear regression.

Gradients with respect to logσ2\log \sigma^2 and logτ2\log \tau^2 can be computed analytically and used with L-BFGS-B. This automatically selects the regularisation strength from data - no cross-validation needed.


From Scratch: Full Implementation

import numpy as np
from scipy.linalg import cholesky, solve_triangular
from scipy.optimize import minimize
from scipy.stats import norm

# ─── Bayesian Linear Regression ──────────────────────────────────────────────

class BayesianLinearRegression:
"""
Exact Bayesian linear regression with Gaussian prior.

Model:
y = X @ theta + eps, eps ~ N(0, sigma^2)
theta ~ N(0, tau^2 * I)

Posterior: theta | X, y ~ N(m_N, S_N)
Predictive: y* | x*, X, y ~ N(m_N.T @ x*, sigma^2 + x*.T @ S_N @ x*)
"""

def __init__(self, tau_sq=1.0, sigma_sq=0.1):
"""
tau_sq: prior variance on weights
sigma_sq: observation noise variance
"""
self.tau_sq = tau_sq
self.sigma_sq = sigma_sq
self._fitted = False

def fit(self, X, y):
"""
Compute posterior distribution over weights.

S_N = (I/tau^2 + X.T @ X / sigma^2)^{-1}
m_N = (1/sigma^2) * S_N @ X.T @ y
"""
X = np.atleast_2d(X)
if X.shape[0] == 1 and X.shape[1] != X.shape[0]:
X = X.T # handle 1D input
y = y.ravel()
n, d = X.shape

# Posterior precision matrix: Lambda_N = I/tau^2 + X.T X / sigma^2
precision_prior = np.eye(d) / self.tau_sq
precision_likelihood = X.T @ X / self.sigma_sq
Lambda_N = precision_prior + precision_likelihood

# Posterior covariance: S_N = Lambda_N^{-1}
# Use Cholesky for numerical stability
L = cholesky(Lambda_N, lower=True)
# S_N @ Lambda_N = I => solve two triangular systems
identity = np.eye(d)
tmp = solve_triangular(L, identity, lower=True)
self.S_N = solve_triangular(L.T, tmp, lower=False) # (d, d)

# Posterior mean: m_N = (1/sigma^2) * S_N @ X.T @ y
self.m_N = (self.S_N @ X.T @ y) / self.sigma_sq

self.X_train = X
self.y_train = y
self._fitted = True
return self

def predict(self, X_test, return_std=True, return_interval=True, alpha=0.05):
"""
Posterior predictive distribution.

Returns:
mean: posterior predictive mean (n_test,)
std: posterior predictive std (n_test,) [aleatoric + epistemic]
lower, upper: (1-alpha) credible interval
"""
assert self._fitted, "Call fit() first."
X_test = np.atleast_2d(X_test)

# Predictive mean: m_N.T @ x* for each test point
mean = X_test @ self.m_N # (n_test,)

if not return_std:
return mean

# Predictive variance: sigma^2 + x*^T @ S_N @ x*
# Vectorised: diag(X_test @ S_N @ X_test.T) + sigma^2
epistemic = np.sum((X_test @ self.S_N) * X_test, axis=1) # (n_test,)
aleatoric = self.sigma_sq
pred_var = aleatoric + epistemic
pred_std = np.sqrt(pred_var)

if not return_interval:
return mean, pred_std

# Credible interval
z = norm.ppf(1 - alpha / 2)
lower = mean - z * pred_std
upper = mean + z * pred_std

return mean, pred_std, lower, upper

def posterior_samples(self, X_test, n_samples=100):
"""
Draw samples from the posterior predictive distribution.
Each sample corresponds to drawing theta ~ posterior, then y* ~ N(x*.T theta, sigma^2).
"""
assert self._fitted
# Sample weight vectors from posterior
theta_samples = np.random.multivariate_normal(
self.m_N, self.S_N, size=n_samples
) # (n_samples, d)

X_test = np.atleast_2d(X_test)
# Predicted function values for each sample (before noise)
f_samples = (X_test @ theta_samples.T).T # (n_samples, n_test)

# Add observation noise
noise = np.random.randn(*f_samples.shape) * np.sqrt(self.sigma_sq)
return f_samples + noise

def log_marginal_likelihood(self):
"""
log p(y | X, tau^2, sigma^2)
= log N(y | 0, sigma^2 I + tau^2 X X.T)

Used for empirical Bayes hyperparameter optimisation.
"""
assert self._fitted
n = len(self.y_train)
# Marginal covariance: C = sigma^2 I + tau^2 X X.T
C = self.sigma_sq * np.eye(n) + self.tau_sq * (self.X_train @ self.X_train.T)

try:
L = cholesky(C, lower=True)
except np.linalg.LinAlgError:
return -np.inf

# log N(y | 0, C) = -0.5 y.T C^{-1} y - 0.5 log det C - n/2 log(2pi)
v = solve_triangular(L, self.y_train, lower=True)
log_det = 2 * np.sum(np.log(np.diag(L)))
data_fit = -0.5 * v @ v
complexity = -0.5 * log_det
constant = -n / 2 * np.log(2 * np.pi)
return data_fit + complexity + constant


# ─── Empirical Bayes: learn hyperparameters from data ────────────────────────

def empirical_bayes(X, y, n_restarts=5):
"""
Maximise log marginal likelihood over (tau^2, sigma^2).
Returns BayesianLinearRegression with optimised hyperparameters.
"""
best_lml = -np.inf
best_params = None

for _ in range(n_restarts):
log_params0 = np.random.uniform(-3, 3, size=2)

def neg_lml(log_params):
tau_sq = np.exp(log_params[0])
sigma_sq = np.exp(log_params[1])
blr = BayesianLinearRegression(tau_sq=tau_sq, sigma_sq=sigma_sq)
blr.fit(X, y)
return -blr.log_marginal_likelihood()

result = minimize(neg_lml, log_params0, method='L-BFGS-B',
options={'maxiter': 500})

if -result.fun > best_lml:
best_lml = -result.fun
best_params = np.exp(result.x)

tau_sq_opt, sigma_sq_opt = best_params
print(f"Empirical Bayes:")
print(f" tau^2 (prior variance): {tau_sq_opt:.6f}")
print(f" sigma^2 (noise variance): {sigma_sq_opt:.6f}")
print(f" log marginal likelihood: {best_lml:.4f}")
print(f" Implied ridge lambda: {sigma_sq_opt / tau_sq_opt:.4f}")

return BayesianLinearRegression(tau_sq=tau_sq_opt, sigma_sq=sigma_sq_opt).fit(X, y)


# ─── Demo: 1D regression with uncertainty bands ──────────────────────────────

np.random.seed(42)

# True model: y = 0.5 + 2.0*x + noise
def true_fn(x):
return 0.5 + 2.0 * x

NOISE_STD = 0.4
N_TRAIN = 20

X_train_1d = np.random.uniform(-2, 2, N_TRAIN)
y_train = true_fn(X_train_1d) + np.random.randn(N_TRAIN) * NOISE_STD

# Design matrix with intercept: [1, x]
def make_features(x):
x = np.asarray(x).ravel()
return np.column_stack([np.ones_like(x), x])

X_train = make_features(X_train_1d)

# Fit Bayesian linear regression
blr = BayesianLinearRegression(tau_sq=10.0, sigma_sq=NOISE_STD**2)
blr.fit(X_train, y_train)

# Predict on test grid
X_test_1d = np.linspace(-3, 3, 200)
X_test = make_features(X_test_1d)

mean, std, lower, upper = blr.predict(X_test)

print("Bayesian Linear Regression Results:")
print(f" True weights: [0.5, 2.0]")
print(f" Posterior mean: {blr.m_N}")
print(f" Posterior std: {np.sqrt(np.diag(blr.S_N))}")
print(f"\nPredictive uncertainty at x=0 (in-distribution):")
i_zero = np.argmin(np.abs(X_test_1d))
print(f" Mean: {mean[i_zero]:.3f}, Std: {std[i_zero]:.3f}")
print(f" 95% CI: [{lower[i_zero]:.3f}, {upper[i_zero]:.3f}]")

print(f"\nPredictive uncertainty at x=2.9 (extrapolation):")
i_far = np.argmin(np.abs(X_test_1d - 2.9))
print(f" Mean: {mean[i_far]:.3f}, Std: {std[i_far]:.3f}")
print(f" 95% CI: [{lower[i_far]:.3f}, {upper[i_far]:.3f}]")
print(f" (Wider interval - extrapolating beyond training data)")

Uncertainty Grows With Distance: Visualising the Predictive Bands

A key property of Bayesian linear regression: predictive uncertainty is not uniform. It grows as xx^* moves away from the centroid of training data.

This is captured by the epistemic term xSNxx^{*\top} S_N x^*. In regions where the training data clusters, XXX^\top X is large, SNS_N is small (tight posterior), and epistemic uncertainty is low. In regions away from the training centroid, XXX^\top X does not constrain θ\theta along that direction, and epistemic uncertainty is high.

import numpy as np

def uncertainty_profile(blr, X_test_1d, X_train_1d):
"""
Decompose predictive uncertainty into aleatoric and epistemic components.
"""
X_test = np.column_stack([np.ones_like(X_test_1d), X_test_1d])

# Full predictive variance
epistemic = np.sum((X_test @ blr.S_N) * X_test, axis=1)
aleatoric = blr.sigma_sq * np.ones_like(epistemic)
total = epistemic + aleatoric

# Centroid of training data
x_centroid = X_train_1d.mean()

print(f"\nUncertainty Decomposition (tau^2={blr.tau_sq:.2f}, sigma^2={blr.sigma_sq:.4f}):")
print(f"{'x*':>8} | {'Aleatoric':>12} | {'Epistemic':>12} | {'Total std':>10}")
print("-" * 50)
checkpoints = [-2.5, -1.0, x_centroid, 1.0, 2.5]
for x_val in checkpoints:
idx = np.argmin(np.abs(X_test_1d - x_val))
print(f"{x_val:>8.2f} | {np.sqrt(aleatoric[idx]):>12.4f} | "
f"{np.sqrt(epistemic[idx]):>12.4f} | "
f"{np.sqrt(total[idx]):>10.4f}")

return epistemic, aleatoric

# Run with our fitted model
epistemic, aleatoric = uncertainty_profile(blr, X_test_1d, X_train_1d)

How the Posterior Collapses With More Data

import numpy as np

def posterior_convergence_demo(true_theta, sigma_sq, tau_sq, dataset_sizes):
"""
Show how posterior mean approaches MLE and covariance shrinks with more data.
"""
np.random.seed(0)
d = len(true_theta)

print(f"{'n':>6} | {'Posterior Mean':>20} | {'Posterior Std (avg)':>20} | {'LML':>10}")
print("-" * 65)

for n in dataset_sizes:
X = np.random.randn(n, d)
y = X @ true_theta + np.random.randn(n) * np.sqrt(sigma_sq)

blr = BayesianLinearRegression(tau_sq=tau_sq, sigma_sq=sigma_sq)
blr.fit(X, y)

avg_std = np.sqrt(np.diag(blr.S_N)).mean()
lml = blr.log_marginal_likelihood()
mean_str = ", ".join(f"{m:.3f}" for m in blr.m_N)
print(f"{n:>6} | [{mean_str}] | {avg_std:>20.6f} | {lml:>10.2f}")

# Example: 2D problem
true_weights = np.array([1.5, -0.7])
posterior_convergence_demo(true_weights, sigma_sq=0.25, tau_sq=5.0,
dataset_sizes=[5, 10, 25, 50, 100, 500])

Heteroscedastic Bayesian Regression

Standard Bayesian linear regression assumes homoscedastic noise - the same σ2\sigma^2 everywhere. In many problems, noise varies with xx (heteroscedastic). A microphone records louder noise at higher frequencies. A financial model has higher volatility in certain market regimes.

Heteroscedastic extension: model logσ2(x)=wσϕ(x)\log \sigma^2(x) = w_{\sigma}^\top \phi(x) - learn a separate linear model for the log noise variance. This requires iterative optimisation (no closed form) but is tractable.

Alternatively, use a Student-t likelihood instead of Gaussian. Student-t has heavier tails - it is more robust to outliers, assigning less weight to extreme observations:

p(yixi,θ)=Student-t(ν,xiθ,σ2)p(y_i | x_i, \theta) = \text{Student-t}(\nu, x_i^\top \theta, \sigma^2)

where ν\nu is the degrees of freedom parameter. As ν\nu \to \infty, this converges to the Gaussian likelihood. For ν=3\nu = 355, it is dramatically more robust to outliers.

import numpy as np
from scipy.special import gammaln

def student_t_log_likelihood(y, mu, sigma, nu=4.0):
"""
Log likelihood of Student-t distribution.
More robust to outliers than Gaussian.
nu: degrees of freedom (larger = closer to Gaussian)
"""
n = len(y)
z = (y - mu) / sigma
log_p = (gammaln((nu + 1) / 2) - gammaln(nu / 2)
- 0.5 * np.log(nu * np.pi) - np.log(sigma)
- (nu + 1) / 2 * np.log(1 + z**2 / nu))
return log_p.sum()


def compare_gaussian_vs_student_t(X_train, y_clean, y_corrupted, sigma=0.5, nu=4.0):
"""
Compare Gaussian vs Student-t likelihood on corrupted data.
Shows robustness of heavy-tailed likelihood.
"""
# Gaussian MAP (ridge regression)
lam = sigma**2 # assume tau^2 = 1
theta_gaussian = np.linalg.solve(X_train.T @ X_train + lam * np.eye(X_train.shape[1]),
X_train.T @ y_corrupted)

print("Effect of outliers on parameter estimates:")
print(f" Gaussian MAP (corrupted data): {theta_gaussian}")

# For comparison: Gaussian MAP on clean data
theta_clean = np.linalg.solve(X_train.T @ X_train + lam * np.eye(X_train.shape[1]),
X_train.T @ y_clean)
print(f" Gaussian MAP (clean data): {theta_clean}")
print(f" Error from outliers: {np.linalg.norm(theta_gaussian - theta_clean):.4f}")

sklearn-Compatible Wrapper

from sklearn.base import BaseEstimator, RegressorMixin
import numpy as np

class BayesianRidgeFromScratch(BaseEstimator, RegressorMixin):
"""
sklearn-compatible Bayesian linear regression.
Fits hyperparameters via empirical Bayes (evidence maximisation).
"""

def __init__(self, tau_sq=1.0, sigma_sq=0.1, fit_hyperparams=True, n_restarts=3):
self.tau_sq = tau_sq
self.sigma_sq = sigma_sq
self.fit_hyperparams = fit_hyperparams
self.n_restarts = n_restarts

def fit(self, X, y):
if self.fit_hyperparams:
# Optimise hyperparameters via marginal likelihood
self.blr_ = empirical_bayes(X, y, n_restarts=self.n_restarts)
else:
self.blr_ = BayesianLinearRegression(
tau_sq=self.tau_sq, sigma_sq=self.sigma_sq
).fit(X, y)
return self

def predict(self, X, return_std=False):
if return_std:
mean, std, _, _ = self.blr_.predict(X)
return mean, std
return self.blr_.predict(X, return_std=False)

def score(self, X, y):
"""R^2 score using posterior mean predictions."""
y_pred = self.predict(X)
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - y.mean()) ** 2)
return 1 - ss_res / ss_tot


# ─── Compare with sklearn's BayesianRidge ────────────────────────────────────

import numpy as np
from sklearn.linear_model import BayesianRidge, Ridge
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score

np.random.seed(42)
X, y = make_regression(n_samples=100, n_features=10, noise=15.0, random_state=42)

# Our implementation
our_model = BayesianRidgeFromScratch(fit_hyperparams=True)
our_model.fit(X, y)

# sklearn's BayesianRidge (uses different parameterisation but same concept)
sklearn_br = BayesianRidge()

# Ridge regression baseline
ridge = Ridge()

for name, model in [("Our BLR", our_model), ("sklearn BayesianRidge", sklearn_br),
("Ridge", ridge)]:
scores = cross_val_score(model, X, y, cv=5, scoring='neg_root_mean_squared_error')
print(f"{name:>25}: RMSE = {-scores.mean():.3f} ± {scores.std():.3f}")

Basis Function Expansion: Nonlinear Bayesian Regression

Bayesian linear regression is linear in the parameters, not necessarily in the inputs. By expanding inputs through a fixed feature map ϕ:RdRm\phi: \mathbb{R}^d \to \mathbb{R}^m, we get nonlinear regression with all the same Bayesian machinery:

y=ϕ(x)θ+εy = \phi(x)^\top \theta + \varepsilon

Common feature maps:

Feature map ϕ(x)\phi(x)EffectExample
[1,x,x2,,xp][1, x, x^2, \ldots, x^p]Polynomial regressionSmooth curves
[sin(kx),cos(kx)][\sin(kx), \cos(kx)]Fourier featuresPeriodic functions
[RBF(x,ci)][\text{RBF}(x, c_i)]Radial basis functionsFlexible nonlinear
Neural network layersDeep featuresImage/text tasks

With a Gaussian prior on θ\theta and any fixed ϕ\phi, the posterior is still Gaussian - all closed-form results carry over directly. This is why Bayesian linear regression with RBF features is equivalent to a GP with an RBF kernel - and explains why GPs are sometimes called "Bayesian linear regression in infinite-dimensional feature space."

import numpy as np

def polynomial_features(x, degree=3):
"""Expand scalar x to polynomial features [1, x, x^2, ..., x^degree]."""
x = np.asarray(x).ravel()
return np.column_stack([x ** k for k in range(degree + 1)])

def rbf_features(x, centres, length_scale=1.0):
"""Expand x to RBF features centred at each point in centres."""
x = np.asarray(x).ravel()
centres = np.asarray(centres).ravel()
diff = x[:, None] - centres[None, :] # (n, m)
return np.exp(-diff**2 / (2 * length_scale**2))

np.random.seed(42)

# Nonlinear true function
def nonlinear_fn(x):
return np.sin(2 * x) + 0.3 * x**2

X_1d = np.sort(np.random.uniform(-3, 3, 30))
y_nl = nonlinear_fn(X_1d) + np.random.randn(30) * 0.3

# Polynomial Bayesian regression
X_poly = polynomial_features(X_1d, degree=5)
blr_poly = BayesianLinearRegression(tau_sq=10.0, sigma_sq=0.09)
blr_poly.fit(X_poly, y_nl)

# RBF Bayesian regression
centres = np.linspace(-3, 3, 15)
X_rbf = rbf_features(X_1d, centres, length_scale=0.5)
blr_rbf = BayesianLinearRegression(tau_sq=10.0, sigma_sq=0.09)
blr_rbf.fit(X_rbf, y_nl)

# Evaluate on test set
X_test_1d = np.linspace(-3.5, 3.5, 200)
X_test_poly = polynomial_features(X_test_1d, degree=5)
X_test_rbf = rbf_features(X_test_1d, centres, length_scale=0.5)

mean_poly, std_poly, _, _ = blr_poly.predict(X_test_poly)
mean_rbf, std_rbf, _, _ = blr_rbf.predict(X_test_rbf)

true_test = nonlinear_fn(X_test_1d)
rmse_poly = np.sqrt(np.mean((mean_poly - true_test)**2))
rmse_rbf = np.sqrt(np.mean((mean_rbf - true_test)**2))

print(f"Polynomial features (degree 5): RMSE = {rmse_poly:.4f}")
print(f"RBF features (15 centres): RMSE = {rmse_rbf:.4f}")
print("\nNote: RBF features essentially approximate GP regression,")
print("recovering calibrated uncertainty similar to a GP with RBF kernel.")

Active Learning with Bayesian Linear Regression

A direct application of predictive uncertainty: query the point where prediction uncertainty is highest.

import numpy as np

def active_learning_query(blr, X_candidates, n_query=5):
"""
Select the n_query most uncertain candidate points.
Uncertainty = total predictive std (epistemic + aleatoric).
"""
_, std, _, _ = blr.predict(X_candidates)
# Epistemic only: more data reduces this - the right criterion for active learning
epistemic = np.sum((X_candidates @ blr.S_N) * X_candidates, axis=1)
epistemic_std = np.sqrt(epistemic)

# Rank by epistemic uncertainty (most uncertain = most informative)
ranked_idx = np.argsort(epistemic_std)[::-1]
top_idx = ranked_idx[:n_query]

print(f"Top {n_query} points to query (highest epistemic uncertainty):")
for rank, i in enumerate(top_idx):
print(f" {rank+1}. Index {i}: epistemic std = {epistemic_std[i]:.4f}")

return top_idx


np.random.seed(0)
# Initial 10 training points
X_init = np.random.uniform(-2, 2, 10)
y_init = true_fn(X_init) + np.random.randn(10) * 0.4
X_init_feat = make_features(X_init)

blr_active = BayesianLinearRegression(tau_sq=5.0, sigma_sq=0.16)
blr_active.fit(X_init_feat, y_init)

# Candidate pool: 100 unlabelled points
X_pool = np.linspace(-3, 3, 100)
X_pool_feat = make_features(X_pool)

active_idx = active_learning_query(blr_active, X_pool_feat, n_query=5)

Production Engineering Notes

:::tip When to choose Bayesian linear regression over ridge Use Bayesian linear regression when you need calibrated uncertainty intervals alongside predictions - sensor fusion, engineering tolerance checks, financial risk intervals. Use ridge regression when you only need the point estimate and computational overhead matters. Ridge is a special case (MAP) that discards the posterior covariance and is therefore cheaper. :::

:::warning Computing S_N for high-dimensional problems The posterior covariance SNS_N is a d×dd \times d matrix. For d=10,000d = 10{,}000 features, storing and inverting SNS_N requires 10,000280010{,}000^2 \approx 800 MB of memory and O(d3)O(d^3) computation. Use the Woodbury identity to compute the n×nn \times n dual form instead: SN=τ2Iτ4X(XXτ2+σ2I)1XS_N = \tau^2 I - \tau^4 X^\top (X X^\top \tau^2 + \sigma^2 I)^{-1} X. When ndn \ll d, the dual form is dramatically cheaper. :::

:::danger Posterior covariance and prediction intervals are not prediction error bounds A 95% Bayesian credible interval means: given the model assumptions and data, there is a 95% posterior probability the true parameter (or future observation) falls in this range. It is not a frequentist confidence interval. If the model is misspecified (wrong prior, wrong likelihood), the credible interval can be badly miscalibrated. Always perform posterior predictive checks: do observed data look like samples from your model? :::

:::note The Woodbury matrix identity for efficient inference When dnd \gg n, compute the posterior using the dual form via Woodbury:

SN=τ2 ⁣[IX ⁣(XX+σ2τ2I)1 ⁣X]S_N = \tau^2\!\left[I - X^\top\!\left(XX^\top + \frac{\sigma^2}{\tau^2}I\right)^{-1}\!X\right]

This inverts an n×nn \times n matrix instead of a d×dd \times d matrix - crucial when features are high-dimensional (e.g., polynomial or RBF features with many centres). :::


YouTube Resources

  • Bayesian Linear Regression - Mathematicalmonk (YouTube): rigorous derivation of the posterior from scratch, highly recommended for the full derivation
  • Bayesian Methods for Machine Learning - Coursera (Higher School of Economics): week-by-week treatment with Bayesian linear regression as the foundation
  • Linear Regression as a Bayesian Model - Brandon Rohrer: accessible to practitioners, connects L2 regularisation to Gaussian prior clearly
  • Evidence Approximation and Empirical Bayes - Carl Edward Rasmussen (Cambridge GP lectures): connects Bayesian linear regression to GP marginal likelihood maximisation

Interview Q&A

Q1: What is the difference between the ridge regression solution and Bayesian linear regression?

Ridge regression gives a single point estimate: θ^=(XX+λI)1Xy\hat{\theta} = (X^\top X + \lambda I)^{-1}X^\top y. This is exactly the MAP estimate (mode of the posterior) from Bayesian linear regression with a zero-mean Gaussian prior where λ=σ2/τ2\lambda = \sigma^2/\tau^2. Bayesian linear regression goes further: it maintains the full posterior covariance SN=(1τ2I+1σ2XX)1S_N = (\frac{1}{\tau^2}I + \frac{1}{\sigma^2}X^\top X)^{-1}. This covariance enables computing the predictive distribution at test points - including both the aleatoric noise σ2\sigma^2 and epistemic uncertainty xSNxx^{*\top}S_N x^*. Ridge discards SNS_N and cannot provide uncertainty intervals. If you only need point estimates and care about speed, use ridge. If you need calibrated uncertainty, use full Bayesian.

Q2: How does predictive uncertainty in Bayesian linear regression decompose?

The posterior predictive variance at a test point xx^* is Var[y]=σ2+xSNx\text{Var}[y^*] = \sigma^2 + x^{*\top}S_N x^*. The first term σ2\sigma^2 is aleatoric uncertainty - irreducible measurement noise. It is the same everywhere regardless of how much training data exists or how far the test point is from training data. The second term xSNxx^{*\top}S_N x^* is epistemic uncertainty - uncertainty about the weights. It shrinks toward zero as more data arrives (because SN0S_N \to 0 as nn \to \infty) and is larger in directions not well-constrained by the training data. Epistemic uncertainty is highest far from the training data centroid - the model is extrapolating and correctly says so.

Q3: How does evidence maximisation (empirical Bayes) differ from cross-validation for hyperparameter selection?

Evidence maximisation maximises the marginal likelihood p(yX,τ2,σ2)=p(yX,θ)p(θτ2)dθp(y|X, \tau^2, \sigma^2) = \int p(y|X,\theta)p(\theta|\tau^2) d\theta - integrating out the weights analytically. This gives a principled single-objective over hyperparameters that does not require a held-out validation set. Cross-validation splits data, trains on subsets, and measures held-out performance - it is more general (works for any model) but wastes data and is computationally expensive. For conjugate Bayesian models like Gaussian-Gaussian, evidence maximisation is preferred: it uses all data for both training and hyperparameter selection, and it optimises the right objective (marginal likelihood implements automatic Occam's razor). For non-conjugate models, cross-validation is usually the practical choice.

Q4: Why does Bayesian linear regression with RBF features approximate a Gaussian process?

A Gaussian process with RBF kernel k(x,x)=σ2exp(xx2/22)k(x, x') = \sigma^2 \exp(-\|x-x'\|^2/2\ell^2) can be interpreted as Bayesian linear regression with an infinite-dimensional feature map. Specifically, any kernel satisfying Mercer's theorem can be written as k(x,x)=ϕ(x)ϕ(x)k(x, x') = \phi(x)^\top \phi(x') for some (possibly infinite-dimensional) feature map ϕ\phi. With a finite set of RBF basis functions ϕi(x)=exp(xci2/22)\phi_i(x) = \exp(-\|x-c_i\|^2/2\ell^2) centred at mm locations, Bayesian linear regression in this feature space approximates a GP with RBF kernel - the approximation quality improves with more basis functions. This connection is exploited in random Fourier features (Rahimi & Recht, 2007): approximate a GP with a finite random feature map, enabling GP-quality uncertainty at O(nm)O(nm) cost instead of O(n3)O(n^3).

Q5: How would you use Bayesian linear regression for online learning as new data arrives?

Bayesian linear regression supports exact sequential updating. After observing batch 1 and computing posterior N(m1,S1)\mathcal{N}(m_1, S_1), treat this posterior as the prior for batch 2. The update is: S21=S11+X2X2/σ2S_2^{-1} = S_1^{-1} + X_2^\top X_2/\sigma^2 and m2=S2(S11m1+X2y2/σ2)m_2 = S_2(S_1^{-1}m_1 + X_2^\top y_2/\sigma^2). This is the Kalman filter for linear regression - no need to store historical data. Each new batch updates the precision matrix and mean vector in O(d2nbatch)O(d^2 n_{\text{batch}}) time, compared to O(d3)O(d^3) for a full re-fit. In production, this enables streaming inference: as sensor readings arrive, update beliefs about calibration parameters in real time without reprocessing historical data.


Common Mistakes

:::danger Using point predictions without uncertainty for downstream decisions Bayesian linear regression's main value is the predictive distribution, not just the mean. Using only mNxm_N^\top x^* and discarding the posterior covariance SNS_N is equivalent to using ridge regression - the Bayesian computation was wasted. Always propagate uncertainty to downstream decisions: compute the probability of exceeding a threshold, the expected utility under the predictive distribution, or the credible interval for a compliance check. :::

:::warning Assuming the Gaussian prior is appropriate A Gaussian prior on weights works well when you have no strong domain knowledge and want a generic regulariser. But if you know some weights should be exactly zero (sparse signal), a Laplace prior (giving MAP = Lasso) or a spike-and-slab prior is more appropriate. If you know weights should be positive (e.g., a price coefficient), a log-normal or gamma prior is better. The choice of prior family should reflect actual beliefs - defaulting to Gaussian is convenient but not always correct. :::

:::danger Forgetting to include an intercept term If the true function has a non-zero mean and you do not include an intercept (column of ones in XX), the Gaussian prior pulls predictions toward zero everywhere. This is prior misspecification - the zero-mean prior applies to the linear coefficients, not to the function values. Always include an intercept unless you have explicitly verified that the true function has zero mean at x=0x = 0. :::

:::warning High-dimensional SNS_N storage The posterior covariance SNS_N is d×dd \times d. For d=1,000d = 1{,}000 features, this is 8 MB - fine. For d=100,000d = 100{,}000 features, this is 80 GB - impossible. Use the dual (Woodbury) form when d>nd > n, or use sparse approximations to SNS_N (diagonal approximation, low-rank approximation). Mean-field variational inference (Lesson 04) uses a diagonal approximation to avoid this issue in deep learning. :::


Practice Problems

  1. Derive the posterior mean mNm_N and covariance SNS_N for Bayesian linear regression from scratch. Start with logp(θX,y)logp(yX,θ)+logp(θ)\log p(\theta|X,y) \propto \log p(y|X,\theta) + \log p(\theta), expand the quadratic, and complete the square.

  2. Show that the MAP estimate of Bayesian linear regression with Gaussian prior is exactly the ridge regression solution. Write out both expressions and match terms.

  3. Implement Bayesian linear regression for a 1D problem with polynomial features of degree 4. Plot the predictive mean and 95% credible interval as a function of ntrain{5,10,20,50}n_{\text{train}} \in \{5, 10, 20, 50\}. Observe how the credible interval narrows with more data and how it widens in extrapolation regions.

  4. Implement the sequential Bayesian update rule and verify that processing data in two batches gives the same posterior as processing all data at once. Use a 2D regression problem with 50 total observations split as 20 + 30.

  5. Compare the predictive intervals of Bayesian linear regression with those of standard linear regression (OLS with bootstrap confidence intervals). Use 100 Monte Carlo simulations, generating fresh training sets of size 20 each time, and plot the empirical coverage of each method's claimed 95% intervals. Which method is better calibrated and why?

:::tip 🎮 Interactive Playground

Visualize this concept: Try the MLE & MAP Explorer demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.