Bayesian Linear Regression - Uncertainty Estimates for Every Prediction
:::note Reading time: ~38 minutes | Interview relevance: Very High | Target roles: MLE, Data Scientist, Research Engineer, AI Engineer :::
The Real Interview Moment
A semiconductor fabrication facility needs to predict the resistance of a chip component from voltage measurements taken at five test points during manufacturing. Each resistance measurement is inherently noisy - the measurement apparatus introduces roughly ±0.2 ohms of random error. The measurements are correlated because all five voltage readings come from the same physical component.
A standard linear regression model fits and reports a single number: "predicted resistance: 47.3 ohms." The process engineer receiving this output needs to make a decision: does this component pass quality control? But 47.3 ohms with what confidence? The specification is ohms. Is the true value likely within spec, or is the 47.3 estimate itself uncertain?
Bayesian linear regression answers this directly. It reports: "predicted resistance: 47.3 ohms, 95% credible interval [45.8, 48.7] ohms." The engineer can now compute: given this interval, what fraction of the posterior predictive distribution falls within the specification band? That is the probability the component passes - a number the engineer can act on. The decision to pass, fail, or request re-measurement is now grounded in calibrated uncertainty.
This lesson derives Bayesian linear regression from first principles, shows exactly how the posterior is computed in closed form, connects it to ridge regression, and builds the full implementation from scratch. Bayesian linear regression is also the foundation for understanding Gaussian processes (the kernel methods of Lesson 02 are Bayesian linear regression in infinite-dimensional feature space) and for understanding variational inference for neural networks.
Why Bayesian Linear Regression Exists
Standard linear regression (OLS) solves:
This is clean, fast, and analytically exact. But it gives a single point estimate with no uncertainty. When you compute predictions , you get a single number per test point.
Where does uncertainty come from? Two sources:
- Observation noise : even if we knew the true exactly, predictions would be noisy because
- Parameter uncertainty: we estimated from finite noisy data, so is itself uncertain
OLS accounts for source 1 (confidence intervals on predictions use the residual variance) but ignores source 2 - it treats as if it were exact. With large datasets, this is fine: parameter uncertainty is negligible. With small datasets, it is not: the estimate of is itself highly uncertain and that uncertainty must propagate to predictions.
Bayesian linear regression explicitly models both sources. It places a prior on , computes the exact posterior , and then marginalises over to give a predictive distribution that correctly accounts for all uncertainty.
Model Setup
The standard Bayesian linear regression setup is:
Likelihood (data model):
Equivalently:
Prior (beliefs about weights before seeing data):
The prior says: we believe the true weights are centred at zero, with scale . This encodes the regularisation belief that features should not have arbitrarily large effects.
The Posterior: Exact Derivation
We want the posterior .
Taking logarithms (since log is monotone, maximising the log is equivalent to maximising the probability):
Expanding and collecting terms in :
This is a quadratic form in - the log of a Gaussian. Completing the square, the posterior is:
where:
The matrix is the posterior covariance of the weights - a matrix encoding how uncertain we are about each weight and how weights co-vary. The vector is the posterior mean - our best single-point estimate of .
Sequential Updating
A beautiful property: if we observe data in batches, the posterior updates identically to processing all data at once. After batch 1 giving posterior , the prior for batch 2 is , and the new posterior is:
This is the basis of the Kalman filter - recursive Bayesian linear regression for sequential state estimation.
Connection to Ridge Regression
The MAP estimate (mode of the posterior, which for a Gaussian equals the mean) is:
Compare to the ridge regression solution:
They are identical with . Ridge regression is MAP inference under a Gaussian prior.
This gives a clean Bayesian interpretation of the regularisation coefficient:
- Large (broad prior, weak regularisation): , solution converges to OLS
- Small (tight prior, strong regularisation): large , solution shrinks toward zero
- : the ratio of noise variance to prior variance sets the regularisation strength
The key insight: MAP (and ridge) gives only the mean, discarding the full posterior covariance . The full Bayesian approach keeps and uses it to compute predictive uncertainty.
The Posterior Predictive Distribution
Given a new input , the predictive distribution marginalises over all possible weight vectors:
Since both the likelihood and posterior are Gaussian, this integral has a closed form:
The predictive variance has two components:
| Term | Name | Meaning | Reducible? |
|---|---|---|---|
| Aleatoric (noise) | Observation noise - irreducible | No | |
| Epistemic (model) | Uncertainty about parameters | Yes - more data shrinks |
As , the posterior covariance (data overwhelms the prior), and epistemic uncertainty vanishes. The only remaining uncertainty is the irreducible noise .
The 95% credible interval for a new prediction is:
This is what the sensor calibration engineer needs - a honest statement of uncertainty, not a single point.
Evidence Maximisation (Empirical Bayes)
How do we set (prior variance) and (noise variance)? Two options:
- Specify manually: use domain knowledge. If you believe weights should typically be within ±1, set .
- Empirical Bayes (evidence maximisation): learn and from data by maximising the marginal likelihood .
The marginal likelihood integrates out the weights:
For Gaussian-Gaussian conjugate models, this integral is tractable:
Taking the log:
This is the same form as the GP marginal likelihood from Lesson 02 - and for good reason: a GP with a linear kernel is exactly Bayesian linear regression.
Gradients with respect to and can be computed analytically and used with L-BFGS-B. This automatically selects the regularisation strength from data - no cross-validation needed.
From Scratch: Full Implementation
import numpy as np
from scipy.linalg import cholesky, solve_triangular
from scipy.optimize import minimize
from scipy.stats import norm
# ─── Bayesian Linear Regression ──────────────────────────────────────────────
class BayesianLinearRegression:
"""
Exact Bayesian linear regression with Gaussian prior.
Model:
y = X @ theta + eps, eps ~ N(0, sigma^2)
theta ~ N(0, tau^2 * I)
Posterior: theta | X, y ~ N(m_N, S_N)
Predictive: y* | x*, X, y ~ N(m_N.T @ x*, sigma^2 + x*.T @ S_N @ x*)
"""
def __init__(self, tau_sq=1.0, sigma_sq=0.1):
"""
tau_sq: prior variance on weights
sigma_sq: observation noise variance
"""
self.tau_sq = tau_sq
self.sigma_sq = sigma_sq
self._fitted = False
def fit(self, X, y):
"""
Compute posterior distribution over weights.
S_N = (I/tau^2 + X.T @ X / sigma^2)^{-1}
m_N = (1/sigma^2) * S_N @ X.T @ y
"""
X = np.atleast_2d(X)
if X.shape[0] == 1 and X.shape[1] != X.shape[0]:
X = X.T # handle 1D input
y = y.ravel()
n, d = X.shape
# Posterior precision matrix: Lambda_N = I/tau^2 + X.T X / sigma^2
precision_prior = np.eye(d) / self.tau_sq
precision_likelihood = X.T @ X / self.sigma_sq
Lambda_N = precision_prior + precision_likelihood
# Posterior covariance: S_N = Lambda_N^{-1}
# Use Cholesky for numerical stability
L = cholesky(Lambda_N, lower=True)
# S_N @ Lambda_N = I => solve two triangular systems
identity = np.eye(d)
tmp = solve_triangular(L, identity, lower=True)
self.S_N = solve_triangular(L.T, tmp, lower=False) # (d, d)
# Posterior mean: m_N = (1/sigma^2) * S_N @ X.T @ y
self.m_N = (self.S_N @ X.T @ y) / self.sigma_sq
self.X_train = X
self.y_train = y
self._fitted = True
return self
def predict(self, X_test, return_std=True, return_interval=True, alpha=0.05):
"""
Posterior predictive distribution.
Returns:
mean: posterior predictive mean (n_test,)
std: posterior predictive std (n_test,) [aleatoric + epistemic]
lower, upper: (1-alpha) credible interval
"""
assert self._fitted, "Call fit() first."
X_test = np.atleast_2d(X_test)
# Predictive mean: m_N.T @ x* for each test point
mean = X_test @ self.m_N # (n_test,)
if not return_std:
return mean
# Predictive variance: sigma^2 + x*^T @ S_N @ x*
# Vectorised: diag(X_test @ S_N @ X_test.T) + sigma^2
epistemic = np.sum((X_test @ self.S_N) * X_test, axis=1) # (n_test,)
aleatoric = self.sigma_sq
pred_var = aleatoric + epistemic
pred_std = np.sqrt(pred_var)
if not return_interval:
return mean, pred_std
# Credible interval
z = norm.ppf(1 - alpha / 2)
lower = mean - z * pred_std
upper = mean + z * pred_std
return mean, pred_std, lower, upper
def posterior_samples(self, X_test, n_samples=100):
"""
Draw samples from the posterior predictive distribution.
Each sample corresponds to drawing theta ~ posterior, then y* ~ N(x*.T theta, sigma^2).
"""
assert self._fitted
# Sample weight vectors from posterior
theta_samples = np.random.multivariate_normal(
self.m_N, self.S_N, size=n_samples
) # (n_samples, d)
X_test = np.atleast_2d(X_test)
# Predicted function values for each sample (before noise)
f_samples = (X_test @ theta_samples.T).T # (n_samples, n_test)
# Add observation noise
noise = np.random.randn(*f_samples.shape) * np.sqrt(self.sigma_sq)
return f_samples + noise
def log_marginal_likelihood(self):
"""
log p(y | X, tau^2, sigma^2)
= log N(y | 0, sigma^2 I + tau^2 X X.T)
Used for empirical Bayes hyperparameter optimisation.
"""
assert self._fitted
n = len(self.y_train)
# Marginal covariance: C = sigma^2 I + tau^2 X X.T
C = self.sigma_sq * np.eye(n) + self.tau_sq * (self.X_train @ self.X_train.T)
try:
L = cholesky(C, lower=True)
except np.linalg.LinAlgError:
return -np.inf
# log N(y | 0, C) = -0.5 y.T C^{-1} y - 0.5 log det C - n/2 log(2pi)
v = solve_triangular(L, self.y_train, lower=True)
log_det = 2 * np.sum(np.log(np.diag(L)))
data_fit = -0.5 * v @ v
complexity = -0.5 * log_det
constant = -n / 2 * np.log(2 * np.pi)
return data_fit + complexity + constant
# ─── Empirical Bayes: learn hyperparameters from data ────────────────────────
def empirical_bayes(X, y, n_restarts=5):
"""
Maximise log marginal likelihood over (tau^2, sigma^2).
Returns BayesianLinearRegression with optimised hyperparameters.
"""
best_lml = -np.inf
best_params = None
for _ in range(n_restarts):
log_params0 = np.random.uniform(-3, 3, size=2)
def neg_lml(log_params):
tau_sq = np.exp(log_params[0])
sigma_sq = np.exp(log_params[1])
blr = BayesianLinearRegression(tau_sq=tau_sq, sigma_sq=sigma_sq)
blr.fit(X, y)
return -blr.log_marginal_likelihood()
result = minimize(neg_lml, log_params0, method='L-BFGS-B',
options={'maxiter': 500})
if -result.fun > best_lml:
best_lml = -result.fun
best_params = np.exp(result.x)
tau_sq_opt, sigma_sq_opt = best_params
print(f"Empirical Bayes:")
print(f" tau^2 (prior variance): {tau_sq_opt:.6f}")
print(f" sigma^2 (noise variance): {sigma_sq_opt:.6f}")
print(f" log marginal likelihood: {best_lml:.4f}")
print(f" Implied ridge lambda: {sigma_sq_opt / tau_sq_opt:.4f}")
return BayesianLinearRegression(tau_sq=tau_sq_opt, sigma_sq=sigma_sq_opt).fit(X, y)
# ─── Demo: 1D regression with uncertainty bands ──────────────────────────────
np.random.seed(42)
# True model: y = 0.5 + 2.0*x + noise
def true_fn(x):
return 0.5 + 2.0 * x
NOISE_STD = 0.4
N_TRAIN = 20
X_train_1d = np.random.uniform(-2, 2, N_TRAIN)
y_train = true_fn(X_train_1d) + np.random.randn(N_TRAIN) * NOISE_STD
# Design matrix with intercept: [1, x]
def make_features(x):
x = np.asarray(x).ravel()
return np.column_stack([np.ones_like(x), x])
X_train = make_features(X_train_1d)
# Fit Bayesian linear regression
blr = BayesianLinearRegression(tau_sq=10.0, sigma_sq=NOISE_STD**2)
blr.fit(X_train, y_train)
# Predict on test grid
X_test_1d = np.linspace(-3, 3, 200)
X_test = make_features(X_test_1d)
mean, std, lower, upper = blr.predict(X_test)
print("Bayesian Linear Regression Results:")
print(f" True weights: [0.5, 2.0]")
print(f" Posterior mean: {blr.m_N}")
print(f" Posterior std: {np.sqrt(np.diag(blr.S_N))}")
print(f"\nPredictive uncertainty at x=0 (in-distribution):")
i_zero = np.argmin(np.abs(X_test_1d))
print(f" Mean: {mean[i_zero]:.3f}, Std: {std[i_zero]:.3f}")
print(f" 95% CI: [{lower[i_zero]:.3f}, {upper[i_zero]:.3f}]")
print(f"\nPredictive uncertainty at x=2.9 (extrapolation):")
i_far = np.argmin(np.abs(X_test_1d - 2.9))
print(f" Mean: {mean[i_far]:.3f}, Std: {std[i_far]:.3f}")
print(f" 95% CI: [{lower[i_far]:.3f}, {upper[i_far]:.3f}]")
print(f" (Wider interval - extrapolating beyond training data)")
Uncertainty Grows With Distance: Visualising the Predictive Bands
A key property of Bayesian linear regression: predictive uncertainty is not uniform. It grows as moves away from the centroid of training data.
This is captured by the epistemic term . In regions where the training data clusters, is large, is small (tight posterior), and epistemic uncertainty is low. In regions away from the training centroid, does not constrain along that direction, and epistemic uncertainty is high.
import numpy as np
def uncertainty_profile(blr, X_test_1d, X_train_1d):
"""
Decompose predictive uncertainty into aleatoric and epistemic components.
"""
X_test = np.column_stack([np.ones_like(X_test_1d), X_test_1d])
# Full predictive variance
epistemic = np.sum((X_test @ blr.S_N) * X_test, axis=1)
aleatoric = blr.sigma_sq * np.ones_like(epistemic)
total = epistemic + aleatoric
# Centroid of training data
x_centroid = X_train_1d.mean()
print(f"\nUncertainty Decomposition (tau^2={blr.tau_sq:.2f}, sigma^2={blr.sigma_sq:.4f}):")
print(f"{'x*':>8} | {'Aleatoric':>12} | {'Epistemic':>12} | {'Total std':>10}")
print("-" * 50)
checkpoints = [-2.5, -1.0, x_centroid, 1.0, 2.5]
for x_val in checkpoints:
idx = np.argmin(np.abs(X_test_1d - x_val))
print(f"{x_val:>8.2f} | {np.sqrt(aleatoric[idx]):>12.4f} | "
f"{np.sqrt(epistemic[idx]):>12.4f} | "
f"{np.sqrt(total[idx]):>10.4f}")
return epistemic, aleatoric
# Run with our fitted model
epistemic, aleatoric = uncertainty_profile(blr, X_test_1d, X_train_1d)
How the Posterior Collapses With More Data
import numpy as np
def posterior_convergence_demo(true_theta, sigma_sq, tau_sq, dataset_sizes):
"""
Show how posterior mean approaches MLE and covariance shrinks with more data.
"""
np.random.seed(0)
d = len(true_theta)
print(f"{'n':>6} | {'Posterior Mean':>20} | {'Posterior Std (avg)':>20} | {'LML':>10}")
print("-" * 65)
for n in dataset_sizes:
X = np.random.randn(n, d)
y = X @ true_theta + np.random.randn(n) * np.sqrt(sigma_sq)
blr = BayesianLinearRegression(tau_sq=tau_sq, sigma_sq=sigma_sq)
blr.fit(X, y)
avg_std = np.sqrt(np.diag(blr.S_N)).mean()
lml = blr.log_marginal_likelihood()
mean_str = ", ".join(f"{m:.3f}" for m in blr.m_N)
print(f"{n:>6} | [{mean_str}] | {avg_std:>20.6f} | {lml:>10.2f}")
# Example: 2D problem
true_weights = np.array([1.5, -0.7])
posterior_convergence_demo(true_weights, sigma_sq=0.25, tau_sq=5.0,
dataset_sizes=[5, 10, 25, 50, 100, 500])
Heteroscedastic Bayesian Regression
Standard Bayesian linear regression assumes homoscedastic noise - the same everywhere. In many problems, noise varies with (heteroscedastic). A microphone records louder noise at higher frequencies. A financial model has higher volatility in certain market regimes.
Heteroscedastic extension: model - learn a separate linear model for the log noise variance. This requires iterative optimisation (no closed form) but is tractable.
Alternatively, use a Student-t likelihood instead of Gaussian. Student-t has heavier tails - it is more robust to outliers, assigning less weight to extreme observations:
where is the degrees of freedom parameter. As , this converges to the Gaussian likelihood. For –, it is dramatically more robust to outliers.
import numpy as np
from scipy.special import gammaln
def student_t_log_likelihood(y, mu, sigma, nu=4.0):
"""
Log likelihood of Student-t distribution.
More robust to outliers than Gaussian.
nu: degrees of freedom (larger = closer to Gaussian)
"""
n = len(y)
z = (y - mu) / sigma
log_p = (gammaln((nu + 1) / 2) - gammaln(nu / 2)
- 0.5 * np.log(nu * np.pi) - np.log(sigma)
- (nu + 1) / 2 * np.log(1 + z**2 / nu))
return log_p.sum()
def compare_gaussian_vs_student_t(X_train, y_clean, y_corrupted, sigma=0.5, nu=4.0):
"""
Compare Gaussian vs Student-t likelihood on corrupted data.
Shows robustness of heavy-tailed likelihood.
"""
# Gaussian MAP (ridge regression)
lam = sigma**2 # assume tau^2 = 1
theta_gaussian = np.linalg.solve(X_train.T @ X_train + lam * np.eye(X_train.shape[1]),
X_train.T @ y_corrupted)
print("Effect of outliers on parameter estimates:")
print(f" Gaussian MAP (corrupted data): {theta_gaussian}")
# For comparison: Gaussian MAP on clean data
theta_clean = np.linalg.solve(X_train.T @ X_train + lam * np.eye(X_train.shape[1]),
X_train.T @ y_clean)
print(f" Gaussian MAP (clean data): {theta_clean}")
print(f" Error from outliers: {np.linalg.norm(theta_gaussian - theta_clean):.4f}")
sklearn-Compatible Wrapper
from sklearn.base import BaseEstimator, RegressorMixin
import numpy as np
class BayesianRidgeFromScratch(BaseEstimator, RegressorMixin):
"""
sklearn-compatible Bayesian linear regression.
Fits hyperparameters via empirical Bayes (evidence maximisation).
"""
def __init__(self, tau_sq=1.0, sigma_sq=0.1, fit_hyperparams=True, n_restarts=3):
self.tau_sq = tau_sq
self.sigma_sq = sigma_sq
self.fit_hyperparams = fit_hyperparams
self.n_restarts = n_restarts
def fit(self, X, y):
if self.fit_hyperparams:
# Optimise hyperparameters via marginal likelihood
self.blr_ = empirical_bayes(X, y, n_restarts=self.n_restarts)
else:
self.blr_ = BayesianLinearRegression(
tau_sq=self.tau_sq, sigma_sq=self.sigma_sq
).fit(X, y)
return self
def predict(self, X, return_std=False):
if return_std:
mean, std, _, _ = self.blr_.predict(X)
return mean, std
return self.blr_.predict(X, return_std=False)
def score(self, X, y):
"""R^2 score using posterior mean predictions."""
y_pred = self.predict(X)
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - y.mean()) ** 2)
return 1 - ss_res / ss_tot
# ─── Compare with sklearn's BayesianRidge ────────────────────────────────────
import numpy as np
from sklearn.linear_model import BayesianRidge, Ridge
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
np.random.seed(42)
X, y = make_regression(n_samples=100, n_features=10, noise=15.0, random_state=42)
# Our implementation
our_model = BayesianRidgeFromScratch(fit_hyperparams=True)
our_model.fit(X, y)
# sklearn's BayesianRidge (uses different parameterisation but same concept)
sklearn_br = BayesianRidge()
# Ridge regression baseline
ridge = Ridge()
for name, model in [("Our BLR", our_model), ("sklearn BayesianRidge", sklearn_br),
("Ridge", ridge)]:
scores = cross_val_score(model, X, y, cv=5, scoring='neg_root_mean_squared_error')
print(f"{name:>25}: RMSE = {-scores.mean():.3f} ± {scores.std():.3f}")
Basis Function Expansion: Nonlinear Bayesian Regression
Bayesian linear regression is linear in the parameters, not necessarily in the inputs. By expanding inputs through a fixed feature map , we get nonlinear regression with all the same Bayesian machinery:
Common feature maps:
| Feature map | Effect | Example |
|---|---|---|
| Polynomial regression | Smooth curves | |
| Fourier features | Periodic functions | |
| Radial basis functions | Flexible nonlinear | |
| Neural network layers | Deep features | Image/text tasks |
With a Gaussian prior on and any fixed , the posterior is still Gaussian - all closed-form results carry over directly. This is why Bayesian linear regression with RBF features is equivalent to a GP with an RBF kernel - and explains why GPs are sometimes called "Bayesian linear regression in infinite-dimensional feature space."
import numpy as np
def polynomial_features(x, degree=3):
"""Expand scalar x to polynomial features [1, x, x^2, ..., x^degree]."""
x = np.asarray(x).ravel()
return np.column_stack([x ** k for k in range(degree + 1)])
def rbf_features(x, centres, length_scale=1.0):
"""Expand x to RBF features centred at each point in centres."""
x = np.asarray(x).ravel()
centres = np.asarray(centres).ravel()
diff = x[:, None] - centres[None, :] # (n, m)
return np.exp(-diff**2 / (2 * length_scale**2))
np.random.seed(42)
# Nonlinear true function
def nonlinear_fn(x):
return np.sin(2 * x) + 0.3 * x**2
X_1d = np.sort(np.random.uniform(-3, 3, 30))
y_nl = nonlinear_fn(X_1d) + np.random.randn(30) * 0.3
# Polynomial Bayesian regression
X_poly = polynomial_features(X_1d, degree=5)
blr_poly = BayesianLinearRegression(tau_sq=10.0, sigma_sq=0.09)
blr_poly.fit(X_poly, y_nl)
# RBF Bayesian regression
centres = np.linspace(-3, 3, 15)
X_rbf = rbf_features(X_1d, centres, length_scale=0.5)
blr_rbf = BayesianLinearRegression(tau_sq=10.0, sigma_sq=0.09)
blr_rbf.fit(X_rbf, y_nl)
# Evaluate on test set
X_test_1d = np.linspace(-3.5, 3.5, 200)
X_test_poly = polynomial_features(X_test_1d, degree=5)
X_test_rbf = rbf_features(X_test_1d, centres, length_scale=0.5)
mean_poly, std_poly, _, _ = blr_poly.predict(X_test_poly)
mean_rbf, std_rbf, _, _ = blr_rbf.predict(X_test_rbf)
true_test = nonlinear_fn(X_test_1d)
rmse_poly = np.sqrt(np.mean((mean_poly - true_test)**2))
rmse_rbf = np.sqrt(np.mean((mean_rbf - true_test)**2))
print(f"Polynomial features (degree 5): RMSE = {rmse_poly:.4f}")
print(f"RBF features (15 centres): RMSE = {rmse_rbf:.4f}")
print("\nNote: RBF features essentially approximate GP regression,")
print("recovering calibrated uncertainty similar to a GP with RBF kernel.")
Active Learning with Bayesian Linear Regression
A direct application of predictive uncertainty: query the point where prediction uncertainty is highest.
import numpy as np
def active_learning_query(blr, X_candidates, n_query=5):
"""
Select the n_query most uncertain candidate points.
Uncertainty = total predictive std (epistemic + aleatoric).
"""
_, std, _, _ = blr.predict(X_candidates)
# Epistemic only: more data reduces this - the right criterion for active learning
epistemic = np.sum((X_candidates @ blr.S_N) * X_candidates, axis=1)
epistemic_std = np.sqrt(epistemic)
# Rank by epistemic uncertainty (most uncertain = most informative)
ranked_idx = np.argsort(epistemic_std)[::-1]
top_idx = ranked_idx[:n_query]
print(f"Top {n_query} points to query (highest epistemic uncertainty):")
for rank, i in enumerate(top_idx):
print(f" {rank+1}. Index {i}: epistemic std = {epistemic_std[i]:.4f}")
return top_idx
np.random.seed(0)
# Initial 10 training points
X_init = np.random.uniform(-2, 2, 10)
y_init = true_fn(X_init) + np.random.randn(10) * 0.4
X_init_feat = make_features(X_init)
blr_active = BayesianLinearRegression(tau_sq=5.0, sigma_sq=0.16)
blr_active.fit(X_init_feat, y_init)
# Candidate pool: 100 unlabelled points
X_pool = np.linspace(-3, 3, 100)
X_pool_feat = make_features(X_pool)
active_idx = active_learning_query(blr_active, X_pool_feat, n_query=5)
Production Engineering Notes
:::tip When to choose Bayesian linear regression over ridge Use Bayesian linear regression when you need calibrated uncertainty intervals alongside predictions - sensor fusion, engineering tolerance checks, financial risk intervals. Use ridge regression when you only need the point estimate and computational overhead matters. Ridge is a special case (MAP) that discards the posterior covariance and is therefore cheaper. :::
:::warning Computing S_N for high-dimensional problems The posterior covariance is a matrix. For features, storing and inverting requires MB of memory and computation. Use the Woodbury identity to compute the dual form instead: . When , the dual form is dramatically cheaper. :::
:::danger Posterior covariance and prediction intervals are not prediction error bounds A 95% Bayesian credible interval means: given the model assumptions and data, there is a 95% posterior probability the true parameter (or future observation) falls in this range. It is not a frequentist confidence interval. If the model is misspecified (wrong prior, wrong likelihood), the credible interval can be badly miscalibrated. Always perform posterior predictive checks: do observed data look like samples from your model? :::
:::note The Woodbury matrix identity for efficient inference When , compute the posterior using the dual form via Woodbury:
This inverts an matrix instead of a matrix - crucial when features are high-dimensional (e.g., polynomial or RBF features with many centres). :::
YouTube Resources
- Bayesian Linear Regression - Mathematicalmonk (YouTube): rigorous derivation of the posterior from scratch, highly recommended for the full derivation
- Bayesian Methods for Machine Learning - Coursera (Higher School of Economics): week-by-week treatment with Bayesian linear regression as the foundation
- Linear Regression as a Bayesian Model - Brandon Rohrer: accessible to practitioners, connects L2 regularisation to Gaussian prior clearly
- Evidence Approximation and Empirical Bayes - Carl Edward Rasmussen (Cambridge GP lectures): connects Bayesian linear regression to GP marginal likelihood maximisation
Interview Q&A
Q1: What is the difference between the ridge regression solution and Bayesian linear regression?
Ridge regression gives a single point estimate: . This is exactly the MAP estimate (mode of the posterior) from Bayesian linear regression with a zero-mean Gaussian prior where . Bayesian linear regression goes further: it maintains the full posterior covariance . This covariance enables computing the predictive distribution at test points - including both the aleatoric noise and epistemic uncertainty . Ridge discards and cannot provide uncertainty intervals. If you only need point estimates and care about speed, use ridge. If you need calibrated uncertainty, use full Bayesian.
Q2: How does predictive uncertainty in Bayesian linear regression decompose?
The posterior predictive variance at a test point is . The first term is aleatoric uncertainty - irreducible measurement noise. It is the same everywhere regardless of how much training data exists or how far the test point is from training data. The second term is epistemic uncertainty - uncertainty about the weights. It shrinks toward zero as more data arrives (because as ) and is larger in directions not well-constrained by the training data. Epistemic uncertainty is highest far from the training data centroid - the model is extrapolating and correctly says so.
Q3: How does evidence maximisation (empirical Bayes) differ from cross-validation for hyperparameter selection?
Evidence maximisation maximises the marginal likelihood - integrating out the weights analytically. This gives a principled single-objective over hyperparameters that does not require a held-out validation set. Cross-validation splits data, trains on subsets, and measures held-out performance - it is more general (works for any model) but wastes data and is computationally expensive. For conjugate Bayesian models like Gaussian-Gaussian, evidence maximisation is preferred: it uses all data for both training and hyperparameter selection, and it optimises the right objective (marginal likelihood implements automatic Occam's razor). For non-conjugate models, cross-validation is usually the practical choice.
Q4: Why does Bayesian linear regression with RBF features approximate a Gaussian process?
A Gaussian process with RBF kernel can be interpreted as Bayesian linear regression with an infinite-dimensional feature map. Specifically, any kernel satisfying Mercer's theorem can be written as for some (possibly infinite-dimensional) feature map . With a finite set of RBF basis functions centred at locations, Bayesian linear regression in this feature space approximates a GP with RBF kernel - the approximation quality improves with more basis functions. This connection is exploited in random Fourier features (Rahimi & Recht, 2007): approximate a GP with a finite random feature map, enabling GP-quality uncertainty at cost instead of .
Q5: How would you use Bayesian linear regression for online learning as new data arrives?
Bayesian linear regression supports exact sequential updating. After observing batch 1 and computing posterior , treat this posterior as the prior for batch 2. The update is: and . This is the Kalman filter for linear regression - no need to store historical data. Each new batch updates the precision matrix and mean vector in time, compared to for a full re-fit. In production, this enables streaming inference: as sensor readings arrive, update beliefs about calibration parameters in real time without reprocessing historical data.
Common Mistakes
:::danger Using point predictions without uncertainty for downstream decisions Bayesian linear regression's main value is the predictive distribution, not just the mean. Using only and discarding the posterior covariance is equivalent to using ridge regression - the Bayesian computation was wasted. Always propagate uncertainty to downstream decisions: compute the probability of exceeding a threshold, the expected utility under the predictive distribution, or the credible interval for a compliance check. :::
:::warning Assuming the Gaussian prior is appropriate A Gaussian prior on weights works well when you have no strong domain knowledge and want a generic regulariser. But if you know some weights should be exactly zero (sparse signal), a Laplace prior (giving MAP = Lasso) or a spike-and-slab prior is more appropriate. If you know weights should be positive (e.g., a price coefficient), a log-normal or gamma prior is better. The choice of prior family should reflect actual beliefs - defaulting to Gaussian is convenient but not always correct. :::
:::danger Forgetting to include an intercept term If the true function has a non-zero mean and you do not include an intercept (column of ones in ), the Gaussian prior pulls predictions toward zero everywhere. This is prior misspecification - the zero-mean prior applies to the linear coefficients, not to the function values. Always include an intercept unless you have explicitly verified that the true function has zero mean at . :::
:::warning High-dimensional storage The posterior covariance is . For features, this is 8 MB - fine. For features, this is 80 GB - impossible. Use the dual (Woodbury) form when , or use sparse approximations to (diagonal approximation, low-rank approximation). Mean-field variational inference (Lesson 04) uses a diagonal approximation to avoid this issue in deep learning. :::
Practice Problems
-
Derive the posterior mean and covariance for Bayesian linear regression from scratch. Start with , expand the quadratic, and complete the square.
-
Show that the MAP estimate of Bayesian linear regression with Gaussian prior is exactly the ridge regression solution. Write out both expressions and match terms.
-
Implement Bayesian linear regression for a 1D problem with polynomial features of degree 4. Plot the predictive mean and 95% credible interval as a function of . Observe how the credible interval narrows with more data and how it widens in extrapolation regions.
-
Implement the sequential Bayesian update rule and verify that processing data in two batches gives the same posterior as processing all data at once. Use a 2D regression problem with 50 total observations split as 20 + 30.
-
Compare the predictive intervals of Bayesian linear regression with those of standard linear regression (OLS with bootstrap confidence intervals). Use 100 Monte Carlo simulations, generating fresh training sets of size 20 each time, and plot the empirical coverage of each method's claimed 95% intervals. Which method is better calibrated and why?
:::tip 🎮 Interactive Playground
Visualize this concept: Try the MLE & MAP Explorer demo on the EngineersOfAI Playground - no code required.
:::
