What is Bayesian linear regression?

How placing a prior on linear regression weights gives a full posterior distribution over predictions - with closed-form solutions, predictive uncertainty, and connections to ridge regression.

How does conjugate prior work in practice?

Bayesian Linear Regression - Uncertainty Estimates for Every Prediction covers Bayesian linear regression, conjugate prior, posterior predictive from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/bayesian-ml/bayesian-linear-regression

What is the difference between Bayesian linear regression and posterior predictive?

See the full breakdown at https://engineersofai.com/docs/ml/bayesian-ml/bayesian-linear-regression

Bayesian Linear Regression - Uncertainty Estimates for Every Prediction

:::note Reading time: ~38 minutes | Interview relevance: Very High | Target roles: MLE, Data Scientist, Research Engineer, AI Engineer :::

The Real Interview Moment

A semiconductor fabrication facility needs to predict the resistance of a chip component from voltage measurements taken at five test points during manufacturing. Each resistance measurement is inherently noisy - the measurement apparatus introduces roughly ±0.2 ohms of random error. The measurements are correlated because all five voltage readings come from the same physical component.

A standard linear regression model fits $y = X\theta + \varepsilon$ and reports a single number: "predicted resistance: 47.3 ohms." The process engineer receiving this output needs to make a decision: does this component pass quality control? But 47.3 ohms with what confidence? The specification is $47.0 \pm 1.5$ ohms. Is the true value likely within spec, or is the 47.3 estimate itself uncertain?

Bayesian linear regression answers this directly. It reports: "predicted resistance: 47.3 ohms, 95% credible interval [45.8, 48.7] ohms." The engineer can now compute: given this interval, what fraction of the posterior predictive distribution falls within the specification band? That is the probability the component passes - a number the engineer can act on. The decision to pass, fail, or request re-measurement is now grounded in calibrated uncertainty.

This lesson derives Bayesian linear regression from first principles, shows exactly how the posterior is computed in closed form, connects it to ridge regression, and builds the full implementation from scratch. Bayesian linear regression is also the foundation for understanding Gaussian processes (the kernel methods of Lesson 02 are Bayesian linear regression in infinite-dimensional feature space) and for understanding variational inference for neural networks.

Why Bayesian Linear Regression Exists

Standard linear regression (OLS) solves:

$\hat{\theta}_{\text{OLS}} = \arg\min_\theta \|y - X\theta\|^2 = (X^\top X)^{-1} X^\top y$

This is clean, fast, and analytically exact. But it gives a single point estimate with no uncertainty. When you compute predictions $\hat{y} = X_{\text{test}}\hat{\theta}$ , you get a single number per test point.

Where does uncertainty come from? Two sources:

Observation noise $\sigma^2$ : even if we knew the true $\theta$ exactly, predictions would be noisy because $y = X\theta + \varepsilon$
Parameter uncertainty: we estimated $\theta$ from finite noisy data, so $\hat{\theta}$ is itself uncertain

OLS accounts for source 1 (confidence intervals on predictions use the residual variance) but ignores source 2 - it treats $\hat{\theta}$ as if it were exact. With large datasets, this is fine: parameter uncertainty is negligible. With small datasets, it is not: the estimate of $\theta$ is itself highly uncertain and that uncertainty must propagate to predictions.

Bayesian linear regression explicitly models both sources. It places a prior on $\theta$ , computes the exact posterior $p(\theta | X, y)$ , and then marginalises over $\theta$ to give a predictive distribution that correctly accounts for all uncertainty.

Model Setup

The standard Bayesian linear regression setup is:

Likelihood (data model): $y = X\theta + \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, \sigma^2 I)$

Equivalently: $p(y | X, \theta) = \mathcal{N}(y | X\theta, \sigma^2 I)$

Prior (beliefs about weights before seeing data): $p(\theta) = \mathcal{N}(\theta | \mathbf{0}, \tau^2 I)$

The prior says: we believe the true weights are centred at zero, with scale $\tau$ . This encodes the regularisation belief that features should not have arbitrarily large effects.

The Posterior: Exact Derivation

We want the posterior $p(\theta | X, y) \propto p(y | X, \theta)\,p(\theta)$ .

Taking logarithms (since log is monotone, maximising the log is equivalent to maximising the probability):

$\log p(\theta | X, y) = \log p(y | X, \theta) + \log p(\theta) + \text{const}$

$= -\frac{1}{2\sigma^2}\|y - X\theta\|^2 - \frac{1}{2\tau^2}\|\theta\|^2 + \text{const}$

Expanding and collecting terms in $\theta$ :

$= -\frac{1}{2}\theta^\top \underbrace{\left(\frac{1}{\sigma^2}X^\top X + \frac{1}{\tau^2}I\right)}_{S_N^{-1}} \theta + \theta^\top \underbrace{\frac{1}{\sigma^2}X^\top y}_{S_N^{-1} m_N} + \text{const}$

This is a quadratic form in $\theta$ - the log of a Gaussian. Completing the square, the posterior is:

$\boxed{p(\theta | X, y) = \mathcal{N}(\theta | m_N, S_N)}$

where:

$\boxed{S_N = \left(\frac{1}{\tau^2}I + \frac{1}{\sigma^2}X^\top X\right)^{-1}}$

$\boxed{m_N = \frac{1}{\sigma^2} S_N X^\top y}$

The matrix $S_N$ is the posterior covariance of the weights - a $d \times d$ matrix encoding how uncertain we are about each weight and how weights co-vary. The vector $m_N$ is the posterior mean - our best single-point estimate of $\theta$ .

Sequential Updating

A beautiful property: if we observe data in batches, the posterior updates identically to processing all data at once. After batch 1 giving posterior $\mathcal{N}(m_1, S_1)$ , the prior for batch 2 is $\mathcal{N}(m_1, S_1)$ , and the new posterior is:

$S_2^{-1} = S_1^{-1} + \frac{1}{\sigma^2}X_2^\top X_2, \qquad m_2 = S_2\left(S_1^{-1} m_1 + \frac{1}{\sigma^2}X_2^\top y_2\right)$

This is the basis of the Kalman filter - recursive Bayesian linear regression for sequential state estimation.

Connection to Ridge Regression

The MAP estimate (mode of the posterior, which for a Gaussian equals the mean) is:

$\hat{\theta}_{\text{MAP}} = m_N = \frac{1}{\sigma^2} S_N X^\top y = \left(\frac{\sigma^2}{\tau^2}I + X^\top X\right)^{-1} X^\top y$

Compare to the ridge regression solution:

$\hat{\theta}_{\text{ridge}} = \left(\lambda I + X^\top X\right)^{-1} X^\top y$

They are identical with $\lambda = \sigma^2 / \tau^2$ . Ridge regression is MAP inference under a Gaussian prior.

This gives a clean Bayesian interpretation of the regularisation coefficient:

Large $\tau^2$ (broad prior, weak regularisation): $\lambda \to 0$ , solution converges to OLS
Small $\tau^2$ (tight prior, strong regularisation): large $\lambda$ , solution shrinks toward zero
$\sigma^2 / \tau^2 = \lambda$ : the ratio of noise variance to prior variance sets the regularisation strength

The key insight: MAP (and ridge) gives only the mean, discarding the full posterior covariance $S_N$ . The full Bayesian approach keeps $S_N$ and uses it to compute predictive uncertainty.

The Posterior Predictive Distribution

Given a new input $x^*$ , the predictive distribution marginalises over all possible weight vectors:

$p(y^*|x^*, X, y) = \int p(y^*|x^*, \theta)\,p(\theta|X,y)\,d\theta$

Since both the likelihood and posterior are Gaussian, this integral has a closed form:

$\boxed{p(y^*|x^*, X, y) = \mathcal{N}\!\left(y^* \;\middle|\; m_N^\top x^*,\;\; \sigma^2 + x^{*\top} S_N x^*\right)}$

The predictive variance has two components:

$\text{Var}[y^*] = \underbrace{\sigma^2}_{\text{aleatoric}} + \underbrace{x^{*\top} S_N x^*}_{\text{epistemic}}$

Term	Name	Meaning	Reducible?
$\sigma^2$	Aleatoric (noise)	Observation noise - irreducible	No
$x^{\top} S_N x^$	Epistemic (model)	Uncertainty about parameters	Yes - more data shrinks $S_N$

As $n \to \infty$ , the posterior covariance $S_N \to 0$ (data overwhelms the prior), and epistemic uncertainty vanishes. The only remaining uncertainty is the irreducible noise $\sigma^2$ .

The 95% credible interval for a new prediction is:

$\left[m_N^\top x^* - 1.96\sqrt{\sigma^2 + x^{*\top} S_N x^*},\;\; m_N^\top x^* + 1.96\sqrt{\sigma^2 + x^{*\top} S_N x^*}\right]$

This is what the sensor calibration engineer needs - a honest statement of uncertainty, not a single point.

Evidence Maximisation (Empirical Bayes)

How do we set $\tau^2$ (prior variance) and $\sigma^2$ (noise variance)? Two options:

Specify manually: use domain knowledge. If you believe weights should typically be within ±1, set $\tau^2 = 1$ .
Empirical Bayes (evidence maximisation): learn $\tau^2$ and $\sigma^2$ from data by maximising the marginal likelihood $p(y|X, \tau^2, \sigma^2)$ .

The marginal likelihood integrates out the weights:

$p(y|X, \tau^2, \sigma^2) = \int p(y|X,\theta)\,p(\theta|\tau^2)\,d\theta$

For Gaussian-Gaussian conjugate models, this integral is tractable:

$p(y|X, \tau^2, \sigma^2) = \mathcal{N}\!\left(y \;\middle|\; \mathbf{0},\; \sigma^2 I + \tau^2 X X^\top\right)$

Taking the log:

$\log p(y|X) = -\frac{1}{2}y^\top \left(\sigma^2 I + \tau^2 X X^\top\right)^{-1} y - \frac{1}{2}\log\det\!\left(\sigma^2 I + \tau^2 X X^\top\right) - \frac{n}{2}\log 2\pi$

This is the same form as the GP marginal likelihood from Lesson 02 - and for good reason: a GP with a linear kernel is exactly Bayesian linear regression.

Gradients with respect to $\log \sigma^2$ and $\log \tau^2$ can be computed analytically and used with L-BFGS-B. This automatically selects the regularisation strength from data - no cross-validation needed.

From Scratch: Full Implementation

import numpy as np
from scipy.linalg import cholesky, solve_triangular
from scipy.optimize import minimize
from scipy.stats import norm

# ─── Bayesian Linear Regression ──────────────────────────────────────────────

class BayesianLinearRegression:
    """
    Exact Bayesian linear regression with Gaussian prior.

    Model:
        y = X @ theta + eps,  eps ~ N(0, sigma^2)
        theta ~ N(0, tau^2 * I)

    Posterior: theta | X, y ~ N(m_N, S_N)
    Predictive: y* | x*, X, y ~ N(m_N.T @ x*, sigma^2 + x*.T @ S_N @ x*)
    """

    def __init__(self, tau_sq=1.0, sigma_sq=0.1):
        """
        tau_sq:   prior variance on weights
        sigma_sq: observation noise variance
        """
        self.tau_sq = tau_sq
        self.sigma_sq = sigma_sq
        self._fitted = False

    def fit(self, X, y):
        """
        Compute posterior distribution over weights.

        S_N = (I/tau^2 + X.T @ X / sigma^2)^{-1}
        m_N = (1/sigma^2) * S_N @ X.T @ y
        """
        X = np.atleast_2d(X)
        if X.shape[0] == 1 and X.shape[1] != X.shape[0]:
            X = X.T   # handle 1D input
        y = y.ravel()
        n, d = X.shape

        # Posterior precision matrix: Lambda_N = I/tau^2 + X.T X / sigma^2
        precision_prior = np.eye(d) / self.tau_sq
        precision_likelihood = X.T @ X / self.sigma_sq
        Lambda_N = precision_prior + precision_likelihood

        # Posterior covariance: S_N = Lambda_N^{-1}
        # Use Cholesky for numerical stability
        L = cholesky(Lambda_N, lower=True)
        # S_N @ Lambda_N = I  =>  solve two triangular systems
        identity = np.eye(d)
        tmp = solve_triangular(L, identity, lower=True)
        self.S_N = solve_triangular(L.T, tmp, lower=False)  # (d, d)

        # Posterior mean: m_N = (1/sigma^2) * S_N @ X.T @ y
        self.m_N = (self.S_N @ X.T @ y) / self.sigma_sq

        self.X_train = X
        self.y_train = y
        self._fitted = True
        return self

    def predict(self, X_test, return_std=True, return_interval=True, alpha=0.05):
        """
        Posterior predictive distribution.

        Returns:
            mean: posterior predictive mean (n_test,)
            std:  posterior predictive std (n_test,)  [aleatoric + epistemic]
            lower, upper: (1-alpha) credible interval
        """
        assert self._fitted, "Call fit() first."
        X_test = np.atleast_2d(X_test)

        # Predictive mean: m_N.T @ x*  for each test point
        mean = X_test @ self.m_N   # (n_test,)

        if not return_std:
            return mean

        # Predictive variance: sigma^2 + x*^T @ S_N @ x*
        # Vectorised: diag(X_test @ S_N @ X_test.T) + sigma^2
        epistemic = np.sum((X_test @ self.S_N) * X_test, axis=1)   # (n_test,)
        aleatoric = self.sigma_sq
        pred_var = aleatoric + epistemic
        pred_std = np.sqrt(pred_var)

        if not return_interval:
            return mean, pred_std

        # Credible interval
        z = norm.ppf(1 - alpha / 2)
        lower = mean - z * pred_std
        upper = mean + z * pred_std

        return mean, pred_std, lower, upper

    def posterior_samples(self, X_test, n_samples=100):
        """
        Draw samples from the posterior predictive distribution.
        Each sample corresponds to drawing theta ~ posterior, then y* ~ N(x*.T theta, sigma^2).
        """
        assert self._fitted
        # Sample weight vectors from posterior
        theta_samples = np.random.multivariate_normal(
            self.m_N, self.S_N, size=n_samples
        )   # (n_samples, d)

        X_test = np.atleast_2d(X_test)
        # Predicted function values for each sample (before noise)
        f_samples = (X_test @ theta_samples.T).T   # (n_samples, n_test)

        # Add observation noise
        noise = np.random.randn(*f_samples.shape) * np.sqrt(self.sigma_sq)
        return f_samples + noise

    def log_marginal_likelihood(self):
        """
        log p(y | X, tau^2, sigma^2)
        = log N(y | 0, sigma^2 I + tau^2 X X.T)

        Used for empirical Bayes hyperparameter optimisation.
        """
        assert self._fitted
        n = len(self.y_train)
        # Marginal covariance: C = sigma^2 I + tau^2 X X.T
        C = self.sigma_sq * np.eye(n) + self.tau_sq * (self.X_train @ self.X_train.T)

        try:
            L = cholesky(C, lower=True)
        except np.linalg.LinAlgError:
            return -np.inf

        # log N(y | 0, C) = -0.5 y.T C^{-1} y - 0.5 log det C - n/2 log(2pi)
        v = solve_triangular(L, self.y_train, lower=True)
        log_det = 2 * np.sum(np.log(np.diag(L)))
        data_fit = -0.5 * v @ v
        complexity = -0.5 * log_det
        constant = -n / 2 * np.log(2 * np.pi)
        return data_fit + complexity + constant


# ─── Empirical Bayes: learn hyperparameters from data ────────────────────────

def empirical_bayes(X, y, n_restarts=5):
    """
    Maximise log marginal likelihood over (tau^2, sigma^2).
    Returns BayesianLinearRegression with optimised hyperparameters.
    """
    best_lml = -np.inf
    best_params = None

    for _ in range(n_restarts):
        log_params0 = np.random.uniform(-3, 3, size=2)

        def neg_lml(log_params):
            tau_sq = np.exp(log_params[0])
            sigma_sq = np.exp(log_params[1])
            blr = BayesianLinearRegression(tau_sq=tau_sq, sigma_sq=sigma_sq)
            blr.fit(X, y)
            return -blr.log_marginal_likelihood()

        result = minimize(neg_lml, log_params0, method='L-BFGS-B',
                          options={'maxiter': 500})

        if -result.fun > best_lml:
            best_lml = -result.fun
            best_params = np.exp(result.x)

    tau_sq_opt, sigma_sq_opt = best_params
    print(f"Empirical Bayes:")
    print(f"  tau^2 (prior variance):   {tau_sq_opt:.6f}")
    print(f"  sigma^2 (noise variance): {sigma_sq_opt:.6f}")
    print(f"  log marginal likelihood:  {best_lml:.4f}")
    print(f"  Implied ridge lambda:     {sigma_sq_opt / tau_sq_opt:.4f}")

    return BayesianLinearRegression(tau_sq=tau_sq_opt, sigma_sq=sigma_sq_opt).fit(X, y)


# ─── Demo: 1D regression with uncertainty bands ──────────────────────────────

np.random.seed(42)

# True model: y = 0.5 + 2.0*x + noise
def true_fn(x):
    return 0.5 + 2.0 * x

NOISE_STD = 0.4
N_TRAIN = 20

X_train_1d = np.random.uniform(-2, 2, N_TRAIN)
y_train = true_fn(X_train_1d) + np.random.randn(N_TRAIN) * NOISE_STD

# Design matrix with intercept: [1, x]
def make_features(x):
    x = np.asarray(x).ravel()
    return np.column_stack([np.ones_like(x), x])

X_train = make_features(X_train_1d)

# Fit Bayesian linear regression
blr = BayesianLinearRegression(tau_sq=10.0, sigma_sq=NOISE_STD**2)
blr.fit(X_train, y_train)

# Predict on test grid
X_test_1d = np.linspace(-3, 3, 200)
X_test = make_features(X_test_1d)

mean, std, lower, upper = blr.predict(X_test)

print("Bayesian Linear Regression Results:")
print(f"  True weights:     [0.5, 2.0]")
print(f"  Posterior mean:   {blr.m_N}")
print(f"  Posterior std:    {np.sqrt(np.diag(blr.S_N))}")
print(f"\nPredictive uncertainty at x=0 (in-distribution):")
i_zero = np.argmin(np.abs(X_test_1d))
print(f"  Mean: {mean[i_zero]:.3f}, Std: {std[i_zero]:.3f}")
print(f"  95% CI: [{lower[i_zero]:.3f}, {upper[i_zero]:.3f}]")

print(f"\nPredictive uncertainty at x=2.9 (extrapolation):")
i_far = np.argmin(np.abs(X_test_1d - 2.9))
print(f"  Mean: {mean[i_far]:.3f}, Std: {std[i_far]:.3f}")
print(f"  95% CI: [{lower[i_far]:.3f}, {upper[i_far]:.3f}]")
print(f"  (Wider interval - extrapolating beyond training data)")

Uncertainty Grows With Distance: Visualising the Predictive Bands

A key property of Bayesian linear regression: predictive uncertainty is not uniform. It grows as $x^*$ moves away from the centroid of training data.

This is captured by the epistemic term $x^{*\top} S_N x^*$ . In regions where the training data clusters, $X^\top X$ is large, $S_N$ is small (tight posterior), and epistemic uncertainty is low. In regions away from the training centroid, $X^\top X$ does not constrain $\theta$ along that direction, and epistemic uncertainty is high.

import numpy as np

def uncertainty_profile(blr, X_test_1d, X_train_1d):
    """
    Decompose predictive uncertainty into aleatoric and epistemic components.
    """
    X_test = np.column_stack([np.ones_like(X_test_1d), X_test_1d])

    # Full predictive variance
    epistemic = np.sum((X_test @ blr.S_N) * X_test, axis=1)
    aleatoric = blr.sigma_sq * np.ones_like(epistemic)
    total = epistemic + aleatoric

    # Centroid of training data
    x_centroid = X_train_1d.mean()

    print(f"\nUncertainty Decomposition (tau^2={blr.tau_sq:.2f}, sigma^2={blr.sigma_sq:.4f}):")
    print(f"{'x*':>8} | {'Aleatoric':>12} | {'Epistemic':>12} | {'Total std':>10}")
    print("-" * 50)
    checkpoints = [-2.5, -1.0, x_centroid, 1.0, 2.5]
    for x_val in checkpoints:
        idx = np.argmin(np.abs(X_test_1d - x_val))
        print(f"{x_val:>8.2f} | {np.sqrt(aleatoric[idx]):>12.4f} | "
              f"{np.sqrt(epistemic[idx]):>12.4f} | "
              f"{np.sqrt(total[idx]):>10.4f}")

    return epistemic, aleatoric

# Run with our fitted model
epistemic, aleatoric = uncertainty_profile(blr, X_test_1d, X_train_1d)

How the Posterior Collapses With More Data

import numpy as np

def posterior_convergence_demo(true_theta, sigma_sq, tau_sq, dataset_sizes):
    """
    Show how posterior mean approaches MLE and covariance shrinks with more data.
    """
    np.random.seed(0)
    d = len(true_theta)

    print(f"{'n':>6} | {'Posterior Mean':>20} | {'Posterior Std (avg)':>20} | {'LML':>10}")
    print("-" * 65)

    for n in dataset_sizes:
        X = np.random.randn(n, d)
        y = X @ true_theta + np.random.randn(n) * np.sqrt(sigma_sq)

        blr = BayesianLinearRegression(tau_sq=tau_sq, sigma_sq=sigma_sq)
        blr.fit(X, y)

        avg_std = np.sqrt(np.diag(blr.S_N)).mean()
        lml = blr.log_marginal_likelihood()
        mean_str = ", ".join(f"{m:.3f}" for m in blr.m_N)
        print(f"{n:>6} | [{mean_str}] | {avg_std:>20.6f} | {lml:>10.2f}")

# Example: 2D problem
true_weights = np.array([1.5, -0.7])
posterior_convergence_demo(true_weights, sigma_sq=0.25, tau_sq=5.0,
                           dataset_sizes=[5, 10, 25, 50, 100, 500])

Heteroscedastic Bayesian Regression

Standard Bayesian linear regression assumes homoscedastic noise - the same $\sigma^2$ everywhere. In many problems, noise varies with $x$ (heteroscedastic). A microphone records louder noise at higher frequencies. A financial model has higher volatility in certain market regimes.

Heteroscedastic extension: model $\log \sigma^2(x) = w_{\sigma}^\top \phi(x)$ - learn a separate linear model for the log noise variance. This requires iterative optimisation (no closed form) but is tractable.

Alternatively, use a Student-t likelihood instead of Gaussian. Student-t has heavier tails - it is more robust to outliers, assigning less weight to extreme observations:

$p(y_i | x_i, \theta) = \text{Student-t}(\nu, x_i^\top \theta, \sigma^2)$

where $\nu$ is the degrees of freedom parameter. As $\nu \to \infty$ , this converges to the Gaussian likelihood. For $\nu = 3$ – $5$ , it is dramatically more robust to outliers.

import numpy as np
from scipy.special import gammaln

def student_t_log_likelihood(y, mu, sigma, nu=4.0):
    """
    Log likelihood of Student-t distribution.
    More robust to outliers than Gaussian.
    nu: degrees of freedom (larger = closer to Gaussian)
    """
    n = len(y)
    z = (y - mu) / sigma
    log_p = (gammaln((nu + 1) / 2) - gammaln(nu / 2)
             - 0.5 * np.log(nu * np.pi) - np.log(sigma)
             - (nu + 1) / 2 * np.log(1 + z**2 / nu))
    return log_p.sum()


def compare_gaussian_vs_student_t(X_train, y_clean, y_corrupted, sigma=0.5, nu=4.0):
    """
    Compare Gaussian vs Student-t likelihood on corrupted data.
    Shows robustness of heavy-tailed likelihood.
    """
    # Gaussian MAP (ridge regression)
    lam = sigma**2  # assume tau^2 = 1
    theta_gaussian = np.linalg.solve(X_train.T @ X_train + lam * np.eye(X_train.shape[1]),
                                     X_train.T @ y_corrupted)

    print("Effect of outliers on parameter estimates:")
    print(f"  Gaussian MAP (corrupted data):  {theta_gaussian}")

    # For comparison: Gaussian MAP on clean data
    theta_clean = np.linalg.solve(X_train.T @ X_train + lam * np.eye(X_train.shape[1]),
                                  X_train.T @ y_clean)
    print(f"  Gaussian MAP (clean data):      {theta_clean}")
    print(f"  Error from outliers: {np.linalg.norm(theta_gaussian - theta_clean):.4f}")

sklearn-Compatible Wrapper

from sklearn.base import BaseEstimator, RegressorMixin
import numpy as np

class BayesianRidgeFromScratch(BaseEstimator, RegressorMixin):
    """
    sklearn-compatible Bayesian linear regression.
    Fits hyperparameters via empirical Bayes (evidence maximisation).
    """

    def __init__(self, tau_sq=1.0, sigma_sq=0.1, fit_hyperparams=True, n_restarts=3):
        self.tau_sq = tau_sq
        self.sigma_sq = sigma_sq
        self.fit_hyperparams = fit_hyperparams
        self.n_restarts = n_restarts

    def fit(self, X, y):
        if self.fit_hyperparams:
            # Optimise hyperparameters via marginal likelihood
            self.blr_ = empirical_bayes(X, y, n_restarts=self.n_restarts)
        else:
            self.blr_ = BayesianLinearRegression(
                tau_sq=self.tau_sq, sigma_sq=self.sigma_sq
            ).fit(X, y)
        return self

    def predict(self, X, return_std=False):
        if return_std:
            mean, std, _, _ = self.blr_.predict(X)
            return mean, std
        return self.blr_.predict(X, return_std=False)

    def score(self, X, y):
        """R^2 score using posterior mean predictions."""
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - y.mean()) ** 2)
        return 1 - ss_res / ss_tot


# ─── Compare with sklearn's BayesianRidge ────────────────────────────────────

import numpy as np
from sklearn.linear_model import BayesianRidge, Ridge
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score

np.random.seed(42)
X, y = make_regression(n_samples=100, n_features=10, noise=15.0, random_state=42)

# Our implementation
our_model = BayesianRidgeFromScratch(fit_hyperparams=True)
our_model.fit(X, y)

# sklearn's BayesianRidge (uses different parameterisation but same concept)
sklearn_br = BayesianRidge()

# Ridge regression baseline
ridge = Ridge()

for name, model in [("Our BLR", our_model), ("sklearn BayesianRidge", sklearn_br),
                    ("Ridge", ridge)]:
    scores = cross_val_score(model, X, y, cv=5, scoring='neg_root_mean_squared_error')
    print(f"{name:>25}: RMSE = {-scores.mean():.3f} ± {scores.std():.3f}")

Basis Function Expansion: Nonlinear Bayesian Regression

Bayesian linear regression is linear in the parameters, not necessarily in the inputs. By expanding inputs through a fixed feature map $\phi: \mathbb{R}^d \to \mathbb{R}^m$ , we get nonlinear regression with all the same Bayesian machinery:

$y = \phi(x)^\top \theta + \varepsilon$

Common feature maps:

Feature map $\phi(x)$	Effect	Example
$[1, x, x^2, \ldots, x^p]$	Polynomial regression	Smooth curves
$[\sin(kx), \cos(kx)]$	Fourier features	Periodic functions
$[\text{RBF}(x, c_i)]$	Radial basis functions	Flexible nonlinear
Neural network layers	Deep features	Image/text tasks

With a Gaussian prior on $\theta$ and any fixed $\phi$ , the posterior is still Gaussian - all closed-form results carry over directly. This is why Bayesian linear regression with RBF features is equivalent to a GP with an RBF kernel - and explains why GPs are sometimes called "Bayesian linear regression in infinite-dimensional feature space."

import numpy as np

def polynomial_features(x, degree=3):
    """Expand scalar x to polynomial features [1, x, x^2, ..., x^degree]."""
    x = np.asarray(x).ravel()
    return np.column_stack([x ** k for k in range(degree + 1)])

def rbf_features(x, centres, length_scale=1.0):
    """Expand x to RBF features centred at each point in centres."""
    x = np.asarray(x).ravel()
    centres = np.asarray(centres).ravel()
    diff = x[:, None] - centres[None, :]   # (n, m)
    return np.exp(-diff**2 / (2 * length_scale**2))

np.random.seed(42)

# Nonlinear true function
def nonlinear_fn(x):
    return np.sin(2 * x) + 0.3 * x**2

X_1d = np.sort(np.random.uniform(-3, 3, 30))
y_nl = nonlinear_fn(X_1d) + np.random.randn(30) * 0.3

# Polynomial Bayesian regression
X_poly = polynomial_features(X_1d, degree=5)
blr_poly = BayesianLinearRegression(tau_sq=10.0, sigma_sq=0.09)
blr_poly.fit(X_poly, y_nl)

# RBF Bayesian regression
centres = np.linspace(-3, 3, 15)
X_rbf = rbf_features(X_1d, centres, length_scale=0.5)
blr_rbf = BayesianLinearRegression(tau_sq=10.0, sigma_sq=0.09)
blr_rbf.fit(X_rbf, y_nl)

# Evaluate on test set
X_test_1d = np.linspace(-3.5, 3.5, 200)
X_test_poly = polynomial_features(X_test_1d, degree=5)
X_test_rbf = rbf_features(X_test_1d, centres, length_scale=0.5)

mean_poly, std_poly, _, _ = blr_poly.predict(X_test_poly)
mean_rbf, std_rbf, _, _ = blr_rbf.predict(X_test_rbf)

true_test = nonlinear_fn(X_test_1d)
rmse_poly = np.sqrt(np.mean((mean_poly - true_test)**2))
rmse_rbf = np.sqrt(np.mean((mean_rbf - true_test)**2))

print(f"Polynomial features (degree 5): RMSE = {rmse_poly:.4f}")
print(f"RBF features (15 centres):      RMSE = {rmse_rbf:.4f}")
print("\nNote: RBF features essentially approximate GP regression,")
print("recovering calibrated uncertainty similar to a GP with RBF kernel.")

Active Learning with Bayesian Linear Regression

A direct application of predictive uncertainty: query the point where prediction uncertainty is highest.

import numpy as np

def active_learning_query(blr, X_candidates, n_query=5):
    """
    Select the n_query most uncertain candidate points.
    Uncertainty = total predictive std (epistemic + aleatoric).
    """
    _, std, _, _ = blr.predict(X_candidates)
    # Epistemic only: more data reduces this - the right criterion for active learning
    epistemic = np.sum((X_candidates @ blr.S_N) * X_candidates, axis=1)
    epistemic_std = np.sqrt(epistemic)

    # Rank by epistemic uncertainty (most uncertain = most informative)
    ranked_idx = np.argsort(epistemic_std)[::-1]
    top_idx = ranked_idx[:n_query]

    print(f"Top {n_query} points to query (highest epistemic uncertainty):")
    for rank, i in enumerate(top_idx):
        print(f"  {rank+1}. Index {i}: epistemic std = {epistemic_std[i]:.4f}")

    return top_idx


np.random.seed(0)
# Initial 10 training points
X_init = np.random.uniform(-2, 2, 10)
y_init = true_fn(X_init) + np.random.randn(10) * 0.4
X_init_feat = make_features(X_init)

blr_active = BayesianLinearRegression(tau_sq=5.0, sigma_sq=0.16)
blr_active.fit(X_init_feat, y_init)

# Candidate pool: 100 unlabelled points
X_pool = np.linspace(-3, 3, 100)
X_pool_feat = make_features(X_pool)

active_idx = active_learning_query(blr_active, X_pool_feat, n_query=5)

Production Engineering Notes

:::tip When to choose Bayesian linear regression over ridge Use Bayesian linear regression when you need calibrated uncertainty intervals alongside predictions - sensor fusion, engineering tolerance checks, financial risk intervals. Use ridge regression when you only need the point estimate and computational overhead matters. Ridge is a special case (MAP) that discards the posterior covariance and is therefore cheaper. :::

:::warning Computing S_N for high-dimensional problems The posterior covariance $S_N$ is a $d \times d$ matrix. For $d = 10{,}000$ features, storing and inverting $S_N$ requires $10{,}000^2 \approx 800$ MB of memory and $O(d^3)$ computation. Use the Woodbury identity to compute the $n \times n$ dual form instead: $S_N = \tau^2 I - \tau^4 X^\top (X X^\top \tau^2 + \sigma^2 I)^{-1} X$ . When $n \ll d$ , the dual form is dramatically cheaper. :::

:::danger Posterior covariance and prediction intervals are not prediction error bounds A 95% Bayesian credible interval means: given the model assumptions and data, there is a 95% posterior probability the true parameter (or future observation) falls in this range. It is not a frequentist confidence interval. If the model is misspecified (wrong prior, wrong likelihood), the credible interval can be badly miscalibrated. Always perform posterior predictive checks: do observed data look like samples from your model? :::

:::note The Woodbury matrix identity for efficient inference When $d \gg n$ , compute the posterior using the dual form via Woodbury:

$S_N = \tau^2\!\left[I - X^\top\!\left(XX^\top + \frac{\sigma^2}{\tau^2}I\right)^{-1}\!X\right]$

This inverts an $n \times n$ matrix instead of a $d \times d$ matrix - crucial when features are high-dimensional (e.g., polynomial or RBF features with many centres). :::

YouTube Resources

Bayesian Linear Regression - Mathematicalmonk (YouTube): rigorous derivation of the posterior from scratch, highly recommended for the full derivation
Bayesian Methods for Machine Learning - Coursera (Higher School of Economics): week-by-week treatment with Bayesian linear regression as the foundation
Linear Regression as a Bayesian Model - Brandon Rohrer: accessible to practitioners, connects L2 regularisation to Gaussian prior clearly
Evidence Approximation and Empirical Bayes - Carl Edward Rasmussen (Cambridge GP lectures): connects Bayesian linear regression to GP marginal likelihood maximisation

Interview Q&A

Q1: What is the difference between the ridge regression solution and Bayesian linear regression?

Ridge regression gives a single point estimate: $\hat{\theta} = (X^\top X + \lambda I)^{-1}X^\top y$ . This is exactly the MAP estimate (mode of the posterior) from Bayesian linear regression with a zero-mean Gaussian prior where $\lambda = \sigma^2/\tau^2$ . Bayesian linear regression goes further: it maintains the full posterior covariance $S_N = (\frac{1}{\tau^2}I + \frac{1}{\sigma^2}X^\top X)^{-1}$ . This covariance enables computing the predictive distribution at test points - including both the aleatoric noise $\sigma^2$ and epistemic uncertainty $x^{*\top}S_N x^*$ . Ridge discards $S_N$ and cannot provide uncertainty intervals. If you only need point estimates and care about speed, use ridge. If you need calibrated uncertainty, use full Bayesian.

Q2: How does predictive uncertainty in Bayesian linear regression decompose?

The posterior predictive variance at a test point $x^*$ is $\text{Var}[y^*] = \sigma^2 + x^{*\top}S_N x^*$ . The first term $\sigma^2$ is aleatoric uncertainty - irreducible measurement noise. It is the same everywhere regardless of how much training data exists or how far the test point is from training data. The second term $x^{*\top}S_N x^*$ is epistemic uncertainty - uncertainty about the weights. It shrinks toward zero as more data arrives (because $S_N \to 0$ as $n \to \infty$ ) and is larger in directions not well-constrained by the training data. Epistemic uncertainty is highest far from the training data centroid - the model is extrapolating and correctly says so.

Q3: How does evidence maximisation (empirical Bayes) differ from cross-validation for hyperparameter selection?

Evidence maximisation maximises the marginal likelihood $p(y|X, \tau^2, \sigma^2) = \int p(y|X,\theta)p(\theta|\tau^2) d\theta$ - integrating out the weights analytically. This gives a principled single-objective over hyperparameters that does not require a held-out validation set. Cross-validation splits data, trains on subsets, and measures held-out performance - it is more general (works for any model) but wastes data and is computationally expensive. For conjugate Bayesian models like Gaussian-Gaussian, evidence maximisation is preferred: it uses all data for both training and hyperparameter selection, and it optimises the right objective (marginal likelihood implements automatic Occam's razor). For non-conjugate models, cross-validation is usually the practical choice.

Q4: Why does Bayesian linear regression with RBF features approximate a Gaussian process?

A Gaussian process with RBF kernel $k(x, x') = \sigma^2 \exp(-\|x-x'\|^2/2\ell^2)$ can be interpreted as Bayesian linear regression with an infinite-dimensional feature map. Specifically, any kernel satisfying Mercer's theorem can be written as $k(x, x') = \phi(x)^\top \phi(x')$ for some (possibly infinite-dimensional) feature map $\phi$ . With a finite set of RBF basis functions $\phi_i(x) = \exp(-\|x-c_i\|^2/2\ell^2)$ centred at $m$ locations, Bayesian linear regression in this feature space approximates a GP with RBF kernel - the approximation quality improves with more basis functions. This connection is exploited in random Fourier features (Rahimi & Recht, 2007): approximate a GP with a finite random feature map, enabling GP-quality uncertainty at $O(nm)$ cost instead of $O(n^3)$ .

Q5: How would you use Bayesian linear regression for online learning as new data arrives?

Bayesian linear regression supports exact sequential updating. After observing batch 1 and computing posterior $\mathcal{N}(m_1, S_1)$ , treat this posterior as the prior for batch 2. The update is: $S_2^{-1} = S_1^{-1} + X_2^\top X_2/\sigma^2$ and $m_2 = S_2(S_1^{-1}m_1 + X_2^\top y_2/\sigma^2)$ . This is the Kalman filter for linear regression - no need to store historical data. Each new batch updates the precision matrix and mean vector in $O(d^2 n_{\text{batch}})$ time, compared to $O(d^3)$ for a full re-fit. In production, this enables streaming inference: as sensor readings arrive, update beliefs about calibration parameters in real time without reprocessing historical data.

Common Mistakes

:::danger Using point predictions without uncertainty for downstream decisions Bayesian linear regression's main value is the predictive distribution, not just the mean. Using only $m_N^\top x^*$ and discarding the posterior covariance $S_N$ is equivalent to using ridge regression - the Bayesian computation was wasted. Always propagate uncertainty to downstream decisions: compute the probability of exceeding a threshold, the expected utility under the predictive distribution, or the credible interval for a compliance check. :::

:::warning Assuming the Gaussian prior is appropriate A Gaussian prior on weights works well when you have no strong domain knowledge and want a generic regulariser. But if you know some weights should be exactly zero (sparse signal), a Laplace prior (giving MAP = Lasso) or a spike-and-slab prior is more appropriate. If you know weights should be positive (e.g., a price coefficient), a log-normal or gamma prior is better. The choice of prior family should reflect actual beliefs - defaulting to Gaussian is convenient but not always correct. :::

:::danger Forgetting to include an intercept term If the true function has a non-zero mean and you do not include an intercept (column of ones in $X$ ), the Gaussian prior pulls predictions toward zero everywhere. This is prior misspecification - the zero-mean prior applies to the linear coefficients, not to the function values. Always include an intercept unless you have explicitly verified that the true function has zero mean at $x = 0$ . :::

:::warning High-dimensional $S_N$ storage The posterior covariance $S_N$ is $d \times d$ . For $d = 1{,}000$ features, this is 8 MB - fine. For $d = 100{,}000$ features, this is 80 GB - impossible. Use the dual (Woodbury) form when $d > n$ , or use sparse approximations to $S_N$ (diagonal approximation, low-rank approximation). Mean-field variational inference (Lesson 04) uses a diagonal approximation to avoid this issue in deep learning. :::

Practice Problems

Derive the posterior mean $m_N$ and covariance $S_N$ for Bayesian linear regression from scratch. Start with $\log p(\theta|X,y) \propto \log p(y|X,\theta) + \log p(\theta)$ , expand the quadratic, and complete the square.
Show that the MAP estimate of Bayesian linear regression with Gaussian prior is exactly the ridge regression solution. Write out both expressions and match terms.
Implement Bayesian linear regression for a 1D problem with polynomial features of degree 4. Plot the predictive mean and 95% credible interval as a function of $n_{\text{train}} \in \{5, 10, 20, 50\}$ . Observe how the credible interval narrows with more data and how it widens in extrapolation regions.
Implement the sequential Bayesian update rule and verify that processing data in two batches gives the same posterior as processing all data at once. Use a 2D regression problem with 50 total observations split as 20 + 30.
Compare the predictive intervals of Bayesian linear regression with those of standard linear regression (OLS with bootstrap confidence intervals). Use 100 Monte Carlo simulations, generating fresh training sets of size 20 each time, and plot the empirical coverage of each method's claimed 95% intervals. Which method is better calibrated and why?

:::tip 🎮 Interactive Playground

Visualize this concept: Try the MLE & MAP Explorer demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why Bayesian Linear Regression Exists​

Model Setup​

The Posterior: Exact Derivation​

Sequential Updating​

Connection to Ridge Regression​

The Posterior Predictive Distribution​

Evidence Maximisation (Empirical Bayes)​

From Scratch: Full Implementation​

Uncertainty Grows With Distance: Visualising the Predictive Bands​

How the Posterior Collapses With More Data​

Heteroscedastic Bayesian Regression​

sklearn-Compatible Wrapper​

Basis Function Expansion: Nonlinear Bayesian Regression​

Active Learning with Bayesian Linear Regression​

Production Engineering Notes​

YouTube Resources​

Interview Q&A​

Common Mistakes​

Practice Problems​