Skip to main content

Prior and Posterior Distributions

The Problem With Ignoring What You Know

You're building a spam classifier. You have 500 emails, of which 50 are spam - a 10% base rate. Your model is trained and achieves good accuracy.

Now a new user signs up. They've received 3 emails, 2 of which were spam. What's your best estimate of the spam rate for this user?

The MLE answer: p^=2/3=0.667\hat{p} = 2/3 = 0.667. Two thirds of their emails are spam.

But wait. You know from your entire dataset that spam rates for individual users range from 5% to 40%. The prior knowledge you've accumulated says that 0.667 is an extreme outlier. With only 3 data points, you should not trust the sample estimate - you should blend it with what you know about typical spam rates.

This blending of prior knowledge with observed data is exactly what Bayesian posterior computation achieves. The prior encodes what you know before seeing data. The posterior is what you know after.

Bayes Theorem: The Engine of Belief Updating

The posterior distribution is computed via Bayes theorem:

P(θD)posterior=P(Dθ)likelihoodP(θ)priorP(D)evidence\underbrace{P(\theta \mid \mathcal{D})}_{\text{posterior}} = \frac{\underbrace{P(\mathcal{D} \mid \theta)}_{\text{likelihood}} \cdot \underbrace{P(\theta)}_{\text{prior}}}{\underbrace{P(\mathcal{D})}_{\text{evidence}}}

In practice, we often work with the unnormalized posterior and ignore the evidence (since it doesn't depend on θ\theta):

P(θD)P(Dθ)P(θ)P(\theta \mid \mathcal{D}) \propto P(\mathcal{D} \mid \theta) \cdot P(\theta)

Or in log space (more numerically stable):

logP(θD)=logP(Dθ)+logP(θ)+const\log P(\theta \mid \mathcal{D}) = \log P(\mathcal{D} \mid \theta) + \log P(\theta) + \text{const}

Maximizing this gives MAP (Maximum A Posteriori) estimation, which is the Bayesian generalization of MLE.

MAP vs MLE: The Core Distinction

Maximum Likelihood Estimation (MLE)

θ^MLE=argmaxθP(Dθ)=argmaxθlogP(Dθ)\hat{\theta}_{MLE} = \arg\max_\theta P(\mathcal{D} \mid \theta) = \arg\max_\theta \log P(\mathcal{D} \mid \theta)

MLE ignores prior knowledge. It finds the parameter value that makes the observed data most probable.

Maximum A Posteriori (MAP)

θ^MAP=argmaxθP(θD)=argmaxθ[logP(Dθ)+logP(θ)]\hat{\theta}_{MAP} = \arg\max_\theta P(\theta \mid \mathcal{D}) = \arg\max_\theta \left[\log P(\mathcal{D} \mid \theta) + \log P(\theta)\right]

MAP finds the parameter value that is most probable given both the data AND the prior.

:::tip ML Connection - Regularization IS MAP The MAP objective has a direct connection to regularized ML:

  • L2 regularization (Ridge): Gaussian prior P(θ)=N(0,σ2I)P(\theta) = \mathcal{N}(0, \sigma^2 I)

    • logP(θ)=12σ2θ2+const\log P(\theta) = -\frac{1}{2\sigma^2}\|\theta\|^2 + \text{const}
    • MAP objective: ilogP(yixi,θ)λ2θ2\sum_i \log P(y_i \mid x_i, \theta) - \frac{\lambda}{2}\|\theta\|^2
    • This IS Ridge regression / L2-regularized cross-entropy
  • L1 regularization (Lasso): Laplace prior P(θ)=Laplace(0,b)P(\theta) = \text{Laplace}(0, b)

    • logP(θ)=1bθ1+const\log P(\theta) = -\frac{1}{b}\|\theta\|_1 + \text{const}
    • MAP objective: ilogP(yixi,θ)λθ1\sum_i \log P(y_i \mid x_i, \theta) - \lambda\|\theta\|_1
    • This IS Lasso regression / L1-regularized loss

Every time you use weight decay in your optimizer, you are computing MAP with a Gaussian prior. :::

Conjugate Priors: When the Math Works Out Beautifully

A prior P(θ)P(\theta) is conjugate to a likelihood P(Dθ)P(\mathcal{D} \mid \theta) if the posterior P(θD)P(\theta \mid \mathcal{D}) is in the same distribution family as the prior.

Conjugate pairs allow closed-form posterior computation - no MCMC or numerical methods needed. This makes Bayesian updating extremely fast.

The Beta-Bernoulli Conjugate Pair

This is the most important conjugate pair for ML engineers.

Likelihood: Data are Bernoulli trials (binary outcomes, e.g., click/no-click) P(Dθ)=θh(1θ)tP(\mathcal{D} \mid \theta) = \theta^h (1-\theta)^t where hh = number of heads/successes, tt = number of tails/failures.

Prior: Beta distribution P(θ)=Beta(α,β)θα1(1θ)β1P(\theta) = \text{Beta}(\alpha, \beta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}

Posterior (conjugate!): P(θD)=Beta(α+h,β+t)P(\theta \mid \mathcal{D}) = \text{Beta}(\alpha + h, \beta + t)

The parameters α\alpha and β\beta have a beautiful interpretation: they count pseudo-observations. Beta(5, 5) means "I've seen 5 successes and 5 failures in my prior experience." This gives the posterior mean:

E[θD]=α+hα+h+β+tE[\theta \mid \mathcal{D}] = \frac{\alpha + h}{\alpha + h + \beta + t}

This is a weighted average of the prior mean αα+β\frac{\alpha}{\alpha+\beta} and the MLE hh+t\frac{h}{h+t}.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def bayesian_beta_update(prior_alpha, prior_beta, successes, failures):
"""Update Beta prior with binomial observations."""
post_alpha = prior_alpha + successes
post_beta = prior_beta + failures
posterior = stats.beta(post_alpha, post_beta)

post_mean = post_alpha / (post_alpha + post_beta)
ci = posterior.ppf([0.025, 0.975])
mle = successes / (successes + failures) if (successes + failures) > 0 else 0

return {
'posterior': posterior,
'mean': post_mean,
'mle': mle,
'credible_interval': ci,
'post_alpha': post_alpha,
'post_beta': post_beta
}

# Spam classifier example
# Prior: typical user spam rate is around 10%, encoded as Beta(2, 18)
# This says: "I've seen about 2 spam / 18 non-spam in my prior experience"
prior_alpha, prior_beta = 2, 18
print(f"Prior mean: {prior_alpha/(prior_alpha+prior_beta):.3f}")

# New user: 2 spam out of 3 emails
result = bayesian_beta_update(prior_alpha, prior_beta, successes=2, failures=1)
print(f"\nNew user (3 emails, 2 spam):")
print(f"MLE: {result['mle']:.3f}")
print(f"Bayesian posterior mean: {result['mean']:.3f}")
print(f"95% credible interval: ({result['credible_interval'][0]:.3f}, {result['credible_interval'][1]:.3f})")

# With more data: 20 spam out of 30 emails
result2 = bayesian_beta_update(prior_alpha, prior_beta, successes=20, failures=10)
print(f"\nEstablished user (30 emails, 20 spam):")
print(f"MLE: {result2['mle']:.3f}")
print(f"Bayesian posterior mean: {result2['mean']:.3f}")
print(f"95% credible interval: ({result2['credible_interval'][0]:.3f}, {result2['credible_interval'][1]:.3f})")

Notice how with only 3 observations, the Bayesian estimate (≈ 0.19) is pulled substantially toward the prior mean (0.10), while with 30 observations, it's much closer to the MLE (0.667 → ≈ 0.60). The prior matters less as data accumulates.

The Gaussian-Gaussian Conjugate Pair

Likelihood: Gaussian with known variance σ2\sigma^2 P(Dμ)=i=1nN(xiμ,σ2)P(\mathcal{D} \mid \mu) = \prod_{i=1}^n \mathcal{N}(x_i \mid \mu, \sigma^2)

Prior: Gaussian on the mean P(μ)=N(μ0,τ2)P(\mu) = \mathcal{N}(\mu_0, \tau^2)

Posterior (conjugate!): Another Gaussian P(μD)=N(μn,τn2)P(\mu \mid \mathcal{D}) = \mathcal{N}(\mu_n, \tau_n^2)

where: τn2=(1τ2+nσ2)1,μn=τn2(μ0τ2+nxˉσ2)\tau_n^2 = \left(\frac{1}{\tau^2} + \frac{n}{\sigma^2}\right)^{-1}, \quad \mu_n = \tau_n^2 \left(\frac{\mu_0}{\tau^2} + \frac{n\bar{x}}{\sigma^2}\right)

The posterior mean μn\mu_n is a precision-weighted average of the prior mean and the sample mean, where precision = 1/variance1/\text{variance}:

μn=1τ2μ0+nσ2xˉ1τ2+nσ2\mu_n = \frac{\frac{1}{\tau^2} \mu_0 + \frac{n}{\sigma^2} \bar{x}}{\frac{1}{\tau^2} + \frac{n}{\sigma^2}}

def gaussian_gaussian_update(mu0, tau2, sigma2, data):
"""
Update Gaussian prior on mean with Gaussian-likelihood observations.
mu0, tau2: prior mean and variance
sigma2: known likelihood variance
data: array of observations
"""
n = len(data)
x_bar = np.mean(data)

# Posterior precision = prior precision + data precision
prior_precision = 1 / tau2
data_precision = n / sigma2
post_precision = prior_precision + data_precision
post_tau2 = 1 / post_precision

# Posterior mean = precision-weighted average
post_mu = post_tau2 * (prior_precision * mu0 + data_precision * x_bar)

return post_mu, post_tau2

# Example: Estimating user latency
# Prior: latency is typically around 200ms with std=50ms
mu0, tau = 200.0, 50.0
tau2 = tau**2
sigma2 = 30.0**2 # known measurement noise std=30ms

# Observe 5 latency measurements
observations = np.array([180, 175, 190, 185, 170])

post_mu, post_tau2 = gaussian_gaussian_update(mu0, tau2, sigma2, observations)
post_std = np.sqrt(post_tau2)

print(f"Prior: mean={mu0:.1f}ms, std={tau:.1f}ms")
print(f"Observed mean: {np.mean(observations):.1f}ms (n={len(observations)})")
print(f"Posterior: mean={post_mu:.1f}ms, std={post_std:.1f}ms")
print(f"95% credible interval: ({post_mu - 1.96*post_std:.1f}, {post_mu + 1.96*post_std:.1f})ms")

Complete Table of Common Conjugate Pairs

LikelihoodPriorPosteriorML Use Case
Bernoulli/BinomialBetaBetaClick-through rate, spam rate
Gaussian (known var)GaussianGaussianLatency, sensor readings
Gaussian (known mean)Inverse-GammaInverse-GammaVariance estimation
PoissonGammaGammaEvent counts, query frequency
MultinomialDirichletDirichletTopic modeling, class probabilities
ExponentialGammaGammaTime-to-failure, inter-arrival times
Normal-Normal (full)Normal-Inverse-GammaNormal-Inverse-GammaBayesian linear regression

Choosing a Prior: A Practical Guide

Choosing the prior is often the hardest part of Bayesian analysis. Here are the main strategies:

1. Informative Priors (Subjective Bayesian)

Use domain knowledge to construct the prior. If you know from clinical experience that drug efficacy rates range from 5-40%, encode this.

# Informative prior for drug efficacy
# Encode belief: efficacy is between 5-40%, centered around 15%
# Beta(3, 17) has mean = 3/20 = 0.15 and is concentrated in [0.05, 0.40]
from scipy import stats
prior = stats.beta(3, 17)
print(f"Informative prior: mean={prior.mean():.3f}, "
f"95% interval={prior.ppf(0.025):.3f} to {prior.ppf(0.975):.3f}")

2. Weakly Informative Priors (Regularizing Priors)

Use priors that provide mild regularization without encoding strong beliefs. This is the most common choice in modern Bayesian ML (championed by Andrew Gelman).

# Weakly informative prior for logistic regression coefficients
# N(0, 1) is weakly informative: coefficients shouldn't be huge,
# but we're not certain where they should be
# This corresponds to L2 regularization with lambda = 1
import scipy.stats as stats
weakly_informative = stats.norm(0, 1)

3. Non-Informative Priors (Jeffreys Prior)

Jeffreys prior is invariant to reparameterization - it expresses "maximum ignorance":

P(θ)I(θ)P(\theta) \propto \sqrt{I(\theta)}

where I(θ)I(\theta) is the Fisher information.

For the Bernoulli parameter: Jeffreys prior is Beta(0.5,0.5)\text{Beta}(0.5, 0.5) - a U-shaped distribution that assigns more weight to extreme values (0 and 1).

4. Flat/Uniform Priors

P(θ)1P(\theta) \propto 1 over the parameter space. Gives the posterior proportional to the likelihood, so MAP = MLE. Often mathematically improper (doesn't integrate to 1 over an infinite domain) but used as a starting point.

:::warning Prior Sensitivity Always perform prior sensitivity analysis: how much do your conclusions change if you use different reasonable priors? If your conclusions are robust across a range of priors, you have strong evidence. If they depend critically on prior choice, report this uncertainty.

def sensitivity_analysis(successes, failures, priors):
"""Compare posterior means under different priors."""
print(f"Data: {successes} successes, {failures} failures (MLE = {successes/(successes+failures):.3f})")
print(f"\n{'Prior':<25} {'Post. Mean':<15} {'95% CI':<25}")
print("-" * 65)
for name, (a, b) in priors.items():
post = stats.beta(a + successes, b + failures)
ci = post.ppf([0.025, 0.975])
print(f"{name:<25} {post.mean():<15.3f} ({ci[0]:.3f}, {ci[1]:.3f})")

priors = {
"Uniform Beta(1,1)": (1, 1),
"Jeffreys Beta(0.5,0.5)": (0.5, 0.5),
"Weak Beta(2,2)": (2, 2),
"Informative Beta(10,90)": (10, 90), # strong belief ~10%
}
sensitivity_analysis(successes=7, failures=3, priors=priors)

:::

Posterior Predictive Distribution

The full posterior lets you compute the posterior predictive distribution - the distribution over new observations, averaging over parameter uncertainty:

P(x~D)=P(x~θ)P(θD)dθP(\tilde{x} \mid \mathcal{D}) = \int P(\tilde{x} \mid \theta) P(\theta \mid \mathcal{D}) d\theta

This is more principled than plug-in prediction (using θ^\hat{\theta} to predict), because it accounts for uncertainty in θ\theta.

For the Beta-Bernoulli model, the posterior predictive probability of success is simply the posterior mean:

P(x~=1D)=E[θD]=α+hα+h+β+tP(\tilde{x} = 1 \mid \mathcal{D}) = E[\theta \mid \mathcal{D}] = \frac{\alpha + h}{\alpha + h + \beta + t}

This is the Laplace smoothing formula used ubiquitously in NLP:

P(wcontext)=c(w,context)+αwc(w,context)+αVP(w \mid \text{context}) = \frac{c(w, \text{context}) + \alpha}{\sum_{w'} c(w', \text{context}) + \alpha \cdot |V|}

Laplace smoothing IS Bayesian posterior predictive estimation with a Dirichlet prior!

Bayesian Linear Regression: A Complete Example

import numpy as np
import matplotlib.pyplot as plt

class BayesianLinearRegression:
"""
Bayesian linear regression with Gaussian prior on weights.
Computes closed-form posterior using conjugate updates.
"""
def __init__(self, alpha=1.0, beta=1.0):
"""
alpha: prior precision (1/prior_variance) on weights
beta: likelihood precision (1/noise_variance)
"""
self.alpha = alpha # prior: w ~ N(0, alpha^{-1} I)
self.beta = beta # likelihood: y | x,w ~ N(w^T x, beta^{-1})
self.m_N = None # posterior mean
self.S_N = None # posterior covariance

def fit(self, X, y):
"""Compute closed-form posterior over weights."""
n, d = X.shape
# Posterior precision matrix
S_N_inv = self.alpha * np.eye(d) + self.beta * X.T @ X
self.S_N = np.linalg.inv(S_N_inv)
# Posterior mean
self.m_N = self.beta * self.S_N @ X.T @ y
return self

def predict(self, X_new):
"""Return posterior predictive mean and variance."""
y_mean = X_new @ self.m_N
# Predictive variance accounts for parameter uncertainty + noise
y_var = (1/self.beta) + np.diag(X_new @ self.S_N @ X_new.T)
y_std = np.sqrt(y_var)
return y_mean, y_std

# Generate synthetic data
np.random.seed(42)
n_train = 15
X_train = np.random.uniform(-3, 3, (n_train, 1))
y_train = 0.5 * X_train[:, 0] + np.random.normal(0, 0.5, n_train)

# Add bias term
X_train_aug = np.column_stack([X_train, np.ones(n_train)])

# Fit Bayesian linear regression
blr = BayesianLinearRegression(alpha=1.0, beta=4.0) # alpha=1 ~ L2 with lambda=1
blr.fit(X_train_aug, y_train)

# Predict on test range
X_test = np.linspace(-4, 4, 100).reshape(-1, 1)
X_test_aug = np.column_stack([X_test, np.ones(100)])
y_mean, y_std = blr.predict(X_test_aug)

print(f"Posterior weight mean: slope={blr.m_N[0]:.3f}, intercept={blr.m_N[1]:.3f}")
print(f"Posterior weight covariance diagonal: {np.diag(blr.S_N)}")
print(f"Prediction at x=0: {y_mean[50]:.3f} ± {y_std[50]:.3f}")

# Note: wider uncertainty bands at the edges (less data there)
# This is epistemic uncertainty properly quantified

The key insight: the predictive uncertainty (y_stdy\_std) is higher at the edges of the training range - where you have less data. A point estimate model would give the same uncertainty everywhere. The Bayesian model correctly reflects where it knows less.

Interview Questions

Q1: What is the difference between MLE, MAP, and full Bayesian inference?

MLE finds the parameter maximizing the likelihood: θ^MLE=argmaxP(Dθ)\hat{\theta}_{MLE} = \arg\max P(\mathcal{D}|\theta). It ignores any prior knowledge and can overfit with small data. MAP adds a prior term: θ^MAP=argmax[P(Dθ)P(θ)]\hat{\theta}_{MAP} = \arg\max [P(\mathcal{D}|\theta) P(\theta)], which is equivalent to regularized MLE (L2 regularization = Gaussian prior). Both MLE and MAP return a single point estimate - they discard uncertainty information. Full Bayesian inference computes the entire posterior distribution P(θD)P(\theta|\mathcal{D}), enabling uncertainty quantification, posterior predictive distributions, and model comparison. The tradeoff: MAP is computationally cheap, full Bayes is more expensive but more principled.

Q2: What is a conjugate prior and why does it matter in practice?

A conjugate prior is one where the posterior is in the same distribution family as the prior. For example, Beta is conjugate to Binomial: if the prior on pp is Beta(α,β)\text{Beta}(\alpha, \beta) and you observe hh successes and tt failures, the posterior is Beta(α+h,β+t)\text{Beta}(\alpha+h, \beta+t). This matters because: (1) the posterior update is a simple formula - no numerical integration or MCMC required; (2) updates are interpretable - the prior parameters act as pseudo-observation counts; (3) updates are composable - you can apply them sequentially. In production systems where you're updating beliefs continuously (e.g., per-user spam rates, per-item click rates), conjugate models enable real-time Bayesian updates with O(1) storage and computation.

Q3: How do you choose a prior when you have little domain knowledge?

Several strategies: (1) Weakly informative priors encode only broad constraints - for logistic regression coefficients, N(0,1)\mathcal{N}(0, 1) says "I don't expect huge coefficients" without being strongly opinionated; (2) Jeffreys prior I(θ)\propto \sqrt{I(\theta)} is invariant to reparameterization, representing "maximum ignorance" mathematically; (3) Empirical Bayes uses the data to estimate hyperparameters of the prior - this can cause overfitting but is pragmatic; (4) Always perform sensitivity analysis: vary the prior and check if conclusions change. If your posterior is sensitive to prior choice with little data, report that uncertainty honestly. The worst choice is an arbitrary flat prior with no thought - this is not "uninformative," it's often strongly informative in a bad way (e.g., a uniform prior on [0,106][0, 10^6] is very different from one on [0,1][0, 1]).

Q4: Explain the posterior predictive distribution and why it's better than plug-in prediction.

The posterior predictive distribution is P(x~D)=P(x~θ)P(θD)dθP(\tilde{x} | \mathcal{D}) = \int P(\tilde{x}|\theta) P(\theta|\mathcal{D}) d\theta - it averages predictions over all plausible parameter values, weighted by their posterior probability. Plug-in prediction uses a single estimate θ^\hat{\theta}: P(x~θ^)P(\tilde{x}|\hat{\theta}). The posterior predictive is better because: (1) it correctly reflects parameter uncertainty - predictions are uncertain not just from data noise but also from uncertainty about which parameters are correct; (2) it automatically gives wider prediction intervals in regions with less data; (3) it gives coherent probability estimates - the predicted probabilities properly account for model uncertainty. In practice, for neural networks this is approximated via Monte Carlo Dropout or ensembles.

Q5: What does it mean for a prior to be "improper" and is it a problem?

An improper prior is one that doesn't integrate to a finite constant - it's not a valid probability distribution. Example: the flat prior P(θ)1P(\theta) \propto 1 over R\mathbb{R} is improper because 1dθ=\int_{-\infty}^{\infty} 1 \, d\theta = \infty. This is sometimes used because: (1) it's the limit of a very wide Gaussian; (2) it ensures MAP = MLE (no regularization). Whether it's a problem depends on whether the posterior is proper. For standard models with sufficient data, the posterior P(θD)P(Dθ)1P(\theta|\mathcal{D}) \propto P(\mathcal{D}|\theta) \cdot 1 integrates to a finite value and is a valid distribution - the improper prior "washes out." However, improper priors can cause problems: (1) Bayes factors are undefined (model comparison requires proper priors); (2) some parameters may still have improper posteriors in hierarchical models; (3) they can make MCMC samplers fail in subtle ways. Best practice: use weakly informative proper priors that are nearly flat in the region of interest.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Prior to Posterior demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.