Skip to main content

Maximum Likelihood Estimation

The Real Interview Moment

You are in the final round at a top-tier AI lab. The interviewer slides a sheet of paper across the table: "We have a neural network trained with MSE loss on a classification task. Accuracy is 92%. Our competitor uses cross-entropy and gets 93%. Explain exactly why cross-entropy is better - mathematically, not intuitively."

You pause. You know cross-entropy is "better for classification," but do you know why? The honest answer requires tracing a chain of reasoning from probability theory all the way to gradient flow. MSE assumes the residuals are Gaussian. Classification residuals are Bernoulli, not Gaussian. When you use MSE for classification, you are solving the wrong optimization problem. Cross-entropy is the correct loss because it is the negative log-likelihood of the Bernoulli distribution. Using the right probabilistic model for your data is not a heuristic; it is a theorem.

This is the subject of Maximum Likelihood Estimation. Every loss function in machine learning is either the negative log-likelihood of some noise model, or it is an approximation to one. MSE is the NLL of Gaussian noise. Cross-entropy is the NLL of Bernoulli noise. Huber loss approximates the NLL of a distribution with Gaussian body and Laplace tails. Knowing this unifies everything - regression, classification, density estimation, survival analysis - under a single framework.

The principle is deceptively simple: given observed data, find the parameters that make the data most probable. Yet the implications cascade through the entire field. It explains why regularization works (MAP estimation with a prior). It quantifies the fundamental limits of estimation (Cramér-Rao bound). It justifies the use of gradient descent on loss functions (because maximizing likelihood is the same as minimizing average loss). And it gives you the vocabulary to design new loss functions for new problems from first principles.

This lesson derives MLE carefully, connects it to every major model you have already seen, builds the Fisher information framework, and ends with the MAP duality that explains L1 and L2 regularization as Bayesian inference. After this lesson, you will never wonder why a loss function looks the way it does.


Why This Exists - The Problem Before MLE

Before MLE was formalized, statisticians fit curves by minimizing "error" in ad hoc ways. The method of least squares (Gauss, Legendre, ~1805) was proposed purely because it led to tractable algebra. Nobody asked: "Under what probabilistic assumptions does minimizing squared error give the best estimates?"

The uncomfortable truth is that least squares is optimal only under Gaussian noise. For count data, binary data, survival times, or heavy-tailed measurements, squared error is a poor choice - it assigns high loss to large residuals that are actually probable under the true data distribution.

Fisher (1912–1922) formalized MLE as a general estimation principle: rather than defining "best fit" geometrically, define it probabilistically. The best parameters are those that assign the highest probability to the data you actually observed. This immediately tells you:

  • For Gaussian noise → minimize squared errors (MSE)
  • For Bernoulli outputs → minimize cross-entropy
  • For Poisson counts → minimize Poisson deviance
  • For heavy-tailed noise → minimize absolute errors (MAE)

The loss function is not a design choice - it is derived from the data distribution.


Historical Context

YearMilestone
1805Legendre proposes method of least squares (no probabilistic justification)
1809Gauss derives OLS from Gaussian error assumption - first proto-MLE argument
1912R.A. Fisher introduces "likelihood" as distinct from probability
1922Fisher proves MLE is asymptotically efficient (achieves Cramér-Rao bound)
1945–46Cramér and Rao independently derive the variance lower bound
1972Nelder & Wedderburn formalize GLMs - MLE applied to exponential family
1986Rumelhart et al. backprop = gradient of cross-entropy NLL
2010sDeep learning: every modern loss function is a (negative) log-likelihood

Fisher's key insight: probability is a function of data given fixed parameters; likelihood is the same function viewed as varying over parameters given fixed data. They are numerically equal but conceptually opposite. We want to find the parameter value that makes the observed data most probable.


The MLE Principle: Probability vs Likelihood

For a data point yy and parameter θ\theta:

  • P(yθ)P(y | \theta): probability of observing yy if the true parameter is θ\theta - a function of yy, θ\theta fixed
  • L(θy)=P(yθ)\mathcal{L}(\theta | y) = P(y | \theta): likelihood of θ\theta given we observed yy - same number, but viewed as a function of θ\theta, yy fixed

This distinction matters deeply. A likelihood is not a probability distribution over θ\theta - it need not integrate to 1. It is just a measure of how well a parameter value explains the data.

Given nn i.i.d. observations {yi}i=1n\{y_i\}_{i=1}^n:

L(θ)=i=1nP(yiθ)\mathcal{L}(\theta) = \prod_{i=1}^n P(y_i | \theta)

The log-likelihood (monotone transformation, same maximizer):

(θ)=logL(θ)=i=1nlogP(yiθ)\ell(\theta) = \log \mathcal{L}(\theta) = \sum_{i=1}^n \log P(y_i | \theta)

The MLE is:

θ^MLE=argmaxθ(θ)=argmaxθi=1nlogP(yiθ)\hat{\theta}_\text{MLE} = \arg\max_\theta \ell(\theta) = \arg\max_\theta \sum_{i=1}^n \log P(y_i | \theta)

The log converts products to sums, making both computation and differentiation tractable.


MLE for Common Distributions

Gaussian: Recovering Sample Mean and Variance

Given nn i.i.d. observations from N(μ,σ2)\mathcal{N}(\mu, \sigma^2):

P(yiμ,σ2)=12πσ2exp((yiμ)22σ2)P(y_i | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mu)^2}{2\sigma^2}\right)

Log-likelihood:

(μ,σ2)=n2log(2π)n2logσ212σ2i=1n(yiμ)2\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \mu)^2

MLE for μ\mu: set /μ=0\partial\ell/\partial\mu = 0:

μ=1σ2i=1n(yiμ)=0    μ^MLE=1ni=1nyi=yˉ\frac{\partial\ell}{\partial\mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (y_i - \mu) = 0 \implies \hat{\mu}_\text{MLE} = \frac{1}{n}\sum_{i=1}^n y_i = \bar{y}

The sample mean is the MLE for Gaussian location.

MLE for σ2\sigma^2: set /σ2=0\partial\ell/\partial\sigma^2 = 0:

σ2=n2σ2+12σ4i=1n(yiμ)2=0    σ^MLE2=1ni=1n(yiyˉ)2\frac{\partial\ell}{\partial\sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4}\sum_{i=1}^n(y_i - \mu)^2 = 0 \implies \hat{\sigma}^2_\text{MLE} = \frac{1}{n}\sum_{i=1}^n(y_i - \bar{y})^2

Note: this is the biased estimator (divides by nn, not n1n-1). MLE is slightly biased for σ2\sigma^2 in finite samples. The unbiased estimator s2=1n1(yiyˉ)2s^2 = \frac{1}{n-1}\sum(y_i-\bar{y})^2 is not the MLE.

Bernoulli: Recovering Sample Proportion

Given nn i.i.d. binary observations yi{0,1}y_i \in \{0, 1\} from Bernoulli(p)\text{Bernoulli}(p):

P(yip)=pyi(1p)1yiP(y_i | p) = p^{y_i}(1-p)^{1-y_i}

Log-likelihood:

(p)=i=1nyilogp+(1yi)log(1p)=klogp+(nk)log(1p)\ell(p) = \sum_{i=1}^n y_i \log p + (1 - y_i)\log(1-p) = k\log p + (n-k)\log(1-p)

where k=yik = \sum y_i is the number of successes.

MLE: set /p=0\partial\ell/\partial p = 0:

p=kpnk1p=0    p^MLE=kn=yˉ\frac{\partial\ell}{\partial p} = \frac{k}{p} - \frac{n-k}{1-p} = 0 \implies \hat{p}_\text{MLE} = \frac{k}{n} = \bar{y}

The sample proportion is the MLE for Bernoulli parameter - the most natural estimator, justified by MLE.

Poisson: Recovering Sample Mean

Given nn i.i.d. count observations yi{0,1,2,}y_i \in \{0, 1, 2, \ldots\} from Poisson(λ)\text{Poisson}(\lambda):

P(yiλ)=eλλyiyi!P(y_i | \lambda) = \frac{e^{-\lambda}\lambda^{y_i}}{y_i!}

Log-likelihood:

(λ)=i=1n(λ+yilogλlog(yi!))=nλ+(iyi)logλconst\ell(\lambda) = \sum_{i=1}^n \left(-\lambda + y_i\log\lambda - \log(y_i!)\right) = -n\lambda + \left(\sum_i y_i\right)\log\lambda - \text{const}

MLE: /λ=n+iyiλ=0    λ^MLE=yˉ\partial\ell/\partial\lambda = -n + \frac{\sum_i y_i}{\lambda} = 0 \implies \hat{\lambda}_\text{MLE} = \bar{y}

For Gaussian, Bernoulli, and Poisson alike: the MLE is the sample mean. This generalizes to all one-parameter exponential family distributions - the MLE is always the empirical average of the sufficient statistic.


MLE for Linear Regression → Derives OLS

The Gaussian noise model for regression:

yi=wxi+ϵi,ϵiN(0,σ2)y_i = \mathbf{w}^\top\mathbf{x}_i + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2)

implies:

P(yixi,w)=12πσ2exp((yiwxi)22σ2)P(y_i | \mathbf{x}_i, \mathbf{w}) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mathbf{w}^\top\mathbf{x}_i)^2}{2\sigma^2}\right)

Log-likelihood over nn samples:

(w)=n2log(2πσ2)12σ2i=1n(yiwxi)2\ell(\mathbf{w}) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \mathbf{w}^\top\mathbf{x}_i)^2

Maximizing over w\mathbf{w} is equivalent to minimizing:

i=1n(yiwxi)2\sum_{i=1}^n (y_i - \mathbf{w}^\top\mathbf{x}_i)^2

This is exactly OLS. The Gaussian noise assumption uniquely derives the squared loss. The insight works in reverse too: whenever you use MSE, you are implicitly assuming your residuals are i.i.d. Gaussian.


MLE for Logistic Regression → Derives Cross-Entropy

Binary label yi{0,1}y_i \in \{0, 1\} modeled as Bernoulli:

P(yixi,w)=y^iyi(1y^i)1yi,y^i=σ(wxi)P(y_i | \mathbf{x}_i, \mathbf{w}) = \hat{y}_i^{y_i}(1 - \hat{y}_i)^{1-y_i}, \quad \hat{y}_i = \sigma(\mathbf{w}^\top\mathbf{x}_i)

Log-likelihood:

(w)=i=1n[yilogy^i+(1yi)log(1y^i)]\ell(\mathbf{w}) = \sum_{i=1}^n \left[y_i \log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\right]

Maximizing (w)\ell(\mathbf{w}) is equivalent to minimizing (w)-\ell(\mathbf{w}), the binary cross-entropy:

LCE(w)=1ni=1n[yilogy^i+(1yi)log(1y^i)]\mathcal{L}_\text{CE}(\mathbf{w}) = -\frac{1}{n}\sum_{i=1}^n \left[y_i \log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\right]

Cross-entropy is the NLL of Bernoulli distribution. The Bernoulli assumption derives cross-entropy; the Gaussian assumption derives MSE. The pattern is universal.

The NLL ↔ Loss Function Dictionary

Data distributionNoise modelNegative log-likelihoodLoss name
Linear regressionGaussian N(μ,σ2)\mathcal{N}(\mu, \sigma^2)12σ2(yy^)2\frac{1}{2\sigma^2}(y - \hat{y})^2MSE / squared loss
Binary classificationBernoulli(p)(p)ylogy^(1y)log(1y^)-y\log\hat{y} - (1-y)\log(1-\hat{y})Binary cross-entropy
MulticlassCategorical(p1,,pK)(p_1,\ldots,p_K)kyklogy^k-\sum_k y_k \log\hat{y}_kCategorical cross-entropy
Count predictionPoisson(λ)(\lambda)y^ylogy^\hat{y} - y\log\hat{y}Poisson deviance
Robust regressionLaplace(b)(b)yy^/b\|y - \hat{y}\|/bMAE / L1 loss
Heavy-tailedCauchylog(1+(yy^)2)\log(1 + (y-\hat{y})^2)Log-loss (robust)

Choosing a loss function is choosing a noise model. Get the noise model right and the loss function follows.


Python: MLE From Scratch

Gaussian MLE with Likelihood Surface

import numpy as np
from scipy import stats
from scipy.optimize import minimize
import matplotlib.pyplot as plt


def mle_gaussian_full(data: np.ndarray) -> tuple[float, float]:
"""
MLE for Gaussian: find mu, sigma maximizing P(data | mu, sigma).

Closed-form MLE:
mu_hat = mean(data)
sigma_hat = std(data, ddof=0) # divide by n, not n-1
"""
n = len(data)
mu_hat = np.mean(data)
sigma_hat = np.std(data, ddof=0) # MLE: biased, divides by n

# Numerical verification via scipy.optimize
def neg_log_likelihood(params):
mu, log_sigma = params
sigma = np.exp(log_sigma) # log-parameterize to keep sigma > 0
return -np.sum(stats.norm.logpdf(data, loc=mu, scale=sigma))

result = minimize(neg_log_likelihood, x0=[0.0, 0.0], method='L-BFGS-B')
mu_num = result.x[0]
sigma_num = np.exp(result.x[1])

# Compute log-likelihood at MLE
ll_at_mle = np.sum(stats.norm.logpdf(data, loc=mu_hat, scale=sigma_hat))

print("Gaussian MLE Results:")
print(f" n = {n}")
print(f" Closed-form: mu = {mu_hat:.4f}, sigma = {sigma_hat:.4f}")
print(f" Numerical: mu = {mu_num:.4f}, sigma = {sigma_num:.4f}")
print(f" Unbiased std (ddof=1): {np.std(data, ddof=1):.4f} (NOT the MLE)")
print(f" Log-likelihood at MLE: {ll_at_mle:.3f}")

# Visualize likelihood surface
mu_grid = np.linspace(mu_hat - 1.5, mu_hat + 1.5, 80)
sig_grid = np.linspace(max(0.1, sigma_hat - 1.0), sigma_hat + 1.0, 80)
MU, SIG = np.meshgrid(mu_grid, sig_grid)

LL = np.vectorize(
lambda m, s: np.sum(stats.norm.logpdf(data, loc=m, scale=s))
)(MU, SIG)

plt.figure(figsize=(7, 5))
cs = plt.contourf(MU, SIG, LL, levels=25, cmap='Blues')
plt.colorbar(cs, label='Log-Likelihood')
plt.plot(mu_hat, sigma_hat, 'r*', markersize=16, label='MLE')
plt.xlabel('$\\mu$'); plt.ylabel('$\\sigma$')
plt.title('Gaussian Log-Likelihood Surface')
plt.legend()
plt.tight_layout()
plt.savefig('gaussian_mle_surface.png', dpi=150)
plt.show()

return mu_hat, sigma_hat


np.random.seed(42)
true_mu, true_sigma = 3.5, 1.2
data = np.random.normal(true_mu, true_sigma, size=500)
mu_hat, sigma_hat = mle_gaussian_full(data)

Bernoulli and Poisson MLE

def mle_bernoulli(data: np.ndarray) -> float:
"""
MLE for Bernoulli parameter p.
Closed-form: p_hat = mean(data)
"""
n = len(data)
k = np.sum(data)
p_hat = k / n

# Plot log-likelihood as a function of p
p_vals = np.linspace(0.01, 0.99, 300)
log_lik = k * np.log(p_vals) + (n - k) * np.log(1 - p_vals)

plt.figure(figsize=(7, 4))
plt.plot(p_vals, log_lik, 'b-', lw=2)
plt.axvline(p_hat, color='r', ls='--', label=f'MLE p̂ = {p_hat:.3f}')
plt.xlabel('p'); plt.ylabel('Log-Likelihood ℓ(p)')
plt.title('Bernoulli Log-Likelihood')
plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('bernoulli_mle.png', dpi=150)
plt.show()

print(f"\nBernoulli MLE: p̂ = k/n = {k}/{n} = {p_hat:.4f}")
return p_hat


def mle_poisson(data: np.ndarray) -> float:
"""
MLE for Poisson rate lambda.
Closed-form: lambda_hat = mean(data)
"""
lambda_hat = np.mean(data)

lambda_vals = np.linspace(0.5, 2 * lambda_hat, 200)
log_lik = np.sum(data) * np.log(lambda_vals) - len(data) * lambda_vals

plt.figure(figsize=(7, 4))
plt.plot(lambda_vals, log_lik, 'g-', lw=2)
plt.axvline(lambda_hat, color='r', ls='--', label=f'MLE λ̂ = {lambda_hat:.3f}')
plt.xlabel('λ'); plt.ylabel('Log-Likelihood ℓ(λ)')
plt.title('Poisson Log-Likelihood')
plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('poisson_mle.png', dpi=150)
plt.show()

print(f"\nPoisson MLE: λ̂ = mean = {lambda_hat:.4f}")
return lambda_hat


np.random.seed(42)
binary_data = np.random.binomial(1, p=0.35, size=1000)
count_data = np.random.poisson(lam=4.2, size=800)

p_hat = mle_bernoulli(binary_data)
lambda_hat = mle_poisson(count_data)

Numerical MLE for Custom Distributions

from scipy.optimize import minimize
from scipy.stats import t as t_dist


def mle_student_t(data: np.ndarray) -> dict:
"""
MLE for Student-t distribution: fit (nu, mu, sigma) numerically.
No closed-form solution exists - must use numerical optimization.
Shows how MLE applies to any distribution.
"""
def neg_log_likelihood(params):
log_nu, mu, log_sigma = params
nu = np.exp(log_nu) + 2 # degrees of freedom > 2
sigma = np.exp(log_sigma)
# scipy t-dist: scale = sigma, loc = mu
return -np.sum(t_dist.logpdf(data, df=nu, loc=mu, scale=sigma))

# Initial guess: Gaussian-like starting point
x0 = [np.log(5.0), np.mean(data), np.log(np.std(data))]

result = minimize(
neg_log_likelihood,
x0=x0,
method='L-BFGS-B',
options={'maxiter': 1000, 'ftol': 1e-12}
)

nu_hat = np.exp(result.x[0]) + 2
mu_hat = result.x[1]
sigma_hat = np.exp(result.x[2])

print("\nStudent-t MLE (numerical):")
print(f" ν̂ (degrees of freedom) = {nu_hat:.3f}")
print(f" μ̂ = {mu_hat:.4f}")
print(f" σ̂ = {sigma_hat:.4f}")
print(f" Optimizer success: {result.success}")
print(f" Neg log-likelihood at MLE: {result.fun:.3f}")

return {'nu': nu_hat, 'mu': mu_hat, 'sigma': sigma_hat}


# Simulate heavy-tailed data
np.random.seed(42)
heavy_data = t_dist.rvs(df=4, loc=2.0, scale=1.5, size=500)
params = mle_student_t(heavy_data)

MLE for Linear Regression with Full Statistics

from sklearn.datasets import make_regression
import numpy as np


class MLELinearRegression:
"""
Linear regression via MLE under Gaussian noise assumption.
Provides:
- MLE weights (= OLS)
- MLE sigma^2 (biased) and unbiased sigma^2
- Standard errors via Fisher information matrix
- 95% confidence intervals for each coefficient
- Wald test p-values
"""

def __init__(self):
self.w_ = None
self.sigma2_ = None
self.se_ = None
self.ci95_ = None
self.p_values_ = None

def fit(self, X: np.ndarray, y: np.ndarray):
n, d = X.shape
# Augment with intercept
Xa = np.column_stack([np.ones(n), X]) # shape (n, d+1)

# MLE weights = OLS normal equation
self.w_ = np.linalg.lstsq(Xa, y, rcond=None)[0] # shape (d+1,)

# Residuals and sigma^2
resid = y - Xa @ self.w_
self.sigma2_ = np.mean(resid ** 2) # MLE (biased, /n)
sigma2_unb = np.sum(resid ** 2) / (n - d - 1) # unbiased, /n-d-1

# Fisher information matrix for w: I(w) = X'X / sigma^2
# Covariance of MLE: Cov(ŵ) ≈ sigma^2 * (X'X)^{-1}
# Using unbiased sigma^2 for better finite-sample CI coverage
gram = Xa.T @ Xa
cov_w = sigma2_unb * np.linalg.inv(gram)
self.se_ = np.sqrt(np.diag(cov_w))

# 95% CI: ŵ ± 1.96 * se (asymptotic normality of MLE)
z95 = 1.96
self.ci95_ = np.stack([self.w_ - z95 * self.se_,
self.w_ + z95 * self.se_], axis=1)

# Wald test: H0: w_j = 0, test stat = w_j / se_j ~ N(0,1)
z_stats = self.w_ / self.se_
from scipy import stats as sc_stats
self.p_values_ = 2 * (1 - sc_stats.norm.cdf(np.abs(z_stats)))

# Log-likelihood at MLE
ll = (-n / 2) * np.log(2 * np.pi * self.sigma2_) \
- (1 / (2 * self.sigma2_)) * np.sum(resid ** 2)

print("MLE Linear Regression:")
print(f" n={n}, d={d}")
print(f" sigma2_MLE (biased, /n): {self.sigma2_:.4f}")
print(f" sigma2_unb (unbiased, /n-d-1): {sigma2_unb:.4f}")
print(f" Log-likelihood at MLE: {ll:.3f}")
print()
print(f"{'Param':10s} {'Estimate':10s} {'Std Err':10s} {'95% CI':22s} {'p-value':8s}")
print("-" * 70)
names = ['intercept'] + [f'x{j+1}' for j in range(d)]
for i, name in enumerate(names):
ci_str = f"[{self.ci95_[i,0]:7.3f}, {self.ci95_[i,1]:7.3f}]"
sig = "***" if self.p_values_[i] < 0.001 else \
"**" if self.p_values_[i] < 0.01 else \
"*" if self.p_values_[i] < 0.05 else ""
print(f"{name:10s} {self.w_[i]:10.4f} {self.se_[i]:10.4f} {ci_str} {self.p_values_[i]:.5f} {sig}")
return self

def predict(self, X: np.ndarray) -> np.ndarray:
n = X.shape[0]
Xa = np.column_stack([np.ones(n), X])
return Xa @ self.w_

def log_likelihood(self, X: np.ndarray, y: np.ndarray) -> float:
resid = y - self.predict(X)
n = len(y)
return -(n/2)*np.log(2*np.pi*self.sigma2_) \
- (1/(2*self.sigma2_)) * np.sum(resid**2)


np.random.seed(42)
X_reg, y_reg = make_regression(n_samples=200, n_features=4,
noise=8.0, random_state=42)
mle_lr = MLELinearRegression().fit(X_reg, y_reg)

MLE vs MAP: Regularization as Bayesian Inference

MLE maximizes only the likelihood. Maximum A Posteriori (MAP) also incorporates a prior P(θ)P(\boldsymbol{\theta}):

θ^MAP=argmaxθP(yX;θ)likelihoodP(θ)prior=argmaxθ(θ)log-likelihood+logP(θ)log-prior\hat{\boldsymbol{\theta}}_\text{MAP} = \arg\max_{\boldsymbol{\theta}} \underbrace{P(\mathbf{y}|\mathbf{X};\boldsymbol{\theta})}_\text{likelihood} \cdot \underbrace{P(\boldsymbol{\theta})}_\text{prior} = \arg\max_{\boldsymbol{\theta}} \underbrace{\ell(\boldsymbol{\theta})}_\text{log-likelihood} + \underbrace{\log P(\boldsymbol{\theta})}_\text{log-prior}

Gaussian Prior → Ridge Regression (L2)

Assign prior wN(0,1λI)\mathbf{w} \sim \mathcal{N}\left(\mathbf{0},\, \frac{1}{\lambda}\mathbf{I}\right):

logP(w)=λ2w22+const\log P(\mathbf{w}) = -\frac{\lambda}{2}\|\mathbf{w}\|_2^2 + \text{const}

MAP objective:

w^MAP=argminwyXw2NLL (Gaussian noise)+λw22negative log Gaussian prior\hat{\mathbf{w}}_\text{MAP} = \arg\min_\mathbf{w} \underbrace{\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2}_\text{NLL (Gaussian noise)} + \underbrace{\lambda\|\mathbf{w}\|_2^2}_\text{negative log Gaussian prior}

This is exactly Ridge regression. L2 regularization = MAP with Gaussian prior.

Laplace Prior → Lasso (L1)

Assign prior wjLaplace(0,1/λ)w_j \sim \text{Laplace}(0, 1/\lambda) independently:

logP(wj)=λwj+const\log P(w_j) = -\lambda|w_j| + \text{const}

MAP objective:

w^MAP=argminwyXw2+λw1\hat{\mathbf{w}}_\text{MAP} = \arg\min_\mathbf{w} \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda\|\mathbf{w}\|_1

This is exactly Lasso. L1 regularization = MAP with Laplace prior. The Laplace distribution has sharp peaks at zero and heavy tails, which is why the Lasso prior promotes sparsity - zero is much more probable under the Laplace than under the Gaussian.

The MAP ↔ Regularization Correspondence

Prior on wjw_jLog-prior termRegularizerPenalty name
N(0,1/λ)\mathcal{N}(0, 1/\lambda)λwj2/2-\lambda w_j^2 / 2λw22\lambda\|\mathbf{w}\|_2^2L2 / Ridge
Laplace(0,1/λ)\text{Laplace}(0, 1/\lambda)$-\lambdaw_j$
Uniform (improper)00noneMLE = OLS
Elastic net prior$-\alpha\lambdaw_j- (1-\alpha)\lambda w_j^2$

As nn \to \infty, the prior is overwhelmed by the likelihood and MAP → MLE regardless of the prior chosen.

import numpy as np
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression, Ridge, Lasso


def compare_mle_map(X: np.ndarray, y: np.ndarray, lambda_: float = 1.0):
"""
Demonstrate MLE = OLS, MAP+Gaussian prior = Ridge, MAP+Laplace prior = Lasso.
"""
n, d = X.shape

# MLE (no prior)
mle = LinearRegression().fit(X, y)

# MAP + Gaussian prior = Ridge
# sklearn Ridge: minimizes ||y - Xw||^2 + alpha * ||w||^2
ridge = Ridge(alpha=lambda_).fit(X, y)

# MAP + Laplace prior = Lasso
lasso = Lasso(alpha=lambda_ / (2 * n), max_iter=5000).fit(X, y)

print(f"\nMLE vs MAP Comparison (lambda={lambda_}, n={n}, d={d}):")
print(f"{'Feature':10s} {'MLE/OLS':10s} {'MAP/Ridge':10s} {'MAP/Lasso':10s}")
print("-" * 50)
for j in range(min(d, 8)):
print(f"x{j+1:8s} {mle.coef_[j]:10.4f} {ridge.coef_[j]:10.4f} {lasso.coef_[j]:10.4f}")

print(f"\n Ridge l2 norm: {np.linalg.norm(ridge.coef_):.4f} "
f"(vs MLE: {np.linalg.norm(mle.coef_):.4f})")
print(f" Lasso zeros: {np.sum(lasso.coef_ == 0)} of {d} features zeroed out")


np.random.seed(42)
X_sparse, y_sparse = make_regression(n_samples=150, n_features=20,
n_informative=5, noise=10.0,
random_state=42)
compare_mle_map(X_sparse, y_sparse, lambda_=50.0)

Fisher Information and the Cramér-Rao Bound

The Score Function

The score function is the gradient of the log-likelihood with respect to θ\theta:

s(θ;y)=θlogP(yθ)s(\theta; y) = \frac{\partial}{\partial\theta}\log P(y|\theta)

Under regularity conditions, E[s(θ;y)]=0\mathbb{E}[s(\theta; y)] = 0 - the expected score is zero at the true parameter.

Fisher Information

The Fisher information measures how much information a single observation carries about θ\theta:

I(θ)=E[s(θ;y)2]=E[(logP(yθ)θ)2]\mathcal{I}(\theta) = \mathbb{E}\left[s(\theta; y)^2\right] = \mathbb{E}\left[\left(\frac{\partial\log P(y|\theta)}{\partial\theta}\right)^2\right]

Equivalently (second form, more useful for computation):

I(θ)=E[2logP(yθ)θ2]\mathcal{I}(\theta) = -\mathbb{E}\left[\frac{\partial^2\log P(y|\theta)}{\partial\theta^2}\right]

The second form says: Fisher information equals the expected curvature (negative Hessian) of the log-likelihood. High curvature = sharply peaked likelihood = precise estimation possible.

For nn i.i.d. samples, total Fisher information is nI(θ)n \cdot \mathcal{I}(\theta).

Cramér-Rao Lower Bound

For any unbiased estimator θ^\hat{\theta} of θ\theta:

Var(θ^)1nI(θ)\text{Var}(\hat{\theta}) \geq \frac{1}{n \cdot \mathcal{I}(\theta)}

This is the Cramér-Rao Lower Bound (CRLB). It establishes a fundamental floor on estimation variance - no unbiased estimator can do better.

The MLE is asymptotically efficient: as nn \to \infty, the MLE achieves the CRLB. No consistent estimator can be more efficient asymptotically.

Fisher Information: Concrete Examples

Gaussian (known σ2\sigma^2):

logP(yμ)=(yμ)22σ2+const\log P(y|\mu) = -\frac{(y-\mu)^2}{2\sigma^2} + \text{const}

2logPμ2=1σ2    I(μ)=1σ2\frac{\partial^2 \log P}{\partial\mu^2} = -\frac{1}{\sigma^2} \implies \mathcal{I}(\mu) = \frac{1}{\sigma^2}

CRLB: Var(μ^)σ2n\text{Var}(\hat{\mu}) \geq \frac{\sigma^2}{n}. The sample mean achieves exactly σ2/n\sigma^2/n - it is the UMVUE (uniformly minimum variance unbiased estimator).

Bernoulli:

logP(yp)=ylogp+(1y)log(1p)\log P(y|p) = y\log p + (1-y)\log(1-p)

2logPp2=yp21y(1p)2    I(p)=1p(1p)\frac{\partial^2 \log P}{\partial p^2} = -\frac{y}{p^2} - \frac{1-y}{(1-p)^2} \implies \mathcal{I}(p) = \frac{1}{p(1-p)}

CRLB: Var(p^)p(1p)n\text{Var}(\hat{p}) \geq \frac{p(1-p)}{n}. Sample proportion achieves exactly p(1p)/np(1-p)/n - efficient.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats


def fisher_information_analysis():
"""
Demonstrate Fisher information, CRLB, and MLE efficiency.
"""
print("=" * 60)
print("Fisher Information and Cramér-Rao Bound")
print("=" * 60)

# --- Gaussian ---
sigma = 2.0
ns = [10, 50, 100, 500, 1000]

print(f"\nGaussian (sigma={sigma}): Fisher info I(mu) = 1/sigma^2 = {1/sigma**2:.4f}")
print(f"\n{'n':6s} {'CRLB std':10s} {'Simul std':10s} {'Ratio':8s}")
print("-" * 40)

for n in ns:
crlb_std = sigma / np.sqrt(n) # sqrt(1 / (n * I(mu)))
# Simulate 5000 experiments
samples = np.random.normal(3.5, sigma, size=(5000, n))
mle_means = samples.mean(axis=1)
sim_std = np.std(mle_means)
ratio = sim_std / crlb_std
print(f"{n:6d} {crlb_std:10.5f} {sim_std:10.5f} {ratio:8.4f}")

# --- Bernoulli ---
p_true = 0.3
I_bern = 1.0 / (p_true * (1 - p_true))
print(f"\nBernoulli (p={p_true}): Fisher info I(p) = 1/(p(1-p)) = {I_bern:.4f}")
print(f"\n{'n':6s} {'CRLB std':10s} {'Simul std':10s} {'Ratio':8s}")
print("-" * 40)

for n in ns:
crlb_std = np.sqrt(p_true * (1 - p_true) / n)
samples = np.random.binomial(1, p_true, size=(5000, n))
mle_props = samples.mean(axis=1)
sim_std = np.std(mle_props)
ratio = sim_std / crlb_std
print(f"{n:6d} {crlb_std:10.5f} {sim_std:10.5f} {ratio:8.4f}")

print("\nConclusion: ratio -> 1.0 as n grows - MLE achieves CRLB asymptotically.")


np.random.seed(42)
fisher_information_analysis()

The Observed Fisher Information Matrix

In practice, the Fisher information for a vector parameter θ\boldsymbol{\theta} (the Fisher information matrix) is approximated by the observed Fisher information - the negative Hessian of the log-likelihood at the MLE:

I^(θ^)=2(θ)θθθ=θ^\hat{\mathcal{I}}(\hat{\boldsymbol{\theta}}) = -\frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}^\top}\Bigg|_{\boldsymbol{\theta}=\hat{\boldsymbol{\theta}}}

The inverse gives the asymptotic covariance matrix of the MLE:

Cov(θ^)I^(θ^)1\text{Cov}(\hat{\boldsymbol{\theta}}) \approx \hat{\mathcal{I}}(\hat{\boldsymbol{\theta}})^{-1}

Standard errors = square root of diagonal entries. This is how statsmodels, R, and SciPy report confidence intervals for MLE-fit models.


Asymptotic Properties of MLE

Consistency

Under regularity conditions, θ^MLEpθ\hat{\boldsymbol{\theta}}_\text{MLE} \xrightarrow{p} \boldsymbol{\theta}^* as nn \to \infty. More data always improves MLE accuracy.

Asymptotic Normality

n(θ^MLEθ)dN(0,I(θ)1)\sqrt{n}\left(\hat{\boldsymbol{\theta}}_\text{MLE} - \boldsymbol{\theta}^*\right) \xrightarrow{d} \mathcal{N}\left(\mathbf{0},\, \mathcal{I}(\boldsymbol{\theta}^*)^{-1}\right)

This gives the theoretical basis for confidence intervals and hypothesis tests from MLE estimates. At large nn, CIs computed from the observed Fisher information are valid.

Invariance

If θ^MLE\hat{\theta}_\text{MLE} is the MLE of θ\theta, then g(θ^MLE)g(\hat{\theta}_\text{MLE}) is the MLE of g(θ)g(\theta) for any function gg.

Examples:

  • MLE of σ2\sigma^2 is σ^2=1n(yiyˉ)2\hat{\sigma}^2 = \frac{1}{n}\sum(y_i - \bar{y})^2, so MLE of σ\sigma is σ^2\sqrt{\hat{\sigma}^2}
  • MLE of pp is p^=k/n\hat{p} = k/n, so MLE of log-odds is log(p^/(1p^))\log(\hat{p}/(1-\hat{p}))

Asymptotic Efficiency

No consistent estimator can have lower asymptotic variance than the MLE. This is a strong result: the MLE extracts all the information from the data.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats


def mle_asymptotic_normality(true_p: float = 0.4, n_values: list = None):
"""
Demonstrate asymptotic normality of MLE for Bernoulli p.
As n grows, the sampling distribution of p_hat converges to N(p, p(1-p)/n).
"""
if n_values is None:
n_values = [10, 30, 100, 500]

n_sim = 10_000
fig, axes = plt.subplots(1, len(n_values), figsize=(14, 4))

for ax, n in zip(axes, n_values):
# Simulate MLE sampling distribution
samples = np.random.binomial(1, true_p, size=(n_sim, n))
p_hats = samples.mean(axis=1)

# Theoretical asymptotic distribution
sigma_theory = np.sqrt(true_p * (1 - true_p) / n)
x_range = np.linspace(true_p - 4 * sigma_theory,
true_p + 4 * sigma_theory, 200)
density = stats.norm.pdf(x_range, loc=true_p, scale=sigma_theory)

ax.hist(p_hats, bins=50, density=True, alpha=0.6, color='steelblue',
label='Simulated MLE')
ax.plot(x_range, density, 'r-', lw=2, label='Asymptotic N')
ax.axvline(true_p, color='k', ls='--', alpha=0.5)
ax.set_title(f'n = {n}')
ax.set_xlabel('p̂')
ax.legend(fontsize=8)

plt.suptitle(f'Asymptotic Normality of Bernoulli MLE (true p = {true_p})',
fontsize=12)
plt.tight_layout()
plt.savefig('mle_asymptotic_normality.png', dpi=150)
plt.show()


np.random.seed(42)
mle_asymptotic_normality(true_p=0.4, n_values=[10, 30, 100, 500])

MLE vs ERM: The Theoretical Bridge

Empirical Risk Minimization (ERM) minimizes average loss over training data:

θ^ERM=argminθ1ni=1n(yi,f(xi;θ))\hat{\boldsymbol{\theta}}_\text{ERM} = \arg\min_{\boldsymbol{\theta}} \frac{1}{n}\sum_{i=1}^n \ell(y_i, f(\mathbf{x}_i; \boldsymbol{\theta}))

MLE maximizes average log-likelihood:

θ^MLE=argmaxθ1ni=1nlogP(yixi;θ)\hat{\boldsymbol{\theta}}_\text{MLE} = \arg\max_{\boldsymbol{\theta}} \frac{1}{n}\sum_{i=1}^n \log P(y_i | \mathbf{x}_i; \boldsymbol{\theta})

These are the same when loss = negative log-likelihood: (y,y^)=logP(yy^)\ell(y, \hat{y}) = -\log P(y|\hat{y}).

Furthermore, MLE is equivalent to minimizing KL divergence between the empirical distribution p^\hat{p} and the model pθp_\theta:

θ^MLE=argminθDKL(p^(yx)pθ(yx))\hat{\boldsymbol{\theta}}_\text{MLE} = \arg\min_\theta D_\text{KL}\left(\hat{p}(y|\mathbf{x}) \,\|\, p_\theta(y|\mathbf{x})\right)

This is a profound connection: MLE finds the model that is closest (in KL divergence) to the true data distribution, as estimated from the training set.


Production Notes

:::tip Use numerical MLE when no closed form exists Most distributions used in production (Student-t, Beta, Negative Binomial) have no closed-form MLE. Use scipy.optimize.minimize with method='L-BFGS-B' and log-parameterize positive quantities. Always verify by checking the gradient at the solution: optimize.check_grad or manually. :::

:::warning MLE sigma^2 is biased for linear regression The MLE estimate σ^2=1ni(yiy^i)2\hat{\sigma}^2 = \frac{1}{n}\sum_i (y_i - \hat{y}_i)^2 is biased. For inference (confidence intervals, hypothesis tests), use the unbiased estimate s2=1nd1i(yiy^i)2s^2 = \frac{1}{n-d-1}\sum_i (y_i - \hat{y}_i)^2. The degrees-of-freedom correction nd1n-d-1 accounts for the fact that d+1d+1 parameters were estimated from the same data used to compute residuals. :::

:::danger Do not confuse likelihood and probability L(θy)=P(yθ)\mathcal{L}(\theta | y) = P(y | \theta) numerically, but they are different functions. Likelihood is not a probability distribution over θ\theta - it does not integrate to 1 and cannot be normalized to get a distribution (that is Bayesian posterior, not likelihood). Reporting L(θ^)\mathcal{L}(\hat\theta) as a "probability" of the parameter is wrong. :::

:::warning MSE for classification: wrong noise model Using MSE loss for binary classification assumes Gaussian residuals, which is misspecified. The outputs are probabilities in [0,1][0,1] but MSE assigns no penalty for predictions outside [0,1][0,1], leads to saturated gradients with sigmoid activation, and produces a non-convex objective. Use cross-entropy (correct Bernoulli NLL) instead. :::


YouTube Resources

TitleChannelWhy Watch
Maximum Likelihood For the Normal Distribution, step-by-stepStatQuestExcellent visual derivation of Gaussian MLE
MLE and MAP estimationAurelien GeronConnects MLE to neural network training
Fisher Information and the Cramér-Rao Lower BoundMathematical MonkRigorous CRLB derivation
Bayesian vs Frequentist Statistics3Blue1BrownConceptual foundation, MLE vs MAP
Likelihood and Log-LikelihoodBrandon FoltzClear introduction for practitioners

Interview Q&A

Q1: Explain why the Gaussian noise assumption leads to OLS, and why Bernoulli leads to cross-entropy.

The key insight is that the loss function is the negative log-likelihood of the noise model. If we assume yi=wxi+ϵiy_i = \mathbf{w}^\top\mathbf{x}_i + \epsilon_i with ϵiN(0,σ2)\epsilon_i \sim \mathcal{N}(0,\sigma^2), then P(yixi,w)=12πσ2exp((yiwxi)22σ2)P(y_i|\mathbf{x}_i,\mathbf{w}) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(y_i - \mathbf{w}^\top\mathbf{x}_i)^2}{2\sigma^2}\right). Taking the log and summing: =n2log(2πσ2)12σ2i(yiwxi)2\ell = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_i(y_i - \mathbf{w}^\top\mathbf{x}_i)^2. Maximizing over w\mathbf{w} means minimizing i(yiy^i)2\sum_i(y_i - \hat{y}_i)^2 - OLS. For binary classification, yiBernoulli(y^i)y_i \sim \text{Bernoulli}(\hat{y}_i), so P(yiy^i)=y^iyi(1y^i)1yiP(y_i|\hat{y}_i) = \hat{y}_i^{y_i}(1-\hat{y}_i)^{1-y_i}. Log-likelihood: i[yilogy^i+(1yi)log(1y^i)]\sum_i[y_i\log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)]. Maximizing this is minimizing binary cross-entropy. The loss function is not an arbitrary design choice - it is the negative log-likelihood of the correct generative model.

Q2: What is the Cramér-Rao bound and why does it matter for ML?

The CRLB is Var(θ^)1/(nI(θ))\text{Var}(\hat\theta) \geq 1/(n \cdot \mathcal{I}(\theta)) for any unbiased estimator θ^\hat\theta. It sets a fundamental lower bound on estimation variance - no matter how clever your estimator, you cannot beat this. The MLE achieves this bound asymptotically, making it the most efficient estimator (among consistent unbiased estimators) for large nn. In ML practice: (1) CRLB explains why more data always helps - the bound tightens as nn grows; (2) The inverse Fisher information gives asymptotic standard errors for MLE, used for confidence intervals and p-values in statsmodels/R; (3) CRLB is the theoretical justification for using MLE rather than some ad-hoc estimator - it is statistically optimal.

Q3: How does L2 regularization relate to MAP estimation, and what prior does it correspond to?

L2 regularization (Ridge) = MAP estimation with a Gaussian prior P(wj)=N(0,1/λ)P(w_j) = \mathcal{N}(0, 1/\lambda) on each weight. Proof: MAP objective is logP(yX,w)+logP(w)\log P(\mathbf{y}|\mathbf{X},\mathbf{w}) + \log P(\mathbf{w}). Substituting Gaussian NLL and Gaussian prior: 12σ2yXw2λ2w2-\frac{1}{2\sigma^2}\|\mathbf{y}-\mathbf{X}\mathbf{w}\|^2 - \frac{\lambda}{2}\|\mathbf{w}\|^2. Minimizing the negative = argminwyXw2+λw2\arg\min_\mathbf{w} \|\mathbf{y}-\mathbf{X}\mathbf{w}\|^2 + \lambda\|\mathbf{w}\|^2 - exactly Ridge. The regularization parameter λ\lambda is the precision of the Gaussian prior: large λ\lambda = strong prior belief that weights are near zero = heavy regularization. L1 regularization corresponds to a Laplace prior, which has heavier tails and a sharper peak at zero - hence promoting sparsity.

Q4: What is Fisher information intuitively, and how is it related to the curvature of the log-likelihood?

Fisher information I(θ)\mathcal{I}(\theta) measures how much information each data point carries about the parameter θ\theta. Intuitively: if you perturb θ\theta slightly, how much does the likelihood change? High Fisher information = the likelihood is sensitive to θ\theta = we can pin down θ\theta precisely from the data. Formally, I(θ)=E[2logP/θ2]\mathcal{I}(\theta) = -\mathbb{E}[\partial^2\log P/\partial\theta^2] = expected negative Hessian of log-likelihood. High Fisher information = high curvature = narrow, peaked log-likelihood surface = small estimation uncertainty. This is exactly the connection to the CRLB: high curvature means the log-likelihood concentrates around the true θ\theta, allowing very low variance estimates. For the Gaussian mean: I(μ)=1/σ2\mathcal{I}(\mu) = 1/\sigma^2 - smaller variance → higher information → more precise estimation.

Q5: Why does MLE sometimes overfit, and what is the connection to the bias-variance tradeoff?

MLE is a point estimate that maximizes probability of the training data. For complex models with many parameters, MLE can perfectly fit the training data while generalizing poorly - a classic overfitting pattern. This is because MLE is equivalent to ERM (empirical risk minimization), which optimizes on the training distribution, not the test distribution. The MLE has zero bias asymptotically (consistent) but high variance for small nn. MAP estimation (MLE + prior) adds a regularization term that introduces a small bias in exchange for lower variance - the bias-variance tradeoff. As nn \to \infty, the prior is overwhelmed by data and MAP → MLE. For small nn, a carefully chosen prior (e.g., Gaussian for Ridge, Laplace for Lasso) dramatically reduces variance with minimal bias cost, improving generalization.

Q6: You're fitting a model to customer purchase counts per day. You fit a Poisson regression and find the variance of counts is 8× the mean. What does this tell you and how do you fix it?

Poisson regression assumes equidispersion: Var(y)=μ\text{Var}(y) = \mu. If the observed variance/mean ratio is 8, the data is overdispersed - the Poisson assumption is violated. This typically arises from unobserved heterogeneity (some customers are systematically high-purchasers, not captured by your features) or clustered events. Consequences: the Poisson likelihood is misspecified, which means MLE gives inconsistent estimates of coefficient standard errors (they are too small), leading to artificially narrow confidence intervals and inflated significance. Fix: (1) Negative Binomial regression - adds a per-observation random effect with variance Var(y)=μ+αμ2\text{Var}(y) = \mu + \alpha\mu^2, fitting α\alpha from data; (2) Quasi-Poisson - uses Poisson point estimates but inflates standard errors by dispersion\sqrt{\text{dispersion}}; (3) Add features to absorb heterogeneity. Model selection: compare Poisson AIC vs Negative Binomial AIC - lower is better.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the MLE & MAP Explorer demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.