Maximum Likelihood Estimation
The Real Interview Moment
You are in the final round at a top-tier AI lab. The interviewer slides a sheet of paper across the table: "We have a neural network trained with MSE loss on a classification task. Accuracy is 92%. Our competitor uses cross-entropy and gets 93%. Explain exactly why cross-entropy is better - mathematically, not intuitively."
You pause. You know cross-entropy is "better for classification," but do you know why? The honest answer requires tracing a chain of reasoning from probability theory all the way to gradient flow. MSE assumes the residuals are Gaussian. Classification residuals are Bernoulli, not Gaussian. When you use MSE for classification, you are solving the wrong optimization problem. Cross-entropy is the correct loss because it is the negative log-likelihood of the Bernoulli distribution. Using the right probabilistic model for your data is not a heuristic; it is a theorem.
This is the subject of Maximum Likelihood Estimation. Every loss function in machine learning is either the negative log-likelihood of some noise model, or it is an approximation to one. MSE is the NLL of Gaussian noise. Cross-entropy is the NLL of Bernoulli noise. Huber loss approximates the NLL of a distribution with Gaussian body and Laplace tails. Knowing this unifies everything - regression, classification, density estimation, survival analysis - under a single framework.
The principle is deceptively simple: given observed data, find the parameters that make the data most probable. Yet the implications cascade through the entire field. It explains why regularization works (MAP estimation with a prior). It quantifies the fundamental limits of estimation (Cramér-Rao bound). It justifies the use of gradient descent on loss functions (because maximizing likelihood is the same as minimizing average loss). And it gives you the vocabulary to design new loss functions for new problems from first principles.
This lesson derives MLE carefully, connects it to every major model you have already seen, builds the Fisher information framework, and ends with the MAP duality that explains L1 and L2 regularization as Bayesian inference. After this lesson, you will never wonder why a loss function looks the way it does.
Why This Exists - The Problem Before MLE
Before MLE was formalized, statisticians fit curves by minimizing "error" in ad hoc ways. The method of least squares (Gauss, Legendre, ~1805) was proposed purely because it led to tractable algebra. Nobody asked: "Under what probabilistic assumptions does minimizing squared error give the best estimates?"
The uncomfortable truth is that least squares is optimal only under Gaussian noise. For count data, binary data, survival times, or heavy-tailed measurements, squared error is a poor choice - it assigns high loss to large residuals that are actually probable under the true data distribution.
Fisher (1912–1922) formalized MLE as a general estimation principle: rather than defining "best fit" geometrically, define it probabilistically. The best parameters are those that assign the highest probability to the data you actually observed. This immediately tells you:
- For Gaussian noise → minimize squared errors (MSE)
- For Bernoulli outputs → minimize cross-entropy
- For Poisson counts → minimize Poisson deviance
- For heavy-tailed noise → minimize absolute errors (MAE)
The loss function is not a design choice - it is derived from the data distribution.
Historical Context
| Year | Milestone |
|---|---|
| 1805 | Legendre proposes method of least squares (no probabilistic justification) |
| 1809 | Gauss derives OLS from Gaussian error assumption - first proto-MLE argument |
| 1912 | R.A. Fisher introduces "likelihood" as distinct from probability |
| 1922 | Fisher proves MLE is asymptotically efficient (achieves Cramér-Rao bound) |
| 1945–46 | Cramér and Rao independently derive the variance lower bound |
| 1972 | Nelder & Wedderburn formalize GLMs - MLE applied to exponential family |
| 1986 | Rumelhart et al. backprop = gradient of cross-entropy NLL |
| 2010s | Deep learning: every modern loss function is a (negative) log-likelihood |
Fisher's key insight: probability is a function of data given fixed parameters; likelihood is the same function viewed as varying over parameters given fixed data. They are numerically equal but conceptually opposite. We want to find the parameter value that makes the observed data most probable.
The MLE Principle: Probability vs Likelihood
For a data point and parameter :
- : probability of observing if the true parameter is - a function of , fixed
- : likelihood of given we observed - same number, but viewed as a function of , fixed
This distinction matters deeply. A likelihood is not a probability distribution over - it need not integrate to 1. It is just a measure of how well a parameter value explains the data.
Given i.i.d. observations :
The log-likelihood (monotone transformation, same maximizer):
The MLE is:
The log converts products to sums, making both computation and differentiation tractable.
MLE for Common Distributions
Gaussian: Recovering Sample Mean and Variance
Given i.i.d. observations from :
Log-likelihood:
MLE for : set :
The sample mean is the MLE for Gaussian location.
MLE for : set :
Note: this is the biased estimator (divides by , not ). MLE is slightly biased for in finite samples. The unbiased estimator is not the MLE.
Bernoulli: Recovering Sample Proportion
Given i.i.d. binary observations from :
Log-likelihood:
where is the number of successes.
MLE: set :
The sample proportion is the MLE for Bernoulli parameter - the most natural estimator, justified by MLE.
Poisson: Recovering Sample Mean
Given i.i.d. count observations from :
Log-likelihood:
MLE:
For Gaussian, Bernoulli, and Poisson alike: the MLE is the sample mean. This generalizes to all one-parameter exponential family distributions - the MLE is always the empirical average of the sufficient statistic.
MLE for Linear Regression → Derives OLS
The Gaussian noise model for regression:
implies:
Log-likelihood over samples:
Maximizing over is equivalent to minimizing:
This is exactly OLS. The Gaussian noise assumption uniquely derives the squared loss. The insight works in reverse too: whenever you use MSE, you are implicitly assuming your residuals are i.i.d. Gaussian.
MLE for Logistic Regression → Derives Cross-Entropy
Binary label modeled as Bernoulli:
Log-likelihood:
Maximizing is equivalent to minimizing , the binary cross-entropy:
Cross-entropy is the NLL of Bernoulli distribution. The Bernoulli assumption derives cross-entropy; the Gaussian assumption derives MSE. The pattern is universal.
The NLL ↔ Loss Function Dictionary
| Data distribution | Noise model | Negative log-likelihood | Loss name |
|---|---|---|---|
| Linear regression | Gaussian | MSE / squared loss | |
| Binary classification | Bernoulli | Binary cross-entropy | |
| Multiclass | Categorical | Categorical cross-entropy | |
| Count prediction | Poisson | Poisson deviance | |
| Robust regression | Laplace | MAE / L1 loss | |
| Heavy-tailed | Cauchy | Log-loss (robust) |
Choosing a loss function is choosing a noise model. Get the noise model right and the loss function follows.
Python: MLE From Scratch
Gaussian MLE with Likelihood Surface
import numpy as np
from scipy import stats
from scipy.optimize import minimize
import matplotlib.pyplot as plt
def mle_gaussian_full(data: np.ndarray) -> tuple[float, float]:
"""
MLE for Gaussian: find mu, sigma maximizing P(data | mu, sigma).
Closed-form MLE:
mu_hat = mean(data)
sigma_hat = std(data, ddof=0) # divide by n, not n-1
"""
n = len(data)
mu_hat = np.mean(data)
sigma_hat = np.std(data, ddof=0) # MLE: biased, divides by n
# Numerical verification via scipy.optimize
def neg_log_likelihood(params):
mu, log_sigma = params
sigma = np.exp(log_sigma) # log-parameterize to keep sigma > 0
return -np.sum(stats.norm.logpdf(data, loc=mu, scale=sigma))
result = minimize(neg_log_likelihood, x0=[0.0, 0.0], method='L-BFGS-B')
mu_num = result.x[0]
sigma_num = np.exp(result.x[1])
# Compute log-likelihood at MLE
ll_at_mle = np.sum(stats.norm.logpdf(data, loc=mu_hat, scale=sigma_hat))
print("Gaussian MLE Results:")
print(f" n = {n}")
print(f" Closed-form: mu = {mu_hat:.4f}, sigma = {sigma_hat:.4f}")
print(f" Numerical: mu = {mu_num:.4f}, sigma = {sigma_num:.4f}")
print(f" Unbiased std (ddof=1): {np.std(data, ddof=1):.4f} (NOT the MLE)")
print(f" Log-likelihood at MLE: {ll_at_mle:.3f}")
# Visualize likelihood surface
mu_grid = np.linspace(mu_hat - 1.5, mu_hat + 1.5, 80)
sig_grid = np.linspace(max(0.1, sigma_hat - 1.0), sigma_hat + 1.0, 80)
MU, SIG = np.meshgrid(mu_grid, sig_grid)
LL = np.vectorize(
lambda m, s: np.sum(stats.norm.logpdf(data, loc=m, scale=s))
)(MU, SIG)
plt.figure(figsize=(7, 5))
cs = plt.contourf(MU, SIG, LL, levels=25, cmap='Blues')
plt.colorbar(cs, label='Log-Likelihood')
plt.plot(mu_hat, sigma_hat, 'r*', markersize=16, label='MLE')
plt.xlabel('$\\mu$'); plt.ylabel('$\\sigma$')
plt.title('Gaussian Log-Likelihood Surface')
plt.legend()
plt.tight_layout()
plt.savefig('gaussian_mle_surface.png', dpi=150)
plt.show()
return mu_hat, sigma_hat
np.random.seed(42)
true_mu, true_sigma = 3.5, 1.2
data = np.random.normal(true_mu, true_sigma, size=500)
mu_hat, sigma_hat = mle_gaussian_full(data)
Bernoulli and Poisson MLE
def mle_bernoulli(data: np.ndarray) -> float:
"""
MLE for Bernoulli parameter p.
Closed-form: p_hat = mean(data)
"""
n = len(data)
k = np.sum(data)
p_hat = k / n
# Plot log-likelihood as a function of p
p_vals = np.linspace(0.01, 0.99, 300)
log_lik = k * np.log(p_vals) + (n - k) * np.log(1 - p_vals)
plt.figure(figsize=(7, 4))
plt.plot(p_vals, log_lik, 'b-', lw=2)
plt.axvline(p_hat, color='r', ls='--', label=f'MLE p̂ = {p_hat:.3f}')
plt.xlabel('p'); plt.ylabel('Log-Likelihood ℓ(p)')
plt.title('Bernoulli Log-Likelihood')
plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('bernoulli_mle.png', dpi=150)
plt.show()
print(f"\nBernoulli MLE: p̂ = k/n = {k}/{n} = {p_hat:.4f}")
return p_hat
def mle_poisson(data: np.ndarray) -> float:
"""
MLE for Poisson rate lambda.
Closed-form: lambda_hat = mean(data)
"""
lambda_hat = np.mean(data)
lambda_vals = np.linspace(0.5, 2 * lambda_hat, 200)
log_lik = np.sum(data) * np.log(lambda_vals) - len(data) * lambda_vals
plt.figure(figsize=(7, 4))
plt.plot(lambda_vals, log_lik, 'g-', lw=2)
plt.axvline(lambda_hat, color='r', ls='--', label=f'MLE λ̂ = {lambda_hat:.3f}')
plt.xlabel('λ'); plt.ylabel('Log-Likelihood ℓ(λ)')
plt.title('Poisson Log-Likelihood')
plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('poisson_mle.png', dpi=150)
plt.show()
print(f"\nPoisson MLE: λ̂ = mean = {lambda_hat:.4f}")
return lambda_hat
np.random.seed(42)
binary_data = np.random.binomial(1, p=0.35, size=1000)
count_data = np.random.poisson(lam=4.2, size=800)
p_hat = mle_bernoulli(binary_data)
lambda_hat = mle_poisson(count_data)
Numerical MLE for Custom Distributions
from scipy.optimize import minimize
from scipy.stats import t as t_dist
def mle_student_t(data: np.ndarray) -> dict:
"""
MLE for Student-t distribution: fit (nu, mu, sigma) numerically.
No closed-form solution exists - must use numerical optimization.
Shows how MLE applies to any distribution.
"""
def neg_log_likelihood(params):
log_nu, mu, log_sigma = params
nu = np.exp(log_nu) + 2 # degrees of freedom > 2
sigma = np.exp(log_sigma)
# scipy t-dist: scale = sigma, loc = mu
return -np.sum(t_dist.logpdf(data, df=nu, loc=mu, scale=sigma))
# Initial guess: Gaussian-like starting point
x0 = [np.log(5.0), np.mean(data), np.log(np.std(data))]
result = minimize(
neg_log_likelihood,
x0=x0,
method='L-BFGS-B',
options={'maxiter': 1000, 'ftol': 1e-12}
)
nu_hat = np.exp(result.x[0]) + 2
mu_hat = result.x[1]
sigma_hat = np.exp(result.x[2])
print("\nStudent-t MLE (numerical):")
print(f" ν̂ (degrees of freedom) = {nu_hat:.3f}")
print(f" μ̂ = {mu_hat:.4f}")
print(f" σ̂ = {sigma_hat:.4f}")
print(f" Optimizer success: {result.success}")
print(f" Neg log-likelihood at MLE: {result.fun:.3f}")
return {'nu': nu_hat, 'mu': mu_hat, 'sigma': sigma_hat}
# Simulate heavy-tailed data
np.random.seed(42)
heavy_data = t_dist.rvs(df=4, loc=2.0, scale=1.5, size=500)
params = mle_student_t(heavy_data)
MLE for Linear Regression with Full Statistics
from sklearn.datasets import make_regression
import numpy as np
class MLELinearRegression:
"""
Linear regression via MLE under Gaussian noise assumption.
Provides:
- MLE weights (= OLS)
- MLE sigma^2 (biased) and unbiased sigma^2
- Standard errors via Fisher information matrix
- 95% confidence intervals for each coefficient
- Wald test p-values
"""
def __init__(self):
self.w_ = None
self.sigma2_ = None
self.se_ = None
self.ci95_ = None
self.p_values_ = None
def fit(self, X: np.ndarray, y: np.ndarray):
n, d = X.shape
# Augment with intercept
Xa = np.column_stack([np.ones(n), X]) # shape (n, d+1)
# MLE weights = OLS normal equation
self.w_ = np.linalg.lstsq(Xa, y, rcond=None)[0] # shape (d+1,)
# Residuals and sigma^2
resid = y - Xa @ self.w_
self.sigma2_ = np.mean(resid ** 2) # MLE (biased, /n)
sigma2_unb = np.sum(resid ** 2) / (n - d - 1) # unbiased, /n-d-1
# Fisher information matrix for w: I(w) = X'X / sigma^2
# Covariance of MLE: Cov(ŵ) ≈ sigma^2 * (X'X)^{-1}
# Using unbiased sigma^2 for better finite-sample CI coverage
gram = Xa.T @ Xa
cov_w = sigma2_unb * np.linalg.inv(gram)
self.se_ = np.sqrt(np.diag(cov_w))
# 95% CI: ŵ ± 1.96 * se (asymptotic normality of MLE)
z95 = 1.96
self.ci95_ = np.stack([self.w_ - z95 * self.se_,
self.w_ + z95 * self.se_], axis=1)
# Wald test: H0: w_j = 0, test stat = w_j / se_j ~ N(0,1)
z_stats = self.w_ / self.se_
from scipy import stats as sc_stats
self.p_values_ = 2 * (1 - sc_stats.norm.cdf(np.abs(z_stats)))
# Log-likelihood at MLE
ll = (-n / 2) * np.log(2 * np.pi * self.sigma2_) \
- (1 / (2 * self.sigma2_)) * np.sum(resid ** 2)
print("MLE Linear Regression:")
print(f" n={n}, d={d}")
print(f" sigma2_MLE (biased, /n): {self.sigma2_:.4f}")
print(f" sigma2_unb (unbiased, /n-d-1): {sigma2_unb:.4f}")
print(f" Log-likelihood at MLE: {ll:.3f}")
print()
print(f"{'Param':10s} {'Estimate':10s} {'Std Err':10s} {'95% CI':22s} {'p-value':8s}")
print("-" * 70)
names = ['intercept'] + [f'x{j+1}' for j in range(d)]
for i, name in enumerate(names):
ci_str = f"[{self.ci95_[i,0]:7.3f}, {self.ci95_[i,1]:7.3f}]"
sig = "***" if self.p_values_[i] < 0.001 else \
"**" if self.p_values_[i] < 0.01 else \
"*" if self.p_values_[i] < 0.05 else ""
print(f"{name:10s} {self.w_[i]:10.4f} {self.se_[i]:10.4f} {ci_str} {self.p_values_[i]:.5f} {sig}")
return self
def predict(self, X: np.ndarray) -> np.ndarray:
n = X.shape[0]
Xa = np.column_stack([np.ones(n), X])
return Xa @ self.w_
def log_likelihood(self, X: np.ndarray, y: np.ndarray) -> float:
resid = y - self.predict(X)
n = len(y)
return -(n/2)*np.log(2*np.pi*self.sigma2_) \
- (1/(2*self.sigma2_)) * np.sum(resid**2)
np.random.seed(42)
X_reg, y_reg = make_regression(n_samples=200, n_features=4,
noise=8.0, random_state=42)
mle_lr = MLELinearRegression().fit(X_reg, y_reg)
MLE vs MAP: Regularization as Bayesian Inference
MLE maximizes only the likelihood. Maximum A Posteriori (MAP) also incorporates a prior :
Gaussian Prior → Ridge Regression (L2)
Assign prior :
MAP objective:
This is exactly Ridge regression. L2 regularization = MAP with Gaussian prior.
Laplace Prior → Lasso (L1)
Assign prior independently:
MAP objective:
This is exactly Lasso. L1 regularization = MAP with Laplace prior. The Laplace distribution has sharp peaks at zero and heavy tails, which is why the Lasso prior promotes sparsity - zero is much more probable under the Laplace than under the Gaussian.
The MAP ↔ Regularization Correspondence
| Prior on | Log-prior term | Regularizer | Penalty name |
|---|---|---|---|
| L2 / Ridge | |||
| $-\lambda | w_j | $ | |
| Uniform (improper) | none | MLE = OLS | |
| Elastic net prior | $-\alpha\lambda | w_j | - (1-\alpha)\lambda w_j^2$ |
As , the prior is overwhelmed by the likelihood and MAP → MLE regardless of the prior chosen.
import numpy as np
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression, Ridge, Lasso
def compare_mle_map(X: np.ndarray, y: np.ndarray, lambda_: float = 1.0):
"""
Demonstrate MLE = OLS, MAP+Gaussian prior = Ridge, MAP+Laplace prior = Lasso.
"""
n, d = X.shape
# MLE (no prior)
mle = LinearRegression().fit(X, y)
# MAP + Gaussian prior = Ridge
# sklearn Ridge: minimizes ||y - Xw||^2 + alpha * ||w||^2
ridge = Ridge(alpha=lambda_).fit(X, y)
# MAP + Laplace prior = Lasso
lasso = Lasso(alpha=lambda_ / (2 * n), max_iter=5000).fit(X, y)
print(f"\nMLE vs MAP Comparison (lambda={lambda_}, n={n}, d={d}):")
print(f"{'Feature':10s} {'MLE/OLS':10s} {'MAP/Ridge':10s} {'MAP/Lasso':10s}")
print("-" * 50)
for j in range(min(d, 8)):
print(f"x{j+1:8s} {mle.coef_[j]:10.4f} {ridge.coef_[j]:10.4f} {lasso.coef_[j]:10.4f}")
print(f"\n Ridge l2 norm: {np.linalg.norm(ridge.coef_):.4f} "
f"(vs MLE: {np.linalg.norm(mle.coef_):.4f})")
print(f" Lasso zeros: {np.sum(lasso.coef_ == 0)} of {d} features zeroed out")
np.random.seed(42)
X_sparse, y_sparse = make_regression(n_samples=150, n_features=20,
n_informative=5, noise=10.0,
random_state=42)
compare_mle_map(X_sparse, y_sparse, lambda_=50.0)
Fisher Information and the Cramér-Rao Bound
The Score Function
The score function is the gradient of the log-likelihood with respect to :
Under regularity conditions, - the expected score is zero at the true parameter.
Fisher Information
The Fisher information measures how much information a single observation carries about :
Equivalently (second form, more useful for computation):
The second form says: Fisher information equals the expected curvature (negative Hessian) of the log-likelihood. High curvature = sharply peaked likelihood = precise estimation possible.
For i.i.d. samples, total Fisher information is .
Cramér-Rao Lower Bound
For any unbiased estimator of :
This is the Cramér-Rao Lower Bound (CRLB). It establishes a fundamental floor on estimation variance - no unbiased estimator can do better.
The MLE is asymptotically efficient: as , the MLE achieves the CRLB. No consistent estimator can be more efficient asymptotically.
Fisher Information: Concrete Examples
Gaussian (known ):
CRLB: . The sample mean achieves exactly - it is the UMVUE (uniformly minimum variance unbiased estimator).
Bernoulli:
CRLB: . Sample proportion achieves exactly - efficient.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
def fisher_information_analysis():
"""
Demonstrate Fisher information, CRLB, and MLE efficiency.
"""
print("=" * 60)
print("Fisher Information and Cramér-Rao Bound")
print("=" * 60)
# --- Gaussian ---
sigma = 2.0
ns = [10, 50, 100, 500, 1000]
print(f"\nGaussian (sigma={sigma}): Fisher info I(mu) = 1/sigma^2 = {1/sigma**2:.4f}")
print(f"\n{'n':6s} {'CRLB std':10s} {'Simul std':10s} {'Ratio':8s}")
print("-" * 40)
for n in ns:
crlb_std = sigma / np.sqrt(n) # sqrt(1 / (n * I(mu)))
# Simulate 5000 experiments
samples = np.random.normal(3.5, sigma, size=(5000, n))
mle_means = samples.mean(axis=1)
sim_std = np.std(mle_means)
ratio = sim_std / crlb_std
print(f"{n:6d} {crlb_std:10.5f} {sim_std:10.5f} {ratio:8.4f}")
# --- Bernoulli ---
p_true = 0.3
I_bern = 1.0 / (p_true * (1 - p_true))
print(f"\nBernoulli (p={p_true}): Fisher info I(p) = 1/(p(1-p)) = {I_bern:.4f}")
print(f"\n{'n':6s} {'CRLB std':10s} {'Simul std':10s} {'Ratio':8s}")
print("-" * 40)
for n in ns:
crlb_std = np.sqrt(p_true * (1 - p_true) / n)
samples = np.random.binomial(1, p_true, size=(5000, n))
mle_props = samples.mean(axis=1)
sim_std = np.std(mle_props)
ratio = sim_std / crlb_std
print(f"{n:6d} {crlb_std:10.5f} {sim_std:10.5f} {ratio:8.4f}")
print("\nConclusion: ratio -> 1.0 as n grows - MLE achieves CRLB asymptotically.")
np.random.seed(42)
fisher_information_analysis()
The Observed Fisher Information Matrix
In practice, the Fisher information for a vector parameter (the Fisher information matrix) is approximated by the observed Fisher information - the negative Hessian of the log-likelihood at the MLE:
The inverse gives the asymptotic covariance matrix of the MLE:
Standard errors = square root of diagonal entries. This is how statsmodels, R, and SciPy report confidence intervals for MLE-fit models.
Asymptotic Properties of MLE
Consistency
Under regularity conditions, as . More data always improves MLE accuracy.
Asymptotic Normality
This gives the theoretical basis for confidence intervals and hypothesis tests from MLE estimates. At large , CIs computed from the observed Fisher information are valid.
Invariance
If is the MLE of , then is the MLE of for any function .
Examples:
- MLE of is , so MLE of is
- MLE of is , so MLE of log-odds is
Asymptotic Efficiency
No consistent estimator can have lower asymptotic variance than the MLE. This is a strong result: the MLE extracts all the information from the data.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
def mle_asymptotic_normality(true_p: float = 0.4, n_values: list = None):
"""
Demonstrate asymptotic normality of MLE for Bernoulli p.
As n grows, the sampling distribution of p_hat converges to N(p, p(1-p)/n).
"""
if n_values is None:
n_values = [10, 30, 100, 500]
n_sim = 10_000
fig, axes = plt.subplots(1, len(n_values), figsize=(14, 4))
for ax, n in zip(axes, n_values):
# Simulate MLE sampling distribution
samples = np.random.binomial(1, true_p, size=(n_sim, n))
p_hats = samples.mean(axis=1)
# Theoretical asymptotic distribution
sigma_theory = np.sqrt(true_p * (1 - true_p) / n)
x_range = np.linspace(true_p - 4 * sigma_theory,
true_p + 4 * sigma_theory, 200)
density = stats.norm.pdf(x_range, loc=true_p, scale=sigma_theory)
ax.hist(p_hats, bins=50, density=True, alpha=0.6, color='steelblue',
label='Simulated MLE')
ax.plot(x_range, density, 'r-', lw=2, label='Asymptotic N')
ax.axvline(true_p, color='k', ls='--', alpha=0.5)
ax.set_title(f'n = {n}')
ax.set_xlabel('p̂')
ax.legend(fontsize=8)
plt.suptitle(f'Asymptotic Normality of Bernoulli MLE (true p = {true_p})',
fontsize=12)
plt.tight_layout()
plt.savefig('mle_asymptotic_normality.png', dpi=150)
plt.show()
np.random.seed(42)
mle_asymptotic_normality(true_p=0.4, n_values=[10, 30, 100, 500])
MLE vs ERM: The Theoretical Bridge
Empirical Risk Minimization (ERM) minimizes average loss over training data:
MLE maximizes average log-likelihood:
These are the same when loss = negative log-likelihood: .
Furthermore, MLE is equivalent to minimizing KL divergence between the empirical distribution and the model :
This is a profound connection: MLE finds the model that is closest (in KL divergence) to the true data distribution, as estimated from the training set.
Production Notes
:::tip Use numerical MLE when no closed form exists
Most distributions used in production (Student-t, Beta, Negative Binomial) have no closed-form MLE. Use scipy.optimize.minimize with method='L-BFGS-B' and log-parameterize positive quantities. Always verify by checking the gradient at the solution: optimize.check_grad or manually.
:::
:::warning MLE sigma^2 is biased for linear regression The MLE estimate is biased. For inference (confidence intervals, hypothesis tests), use the unbiased estimate . The degrees-of-freedom correction accounts for the fact that parameters were estimated from the same data used to compute residuals. :::
:::danger Do not confuse likelihood and probability numerically, but they are different functions. Likelihood is not a probability distribution over - it does not integrate to 1 and cannot be normalized to get a distribution (that is Bayesian posterior, not likelihood). Reporting as a "probability" of the parameter is wrong. :::
:::warning MSE for classification: wrong noise model Using MSE loss for binary classification assumes Gaussian residuals, which is misspecified. The outputs are probabilities in but MSE assigns no penalty for predictions outside , leads to saturated gradients with sigmoid activation, and produces a non-convex objective. Use cross-entropy (correct Bernoulli NLL) instead. :::
YouTube Resources
| Title | Channel | Why Watch |
|---|---|---|
| Maximum Likelihood For the Normal Distribution, step-by-step | StatQuest | Excellent visual derivation of Gaussian MLE |
| MLE and MAP estimation | Aurelien Geron | Connects MLE to neural network training |
| Fisher Information and the Cramér-Rao Lower Bound | Mathematical Monk | Rigorous CRLB derivation |
| Bayesian vs Frequentist Statistics | 3Blue1Brown | Conceptual foundation, MLE vs MAP |
| Likelihood and Log-Likelihood | Brandon Foltz | Clear introduction for practitioners |
Interview Q&A
Q1: Explain why the Gaussian noise assumption leads to OLS, and why Bernoulli leads to cross-entropy.
The key insight is that the loss function is the negative log-likelihood of the noise model. If we assume with , then . Taking the log and summing: . Maximizing over means minimizing - OLS. For binary classification, , so . Log-likelihood: . Maximizing this is minimizing binary cross-entropy. The loss function is not an arbitrary design choice - it is the negative log-likelihood of the correct generative model.
Q2: What is the Cramér-Rao bound and why does it matter for ML?
The CRLB is for any unbiased estimator . It sets a fundamental lower bound on estimation variance - no matter how clever your estimator, you cannot beat this. The MLE achieves this bound asymptotically, making it the most efficient estimator (among consistent unbiased estimators) for large . In ML practice: (1) CRLB explains why more data always helps - the bound tightens as grows; (2) The inverse Fisher information gives asymptotic standard errors for MLE, used for confidence intervals and p-values in statsmodels/R; (3) CRLB is the theoretical justification for using MLE rather than some ad-hoc estimator - it is statistically optimal.
Q3: How does L2 regularization relate to MAP estimation, and what prior does it correspond to?
L2 regularization (Ridge) = MAP estimation with a Gaussian prior on each weight. Proof: MAP objective is . Substituting Gaussian NLL and Gaussian prior: . Minimizing the negative = - exactly Ridge. The regularization parameter is the precision of the Gaussian prior: large = strong prior belief that weights are near zero = heavy regularization. L1 regularization corresponds to a Laplace prior, which has heavier tails and a sharper peak at zero - hence promoting sparsity.
Q4: What is Fisher information intuitively, and how is it related to the curvature of the log-likelihood?
Fisher information measures how much information each data point carries about the parameter . Intuitively: if you perturb slightly, how much does the likelihood change? High Fisher information = the likelihood is sensitive to = we can pin down precisely from the data. Formally, = expected negative Hessian of log-likelihood. High Fisher information = high curvature = narrow, peaked log-likelihood surface = small estimation uncertainty. This is exactly the connection to the CRLB: high curvature means the log-likelihood concentrates around the true , allowing very low variance estimates. For the Gaussian mean: - smaller variance → higher information → more precise estimation.
Q5: Why does MLE sometimes overfit, and what is the connection to the bias-variance tradeoff?
MLE is a point estimate that maximizes probability of the training data. For complex models with many parameters, MLE can perfectly fit the training data while generalizing poorly - a classic overfitting pattern. This is because MLE is equivalent to ERM (empirical risk minimization), which optimizes on the training distribution, not the test distribution. The MLE has zero bias asymptotically (consistent) but high variance for small . MAP estimation (MLE + prior) adds a regularization term that introduces a small bias in exchange for lower variance - the bias-variance tradeoff. As , the prior is overwhelmed by data and MAP → MLE. For small , a carefully chosen prior (e.g., Gaussian for Ridge, Laplace for Lasso) dramatically reduces variance with minimal bias cost, improving generalization.
Q6: You're fitting a model to customer purchase counts per day. You fit a Poisson regression and find the variance of counts is 8× the mean. What does this tell you and how do you fix it?
Poisson regression assumes equidispersion: . If the observed variance/mean ratio is 8, the data is overdispersed - the Poisson assumption is violated. This typically arises from unobserved heterogeneity (some customers are systematically high-purchasers, not captured by your features) or clustered events. Consequences: the Poisson likelihood is misspecified, which means MLE gives inconsistent estimates of coefficient standard errors (they are too small), leading to artificially narrow confidence intervals and inflated significance. Fix: (1) Negative Binomial regression - adds a per-observation random effect with variance , fitting from data; (2) Quasi-Poisson - uses Poisson point estimates but inflates standard errors by ; (3) Add features to absorb heterogeneity. Model selection: compare Poisson AIC vs Negative Binomial AIC - lower is better.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the MLE & MAP Explorer demo on the EngineersOfAI Playground - no code required.
:::
