What is maximum likelihood estimation?

Understand MLE from first principles - derive OLS from Gaussian noise, cross-entropy from Bernoulli, Fisher information, Cramér-Rao bound, and the deep connection between MLE and empirical risk minimization.

How does MLE work in practice?

Maximum Likelihood Estimation covers maximum likelihood estimation, MLE, statistical inference from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/linear-models/maximum-likelihood-estimation

What is the difference between maximum likelihood estimation and statistical inference?

See the full breakdown at https://engineersofai.com/docs/ml/linear-models/maximum-likelihood-estimation

Maximum Likelihood Estimation

The Real Interview Moment

You are in the final round at a top-tier AI lab. The interviewer slides a sheet of paper across the table: "We have a neural network trained with MSE loss on a classification task. Accuracy is 92%. Our competitor uses cross-entropy and gets 93%. Explain exactly why cross-entropy is better - mathematically, not intuitively."

You pause. You know cross-entropy is "better for classification," but do you know why? The honest answer requires tracing a chain of reasoning from probability theory all the way to gradient flow. MSE assumes the residuals are Gaussian. Classification residuals are Bernoulli, not Gaussian. When you use MSE for classification, you are solving the wrong optimization problem. Cross-entropy is the correct loss because it is the negative log-likelihood of the Bernoulli distribution. Using the right probabilistic model for your data is not a heuristic; it is a theorem.

This is the subject of Maximum Likelihood Estimation. Every loss function in machine learning is either the negative log-likelihood of some noise model, or it is an approximation to one. MSE is the NLL of Gaussian noise. Cross-entropy is the NLL of Bernoulli noise. Huber loss approximates the NLL of a distribution with Gaussian body and Laplace tails. Knowing this unifies everything - regression, classification, density estimation, survival analysis - under a single framework.

The principle is deceptively simple: given observed data, find the parameters that make the data most probable. Yet the implications cascade through the entire field. It explains why regularization works (MAP estimation with a prior). It quantifies the fundamental limits of estimation (Cramér-Rao bound). It justifies the use of gradient descent on loss functions (because maximizing likelihood is the same as minimizing average loss). And it gives you the vocabulary to design new loss functions for new problems from first principles.

This lesson derives MLE carefully, connects it to every major model you have already seen, builds the Fisher information framework, and ends with the MAP duality that explains L1 and L2 regularization as Bayesian inference. After this lesson, you will never wonder why a loss function looks the way it does.

Why This Exists - The Problem Before MLE

Before MLE was formalized, statisticians fit curves by minimizing "error" in ad hoc ways. The method of least squares (Gauss, Legendre, ~1805) was proposed purely because it led to tractable algebra. Nobody asked: "Under what probabilistic assumptions does minimizing squared error give the best estimates?"

The uncomfortable truth is that least squares is optimal only under Gaussian noise. For count data, binary data, survival times, or heavy-tailed measurements, squared error is a poor choice - it assigns high loss to large residuals that are actually probable under the true data distribution.

Fisher (1912–1922) formalized MLE as a general estimation principle: rather than defining "best fit" geometrically, define it probabilistically. The best parameters are those that assign the highest probability to the data you actually observed. This immediately tells you:

For Gaussian noise → minimize squared errors (MSE)
For Bernoulli outputs → minimize cross-entropy
For Poisson counts → minimize Poisson deviance
For heavy-tailed noise → minimize absolute errors (MAE)

The loss function is not a design choice - it is derived from the data distribution.

Historical Context

Year	Milestone
1805	Legendre proposes method of least squares (no probabilistic justification)
1809	Gauss derives OLS from Gaussian error assumption - first proto-MLE argument
1912	R.A. Fisher introduces "likelihood" as distinct from probability
1922	Fisher proves MLE is asymptotically efficient (achieves Cramér-Rao bound)
1945–46	Cramér and Rao independently derive the variance lower bound
1972	Nelder & Wedderburn formalize GLMs - MLE applied to exponential family
1986	Rumelhart et al. backprop = gradient of cross-entropy NLL
2010s	Deep learning: every modern loss function is a (negative) log-likelihood

Fisher's key insight: probability is a function of data given fixed parameters; likelihood is the same function viewed as varying over parameters given fixed data. They are numerically equal but conceptually opposite. We want to find the parameter value that makes the observed data most probable.

The MLE Principle: Probability vs Likelihood

For a data point $y$ and parameter $\theta$ :

$P(y | \theta)$ : probability of observing $y$ if the true parameter is $\theta$ - a function of $y$ , $\theta$ fixed
$\mathcal{L}(\theta | y) = P(y | \theta)$ : likelihood of $\theta$ given we observed $y$ - same number, but viewed as a function of $\theta$ , $y$ fixed

This distinction matters deeply. A likelihood is not a probability distribution over $\theta$ - it need not integrate to 1. It is just a measure of how well a parameter value explains the data.

Given $n$ i.i.d. observations $\{y_i\}_{i=1}^n$ :

$\mathcal{L}(\theta) = \prod_{i=1}^n P(y_i | \theta)$

The log-likelihood (monotone transformation, same maximizer):

$\ell(\theta) = \log \mathcal{L}(\theta) = \sum_{i=1}^n \log P(y_i | \theta)$

The MLE is:

$\hat{\theta}_\text{MLE} = \arg\max_\theta \ell(\theta) = \arg\max_\theta \sum_{i=1}^n \log P(y_i | \theta)$

The log converts products to sums, making both computation and differentiation tractable.

MLE for Common Distributions

Gaussian: Recovering Sample Mean and Variance

Given $n$ i.i.d. observations from $\mathcal{N}(\mu, \sigma^2)$ :

$P(y_i | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mu)^2}{2\sigma^2}\right)$

Log-likelihood:

$\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \mu)^2$

MLE for $\mu$ : set $\partial\ell/\partial\mu = 0$ :

$\frac{\partial\ell}{\partial\mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (y_i - \mu) = 0 \implies \hat{\mu}_\text{MLE} = \frac{1}{n}\sum_{i=1}^n y_i = \bar{y}$

The sample mean is the MLE for Gaussian location.

MLE for $\sigma^2$ : set $\partial\ell/\partial\sigma^2 = 0$ :

$\frac{\partial\ell}{\partial\sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4}\sum_{i=1}^n(y_i - \mu)^2 = 0 \implies \hat{\sigma}^2_\text{MLE} = \frac{1}{n}\sum_{i=1}^n(y_i - \bar{y})^2$

Note: this is the biased estimator (divides by $n$ , not $n-1$ ). MLE is slightly biased for $\sigma^2$ in finite samples. The unbiased estimator $s^2 = \frac{1}{n-1}\sum(y_i-\bar{y})^2$ is not the MLE.

Bernoulli: Recovering Sample Proportion

Given $n$ i.i.d. binary observations $y_i \in \{0, 1\}$ from $\text{Bernoulli}(p)$ :

$P(y_i | p) = p^{y_i}(1-p)^{1-y_i}$

Log-likelihood:

$\ell(p) = \sum_{i=1}^n y_i \log p + (1 - y_i)\log(1-p) = k\log p + (n-k)\log(1-p)$

where $k = \sum y_i$ is the number of successes.

MLE: set $\partial\ell/\partial p = 0$ :

$\frac{\partial\ell}{\partial p} = \frac{k}{p} - \frac{n-k}{1-p} = 0 \implies \hat{p}_\text{MLE} = \frac{k}{n} = \bar{y}$

The sample proportion is the MLE for Bernoulli parameter - the most natural estimator, justified by MLE.

Poisson: Recovering Sample Mean

Given $n$ i.i.d. count observations $y_i \in \{0, 1, 2, \ldots\}$ from $\text{Poisson}(\lambda)$ :

$P(y_i | \lambda) = \frac{e^{-\lambda}\lambda^{y_i}}{y_i!}$

Log-likelihood:

$\ell(\lambda) = \sum_{i=1}^n \left(-\lambda + y_i\log\lambda - \log(y_i!)\right) = -n\lambda + \left(\sum_i y_i\right)\log\lambda - \text{const}$

MLE: $\partial\ell/\partial\lambda = -n + \frac{\sum_i y_i}{\lambda} = 0 \implies \hat{\lambda}_\text{MLE} = \bar{y}$

For Gaussian, Bernoulli, and Poisson alike: the MLE is the sample mean. This generalizes to all one-parameter exponential family distributions - the MLE is always the empirical average of the sufficient statistic.

MLE for Linear Regression → Derives OLS

The Gaussian noise model for regression:

$y_i = \mathbf{w}^\top\mathbf{x}_i + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2)$

implies:

$P(y_i | \mathbf{x}_i, \mathbf{w}) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mathbf{w}^\top\mathbf{x}_i)^2}{2\sigma^2}\right)$

Log-likelihood over $n$ samples:

$\ell(\mathbf{w}) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \mathbf{w}^\top\mathbf{x}_i)^2$

Maximizing over $\mathbf{w}$ is equivalent to minimizing:

$\sum_{i=1}^n (y_i - \mathbf{w}^\top\mathbf{x}_i)^2$

This is exactly OLS. The Gaussian noise assumption uniquely derives the squared loss. The insight works in reverse too: whenever you use MSE, you are implicitly assuming your residuals are i.i.d. Gaussian.

MLE for Logistic Regression → Derives Cross-Entropy

Binary label $y_i \in \{0, 1\}$ modeled as Bernoulli:

$P(y_i | \mathbf{x}_i, \mathbf{w}) = \hat{y}_i^{y_i}(1 - \hat{y}_i)^{1-y_i}, \quad \hat{y}_i = \sigma(\mathbf{w}^\top\mathbf{x}_i)$

Log-likelihood:

$\ell(\mathbf{w}) = \sum_{i=1}^n \left[y_i \log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\right]$

Maximizing $\ell(\mathbf{w})$ is equivalent to minimizing $-\ell(\mathbf{w})$ , the binary cross-entropy:

$\mathcal{L}_\text{CE}(\mathbf{w}) = -\frac{1}{n}\sum_{i=1}^n \left[y_i \log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\right]$

Cross-entropy is the NLL of Bernoulli distribution. The Bernoulli assumption derives cross-entropy; the Gaussian assumption derives MSE. The pattern is universal.

The NLL ↔ Loss Function Dictionary

Data distribution	Noise model	Negative log-likelihood	Loss name
Linear regression	Gaussian $\mathcal{N}(\mu, \sigma^2)$	$\frac{1}{2\sigma^2}(y - \hat{y})^2$	MSE / squared loss
Binary classification	Bernoulli $(p)$	$-y\log\hat{y} - (1-y)\log(1-\hat{y})$	Binary cross-entropy
Multiclass	Categorical $(p_1,\ldots,p_K)$	$-\sum_k y_k \log\hat{y}_k$	Categorical cross-entropy
Count prediction	Poisson $(\lambda)$	$\hat{y} - y\log\hat{y}$	Poisson deviance
Robust regression	Laplace $(b)$	$\\|y - \hat{y}\\|/b$	MAE / L1 loss
Heavy-tailed	Cauchy	$\log(1 + (y-\hat{y})^2)$	Log-loss (robust)

Choosing a loss function is choosing a noise model. Get the noise model right and the loss function follows.

Python: MLE From Scratch

Gaussian MLE with Likelihood Surface

import numpy as np
from scipy import stats
from scipy.optimize import minimize
import matplotlib.pyplot as plt


def mle_gaussian_full(data: np.ndarray) -> tuple[float, float]:
    """
    MLE for Gaussian: find mu, sigma maximizing P(data | mu, sigma).

    Closed-form MLE:
        mu_hat    = mean(data)
        sigma_hat = std(data, ddof=0)   # divide by n, not n-1
    """
    n = len(data)
    mu_hat    = np.mean(data)
    sigma_hat = np.std(data, ddof=0)   # MLE: biased, divides by n

    # Numerical verification via scipy.optimize
    def neg_log_likelihood(params):
        mu, log_sigma = params
        sigma = np.exp(log_sigma)      # log-parameterize to keep sigma > 0
        return -np.sum(stats.norm.logpdf(data, loc=mu, scale=sigma))

    result = minimize(neg_log_likelihood, x0=[0.0, 0.0], method='L-BFGS-B')
    mu_num    = result.x[0]
    sigma_num = np.exp(result.x[1])

    # Compute log-likelihood at MLE
    ll_at_mle = np.sum(stats.norm.logpdf(data, loc=mu_hat, scale=sigma_hat))

    print("Gaussian MLE Results:")
    print(f"  n = {n}")
    print(f"  Closed-form: mu = {mu_hat:.4f},  sigma = {sigma_hat:.4f}")
    print(f"  Numerical:   mu = {mu_num:.4f},  sigma = {sigma_num:.4f}")
    print(f"  Unbiased std (ddof=1): {np.std(data, ddof=1):.4f}  (NOT the MLE)")
    print(f"  Log-likelihood at MLE: {ll_at_mle:.3f}")

    # Visualize likelihood surface
    mu_grid  = np.linspace(mu_hat - 1.5, mu_hat + 1.5, 80)
    sig_grid = np.linspace(max(0.1, sigma_hat - 1.0), sigma_hat + 1.0, 80)
    MU, SIG  = np.meshgrid(mu_grid, sig_grid)

    LL = np.vectorize(
        lambda m, s: np.sum(stats.norm.logpdf(data, loc=m, scale=s))
    )(MU, SIG)

    plt.figure(figsize=(7, 5))
    cs = plt.contourf(MU, SIG, LL, levels=25, cmap='Blues')
    plt.colorbar(cs, label='Log-Likelihood')
    plt.plot(mu_hat, sigma_hat, 'r*', markersize=16, label='MLE')
    plt.xlabel('$\\mu$'); plt.ylabel('$\\sigma$')
    plt.title('Gaussian Log-Likelihood Surface')
    plt.legend()
    plt.tight_layout()
    plt.savefig('gaussian_mle_surface.png', dpi=150)
    plt.show()

    return mu_hat, sigma_hat


np.random.seed(42)
true_mu, true_sigma = 3.5, 1.2
data = np.random.normal(true_mu, true_sigma, size=500)
mu_hat, sigma_hat = mle_gaussian_full(data)

Bernoulli and Poisson MLE

def mle_bernoulli(data: np.ndarray) -> float:
    """
    MLE for Bernoulli parameter p.
    Closed-form: p_hat = mean(data)
    """
    n          = len(data)
    k          = np.sum(data)
    p_hat      = k / n

    # Plot log-likelihood as a function of p
    p_vals     = np.linspace(0.01, 0.99, 300)
    log_lik    = k * np.log(p_vals) + (n - k) * np.log(1 - p_vals)

    plt.figure(figsize=(7, 4))
    plt.plot(p_vals, log_lik, 'b-', lw=2)
    plt.axvline(p_hat, color='r', ls='--', label=f'MLE p̂ = {p_hat:.3f}')
    plt.xlabel('p'); plt.ylabel('Log-Likelihood  ℓ(p)')
    plt.title('Bernoulli Log-Likelihood')
    plt.legend(); plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.savefig('bernoulli_mle.png', dpi=150)
    plt.show()

    print(f"\nBernoulli MLE: p̂ = k/n = {k}/{n} = {p_hat:.4f}")
    return p_hat


def mle_poisson(data: np.ndarray) -> float:
    """
    MLE for Poisson rate lambda.
    Closed-form: lambda_hat = mean(data)
    """
    lambda_hat = np.mean(data)

    lambda_vals = np.linspace(0.5, 2 * lambda_hat, 200)
    log_lik = np.sum(data) * np.log(lambda_vals) - len(data) * lambda_vals

    plt.figure(figsize=(7, 4))
    plt.plot(lambda_vals, log_lik, 'g-', lw=2)
    plt.axvline(lambda_hat, color='r', ls='--', label=f'MLE λ̂ = {lambda_hat:.3f}')
    plt.xlabel('λ'); plt.ylabel('Log-Likelihood  ℓ(λ)')
    plt.title('Poisson Log-Likelihood')
    plt.legend(); plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.savefig('poisson_mle.png', dpi=150)
    plt.show()

    print(f"\nPoisson MLE: λ̂ = mean = {lambda_hat:.4f}")
    return lambda_hat


np.random.seed(42)
binary_data = np.random.binomial(1, p=0.35, size=1000)
count_data  = np.random.poisson(lam=4.2, size=800)

p_hat      = mle_bernoulli(binary_data)
lambda_hat = mle_poisson(count_data)

Numerical MLE for Custom Distributions

from scipy.optimize import minimize
from scipy.stats import t as t_dist


def mle_student_t(data: np.ndarray) -> dict:
    """
    MLE for Student-t distribution: fit (nu, mu, sigma) numerically.
    No closed-form solution exists - must use numerical optimization.
    Shows how MLE applies to any distribution.
    """
    def neg_log_likelihood(params):
        log_nu, mu, log_sigma = params
        nu    = np.exp(log_nu) + 2       # degrees of freedom > 2
        sigma = np.exp(log_sigma)
        # scipy t-dist: scale = sigma, loc = mu
        return -np.sum(t_dist.logpdf(data, df=nu, loc=mu, scale=sigma))

    # Initial guess: Gaussian-like starting point
    x0 = [np.log(5.0), np.mean(data), np.log(np.std(data))]

    result = minimize(
        neg_log_likelihood,
        x0=x0,
        method='L-BFGS-B',
        options={'maxiter': 1000, 'ftol': 1e-12}
    )

    nu_hat    = np.exp(result.x[0]) + 2
    mu_hat    = result.x[1]
    sigma_hat = np.exp(result.x[2])

    print("\nStudent-t MLE (numerical):")
    print(f"  ν̂ (degrees of freedom) = {nu_hat:.3f}")
    print(f"  μ̂ = {mu_hat:.4f}")
    print(f"  σ̂ = {sigma_hat:.4f}")
    print(f"  Optimizer success: {result.success}")
    print(f"  Neg log-likelihood at MLE: {result.fun:.3f}")

    return {'nu': nu_hat, 'mu': mu_hat, 'sigma': sigma_hat}


# Simulate heavy-tailed data
np.random.seed(42)
heavy_data = t_dist.rvs(df=4, loc=2.0, scale=1.5, size=500)
params = mle_student_t(heavy_data)

MLE for Linear Regression with Full Statistics

from sklearn.datasets import make_regression
import numpy as np


class MLELinearRegression:
    """
    Linear regression via MLE under Gaussian noise assumption.
    Provides:
      - MLE weights (= OLS)
      - MLE sigma^2 (biased) and unbiased sigma^2
      - Standard errors via Fisher information matrix
      - 95% confidence intervals for each coefficient
      - Wald test p-values
    """

    def __init__(self):
        self.w_       = None
        self.sigma2_  = None
        self.se_      = None
        self.ci95_    = None
        self.p_values_ = None

    def fit(self, X: np.ndarray, y: np.ndarray):
        n, d  = X.shape
        # Augment with intercept
        Xa    = np.column_stack([np.ones(n), X])  # shape (n, d+1)

        # MLE weights = OLS normal equation
        self.w_ = np.linalg.lstsq(Xa, y, rcond=None)[0]   # shape (d+1,)

        # Residuals and sigma^2
        resid       = y - Xa @ self.w_
        self.sigma2_ = np.mean(resid ** 2)            # MLE (biased, /n)
        sigma2_unb  = np.sum(resid ** 2) / (n - d - 1)  # unbiased, /n-d-1

        # Fisher information matrix for w: I(w) = X'X / sigma^2
        # Covariance of MLE: Cov(ŵ) ≈ sigma^2 * (X'X)^{-1}
        # Using unbiased sigma^2 for better finite-sample CI coverage
        gram        = Xa.T @ Xa
        cov_w       = sigma2_unb * np.linalg.inv(gram)
        self.se_    = np.sqrt(np.diag(cov_w))

        # 95% CI: ŵ ± 1.96 * se  (asymptotic normality of MLE)
        z95         = 1.96
        self.ci95_  = np.stack([self.w_ - z95 * self.se_,
                                self.w_ + z95 * self.se_], axis=1)

        # Wald test: H0: w_j = 0,  test stat = w_j / se_j ~ N(0,1)
        z_stats     = self.w_ / self.se_
        from scipy import stats as sc_stats
        self.p_values_ = 2 * (1 - sc_stats.norm.cdf(np.abs(z_stats)))

        # Log-likelihood at MLE
        ll = (-n / 2) * np.log(2 * np.pi * self.sigma2_) \
             - (1 / (2 * self.sigma2_)) * np.sum(resid ** 2)

        print("MLE Linear Regression:")
        print(f"  n={n}, d={d}")
        print(f"  sigma2_MLE  (biased, /n):   {self.sigma2_:.4f}")
        print(f"  sigma2_unb  (unbiased, /n-d-1): {sigma2_unb:.4f}")
        print(f"  Log-likelihood at MLE: {ll:.3f}")
        print()
        print(f"{'Param':10s}  {'Estimate':10s}  {'Std Err':10s}  {'95% CI':22s}  {'p-value':8s}")
        print("-" * 70)
        names = ['intercept'] + [f'x{j+1}' for j in range(d)]
        for i, name in enumerate(names):
            ci_str = f"[{self.ci95_[i,0]:7.3f}, {self.ci95_[i,1]:7.3f}]"
            sig    = "***" if self.p_values_[i] < 0.001 else \
                     "**"  if self.p_values_[i] < 0.01  else \
                     "*"   if self.p_values_[i] < 0.05  else ""
            print(f"{name:10s}  {self.w_[i]:10.4f}  {self.se_[i]:10.4f}  {ci_str}  {self.p_values_[i]:.5f} {sig}")
        return self

    def predict(self, X: np.ndarray) -> np.ndarray:
        n  = X.shape[0]
        Xa = np.column_stack([np.ones(n), X])
        return Xa @ self.w_

    def log_likelihood(self, X: np.ndarray, y: np.ndarray) -> float:
        resid = y - self.predict(X)
        n     = len(y)
        return -(n/2)*np.log(2*np.pi*self.sigma2_) \
               - (1/(2*self.sigma2_)) * np.sum(resid**2)


np.random.seed(42)
X_reg, y_reg = make_regression(n_samples=200, n_features=4,
                                noise=8.0, random_state=42)
mle_lr = MLELinearRegression().fit(X_reg, y_reg)

MLE vs MAP: Regularization as Bayesian Inference

MLE maximizes only the likelihood. Maximum A Posteriori (MAP) also incorporates a prior $P(\boldsymbol{\theta})$ :

$\hat{\boldsymbol{\theta}}_\text{MAP} = \arg\max_{\boldsymbol{\theta}} \underbrace{P(\mathbf{y}|\mathbf{X};\boldsymbol{\theta})}_\text{likelihood} \cdot \underbrace{P(\boldsymbol{\theta})}_\text{prior} = \arg\max_{\boldsymbol{\theta}} \underbrace{\ell(\boldsymbol{\theta})}_\text{log-likelihood} + \underbrace{\log P(\boldsymbol{\theta})}_\text{log-prior}$

Gaussian Prior → Ridge Regression (L2)

Assign prior $\mathbf{w} \sim \mathcal{N}\left(\mathbf{0},\, \frac{1}{\lambda}\mathbf{I}\right)$ :

$\log P(\mathbf{w}) = -\frac{\lambda}{2}\|\mathbf{w}\|_2^2 + \text{const}$

MAP objective:

$\hat{\mathbf{w}}_\text{MAP} = \arg\min_\mathbf{w} \underbrace{\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2}_\text{NLL (Gaussian noise)} + \underbrace{\lambda\|\mathbf{w}\|_2^2}_\text{negative log Gaussian prior}$

This is exactly Ridge regression. L2 regularization = MAP with Gaussian prior.

Laplace Prior → Lasso (L1)

Assign prior $w_j \sim \text{Laplace}(0, 1/\lambda)$ independently:

$\log P(w_j) = -\lambda|w_j| + \text{const}$

MAP objective:

$\hat{\mathbf{w}}_\text{MAP} = \arg\min_\mathbf{w} \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda\|\mathbf{w}\|_1$

This is exactly Lasso. L1 regularization = MAP with Laplace prior. The Laplace distribution has sharp peaks at zero and heavy tails, which is why the Lasso prior promotes sparsity - zero is much more probable under the Laplace than under the Gaussian.

The MAP ↔ Regularization Correspondence

Prior on $w_j$	Log-prior term	Regularizer	Penalty name
$\mathcal{N}(0, 1/\lambda)$	$-\lambda w_j^2 / 2$	$\lambda\\|\mathbf{w}\\|_2^2$	L2 / Ridge
$\text{Laplace}(0, 1/\lambda)$	$-\lambda	w_j	$
Uniform (improper)	$0$	none	MLE = OLS
Elastic net prior	$-\alpha\lambda	w_j	- (1-\alpha)\lambda w_j^2$

As $n \to \infty$ , the prior is overwhelmed by the likelihood and MAP → MLE regardless of the prior chosen.

import numpy as np
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression, Ridge, Lasso


def compare_mle_map(X: np.ndarray, y: np.ndarray, lambda_: float = 1.0):
    """
    Demonstrate MLE = OLS, MAP+Gaussian prior = Ridge, MAP+Laplace prior = Lasso.
    """
    n, d = X.shape

    # MLE (no prior)
    mle   = LinearRegression().fit(X, y)

    # MAP + Gaussian prior = Ridge
    # sklearn Ridge: minimizes ||y - Xw||^2 + alpha * ||w||^2
    ridge = Ridge(alpha=lambda_).fit(X, y)

    # MAP + Laplace prior = Lasso
    lasso = Lasso(alpha=lambda_ / (2 * n), max_iter=5000).fit(X, y)

    print(f"\nMLE vs MAP Comparison (lambda={lambda_}, n={n}, d={d}):")
    print(f"{'Feature':10s}  {'MLE/OLS':10s}  {'MAP/Ridge':10s}  {'MAP/Lasso':10s}")
    print("-" * 50)
    for j in range(min(d, 8)):
        print(f"x{j+1:8s}  {mle.coef_[j]:10.4f}  {ridge.coef_[j]:10.4f}  {lasso.coef_[j]:10.4f}")

    print(f"\n  Ridge l2 norm: {np.linalg.norm(ridge.coef_):.4f}  "
          f"(vs MLE: {np.linalg.norm(mle.coef_):.4f})")
    print(f"  Lasso zeros:   {np.sum(lasso.coef_ == 0)} of {d} features zeroed out")


np.random.seed(42)
X_sparse, y_sparse = make_regression(n_samples=150, n_features=20,
                                      n_informative=5, noise=10.0,
                                      random_state=42)
compare_mle_map(X_sparse, y_sparse, lambda_=50.0)

Fisher Information and the Cramér-Rao Bound

The Score Function

The score function is the gradient of the log-likelihood with respect to $\theta$ :

$s(\theta; y) = \frac{\partial}{\partial\theta}\log P(y|\theta)$

Under regularity conditions, $\mathbb{E}[s(\theta; y)] = 0$ - the expected score is zero at the true parameter.

Fisher Information

The Fisher information measures how much information a single observation carries about $\theta$ :

$\mathcal{I}(\theta) = \mathbb{E}\left[s(\theta; y)^2\right] = \mathbb{E}\left[\left(\frac{\partial\log P(y|\theta)}{\partial\theta}\right)^2\right]$

Equivalently (second form, more useful for computation):

$\mathcal{I}(\theta) = -\mathbb{E}\left[\frac{\partial^2\log P(y|\theta)}{\partial\theta^2}\right]$

The second form says: Fisher information equals the expected curvature (negative Hessian) of the log-likelihood. High curvature = sharply peaked likelihood = precise estimation possible.

For $n$ i.i.d. samples, total Fisher information is $n \cdot \mathcal{I}(\theta)$ .

Cramér-Rao Lower Bound

For any unbiased estimator $\hat{\theta}$ of $\theta$ :

$\text{Var}(\hat{\theta}) \geq \frac{1}{n \cdot \mathcal{I}(\theta)}$

This is the Cramér-Rao Lower Bound (CRLB). It establishes a fundamental floor on estimation variance - no unbiased estimator can do better.

The MLE is asymptotically efficient: as $n \to \infty$ , the MLE achieves the CRLB. No consistent estimator can be more efficient asymptotically.

Fisher Information: Concrete Examples

Gaussian (known $\sigma^2$ ):

$\log P(y|\mu) = -\frac{(y-\mu)^2}{2\sigma^2} + \text{const}$

$\frac{\partial^2 \log P}{\partial\mu^2} = -\frac{1}{\sigma^2} \implies \mathcal{I}(\mu) = \frac{1}{\sigma^2}$

CRLB: $\text{Var}(\hat{\mu}) \geq \frac{\sigma^2}{n}$ . The sample mean achieves exactly $\sigma^2/n$ - it is the UMVUE (uniformly minimum variance unbiased estimator).

Bernoulli:

$\log P(y|p) = y\log p + (1-y)\log(1-p)$

$\frac{\partial^2 \log P}{\partial p^2} = -\frac{y}{p^2} - \frac{1-y}{(1-p)^2} \implies \mathcal{I}(p) = \frac{1}{p(1-p)}$

CRLB: $\text{Var}(\hat{p}) \geq \frac{p(1-p)}{n}$ . Sample proportion achieves exactly $p(1-p)/n$ - efficient.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats


def fisher_information_analysis():
    """
    Demonstrate Fisher information, CRLB, and MLE efficiency.
    """
    print("=" * 60)
    print("Fisher Information and Cramér-Rao Bound")
    print("=" * 60)

    # --- Gaussian ---
    sigma = 2.0
    ns    = [10, 50, 100, 500, 1000]

    print(f"\nGaussian (sigma={sigma}): Fisher info I(mu) = 1/sigma^2 = {1/sigma**2:.4f}")
    print(f"\n{'n':6s}  {'CRLB std':10s}  {'Simul std':10s}  {'Ratio':8s}")
    print("-" * 40)

    for n in ns:
        crlb_std = sigma / np.sqrt(n)  # sqrt(1 / (n * I(mu)))
        # Simulate 5000 experiments
        samples   = np.random.normal(3.5, sigma, size=(5000, n))
        mle_means = samples.mean(axis=1)
        sim_std   = np.std(mle_means)
        ratio     = sim_std / crlb_std
        print(f"{n:6d}  {crlb_std:10.5f}  {sim_std:10.5f}  {ratio:8.4f}")

    # --- Bernoulli ---
    p_true = 0.3
    I_bern = 1.0 / (p_true * (1 - p_true))
    print(f"\nBernoulli (p={p_true}): Fisher info I(p) = 1/(p(1-p)) = {I_bern:.4f}")
    print(f"\n{'n':6s}  {'CRLB std':10s}  {'Simul std':10s}  {'Ratio':8s}")
    print("-" * 40)

    for n in ns:
        crlb_std  = np.sqrt(p_true * (1 - p_true) / n)
        samples   = np.random.binomial(1, p_true, size=(5000, n))
        mle_props = samples.mean(axis=1)
        sim_std   = np.std(mle_props)
        ratio     = sim_std / crlb_std
        print(f"{n:6d}  {crlb_std:10.5f}  {sim_std:10.5f}  {ratio:8.4f}")

    print("\nConclusion: ratio -> 1.0 as n grows - MLE achieves CRLB asymptotically.")


np.random.seed(42)
fisher_information_analysis()

The Observed Fisher Information Matrix

In practice, the Fisher information for a vector parameter $\boldsymbol{\theta}$ (the Fisher information matrix) is approximated by the observed Fisher information - the negative Hessian of the log-likelihood at the MLE:

$\hat{\mathcal{I}}(\hat{\boldsymbol{\theta}}) = -\frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}^\top}\Bigg|_{\boldsymbol{\theta}=\hat{\boldsymbol{\theta}}}$

The inverse gives the asymptotic covariance matrix of the MLE:

$\text{Cov}(\hat{\boldsymbol{\theta}}) \approx \hat{\mathcal{I}}(\hat{\boldsymbol{\theta}})^{-1}$

Standard errors = square root of diagonal entries. This is how statsmodels, R, and SciPy report confidence intervals for MLE-fit models.

Asymptotic Properties of MLE

Consistency

Under regularity conditions, $\hat{\boldsymbol{\theta}}_\text{MLE} \xrightarrow{p} \boldsymbol{\theta}^*$ as $n \to \infty$ . More data always improves MLE accuracy.

Asymptotic Normality

$\sqrt{n}\left(\hat{\boldsymbol{\theta}}_\text{MLE} - \boldsymbol{\theta}^*\right) \xrightarrow{d} \mathcal{N}\left(\mathbf{0},\, \mathcal{I}(\boldsymbol{\theta}^*)^{-1}\right)$

This gives the theoretical basis for confidence intervals and hypothesis tests from MLE estimates. At large $n$ , CIs computed from the observed Fisher information are valid.

Invariance

If $\hat{\theta}_\text{MLE}$ is the MLE of $\theta$ , then $g(\hat{\theta}_\text{MLE})$ is the MLE of $g(\theta)$ for any function $g$ .

Examples:

MLE of $\sigma^2$ is $\hat{\sigma}^2 = \frac{1}{n}\sum(y_i - \bar{y})^2$ , so MLE of $\sigma$ is $\sqrt{\hat{\sigma}^2}$
MLE of $p$ is $\hat{p} = k/n$ , so MLE of log-odds is $\log(\hat{p}/(1-\hat{p}))$

Asymptotic Efficiency

No consistent estimator can have lower asymptotic variance than the MLE. This is a strong result: the MLE extracts all the information from the data.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats


def mle_asymptotic_normality(true_p: float = 0.4, n_values: list = None):
    """
    Demonstrate asymptotic normality of MLE for Bernoulli p.
    As n grows, the sampling distribution of p_hat converges to N(p, p(1-p)/n).
    """
    if n_values is None:
        n_values = [10, 30, 100, 500]

    n_sim = 10_000
    fig, axes = plt.subplots(1, len(n_values), figsize=(14, 4))

    for ax, n in zip(axes, n_values):
        # Simulate MLE sampling distribution
        samples = np.random.binomial(1, true_p, size=(n_sim, n))
        p_hats  = samples.mean(axis=1)

        # Theoretical asymptotic distribution
        sigma_theory = np.sqrt(true_p * (1 - true_p) / n)
        x_range      = np.linspace(true_p - 4 * sigma_theory,
                                   true_p + 4 * sigma_theory, 200)
        density      = stats.norm.pdf(x_range, loc=true_p, scale=sigma_theory)

        ax.hist(p_hats, bins=50, density=True, alpha=0.6, color='steelblue',
                label='Simulated MLE')
        ax.plot(x_range, density, 'r-', lw=2, label='Asymptotic N')
        ax.axvline(true_p, color='k', ls='--', alpha=0.5)
        ax.set_title(f'n = {n}')
        ax.set_xlabel('p̂')
        ax.legend(fontsize=8)

    plt.suptitle(f'Asymptotic Normality of Bernoulli MLE (true p = {true_p})',
                 fontsize=12)
    plt.tight_layout()
    plt.savefig('mle_asymptotic_normality.png', dpi=150)
    plt.show()


np.random.seed(42)
mle_asymptotic_normality(true_p=0.4, n_values=[10, 30, 100, 500])

MLE vs ERM: The Theoretical Bridge

Empirical Risk Minimization (ERM) minimizes average loss over training data:

$\hat{\boldsymbol{\theta}}_\text{ERM} = \arg\min_{\boldsymbol{\theta}} \frac{1}{n}\sum_{i=1}^n \ell(y_i, f(\mathbf{x}_i; \boldsymbol{\theta}))$

MLE maximizes average log-likelihood:

$\hat{\boldsymbol{\theta}}_\text{MLE} = \arg\max_{\boldsymbol{\theta}} \frac{1}{n}\sum_{i=1}^n \log P(y_i | \mathbf{x}_i; \boldsymbol{\theta})$

These are the same when loss = negative log-likelihood: $\ell(y, \hat{y}) = -\log P(y|\hat{y})$ .

Furthermore, MLE is equivalent to minimizing KL divergence between the empirical distribution $\hat{p}$ and the model $p_\theta$ :

$\hat{\boldsymbol{\theta}}_\text{MLE} = \arg\min_\theta D_\text{KL}\left(\hat{p}(y|\mathbf{x}) \,\|\, p_\theta(y|\mathbf{x})\right)$

This is a profound connection: MLE finds the model that is closest (in KL divergence) to the true data distribution, as estimated from the training set.

Production Notes

:::tip Use numerical MLE when no closed form exists Most distributions used in production (Student-t, Beta, Negative Binomial) have no closed-form MLE. Use scipy.optimize.minimize with method='L-BFGS-B' and log-parameterize positive quantities. Always verify by checking the gradient at the solution: optimize.check_grad or manually. :::

:::warning MLE sigma^2 is biased for linear regression The MLE estimate $\hat{\sigma}^2 = \frac{1}{n}\sum_i (y_i - \hat{y}_i)^2$ is biased. For inference (confidence intervals, hypothesis tests), use the unbiased estimate $s^2 = \frac{1}{n-d-1}\sum_i (y_i - \hat{y}_i)^2$ . The degrees-of-freedom correction $n-d-1$ accounts for the fact that $d+1$ parameters were estimated from the same data used to compute residuals. :::

:::danger Do not confuse likelihood and probability $\mathcal{L}(\theta | y) = P(y | \theta)$ numerically, but they are different functions. Likelihood is not a probability distribution over $\theta$ - it does not integrate to 1 and cannot be normalized to get a distribution (that is Bayesian posterior, not likelihood). Reporting $\mathcal{L}(\hat\theta)$ as a "probability" of the parameter is wrong. :::

:::warning MSE for classification: wrong noise model Using MSE loss for binary classification assumes Gaussian residuals, which is misspecified. The outputs are probabilities in $[0,1]$ but MSE assigns no penalty for predictions outside $[0,1]$ , leads to saturated gradients with sigmoid activation, and produces a non-convex objective. Use cross-entropy (correct Bernoulli NLL) instead. :::

YouTube Resources

Title	Channel	Why Watch
Maximum Likelihood For the Normal Distribution, step-by-step	StatQuest	Excellent visual derivation of Gaussian MLE
MLE and MAP estimation	Aurelien Geron	Connects MLE to neural network training
Fisher Information and the Cramér-Rao Lower Bound	Mathematical Monk	Rigorous CRLB derivation
Bayesian vs Frequentist Statistics	3Blue1Brown	Conceptual foundation, MLE vs MAP
Likelihood and Log-Likelihood	Brandon Foltz	Clear introduction for practitioners

Interview Q&A

Q1: Explain why the Gaussian noise assumption leads to OLS, and why Bernoulli leads to cross-entropy.

The key insight is that the loss function is the negative log-likelihood of the noise model. If we assume $y_i = \mathbf{w}^\top\mathbf{x}_i + \epsilon_i$ with $\epsilon_i \sim \mathcal{N}(0,\sigma^2)$ , then $P(y_i|\mathbf{x}_i,\mathbf{w}) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(y_i - \mathbf{w}^\top\mathbf{x}_i)^2}{2\sigma^2}\right)$ . Taking the log and summing: $\ell = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_i(y_i - \mathbf{w}^\top\mathbf{x}_i)^2$ . Maximizing over $\mathbf{w}$ means minimizing $\sum_i(y_i - \hat{y}_i)^2$ - OLS. For binary classification, $y_i \sim \text{Bernoulli}(\hat{y}_i)$ , so $P(y_i|\hat{y}_i) = \hat{y}_i^{y_i}(1-\hat{y}_i)^{1-y_i}$ . Log-likelihood: $\sum_i[y_i\log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)]$ . Maximizing this is minimizing binary cross-entropy. The loss function is not an arbitrary design choice - it is the negative log-likelihood of the correct generative model.

Q2: What is the Cramér-Rao bound and why does it matter for ML?

The CRLB is $\text{Var}(\hat\theta) \geq 1/(n \cdot \mathcal{I}(\theta))$ for any unbiased estimator $\hat\theta$ . It sets a fundamental lower bound on estimation variance - no matter how clever your estimator, you cannot beat this. The MLE achieves this bound asymptotically, making it the most efficient estimator (among consistent unbiased estimators) for large $n$ . In ML practice: (1) CRLB explains why more data always helps - the bound tightens as $n$ grows; (2) The inverse Fisher information gives asymptotic standard errors for MLE, used for confidence intervals and p-values in statsmodels/R; (3) CRLB is the theoretical justification for using MLE rather than some ad-hoc estimator - it is statistically optimal.

Q3: How does L2 regularization relate to MAP estimation, and what prior does it correspond to?

L2 regularization (Ridge) = MAP estimation with a Gaussian prior $P(w_j) = \mathcal{N}(0, 1/\lambda)$ on each weight. Proof: MAP objective is $\log P(\mathbf{y}|\mathbf{X},\mathbf{w}) + \log P(\mathbf{w})$ . Substituting Gaussian NLL and Gaussian prior: $-\frac{1}{2\sigma^2}\|\mathbf{y}-\mathbf{X}\mathbf{w}\|^2 - \frac{\lambda}{2}\|\mathbf{w}\|^2$ . Minimizing the negative = $\arg\min_\mathbf{w} \|\mathbf{y}-\mathbf{X}\mathbf{w}\|^2 + \lambda\|\mathbf{w}\|^2$ - exactly Ridge. The regularization parameter $\lambda$ is the precision of the Gaussian prior: large $\lambda$ = strong prior belief that weights are near zero = heavy regularization. L1 regularization corresponds to a Laplace prior, which has heavier tails and a sharper peak at zero - hence promoting sparsity.

Q4: What is Fisher information intuitively, and how is it related to the curvature of the log-likelihood?

Fisher information $\mathcal{I}(\theta)$ measures how much information each data point carries about the parameter $\theta$ . Intuitively: if you perturb $\theta$ slightly, how much does the likelihood change? High Fisher information = the likelihood is sensitive to $\theta$ = we can pin down $\theta$ precisely from the data. Formally, $\mathcal{I}(\theta) = -\mathbb{E}[\partial^2\log P/\partial\theta^2]$ = expected negative Hessian of log-likelihood. High Fisher information = high curvature = narrow, peaked log-likelihood surface = small estimation uncertainty. This is exactly the connection to the CRLB: high curvature means the log-likelihood concentrates around the true $\theta$ , allowing very low variance estimates. For the Gaussian mean: $\mathcal{I}(\mu) = 1/\sigma^2$ - smaller variance → higher information → more precise estimation.

Q5: Why does MLE sometimes overfit, and what is the connection to the bias-variance tradeoff?

MLE is a point estimate that maximizes probability of the training data. For complex models with many parameters, MLE can perfectly fit the training data while generalizing poorly - a classic overfitting pattern. This is because MLE is equivalent to ERM (empirical risk minimization), which optimizes on the training distribution, not the test distribution. The MLE has zero bias asymptotically (consistent) but high variance for small $n$ . MAP estimation (MLE + prior) adds a regularization term that introduces a small bias in exchange for lower variance - the bias-variance tradeoff. As $n \to \infty$ , the prior is overwhelmed by data and MAP → MLE. For small $n$ , a carefully chosen prior (e.g., Gaussian for Ridge, Laplace for Lasso) dramatically reduces variance with minimal bias cost, improving generalization.

Q6: You're fitting a model to customer purchase counts per day. You fit a Poisson regression and find the variance of counts is 8× the mean. What does this tell you and how do you fix it?

Poisson regression assumes equidispersion: $\text{Var}(y) = \mu$ . If the observed variance/mean ratio is 8, the data is overdispersed - the Poisson assumption is violated. This typically arises from unobserved heterogeneity (some customers are systematically high-purchasers, not captured by your features) or clustered events. Consequences: the Poisson likelihood is misspecified, which means MLE gives inconsistent estimates of coefficient standard errors (they are too small), leading to artificially narrow confidence intervals and inflated significance. Fix: (1) Negative Binomial regression - adds a per-observation random effect with variance $\text{Var}(y) = \mu + \alpha\mu^2$ , fitting $\alpha$ from data; (2) Quasi-Poisson - uses Poisson point estimates but inflates standard errors by $\sqrt{\text{dispersion}}$ ; (3) Add features to absorb heterogeneity. Model selection: compare Poisson AIC vs Negative Binomial AIC - lower is better.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the MLE & MAP Explorer demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why This Exists - The Problem Before MLE​

Historical Context​

The MLE Principle: Probability vs Likelihood​

MLE for Common Distributions​

Gaussian: Recovering Sample Mean and Variance​

Bernoulli: Recovering Sample Proportion​

Poisson: Recovering Sample Mean​

MLE for Linear Regression → Derives OLS​

MLE for Logistic Regression → Derives Cross-Entropy​

The NLL ↔ Loss Function Dictionary​

Python: MLE From Scratch​

Gaussian MLE with Likelihood Surface​

Bernoulli and Poisson MLE​

Numerical MLE for Custom Distributions​

MLE for Linear Regression with Full Statistics​

MLE vs MAP: Regularization as Bayesian Inference​

Gaussian Prior → Ridge Regression (L2)​

Laplace Prior → Lasso (L1)​

The MAP ↔ Regularization Correspondence​

Fisher Information and the Cramér-Rao Bound​

The Score Function​

Fisher Information​

Cramér-Rao Lower Bound​

Fisher Information: Concrete Examples​

The Observed Fisher Information Matrix​

Asymptotic Properties of MLE​

Consistency​

Asymptotic Normality​

Invariance​

Asymptotic Efficiency​

MLE vs ERM: The Theoretical Bridge​

Production Notes​

YouTube Resources​

Interview Q&A​

The Real Interview Moment

Why This Exists - The Problem Before MLE

Historical Context

The MLE Principle: Probability vs Likelihood

MLE for Common Distributions

Gaussian: Recovering Sample Mean and Variance

Bernoulli: Recovering Sample Proportion

Poisson: Recovering Sample Mean

MLE for Linear Regression → Derives OLS

MLE for Logistic Regression → Derives Cross-Entropy

The NLL ↔ Loss Function Dictionary

Python: MLE From Scratch

Gaussian MLE with Likelihood Surface

Bernoulli and Poisson MLE

Numerical MLE for Custom Distributions

MLE for Linear Regression with Full Statistics

MLE vs MAP: Regularization as Bayesian Inference

Gaussian Prior → Ridge Regression (L2)

Laplace Prior → Lasso (L1)

The MAP ↔ Regularization Correspondence

Fisher Information and the Cramér-Rao Bound

The Score Function

Fisher Information

Cramér-Rao Lower Bound

Fisher Information: Concrete Examples

The Observed Fisher Information Matrix

Asymptotic Properties of MLE

Consistency

Asymptotic Normality

Invariance

Asymptotic Efficiency

MLE vs ERM: The Theoretical Bridge

Production Notes

YouTube Resources

Interview Q&A