Estimation Theory: The Math Behind How Models Learn

Reading time: ~35 min | Interview relevance: Very High | Target roles: MLE, AI Engineer, Research Engineer

The Production Scenario

Your team is training a language model. The loss function in your training script reads:

loss = F.cross_entropy(logits, targets)

A new grad joins and asks: "Why cross-entropy? Why not mean squared error? Why not something else?"

Most engineers say "that's just standard for classification" and move on. But a senior ML engineer knows the real answer: cross-entropy IS maximum likelihood estimation. The loss function you minimise during training is the negative log-likelihood of your training data under the model. This is not a coincidence or a convention - it is the mathematically principled answer to the question "what parameters best explain the data I observed?"

Estimation theory gives you this deeper understanding. It also tells you exactly where L2 regularisation comes from (it is a Bayesian prior), and it gives you the formal language to reason about whether your model's parameter estimates are good ones.

What Is an Estimator?

You observe data $x_1, x_2, \ldots, x_n$ drawn from some distribution $p(x | \theta)$ where $\theta$ is an unknown parameter (or parameter vector). An estimator $\hat{\theta}$ is any function of the data that produces a guess for $\theta$ :

$\hat{\theta} = g(x_1, x_2, \ldots, x_n)$

Examples:

Sample mean $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$ estimates the population mean $\mu$
Sample variance $s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2$ estimates population variance $\sigma^2$
The weights $\hat{w}$ from gradient descent estimate the "true" best weights $w^*$

Different estimators have different properties. Two of the most important: bias and variance.

Bias and Variance of Estimators

Bias

The bias of an estimator is the systematic error - how far off on average is our estimate from the true value?

$\text{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta$

An estimator is unbiased if $\mathbb{E}[\hat{\theta}] = \theta$ .

Example - why divide by $n-1$ for sample variance?

Consider two estimators for variance:

$\hat{\sigma}^2_{\text{biased}} = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2 \quad \text{(MLE estimator)}$

$\hat{\sigma}^2_{\text{unbiased}} = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2 \quad \text{(Bessel's correction)}$

The biased version is the MLE solution, but it systematically underestimates variance because $\bar{x}$ is itself estimated from the data. The correction factor $\frac{n}{n-1}$ compensates. In NumPy:

import numpy as np

data = np.array([2.1, 2.5, 3.0, 2.8, 2.3, 3.1, 2.7])

# Biased (MLE) - divides by n
var_biased = np.var(data, ddof=0)

# Unbiased (Bessel's correction) - divides by n-1
var_unbiased = np.var(data, ddof=1)

print(f"Biased variance:   {var_biased:.4f}")
print(f"Unbiased variance: {var_unbiased:.4f}")
print(f"Ratio (n/(n-1)):   {len(data)/(len(data)-1):.4f}")

# Verify: unbiased = biased * n/(n-1)
print(f"Check:             {var_biased * len(data)/(len(data)-1):.4f}")

Variance of an Estimator

The variance of an estimator measures how much it fluctuates across different datasets:

$\text{Var}(\hat{\theta}) = \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2\right]$

A low-variance estimator gives similar answers no matter which random sample you happen to observe. This is exactly what you want in ML - a training procedure that gives you roughly the same model regardless of which mini-batches you sampled.

Mean Squared Error Decomposition

The Mean Squared Error (MSE) of an estimator decomposes into bias and variance:

$\text{MSE}(\hat{\theta}) = \mathbb{E}[(\hat{\theta} - \theta)^2] = \text{Bias}^2(\hat{\theta}) + \text{Var}(\hat{\theta})$

Proof:

$\text{MSE} = \mathbb{E}[(\hat{\theta} - \theta)^2]$ $= \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}] + \mathbb{E}[\hat{\theta}] - \theta)^2\right]$ $= \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2\right] + 2\underbrace{\mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])(\mathbb{E}[\hat{\theta}] - \theta)\right]}_{=0} + (\mathbb{E}[\hat{\theta}] - \theta)^2$ $= \text{Var}(\hat{\theta}) + \text{Bias}^2(\hat{\theta})$

This is the bias-variance tradeoff at the estimator level - the same tradeoff you encounter in ML models.

:::tip ML Engineering Connection The bias-variance tradeoff of ML models (underfitting vs overfitting) is literally the bias-variance tradeoff of estimators. A model with high bias (underfitting) has a biased estimator for the true function. A model with high variance (overfitting) is a high-variance estimator that is sensitive to which training set you happened to sample. :::

Visualising Bias-Variance

Low Bias,     Low Bias,     High Bias,    High Bias,
Low Variance  High Variance Low Variance  High Variance

     *               * *         *             *  *
     *           *     *           *         *   *  *
     *              *               *          *
    [X]             [X]            [X]          [X]

[X] = true parameter value
 *  = estimates from different data samples

Maximum Likelihood Estimation (MLE)

MLE answers the question: which parameter value makes the observed data most probable?

Given data $\mathcal{D} = \{x_1, \ldots, x_n\}$ assumed i.i.d. from $p(x|\theta)$ , the likelihood is:

$\mathcal{L}(\theta) = p(\mathcal{D}|\theta) = \prod_{i=1}^n p(x_i|\theta)$

The MLE is:

$\hat{\theta}_{\text{MLE}} = \arg\max_\theta \mathcal{L}(\theta) = \arg\max_\theta \prod_{i=1}^n p(x_i|\theta)$

In practice, we maximise the log-likelihood (which has the same maximiser, but products become sums - much better for numerical stability and calculus):

$\hat{\theta}_{\text{MLE}} = \arg\max_\theta \log \mathcal{L}(\theta) = \arg\max_\theta \sum_{i=1}^n \log p(x_i|\theta)$

And since minimisation is more familiar in ML, we equivalently minimise the negative log-likelihood (NLL):

$\hat{\theta}_{\text{MLE}} = \arg\min_\theta -\sum_{i=1}^n \log p(x_i|\theta)$

MLE for the Gaussian Distribution

Suppose $x_1, \ldots, x_n \sim \mathcal{N}(\mu, \sigma^2)$ . The log-likelihood is:

$\log \mathcal{L}(\mu, \sigma^2) = \sum_{i=1}^n \log \left[\frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right)\right]$

$= -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2$

Deriving $\hat{\mu}_{\text{MLE}}$ : Take the partial derivative with respect to $\mu$ and set to zero:

$\frac{\partial \log \mathcal{L}}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (x_i - \mu) = 0$

$\Rightarrow \sum_{i=1}^n (x_i - \mu) = 0 \Rightarrow \hat{\mu}_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n x_i = \bar{x}$

The MLE for $\mu$ is the sample mean. This is the intuitive answer - and now we know it is the principled MLE answer.

Deriving $\hat{\sigma}^2_{\text{MLE}}$ : Take the partial derivative with respect to $\sigma^2$ (let $v = \sigma^2$ ):

$\frac{\partial \log \mathcal{L}}{\partial v} = -\frac{n}{2v} + \frac{1}{2v^2}\sum_{i=1}^n (x_i - \mu)^2 = 0$

$\Rightarrow \hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2$

Note: this is the biased estimator (divides by $n$ , not $n-1$ ). MLE is consistent but not always unbiased.

import numpy as np
from scipy.stats import norm

np.random.seed(42)
mu_true, sigma_true = 3.5, 1.2

n = 1000
data = np.random.normal(mu_true, sigma_true, n)

# MLE estimates (manual)
mu_mle = np.mean(data)
sigma2_mle = np.mean((data - mu_mle)**2)   # biased (divides by n)
sigma2_unbiased = np.var(data, ddof=1)     # unbiased (divides by n-1)

print(f"True mu:          {mu_true}")
print(f"MLE mu:           {mu_mle:.4f}")
print(f"True sigma^2:     {sigma_true**2:.4f}")
print(f"MLE sigma^2:      {sigma2_mle:.4f}  (biased)")
print(f"Unbiased sigma^2: {sigma2_unbiased:.4f}  (Bessel's correction)")

# Verify with scipy MLE
mu_scipy, sigma_scipy = norm.fit(data)
print(f"\nScipy MLE mu:     {mu_scipy:.4f}")
print(f"Scipy MLE sigma:  {sigma_scipy:.4f}")

MLE for the Bernoulli Distribution

If $x_i \in \{0, 1\}$ with $p(x=1) = p$ :

$\log \mathcal{L}(p) = \sum_{i=1}^n [x_i \log p + (1-x_i)\log(1-p)]$

$\frac{\partial \log \mathcal{L}}{\partial p} = \frac{\sum x_i}{p} - \frac{n - \sum x_i}{1-p} = 0$

$\hat{p}_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n x_i = \bar{x}$

The MLE for a Bernoulli proportion is just the sample proportion. Again: the intuitive answer IS the principled MLE answer.

Cross-Entropy Loss IS MLE - The Critical Connection

This is one of the most important connections in all of ML engineering.

Suppose you have a classification model $f_\theta$ that outputs probabilities $\hat{y}_i = p(y=1 | x_i, \theta)$ for binary classification. The likelihood of the training data under this model is:

$\mathcal{L}(\theta) = \prod_{i=1}^n \hat{y}_i^{y_i}(1-\hat{y}_i)^{1-y_i}$

The log-likelihood:

$\log \mathcal{L}(\theta) = \sum_{i=1}^n \left[y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\right]$

Maximising log-likelihood is equivalent to minimising negative log-likelihood:

$-\frac{1}{n}\log \mathcal{L}(\theta) = -\frac{1}{n}\sum_{i=1}^n \left[y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\right]$

This is exactly binary cross-entropy loss. For multi-class with $K$ classes:

$\mathcal{L}_{\text{CE}} = -\frac{1}{n}\sum_{i=1}^n \sum_{k=1}^K y_{ik}\log \hat{y}_{ik}$

This is exactly negative log-likelihood under a categorical distribution.

:::tip ML Engineering Connection Every time you write F.cross_entropy(logits, targets) in PyTorch, you are running Maximum Likelihood Estimation. The model parameters that minimise cross-entropy are exactly the parameters that make your training data most probable under the model's probability distribution. This is not a heuristic - it is the mathematically optimal answer to "what parameters best fit this data?" :::

import numpy as np

def cross_entropy_loss(y_true, y_pred, eps=1e-15):
    """Binary cross-entropy = negative log-likelihood."""
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

def negative_log_likelihood(y_true, y_pred, eps=1e-15):
    """Bernoulli negative log-likelihood."""
    y_pred = np.clip(y_pred, eps, 1 - eps)
    log_lik = np.sum(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    return -log_lik / len(y_true)

# These are identical
y_true = np.array([1, 0, 1, 1, 0])
y_pred = np.array([0.9, 0.1, 0.8, 0.7, 0.3])

print(f"Cross-entropy loss:       {cross_entropy_loss(y_true, y_pred):.6f}")
print(f"Negative log-likelihood:  {negative_log_likelihood(y_true, y_pred):.6f}")
# Output will be identical

MSE Loss IS MLE for Gaussian Noise

Similarly, Mean Squared Error (used for regression) is the MLE estimator assuming Gaussian noise:

If $y_i = f_\theta(x_i) + \epsilon_i$ where $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ , then:

$p(y_i | x_i, \theta) = \mathcal{N}(f_\theta(x_i), \sigma^2)$

The log-likelihood:

$\log \mathcal{L}(\theta) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - f_\theta(x_i))^2$

Maximising this with respect to $\theta$ is equivalent to minimising:

$\sum_{i=1}^n (y_i - f_\theta(x_i))^2 = \text{MSE loss (up to a constant)}$

MSE loss = MLE under Gaussian noise assumption. When you switch from MSE to Huber loss, you are implicitly changing the noise distribution from Gaussian to something with heavier tails (more robust to outliers).

Loss Function	Implicit Noise Model
MSE	Gaussian $\mathcal{N}(0, \sigma^2)$
MAE	Laplace distribution
Huber loss	Gaussian for small errors, Laplace for large
Cross-entropy	Bernoulli / Categorical

Maximum A Posteriori (MAP) Estimation

MLE only uses the likelihood - it ignores any prior knowledge about $\theta$ . MAP estimation incorporates a prior distribution $p(\theta)$ :

$\hat{\theta}_{\text{MAP}} = \arg\max_\theta p(\theta | \mathcal{D}) = \arg\max_\theta p(\mathcal{D}|\theta) \cdot p(\theta)$

Taking the log:

$\hat{\theta}_{\text{MAP}} = \arg\max_\theta \left[\log \mathcal{L}(\theta) + \log p(\theta)\right]$

This is equivalent to minimising the regularised loss:

$\hat{\theta}_{\text{MAP}} = \arg\min_\theta \left[-\log \mathcal{L}(\theta) - \log p(\theta)\right]$

L2 Regularisation = Gaussian Prior

If $p(\theta) = \mathcal{N}(0, \lambda^{-1}I)$ (a Gaussian prior centred at zero):

$\log p(\theta) = -\frac{\lambda}{2}\|\theta\|^2 + \text{const}$

The MAP objective becomes:

$\hat{\theta}_{\text{MAP}} = \arg\min_\theta \left[\text{NLL}(\theta) + \frac{\lambda}{2}\|\theta\|^2\right]$

This is exactly L2 regularisation (Ridge). The regularisation strength $\lambda$ is the precision of the Gaussian prior - a stronger prior means we are more confident weights should be near zero.

L1 Regularisation = Laplace Prior

If $p(\theta) \propto \exp(-\lambda|\theta|)$ (a Laplace prior):

$\log p(\theta) = -\lambda\|\theta\|_1 + \text{const}$

The MAP objective:

$\hat{\theta}_{\text{MAP}} = \arg\min_\theta \left[\text{NLL}(\theta) + \lambda\|\theta\|_1\right]$

This is L1 regularisation (Lasso). The Laplace prior has heavier tails than the Gaussian and places more probability mass near zero, which promotes sparsity.

import numpy as np
from numpy.linalg import norm

np.random.seed(42)
n, d = 100, 50  # n samples, d features (underdetermined: d > n)

X = np.random.randn(n, d)
w_true = np.zeros(d)
w_true[:5] = [1.0, -0.5, 0.8, -1.2, 0.3]  # sparse true weights
y = X @ w_true + 0.1 * np.random.randn(n)

# MLE solution (OLS) - use lstsq for numerical stability
w_mle, _, _, _ = np.linalg.lstsq(X, y, rcond=None)

# MAP solution (Ridge regression) - L2 prior
# Normal equations with regularization: w_map = (X^T X + lambda I)^{-1} X^T y
lambda_reg = 1.0
A = X.T @ X + lambda_reg * np.eye(d)
b = X.T @ y
w_map = np.linalg.solve(A, b)

print("MLE vs MAP estimation (underdetermined system):")
print(f"MLE weights norm:  {norm(w_mle):.3f}")
print(f"MAP weights norm:  {norm(w_map):.3f}")
print(f"True weights norm: {norm(w_true):.3f}")

mle_error = norm(w_mle - w_true)
map_error = norm(w_map - w_true)
print(f"\nMLE estimation error: {mle_error:.3f}")
print(f"MAP estimation error: {map_error:.3f}")
print(f"MAP improvement: {((mle_error - map_error)/mle_error)*100:.1f}%")

:::note The Regularisation Story Every regularisation technique in ML has a Bayesian interpretation. When you add weight decay to Adam, you are performing MAP estimation with a Gaussian prior. When you add L1 to a linear model, you are performing MAP with a Laplace prior. The hyperparameter $\lambda$ controls how strong your prior belief is that weights should be small. :::

Properties of Good Estimators

Consistency

An estimator $\hat{\theta}_n$ is consistent if it converges to the true value as $n \to \infty$ :

$\hat{\theta}_n \xrightarrow{p} \theta \quad \text{as } n \to \infty$

MLE estimators are consistent under mild regularity conditions - this is why "more data helps": your model parameters converge to the optimal values with enough data.

Efficiency and the Cramér-Rao Bound

An estimator is efficient if it achieves the lowest possible variance among all unbiased estimators. The Cramér-Rao lower bound establishes this minimum variance:

$\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}$

where $I(\theta)$ is the Fisher information:

$I(\theta) = -\mathbb{E}\left[\frac{\partial^2 \log p(x|\theta)}{\partial \theta^2}\right]$

MLE is asymptotically efficient - for large $n$ , the MLE achieves the Cramér-Rao bound. No other consistent estimator can do better asymptotically.

import numpy as np

def cramer_rao_bound(sigma2, n):
    """Minimum variance from Cramér-Rao bound for Gaussian mean."""
    # Fisher information for mean: I(mu) = n/sigma^2
    return sigma2 / n

sigma2_true = 4.0
sample_sizes = [10, 50, 100, 500, 1000]

print("Cramér-Rao Bound vs Empirical Variance of Sample Mean:")
print(f"{'n':>8} | {'CRB (min var)':>14} | {'Empirical Var':>14} | {'Ratio':>8}")
print("-" * 54)

for n in sample_sizes:
    crb = cramer_rao_bound(sigma2_true, n)
    n_experiments = 10_000
    means = [np.mean(np.random.normal(0, np.sqrt(sigma2_true), n))
             for _ in range(n_experiments)]
    empirical_var = np.var(means)
    print(f"{n:>8} | {crb:>14.6f} | {empirical_var:>14.6f} | {empirical_var/crb:>8.4f}")
# Ratio should be ~1.0 - sample mean achieves the CRB

Summary of Estimator Properties

Property	Definition	ML Implication
Unbiased	$\mathbb{E}[\hat{\theta}] = \theta$	Sample mean is unbiased for $\mu$
Consistent	$\hat{\theta}_n \to \theta$ as $n\to\infty$	MLE converges with more data
Efficient	Achieves Cramér-Rao bound	MLE is asymptotically efficient
Sufficient	Uses all info in data	MLE uses sufficient statistics

MLE vs MAP vs Full Bayesian

Approach	Formula	Output	Regularisation	Use Case
MLE	$\arg\max_\theta p(\mathcal{D}\\|\theta)$	Point estimate	None	Large data, no prior
MAP	$\arg\max_\theta p(\mathcal{D}\\|\theta)p(\theta)$	Point estimate	Yes (via prior)	Small data, prior known
Full Bayesian	$p(\theta\\|\mathcal{D}) \propto p(\mathcal{D}\\|\theta)p(\theta)$	Distribution	Yes (via prior)	Uncertainty needed

MLE overfits in small-data regimes. MAP adds regularisation. Full Bayesian inference (covered in Module 06) gives a distribution over parameters - uncertainty quantification rather than just a point estimate.

:::warning Common Interview Trap When asked "what is MLE?", many candidates say "you maximise the likelihood." That is correct but incomplete. The full answer: MLE finds the parameters $\hat{\theta}$ such that the observed data is most probable. It is equivalent to minimising cross-entropy loss for classification and MSE loss for regression with Gaussian noise. The regularisation you add converts it from MLE to MAP. :::

Putting It Together: The Training Loop as MLE

import numpy as np

# This simplified training loop IS MLE
def logistic_regression_mle(X, y, learning_rate=0.01, epochs=1000):
    """
    Train logistic regression by minimising NLL = MLE.
    This is exactly what torch.nn.BCELoss() does.
    """
    n, d = X.shape
    theta = np.zeros(d)

    for epoch in range(epochs):
        # Forward pass: predict probabilities
        logits = X @ theta
        y_hat = 1 / (1 + np.exp(-logits))  # sigmoid

        # Loss = negative log-likelihood = cross-entropy
        eps = 1e-15
        nll = -np.mean(y * np.log(y_hat + eps) + (1 - y) * np.log(1 - y_hat + eps))

        # Gradient of NLL w.r.t. theta
        grad = X.T @ (y_hat - y) / n

        # Gradient descent step (MLE update)
        theta -= learning_rate * grad

        if epoch % 100 == 0:
            print(f"Epoch {epoch:4d} | NLL: {nll:.4f}")

    return theta

# Generate synthetic data
np.random.seed(42)
n = 500
X = np.column_stack([np.ones(n), np.random.randn(n, 2)])
w_true = np.array([0.5, 1.2, -0.8])
probs = 1 / (1 + np.exp(-X @ w_true))
y = (np.random.rand(n) < probs).astype(float)

w_mle = logistic_regression_mle(X, y)
print(f"\nTrue weights: {w_true}")
print(f"MLE weights:  {w_mle.round(3)}")

The training loop above is exactly MLE. When you add lambda * np.sum(theta**2) to the NLL, you convert it to MAP with a Gaussian prior - which is L2 regularised logistic regression.

Interview Q&A

Q1: Why does training a neural network with cross-entropy loss correspond to MLE?

Cross-entropy loss is $-\frac{1}{n}\sum_i \sum_k y_{ik}\log\hat{y}_{ik}$ . If the model outputs a categorical distribution $\hat{y}_{ik} = p(y=k|x_i;\theta)$ , then the log-likelihood of the training data under this model is $\sum_i \sum_k y_{ik}\log p(y=k|x_i;\theta)$ . Minimising cross-entropy is equivalent to maximising log-likelihood, which is MLE. The model parameters that minimise cross-entropy are exactly those that make the training labels most probable under the model's predicted distribution.

Q2: What is the difference between MLE and MAP? Give a practical example.

MLE finds the parameters that maximise the likelihood $p(\mathcal{D}|\theta)$ . MAP finds the parameters that maximise the posterior $p(\theta|\mathcal{D}) \propto p(\mathcal{D}|\theta)p(\theta)$ , incorporating a prior belief about $\theta$ . Practical example: in linear regression, MLE gives ordinary least squares (OLS). MAP with a Gaussian prior on the weights gives Ridge regression (L2 regularisation). MAP with a Laplace prior gives Lasso (L1 regularisation). The regularisation term in your loss function IS the negative log of your prior.

Q3: What does it mean for an estimator to be biased? Is the MLE for Gaussian variance biased?

An estimator is biased if its expectation differs from the true parameter: $\mathbb{E}[\hat{\theta}] \neq \theta$ . Yes, the MLE for Gaussian variance is biased - it divides by $n$ rather than $n-1$ , giving $\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum(x_i-\bar{x})^2$ . The expected value is $\frac{n-1}{n}\sigma^2 < \sigma^2$ . Bessel's correction multiplies by $\frac{n}{n-1}$ to produce the unbiased $\frac{1}{n-1}\sum(x_i-\bar{x})^2$ .

Q4: How does the bias-variance tradeoff of estimators relate to the bias-variance tradeoff in ML models?

They are the same tradeoff at different scales. For an estimator, $\text{MSE}(\hat{\theta}) = \text{Bias}^2(\hat{\theta}) + \text{Var}(\hat{\theta})$ . For an ML model, expected generalisation error decomposes the same way: $\text{Bias}^2 + \text{Variance} + \text{Irreducible noise}$ . A model with high bias (underfitting) is using a function class too simple to represent the true relationship. A model with high variance (overfitting) is sensitive to which training dataset was sampled - the estimator has high variance across datasets.

Q5: What is the Cramér-Rao lower bound and why does it matter in ML?

The Cramér-Rao lower bound states that for any unbiased estimator $\hat{\theta}$ , $\text{Var}(\hat{\theta}) \geq 1/I(\theta)$ where $I(\theta)$ is the Fisher information. No unbiased estimator can be more precise than this. It matters in ML because: (1) it tells us the fundamental limits of parameter estimation from data, (2) MLE achieves this bound asymptotically (it is asymptotically efficient), and (3) the Fisher information matrix is the basis for natural gradient methods - K-FAC, and Adam's second-moment estimate is a diagonal approximation of the inverse Fisher. Understanding Fisher information connects training dynamics to statistical estimation theory.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the MLE & MAP Explorer demo on the EngineersOfAI Playground - no code required.

:::

The Production Scenario​

What Is an Estimator?​

Bias and Variance of Estimators​

Bias​

Variance of an Estimator​

Mean Squared Error Decomposition​

Visualising Bias-Variance​

Maximum Likelihood Estimation (MLE)​

MLE for the Gaussian Distribution​

MLE for the Bernoulli Distribution​

Cross-Entropy Loss IS MLE - The Critical Connection​

MSE Loss IS MLE for Gaussian Noise​

Maximum A Posteriori (MAP) Estimation​

L2 Regularisation = Gaussian Prior​

L1 Regularisation = Laplace Prior​

Properties of Good Estimators​

Consistency​

Efficiency and the Cramér-Rao Bound​

Summary of Estimator Properties​

MLE vs MAP vs Full Bayesian​

Putting It Together: The Training Loop as MLE​

Interview Q&A​