Expectation, Variance, and Moments

Reading time: ~45 minutes | Interview relevance: Very High | Target roles: ML Engineer, AI Engineer, Research Scientist, Data Scientist, MLOps Engineer

The ML Scenario That Motivates This Lesson

You are training a transformer model and you notice the training loss oscillates wildly. Someone suggests increasing the batch size. Why does that help?

The answer requires understanding variance of estimators. The gradient of the true loss function (over the full training set) is the expected gradient over the data distribution. The mini-batch gradient is an estimate of this expectation. Its variance decreases as $1/B$ where $B$ is the batch size - doubling the batch size halves the gradient variance, leading to more stable updates.

Similarly, Batch Normalization works by computing the mean and variance of activations within each mini-batch and normalizing by them. These are moment statistics. The learnable parameters $\gamma$ and $\beta$ re-scale and re-shift to optimal mean and variance. Without understanding expectation and variance, BatchNorm is a mystery; with these tools, it's a natural solution.

1. Expected Value

The expected value (or expectation, or mean) $\mathbb{E}[X]$ is the probability-weighted average of all possible values of $X$ .

Discrete RV

$\mathbb{E}[X] = \sum_{x} x \cdot P(X = x)$

Example: Fair die: $\mathbb{E}[X] = \frac{1}{6}(1 + 2 + 3 + 4 + 5 + 6) = \frac{21}{6} = 3.5$

Continuous RV

$\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f_X(x) \, dx$

Example: Standard normal: $\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot \frac{1}{\sqrt{2\pi}} e^{-x^2/2} dx = 0$ (by symmetry)

Expected Value of a Function

For any function $g$ :

$\mathbb{E}[g(X)] = \sum_x g(x) P(X=x) \quad \text{(discrete)}$ $\mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) f_X(x) \, dx \quad \text{(continuous)}$

This is called the Law of the Unconscious Statistician (LOTUS) - you don't need the distribution of $g(X)$ to compute $\mathbb{E}[g(X)]$ .

import numpy as np
from scipy.stats import norm, binom

# Discrete: expected value of a die
die_faces = np.arange(1, 7)
die_probs = np.ones(6) / 6
ev_die = np.sum(die_faces * die_probs)
print(f"E[X] for fair die: {ev_die:.4f}")  # 3.5

# E[X^2] for a fair die (using LOTUS)
ev_x_squared = np.sum(die_faces**2 * die_probs)
print(f"E[X^2] for fair die: {ev_x_squared:.4f}")  # 15.1667

# Continuous: verify E[X] = 0 for N(0,1) via integration
from scipy import integrate

integrand = lambda x: x * norm.pdf(x)
ev_normal, _ = integrate.quad(integrand, -np.inf, np.inf)
print(f"E[X] for N(0,1): {ev_normal:.8f}")  # ~0

2. Linearity of Expectation

The most important property of expectation:

$\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$

This holds even when $X$ and $Y$ are dependent - a remarkable fact.

More generally, for any constants $a_1, \ldots, a_n$ and random variables $X_1, \ldots, X_n$ :

$\mathbb{E}\left[\sum_{i=1}^n a_i X_i\right] = \sum_{i=1}^n a_i \mathbb{E}[X_i]$

Why This Is Powerful

Example 1: Expected number of heads in $n$ flips

Instead of computing from the Binomial PMF, write $S_n = X_1 + X_2 + \cdots + X_n$ where each $X_i \sim \text{Bernoulli}(p)$ . Then: $\mathbb{E}[S_n] = \sum_{i=1}^n \mathbb{E}[X_i] = n \cdot p$

Example 2: Expected loss over a dataset

If the total loss is $\mathcal{L} = \frac{1}{N}\sum_{i=1}^N \ell(\hat{y}_i, y_i)$ , then: $\mathbb{E}[\mathcal{L}] = \frac{1}{N}\sum_{i=1}^N \mathbb{E}[\ell(\hat{y}_i, y_i)]$

By the Law of Large Numbers (Lesson 07), as $N \to \infty$ , the empirical average converges to this expectation.

:::tip Linearity of Expectation in SGD The stochastic gradient $g_B = \frac{1}{B}\sum_{i \in \mathcal{B}} \nabla_\theta \ell_i$ is an unbiased estimator of the full gradient:

$\mathbb{E}[g_B] = \mathbb{E}\left[\frac{1}{B}\sum_{i \in \mathcal{B}} \nabla_\theta \ell_i\right] = \frac{1}{B} \sum_{i \in \mathcal{B}} \mathbb{E}[\nabla_\theta \ell_i] = \nabla_\theta \mathcal{L}$

This unbiasedness is guaranteed by linearity of expectation. It is why SGD converges to a minimum of the true loss function (under appropriate learning rate schedules), not just a minimum of the training batch. :::

3. Variance

Variance measures the spread of a distribution around its mean:

$\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]$

Computational Formula

A simpler formula for computing variance:

$\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$

Proof: $\mathbb{E}[(X-\mu)^2] = \mathbb{E}[X^2 - 2\mu X + \mu^2] = \mathbb{E}[X^2] - 2\mu^2 + \mu^2 = \mathbb{E}[X^2] - \mu^2$

Standard Deviation

$\sigma_X = \sqrt{\text{Var}(X)}$

Standard deviation has the same units as $X$ , making it more interpretable than variance.

Properties of Variance

Property	Formula	Notes
Constant shift	$\text{Var}(X + c) = \text{Var}(X)$	Shifting doesn't change spread
Scaling	$\text{Var}(aX) = a^2 \text{Var}(X)$	Scaling squares the variance
Sum of independent	$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$	Only if $X \perp Y$
Non-negative	$\text{Var}(X) \geq 0$	Zero iff $X$ is a constant

import numpy as np

np.random.seed(42)
n = 1_000_000

X = np.random.randn(n)   # N(0,1)

# Variance properties
c = 5.0
a = 3.0

print(f"Var(X)     = {X.var():.4f}  (expected: 1.0)")
print(f"Var(X + c) = {(X + c).var():.4f}  (expected: 1.0)")
print(f"Var(a*X)   = {(a*X).var():.4f}  (expected: {a**2:.1f})")
print(f"Std(X)     = {X.std():.4f}  (expected: 1.0)")

# Computational formula: Var(X) = E[X^2] - E[X]^2
ev_x2 = (X**2).mean()
ev_x  = X.mean()
var_formula = ev_x2 - ev_x**2
print(f"\nE[X^2] - E[X]^2 = {var_formula:.6f}  (direct: {X.var():.6f})")

# Variance of sum: independent vs dependent
Y = np.random.randn(n)   # independent of X

Z_indep = X + Y
print(f"\nVar(X+Y) where X,Y independent:")
print(f"  Direct: {Z_indep.var():.4f}, Formula: {X.var() + Y.var():.4f}")

# Make Y dependent on X
Y_dep = 0.8 * X + 0.6 * np.random.randn(n)
Z_dep = X + Y_dep
print(f"\nVar(X+Y) where Y = 0.8X + noise:")
print(f"  Direct: {Z_dep.var():.4f}")
print(f"  (higher than {X.var() + Y_dep.var():.4f} because of positive covariance)")

4. Covariance and Correlation

Covariance

Covariance measures the joint variability of two random variables:

$\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]$

Computational formula:

$\text{Cov}(X, Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]$

Covariance Value	Interpretation
$> 0$	$X$ and $Y$ tend to increase together
$= 0$	$X$ and $Y$ are uncorrelated
$< 0$	When $X$ increases, $Y$ tends to decrease

Key identity: $\text{Var}(X) = \text{Cov}(X, X)$

Sum of correlated variables:

$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$

Correlation Coefficient

Normalized covariance that lives in $[-1, 1]$ :

$\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X)} \cdot \sqrt{\text{Var}(Y)}}$

$\rho = 1$ : perfect positive linear relationship
$\rho = -1$ : perfect negative linear relationship
$\rho = 0$ : uncorrelated (but may still be dependent!)

:::warning Uncorrelated Does Not Mean Independent $\rho = 0$ means no linear relationship. Two random variables can be perfectly dependent but uncorrelated. Example: $X \sim \mathcal{N}(0,1)$ , $Y = X^2$ . Then $\text{Cov}(X, Y) = \mathbb{E}[X \cdot X^2] - \mathbb{E}[X]\mathbb{E}[X^2] = \mathbb{E}[X^3] - 0 = 0$ (odd moments of symmetric distributions are zero). Yet $Y$ is completely determined by $X$ .

In ML: features can be uncorrelated but still redundant or dependent in nonlinear ways. :::

import numpy as np

np.random.seed(42)
n = 100_000

# X and Y with controlled correlation
rho = 0.8
X = np.random.randn(n)
Y = rho * X + np.sqrt(1 - rho**2) * np.random.randn(n)

print(f"Target rho = {rho}")
print(f"Empirical Cov(X,Y) = {np.cov(X, Y)[0,1]:.4f}")
print(f"Empirical Corr(X,Y) = {np.corrcoef(X, Y)[0,1]:.4f}")

# Demonstrate: uncorrelated but dependent
X2 = np.random.randn(n)
Y2 = X2**2
print(f"\nX and Y=X^2:")
print(f"Correlation = {np.corrcoef(X2, Y2)[0,1]:.4f}  (near 0 - uncorrelated)")
print(f"But Y is completely determined by X  (dependent!)")

Covariance Matrix

For a random vector $\mathbf{X} = [X_1, \ldots, X_d]^T$ , the covariance matrix is:

$\Sigma_{ij} = \text{Cov}(X_i, X_j)$

$\boldsymbol{\Sigma} = \mathbb{E}[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T]$

Properties:

$\boldsymbol{\Sigma}$ is symmetric: $\boldsymbol{\Sigma} = \boldsymbol{\Sigma}^T$
$\boldsymbol{\Sigma}$ is positive semi-definite: $\mathbf{v}^T \boldsymbol{\Sigma} \mathbf{v} \geq 0$ for all $\mathbf{v}$
Diagonal entries are variances: $\Sigma_{ii} = \text{Var}(X_i)$

import numpy as np

# Compute empirical covariance matrix
np.random.seed(42)
n_samples, n_features = 1000, 4

# Create correlated features
A = np.array([[1.0, 0.5, 0.0, -0.3],
              [0.5, 1.0, 0.2,  0.0],
              [0.0, 0.2, 1.0,  0.4],
              [-0.3, 0.0, 0.4, 1.0]])

# Generate data with this covariance structure
L = np.linalg.cholesky(A)  # Cholesky decomposition
X = (L @ np.random.randn(n_features, n_samples)).T  # (n_samples, n_features)

Sigma_empirical = np.cov(X.T)
print("Empirical covariance matrix:")
print(Sigma_empirical.round(3))
print(f"\nSymmetric: {np.allclose(Sigma_empirical, Sigma_empirical.T)}")
print(f"PSD (all eigenvalues >= 0): {np.all(np.linalg.eigvals(Sigma_empirical) >= 0)}")

5. Higher Moments

The $k$ -th moment of $X$ about the origin:

$\mu_k = \mathbb{E}[X^k]$

The $k$ -th central moment (about the mean $\mu = \mathbb{E}[X]$ ):

$\tilde{\mu}_k = \mathbb{E}[(X - \mu)^k]$

Moment	Quantity	Formula
1st central	Mean	$\mu = \mathbb{E}[X]$
2nd central	Variance	$\sigma^2 = \mathbb{E}[(X-\mu)^2]$
3rd standardized	Skewness	$\gamma_1 = \mathbb{E}\left[\left(\frac{X-\mu}{\sigma}\right)^3\right]$
4th standardized	Kurtosis	$\gamma_2 = \mathbb{E}\left[\left(\frac{X-\mu}{\sigma}\right)^4\right]$

Skewness

$\text{Skewness}(X) = \frac{\mathbb{E}[(X - \mu)^3]}{\sigma^3}$

Positive skewness: long right tail (e.g., income distributions, word frequencies)
Negative skewness: long left tail
Zero skewness: symmetric distribution (e.g., Gaussian)

Kurtosis

$\text{Kurtosis}(X) = \frac{\mathbb{E}[(X - \mu)^4]}{\sigma^4}$

Gaussian: kurtosis = 3 (or 0 in excess kurtosis convention)
Higher kurtosis ("leptokurtic"): heavier tails, more extreme outliers
Lower kurtosis ("platykurtic"): lighter tails

Skewness vs Kurtosis:

Normal:           Positive Skew:        High Kurtosis:
    ┌───┐              ┌─┐               ┌─┐
    │   │              │  \              ││ │
   ─┤   ├─            ─┤   \─         ──┤ ├──
    └───┘              └─────           └─┘
  Symmetric          Long right tail   Heavy tails

from scipy.stats import norm, expon, t as t_dist
import numpy as np

np.random.seed(42)
n = 100_000

distributions = {
    'Normal N(0,1)': np.random.randn(n),
    'Exponential(1)': np.random.exponential(1, n),
    't-distribution(df=3)': np.random.standard_t(df=3, size=n),
}

from scipy.stats import skew, kurtosis

print(f"{'Distribution':<25} {'Skewness':>10} {'Kurtosis':>10}")
print("-" * 47)
for name, samples in distributions.items():
    sk = skew(samples)
    ku = kurtosis(samples, fisher=False)  # Fisher=False gives Pearson kurtosis
    print(f"{name:<25} {sk:>10.4f} {ku:>10.4f}")

# Normal: skewness≈0, kurtosis≈3
# Exponential: skewness≈2, kurtosis≈9 (right-skewed, heavy tails)
# t(df=3): skewness≈0 (symmetric), kurtosis>>3 (very heavy tails)

6. Batch Normalization: Moments in Practice

Batch Normalization normalizes activations using the empirical mean and variance of each feature across the mini-batch:

$\hat{x}_i = \frac{x_i - \hat{\mu}_\mathcal{B}}{\sqrt{\hat{\sigma}^2_\mathcal{B} + \epsilon}}$

where:

$\hat{\mu}_\mathcal{B} = \frac{1}{B} \sum_{i \in \mathcal{B}} x_i, \quad \hat{\sigma}^2_\mathcal{B} = \frac{1}{B} \sum_{i \in \mathcal{B}} (x_i - \hat{\mu}_\mathcal{B})^2$

Then it applies learnable affine transform:

$y_i = \gamma \hat{x}_i + \beta$

BatchNorm steps:

  x_i (activations)
      │
      ▼  Compute batch mean μ and variance σ²
      │
      ▼  Normalize: x_hat = (x - μ) / sqrt(σ² + ε)
      │
      ▼  Rescale: y = γ * x_hat + β   (learned γ, β)
      │
      ▼  y_i (normalized activations)

Why does it work? Deep networks suffer from internal covariate shift - the distribution of activations changes as weights update. BatchNorm forces each layer's input to have approximately zero mean and unit variance (before the $\gamma, \beta$ transform), stabilizing training.

import numpy as np

def batch_norm(X, gamma=1.0, beta=0.0, eps=1e-5):
    """
    X: shape (batch_size, features)
    Computes batch normalization along the batch dimension.
    """
    mu     = X.mean(axis=0)          # mean over batch: shape (features,)
    sigma2 = X.var(axis=0)           # variance over batch
    X_norm = (X - mu) / np.sqrt(sigma2 + eps)  # normalize
    return gamma * X_norm + beta     # scale and shift

# Simulate a mini-batch of activations
np.random.seed(42)
batch = np.random.randn(32, 64) * 5 + 3  # Non-standard: mean=3, std=5

print(f"Before BatchNorm: mean={batch.mean():.2f}, std={batch.std():.2f}")
normed = batch_norm(batch, gamma=1.0, beta=0.0)
print(f"After  BatchNorm: mean={normed.mean():.4f}, std={normed.std():.4f}")

# After BN with gamma=2, beta=0.5
normed_gs = batch_norm(batch, gamma=2.0, beta=0.5)
print(f"With gamma=2, beta=0.5: mean={normed_gs.mean():.4f}, std={normed_gs.std():.4f}")

7. Gradient Variance and Batch Size

Recall from SGD: the stochastic gradient is:

$g_B = \frac{1}{B} \sum_{i \in \mathcal{B}} \nabla_\theta \ell_i$

By linearity of expectation: $\mathbb{E}[g_B] = \nabla_\theta \mathcal{L}$ (unbiased).

What about variance? Assuming the per-sample gradients are independent:

$\text{Var}(g_B) = \frac{1}{B^2} \sum_{i \in \mathcal{B}} \text{Var}(\nabla_\theta \ell_i) = \frac{\text{Var}(\nabla_\theta \ell)}{B}$

Gradient variance scales as $1/B$ .

This explains:

Larger batches → lower gradient variance → more stable, larger learning rates possible
Very small batches → high variance → acts as noise/regularization (can help escape local minima)
The "linear scaling rule": if you multiply batch size by $k$ , multiply learning rate by $k$ (to keep the effective step size consistent with the variance reduction)

import numpy as np

np.random.seed(42)
n_params = 1

# Simulate per-sample gradients (iid draws from N(1, 4))
true_gradient = 1.0
sample_std = 2.0
N = 10_000  # full dataset

per_sample_grads = true_gradient + sample_std * np.random.randn(N)

def estimate_gradient_variance(per_sample_grads, batch_size, n_estimates=5000):
    """Simulate gradient variance for given batch size."""
    estimates = []
    for _ in range(n_estimates):
        batch_idx = np.random.choice(len(per_sample_grads), batch_size, replace=False)
        batch_grad = per_sample_grads[batch_idx].mean()
        estimates.append(batch_grad)
    return np.var(estimates)

print(f"True gradient: {true_gradient}, Per-sample std: {sample_std}")
print(f"\nBatch size | Gradient Var | Expected Var (σ²/B)")
print("-" * 50)
for B in [1, 4, 16, 64, 256]:
    empirical_var = estimate_gradient_variance(per_sample_grads, B, n_estimates=2000)
    expected_var  = sample_std**2 / B
    print(f"  B={B:<5} | {empirical_var:.5f}     | {expected_var:.5f}")

8. ML Connection Summary

ML Concept	Probabilistic Quantity	Lesson
Training loss	$\mathbb{E}_{\mathbf{x},y}[\ell(\hat{y}, y)]$	Expectation
SGD gradient	Unbiased estimate of $\mathbb{E}[\nabla \ell]$	Linearity of expectation
Gradient stability	$\text{Var}(g_B) = \text{Var}(g_1)/B$	Variance
Batch Normalization	Normalize by $\hat{\mu}$ and $\hat{\sigma}^2$	Mean and variance
Feature redundancy	High $\text{Cov}(X_i, X_j)$	Covariance
PCA	Eigendecomposition of covariance matrix	Covariance matrix
Activation distribution	Skewness, kurtosis	Higher moments
Weight initialization	Variance of weights → variance of activations	Variance propagation

9. Interview Q&A

Q1: What is the linearity of expectation and why does it matter for SGD?

A: Linearity of expectation states that $\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$ for any random variables $X, Y$ (even if they are dependent) and any constants $a, b$ . For SGD, the mini-batch gradient is $g_B = \frac{1}{B}\sum_{i \in \mathcal{B}} \nabla_\theta \ell_i$ . By linearity: $\mathbb{E}[g_B] = \frac{1}{B}\sum_{i} \mathbb{E}[\nabla_\theta \ell_i] = \nabla_\theta \mathcal{L}$ . This means the mini-batch gradient is an unbiased estimator of the true gradient - in expectation, we move in exactly the right direction. Without this property, SGD would not converge to the correct optimum.

Q2: What is the relationship between variance and batch size? What does this imply for training?

A: The variance of the mini-batch gradient estimate is $\text{Var}(g_B) = \text{Var}(g_1) / B$ , where $\text{Var}(g_1)$ is the per-sample gradient variance. Gradient variance decreases linearly with batch size. This implies: (1) small batches have high gradient variance - updates are noisy but this noise can help escape local minima; (2) large batches have low variance - updates are more accurate but the model may converge to sharper minima that generalize worse ("generalization gap"); (3) the linear scaling rule for learning rate: if batch size increases by $k$ , multiply learning rate by $k$ to maintain the same effective update magnitude relative to gradient variance; (4) gradient accumulation (simulating large batches by accumulating gradients over multiple small batches) reduces variance exactly as larger batches would.

Q3: What is the computational formula for variance and why is it preferred numerically?

A: The computational formula $\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$ allows you to compute mean and mean-of-squares in a single pass, then subtract. The definitional formula $\text{Var}(X) = \mathbb{E}[(X-\mu)^2]$ requires two passes (first compute $\mu$ , then compute squared deviations). However, the computational formula can have catastrophic cancellation issues: if $\mathbb{E}[X]$ is very large and $\text{Var}(X)$ is small, subtracting two large numbers of similar magnitude causes loss of significant digits. In practice, Welford's online algorithm computes a numerically stable variance in a single pass by maintaining running mean and sum of squared deviations. NumPy's np.var() uses a two-pass algorithm to avoid this issue.

Q4: What is the covariance matrix and why is it important in ML?

A: The covariance matrix $\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d}$ has $\Sigma_{ij} = \text{Cov}(X_i, X_j)$ . It captures all pairwise linear relationships between features. It matters in ML for several reasons: (1) PCA finds the eigenvectors of the covariance matrix of the data - they are the principal directions of variance; (2) Gaussian distributions are fully characterized by their mean vector and covariance matrix - so the covariance matrix completely determines a multivariate Gaussian; (3) Linear regression - the covariance matrix of features appears in the normal equations $(X^TX)^{-1}X^Ty$ , and its condition number determines numerical stability; (4) Mahalanobis distance uses the inverse covariance matrix to measure distances in a feature space that accounts for correlations and different scales.

Q5: How does Batch Normalization use moments, and what problem does it solve?

A: Batch Normalization computes the empirical mean $\hat{\mu}_\mathcal{B} = \frac{1}{B}\sum_i x_i$ and variance $\hat{\sigma}^2_\mathcal{B} = \frac{1}{B}\sum_i (x_i - \hat{\mu})^2$ over each mini-batch, then normalizes: $\hat{x}_i = (x_i - \hat{\mu})/\sqrt{\hat{\sigma}^2 + \epsilon}$ . Finally, it applies learnable scale $\gamma$ and shift $\beta$ . The problem it solves is internal covariate shift: as weights update during training, the distribution of activations at each layer shifts, forcing subsequent layers to constantly re-adapt. By forcing each layer's inputs to have approximately zero mean and unit variance (before $\gamma, \beta$ correction), BatchNorm stabilizes these distributions, allows higher learning rates, reduces sensitivity to weight initialization, and acts as mild regularization (the mean/variance statistics are computed from the batch, introducing noise). At inference time, running statistics (accumulated exponential moving averages of $\hat{\mu}$ and $\hat{\sigma}^2$ over training batches) are used instead of batch statistics.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Moments Explorer demo on the EngineersOfAI Playground - no code required.

:::

The ML Scenario That Motivates This Lesson​

1. Expected Value​

Discrete RV​

Continuous RV​

Expected Value of a Function​

2. Linearity of Expectation​

Why This Is Powerful​

3. Variance​

Computational Formula​

Standard Deviation​

Properties of Variance​

4. Covariance and Correlation​

Covariance​

Correlation Coefficient​

Covariance Matrix​

5. Higher Moments​

Skewness​

Kurtosis​

6. Batch Normalization: Moments in Practice​

7. Gradient Variance and Batch Size​

8. ML Connection Summary​

9. Interview Q&A​