Skip to main content

Expectation, Variance, and Moments

Reading time: ~45 minutes | Interview relevance: Very High | Target roles: ML Engineer, AI Engineer, Research Scientist, Data Scientist, MLOps Engineer

The ML Scenario That Motivates This Lesson

You are training a transformer model and you notice the training loss oscillates wildly. Someone suggests increasing the batch size. Why does that help?

The answer requires understanding variance of estimators. The gradient of the true loss function (over the full training set) is the expected gradient over the data distribution. The mini-batch gradient is an estimate of this expectation. Its variance decreases as 1/B1/B where BB is the batch size - doubling the batch size halves the gradient variance, leading to more stable updates.

Similarly, Batch Normalization works by computing the mean and variance of activations within each mini-batch and normalizing by them. These are moment statistics. The learnable parameters γ\gamma and β\beta re-scale and re-shift to optimal mean and variance. Without understanding expectation and variance, BatchNorm is a mystery; with these tools, it's a natural solution.

1. Expected Value

The expected value (or expectation, or mean) E[X]\mathbb{E}[X] is the probability-weighted average of all possible values of XX.

Discrete RV

E[X]=xxP(X=x)\mathbb{E}[X] = \sum_{x} x \cdot P(X = x)

Example: Fair die: E[X]=16(1+2+3+4+5+6)=216=3.5\mathbb{E}[X] = \frac{1}{6}(1 + 2 + 3 + 4 + 5 + 6) = \frac{21}{6} = 3.5

Continuous RV

E[X]=xfX(x)dx\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f_X(x) \, dx

Example: Standard normal: E[X]=x12πex2/2dx=0\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot \frac{1}{\sqrt{2\pi}} e^{-x^2/2} dx = 0 (by symmetry)

Expected Value of a Function

For any function gg:

E[g(X)]=xg(x)P(X=x)(discrete)\mathbb{E}[g(X)] = \sum_x g(x) P(X=x) \quad \text{(discrete)} E[g(X)]=g(x)fX(x)dx(continuous)\mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) f_X(x) \, dx \quad \text{(continuous)}

This is called the Law of the Unconscious Statistician (LOTUS) - you don't need the distribution of g(X)g(X) to compute E[g(X)]\mathbb{E}[g(X)].

import numpy as np
from scipy.stats import norm, binom

# Discrete: expected value of a die
die_faces = np.arange(1, 7)
die_probs = np.ones(6) / 6
ev_die = np.sum(die_faces * die_probs)
print(f"E[X] for fair die: {ev_die:.4f}") # 3.5

# E[X^2] for a fair die (using LOTUS)
ev_x_squared = np.sum(die_faces**2 * die_probs)
print(f"E[X^2] for fair die: {ev_x_squared:.4f}") # 15.1667

# Continuous: verify E[X] = 0 for N(0,1) via integration
from scipy import integrate

integrand = lambda x: x * norm.pdf(x)
ev_normal, _ = integrate.quad(integrand, -np.inf, np.inf)
print(f"E[X] for N(0,1): {ev_normal:.8f}") # ~0

2. Linearity of Expectation

The most important property of expectation:

E[aX+bY]=aE[X]+bE[Y]\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]

This holds even when XX and YY are dependent - a remarkable fact.

More generally, for any constants a1,,ana_1, \ldots, a_n and random variables X1,,XnX_1, \ldots, X_n:

E[i=1naiXi]=i=1naiE[Xi]\mathbb{E}\left[\sum_{i=1}^n a_i X_i\right] = \sum_{i=1}^n a_i \mathbb{E}[X_i]

Why This Is Powerful

Example 1: Expected number of heads in nn flips

Instead of computing from the Binomial PMF, write Sn=X1+X2++XnS_n = X_1 + X_2 + \cdots + X_n where each XiBernoulli(p)X_i \sim \text{Bernoulli}(p). Then: E[Sn]=i=1nE[Xi]=np\mathbb{E}[S_n] = \sum_{i=1}^n \mathbb{E}[X_i] = n \cdot p

Example 2: Expected loss over a dataset

If the total loss is L=1Ni=1N(y^i,yi)\mathcal{L} = \frac{1}{N}\sum_{i=1}^N \ell(\hat{y}_i, y_i), then: E[L]=1Ni=1NE[(y^i,yi)]\mathbb{E}[\mathcal{L}] = \frac{1}{N}\sum_{i=1}^N \mathbb{E}[\ell(\hat{y}_i, y_i)]

By the Law of Large Numbers (Lesson 07), as NN \to \infty, the empirical average converges to this expectation.

:::tip Linearity of Expectation in SGD The stochastic gradient gB=1BiBθig_B = \frac{1}{B}\sum_{i \in \mathcal{B}} \nabla_\theta \ell_i is an unbiased estimator of the full gradient:

E[gB]=E[1BiBθi]=1BiBE[θi]=θL\mathbb{E}[g_B] = \mathbb{E}\left[\frac{1}{B}\sum_{i \in \mathcal{B}} \nabla_\theta \ell_i\right] = \frac{1}{B} \sum_{i \in \mathcal{B}} \mathbb{E}[\nabla_\theta \ell_i] = \nabla_\theta \mathcal{L}

This unbiasedness is guaranteed by linearity of expectation. It is why SGD converges to a minimum of the true loss function (under appropriate learning rate schedules), not just a minimum of the training batch. :::

3. Variance

Variance measures the spread of a distribution around its mean:

Var(X)=E[(XE[X])2]\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]

Computational Formula

A simpler formula for computing variance:

Var(X)=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

Proof: E[(Xμ)2]=E[X22μX+μ2]=E[X2]2μ2+μ2=E[X2]μ2\mathbb{E}[(X-\mu)^2] = \mathbb{E}[X^2 - 2\mu X + \mu^2] = \mathbb{E}[X^2] - 2\mu^2 + \mu^2 = \mathbb{E}[X^2] - \mu^2

Standard Deviation

σX=Var(X)\sigma_X = \sqrt{\text{Var}(X)}

Standard deviation has the same units as XX, making it more interpretable than variance.

Properties of Variance

PropertyFormulaNotes
Constant shiftVar(X+c)=Var(X)\text{Var}(X + c) = \text{Var}(X)Shifting doesn't change spread
ScalingVar(aX)=a2Var(X)\text{Var}(aX) = a^2 \text{Var}(X)Scaling squares the variance
Sum of independentVar(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)Only if XYX \perp Y
Non-negativeVar(X)0\text{Var}(X) \geq 0Zero iff XX is a constant
import numpy as np

np.random.seed(42)
n = 1_000_000

X = np.random.randn(n) # N(0,1)

# Variance properties
c = 5.0
a = 3.0

print(f"Var(X) = {X.var():.4f} (expected: 1.0)")
print(f"Var(X + c) = {(X + c).var():.4f} (expected: 1.0)")
print(f"Var(a*X) = {(a*X).var():.4f} (expected: {a**2:.1f})")
print(f"Std(X) = {X.std():.4f} (expected: 1.0)")

# Computational formula: Var(X) = E[X^2] - E[X]^2
ev_x2 = (X**2).mean()
ev_x = X.mean()
var_formula = ev_x2 - ev_x**2
print(f"\nE[X^2] - E[X]^2 = {var_formula:.6f} (direct: {X.var():.6f})")
# Variance of sum: independent vs dependent
Y = np.random.randn(n) # independent of X

Z_indep = X + Y
print(f"\nVar(X+Y) where X,Y independent:")
print(f" Direct: {Z_indep.var():.4f}, Formula: {X.var() + Y.var():.4f}")

# Make Y dependent on X
Y_dep = 0.8 * X + 0.6 * np.random.randn(n)
Z_dep = X + Y_dep
print(f"\nVar(X+Y) where Y = 0.8X + noise:")
print(f" Direct: {Z_dep.var():.4f}")
print(f" (higher than {X.var() + Y_dep.var():.4f} because of positive covariance)")

4. Covariance and Correlation

Covariance

Covariance measures the joint variability of two random variables:

Cov(X,Y)=E[(XE[X])(YE[Y])]\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]

Computational formula:

Cov(X,Y)=E[XY]E[X]E[Y]\text{Cov}(X, Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]

Covariance ValueInterpretation
>0> 0XX and YY tend to increase together
=0= 0XX and YY are uncorrelated
<0< 0When XX increases, YY tends to decrease

Key identity: Var(X)=Cov(X,X)\text{Var}(X) = \text{Cov}(X, X)

Sum of correlated variables:

Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)

Correlation Coefficient

Normalized covariance that lives in [1,1][-1, 1]:

ρ(X,Y)=Cov(X,Y)Var(X)Var(Y)\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X)} \cdot \sqrt{\text{Var}(Y)}}

  • ρ=1\rho = 1: perfect positive linear relationship
  • ρ=1\rho = -1: perfect negative linear relationship
  • ρ=0\rho = 0: uncorrelated (but may still be dependent!)

:::warning Uncorrelated Does Not Mean Independent ρ=0\rho = 0 means no linear relationship. Two random variables can be perfectly dependent but uncorrelated. Example: XN(0,1)X \sim \mathcal{N}(0,1), Y=X2Y = X^2. Then Cov(X,Y)=E[XX2]E[X]E[X2]=E[X3]0=0\text{Cov}(X, Y) = \mathbb{E}[X \cdot X^2] - \mathbb{E}[X]\mathbb{E}[X^2] = \mathbb{E}[X^3] - 0 = 0 (odd moments of symmetric distributions are zero). Yet YY is completely determined by XX.

In ML: features can be uncorrelated but still redundant or dependent in nonlinear ways. :::

import numpy as np

np.random.seed(42)
n = 100_000

# X and Y with controlled correlation
rho = 0.8
X = np.random.randn(n)
Y = rho * X + np.sqrt(1 - rho**2) * np.random.randn(n)

print(f"Target rho = {rho}")
print(f"Empirical Cov(X,Y) = {np.cov(X, Y)[0,1]:.4f}")
print(f"Empirical Corr(X,Y) = {np.corrcoef(X, Y)[0,1]:.4f}")

# Demonstrate: uncorrelated but dependent
X2 = np.random.randn(n)
Y2 = X2**2
print(f"\nX and Y=X^2:")
print(f"Correlation = {np.corrcoef(X2, Y2)[0,1]:.4f} (near 0 - uncorrelated)")
print(f"But Y is completely determined by X (dependent!)")

Covariance Matrix

For a random vector X=[X1,,Xd]T\mathbf{X} = [X_1, \ldots, X_d]^T, the covariance matrix is:

Σij=Cov(Xi,Xj)\Sigma_{ij} = \text{Cov}(X_i, X_j)

Σ=E[(Xμ)(Xμ)T]\boldsymbol{\Sigma} = \mathbb{E}[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T]

Properties:

  • Σ\boldsymbol{\Sigma} is symmetric: Σ=ΣT\boldsymbol{\Sigma} = \boldsymbol{\Sigma}^T
  • Σ\boldsymbol{\Sigma} is positive semi-definite: vTΣv0\mathbf{v}^T \boldsymbol{\Sigma} \mathbf{v} \geq 0 for all v\mathbf{v}
  • Diagonal entries are variances: Σii=Var(Xi)\Sigma_{ii} = \text{Var}(X_i)
import numpy as np

# Compute empirical covariance matrix
np.random.seed(42)
n_samples, n_features = 1000, 4

# Create correlated features
A = np.array([[1.0, 0.5, 0.0, -0.3],
[0.5, 1.0, 0.2, 0.0],
[0.0, 0.2, 1.0, 0.4],
[-0.3, 0.0, 0.4, 1.0]])

# Generate data with this covariance structure
L = np.linalg.cholesky(A) # Cholesky decomposition
X = (L @ np.random.randn(n_features, n_samples)).T # (n_samples, n_features)

Sigma_empirical = np.cov(X.T)
print("Empirical covariance matrix:")
print(Sigma_empirical.round(3))
print(f"\nSymmetric: {np.allclose(Sigma_empirical, Sigma_empirical.T)}")
print(f"PSD (all eigenvalues >= 0): {np.all(np.linalg.eigvals(Sigma_empirical) >= 0)}")

5. Higher Moments

The kk-th moment of XX about the origin:

μk=E[Xk]\mu_k = \mathbb{E}[X^k]

The kk-th central moment (about the mean μ=E[X]\mu = \mathbb{E}[X]):

μ~k=E[(Xμ)k]\tilde{\mu}_k = \mathbb{E}[(X - \mu)^k]

MomentQuantityFormula
1st centralMeanμ=E[X]\mu = \mathbb{E}[X]
2nd centralVarianceσ2=E[(Xμ)2]\sigma^2 = \mathbb{E}[(X-\mu)^2]
3rd standardizedSkewnessγ1=E[(Xμσ)3]\gamma_1 = \mathbb{E}\left[\left(\frac{X-\mu}{\sigma}\right)^3\right]
4th standardizedKurtosisγ2=E[(Xμσ)4]\gamma_2 = \mathbb{E}\left[\left(\frac{X-\mu}{\sigma}\right)^4\right]

Skewness

Skewness(X)=E[(Xμ)3]σ3\text{Skewness}(X) = \frac{\mathbb{E}[(X - \mu)^3]}{\sigma^3}

  • Positive skewness: long right tail (e.g., income distributions, word frequencies)
  • Negative skewness: long left tail
  • Zero skewness: symmetric distribution (e.g., Gaussian)

Kurtosis

Kurtosis(X)=E[(Xμ)4]σ4\text{Kurtosis}(X) = \frac{\mathbb{E}[(X - \mu)^4]}{\sigma^4}

  • Gaussian: kurtosis = 3 (or 0 in excess kurtosis convention)
  • Higher kurtosis ("leptokurtic"): heavier tails, more extreme outliers
  • Lower kurtosis ("platykurtic"): lighter tails
Skewness vs Kurtosis:

Normal: Positive Skew: High Kurtosis:
┌───┐ ┌─┐ ┌─┐
│ │ │ \ ││ │
─┤ ├─ ─┤ \─ ──┤ ├──
└───┘ └───── └─┘
Symmetric Long right tail Heavy tails
from scipy.stats import norm, expon, t as t_dist
import numpy as np

np.random.seed(42)
n = 100_000

distributions = {
'Normal N(0,1)': np.random.randn(n),
'Exponential(1)': np.random.exponential(1, n),
't-distribution(df=3)': np.random.standard_t(df=3, size=n),
}

from scipy.stats import skew, kurtosis

print(f"{'Distribution':<25} {'Skewness':>10} {'Kurtosis':>10}")
print("-" * 47)
for name, samples in distributions.items():
sk = skew(samples)
ku = kurtosis(samples, fisher=False) # Fisher=False gives Pearson kurtosis
print(f"{name:<25} {sk:>10.4f} {ku:>10.4f}")

# Normal: skewness≈0, kurtosis≈3
# Exponential: skewness≈2, kurtosis≈9 (right-skewed, heavy tails)
# t(df=3): skewness≈0 (symmetric), kurtosis>>3 (very heavy tails)

6. Batch Normalization: Moments in Practice

Batch Normalization normalizes activations using the empirical mean and variance of each feature across the mini-batch:

x^i=xiμ^Bσ^B2+ϵ\hat{x}_i = \frac{x_i - \hat{\mu}_\mathcal{B}}{\sqrt{\hat{\sigma}^2_\mathcal{B} + \epsilon}}

where:

μ^B=1BiBxi,σ^B2=1BiB(xiμ^B)2\hat{\mu}_\mathcal{B} = \frac{1}{B} \sum_{i \in \mathcal{B}} x_i, \quad \hat{\sigma}^2_\mathcal{B} = \frac{1}{B} \sum_{i \in \mathcal{B}} (x_i - \hat{\mu}_\mathcal{B})^2

Then it applies learnable affine transform:

yi=γx^i+βy_i = \gamma \hat{x}_i + \beta

BatchNorm steps:

x_i (activations)

▼ Compute batch mean μ and variance σ²

▼ Normalize: x_hat = (x - μ) / sqrt(σ² + ε)

▼ Rescale: y = γ * x_hat + β (learned γ, β)

▼ y_i (normalized activations)

Why does it work? Deep networks suffer from internal covariate shift - the distribution of activations changes as weights update. BatchNorm forces each layer's input to have approximately zero mean and unit variance (before the γ,β\gamma, \beta transform), stabilizing training.

import numpy as np

def batch_norm(X, gamma=1.0, beta=0.0, eps=1e-5):
"""
X: shape (batch_size, features)
Computes batch normalization along the batch dimension.
"""
mu = X.mean(axis=0) # mean over batch: shape (features,)
sigma2 = X.var(axis=0) # variance over batch
X_norm = (X - mu) / np.sqrt(sigma2 + eps) # normalize
return gamma * X_norm + beta # scale and shift

# Simulate a mini-batch of activations
np.random.seed(42)
batch = np.random.randn(32, 64) * 5 + 3 # Non-standard: mean=3, std=5

print(f"Before BatchNorm: mean={batch.mean():.2f}, std={batch.std():.2f}")
normed = batch_norm(batch, gamma=1.0, beta=0.0)
print(f"After BatchNorm: mean={normed.mean():.4f}, std={normed.std():.4f}")

# After BN with gamma=2, beta=0.5
normed_gs = batch_norm(batch, gamma=2.0, beta=0.5)
print(f"With gamma=2, beta=0.5: mean={normed_gs.mean():.4f}, std={normed_gs.std():.4f}")

7. Gradient Variance and Batch Size

Recall from SGD: the stochastic gradient is:

gB=1BiBθig_B = \frac{1}{B} \sum_{i \in \mathcal{B}} \nabla_\theta \ell_i

By linearity of expectation: E[gB]=θL\mathbb{E}[g_B] = \nabla_\theta \mathcal{L} (unbiased).

What about variance? Assuming the per-sample gradients are independent:

Var(gB)=1B2iBVar(θi)=Var(θ)B\text{Var}(g_B) = \frac{1}{B^2} \sum_{i \in \mathcal{B}} \text{Var}(\nabla_\theta \ell_i) = \frac{\text{Var}(\nabla_\theta \ell)}{B}

Gradient variance scales as 1/B1/B.

This explains:

  • Larger batches → lower gradient variance → more stable, larger learning rates possible
  • Very small batches → high variance → acts as noise/regularization (can help escape local minima)
  • The "linear scaling rule": if you multiply batch size by kk, multiply learning rate by kk (to keep the effective step size consistent with the variance reduction)
import numpy as np

np.random.seed(42)
n_params = 1

# Simulate per-sample gradients (iid draws from N(1, 4))
true_gradient = 1.0
sample_std = 2.0
N = 10_000 # full dataset

per_sample_grads = true_gradient + sample_std * np.random.randn(N)

def estimate_gradient_variance(per_sample_grads, batch_size, n_estimates=5000):
"""Simulate gradient variance for given batch size."""
estimates = []
for _ in range(n_estimates):
batch_idx = np.random.choice(len(per_sample_grads), batch_size, replace=False)
batch_grad = per_sample_grads[batch_idx].mean()
estimates.append(batch_grad)
return np.var(estimates)

print(f"True gradient: {true_gradient}, Per-sample std: {sample_std}")
print(f"\nBatch size | Gradient Var | Expected Var (σ²/B)")
print("-" * 50)
for B in [1, 4, 16, 64, 256]:
empirical_var = estimate_gradient_variance(per_sample_grads, B, n_estimates=2000)
expected_var = sample_std**2 / B
print(f" B={B:<5} | {empirical_var:.5f} | {expected_var:.5f}")

8. ML Connection Summary

ML ConceptProbabilistic QuantityLesson
Training lossEx,y[(y^,y)]\mathbb{E}_{\mathbf{x},y}[\ell(\hat{y}, y)]Expectation
SGD gradientUnbiased estimate of E[]\mathbb{E}[\nabla \ell]Linearity of expectation
Gradient stabilityVar(gB)=Var(g1)/B\text{Var}(g_B) = \text{Var}(g_1)/BVariance
Batch NormalizationNormalize by μ^\hat{\mu} and σ^2\hat{\sigma}^2Mean and variance
Feature redundancyHigh Cov(Xi,Xj)\text{Cov}(X_i, X_j)Covariance
PCAEigendecomposition of covariance matrixCovariance matrix
Activation distributionSkewness, kurtosisHigher moments
Weight initializationVariance of weights → variance of activationsVariance propagation

9. Interview Q&A

Q1: What is the linearity of expectation and why does it matter for SGD?

A: Linearity of expectation states that E[aX+bY]=aE[X]+bE[Y]\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y] for any random variables X,YX, Y (even if they are dependent) and any constants a,ba, b. For SGD, the mini-batch gradient is gB=1BiBθig_B = \frac{1}{B}\sum_{i \in \mathcal{B}} \nabla_\theta \ell_i. By linearity: E[gB]=1BiE[θi]=θL\mathbb{E}[g_B] = \frac{1}{B}\sum_{i} \mathbb{E}[\nabla_\theta \ell_i] = \nabla_\theta \mathcal{L}. This means the mini-batch gradient is an unbiased estimator of the true gradient - in expectation, we move in exactly the right direction. Without this property, SGD would not converge to the correct optimum.

Q2: What is the relationship between variance and batch size? What does this imply for training?

A: The variance of the mini-batch gradient estimate is Var(gB)=Var(g1)/B\text{Var}(g_B) = \text{Var}(g_1) / B, where Var(g1)\text{Var}(g_1) is the per-sample gradient variance. Gradient variance decreases linearly with batch size. This implies: (1) small batches have high gradient variance - updates are noisy but this noise can help escape local minima; (2) large batches have low variance - updates are more accurate but the model may converge to sharper minima that generalize worse ("generalization gap"); (3) the linear scaling rule for learning rate: if batch size increases by kk, multiply learning rate by kk to maintain the same effective update magnitude relative to gradient variance; (4) gradient accumulation (simulating large batches by accumulating gradients over multiple small batches) reduces variance exactly as larger batches would.

Q3: What is the computational formula for variance and why is it preferred numerically?

A: The computational formula Var(X)=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 allows you to compute mean and mean-of-squares in a single pass, then subtract. The definitional formula Var(X)=E[(Xμ)2]\text{Var}(X) = \mathbb{E}[(X-\mu)^2] requires two passes (first compute μ\mu, then compute squared deviations). However, the computational formula can have catastrophic cancellation issues: if E[X]\mathbb{E}[X] is very large and Var(X)\text{Var}(X) is small, subtracting two large numbers of similar magnitude causes loss of significant digits. In practice, Welford's online algorithm computes a numerically stable variance in a single pass by maintaining running mean and sum of squared deviations. NumPy's np.var() uses a two-pass algorithm to avoid this issue.

Q4: What is the covariance matrix and why is it important in ML?

A: The covariance matrix ΣRd×d\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d} has Σij=Cov(Xi,Xj)\Sigma_{ij} = \text{Cov}(X_i, X_j). It captures all pairwise linear relationships between features. It matters in ML for several reasons: (1) PCA finds the eigenvectors of the covariance matrix of the data - they are the principal directions of variance; (2) Gaussian distributions are fully characterized by their mean vector and covariance matrix - so the covariance matrix completely determines a multivariate Gaussian; (3) Linear regression - the covariance matrix of features appears in the normal equations (XTX)1XTy(X^TX)^{-1}X^Ty, and its condition number determines numerical stability; (4) Mahalanobis distance uses the inverse covariance matrix to measure distances in a feature space that accounts for correlations and different scales.

Q5: How does Batch Normalization use moments, and what problem does it solve?

A: Batch Normalization computes the empirical mean μ^B=1Bixi\hat{\mu}_\mathcal{B} = \frac{1}{B}\sum_i x_i and variance σ^B2=1Bi(xiμ^)2\hat{\sigma}^2_\mathcal{B} = \frac{1}{B}\sum_i (x_i - \hat{\mu})^2 over each mini-batch, then normalizes: x^i=(xiμ^)/σ^2+ϵ\hat{x}_i = (x_i - \hat{\mu})/\sqrt{\hat{\sigma}^2 + \epsilon}. Finally, it applies learnable scale γ\gamma and shift β\beta. The problem it solves is internal covariate shift: as weights update during training, the distribution of activations at each layer shifts, forcing subsequent layers to constantly re-adapt. By forcing each layer's inputs to have approximately zero mean and unit variance (before γ,β\gamma, \beta correction), BatchNorm stabilizes these distributions, allows higher learning rates, reduces sensitivity to weight initialization, and acts as mild regularization (the mean/variance statistics are computed from the batch, introducing noise). At inference time, running statistics (accumulated exponential moving averages of μ^\hat{\mu} and σ^2\hat{\sigma}^2 over training batches) are used instead of batch statistics.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Moments Explorer demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.