Skip to main content

Module 04: Statistics for ML

"Without data, you're just another person with an opinion. But without statistics, data is just noise."

The Production Reality

You've trained a new model. Your offline metrics look great - accuracy is up 1.2%, NDCG improved. But your engineering manager asks: "Is that improvement real, or just lucky random variation?" You push to production. Two weeks later: "Did this model actually lift conversion, or did we just happen to run the experiment during the holiday season?"

Every ML engineer eventually faces these questions. The answers live in statistics.

Statistics is not a collection of formulas to memorize. It is the formal language for reasoning under uncertainty - and ML systems are uncertainty machines. You train on noisy data, you evaluate on finite test sets, you deploy into a world that shifts. Statistical thinking lets you separate signal from noise at every stage of the ML lifecycle.

This module bridges probability theory (Module 03) and the full machine learning curriculum. By the end, you will have the statistical toolkit to:

  • Design rigorous experiments that prove your model works
  • Calculate how many samples you need to detect a real improvement
  • Understand why your offline A/B test results often disagree with online results
  • Communicate uncertainty in model performance to non-technical stakeholders
  • Avoid the statistical traps that lead to shipping models that don't actually help users

Module Map

How Statistics Powers ML Engineering

1. Model Training

ConceptWhere it appears in ML
Maximum Likelihood Estimation (MLE)Cross-entropy loss IS negative log-likelihood
Regularisation as MAP estimationL2 penalty IS a Gaussian prior on weights
Bias-Variance tradeoffUnderfitting vs overfitting
Consistency of estimators"More data always helps" - but how much?

2. Model Evaluation

ConceptWhere it appears in ML
Confidence intervals"Our model achieves 87.3% accuracy ± 0.4%"
Hypothesis testing"Is model A significantly better than model B?"
Bootstrap resamplingRobust metric estimation on small test sets
Multiple testing correctionComparing dozens of hyperparameter configurations

3. Experimentation & Deployment

ConceptWhere it appears in ML
A/B testing (ANOVA)Controlled online experiments
Statistical power"How long should we run the experiment?"
Causal inference"Did the model cause the improvement or did confounders?"
Effect size (Cohen's d)Minimum detectable effect for business KPIs

Lesson-by-Lesson Real-World Use Cases

LessonReal-World ML Use Case
01 Estimation TheoryTraining neural networks (cross-entropy = MLE); Bayesian regularisation
02 Hypothesis TestingModel comparison tests; feature importance validation; detecting data drift
03 Confidence IntervalsReporting model performance with uncertainty bounds
04 Bootstrap & ResamplingEvaluating variance in F1-score; k-fold cross-validation as resampling
05 Regression AnalysisLinear models as ML foundation; understanding logistic regression deeply
06 ANOVA & Experimental DesignA/B testing new model variants; hyperparameter ablation studies
07 Causal InferenceWhy offline recommendation metrics lie; online A/B as ground truth
08 Statistical PowerDeciding sample size before launching an experiment

Prerequisites

Before starting this module, you should be comfortable with:

  • Module 01 - Linear Algebra: Matrix operations, eigendecomposition (for regression)
  • Module 02 - Calculus: Derivatives, optimization (for MLE derivations)
  • Module 03 - Probability Theory: Random variables, distributions, expectation, Bayes theorem

You do not need to have taken a formal statistics course. This module is self-contained from a statistics perspective, building everything from probability foundations.

:::note Required Python Libraries

# All code in this module uses:
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy.stats import t, norm, chi2, f

Install with: pip install numpy scipy statsmodels matplotlib :::

Learning Objectives

By the end of this module, you will be able to:

Conceptual Understanding

  • Explain why cross-entropy loss is equivalent to Maximum Likelihood Estimation
  • Correctly interpret a p-value (and identify common misconceptions)
  • Explain what a 95% confidence interval means - and what it does NOT mean
  • Distinguish correlation from causation using the potential outcomes framework

Mathematical Skills

  • Derive the MLE estimator for Gaussian and Bernoulli distributions
  • Compute t-tests, chi-squared tests, and F-statistics by hand
  • Construct bootstrap confidence intervals from scratch
  • Calculate required sample size given power, effect size, and significance level

Engineering Skills

  • Write production-quality A/B test analysis code in Python
  • Choose the right statistical test for model comparison
  • Apply multiple testing corrections when comparing many model variants
  • Detect confounders in offline evaluation scenarios

Interview Readiness

  • Answer "What is a p-value?" without triggering the incorrect "probability the null is true" trap
  • Explain the bias-variance tradeoff in terms of estimation theory
  • Design a sample size calculation for a new ML experiment

How Statistics Connects to the Rest of the Curriculum

Probability Theory (Module 03)


Statistics for ML (Module 04) ◄──── This module

├──► Bayesian Statistics (Module 06)
│ └─ Priors, posteriors, MCMC

├──► Statistical Learning Theory (Module 07)
│ └─ PAC learning, VC dimension, generalisation bounds

└──► Information Theory (Module 05)
└─ Entropy, KL divergence, cross-entropy

Statistics is the connective tissue of the math curriculum. MLE from this module explains why cross-entropy loss works. Confidence intervals connect to PAC learning bounds. Hypothesis testing IS what you're doing every time you compare models.

The Three Core Questions of ML Statistics

Every statistical concept in this module answers one of three fundamental ML questions:

Question 1: What should I estimate? Estimation Theory answers this - how to extract parameter values from data, and how to quantify uncertainty in those estimates.

Question 2: Is this result real or noise? Hypothesis Testing, Confidence Intervals, Bootstrap, and Power Analysis answer this - the formal machinery for distinguishing signal from sampling variation.

Question 3: Did my intervention cause the outcome? Causal Inference answers this - the hardest question in ML, and the one most engineers get wrong.

Work through each lesson in order. The lessons build on each other: you need hypothesis testing to understand ANOVA, you need ANOVA to understand A/B testing design, and you need all of it to understand why sample size calculation matters.

Let's begin.

© 2026 EngineersOfAI. All rights reserved.