Module 3 - Probability Theory for ML Engineers
Why Probability Theory Is the Language of Machine Learning
Every time a neural network outputs a confidence score, every time you compute cross-entropy loss, every time a model says "I'm 87% confident this is a cat" - you are doing probability theory. ML is, at its core, an attempt to model uncertainty in data, and probability theory is the formal language for doing that.
Consider what happens when GPT-4 generates the next token. It doesn't pick a word deterministically - it samples from a probability distribution over its entire vocabulary (50,000+ tokens). The logits become a distribution via softmax, and generation is sampling from that distribution. Without understanding probability distributions, sampling, and expectations, you cannot reason about how language models actually work.
Or consider training. Minimizing cross-entropy loss is exactly the same as maximizing the log-likelihood of your training data under your model. The entire machinery of supervised learning is wrapped in probabilistic language. Regularization (L2, dropout, early stopping) is Bayesian prior injection. Batch normalization computes empirical mean and variance - moments of a distribution. Gradient descent on expected loss is stochastic optimization.
This module builds the probabilistic toolkit that every ML engineer needs. By the end, you will be able to:
- Reason formally about events, random variables, and distributions
- Derive and apply Bayes' theorem to generative vs discriminative models
- Understand why the Gaussian distribution appears everywhere in ML
- Compute expectations and variances - the quantities that define model behavior
- Understand generalization bounds from concentration inequalities
- Implement sampling algorithms that power Bayesian inference and data augmentation
Module Map
The module flows from foundations (axioms, random variables) through computational tools (expectation, common distributions) to advanced topics (Bayes, joint distributions, concentration inequalities, and sampling). Each lesson unlocks a new class of ML algorithms.
Lesson-to-Algorithm Map
| Lesson | Core Concept | ML Algorithms Unlocked |
|---|---|---|
| 01 · Probability Axioms | Sample spaces, events, independence | Any probabilistic classifier, A/B testing |
| 02 · Random Variables | PMF, PDF, CDF, transformations | Model outputs, softmax, sigmoid |
| 03 · Expectation & Variance | Moments, covariance, correlation | Loss minimization, BatchNorm, gradient variance |
| 04 · Common Distributions | Gaussian, Bernoulli, Dirichlet, etc. | Linear regression, logistic regression, LDA, VAE |
| 05 · Conditional Probability & Bayes | Prior / posterior / likelihood | Naive Bayes, Bayesian neural nets, GPT sampling |
| 06 · Joint & Marginal Distributions | Multivariate distributions, marginalization | Graphical models, VAEs, latent variable models |
| 07 · Concentration Inequalities | Hoeffding, CLT, LLN | Generalization bounds, PAC learning, mini-batch SGD |
| 08 · Sampling Methods | MCMC, importance sampling, rejection | Bayesian inference, data augmentation, Monte Carlo |
How Probability Underpins ML - A Conceptual Map
1. Supervised Learning as Probability
Standard supervised learning finds:
This is Maximum Likelihood Estimation (MLE). With a prior , it becomes Maximum A Posteriori (MAP):
L2 regularization is a Gaussian prior. L1 regularization is a Laplace prior. Every regularizer is secretly a Bayesian prior.
2. Classification as Conditional Distribution Modeling
A classifier learns:
Softmax turns logits into a proper probability distribution. Cross-entropy loss is the negative log-likelihood of this conditional distribution. Predicting a class is sampling (or argmax) from this distribution.
3. Generative Models as Joint Distribution Modeling
Generative models learn:
VAEs learn an encoder and decoder . The ELBO objective is a probabilistic quantity. Diffusion models are Markov chains over Gaussian distributions. GANs frame generation as a two-player probabilistic game.
4. Uncertainty Quantification
Production ML systems need calibrated uncertainty. A model that says "90% confident" should be right 90% of the time. Calibration, Platt scaling, and Bayesian deep learning all require probability theory to formalize and measure uncertainty.
5. Optimization Theory
Why does SGD work? The Law of Large Numbers guarantees that mini-batch gradients are unbiased estimates of the full gradient. The Central Limit Theorem tells us the distribution of those estimates. Concentration inequalities bound how far estimates deviate from truth.
Prerequisites
Before starting this module, you should be comfortable with:
| Topic | Where to Review |
|---|---|
| Set theory (union, intersection, complement) | Module 01 - Linear Algebra (sets briefly) |
| Real-valued functions | Module 02 - Calculus |
| Integration (single and multivariate) | Module 02 - Calculus |
| Summation notation | Module 01 |
| Python + NumPy basics | Module 00 - Python Foundations |
Learning Objectives
By the end of this module, you will be able to:
- Define probability spaces, random variables, and distributions formally
- Compute probabilities using axioms, conditional probability, and Bayes' theorem
- Derive expectations, variances, covariances, and moments from first principles
- Identify which probability distribution governs each type of ML output
- Apply Bayes' theorem to understand generative vs discriminative models
- Work with joint and marginal distributions over multiple random variables
- State and apply Markov, Chebyshev, and Hoeffding inequalities for ML bounds
- Implement inverse CDF, rejection, importance, and MCMC sampling in Python
Notation Reference
| Symbol | Meaning |
|---|---|
| Sample space | |
| Event space (sigma-algebra) | |
| Probability measure | |
| Random variables | |
| Probability mass / density function | |
| Cumulative distribution function | |
| Expected value of | |
| Variance of | |
| Covariance of and | |
| Normal distribution with mean , variance | |
| Conditional probability of given | |
| Independence |
A Note on Rigor vs Intuition
This module takes an engineering approach to probability theory. We will state formal definitions - the Kolmogorov axioms, measure-theoretic foundations - but will not dwell on the full machinery of measure theory. Our goal is deep intuition backed by enough rigor to reason correctly in ML contexts and pass technical interviews at top AI companies.
When you see a Gaussian prior in a Bayesian neural network, you should think: "I know that prior, I know its density function, I know what it implies about our beliefs about the weights." That level of fluency - not abstract theorem-proving - is the goal.
Let's begin.
