Module 06: Bayesian Statistics
"In Bayesian statistics, probability is not a property of the world - it is a state of knowledge about the world."
- E.T. Jaynes, Probability Theory: The Logic of Science
The Production Reality
You've deployed a recommendation model. It's performing well on average. But your team gets paged: a new user segment is getting terrible recommendations. Your model gives a single point estimate - it doesn't know what it doesn't know.
Or you're training a model on medical imaging data. You have 200 labelled examples. A frequentist model confidently produces predictions. A Bayesian model says: "I'm uncertain here - here are the cases where you should get more labels." That uncertainty signal is clinically critical.
Or you're tuning hyperparameters for a large language model. Random search wastes GPU budget on bad regions. Bayesian optimization maintains a probabilistic model of the objective landscape and queries points where expected improvement is highest. It finds good hyperparameters in 10x fewer evaluations.
Bayesian statistics is the formal machinery for:
- Quantifying uncertainty - models that know what they don't know
- Incorporating prior knowledge - using expert knowledge to regularize under-constrained problems
- Principled model comparison - not just "which model scored higher?" but "how much evidence favors one model over another?"
- Sequential learning - updating beliefs as new data arrives, without storing all past data
This module gives you the full Bayesian toolkit, from philosophical foundations through practical algorithms that power modern ML.
Module Map
How Bayesian Thinking Changes ML Engineering
The Core Bayesian Equation
Everything in this module flows from one equation:
| Term | Name | ML Interpretation |
|---|---|---|
| Posterior | What we believe about parameters after seeing data | |
| Likelihood | How well parameters explain the data | |
| Prior | What we believed before seeing data | |
| Evidence / Marginal Likelihood | Normalizing constant; key for model comparison |
Priors as Regularization
The most immediately practical Bayesian insight for ML engineers: regularization IS a prior.
| Regularization Technique | Bayesian Equivalent |
|---|---|
| L2 regularization (Ridge) | Gaussian prior on weights: |
| L1 regularization (Lasso) | Laplace prior on weights: |
| Dropout | Approximate Bayesian inference (Gal & Ghahramani, 2016) |
| Weight decay | MAP estimation with Gaussian prior |
| Early stopping | Implicit regularization equivalent to L2 |
When you set weight_decay=0.01 in your Adam optimizer, you are performing MAP (Maximum A Posteriori) estimation with a Gaussian prior. You are already doing Bayesian ML - you just may not have known it.
Uncertainty Quantification in Production
| Uncertainty Type | What It Means | Bayesian Tool |
|---|---|---|
| Aleatoric uncertainty | Irreducible noise in data | Predictive distribution variance |
| Epistemic uncertainty | Model lacks sufficient data | Posterior variance over parameters |
| Out-of-distribution detection | Input unlike training data | GP uncertainty, Bayesian NN predictive variance |
This distinction matters enormously in production. Aleatoric uncertainty cannot be reduced with more data. Epistemic uncertainty can - it tells you exactly where to collect more labels.
Lesson-by-Lesson ML Connections
| Lesson | The ML Engineering Payoff |
|---|---|
| 01 Bayesian vs Frequentist | Why "probability 0.7 that this model is better" is meaningful in Bayesian terms but not frequentist |
| 02 Prior & Posterior | Conjugate priors for closed-form posteriors; L2/L1 as MAP with Gaussian/Laplace prior |
| 03 Bayesian Updating | Online learning; streaming data without history; Kalman filter for object tracking |
| 04 MCMC | Full posterior inference; PyMC for practical Bayesian modeling; NUTS sampler |
| 05 Variational Inference | VAEs; scalable approximate inference; reparameterization trick |
| 06 Gaussian Processes | Bayesian optimization for hyperparameter tuning; uncertainty-aware regression |
| 07 Hierarchical Models | Multi-task learning; partial pooling across sparse user groups |
| 08 Model Comparison | Principled model selection; Bayes factors vs AIC/BIC |
Prerequisites
Before starting this module, you should be comfortable with:
- Module 03 - Probability Theory: Random variables, distributions, Bayes theorem, conditional probability
- Module 04 - Statistics for ML: MLE, MAP, likelihood functions, statistical inference
- Module 05 - Information Theory: KL divergence (essential for variational inference)
- Module 02 - Calculus: Gradients and optimization (for VI and MCMC proposals)
:::note Required Python Libraries
# Core numerical stack
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
# Probabilistic programming
import pymc as pm # pip install pymc
import arviz as az # pip install arviz (MCMC diagnostics)
# Gaussian processes
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, Matern
# Deep learning (for VAEs, Bayesian NN)
import torch
import torch.nn as nn
Install: pip install pymc arviz scikit-learn torch
:::
Learning Objectives
By the end of this module, you will be able to:
Conceptual Understanding
- Articulate the philosophical difference between Bayesian and frequentist probability, and choose appropriately
- Explain why L2 regularization is equivalent to MAP estimation with a Gaussian prior
- Describe why Bayesian methods quantify epistemic uncertainty while point-estimate methods cannot
- Explain the ELBO and why variational inference minimizes forward vs reverse KL divergence
Mathematical Skills
- Derive the posterior for conjugate prior-likelihood pairs (Beta-Bernoulli, Gaussian-Gaussian)
- Derive the ELBO from the log-evidence lower bound
- Write down the Metropolis-Hastings acceptance criterion and explain why it satisfies detailed balance
- Compute GP posterior mean and variance given a kernel function and observed data
Engineering Skills
- Implement Bayesian linear regression with conjugate priors in NumPy
- Run MCMC with PyMC and diagnose convergence using R-hat and trace plots
- Implement a variational autoencoder with the reparameterization trick in PyTorch
- Use GP-based Bayesian optimization to tune hyperparameters
Interview Readiness
- Explain the bias-variance tradeoff from a Bayesian perspective
- Describe how dropout relates to Bayesian inference
- Walk through the ELBO derivation from scratch
- Compare MCMC vs variational inference: when would you use each?
The Central Tension: Exactness vs Scalability
A recurring theme in this module is the tension between exact and approximate Bayesian inference:
Exact Bayesian Inference
↓ (requires conjugate priors or analytical tractability)
Conjugate Models (Beta-Bernoulli, Gaussian-Gaussian, Dirichlet-Multinomial)
↓ (breaks down for complex likelihoods)
Approximate Inference needed
├── MCMC: Asymptotically exact, slow, does not scale to large datasets
└── Variational Inference: Fast, biased, scales to neural networks
↓
Modern ML overwhelmingly uses VI (VAEs, Bayesian deep learning at scale)
Understanding this tension - and why deep learning practitioners choose VI over MCMC - is one of the most practically valuable insights in this module.
Bayesian Methods in the Modern ML Stack
| ML System | Where Bayesian Methods Appear |
|---|---|
| Hyperparameter tuning | GP-based Bayesian optimization (Optuna, SMAC, BoTorch) |
| Variational autoencoders | ELBO objective, reparameterization trick |
| Uncertainty estimation | Monte Carlo Dropout, Deep Ensembles, BNNs |
| Recommendation systems | Thompson sampling (Bayesian bandit algorithms) |
| Natural language processing | Bayesian nonparametrics (LDA topic modeling) |
| Drug discovery | Bayesian optimization over molecular property space |
| Robotics & control | Kalman filter, particle filters (sequential Bayesian inference) |
| Continual learning | Bayesian updating without catastrophic forgetting (EWC) |
These systems are built directly on the mathematical foundations in this module. Work through each lesson in order - they build on each other sequentially.
Let's begin.
