Module 14 - Bayesian ML
Why Probability Is Not Optional
Every ML model makes predictions. But not every model tells you how confident it is - or how to interpret that confidence. A gradient boosted tree says "fraud probability: 0.87." A neural network says "cat: 0.99." A regression model says "house price: $412,000." All point estimates. No uncertainty.
In production, this is a problem. A self-driving car that is 99% confident is dangerous if that confidence is miscalibrated. A drug discovery model that ignores uncertainty misallocates $10M in experiments. A sensor that gives a point estimate without error bounds is useless to engineers who need to know if a reading is trustworthy.
Bayesian ML is the principled answer. It replaces point estimates with probability distributions - not just "what is the most likely answer" but "what is the full range of plausible answers, and how confident am I?" This unlocks active learning (query the most uncertain points), out-of-distribution detection (high uncertainty signals novel inputs), and calibrated decision-making (act differently when uncertain).
This module covers the full Bayesian toolkit from first principles to production deployment. You will understand when Bayesian methods are worth their computational cost, and when simpler uncertainty heuristics suffice.
Module Map
Lessons in This Module
| # | Lesson | Key Concepts |
|---|---|---|
| 01 | The Probabilistic Perspective on ML | Bayes' theorem, MLE vs MAP vs full Bayesian, conjugate priors, epistemic vs aleatoric uncertainty |
| 02 | Gaussian Processes | GP prior, kernel functions, posterior predictive, sparse GPs, kernel engineering |
| 03 | Bayesian Linear Regression | Conjugate Gaussian prior, posterior closed-form, ridge connection, predictive uncertainty, evidence maximisation |
| 04 | Bayesian Neural Networks | Variational inference, ELBO, mean-field approximation, MC Dropout, Laplace approximation |
| 05 | Variational Autoencoders | VAE as latent variable model, ELBO derivation, reparameterisation trick, posterior collapse |
| 06 | Uncertainty Quantification | Epistemic vs aleatoric decomposition, calibration, ECE, reliability diagrams, temperature scaling |
| 07 | Conformal Prediction | Distribution-free coverage, split conformal, mondrian conformal, prediction sets |
| 08 | Bayesian Optimisation | Surrogate model, acquisition functions (EI, UCB, PI), hyperparameter tuning, BoTorch |
Core Bayesian Concepts
Prior : What you believe about parameters before seeing data. Can encode domain knowledge or be uninformative.
Likelihood : How probable is the observed data given a particular parameter setting? The standard ML loss function is the negative log-likelihood.
Posterior : Updated beliefs after seeing data. Combines prior and likelihood via Bayes' theorem:
Epistemic uncertainty: Uncertainty about model parameters - reducible with more data. High in low-data regions. The "I don't know" kind.
Aleatoric uncertainty: Irreducible noise inherent in the data - sensor noise, measurement error, fundamental stochasticity. Cannot be reduced by gathering more data.
When Is Bayesian ML Worth the Cost?
Bayesian methods are computationally expensive. Full posterior inference is usually intractable. Here is a practical decision guide:
| Scenario | Use Bayesian? | Why |
|---|---|---|
| Safety-critical decisions (medical, autonomous) | Yes | Calibrated uncertainty prevents overconfident errors |
| Small datasets (less than 1,000 samples) | Yes | Prior regularises; avoids overfitting |
| Active learning / sequential experiments | Yes | Uncertainty drives exploration |
| Hyperparameter optimisation | Yes | Bayesian optimisation beats grid search |
| Large-scale classification (ImageNet) | No | Point estimates work; MC Dropout as cheap approximation |
| Online production inference (low latency) | Usually no | Posterior sampling too slow; use calibration post-hoc |
| Anomaly / OOD detection | Partial | Uncertainty score as OOD signal; conformal for coverage guarantees |
