Module 06: Bayesian Statistics

"In Bayesian statistics, probability is not a property of the world - it is a state of knowledge about the world."

E.T. Jaynes, Probability Theory: The Logic of Science

The Production Reality

You've deployed a recommendation model. It's performing well on average. But your team gets paged: a new user segment is getting terrible recommendations. Your model gives a single point estimate - it doesn't know what it doesn't know.

Or you're training a model on medical imaging data. You have 200 labelled examples. A frequentist model confidently produces predictions. A Bayesian model says: "I'm uncertain here - here are the cases where you should get more labels." That uncertainty signal is clinically critical.

Or you're tuning hyperparameters for a large language model. Random search wastes GPU budget on bad regions. Bayesian optimization maintains a probabilistic model of the objective landscape and queries points where expected improvement is highest. It finds good hyperparameters in 10x fewer evaluations.

Bayesian statistics is the formal machinery for:

Quantifying uncertainty - models that know what they don't know
Incorporating prior knowledge - using expert knowledge to regularize under-constrained problems
Principled model comparison - not just "which model scored higher?" but "how much evidence favors one model over another?"
Sequential learning - updating beliefs as new data arrives, without storing all past data

This module gives you the full Bayesian toolkit, from philosophical foundations through practical algorithms that power modern ML.

Module Map

How Bayesian Thinking Changes ML Engineering

The Core Bayesian Equation

Everything in this module flows from one equation:

$P(\theta \mid \mathcal{D}) = \frac{P(\mathcal{D} \mid \theta) \cdot P(\theta)}{P(\mathcal{D})}$

Term	Name	ML Interpretation
$P(\theta \mid \mathcal{D})$	Posterior	What we believe about parameters after seeing data
$P(\mathcal{D} \mid \theta)$	Likelihood	How well parameters explain the data
$P(\theta)$	Prior	What we believed before seeing data
$P(\mathcal{D})$	Evidence / Marginal Likelihood	Normalizing constant; key for model comparison

Priors as Regularization

The most immediately practical Bayesian insight for ML engineers: regularization IS a prior.

Regularization Technique	Bayesian Equivalent
L2 regularization (Ridge)	Gaussian prior on weights: $\theta \sim \mathcal{N}(0, \sigma^2 I)$
L1 regularization (Lasso)	Laplace prior on weights: $\theta \sim \text{Laplace}(0, b)$
Dropout	Approximate Bayesian inference (Gal & Ghahramani, 2016)
Weight decay	MAP estimation with Gaussian prior
Early stopping	Implicit regularization equivalent to L2

When you set weight_decay=0.01 in your Adam optimizer, you are performing MAP (Maximum A Posteriori) estimation with a Gaussian prior. You are already doing Bayesian ML - you just may not have known it.

Uncertainty Quantification in Production

Uncertainty Type	What It Means	Bayesian Tool
Aleatoric uncertainty	Irreducible noise in data	Predictive distribution variance
Epistemic uncertainty	Model lacks sufficient data	Posterior variance over parameters
Out-of-distribution detection	Input unlike training data	GP uncertainty, Bayesian NN predictive variance

This distinction matters enormously in production. Aleatoric uncertainty cannot be reduced with more data. Epistemic uncertainty can - it tells you exactly where to collect more labels.

Lesson-by-Lesson ML Connections

Lesson	The ML Engineering Payoff
01 Bayesian vs Frequentist	Why "probability 0.7 that this model is better" is meaningful in Bayesian terms but not frequentist
02 Prior & Posterior	Conjugate priors for closed-form posteriors; L2/L1 as MAP with Gaussian/Laplace prior
03 Bayesian Updating	Online learning; streaming data without history; Kalman filter for object tracking
04 MCMC	Full posterior inference; PyMC for practical Bayesian modeling; NUTS sampler
05 Variational Inference	VAEs; scalable approximate inference; reparameterization trick
06 Gaussian Processes	Bayesian optimization for hyperparameter tuning; uncertainty-aware regression
07 Hierarchical Models	Multi-task learning; partial pooling across sparse user groups
08 Model Comparison	Principled model selection; Bayes factors vs AIC/BIC

Prerequisites

Before starting this module, you should be comfortable with:

Module 03 - Probability Theory: Random variables, distributions, Bayes theorem, conditional probability
Module 04 - Statistics for ML: MLE, MAP, likelihood functions, statistical inference
Module 05 - Information Theory: KL divergence (essential for variational inference)
Module 02 - Calculus: Gradients and optimization (for VI and MCMC proposals)

:::note Required Python Libraries

# Core numerical stack
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Probabilistic programming
import pymc as pm          # pip install pymc
import arviz as az         # pip install arviz (MCMC diagnostics)

# Gaussian processes
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, Matern

# Deep learning (for VAEs, Bayesian NN)
import torch
import torch.nn as nn

Install: pip install pymc arviz scikit-learn torch :::

Learning Objectives

By the end of this module, you will be able to:

Conceptual Understanding

Articulate the philosophical difference between Bayesian and frequentist probability, and choose appropriately
Explain why L2 regularization is equivalent to MAP estimation with a Gaussian prior
Describe why Bayesian methods quantify epistemic uncertainty while point-estimate methods cannot
Explain the ELBO and why variational inference minimizes forward vs reverse KL divergence

Mathematical Skills

Derive the posterior for conjugate prior-likelihood pairs (Beta-Bernoulli, Gaussian-Gaussian)
Derive the ELBO from the log-evidence lower bound
Write down the Metropolis-Hastings acceptance criterion and explain why it satisfies detailed balance
Compute GP posterior mean and variance given a kernel function and observed data

Engineering Skills

Implement Bayesian linear regression with conjugate priors in NumPy
Run MCMC with PyMC and diagnose convergence using R-hat and trace plots
Implement a variational autoencoder with the reparameterization trick in PyTorch
Use GP-based Bayesian optimization to tune hyperparameters

Interview Readiness

Explain the bias-variance tradeoff from a Bayesian perspective
Describe how dropout relates to Bayesian inference
Walk through the ELBO derivation from scratch
Compare MCMC vs variational inference: when would you use each?

The Central Tension: Exactness vs Scalability

A recurring theme in this module is the tension between exact and approximate Bayesian inference:

Exact Bayesian Inference
    ↓ (requires conjugate priors or analytical tractability)
Conjugate Models (Beta-Bernoulli, Gaussian-Gaussian, Dirichlet-Multinomial)
    ↓ (breaks down for complex likelihoods)
Approximate Inference needed
    ├── MCMC: Asymptotically exact, slow, does not scale to large datasets
    └── Variational Inference: Fast, biased, scales to neural networks
              ↓
    Modern ML overwhelmingly uses VI (VAEs, Bayesian deep learning at scale)

Understanding this tension - and why deep learning practitioners choose VI over MCMC - is one of the most practically valuable insights in this module.

Bayesian Methods in the Modern ML Stack

ML System	Where Bayesian Methods Appear
Hyperparameter tuning	GP-based Bayesian optimization (Optuna, SMAC, BoTorch)
Variational autoencoders	ELBO objective, reparameterization trick
Uncertainty estimation	Monte Carlo Dropout, Deep Ensembles, BNNs
Recommendation systems	Thompson sampling (Bayesian bandit algorithms)
Natural language processing	Bayesian nonparametrics (LDA topic modeling)
Drug discovery	Bayesian optimization over molecular property space
Robotics & control	Kalman filter, particle filters (sequential Bayesian inference)
Continual learning	Bayesian updating without catastrophic forgetting (EWC)

These systems are built directly on the mathematical foundations in this module. Work through each lesson in order - they build on each other sequentially.

Let's begin.

The Production Reality​

Module Map​

How Bayesian Thinking Changes ML Engineering​

The Core Bayesian Equation​

Priors as Regularization​

Uncertainty Quantification in Production​

Lesson-by-Lesson ML Connections​

Prerequisites​

Learning Objectives​

The Central Tension: Exactness vs Scalability​

Bayesian Methods in the Modern ML Stack​