Skip to main content

Module 06: Bayesian Statistics

"In Bayesian statistics, probability is not a property of the world - it is a state of knowledge about the world."

  • E.T. Jaynes, Probability Theory: The Logic of Science

The Production Reality

You've deployed a recommendation model. It's performing well on average. But your team gets paged: a new user segment is getting terrible recommendations. Your model gives a single point estimate - it doesn't know what it doesn't know.

Or you're training a model on medical imaging data. You have 200 labelled examples. A frequentist model confidently produces predictions. A Bayesian model says: "I'm uncertain here - here are the cases where you should get more labels." That uncertainty signal is clinically critical.

Or you're tuning hyperparameters for a large language model. Random search wastes GPU budget on bad regions. Bayesian optimization maintains a probabilistic model of the objective landscape and queries points where expected improvement is highest. It finds good hyperparameters in 10x fewer evaluations.

Bayesian statistics is the formal machinery for:

  • Quantifying uncertainty - models that know what they don't know
  • Incorporating prior knowledge - using expert knowledge to regularize under-constrained problems
  • Principled model comparison - not just "which model scored higher?" but "how much evidence favors one model over another?"
  • Sequential learning - updating beliefs as new data arrives, without storing all past data

This module gives you the full Bayesian toolkit, from philosophical foundations through practical algorithms that power modern ML.

Module Map

How Bayesian Thinking Changes ML Engineering

The Core Bayesian Equation

Everything in this module flows from one equation:

P(θD)=P(Dθ)P(θ)P(D)P(\theta \mid \mathcal{D}) = \frac{P(\mathcal{D} \mid \theta) \cdot P(\theta)}{P(\mathcal{D})}

TermNameML Interpretation
P(θD)P(\theta \mid \mathcal{D})PosteriorWhat we believe about parameters after seeing data
P(Dθ)P(\mathcal{D} \mid \theta)LikelihoodHow well parameters explain the data
P(θ)P(\theta)PriorWhat we believed before seeing data
P(D)P(\mathcal{D})Evidence / Marginal LikelihoodNormalizing constant; key for model comparison

Priors as Regularization

The most immediately practical Bayesian insight for ML engineers: regularization IS a prior.

Regularization TechniqueBayesian Equivalent
L2 regularization (Ridge)Gaussian prior on weights: θN(0,σ2I)\theta \sim \mathcal{N}(0, \sigma^2 I)
L1 regularization (Lasso)Laplace prior on weights: θLaplace(0,b)\theta \sim \text{Laplace}(0, b)
DropoutApproximate Bayesian inference (Gal & Ghahramani, 2016)
Weight decayMAP estimation with Gaussian prior
Early stoppingImplicit regularization equivalent to L2

When you set weight_decay=0.01 in your Adam optimizer, you are performing MAP (Maximum A Posteriori) estimation with a Gaussian prior. You are already doing Bayesian ML - you just may not have known it.

Uncertainty Quantification in Production

Uncertainty TypeWhat It MeansBayesian Tool
Aleatoric uncertaintyIrreducible noise in dataPredictive distribution variance
Epistemic uncertaintyModel lacks sufficient dataPosterior variance over parameters
Out-of-distribution detectionInput unlike training dataGP uncertainty, Bayesian NN predictive variance

This distinction matters enormously in production. Aleatoric uncertainty cannot be reduced with more data. Epistemic uncertainty can - it tells you exactly where to collect more labels.

Lesson-by-Lesson ML Connections

LessonThe ML Engineering Payoff
01 Bayesian vs FrequentistWhy "probability 0.7 that this model is better" is meaningful in Bayesian terms but not frequentist
02 Prior & PosteriorConjugate priors for closed-form posteriors; L2/L1 as MAP with Gaussian/Laplace prior
03 Bayesian UpdatingOnline learning; streaming data without history; Kalman filter for object tracking
04 MCMCFull posterior inference; PyMC for practical Bayesian modeling; NUTS sampler
05 Variational InferenceVAEs; scalable approximate inference; reparameterization trick
06 Gaussian ProcessesBayesian optimization for hyperparameter tuning; uncertainty-aware regression
07 Hierarchical ModelsMulti-task learning; partial pooling across sparse user groups
08 Model ComparisonPrincipled model selection; Bayes factors vs AIC/BIC

Prerequisites

Before starting this module, you should be comfortable with:

  • Module 03 - Probability Theory: Random variables, distributions, Bayes theorem, conditional probability
  • Module 04 - Statistics for ML: MLE, MAP, likelihood functions, statistical inference
  • Module 05 - Information Theory: KL divergence (essential for variational inference)
  • Module 02 - Calculus: Gradients and optimization (for VI and MCMC proposals)

:::note Required Python Libraries

# Core numerical stack
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Probabilistic programming
import pymc as pm # pip install pymc
import arviz as az # pip install arviz (MCMC diagnostics)

# Gaussian processes
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, Matern

# Deep learning (for VAEs, Bayesian NN)
import torch
import torch.nn as nn

Install: pip install pymc arviz scikit-learn torch :::

Learning Objectives

By the end of this module, you will be able to:

Conceptual Understanding

  • Articulate the philosophical difference between Bayesian and frequentist probability, and choose appropriately
  • Explain why L2 regularization is equivalent to MAP estimation with a Gaussian prior
  • Describe why Bayesian methods quantify epistemic uncertainty while point-estimate methods cannot
  • Explain the ELBO and why variational inference minimizes forward vs reverse KL divergence

Mathematical Skills

  • Derive the posterior for conjugate prior-likelihood pairs (Beta-Bernoulli, Gaussian-Gaussian)
  • Derive the ELBO from the log-evidence lower bound
  • Write down the Metropolis-Hastings acceptance criterion and explain why it satisfies detailed balance
  • Compute GP posterior mean and variance given a kernel function and observed data

Engineering Skills

  • Implement Bayesian linear regression with conjugate priors in NumPy
  • Run MCMC with PyMC and diagnose convergence using R-hat and trace plots
  • Implement a variational autoencoder with the reparameterization trick in PyTorch
  • Use GP-based Bayesian optimization to tune hyperparameters

Interview Readiness

  • Explain the bias-variance tradeoff from a Bayesian perspective
  • Describe how dropout relates to Bayesian inference
  • Walk through the ELBO derivation from scratch
  • Compare MCMC vs variational inference: when would you use each?

The Central Tension: Exactness vs Scalability

A recurring theme in this module is the tension between exact and approximate Bayesian inference:

Exact Bayesian Inference
↓ (requires conjugate priors or analytical tractability)
Conjugate Models (Beta-Bernoulli, Gaussian-Gaussian, Dirichlet-Multinomial)
↓ (breaks down for complex likelihoods)
Approximate Inference needed
├── MCMC: Asymptotically exact, slow, does not scale to large datasets
└── Variational Inference: Fast, biased, scales to neural networks

Modern ML overwhelmingly uses VI (VAEs, Bayesian deep learning at scale)

Understanding this tension - and why deep learning practitioners choose VI over MCMC - is one of the most practically valuable insights in this module.

Bayesian Methods in the Modern ML Stack

ML SystemWhere Bayesian Methods Appear
Hyperparameter tuningGP-based Bayesian optimization (Optuna, SMAC, BoTorch)
Variational autoencodersELBO objective, reparameterization trick
Uncertainty estimationMonte Carlo Dropout, Deep Ensembles, BNNs
Recommendation systemsThompson sampling (Bayesian bandit algorithms)
Natural language processingBayesian nonparametrics (LDA topic modeling)
Drug discoveryBayesian optimization over molecular property space
Robotics & controlKalman filter, particle filters (sequential Bayesian inference)
Continual learningBayesian updating without catastrophic forgetting (EWC)

These systems are built directly on the mathematical foundations in this module. Work through each lesson in order - they build on each other sequentially.

Let's begin.

© 2026 EngineersOfAI. All rights reserved.