Conditional Probability and Bayes' Theorem
Reading time: ~45 minutes | Interview relevance: Very High | Target roles: ML Engineer, AI Engineer, Research Scientist, Data Scientist
The ML Scenario That Motivates This Lesson
You're designing a spam filter. You have a dataset of emails, and you want to predict whether an email is spam given the words it contains.
Two strategies emerge immediately:
Strategy A: Learn directly. Train a model that takes words as input and outputs a spam probability. This is a discriminative approach.
Strategy B: Model how spam emails are generated - and - then use Bayes' theorem to flip it: . This is a generative approach.
Both strategies are valid, both are widely used, and understanding when to use which is a core ML engineering skill. This lesson gives you the theoretical foundation - Bayes' theorem - that connects these two perspectives.
1. Conditional Probability: Formal Definition
The conditional probability of given is:
This tells us: given that has occurred, what is the probability that also occurred?
Intuition
Think of as "zooming in" on the universe to only consider worlds where is true, then asking how much of that restricted universe is also :
Full probability space: After conditioning on B:
┌───────────────────┐ ┌──────────────┐
│ Ω │ │ B │
│ ┌────┐ ┌────┐ │ │ ┌──────┐ │
│ │ A ├──┤ B │ │ ─────► │ │ A∩B │ │
│ └────┘ └────┘ │ │ └──────┘ │
└───────────────────┘ └──────────────┘
P(A) = P(A)/1 P(A|B) = P(A∩B)/P(B)
Properties of Conditional Probability
For fixed with , is itself a valid probability measure on :
- for all
- For disjoint :
This means all standard probability rules (complement, inclusion-exclusion, etc.) apply to conditional probabilities with held fixed.
2. The Total Probability Theorem
If partition the sample space (mutually exclusive and exhaustive):
Then for any event :
This is the Law of Total Probability.
Law of Total Probability:
P(A|B₁) P(A|B₂) P(A|B₃)
│ │ │
P(B₁)────► │ P(B₂)──►│ P(B₃)──►│
│ │ │
└──────────┴──────────┘
│
P(A) = Σᵢ P(A|Bᵢ)P(Bᵢ)
Example: Medical Test
- 1% of population has disease ()
- Test is 99% sensitive:
- Test is 95% specific: , so
# Total Probability Theorem
p_disease = 0.01
p_pos_given_disease = 0.99
p_pos_given_no_disease = 0.05
p_pos = (p_pos_given_disease * p_disease +
p_pos_given_no_disease * (1 - p_disease))
print(f"P(positive test) = {p_pos:.4f}")
print(f" from disease: {p_pos_given_disease * p_disease:.4f}")
print(f" from no disease: {p_pos_given_no_disease * (1-p_disease):.4f}")
3. Bayes' Theorem
Combining the definition of conditional probability with the multiplication rule:
This allows us to flip the conditioning: if we know but want , Bayes' theorem gives us the answer.
Expanded Form
Using the total probability theorem for :
where are a partition of the sample space.
The Bayesian Vocabulary
| Term | Symbol | Meaning |
|---|---|---|
| Prior | Our belief about before seeing data | |
| Likelihood | How probable is data if is true? | |
| Posterior | Updated belief after seeing | |
| Evidence | Marginal probability of the observed data |
The evidence is often just a normalization constant, so:
"Posterior is proportional to likelihood times prior."
4. The Medical Test Example with Bayes
Back to our test: given a positive result, what is the probability of disease?
Even with a 99% sensitive test, a positive result only gives a 16.7% probability of disease! This is the base rate fallacy - when the disease is rare, most positive results are false positives.
# Bayes' theorem: medical test
p_disease = 0.01
p_pos_given_disease = 0.99
p_pos_given_no_disease = 0.05
p_pos = (p_pos_given_disease * p_disease +
p_pos_given_no_disease * (1 - p_disease))
p_disease_given_pos = (p_pos_given_disease * p_disease) / p_pos
print(f"P(disease | positive test) = {p_disease_given_pos:.4f}")
print(f"\nIntuition: Out of 10,000 people:")
n = 10000
sick = int(n * p_disease)
not_sick = n - sick
tp = int(sick * p_pos_given_disease) # true positives
fp = int(not_sick * p_pos_given_no_disease) # false positives
print(f" {sick} have disease, {not_sick} don't")
print(f" {tp} test positive AND have disease (TP)")
print(f" {fp} test positive but don't have disease (FP)")
print(f" P(disease | positive) ≈ {tp}/{tp+fp} = {tp/(tp+fp):.4f}")
:::warning The Base Rate Fallacy in ML This phenomenon appears in ML whenever classes are imbalanced. A fraud detection model that predicts "fraud" might have 90% precision on a balanced test set but terrible precision in production where only 0.1% of transactions are fraud. Always evaluate models with the realistic class prior in mind. Precision on balanced data can be wildly optimistic for rare events. :::
5. Bayesian Inference for Model Parameters
In Bayesian ML, the parameters of a model are treated as random variables. Bayes' theorem gives:
| Approach | Objective | Bayesian Interpretation |
|---|---|---|
| MLE | No prior (or uniform prior) | |
| MAP | Uses prior, ignores posterior uncertainty | |
| Full Bayes | Compute | Full posterior distribution over |
| Bayesian NN | Approximate via MCMC or variational | Uncertainty-aware neural networks |
The MAP estimate with a Gaussian prior gives:
- exactly L2-regularized loss minimization.
import numpy as np
from scipy.stats import norm
# Bayesian linear regression (1D) with conjugate normal prior
np.random.seed(42)
n_train = 50
# True model: y = 2x + noise
x_train = np.random.uniform(-3, 3, n_train)
y_train = 2.0 * x_train + np.random.randn(n_train)
# Prior: w ~ N(0, tau^2), tau = 2
# Likelihood: y_i | w ~ N(w*x_i, sigma^2), sigma = 1
tau2 = 4.0 # prior variance
sigma2 = 1.0 # noise variance
# Closed-form Bayesian update for 1D linear regression
# posterior: w | D ~ N(mu_post, sigma_post^2)
sigma_post_sq = 1.0 / (1.0/tau2 + np.sum(x_train**2)/sigma2)
mu_post = sigma_post_sq * (np.sum(x_train * y_train) / sigma2)
print(f"Prior: w ~ N(0, {tau2})")
print(f"Posterior: w ~ N({mu_post:.4f}, {sigma_post_sq:.4f})")
print(f" Posterior mean (MAP estimate): {mu_post:.4f}")
print(f" Posterior std: {np.sqrt(sigma_post_sq):.4f}")
print(f" True w = 2.0, MLE = {np.dot(x_train, y_train)/np.dot(x_train, x_train):.4f}")
6. Naive Bayes Classifier
Naive Bayes applies Bayes' theorem directly to classification:
The "naive" assumption: features are conditionally independent given the class:
This makes the model tractable: instead of estimating one -dimensional distribution per class, we estimate one-dimensional distributions per class.
Training
Estimate from data:
- (class proportions)
- - depends on feature type:
- Discrete (word counts): Multinomial or Bernoulli Naive Bayes
- Continuous: Gaussian Naive Bayes assumes
Prediction
Using logs to avoid numerical underflow from multiplying many small probabilities.
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Gaussian Naive Bayes on Iris
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y)
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Gaussian Naive Bayes Accuracy: {acc:.4f}")
# Inspect the model parameters
print(f"\nClass priors: {gnb.class_prior_.round(4)}")
print(f"\nPer-class feature means (mu_jk):")
print(gnb.theta_.round(3))
print(f"\nPer-class feature variances (sigma_jk^2):")
print(gnb.var_.round(3))
# Implement Naive Bayes from scratch to see the Bayes mechanics
class GaussianNaiveBayesScratch:
"""Gaussian Naive Bayes from scratch."""
def fit(self, X, y):
self.classes_ = np.unique(y)
self.priors_ = {}
self.means_ = {}
self.vars_ = {}
for k in self.classes_:
X_k = X[y == k]
self.priors_[k] = len(X_k) / len(X)
self.means_[k] = X_k.mean(axis=0)
self.vars_[k] = X_k.var(axis=0) + 1e-9 # add epsilon for stability
def log_likelihood(self, x, k):
"""log P(x | y=k) = sum_j log N(x_j; mu_jk, sigma_jk^2)"""
mu = self.means_[k]
var = self.vars_[k]
return np.sum(-0.5 * np.log(2 * np.pi * var) - (x - mu)**2 / (2 * var))
def predict(self, X):
predictions = []
for x in X:
scores = {}
for k in self.classes_:
# log posterior = log prior + log likelihood (Bayes' theorem)
scores[k] = np.log(self.priors_[k]) + self.log_likelihood(x, k)
predictions.append(max(scores, key=scores.get))
return np.array(predictions)
gnb_scratch = GaussianNaiveBayesScratch()
gnb_scratch.fit(X_train, y_train)
y_pred_scratch = gnb_scratch.predict(X_test)
print(f"\nFrom-scratch Naive Bayes Accuracy: {accuracy_score(y_test, y_pred_scratch):.4f}")
7. Generative vs Discriminative Models
Bayes' theorem illuminates a fundamental divide in ML model design:
Two paths to P(y | x):
GENERATIVE MODEL: DISCRIMINATIVE MODEL:
Learn P(x, y) or P(x|y) Learn P(y|x) directly
and P(y), then apply Bayes
P(y|x) ∝ P(x|y) · P(y) P(y|x) = f(x; θ)
Examples: Examples:
- Naive Bayes - Logistic Regression
- LDA (Linear Discriminant) - Neural Networks
- HMMs, GMMs - SVMs, Decision Trees
- VAEs, Diffusion Models - Transformers (as classifiers)
Trade-offs
| Aspect | Generative | Discriminative |
|---|---|---|
| Models | Full joint | Only |
| Sample generation | Can generate new | Cannot generate |
| Data efficiency | Better with less data | Needs more data |
| Accuracy (large data) | Often lower | Often higher |
| Handles missing features | Naturally | Difficult |
| Interpretability | Often more interpretable | Can be black box |
:::tip When to Use Each Approach Use generative when you need to generate new samples (VAE for image synthesis, LDA for topic modeling), when you have limited data (generative models can incorporate priors), or when you need to handle missing features at test time.
Use discriminative when prediction accuracy is the primary goal and you have lots of labeled data. Most state-of-the-art classifiers are discriminative - directly optimizing the conditional distribution leads to better classifiers when data is plentiful. :::
8. Bayesian Updating: Sequential Learning
One of the most powerful aspects of Bayes' theorem is that it enables sequential updating: each new observation refines our beliefs.
Or sequentially: the posterior after seeing becomes the prior for incorporating :
import numpy as np
from scipy.stats import beta as beta_dist
# Sequential Bayesian updating for coin flip probability
# Coin with unknown P(heads) = p
# Prior: p ~ Beta(1, 1) (uniform)
np.random.seed(42)
true_p = 0.65
n_flips = 50
flips = (np.random.rand(n_flips) < true_p).astype(int) # 1=heads, 0=tails
# Sequential update
alpha, beta_param = 1.0, 1.0 # start with uniform prior
print(f"True p = {true_p}")
print(f"\nSequential Bayesian updates:")
print(f"{'Flip':>5} | {'Result':>6} | {'alpha':>6} | {'beta':>6} | {'Posterior Mean':>14}")
print("-" * 55)
for i, flip in enumerate(flips[:20]):
# Update: observe one flip
alpha += flip # success
beta_param += 1 - flip # failure
posterior_mean = alpha / (alpha + beta_param)
if i < 10 or i % 5 == 4:
print(f"{i+1:>5} | {'H' if flip else 'T':>6} | {alpha:>6.1f} | {beta_param:>6.1f} | {posterior_mean:>14.4f}")
print(f"\nFinal posterior after {n_flips} flips: Beta({alpha:.0f}, {beta_param:.0f})")
print(f"Posterior mean: {alpha/(alpha+beta_param):.4f} (true p = {true_p})")
9. Interview Q&A
Q1: Explain Bayes' theorem and give an ML example.
A: Bayes' theorem states . It lets us "flip" the conditioning - given that we know , compute . In ML: suppose we have a spam classifier. We can model (how common each word is in spam emails) and (fraction of emails that are spam). Bayes' theorem gives . This is the Naive Bayes classifier. The prior encodes our baseline expectation; the likelihood captures the evidence; the posterior is our updated belief after seeing the email.
Q2: What is the "naive" assumption in Naive Bayes and when does it fail?
A: Naive Bayes assumes features are conditionally independent given the class label: . This allows the joint -dimensional distribution to factorize into one-dimensional distributions, making it tractable. The assumption fails whenever features are correlated given the class. In text classification: "New York" appearing as a pair has different meaning than "New" and "York" independently. In image classification: adjacent pixels are correlated. Despite the assumption being wrong in practice, Naive Bayes often achieves surprisingly good classification accuracy because: (1) the argmax of the posterior is often correct even if the posterior probabilities are miscalibrated; (2) the decision boundary can still be accurate even with wrong assumptions; (3) with little data, the naive assumption acts as strong regularization.
Q3: What is the difference between MAP and full Bayesian inference?
A: MAP (Maximum A Posteriori) finds the single parameter value that maximizes the posterior: . Full Bayesian inference computes the entire posterior distribution . The key differences are: (1) Uncertainty: MAP gives a point estimate with no uncertainty; full Bayes gives a distribution that quantifies uncertainty about parameters. (2) Predictions: MAP uses ; full Bayes uses - averaging over all possible parameters weighted by their posterior. (3) Computational cost: MAP is optimization (fast); full Bayes requires computing an integral (hard - typically requires MCMC or variational methods). (4) Practical use: MAP is equivalent to regularized optimization (L2 = Gaussian prior). Full Bayes is used when uncertainty quantification is critical (medical, safety-critical applications).
Q4: How do generative and discriminative models differ, and when should you use each?
A: Generative models learn the joint distribution and use Bayes' theorem to compute . Discriminative models directly learn without modeling . Key trade-offs: generative models can synthesize new data (sample from ), handle missing features naturally, incorporate prior knowledge, and work with less labeled data. Discriminative models (logistic regression, neural networks) typically achieve better classification accuracy with large labeled datasets because they optimize the quantity of interest () directly rather than modeling the entire joint. Use generative when: generating new data (VAE, GAN, diffusion), limited labeled data (semi-supervised), or you need a probabilistic model of the data. Use discriminative when: maximizing classification accuracy on labeled data is the primary goal.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Bayes' Theorem Explorer demo on the EngineersOfAI Playground - no code required.
:::
Q5: Why can a highly accurate test still have low positive predictive value?
A: This is the base rate fallacy (or false positive paradox). Positive predictive value (PPV) = . When the disease prevalence is very low, even a highly accurate test will have many false positives relative to true positives. Example: test with 99% sensitivity and 99% specificity, disease with 0.1% prevalence. PPV = . So 91% of positive tests are false positives! This directly impacts ML: a fraud detection model with 99% precision on a balanced test set may have terrible PPV in production where fraud rate is 0.1%. Always evaluate with realistic class priors, not balanced test sets.
