Skip to main content

Conditional Probability and Bayes' Theorem

Reading time: ~45 minutes | Interview relevance: Very High | Target roles: ML Engineer, AI Engineer, Research Scientist, Data Scientist

The ML Scenario That Motivates This Lesson

You're designing a spam filter. You have a dataset of emails, and you want to predict whether an email is spam given the words it contains.

Two strategies emerge immediately:

Strategy A: Learn P(spamwords)P(\text{spam} \mid \text{words}) directly. Train a model that takes words as input and outputs a spam probability. This is a discriminative approach.

Strategy B: Model how spam emails are generated - P(wordsspam)P(\text{words} \mid \text{spam}) and P(spam)P(\text{spam}) - then use Bayes' theorem to flip it: P(spamwords)P(wordsspam)P(spam)P(\text{spam} \mid \text{words}) \propto P(\text{words} \mid \text{spam}) \cdot P(\text{spam}). This is a generative approach.

Both strategies are valid, both are widely used, and understanding when to use which is a core ML engineering skill. This lesson gives you the theoretical foundation - Bayes' theorem - that connects these two perspectives.

1. Conditional Probability: Formal Definition

The conditional probability of AA given BB is:

P(AB)=P(AB)P(B),P(B)>0P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0

This tells us: given that BB has occurred, what is the probability that AA also occurred?

Intuition

Think of P(AB)P(A \mid B) as "zooming in" on the universe to only consider worlds where BB is true, then asking how much of that restricted universe is also AA:

Full probability space: After conditioning on B:

┌───────────────────┐ ┌──────────────┐
│ Ω │ │ B │
│ ┌────┐ ┌────┐ │ │ ┌──────┐ │
│ │ A ├──┤ B │ │ ─────► │ │ A∩B │ │
│ └────┘ └────┘ │ │ └──────┘ │
└───────────────────┘ └──────────────┘

P(A) = P(A)/1 P(A|B) = P(A∩B)/P(B)

Properties of Conditional Probability

For fixed BB with P(B)>0P(B) > 0, P(B)P(\cdot \mid B) is itself a valid probability measure on Ω\Omega:

  1. P(AB)0P(A \mid B) \geq 0 for all AA
  2. P(ΩB)=1P(\Omega \mid B) = 1
  3. For disjoint A1,A2,A_1, A_2, \ldots: P(iAiB)=iP(AiB)P(\bigcup_i A_i \mid B) = \sum_i P(A_i \mid B)

This means all standard probability rules (complement, inclusion-exclusion, etc.) apply to conditional probabilities with BB held fixed.

2. The Total Probability Theorem

If B1,B2,,BnB_1, B_2, \ldots, B_n partition the sample space (mutually exclusive and exhaustive):

i=1nP(Bi)=1,BiBj= for ij\sum_{i=1}^n P(B_i) = 1, \quad B_i \cap B_j = \emptyset \text{ for } i \neq j

Then for any event AA:

P(A)=i=1nP(ABi)P(Bi)P(A) = \sum_{i=1}^n P(A \mid B_i) P(B_i)

This is the Law of Total Probability.

Law of Total Probability:

P(A|B₁) P(A|B₂) P(A|B₃)
│ │ │
P(B₁)────► │ P(B₂)──►│ P(B₃)──►│
│ │ │
└──────────┴──────────┘

P(A) = Σᵢ P(A|Bᵢ)P(Bᵢ)

Example: Medical Test

  • 1% of population has disease (P(D)=0.01P(D) = 0.01)
  • Test is 99% sensitive: P(posD)=0.99P(\text{pos} \mid D) = 0.99
  • Test is 95% specific: P(negDc)=0.95P(\text{neg} \mid D^c) = 0.95, so P(posDc)=0.05P(\text{pos} \mid D^c) = 0.05

P(pos)=P(posD)P(D)+P(posDc)P(Dc)P(\text{pos}) = P(\text{pos} \mid D) P(D) + P(\text{pos} \mid D^c) P(D^c) =0.99×0.01+0.05×0.99=0.0099+0.0495=0.0594= 0.99 \times 0.01 + 0.05 \times 0.99 = 0.0099 + 0.0495 = 0.0594

# Total Probability Theorem
p_disease = 0.01
p_pos_given_disease = 0.99
p_pos_given_no_disease = 0.05

p_pos = (p_pos_given_disease * p_disease +
p_pos_given_no_disease * (1 - p_disease))

print(f"P(positive test) = {p_pos:.4f}")
print(f" from disease: {p_pos_given_disease * p_disease:.4f}")
print(f" from no disease: {p_pos_given_no_disease * (1-p_disease):.4f}")

3. Bayes' Theorem

Combining the definition of conditional probability with the multiplication rule:

P(AB)=P(BA)P(A)P(B)P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}

This allows us to flip the conditioning: if we know P(BA)P(B \mid A) but want P(AB)P(A \mid B), Bayes' theorem gives us the answer.

Expanded Form

Using the total probability theorem for P(B)P(B):

P(AB)=P(BA)P(A)jP(BAj)P(Aj)P(A \mid B) = \frac{P(B \mid A) \, P(A)}{\sum_j P(B \mid A_j) P(A_j)}

where A1,,AnA_1, \ldots, A_n are a partition of the sample space.

The Bayesian Vocabulary

P(AB)posterior=P(BA)likelihoodP(A)priorP(B)evidence\underbrace{P(A \mid B)}_{\text{posterior}} = \frac{\underbrace{P(B \mid A)}_{\text{likelihood}} \cdot \underbrace{P(A)}_{\text{prior}}}{\underbrace{P(B)}_{\text{evidence}}}

TermSymbolMeaning
PriorP(A)P(A)Our belief about AA before seeing data BB
LikelihoodP(BA)P(B \mid A)How probable is data BB if AA is true?
PosteriorP(AB)P(A \mid B)Updated belief after seeing BB
EvidenceP(B)P(B)Marginal probability of the observed data

The evidence P(B)P(B) is often just a normalization constant, so:

P(AB)P(BA)P(A)P(A \mid B) \propto P(B \mid A) \cdot P(A)

"Posterior is proportional to likelihood times prior."

4. The Medical Test Example with Bayes

Back to our test: given a positive result, what is the probability of disease?

P(Dpos)=P(posD)P(D)P(pos)P(D \mid \text{pos}) = \frac{P(\text{pos} \mid D) \, P(D)}{P(\text{pos})}

=0.99×0.010.0594=0.00990.05940.167= \frac{0.99 \times 0.01}{0.0594} = \frac{0.0099}{0.0594} \approx 0.167

Even with a 99% sensitive test, a positive result only gives a 16.7% probability of disease! This is the base rate fallacy - when the disease is rare, most positive results are false positives.

# Bayes' theorem: medical test
p_disease = 0.01
p_pos_given_disease = 0.99
p_pos_given_no_disease = 0.05

p_pos = (p_pos_given_disease * p_disease +
p_pos_given_no_disease * (1 - p_disease))

p_disease_given_pos = (p_pos_given_disease * p_disease) / p_pos

print(f"P(disease | positive test) = {p_disease_given_pos:.4f}")
print(f"\nIntuition: Out of 10,000 people:")
n = 10000
sick = int(n * p_disease)
not_sick = n - sick
tp = int(sick * p_pos_given_disease) # true positives
fp = int(not_sick * p_pos_given_no_disease) # false positives
print(f" {sick} have disease, {not_sick} don't")
print(f" {tp} test positive AND have disease (TP)")
print(f" {fp} test positive but don't have disease (FP)")
print(f" P(disease | positive) ≈ {tp}/{tp+fp} = {tp/(tp+fp):.4f}")

:::warning The Base Rate Fallacy in ML This phenomenon appears in ML whenever classes are imbalanced. A fraud detection model that predicts "fraud" might have 90% precision on a balanced test set but terrible precision in production where only 0.1% of transactions are fraud. Always evaluate models with the realistic class prior in mind. Precision on balanced data can be wildly optimistic for rare events. :::

5. Bayesian Inference for Model Parameters

In Bayesian ML, the parameters θ\theta of a model are treated as random variables. Bayes' theorem gives:

P(θD)=P(Dθ)P(θ)P(D)P(\theta \mid \mathcal{D}) = \frac{P(\mathcal{D} \mid \theta) \, P(\theta)}{P(\mathcal{D})}

P(θD)posterior over paramsP(Dθ)likelihoodP(θ)prior\underbrace{P(\theta \mid \mathcal{D})}_{\text{posterior over params}} \propto \underbrace{P(\mathcal{D} \mid \theta)}_{\text{likelihood}} \cdot \underbrace{P(\theta)}_{\text{prior}}

ApproachObjectiveBayesian Interpretation
MLEmaxθP(Dθ)\max_\theta P(\mathcal{D} \mid \theta)No prior (or uniform prior)
MAPmaxθP(Dθ)P(θ)\max_\theta P(\mathcal{D} \mid \theta) P(\theta)Uses prior, ignores posterior uncertainty
Full BayesCompute P(θD)P(\theta \mid \mathcal{D})Full posterior distribution over θ\theta
Bayesian NNApproximate P(θD)P(\theta \mid \mathcal{D}) via MCMC or variationalUncertainty-aware neural networks

The MAP estimate with a Gaussian prior P(θ)eλθ2P(\theta) \propto e^{-\lambda\|\theta\|^2} gives:

θ^MAP=argminθ[logP(Dθ)+λθ2]\hat{\theta}_{MAP} = \arg\min_\theta \left[-\log P(\mathcal{D} \mid \theta) + \lambda\|\theta\|^2\right]

  • exactly L2-regularized loss minimization.
import numpy as np
from scipy.stats import norm

# Bayesian linear regression (1D) with conjugate normal prior
np.random.seed(42)
n_train = 50

# True model: y = 2x + noise
x_train = np.random.uniform(-3, 3, n_train)
y_train = 2.0 * x_train + np.random.randn(n_train)

# Prior: w ~ N(0, tau^2), tau = 2
# Likelihood: y_i | w ~ N(w*x_i, sigma^2), sigma = 1
tau2 = 4.0 # prior variance
sigma2 = 1.0 # noise variance

# Closed-form Bayesian update for 1D linear regression
# posterior: w | D ~ N(mu_post, sigma_post^2)
sigma_post_sq = 1.0 / (1.0/tau2 + np.sum(x_train**2)/sigma2)
mu_post = sigma_post_sq * (np.sum(x_train * y_train) / sigma2)

print(f"Prior: w ~ N(0, {tau2})")
print(f"Posterior: w ~ N({mu_post:.4f}, {sigma_post_sq:.4f})")
print(f" Posterior mean (MAP estimate): {mu_post:.4f}")
print(f" Posterior std: {np.sqrt(sigma_post_sq):.4f}")
print(f" True w = 2.0, MLE = {np.dot(x_train, y_train)/np.dot(x_train, x_train):.4f}")

6. Naive Bayes Classifier

Naive Bayes applies Bayes' theorem directly to classification:

P(y=kx)P(xy=k)P(y=k)P(y = k \mid \mathbf{x}) \propto P(\mathbf{x} \mid y = k) \cdot P(y = k)

The "naive" assumption: features are conditionally independent given the class:

P(xy=k)=j=1dP(xjy=k)P(\mathbf{x} \mid y = k) = \prod_{j=1}^d P(x_j \mid y = k)

This makes the model tractable: instead of estimating one dd-dimensional distribution per class, we estimate dd one-dimensional distributions per class.

Training

Estimate from data:

  • P(y=k)=count(y=k)NP(y = k) = \frac{\text{count}(y=k)}{N} (class proportions)
  • P(xjy=k)P(x_j \mid y = k) - depends on feature type:
    • Discrete (word counts): Multinomial or Bernoulli Naive Bayes
    • Continuous: Gaussian Naive Bayes assumes xjy=kN(μjk,σjk2)x_j \mid y=k \sim \mathcal{N}(\mu_{jk}, \sigma_{jk}^2)

Prediction

y^=argmaxk[logP(y=k)+j=1dlogP(xjy=k)]\hat{y} = \arg\max_k \left[ \log P(y=k) + \sum_{j=1}^d \log P(x_j \mid y=k) \right]

Using logs to avoid numerical underflow from multiplying many small probabilities.

import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Gaussian Naive Bayes on Iris
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y)

gnb = GaussianNB()
gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Gaussian Naive Bayes Accuracy: {acc:.4f}")

# Inspect the model parameters
print(f"\nClass priors: {gnb.class_prior_.round(4)}")
print(f"\nPer-class feature means (mu_jk):")
print(gnb.theta_.round(3))
print(f"\nPer-class feature variances (sigma_jk^2):")
print(gnb.var_.round(3))
# Implement Naive Bayes from scratch to see the Bayes mechanics
class GaussianNaiveBayesScratch:
"""Gaussian Naive Bayes from scratch."""

def fit(self, X, y):
self.classes_ = np.unique(y)
self.priors_ = {}
self.means_ = {}
self.vars_ = {}

for k in self.classes_:
X_k = X[y == k]
self.priors_[k] = len(X_k) / len(X)
self.means_[k] = X_k.mean(axis=0)
self.vars_[k] = X_k.var(axis=0) + 1e-9 # add epsilon for stability

def log_likelihood(self, x, k):
"""log P(x | y=k) = sum_j log N(x_j; mu_jk, sigma_jk^2)"""
mu = self.means_[k]
var = self.vars_[k]
return np.sum(-0.5 * np.log(2 * np.pi * var) - (x - mu)**2 / (2 * var))

def predict(self, X):
predictions = []
for x in X:
scores = {}
for k in self.classes_:
# log posterior = log prior + log likelihood (Bayes' theorem)
scores[k] = np.log(self.priors_[k]) + self.log_likelihood(x, k)
predictions.append(max(scores, key=scores.get))
return np.array(predictions)

gnb_scratch = GaussianNaiveBayesScratch()
gnb_scratch.fit(X_train, y_train)
y_pred_scratch = gnb_scratch.predict(X_test)
print(f"\nFrom-scratch Naive Bayes Accuracy: {accuracy_score(y_test, y_pred_scratch):.4f}")

7. Generative vs Discriminative Models

Bayes' theorem illuminates a fundamental divide in ML model design:

Two paths to P(y | x):

GENERATIVE MODEL: DISCRIMINATIVE MODEL:
Learn P(x, y) or P(x|y) Learn P(y|x) directly
and P(y), then apply Bayes

P(y|x) ∝ P(x|y) · P(y) P(y|x) = f(x; θ)

Examples: Examples:
- Naive Bayes - Logistic Regression
- LDA (Linear Discriminant) - Neural Networks
- HMMs, GMMs - SVMs, Decision Trees
- VAEs, Diffusion Models - Transformers (as classifiers)

Trade-offs

AspectGenerativeDiscriminative
ModelsFull joint P(x,y)P(x, y)Only P(yx)P(y \mid x)
Sample generationCan generate new xxCannot generate xx
Data efficiencyBetter with less dataNeeds more data
Accuracy (large data)Often lowerOften higher
Handles missing featuresNaturallyDifficult
InterpretabilityOften more interpretableCan be black box

:::tip When to Use Each Approach Use generative when you need to generate new samples (VAE for image synthesis, LDA for topic modeling), when you have limited data (generative models can incorporate priors), or when you need to handle missing features at test time.

Use discriminative when prediction accuracy is the primary goal and you have lots of labeled data. Most state-of-the-art classifiers are discriminative - directly optimizing the conditional distribution P(yx)P(y \mid x) leads to better classifiers when data is plentiful. :::

8. Bayesian Updating: Sequential Learning

One of the most powerful aspects of Bayes' theorem is that it enables sequential updating: each new observation refines our beliefs.

P(θx1,,xn)P(θ)i=1nP(xiθ)P(\theta \mid x_1, \ldots, x_n) \propto P(\theta) \cdot \prod_{i=1}^n P(x_i \mid \theta)

Or sequentially: the posterior after seeing x1,,xt1x_1, \ldots, x_{t-1} becomes the prior for incorporating xtx_t:

P(θx1:t)P(xtθ)P(θx1:t1)P(\theta \mid x_{1:t}) \propto P(x_t \mid \theta) \cdot P(\theta \mid x_{1:t-1})

import numpy as np
from scipy.stats import beta as beta_dist

# Sequential Bayesian updating for coin flip probability
# Coin with unknown P(heads) = p
# Prior: p ~ Beta(1, 1) (uniform)

np.random.seed(42)
true_p = 0.65
n_flips = 50
flips = (np.random.rand(n_flips) < true_p).astype(int) # 1=heads, 0=tails

# Sequential update
alpha, beta_param = 1.0, 1.0 # start with uniform prior

print(f"True p = {true_p}")
print(f"\nSequential Bayesian updates:")
print(f"{'Flip':>5} | {'Result':>6} | {'alpha':>6} | {'beta':>6} | {'Posterior Mean':>14}")
print("-" * 55)

for i, flip in enumerate(flips[:20]):
# Update: observe one flip
alpha += flip # success
beta_param += 1 - flip # failure

posterior_mean = alpha / (alpha + beta_param)
if i < 10 or i % 5 == 4:
print(f"{i+1:>5} | {'H' if flip else 'T':>6} | {alpha:>6.1f} | {beta_param:>6.1f} | {posterior_mean:>14.4f}")

print(f"\nFinal posterior after {n_flips} flips: Beta({alpha:.0f}, {beta_param:.0f})")
print(f"Posterior mean: {alpha/(alpha+beta_param):.4f} (true p = {true_p})")

9. Interview Q&A

Q1: Explain Bayes' theorem and give an ML example.

A: Bayes' theorem states P(AB)=P(BA)P(A)/P(B)P(A \mid B) = P(B \mid A) P(A) / P(B). It lets us "flip" the conditioning - given that we know P(BA)P(B \mid A), compute P(AB)P(A \mid B). In ML: suppose we have a spam classifier. We can model P(wordspam)P(\text{word} \mid \text{spam}) (how common each word is in spam emails) and P(spam)P(\text{spam}) (fraction of emails that are spam). Bayes' theorem gives P(spamwords)P(wordsspam)P(spam)P(\text{spam} \mid \text{words}) \propto P(\text{words} \mid \text{spam}) \cdot P(\text{spam}). This is the Naive Bayes classifier. The prior P(spam)P(\text{spam}) encodes our baseline expectation; the likelihood P(wordsspam)P(\text{words} \mid \text{spam}) captures the evidence; the posterior P(spamwords)P(\text{spam} \mid \text{words}) is our updated belief after seeing the email.

Q2: What is the "naive" assumption in Naive Bayes and when does it fail?

A: Naive Bayes assumes features are conditionally independent given the class label: P(xy)=jP(xjy)P(\mathbf{x} \mid y) = \prod_j P(x_j \mid y). This allows the joint dd-dimensional distribution to factorize into dd one-dimensional distributions, making it tractable. The assumption fails whenever features are correlated given the class. In text classification: "New York" appearing as a pair has different meaning than "New" and "York" independently. In image classification: adjacent pixels are correlated. Despite the assumption being wrong in practice, Naive Bayes often achieves surprisingly good classification accuracy because: (1) the argmax of the posterior is often correct even if the posterior probabilities are miscalibrated; (2) the decision boundary can still be accurate even with wrong assumptions; (3) with little data, the naive assumption acts as strong regularization.

Q3: What is the difference between MAP and full Bayesian inference?

A: MAP (Maximum A Posteriori) finds the single parameter value that maximizes the posterior: θ^MAP=argmaxθP(θD)\hat{\theta}_{MAP} = \arg\max_\theta P(\theta \mid \mathcal{D}). Full Bayesian inference computes the entire posterior distribution P(θD)P(\theta \mid \mathcal{D}). The key differences are: (1) Uncertainty: MAP gives a point estimate with no uncertainty; full Bayes gives a distribution that quantifies uncertainty about parameters. (2) Predictions: MAP uses P(yx,θ^MAP)P(y \mid x, \hat{\theta}_{MAP}); full Bayes uses P(yx,D)=P(yx,θ)P(θD)dθP(y \mid x, \mathcal{D}) = \int P(y \mid x, \theta) P(\theta \mid \mathcal{D}) d\theta - averaging over all possible parameters weighted by their posterior. (3) Computational cost: MAP is optimization (fast); full Bayes requires computing an integral (hard - typically requires MCMC or variational methods). (4) Practical use: MAP is equivalent to regularized optimization (L2 = Gaussian prior). Full Bayes is used when uncertainty quantification is critical (medical, safety-critical applications).

Q4: How do generative and discriminative models differ, and when should you use each?

A: Generative models learn the joint distribution P(x,y)=P(xy)P(y)P(x, y) = P(x \mid y) P(y) and use Bayes' theorem to compute P(yx)=P(xy)P(y)/P(x)P(y \mid x) = P(x \mid y) P(y) / P(x). Discriminative models directly learn P(yx)P(y \mid x) without modeling P(x)P(x). Key trade-offs: generative models can synthesize new data (sample from P(xy)P(x \mid y)), handle missing features naturally, incorporate prior knowledge, and work with less labeled data. Discriminative models (logistic regression, neural networks) typically achieve better classification accuracy with large labeled datasets because they optimize the quantity of interest (P(yx)P(y \mid x)) directly rather than modeling the entire joint. Use generative when: generating new data (VAE, GAN, diffusion), limited labeled data (semi-supervised), or you need a probabilistic model of the data. Use discriminative when: maximizing classification accuracy on labeled data is the primary goal.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Bayes' Theorem Explorer demo on the EngineersOfAI Playground - no code required.

:::

Q5: Why can a highly accurate test still have low positive predictive value?

A: This is the base rate fallacy (or false positive paradox). Positive predictive value (PPV) = P(diseasepositive test)=P(positivedisease)P(disease)/P(positive)P(\text{disease} \mid \text{positive test}) = P(\text{positive} \mid \text{disease}) \cdot P(\text{disease}) / P(\text{positive}). When the disease prevalence P(disease)P(\text{disease}) is very low, even a highly accurate test will have many false positives relative to true positives. Example: test with 99% sensitivity and 99% specificity, disease with 0.1% prevalence. PPV = (0.99×0.001)/(0.99×0.001+0.01×0.999)0.00099/0.010989%(0.99 \times 0.001) / (0.99 \times 0.001 + 0.01 \times 0.999) \approx 0.00099 / 0.01098 \approx 9\%. So 91% of positive tests are false positives! This directly impacts ML: a fraud detection model with 99% precision on a balanced test set may have terrible PPV in production where fraud rate is 0.1%. Always evaluate with realistic class priors, not balanced test sets.

© 2026 EngineersOfAI. All rights reserved.