Probability Axioms and Events
Reading time: ~35 minutes | Interview relevance: High | Target roles: ML Engineer, AI Engineer, Research Scientist, Data Scientist
The ML Scenario That Motivates This Lesson
You're interviewing at a top AI lab. The interviewer asks: "Your binary classifier outputs 0.87 for a sample. What does that number actually mean?"
If you say "it means 87% chance of being positive class" - that's correct in spirit. But a follow-up: "What conditions must hold for that interpretation to be valid?" exposes whether you understand the formalism.
For 0.87 to be a valid probability, it must satisfy three axioms (the Kolmogorov axioms). The softmax output satisfies them - non-negative, bounded by 1, sums to 1 across classes. This isn't accidental; it was designed this way precisely to produce valid probability distributions.
Understanding probability axioms lets you reason about when ML outputs are valid probabilities, when they're not (miscalibrated models), and what it means to model uncertainty correctly.
1. Probability Space: The Three Components
A probability space is a triple :
Probability Space = (Ω, F, P)
│ │ │
│ │ └─ Probability Measure: assigns numbers to events
│ └──── Event Space: collection of subsets we care about
└──────── Sample Space: all possible outcomes
Sample Space
The sample space is the set of all possible outcomes of an experiment.
| Experiment | Sample Space |
|---|---|
| Flip a coin | |
| Roll a die | |
| Classify an image | |
| Model predicts a real value | |
| Draw from a 2D distribution |
In ML, the sample space is often the space of model outputs, or the data-generating process's outcome space.
Event Space
An event is any subset of that we can assign a probability to. The collection of all such subsets forms the event space (formally, a sigma-algebra).
For a finite sample space, is usually all subsets (the power set). For continuous spaces (), we use the Borel sigma-algebra, which includes all open and closed sets and their combinations.
:::note Sigma-Algebra (for completeness) A sigma-algebra satisfies:
- If , then (closed under complement)
- If , then (closed under countable union)
You won't need to construct sigma-algebras in practice, but knowing they exist explains why not every subset of can be assigned a probability. :::
2. The Kolmogorov Axioms
The probability measure must satisfy three axioms, first formalized by Andrei Kolmogorov in 1933.
Axiom 1: Non-Negativity
Probabilities are never negative.
Axiom 2: Normalization
Something must happen. The total probability over all outcomes is exactly 1.
Axiom 3: Countable Additivity
For mutually exclusive (disjoint) events (i.e., for ):
If two events cannot both happen, the probability of either happening is the sum of their individual probabilities.
:::tip Why These Axioms for ML Engineers These three axioms are why softmax is designed the way it is. For a -class classifier with logits :
- Axiom 1: always, so each output is positive.
- Axiom 2: Sum over equals .
- Axiom 3: Classes are mutually exclusive.
Softmax produces a valid probability distribution by construction. :::
3. Consequences of the Axioms
From just three axioms, we derive all the rules of probability you will use in ML.
The Complement Rule
Proof: and are disjoint, so by Axiom 3: . Since , by Axiom 2: .
ML Connection: In binary classification, . The sigmoid output for the negative class is .
The Impossible Event
Proof: , so .
Monotonicity
If , then .
Proof: , and these are disjoint. So .
Inclusion-Exclusion
For two arbitrary (not necessarily disjoint) events:
For three events:
Inclusion-Exclusion (two events):
┌─────────────┐
│ A │ B │
│ (a) │ (b) │
│ (c) │
└─────────────┘
P(A ∪ B) = P(a) + P(c) + P(b)
= P(A) + P(B) - P(A ∩ B) <-- subtract overlap counted twice
4. Conditional Probability
Conditional probability captures how our belief about event changes given that we know event occurred.
Intuition: We restrict our universe to (we know happened), and then ask: what fraction of also has ?
Without conditioning: After conditioning on B:
┌──────────────────┐ ┌─────┐
│ Ω │ │ B │
│ ┌──┐ ┌──┐ │ │ ┌──┐│
│ │A ├──┤B │ │ => │ │∩ ││
│ └──┘ └──┘ │ │ └──┘│
└──────────────────┘ └─────┘
P(A) = area(A)/area(Ω) P(A|B) = area(A∩B)/area(B)
Example: Disease Testing
- = person has disease, (1% prevalence)
- = test is positive, (95% sensitivity)
- (5% false positive rate)
What is ?
The Multiplication Rule
Rearranging the conditional probability definition:
This extends to chains:
This chain rule is the foundation of autoregressive language models:
:::tip Language Models and the Chain Rule GPT-style models learn for every position . The probability of a complete sentence is the product of these conditional probabilities. Training maximizes the log of this joint probability (log-likelihood), which decomposes into a sum of conditional log-probabilities - making it tractable even for very long sequences. :::
5. Independence
Events and are independent if knowing one tells you nothing about the other:
Equivalently (when ):
Mutual Independence vs Pairwise Independence
For events , mutual independence requires that for every subset :
Pairwise independence ( for all pairs) does not imply mutual independence.
Conditional Independence
and are conditionally independent given (written ) if:
This is central to graphical models (Bayesian networks, Markov Random Fields) and the Naive Bayes classifier.
:::note Naive Bayes and Conditional Independence Naive Bayes assumes that all features are conditionally independent given the class label :
This assumption is almost always false in practice (words in a sentence are correlated). Yet Naive Bayes works surprisingly well despite this. The "naive" refers to this naively optimistic independence assumption. :::
6. Python: Working with Probability Axioms
import numpy as np
np.random.seed(42)
# --- Axiom 1: Non-negativity ---
die_probs = np.array([1/6, 1/6, 1/6, 1/6, 1/6, 1/6])
print(f"Non-negativity: all probabilities >= 0: {np.all(die_probs >= 0)}")
# --- Axiom 2: Normalization ---
print(f"Normalization: sum = {die_probs.sum():.6f}")
# --- Axiom 3: Additivity ---
p_low = die_probs[:3].sum() # P({1, 2, 3})
p_high = die_probs[3:].sum() # P({4, 5, 6})
print(f"Additivity: P(<=3) + P(>=4) = {p_low:.4f} + {p_high:.4f} = {p_low + p_high:.4f}")
# --- Conditional Probability via simulation ---
n_samples = 1_000_000
die1 = np.random.randint(1, 7, n_samples)
die2 = np.random.randint(1, 7, n_samples)
total = die1 + die2
# Unconditional P(sum > 7) - expected: 15/36 ≈ 0.4167
p_sum_gt7 = (total > 7).mean()
print(f"P(sum > 7) = {p_sum_gt7:.4f}")
# Conditional P(sum > 7 | die1 = 4)
# die1=4 means we need die2 > 3, so P = 3/6 = 0.5
mask = (die1 == 4)
p_cond = (total[mask] > 7).mean()
print(f"P(sum > 7 | die1=4) = {p_cond:.4f} (expected: 0.5000)")
# --- Independence check ---
# P(die1=3 AND die2=5) should equal P(die1=3) * P(die2=5)
p_joint = ((die1 == 3) & (die2 == 5)).mean()
p_d1 = (die1 == 3).mean()
p_d2 = (die2 == 5).mean()
print(f"\nP(die1=3, die2=5) = {p_joint:.5f}")
print(f"P(die1=3) * P(die2=5) = {p_d1 * p_d2:.5f}")
print(f"Approximately equal: {np.isclose(p_joint, p_d1 * p_d2, atol=0.001)}")
# --- Inclusion-Exclusion ---
A = (die1 % 2 == 0) # die1 is even
B = (total > 7) # sum > 7
p_A = A.mean()
p_B = B.mean()
p_AB = (A & B).mean()
p_AorB_direct = (A | B).mean()
p_AorB_formula = p_A + p_B - p_AB
print(f"\nInclusion-Exclusion:")
print(f"P(A) = {p_A:.4f}, P(B) = {p_B:.4f}, P(A∩B) = {p_AB:.4f}")
print(f"P(A∪B) direct : {p_AorB_direct:.4f}")
print(f"P(A∪B) formula : {p_AorB_formula:.4f}")
# --- Softmax satisfies all three Kolmogorov axioms ---
def softmax(x):
e_x = np.exp(x - x.max()) # subtract max for numerical stability
return e_x / e_x.sum()
logits = np.array([2.5, 1.0, -0.5, 3.1, 0.8])
probs = softmax(logits)
print(f"\nLogits : {logits}")
print(f"Softmax out : {probs.round(4)}")
print(f"Axiom 1 (non-negative) : {np.all(probs >= 0)}")
print(f"Axiom 2 (sums to 1) : {probs.sum():.8f}")
7. ML Connection: Classification Probabilities and Calibration
In ML, the conditional probability is what every classifier tries to learn.
Input Image x
│
▼
┌───────────────────┐
│ Neural Network │ => logits: [2.1, -0.5, 1.3]
│ f(x; θ) │
└───────────────────┘
│
▼ softmax
│
P(y=cat | x) = 0.72
P(y=dog | x) = 0.05
P(y=other| x) = 0.23
──────
1.00 <- valid probability distribution
Calibration: A model's output of 0.72 is interpretable as "72% probability" only if the model is calibrated - that is, when it says 0.72, it is right about 72% of the time. Many neural networks are overconfident. Calibration research (Platt scaling, temperature scaling) exists to fix this using - again - probability theory.
8. Common Probability Rules Reference
| Rule | Formula |
|---|---|
| Complement | |
| Addition (general) | |
| Addition (disjoint) | |
| Multiplication | |
| Conditional | |
| Independence | |
| Total Probability | |
| Bayes' Theorem |
9. Interview Q&A
Q1: What does it mean for two events to be independent, and how is this different from mutually exclusive?
A: Two events are independent if - knowing one occurred does not change the probability of the other. Two events are mutually exclusive if - they cannot both occur. These are actually opposite extremes. If and are mutually exclusive and both have positive probability, knowing occurred tells you with certainty that did not - meaning they are maximally dependent, not independent. In ML: classes in a multiclass classifier are mutually exclusive (an image is either a cat or a dog, not both). Features fed to a model might be approximately independent (Naive Bayes assumption) or highly dependent (most real data).
Q2: What are the Kolmogorov axioms and why do they matter for ML?
A: The three axioms are: (1) non-negativity: ; (2) normalization: ; (3) countable additivity for disjoint events. They matter because they define what constitutes a valid probability distribution. The softmax function satisfies all three by construction - that is not an accident. When evaluating model outputs (does my model produce valid probabilities?), you are checking these axioms. They also underpin every rule we use to reason about model predictions: complement, inclusion-exclusion, Bayes' theorem all follow as theorems from these three axioms.
Q3: How does the chain rule of probability connect to autoregressive language models?
A: The chain rule states . For text, this becomes . GPT-style models learn each conditional via a transformer. During training, we maximize the log of this joint probability, which becomes a sum of cross-entropy losses at each position - tractable to compute. During inference, we sample from each conditional to generate the next token. The entire training objective of GPT is directly the chain rule of probability applied to sequences.
Q4: What is conditional independence and why does it matter in graphical models?
A: means - given , knowing tells you nothing additional about . In Bayesian networks, nodes are conditionally independent of non-descendants given their parents. This structure allows the joint distribution over variables to factorize into a product of small conditional distributions, making inference tractable. Without conditional independence structure, computing the joint distribution over binary variables requires parameters. In Naive Bayes, features are conditionally independent given the class label, allowing to be computed efficiently even in high dimensions.
Q5: What goes wrong if a model's output does not satisfy the probability axioms?
A: Several practical problems arise. If outputs do not sum to 1, you cannot use them as class probabilities for argmax prediction, and cross-entropy loss becomes undefined or meaningless. If any output is negative, you cannot take log (for log-likelihood). If outputs are not calibrated (do not reflect true conditional probabilities even if they sum to 1), then confidence scores are misleading in production: a model saying "95% confident" might be right only 60% of the time. Miscalibrated models cause serious problems in high-stakes applications - medical diagnosis, autonomous driving, financial risk modeling - where downstream systems rely on these uncertainty estimates for decision-making.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Probability Space demo on the EngineersOfAI Playground - no code required.
:::
