Skip to main content

Probability Axioms and Events

Reading time: ~35 minutes | Interview relevance: High | Target roles: ML Engineer, AI Engineer, Research Scientist, Data Scientist

The ML Scenario That Motivates This Lesson

You're interviewing at a top AI lab. The interviewer asks: "Your binary classifier outputs 0.87 for a sample. What does that number actually mean?"

If you say "it means 87% chance of being positive class" - that's correct in spirit. But a follow-up: "What conditions must hold for that interpretation to be valid?" exposes whether you understand the formalism.

For 0.87 to be a valid probability, it must satisfy three axioms (the Kolmogorov axioms). The softmax output satisfies them - non-negative, bounded by 1, sums to 1 across classes. This isn't accidental; it was designed this way precisely to produce valid probability distributions.

Understanding probability axioms lets you reason about when ML outputs are valid probabilities, when they're not (miscalibrated models), and what it means to model uncertainty correctly.

1. Probability Space: The Three Components

A probability space is a triple (Ω,F,P)(\Omega, \mathcal{F}, P):

Probability Space = (Ω, F, P)
│ │ │
│ │ └─ Probability Measure: assigns numbers to events
│ └──── Event Space: collection of subsets we care about
└──────── Sample Space: all possible outcomes

Sample Space Ω\Omega

The sample space Ω\Omega is the set of all possible outcomes of an experiment.

ExperimentSample Space Ω\Omega
Flip a coin{H,T}\{H, T\}
Roll a die{1,2,3,4,5,6}\{1, 2, 3, 4, 5, 6\}
Classify an image{cat,dog,other}\{\text{cat}, \text{dog}, \text{other}\}
Model predicts a real valueR\mathbb{R}
Draw from a 2D distributionR2\mathbb{R}^2

In ML, the sample space is often the space of model outputs, or the data-generating process's outcome space.

Event Space F\mathcal{F}

An event is any subset of Ω\Omega that we can assign a probability to. The collection of all such subsets forms the event space F\mathcal{F} (formally, a sigma-algebra).

For a finite sample space, F\mathcal{F} is usually all subsets (the power set). For continuous spaces (Rn\mathbb{R}^n), we use the Borel sigma-algebra, which includes all open and closed sets and their combinations.

:::note Sigma-Algebra (for completeness) A sigma-algebra F\mathcal{F} satisfies:

  1. ΩF\Omega \in \mathcal{F}
  2. If AFA \in \mathcal{F}, then AcFA^c \in \mathcal{F} (closed under complement)
  3. If A1,A2,FA_1, A_2, \ldots \in \mathcal{F}, then i=1AiF\bigcup_{i=1}^{\infty} A_i \in \mathcal{F} (closed under countable union)

You won't need to construct sigma-algebras in practice, but knowing they exist explains why not every subset of R\mathbb{R} can be assigned a probability. :::

2. The Kolmogorov Axioms

The probability measure P:F[0,1]P: \mathcal{F} \to [0, 1] must satisfy three axioms, first formalized by Andrei Kolmogorov in 1933.

Axiom 1: Non-Negativity

P(A)0for all AFP(A) \geq 0 \quad \text{for all } A \in \mathcal{F}

Probabilities are never negative.

Axiom 2: Normalization

P(Ω)=1P(\Omega) = 1

Something must happen. The total probability over all outcomes is exactly 1.

Axiom 3: Countable Additivity

For mutually exclusive (disjoint) events A1,A2,A_1, A_2, \ldots (i.e., AiAj=A_i \cap A_j = \emptyset for iji \neq j):

P(i=1Ai)=i=1P(Ai)P\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i)

If two events cannot both happen, the probability of either happening is the sum of their individual probabilities.

:::tip Why These Axioms for ML Engineers These three axioms are why softmax is designed the way it is. For a KK-class classifier with logits z1,,zKz_1, \ldots, z_K:

softmax(zk)=ezkj=1Kezj\text{softmax}(z_k) = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}

  • Axiom 1: ezk>0e^{z_k} > 0 always, so each output is positive.
  • Axiom 2: Sum over kk equals kezkkezk=1\frac{\sum_k e^{z_k}}{\sum_k e^{z_k}} = 1.
  • Axiom 3: Classes are mutually exclusive.

Softmax produces a valid probability distribution by construction. :::

3. Consequences of the Axioms

From just three axioms, we derive all the rules of probability you will use in ML.

The Complement Rule

P(Ac)=1P(A)P(A^c) = 1 - P(A)

Proof: AA and AcA^c are disjoint, so by Axiom 3: P(AAc)=P(A)+P(Ac)P(A \cup A^c) = P(A) + P(A^c). Since AAc=ΩA \cup A^c = \Omega, by Axiom 2: P(A)+P(Ac)=1P(A) + P(A^c) = 1.

ML Connection: In binary classification, P(y^=1)+P(y^=0)=1P(\hat{y} = 1) + P(\hat{y} = 0) = 1. The sigmoid output for the negative class is 1σ(z)1 - \sigma(z).

The Impossible Event

P()=0P(\emptyset) = 0

Proof: =Ωc\emptyset = \Omega^c, so P()=1P(Ω)=11=0P(\emptyset) = 1 - P(\Omega) = 1 - 1 = 0.

Monotonicity

If ABA \subseteq B, then P(A)P(B)P(A) \leq P(B).

Proof: B=A(BA)B = A \cup (B \setminus A), and these are disjoint. So P(B)=P(A)+P(BA)P(A)P(B) = P(A) + P(B \setminus A) \geq P(A).

Inclusion-Exclusion

For two arbitrary (not necessarily disjoint) events:

P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)

For three events:

P(ABC)=P(A)+P(B)+P(C)P(AB)P(AC)P(BC)+P(ABC)P(A \cup B \cup C) = P(A) + P(B) + P(C) - P(A \cap B) - P(A \cap C) - P(B \cap C) + P(A \cap B \cap C)

Inclusion-Exclusion (two events):

┌─────────────┐
│ A │ B │
│ (a) │ (b) │
│ (c) │
└─────────────┘

P(A ∪ B) = P(a) + P(c) + P(b)
= P(A) + P(B) - P(A ∩ B) <-- subtract overlap counted twice

4. Conditional Probability

Conditional probability captures how our belief about event AA changes given that we know event BB occurred.

P(AB)=P(AB)P(B),provided P(B)>0P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \quad \text{provided } P(B) > 0

Intuition: We restrict our universe to BB (we know BB happened), and then ask: what fraction of BB also has AA?

Without conditioning: After conditioning on B:

┌──────────────────┐ ┌─────┐
│ Ω │ │ B │
│ ┌──┐ ┌──┐ │ │ ┌──┐│
│ │A ├──┤B │ │ => │ │∩ ││
│ └──┘ └──┘ │ │ └──┘│
└──────────────────┘ └─────┘
P(A) = area(A)/area(Ω) P(A|B) = area(A∩B)/area(B)

Example: Disease Testing

  • DD = person has disease, P(D)=0.01P(D) = 0.01 (1% prevalence)
  • TT = test is positive, P(TD)=0.95P(T \mid D) = 0.95 (95% sensitivity)
  • P(TDc)=0.05P(T \mid D^c) = 0.05 (5% false positive rate)

What is P(T)P(T)?

P(T)=P(TD)P(D)+P(TDc)P(Dc)=0.95×0.01+0.05×0.99=0.059P(T) = P(T \mid D) P(D) + P(T \mid D^c) P(D^c) = 0.95 \times 0.01 + 0.05 \times 0.99 = 0.059

The Multiplication Rule

Rearranging the conditional probability definition:

P(AB)=P(AB)P(B)=P(BA)P(A)P(A \cap B) = P(A \mid B) \cdot P(B) = P(B \mid A) \cdot P(A)

This extends to chains:

P(A1A2An)=P(A1)P(A2A1)P(A3A1,A2)P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1) \cdot P(A_2 \mid A_1) \cdot P(A_3 \mid A_1, A_2) \cdots

This chain rule is the foundation of autoregressive language models:

P(sentence)=P(w1)P(w2w1)P(w3w1,w2)P(\text{sentence}) = P(w_1) \cdot P(w_2 \mid w_1) \cdot P(w_3 \mid w_1, w_2) \cdots

:::tip Language Models and the Chain Rule GPT-style models learn P(wtw1,w2,,wt1)P(w_t \mid w_1, w_2, \ldots, w_{t-1}) for every position tt. The probability of a complete sentence is the product of these conditional probabilities. Training maximizes the log of this joint probability (log-likelihood), which decomposes into a sum of conditional log-probabilities - making it tractable even for very long sequences. :::

5. Independence

Events AA and BB are independent if knowing one tells you nothing about the other:

P(AB)=P(A)P(B)P(A \cap B) = P(A) \cdot P(B)

Equivalently (when P(B)>0P(B) > 0):

P(AB)=P(A)P(A \mid B) = P(A)

Mutual Independence vs Pairwise Independence

For nn events A1,,AnA_1, \ldots, A_n, mutual independence requires that for every subset S{1,,n}S \subseteq \{1, \ldots, n\}:

P(iSAi)=iSP(Ai)P\left(\bigcap_{i \in S} A_i\right) = \prod_{i \in S} P(A_i)

Pairwise independence (P(AiAj)=P(Ai)P(Aj)P(A_i \cap A_j) = P(A_i)P(A_j) for all pairs) does not imply mutual independence.

Conditional Independence

AA and BB are conditionally independent given CC (written ABCA \perp B \mid C) if:

P(ABC)=P(AC)P(BC)P(A \cap B \mid C) = P(A \mid C) \cdot P(B \mid C)

This is central to graphical models (Bayesian networks, Markov Random Fields) and the Naive Bayes classifier.

:::note Naive Bayes and Conditional Independence Naive Bayes assumes that all features x1,x2,,xdx_1, x_2, \ldots, x_d are conditionally independent given the class label yy:

P(x1,x2,,xdy)=j=1dP(xjy)P(x_1, x_2, \ldots, x_d \mid y) = \prod_{j=1}^d P(x_j \mid y)

This assumption is almost always false in practice (words in a sentence are correlated). Yet Naive Bayes works surprisingly well despite this. The "naive" refers to this naively optimistic independence assumption. :::

6. Python: Working with Probability Axioms

import numpy as np

np.random.seed(42)

# --- Axiom 1: Non-negativity ---
die_probs = np.array([1/6, 1/6, 1/6, 1/6, 1/6, 1/6])
print(f"Non-negativity: all probabilities >= 0: {np.all(die_probs >= 0)}")

# --- Axiom 2: Normalization ---
print(f"Normalization: sum = {die_probs.sum():.6f}")

# --- Axiom 3: Additivity ---
p_low = die_probs[:3].sum() # P({1, 2, 3})
p_high = die_probs[3:].sum() # P({4, 5, 6})
print(f"Additivity: P(<=3) + P(>=4) = {p_low:.4f} + {p_high:.4f} = {p_low + p_high:.4f}")
# --- Conditional Probability via simulation ---
n_samples = 1_000_000
die1 = np.random.randint(1, 7, n_samples)
die2 = np.random.randint(1, 7, n_samples)
total = die1 + die2

# Unconditional P(sum > 7) - expected: 15/36 ≈ 0.4167
p_sum_gt7 = (total > 7).mean()
print(f"P(sum > 7) = {p_sum_gt7:.4f}")

# Conditional P(sum > 7 | die1 = 4)
# die1=4 means we need die2 > 3, so P = 3/6 = 0.5
mask = (die1 == 4)
p_cond = (total[mask] > 7).mean()
print(f"P(sum > 7 | die1=4) = {p_cond:.4f} (expected: 0.5000)")
# --- Independence check ---
# P(die1=3 AND die2=5) should equal P(die1=3) * P(die2=5)
p_joint = ((die1 == 3) & (die2 == 5)).mean()
p_d1 = (die1 == 3).mean()
p_d2 = (die2 == 5).mean()

print(f"\nP(die1=3, die2=5) = {p_joint:.5f}")
print(f"P(die1=3) * P(die2=5) = {p_d1 * p_d2:.5f}")
print(f"Approximately equal: {np.isclose(p_joint, p_d1 * p_d2, atol=0.001)}")
# --- Inclusion-Exclusion ---
A = (die1 % 2 == 0) # die1 is even
B = (total > 7) # sum > 7

p_A = A.mean()
p_B = B.mean()
p_AB = (A & B).mean()

p_AorB_direct = (A | B).mean()
p_AorB_formula = p_A + p_B - p_AB

print(f"\nInclusion-Exclusion:")
print(f"P(A) = {p_A:.4f}, P(B) = {p_B:.4f}, P(A∩B) = {p_AB:.4f}")
print(f"P(A∪B) direct : {p_AorB_direct:.4f}")
print(f"P(A∪B) formula : {p_AorB_formula:.4f}")
# --- Softmax satisfies all three Kolmogorov axioms ---
def softmax(x):
e_x = np.exp(x - x.max()) # subtract max for numerical stability
return e_x / e_x.sum()

logits = np.array([2.5, 1.0, -0.5, 3.1, 0.8])
probs = softmax(logits)

print(f"\nLogits : {logits}")
print(f"Softmax out : {probs.round(4)}")
print(f"Axiom 1 (non-negative) : {np.all(probs >= 0)}")
print(f"Axiom 2 (sums to 1) : {probs.sum():.8f}")

7. ML Connection: Classification Probabilities and Calibration

In ML, the conditional probability P(y=kx)P(y = k \mid \mathbf{x}) is what every classifier tries to learn.

Input Image x


┌───────────────────┐
│ Neural Network │ => logits: [2.1, -0.5, 1.3]
│ f(x; θ) │
└───────────────────┘

▼ softmax

P(y=cat | x) = 0.72
P(y=dog | x) = 0.05
P(y=other| x) = 0.23
──────
1.00 <- valid probability distribution

Calibration: A model's output of 0.72 is interpretable as "72% probability" only if the model is calibrated - that is, when it says 0.72, it is right about 72% of the time. Many neural networks are overconfident. Calibration research (Platt scaling, temperature scaling) exists to fix this using - again - probability theory.

8. Common Probability Rules Reference

RuleFormula
ComplementP(Ac)=1P(A)P(A^c) = 1 - P(A)
Addition (general)P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)
Addition (disjoint)P(AB)=P(A)+P(B)P(A \cup B) = P(A) + P(B)
MultiplicationP(AB)=P(AB)P(B)P(A \cap B) = P(A \mid B) P(B)
ConditionalP(AB)=P(AB)/P(B)P(A \mid B) = P(A \cap B) / P(B)
IndependenceP(AB)=P(A)P(B)P(A \cap B) = P(A) P(B)
Total ProbabilityP(A)=iP(ABi)P(Bi)P(A) = \sum_i P(A \mid B_i) P(B_i)
Bayes' TheoremP(AB)=P(BA)P(A)/P(B)P(A \mid B) = P(B \mid A) P(A) / P(B)

9. Interview Q&A

Q1: What does it mean for two events to be independent, and how is this different from mutually exclusive?

A: Two events are independent if P(AB)=P(A)P(B)P(A \cap B) = P(A) P(B) - knowing one occurred does not change the probability of the other. Two events are mutually exclusive if P(AB)=0P(A \cap B) = 0 - they cannot both occur. These are actually opposite extremes. If AA and BB are mutually exclusive and both have positive probability, knowing AA occurred tells you with certainty that BB did not - meaning they are maximally dependent, not independent. In ML: classes in a multiclass classifier are mutually exclusive (an image is either a cat or a dog, not both). Features fed to a model might be approximately independent (Naive Bayes assumption) or highly dependent (most real data).

Q2: What are the Kolmogorov axioms and why do they matter for ML?

A: The three axioms are: (1) non-negativity: P(A)0P(A) \geq 0; (2) normalization: P(Ω)=1P(\Omega) = 1; (3) countable additivity for disjoint events. They matter because they define what constitutes a valid probability distribution. The softmax function satisfies all three by construction - that is not an accident. When evaluating model outputs (does my model produce valid probabilities?), you are checking these axioms. They also underpin every rule we use to reason about model predictions: complement, inclusion-exclusion, Bayes' theorem all follow as theorems from these three axioms.

Q3: How does the chain rule of probability connect to autoregressive language models?

A: The chain rule states P(A1,A2,,An)=tP(AtA1,,At1)P(A_1, A_2, \ldots, A_n) = \prod_t P(A_t \mid A_1, \ldots, A_{t-1}). For text, this becomes P(w1,w2,,wT)=t=1TP(wtw1,,wt1)P(w_1, w_2, \ldots, w_T) = \prod_{t=1}^T P(w_t \mid w_1, \ldots, w_{t-1}). GPT-style models learn each conditional P(wtcontext)P(w_t \mid \text{context}) via a transformer. During training, we maximize the log of this joint probability, which becomes a sum of cross-entropy losses at each position - tractable to compute. During inference, we sample from each conditional to generate the next token. The entire training objective of GPT is directly the chain rule of probability applied to sequences.

Q4: What is conditional independence and why does it matter in graphical models?

A: ABCA \perp B \mid C means P(A,BC)=P(AC)P(BC)P(A, B \mid C) = P(A \mid C) P(B \mid C) - given CC, knowing AA tells you nothing additional about BB. In Bayesian networks, nodes are conditionally independent of non-descendants given their parents. This structure allows the joint distribution over nn variables to factorize into a product of small conditional distributions, making inference tractable. Without conditional independence structure, computing the joint distribution over nn binary variables requires 2n2^n parameters. In Naive Bayes, features are conditionally independent given the class label, allowing P(xy)=jP(xjy)P(\mathbf{x} \mid y) = \prod_j P(x_j \mid y) to be computed efficiently even in high dimensions.

Q5: What goes wrong if a model's output does not satisfy the probability axioms?

A: Several practical problems arise. If outputs do not sum to 1, you cannot use them as class probabilities for argmax prediction, and cross-entropy loss becomes undefined or meaningless. If any output is negative, you cannot take log (for log-likelihood). If outputs are not calibrated (do not reflect true conditional probabilities even if they sum to 1), then confidence scores are misleading in production: a model saying "95% confident" might be right only 60% of the time. Miscalibrated models cause serious problems in high-stakes applications - medical diagnosis, autonomous driving, financial risk modeling - where downstream systems rely on these uncertainty estimates for decision-making.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Probability Space demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.