Random Variables and Distributions
Reading time: ~40 minutes | Interview relevance: High | Target roles: ML Engineer, AI Engineer, Research Scientist, Data Scientist
The ML Scenario That Motivates This Lesson
You're building a language model and someone asks: "What distribution does the model output follow?"
The raw answer is: a categorical distribution - softmax turns logits into a probability vector over a vocabulary. But more deeply: the logits themselves, before softmax, can be thought of as a random variable that follows a distribution shaped by the training data. The weights of the model are random variables under a Bayesian interpretation. Even the loss on a new batch is a random variable - its expectation and variance determine how reliable our training signal is.
Every quantity in ML that varies due to randomness in data, initialization, dropout, or sampling is a random variable. The formal tools in this lesson - PMF, PDF, CDF, transformations - let you reason precisely about these quantities, compute their probabilities, and understand how they change when you apply functions to them (like softmax, log, sigmoid).
1. What Is a Random Variable?
A random variable is a function from the sample space to a measurable space (usually ):
It maps outcomes of a random experiment to numbers.
Example: Roll two dice. The sample space is . Define (the sum). Then is a random variable that takes values in .
:::note Random Variables Are Functions, Not Numbers A random variable is NOT a fixed number - it is a function. However, we often talk about "the value of " in context of a specific outcome. The randomness comes from the underlying sample space, not from itself (which is a deterministic function). Once an outcome is drawn, is a fixed number. :::
In ML:
| Random Variable | What It Represents |
|---|---|
| Model output | Predicted class or score for a random input |
| Loss | Cross-entropy or MSE on a random mini-batch |
| Gradient | Stochastic gradient on a mini-batch |
| Latent code | Sampled representation in a VAE |
| Dropout mask | Random binary vector applied to activations |
2. Discrete Random Variables
A random variable is discrete if it takes a countable set of values.
Probability Mass Function (PMF)
The PMF assigns a probability to each possible value.
Requirements for a valid PMF:
- for all
- (sum over all possible values)
Example: Fair die
| 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|
| 1/6 | 1/6 | 1/6 | 1/6 | 1/6 | 1/6 |
Example: Biased coin (heads), (tails). This is a Bernoulli distribution.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom
# PMF of Binomial(n=10, p=0.4)
n, p = 10, 0.4
x_vals = np.arange(0, n+1)
pmf_vals = binom.pmf(x_vals, n, p)
print("PMF of Binomial(10, 0.4):")
for x, px in zip(x_vals, pmf_vals):
bar = '#' * int(px * 100)
print(f" P(X={x:2d}) = {px:.4f} {bar}")
print(f"\nSum of PMF = {pmf_vals.sum():.6f}") # Must be 1
3. Continuous Random Variables
A random variable is continuous if it takes values in a continuous range (e.g., , ).
For continuous RVs, for any single point (the probability of hitting exactly one value out of infinitely many is zero). Instead we use densities.
Probability Density Function (PDF)
The PDF satisfies:
Requirements for a valid PDF:
- for all
Note: can be greater than 1 (it is a density, not a probability). Only integrals of give probabilities.
Example: Standard Gaussian
from scipy.stats import norm
import numpy as np
# P(-1 <= X <= 1) for standard normal
prob = norm.cdf(1) - norm.cdf(-1)
print(f"P(-1 <= X <= 1) for N(0,1): {prob:.4f}") # ~0.6827
prob_2sigma = norm.cdf(2) - norm.cdf(-2)
print(f"P(-2 <= X <= 2) for N(0,1): {prob_2sigma:.4f}") # ~0.9545
prob_3sigma = norm.cdf(3) - norm.cdf(-3)
print(f"P(-3 <= X <= 3) for N(0,1): {prob_3sigma:.4f}") # ~0.9973
# This is the famous 68-95-99.7 rule
4. Cumulative Distribution Function (CDF)
The CDF works for both discrete and continuous RVs:
For continuous RVs:
For discrete RVs:
Properties of the CDF
- Monotone non-decreasing: if
- Right-continuous:
- Limits: and
- Recovers probabilities:
CDF of Standard Normal:
F(x)
1.0 │ ╭──────────────
│ ╭────╯
0.5 │ ────╯
│ ╭────╯
0.0 │──────────╯
└──────────────────────────────────── x
-3 -2 -1 0 1 2 3
Relationship Between PDF and CDF
The PDF is the derivative of the CDF. This means the CDF is the antiderivative of the PDF.
from scipy.stats import norm
import numpy as np
# CDF and PDF of standard normal
x = np.linspace(-4, 4, 1000)
cdf_vals = norm.cdf(x)
pdf_vals = norm.pdf(x)
# Numerical derivative of CDF should match PDF
dx = x[1] - x[0]
numerical_pdf = np.gradient(cdf_vals, dx)
# Check agreement
max_error = np.max(np.abs(numerical_pdf[10:-10] - pdf_vals[10:-10]))
print(f"Max error between d/dx(CDF) and PDF: {max_error:.6f}")
# Computing probabilities from CDF
p_between = norm.cdf(1.5) - norm.cdf(-0.5)
print(f"P(-0.5 < X < 1.5) = {p_between:.4f}")
5. The Support of a Distribution
The support of a distribution is the set of values where the PMF/PDF is non-zero:
| Distribution | Support |
|---|---|
| Bernoulli() | |
| Binomial(, ) | |
| Poisson() | |
| Normal(, ) | |
| Exponential() | |
| Beta(, ) | |
| Uniform(, ) |
:::warning Support Matters in ML The sigmoid function maps , making it appropriate for outputs where the true distribution has support on (e.g., probability of positive class). Using sigmoid for regression targets outside is a support mismatch - the model cannot predict values outside that range. Similarly, if your loss function is , you need , which requires the support of to be strictly positive. This is why we add a small (e.g., ) before taking logs in practice. :::
6. Transformations of Random Variables
Given with known distribution, what is the distribution of ?
Discrete Case: Direct Substitution
Continuous Case: Change of Variables
If is monotone and differentiable, the PDF of is:
The term is the Jacobian - it accounts for how the transformation stretches or compresses the density.
Example: Log transformation
If and , then follows a log-normal distribution:
Stock prices, income distributions, and word frequencies in text corpora follow approximately log-normal distributions.
import numpy as np
from scipy.stats import norm, lognorm
np.random.seed(42)
n_samples = 100_000
# X ~ N(0, 1)
X = np.random.randn(n_samples)
# Y = exp(X) ~ LogNormal(mu=0, sigma=1)
Y = np.exp(X)
# Compare with scipy's lognormal
theoretical_mean = np.exp(0 + 0.5 * 1**2) # e^(mu + sigma^2/2)
theoretical_var = (np.exp(1**2) - 1) * np.exp(2*0 + 1**2)
print(f"Empirical mean of Y: {Y.mean():.4f}, Theoretical: {theoretical_mean:.4f}")
print(f"Empirical var of Y: {Y.var():.4f}, Theoretical: {theoretical_var:.4f}")
Affine Transformations
For :
The distribution shape (e.g., Gaussian) is preserved under affine transformation for some families.
# Affine transformation of a standard normal
X = np.random.randn(100_000) # X ~ N(0, 1)
mu, sigma = 3.0, 2.0
Y = sigma * X + mu # Y ~ N(mu, sigma^2)
print(f"Y = {sigma}*X + {mu}")
print(f"E[Y] = {Y.mean():.4f} (expected: {mu})")
print(f"Std[Y] = {Y.std():.4f} (expected: {sigma})")
7. The Quantile Function (Inverse CDF)
The quantile function is the inverse of the CDF:
It gives the value such that .
| (quantile) | for |
|---|---|
| 0.025 | -1.96 |
| 0.05 | -1.645 |
| 0.5 | 0 (median) |
| 0.95 | 1.645 |
| 0.975 | 1.96 |
from scipy.stats import norm
# The famous 1.96 in confidence intervals
q025 = norm.ppf(0.025) # ppf = percent point function = quantile function
q975 = norm.ppf(0.975)
print(f"2.5th percentile: {q025:.4f}")
print(f"97.5th percentile: {q975:.4f}")
print(f"95% CI: [{q025:.4f}, {q975:.4f}]")
The quantile function is used in the inverse CDF sampling method (Lesson 08) - a fundamental technique for sampling from any distribution if you can sample uniform random variables.
8. ML Connection: Model Outputs as Random Variables
The Softmax Distribution
The softmax output of a -class classifier is a random variable taking values in the probability simplex :
Given a fixed model and a random input drawn from the data distribution, is a random vector.
Temperature Scaling and Distribution Sharpness
In language models, we often use a temperature parameter to control the sharpness of the sampling distribution:
- : distribution becomes a point mass at the argmax (greedy decoding)
- : standard softmax
- : distribution approaches uniform (maximum entropy)
import numpy as np
def softmax_temperature(logits, T=1.0):
"""Apply softmax with temperature T."""
z = logits / T
e_z = np.exp(z - z.max())
return e_z / e_z.sum()
logits = np.array([3.0, 1.0, 0.5, 0.1])
for T in [0.1, 0.5, 1.0, 2.0, 10.0]:
p = softmax_temperature(logits, T)
entropy = -np.sum(p * np.log(p + 1e-10))
print(f"T={T:4.1f}: probs={p.round(3)}, entropy={entropy:.3f}")
Output (conceptual):
T= 0.1: probs=[1.000, 0.000, 0.000, 0.000], entropy=0.000 (deterministic)
T= 1.0: probs=[0.779, 0.106, 0.064, 0.043], entropy=0.746 (standard)
T=10.0: probs=[0.264, 0.251, 0.246, 0.239], entropy=1.384 (near uniform)
Dropout as a Random Variable
Dropout replaces each activation with where .
The output of a dropout layer is a random variable even at test time conceptually (though at test time we use the expectation). Monte Carlo Dropout keeps dropout active at test time and samples multiple predictions to estimate uncertainty - treating the model output explicitly as a random variable.
import numpy as np
def mc_dropout_predict(model_outputs_list):
"""
Simulate Monte Carlo Dropout:
Run the model T times with dropout active,
treat predictions as samples from a distribution.
"""
outputs = np.array(model_outputs_list) # shape: (T, n_classes)
mean_pred = outputs.mean(axis=0)
uncertainty = outputs.std(axis=0) # epistemic uncertainty estimate
return mean_pred, uncertainty
# Simulate 10 stochastic forward passes
T = 10
n_classes = 3
np.random.seed(42)
# Simulate: each pass perturbs the logits slightly
base_logits = np.array([2.0, 0.5, -1.0])
def softmax(x):
e = np.exp(x - x.max())
return e / e.sum()
mc_preds = []
for _ in range(T):
noisy_logits = base_logits + 0.5 * np.random.randn(n_classes)
mc_preds.append(softmax(noisy_logits))
mean_p, std_p = mc_dropout_predict(mc_preds)
print(f"Mean prediction: {mean_p.round(4)}")
print(f"Uncertainty : {std_p.round(4)}")
9. Summary: PMF vs PDF vs CDF
Type | Symbol | Definition | Valid when
---------|-----------|-------------------------------|---------------------------
PMF | p(x) | P(X = x) | X is discrete
PDF | f(x) | P(a≤X≤b) = ∫ f(x)dx | X is continuous
CDF | F(x) | P(X ≤ x) | Both discrete & continuous
Quantile | F⁻¹(u) | x such that P(X ≤ x) = u | Both discrete & continuous
10. Interview Q&A
Q1: What is the difference between a PMF and a PDF, and can a PDF value exceed 1?
A: A PMF (probability mass function) assigns exact probabilities to discrete values: . A PDF (probability density function) describes continuous distributions: is not a probability itself, but is. Yes, a PDF value can exceed 1 - density is not probability. For example, the Uniform distribution on has PDF for , since the area under the curve (the probability) is . The key constraint is that the total integral must equal 1, not that each density value must be .
Q2: How is the CDF useful in practice for ML?
A: The CDF is useful in three major ways. First, it lets you compute interval probabilities: . Second, the inverse CDF (quantile function) is the foundation of the inverse transform sampling method - if , then has distribution - enabling sampling from any distribution. Third, computing p-values in statistical testing requires the CDF: a p-value is for a test statistic , computed directly from the CDF of the null distribution.
Q3: If , what is the distribution of ?
A: follows a chi-squared distribution with 1 degree of freedom: . We can derive this using the change-of-variables formula. Since is not monotone on , we split: . Differentiating: , which is indeed . This comes up in hypothesis testing and in the analysis of squared errors in regression.
Q4: What does it mean for a model output to have a distribution, and why does this matter?
A: When we feed a random input (drawn from the data distribution) through a model, the output is a function of a random variable, making it a random variable itself. Its distribution over the data distribution tells us: how spread out are predictions, what is the typical confidence level, are there systematic biases. This matters for: (1) calibration - the empirical distribution of confidences should match empirical accuracy; (2) uncertainty quantification - high-variance predictions signal low confidence; (3) distributional shift detection - if test inputs come from a different distribution, the output distribution shifts in detectable ways; (4) Monte Carlo Dropout explicitly models the output as a distribution by sampling multiple stochastic forward passes.
Q5: Explain the relationship between the sigmoid function and the Bernoulli distribution.
A: The sigmoid function outputs a value in , which parameterizes a Bernoulli distribution: . Specifically, and . The binary cross-entropy loss in logistic regression is exactly the negative log-likelihood of this Bernoulli model: . Minimizing cross-entropy is equivalent to maximizing the log-likelihood of a Bernoulli generative model. This connects logistic regression to probabilistic modeling: we are not just fitting a classifier, we are fitting the conditional distribution assuming it is Bernoulli with parameter .
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Probability Distributions demo on the EngineersOfAI Playground - no code required.
:::
