Skip to main content

Random Variables and Distributions

Reading time: ~40 minutes | Interview relevance: High | Target roles: ML Engineer, AI Engineer, Research Scientist, Data Scientist

The ML Scenario That Motivates This Lesson

You're building a language model and someone asks: "What distribution does the model output follow?"

The raw answer is: a categorical distribution - softmax turns logits into a probability vector over a vocabulary. But more deeply: the logits themselves, before softmax, can be thought of as a random variable that follows a distribution shaped by the training data. The weights of the model are random variables under a Bayesian interpretation. Even the loss on a new batch is a random variable - its expectation and variance determine how reliable our training signal is.

Every quantity in ML that varies due to randomness in data, initialization, dropout, or sampling is a random variable. The formal tools in this lesson - PMF, PDF, CDF, transformations - let you reason precisely about these quantities, compute their probabilities, and understand how they change when you apply functions to them (like softmax, log, sigmoid).

1. What Is a Random Variable?

A random variable XX is a function from the sample space Ω\Omega to a measurable space (usually R\mathbb{R}):

X:ΩRX: \Omega \to \mathbb{R}

It maps outcomes of a random experiment to numbers.

Example: Roll two dice. The sample space is Ω={(i,j):i,j{1,,6}}\Omega = \{(i, j) : i, j \in \{1,\ldots,6\}\}. Define X(i,j)=i+jX(i, j) = i + j (the sum). Then XX is a random variable that takes values in {2,3,,12}\{2, 3, \ldots, 12\}.

:::note Random Variables Are Functions, Not Numbers A random variable is NOT a fixed number - it is a function. However, we often talk about "the value of XX" in context of a specific outcome. The randomness comes from the underlying sample space, not from XX itself (which is a deterministic function). Once an outcome ω\omega is drawn, X(ω)X(\omega) is a fixed number. :::

In ML:

Random VariableWhat It Represents
Model output y^\hat{y}Predicted class or score for a random input
Loss \ellCross-entropy or MSE on a random mini-batch
Gradient ggStochastic gradient on a mini-batch
Latent code z\mathbf{z}Sampled representation in a VAE
Dropout maskRandom binary vector applied to activations

2. Discrete Random Variables

A random variable is discrete if it takes a countable set of values.

Probability Mass Function (PMF)

The PMF pX(x)=P(X=x)p_X(x) = P(X = x) assigns a probability to each possible value.

Requirements for a valid PMF:

  1. pX(x)0p_X(x) \geq 0 for all xx
  2. xpX(x)=1\sum_{x} p_X(x) = 1 (sum over all possible values)

Example: Fair die

xx123456
P(X=x)P(X=x)1/61/61/61/61/61/6

Example: Biased coin P(X=1)=0.7P(X=1) = 0.7 (heads), P(X=0)=0.3P(X=0) = 0.3 (tails). This is a Bernoulli distribution.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom

# PMF of Binomial(n=10, p=0.4)
n, p = 10, 0.4
x_vals = np.arange(0, n+1)
pmf_vals = binom.pmf(x_vals, n, p)

print("PMF of Binomial(10, 0.4):")
for x, px in zip(x_vals, pmf_vals):
bar = '#' * int(px * 100)
print(f" P(X={x:2d}) = {px:.4f} {bar}")

print(f"\nSum of PMF = {pmf_vals.sum():.6f}") # Must be 1

3. Continuous Random Variables

A random variable is continuous if it takes values in a continuous range (e.g., R\mathbb{R}, [0,1][0,1]).

For continuous RVs, P(X=x)=0P(X = x) = 0 for any single point (the probability of hitting exactly one value out of infinitely many is zero). Instead we use densities.

Probability Density Function (PDF)

The PDF fX(x)f_X(x) satisfies:

P(aXb)=abfX(x)dxP(a \leq X \leq b) = \int_a^b f_X(x) \, dx

Requirements for a valid PDF:

  1. fX(x)0f_X(x) \geq 0 for all xx
  2. fX(x)dx=1\int_{-\infty}^{\infty} f_X(x) \, dx = 1

Note: fX(x)f_X(x) can be greater than 1 (it is a density, not a probability). Only integrals of ff give probabilities.

Example: Standard Gaussian

f(x)=12πex2/2f(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}

from scipy.stats import norm
import numpy as np

# P(-1 <= X <= 1) for standard normal
prob = norm.cdf(1) - norm.cdf(-1)
print(f"P(-1 <= X <= 1) for N(0,1): {prob:.4f}") # ~0.6827

prob_2sigma = norm.cdf(2) - norm.cdf(-2)
print(f"P(-2 <= X <= 2) for N(0,1): {prob_2sigma:.4f}") # ~0.9545

prob_3sigma = norm.cdf(3) - norm.cdf(-3)
print(f"P(-3 <= X <= 3) for N(0,1): {prob_3sigma:.4f}") # ~0.9973

# This is the famous 68-95-99.7 rule

4. Cumulative Distribution Function (CDF)

The CDF works for both discrete and continuous RVs:

FX(x)=P(Xx)F_X(x) = P(X \leq x)

For continuous RVs: FX(x)=xfX(t)dtF_X(x) = \int_{-\infty}^x f_X(t) \, dt

For discrete RVs: FX(x)=txP(X=t)F_X(x) = \sum_{t \leq x} P(X = t)

Properties of the CDF

  1. Monotone non-decreasing: F(x1)F(x2)F(x_1) \leq F(x_2) if x1x2x_1 \leq x_2
  2. Right-continuous: limtx+F(t)=F(x)\lim_{t \to x^+} F(t) = F(x)
  3. Limits: limxF(x)=0\lim_{x \to -\infty} F(x) = 0 and limxF(x)=1\lim_{x \to \infty} F(x) = 1
  4. Recovers probabilities: P(a<Xb)=F(b)F(a)P(a < X \leq b) = F(b) - F(a)
CDF of Standard Normal:

F(x)
1.0 │ ╭──────────────
│ ╭────╯
0.5 │ ────╯
│ ╭────╯
0.0 │──────────╯
└──────────────────────────────────── x
-3 -2 -1 0 1 2 3

Relationship Between PDF and CDF

fX(x)=ddxFX(x)f_X(x) = \frac{d}{dx} F_X(x)

The PDF is the derivative of the CDF. This means the CDF is the antiderivative of the PDF.

from scipy.stats import norm
import numpy as np

# CDF and PDF of standard normal
x = np.linspace(-4, 4, 1000)
cdf_vals = norm.cdf(x)
pdf_vals = norm.pdf(x)

# Numerical derivative of CDF should match PDF
dx = x[1] - x[0]
numerical_pdf = np.gradient(cdf_vals, dx)

# Check agreement
max_error = np.max(np.abs(numerical_pdf[10:-10] - pdf_vals[10:-10]))
print(f"Max error between d/dx(CDF) and PDF: {max_error:.6f}")

# Computing probabilities from CDF
p_between = norm.cdf(1.5) - norm.cdf(-0.5)
print(f"P(-0.5 < X < 1.5) = {p_between:.4f}")

5. The Support of a Distribution

The support of a distribution is the set of values where the PMF/PDF is non-zero:

supp(X)={x:P(X=x)>0}(discrete)\text{supp}(X) = \{x : P(X = x) > 0\} \quad \text{(discrete)} supp(X)={x:fX(x)>0}(continuous)\text{supp}(X) = \{x : f_X(x) > 0\} \quad \text{(continuous)}

DistributionSupport
Bernoulli(pp){0,1}\{0, 1\}
Binomial(nn, pp){0,1,,n}\{0, 1, \ldots, n\}
Poisson(λ\lambda){0,1,2,}\{0, 1, 2, \ldots\}
Normal(μ\mu, σ2\sigma^2)(,)(-\infty, \infty)
Exponential(λ\lambda)[0,)[0, \infty)
Beta(α\alpha, β\beta)[0,1][0, 1]
Uniform(aa, bb)[a,b][a, b]

:::warning Support Matters in ML The sigmoid function maps R(0,1)\mathbb{R} \to (0, 1), making it appropriate for outputs where the true distribution has support on [0,1][0, 1] (e.g., probability of positive class). Using sigmoid for regression targets outside [0,1][0,1] is a support mismatch - the model cannot predict values outside that range. Similarly, if your loss function is logp-\log p, you need p>0p > 0, which requires the support of pp to be strictly positive. This is why we add a small ϵ\epsilon (e.g., 10810^{-8}) before taking logs in practice. :::

6. Transformations of Random Variables

Given XX with known distribution, what is the distribution of Y=g(X)Y = g(X)?

Discrete Case: Direct Substitution

P(Y=y)=x:g(x)=yP(X=x)P(Y = y) = \sum_{x : g(x) = y} P(X = x)

Continuous Case: Change of Variables

If gg is monotone and differentiable, the PDF of Y=g(X)Y = g(X) is:

fY(y)=fX(g1(y))ddyg1(y)f_Y(y) = f_X\left(g^{-1}(y)\right) \cdot \left|\frac{d}{dy} g^{-1}(y)\right|

The term ddyg1(y)\left|\frac{d}{dy} g^{-1}(y)\right| is the Jacobian - it accounts for how the transformation stretches or compresses the density.

Example: Log transformation

If XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2) and Y=eXY = e^X, then YY follows a log-normal distribution:

fY(y)=1yσ2πexp((lnyμ)22σ2),y>0f_Y(y) = \frac{1}{y \sigma \sqrt{2\pi}} \exp\left(-\frac{(\ln y - \mu)^2}{2\sigma^2}\right), \quad y > 0

Stock prices, income distributions, and word frequencies in text corpora follow approximately log-normal distributions.

import numpy as np
from scipy.stats import norm, lognorm

np.random.seed(42)
n_samples = 100_000

# X ~ N(0, 1)
X = np.random.randn(n_samples)

# Y = exp(X) ~ LogNormal(mu=0, sigma=1)
Y = np.exp(X)

# Compare with scipy's lognormal
theoretical_mean = np.exp(0 + 0.5 * 1**2) # e^(mu + sigma^2/2)
theoretical_var = (np.exp(1**2) - 1) * np.exp(2*0 + 1**2)

print(f"Empirical mean of Y: {Y.mean():.4f}, Theoretical: {theoretical_mean:.4f}")
print(f"Empirical var of Y: {Y.var():.4f}, Theoretical: {theoretical_var:.4f}")

Affine Transformations

For Y=aX+bY = aX + b:

E[Y]=aE[X]+b\mathbb{E}[Y] = a\mathbb{E}[X] + b Var(Y)=a2Var(X)\text{Var}(Y) = a^2 \text{Var}(X)

The distribution shape (e.g., Gaussian) is preserved under affine transformation for some families.

# Affine transformation of a standard normal
X = np.random.randn(100_000) # X ~ N(0, 1)
mu, sigma = 3.0, 2.0
Y = sigma * X + mu # Y ~ N(mu, sigma^2)

print(f"Y = {sigma}*X + {mu}")
print(f"E[Y] = {Y.mean():.4f} (expected: {mu})")
print(f"Std[Y] = {Y.std():.4f} (expected: {sigma})")

7. The Quantile Function (Inverse CDF)

The quantile function F1(u)F^{-1}(u) is the inverse of the CDF:

F1(u)=inf{x:F(x)u}F^{-1}(u) = \inf\{x : F(x) \geq u\}

It gives the value xx such that P(Xx)=uP(X \leq x) = u.

uu (quantile)F1(u)F^{-1}(u) for N(0,1)\mathcal{N}(0,1)
0.025-1.96
0.05-1.645
0.50 (median)
0.951.645
0.9751.96
from scipy.stats import norm

# The famous 1.96 in confidence intervals
q025 = norm.ppf(0.025) # ppf = percent point function = quantile function
q975 = norm.ppf(0.975)
print(f"2.5th percentile: {q025:.4f}")
print(f"97.5th percentile: {q975:.4f}")
print(f"95% CI: [{q025:.4f}, {q975:.4f}]")

The quantile function is used in the inverse CDF sampling method (Lesson 08) - a fundamental technique for sampling from any distribution if you can sample uniform random variables.

8. ML Connection: Model Outputs as Random Variables

The Softmax Distribution

The softmax output of a KK-class classifier is a random variable taking values in the probability simplex ΔK1\Delta^{K-1}:

p^=softmax(z)=ez1Tez\hat{\mathbf{p}} = \text{softmax}(\mathbf{z}) = \frac{e^{\mathbf{z}}}{\mathbf{1}^T e^{\mathbf{z}}}

Given a fixed model and a random input x\mathbf{x} drawn from the data distribution, p^\hat{\mathbf{p}} is a random vector.

Temperature Scaling and Distribution Sharpness

In language models, we often use a temperature parameter TT to control the sharpness of the sampling distribution:

p^kezk/T\hat{p}_k \propto e^{z_k / T}

  • T0T \to 0: distribution becomes a point mass at the argmax (greedy decoding)
  • T=1T = 1: standard softmax
  • TT \to \infty: distribution approaches uniform (maximum entropy)
import numpy as np

def softmax_temperature(logits, T=1.0):
"""Apply softmax with temperature T."""
z = logits / T
e_z = np.exp(z - z.max())
return e_z / e_z.sum()

logits = np.array([3.0, 1.0, 0.5, 0.1])

for T in [0.1, 0.5, 1.0, 2.0, 10.0]:
p = softmax_temperature(logits, T)
entropy = -np.sum(p * np.log(p + 1e-10))
print(f"T={T:4.1f}: probs={p.round(3)}, entropy={entropy:.3f}")

Output (conceptual):

T= 0.1: probs=[1.000, 0.000, 0.000, 0.000], entropy=0.000 (deterministic)
T= 1.0: probs=[0.779, 0.106, 0.064, 0.043], entropy=0.746 (standard)
T=10.0: probs=[0.264, 0.251, 0.246, 0.239], entropy=1.384 (near uniform)

Dropout as a Random Variable

Dropout replaces each activation aia_i with a~i=aimi/(1p)\tilde{a}_i = a_i \cdot m_i / (1-p) where miBernoulli(1p)m_i \sim \text{Bernoulli}(1-p).

The output of a dropout layer is a random variable even at test time conceptually (though at test time we use the expectation). Monte Carlo Dropout keeps dropout active at test time and samples multiple predictions to estimate uncertainty - treating the model output explicitly as a random variable.

import numpy as np

def mc_dropout_predict(model_outputs_list):
"""
Simulate Monte Carlo Dropout:
Run the model T times with dropout active,
treat predictions as samples from a distribution.
"""
outputs = np.array(model_outputs_list) # shape: (T, n_classes)
mean_pred = outputs.mean(axis=0)
uncertainty = outputs.std(axis=0) # epistemic uncertainty estimate
return mean_pred, uncertainty

# Simulate 10 stochastic forward passes
T = 10
n_classes = 3
np.random.seed(42)

# Simulate: each pass perturbs the logits slightly
base_logits = np.array([2.0, 0.5, -1.0])

def softmax(x):
e = np.exp(x - x.max())
return e / e.sum()

mc_preds = []
for _ in range(T):
noisy_logits = base_logits + 0.5 * np.random.randn(n_classes)
mc_preds.append(softmax(noisy_logits))

mean_p, std_p = mc_dropout_predict(mc_preds)
print(f"Mean prediction: {mean_p.round(4)}")
print(f"Uncertainty : {std_p.round(4)}")

9. Summary: PMF vs PDF vs CDF

Type | Symbol | Definition | Valid when
---------|-----------|-------------------------------|---------------------------
PMF | p(x) | P(X = x) | X is discrete
PDF | f(x) | P(a≤X≤b) = ∫ f(x)dx | X is continuous
CDF | F(x) | P(X ≤ x) | Both discrete & continuous
Quantile | F⁻¹(u) | x such that P(X ≤ x) = u | Both discrete & continuous

10. Interview Q&A

Q1: What is the difference between a PMF and a PDF, and can a PDF value exceed 1?

A: A PMF (probability mass function) assigns exact probabilities to discrete values: p(x)=P(X=x)p(x) = P(X = x). A PDF (probability density function) describes continuous distributions: f(x)f(x) is not a probability itself, but P(aXb)=abf(x)dxP(a \leq X \leq b) = \int_a^b f(x) dx is. Yes, a PDF value can exceed 1 - density is not probability. For example, the Uniform distribution on [0,0.1][0, 0.1] has PDF f(x)=10f(x) = 10 for x[0,0.1]x \in [0, 0.1], since the area under the curve (the probability) is 10×0.1=110 \times 0.1 = 1. The key constraint is that the total integral must equal 1, not that each density value must be 1\leq 1.

Q2: How is the CDF useful in practice for ML?

A: The CDF F(x)=P(Xx)F(x) = P(X \leq x) is useful in three major ways. First, it lets you compute interval probabilities: P(a<Xb)=F(b)F(a)P(a < X \leq b) = F(b) - F(a). Second, the inverse CDF (quantile function) is the foundation of the inverse transform sampling method - if UUniform(0,1)U \sim \text{Uniform}(0,1), then F1(U)F^{-1}(U) has distribution FF - enabling sampling from any distribution. Third, computing p-values in statistical testing requires the CDF: a p-value is P(Ttobserved)P(T \geq t_\text{observed}) for a test statistic TT, computed directly from the CDF of the null distribution.

Q3: If XN(0,1)X \sim \mathcal{N}(0, 1), what is the distribution of Y=X2Y = X^2?

A: Y=X2Y = X^2 follows a chi-squared distribution with 1 degree of freedom: Yχ2(1)Y \sim \chi^2(1). We can derive this using the change-of-variables formula. Since g(x)=x2g(x) = x^2 is not monotone on R\mathbb{R}, we split: P(Yy)=P(X2y)=P(yXy)=FX(y)FX(y)P(Y \leq y) = P(X^2 \leq y) = P(-\sqrt{y} \leq X \leq \sqrt{y}) = F_X(\sqrt{y}) - F_X(-\sqrt{y}). Differentiating: fY(y)=12y[fX(y)+fX(y)]=12πyey/2f_Y(y) = \frac{1}{2\sqrt{y}}[f_X(\sqrt{y}) + f_X(-\sqrt{y})] = \frac{1}{\sqrt{2\pi y}} e^{-y/2}, which is indeed χ2(1)\chi^2(1). This comes up in hypothesis testing and in the analysis of squared errors in regression.

Q4: What does it mean for a model output to have a distribution, and why does this matter?

A: When we feed a random input x\mathbf{x} (drawn from the data distribution) through a model, the output y^=f(x;θ)\hat{y} = f(\mathbf{x}; \theta) is a function of a random variable, making it a random variable itself. Its distribution over the data distribution tells us: how spread out are predictions, what is the typical confidence level, are there systematic biases. This matters for: (1) calibration - the empirical distribution of confidences should match empirical accuracy; (2) uncertainty quantification - high-variance predictions signal low confidence; (3) distributional shift detection - if test inputs come from a different distribution, the output distribution shifts in detectable ways; (4) Monte Carlo Dropout explicitly models the output as a distribution by sampling multiple stochastic forward passes.

Q5: Explain the relationship between the sigmoid function and the Bernoulli distribution.

A: The sigmoid function σ(z)=1/(1+ez)\sigma(z) = 1/(1 + e^{-z}) outputs a value in (0,1)(0, 1), which parameterizes a Bernoulli distribution: y^Bernoulli(σ(z))\hat{y} \sim \text{Bernoulli}(\sigma(z)). Specifically, P(y=1)=σ(z)P(y=1) = \sigma(z) and P(y=0)=1σ(z)P(y=0) = 1 - \sigma(z). The binary cross-entropy loss in logistic regression is exactly the negative log-likelihood of this Bernoulli model: L=[ylogσ(z)+(1y)log(1σ(z))]\mathcal{L} = -[y \log \sigma(z) + (1-y) \log(1-\sigma(z))]. Minimizing cross-entropy is equivalent to maximizing the log-likelihood of a Bernoulli generative model. This connects logistic regression to probabilistic modeling: we are not just fitting a classifier, we are fitting the conditional distribution P(yx)P(y \mid \mathbf{x}) assuming it is Bernoulli with parameter σ(wTx)\sigma(\mathbf{w}^T \mathbf{x}).

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Probability Distributions demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.