Common Probability Distributions
Reading time: ~50 minutes | Interview relevance: Very High | Target roles: ML Engineer, AI Engineer, Research Scientist, Data Scientist
The ML Scenario That Motivates This Lesson
You're designing a new model architecture. For each design choice, you need to pick the right output distribution:
- Binary classification? → Bernoulli
- Multiclass classification? → Categorical (Multinomial with )
- Regression with Gaussian noise? → Gaussian
- Counting events (how many spam words in this email)? → Poisson
- Modeling a probability value (Bayesian prior over )? → Beta
- Modeling a probability vector (prior over class proportions)? → Dirichlet
Every generative model in ML is defined by its choice of distribution. A VAE generates images from , which might be Gaussian (for real-valued images) or Bernoulli (for binary images). Understanding which distribution to use is fundamental ML engineering skill.
1. Bernoulli Distribution
Definition
PMF: for
| Parameter | |
|---|---|
| Mean | |
| Variance | |
| Support |
Variance is maximized at (maximum uncertainty) and is 0 at or (no uncertainty).
ML Connection
- Binary classification output:
- Dropout masks: each neuron mask
- Binary cross-entropy loss: negative log-likelihood of Bernoulli model
import numpy as np
from scipy.stats import bernoulli
p = 0.7
rv = bernoulli(p)
print(f"Bernoulli(p={p}):")
print(f" P(X=1) = {rv.pmf(1):.4f}")
print(f" P(X=0) = {rv.pmf(0):.4f}")
print(f" Mean = {rv.mean():.4f} (expected: {p})")
print(f" Var = {rv.var():.4f} (expected: {p*(1-p):.4f})")
# Variance is maximized at p=0.5
p_vals = np.linspace(0, 1, 100)
variances = p_vals * (1 - p_vals)
max_var_idx = np.argmax(variances)
print(f"\nMax variance at p = {p_vals[max_var_idx]:.2f}, Var = {variances[max_var_idx]:.4f}")
2. Binomial Distribution
Definition
counts the number of successes in independent Bernoulli() trials.
| Parameter | , |
|---|---|
| Mean | |
| Variance | |
| Support |
Note: Bernoulli() = Binomial().
ML Connection
- Reliability testing: How many of predictions are correct?
- A/B testing: Given users, how many click the new button?
- Hypothesis testing: Is observed accuracy significantly different from baseline?
from scipy.stats import binom
import numpy as np
n, p = 20, 0.6
rv = binom(n, p)
print(f"Binomial(n={n}, p={p}):")
print(f" Mean = {rv.mean():.2f} (expected: {n*p})")
print(f" Var = {rv.var():.2f} (expected: {n*p*(1-p):.2f})")
# P(X >= 15) - probability of 15 or more successes
p_15_plus = 1 - rv.cdf(14)
print(f" P(X >= 15) = {p_15_plus:.4f}")
# Confidence interval for observed accuracy
# 100 predictions, 73 correct - is p significantly > 0.6?
n_test, n_correct = 100, 73
p_value = 1 - binom(n_test, 0.6).cdf(n_correct - 1)
print(f"\n P(X >= {n_correct} | p=0.6, n={n_test}) = {p_value:.4f}")
print(f" Significant at alpha=0.05? {p_value < 0.05}")
3. Categorical (Multinoulli) Distribution
Definition
The Categorical distribution is the multiclass generalization of Bernoulli:
This can also be written with one-hot vector :
Multinomial Distribution
samples from a Categorical() distribution: counts how many times each category appears in trials.
ML Connection
- Multiclass classification output: softmax → Categorical distribution
- Language model next-token prediction: Categorical over vocabulary
- Cross-entropy loss: negative log-likelihood of Categorical model
where is the true class (only one term survives because is one-hot).
import numpy as np
def sample_categorical(probs, n_samples=10):
"""Sample from a Categorical distribution."""
return np.random.choice(len(probs), size=n_samples, p=probs)
# Softmax output from a 5-class classifier
logits = np.array([1.2, 0.5, 2.1, -0.3, 0.8])
def softmax(x):
e = np.exp(x - x.max())
return e / e.sum()
probs = softmax(logits)
print(f"Class probabilities: {probs.round(4)}")
# Sample predictions
np.random.seed(42)
preds = sample_categorical(probs, n_samples=10000)
print(f"\nEmpirical class frequencies over 10,000 samples:")
for k in range(5):
freq = (preds == k).mean()
print(f" Class {k}: {freq:.4f} (expected: {probs[k]:.4f})")
# Cross-entropy loss: -log p(true_class)
true_class = 2
ce_loss = -np.log(probs[true_class])
print(f"\nCross-entropy loss for true_class={true_class}: {ce_loss:.4f}")
4. Poisson Distribution
Definition
Models the number of events in a fixed interval, given events occur at rate .
| Parameter | (rate) |
|---|---|
| Mean | |
| Variance | |
| Support |
Key property: mean equals variance. If empirical data shows mean variance, Poisson is a good model.
ML Connection
- Natural language: Word counts in documents (approximately Poisson for rare words)
- Event modeling: Number of model API calls per minute, number of anomalies detected
- Neural Poisson processes: Modeling event sequences in time (used in recommendation systems, neuroscience)
from scipy.stats import poisson
import numpy as np
lam = 3.5 # average 3.5 events per interval
rv = poisson(lam)
print(f"Poisson(λ={lam}):")
print(f" Mean = {rv.mean():.2f} (= λ = {lam})")
print(f" Var = {rv.var():.2f} (= λ = {lam})")
print("\n PMF:")
for k in range(10):
bar = '#' * int(rv.pmf(k) * 100)
print(f" P(X={k}) = {rv.pmf(k):.4f} {bar}")
5. Uniform Distribution
Continuous Uniform
| Parameter | |
|---|---|
| Mean | |
| Variance | |
| Support |
ML Connection
- Weight initialization: Xavier/Glorot initialization draws from
- Random search: Hyperparameter search often draws from uniform distributions
- Sampling foundations: The inverse CDF method starts with
import numpy as np
# Xavier uniform initialization
def xavier_uniform(n_in, n_out):
limit = np.sqrt(6.0 / (n_in + n_out))
return np.random.uniform(-limit, limit, size=(n_in, n_out))
# Initialize a weight matrix for a 256->128 layer
W = xavier_uniform(256, 128)
print(f"Xavier Uniform initialization:")
print(f" Shape: {W.shape}")
print(f" Mean: {W.mean():.6f} (expected: 0)")
print(f" Std: {W.std():.4f} (expected: {np.sqrt(1.0/(256+128)):.4f})")
# The limit ensures variance of activations is preserved
# under the assumption of linear activations
6. Gaussian (Normal) Distribution
The most important distribution in ML and statistics.
Definition
| Parameter | (mean), (variance) |
|---|---|
| Mean | |
| Variance | |
| Support | |
| Skewness | 0 |
| Kurtosis | 3 |
Why Gaussian Is Everywhere
- Central Limit Theorem: Sum/average of many independent random variables converges to Gaussian
- Maximum entropy: Given fixed mean and variance, Gaussian has the highest entropy (least informative prior)
- Conjugacy: Gaussian is conjugate prior for itself (and for many other distributions)
- Closed under linear operations: If , then
The 68-95-99.7 Rule
N(μ, σ²)
68.27%
├───────────────┤
95.45%
├──────────────────────┤
99.73%
├──────────────────────────────┤
μ-3σ μ-2σ μ-σ μ μ+σ μ+2σ μ+3σ
Multivariate Gaussian
For random vector :
ML Connection
| ML Application | Role of Gaussian |
|---|---|
| Linear Regression | Noise model: , |
| Weight Initialization | He initialization: |
| L2 Regularization | Equivalent to Gaussian prior on weights |
| VAE Latent Space | , prior |
| Gaussian Processes | Entire functions distributed as multivariate Gaussian |
| BatchNorm | Normalized activations are approximately Gaussian |
| Diffusion Models | Forward process adds Gaussian noise; reverse learns to denoise |
import numpy as np
from scipy.stats import norm, multivariate_normal
# Standard Normal
print("Standard Normal N(0,1):")
print(f" P(|X| < 1) = {norm.cdf(1) - norm.cdf(-1):.4f} (68.27% rule)")
print(f" P(|X| < 2) = {norm.cdf(2) - norm.cdf(-2):.4f} (95.45% rule)")
print(f" P(|X| < 3) = {norm.cdf(3) - norm.cdf(-3):.4f} (99.73% rule)")
# Linear regression noise model: MSE loss = MLE under Gaussian noise
np.random.seed(42)
n = 200
X = np.random.randn(n, 2)
w_true = np.array([2.0, -1.5])
sigma_noise = 0.5
y = X @ w_true + sigma_noise * np.random.randn(n)
# MLE for linear regression = OLS
w_hat = np.linalg.lstsq(X, y, rcond=None)[0]
print(f"\nLinear Regression (MLE = OLS):")
print(f" True weights: {w_true}")
print(f" Estimated weights: {w_hat.round(4)}")
# L2 regularization = Gaussian prior on weights
# Posterior mode = MAP with N(0, 1/lambda) prior
lam = 0.1 # regularization strength
n_feat = X.shape[1]
w_map = np.linalg.solve(X.T @ X + lam * np.eye(n_feat), X.T @ y)
print(f" MAP (L2 reg, lambda={lam}): {w_map.round(4)}")
# VAE: reparameterization trick with Gaussian latent space
def reparameterize(mu, log_var):
"""
Sample z ~ N(mu, exp(log_var)) using reparameterization trick.
z = mu + std * epsilon, epsilon ~ N(0, I)
Enables backpropagation through the sampling step.
"""
std = np.exp(0.5 * log_var)
epsilon = np.random.randn(*mu.shape)
return mu + std * epsilon
# Encoder outputs
mu = np.array([1.0, -0.5, 0.3])
log_var = np.array([-0.5, 0.2, -1.0])
z_sample = reparameterize(mu, log_var)
print(f"\nVAE Reparameterization:")
print(f" mu = {mu}")
print(f" std = {np.exp(0.5 * log_var).round(4)}")
print(f" z sample= {z_sample.round(4)}")
7. Exponential Distribution
Definition
| Parameter | (rate) |
|---|---|
| Mean | |
| Variance | |
| Support |
The memoryless property: - the future waiting time doesn't depend on how long you've already waited.
ML Connection
- Survival analysis: Time until model failure or user churn
- Neural ODE connections: Exponential decay in continuous-time systems
- Activation statistics: Frequencies in certain sparse models
8. Beta Distribution
Definition
where is the Beta function.
| Parameter | |
|---|---|
| Mean | |
| Variance | |
| Support |
Shape Families
| , | Shape |
|---|---|
| Uniform on | |
| Symmetric, bell-shaped, centered at 0.5 | |
| Skewed toward 1 | |
| , | U-shaped (concentrated near 0 and 1) |
| Concentrated around |
Beta distribution shapes:
Beta(1,1) = Uniform: ────────────────
Beta(2,5): /\
/ \
/ \──────────
Beta(0.5,0.5) U-shape:│ │
\/──────────────\/
ML Connection
- Conjugate prior for Bernoulli/Binomial: If and you observe successes in trials, the posterior is
- Bayesian A/B testing: Model click-through rate as Beta, update with observations
- Beta-VAE: Uses Beta as regularization weight in ELBO
from scipy.stats import beta
import numpy as np
# Bayesian A/B testing with Beta-Bernoulli conjugacy
# Prior: Beta(1, 1) = Uniform (uninformative)
# Observe: 40 clicks out of 100 visits
alpha_prior, beta_prior = 1, 1
n_visits, n_clicks = 100, 40
# Posterior update: Beta(alpha + n_clicks, beta + n_visits - n_clicks)
alpha_post = alpha_prior + n_clicks
beta_post = beta_prior + (n_visits - n_clicks)
posterior = beta(alpha_post, beta_post)
print(f"Prior: Beta({alpha_prior}, {beta_prior})")
print(f"Data: {n_clicks} clicks in {n_visits} visits")
print(f"Posterior: Beta({alpha_post}, {beta_post})")
print(f" Posterior mean: {posterior.mean():.4f} (= {n_clicks/n_visits})")
print(f" Posterior std: {posterior.std():.4f}")
print(f" 95% credible interval: [{posterior.ppf(0.025):.4f}, {posterior.ppf(0.975):.4f}]")
9. Dirichlet Distribution
Definition
The Dirichlet is the multivariate generalization of the Beta distribution, defined over the probability simplex :
Support: (all , )
| Property | Value |
|---|---|
| Mean of -th component | |
| Marginals | Each |
| Concentration | Controlled by |
Concentration Parameter
The total concentration controls the spread:
- for all (i.e., ): Uniform over the simplex
- : Concentrated near the mean (low variance)
- : Sparse - distribution concentrated near corners of the simplex
Dirichlet(1,1,1) Dirichlet(10,10,10) Dirichlet(0.1,0.1,0.1)
(3 classes, symmetric)
Uniform over triangle Concentrated center Concentrated at corners
(any mix is likely) (balanced likely) (one class dominates)
ML Connection
- Conjugate prior for Categorical/Multinomial: If and you observe counts , the posterior is
- Latent Dirichlet Allocation (LDA): Topics are Dirichlet-distributed mixtures of words; documents are Dirichlet-distributed mixtures of topics
- Prior over softmax outputs: In Bayesian neural networks, a Dirichlet prior over class probabilities
- Data augmentation (Mixup): Mixing coefficients drawn from
import numpy as np
from scipy.stats import dirichlet
# Dirichlet as prior for multinomial observations
K = 4 # number of classes
alpha_prior = np.ones(K) # symmetric Dirichlet(1,1,1,1) = uniform
# Observed counts: 30 A, 25 B, 20 C, 25 D
counts = np.array([30, 25, 20, 25])
# Posterior update (Dirichlet-Multinomial conjugacy)
alpha_post = alpha_prior + counts
post_rv = dirichlet(alpha_post)
print(f"Dirichlet-Multinomial Bayesian Update:")
print(f" Prior: Dirichlet({alpha_prior})")
print(f" Observed: counts = {counts}")
print(f" Posterior: Dirichlet({alpha_post})")
print(f" Posterior mean: {(alpha_post / alpha_post.sum()).round(4)}")
print(f" MLE: {(counts / counts.sum()).round(4)}")
# Sample from posterior
samples = post_rv.rvs(size=5)
print(f"\n 5 samples from posterior:")
for s in samples:
print(f" {s.round(4)}")
10. Distribution Cheat Sheet for ML
DISTRIBUTION QUICK REFERENCE
═══════════════════════════════════════════════════════════════════
Bernoulli(p) → Binary output, dropout mask
P(X=1)=p, P(X=0)=1-p, E=p, Var=p(1-p)
Binomial(n,p) → Count of successes in n trials
P(X=k)=C(n,k)p^k(1-p)^(n-k), E=np, Var=np(1-p)
Categorical(p) → Multiclass output (softmax)
P(X=k)=p_k, generalizes Bernoulli to K classes
Poisson(λ) → Event count per interval
P(X=k)=λ^k e^-λ / k!, E=Var=λ
Uniform(a,b) → Weight initialization, random search
f(x)=1/(b-a), E=(a+b)/2, Var=(b-a)²/12
Normal(μ,σ²) → Regression noise, latent space, CLT limit
f(x)=exp(-(x-μ)²/2σ²)/(σ√2π), E=μ, Var=σ²
Exponential(λ) → Waiting times, survival analysis
f(x)=λe^-λx, E=1/λ, Var=1/λ², memoryless
Beta(α,β) → Prior over probability, Bayesian A/B test
Support=[0,1], conjugate to Bernoulli, E=α/(α+β)
Dirichlet(α) → Prior over probability vector
Support=simplex, conjugate to Categorical, generalizes Beta
11. Interview Q&A
Q1: Why is the Gaussian distribution so prevalent in machine learning?
A: The Gaussian appears everywhere for four distinct reasons. First, the Central Limit Theorem: averages of many independent random variables converge to Gaussian, regardless of the underlying distribution. Since most ML quantities (gradients, losses, activations) are sums over many data points or neurons, they tend to be approximately Gaussian. Second, maximum entropy: among all distributions with fixed mean and variance , the Gaussian has the maximum entropy (is the least informative), making it a natural default. Third, mathematical tractability: sums of Gaussians are Gaussian; Gaussian is closed under linear transformations; the multivariate Gaussian has a closed-form density and closed-form marginals and conditionals. Fourth, conjugacy: Gaussian is self-conjugate (Gaussian prior + Gaussian likelihood = Gaussian posterior), enabling closed-form Bayesian updates.
Q2: What is the relationship between L2 regularization and a Gaussian prior on weights?
A: From a Bayesian perspective, L2 regularization is equivalent to placing an independent Gaussian prior on each weight. The MAP estimate maximizes the log posterior: . The Gaussian prior contributes . So MAP is equivalent to minimizing , which is exactly L2 regularized loss minimization. The regularization strength controls the tightness of the Gaussian prior: larger means narrower prior (more weight toward zero, stronger regularization).
Q3: What is the Dirichlet distribution and how does it appear in NLP models?
A: The Dirichlet distribution is a distribution over probability vectors (the probability simplex). If , then is a random vector with non-negative components summing to 1. It is the conjugate prior for the Categorical/Multinomial distribution. In NLP, Latent Dirichlet Allocation (LDA) uses Dirichlet distributions to model the generative process of documents: topics are Dirichlet-distributed mixtures of words (), and documents are Dirichlet-distributed mixtures of topics (). The concentration parameter controls how sparse or spread-out the topic mixture is.
Q4: What is the Beta-Bernoulli conjugacy and why is it useful for Bayesian A/B testing?
A: Conjugacy means that when the prior and likelihood come from matching distribution families, the posterior is in the same family as the prior - enabling closed-form updates. For Beta-Bernoulli: if the prior on conversion rate is , and we observe conversions in trials (Binomial likelihood), the posterior is . This is extremely useful for A/B testing: we can initialize with an uninformative prior , observe user interactions sequentially, update the posterior with each observation, and at any point query the posterior to compute: probability that variant A beats variant B, expected conversion rate with uncertainty, credible intervals. No approximation or MCMC needed - all updates are exact and .
Q5: Why does cross-entropy loss correspond to maximum likelihood under a Categorical distribution?
A: Given a model that outputs class probabilities , we model the true label as . The likelihood of observing the true label is . The log-likelihood is . MLE maximizes the log-likelihood over the dataset: . Negating to get a minimization objective: . This is exactly the cross-entropy loss: , where is the one-hot label (so only the true class term survives). Cross-entropy minimization is therefore exactly MLE under a Categorical output distribution - a direct consequence of the probabilistic interpretation.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Probability Distributions demo on the EngineersOfAI Playground - no code required.
:::
