Common Probability Distributions

Reading time: ~50 minutes | Interview relevance: Very High | Target roles: ML Engineer, AI Engineer, Research Scientist, Data Scientist

The ML Scenario That Motivates This Lesson

You're designing a new model architecture. For each design choice, you need to pick the right output distribution:

Binary classification? → Bernoulli
Multiclass classification? → Categorical (Multinomial with $n=1$ )
Regression with Gaussian noise? → Gaussian
Counting events (how many spam words in this email)? → Poisson
Modeling a probability value (Bayesian prior over $p$ )? → Beta
Modeling a probability vector (prior over class proportions)? → Dirichlet

Every generative model in ML is defined by its choice of distribution. A VAE generates images from $p(\mathbf{x} \mid \mathbf{z})$ , which might be Gaussian (for real-valued images) or Bernoulli (for binary images). Understanding which distribution to use is fundamental ML engineering skill.

1. Bernoulli Distribution

Definition

$X \sim \text{Bernoulli}(p)$

$P(X = 1) = p, \quad P(X = 0) = 1 - p$

PMF: $P(X = x) = p^x (1-p)^{1-x}$ for $x \in \{0, 1\}$

Parameter	$p \in [0, 1]$
Mean	$p$
Variance	$p(1-p)$
Support	$\{0, 1\}$

Variance is maximized at $p = 0.5$ (maximum uncertainty) and is 0 at $p = 0$ or $p = 1$ (no uncertainty).

ML Connection

Binary classification output: $P(y=1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}) \sim \text{Bernoulli}$
Dropout masks: each neuron mask $m_i \sim \text{Bernoulli}(1-p_\text{drop})$
Binary cross-entropy loss: negative log-likelihood of Bernoulli model

$\mathcal{L} = -[y \log p + (1-y) \log(1-p)]$

import numpy as np
from scipy.stats import bernoulli

p = 0.7
rv = bernoulli(p)

print(f"Bernoulli(p={p}):")
print(f"  P(X=1) = {rv.pmf(1):.4f}")
print(f"  P(X=0) = {rv.pmf(0):.4f}")
print(f"  Mean   = {rv.mean():.4f}  (expected: {p})")
print(f"  Var    = {rv.var():.4f}  (expected: {p*(1-p):.4f})")

# Variance is maximized at p=0.5
p_vals = np.linspace(0, 1, 100)
variances = p_vals * (1 - p_vals)
max_var_idx = np.argmax(variances)
print(f"\nMax variance at p = {p_vals[max_var_idx]:.2f}, Var = {variances[max_var_idx]:.4f}")

2. Binomial Distribution

Definition

$X \sim \text{Binomial}(n, p)$

$X$ counts the number of successes in $n$ independent Bernoulli( $p$ ) trials.

$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \ldots, n$

Parameter	$n \in \mathbb{N}$ , $p \in [0,1]$
Mean	$np$
Variance	$np(1-p)$
Support	$\{0, 1, \ldots, n\}$

Note: Bernoulli( $p$ ) = Binomial( $1, p$ ).

ML Connection

Reliability testing: How many of $n$ predictions are correct?
A/B testing: Given $n$ users, how many click the new button?
Hypothesis testing: Is observed accuracy significantly different from baseline?

from scipy.stats import binom
import numpy as np

n, p = 20, 0.6
rv = binom(n, p)

print(f"Binomial(n={n}, p={p}):")
print(f"  Mean = {rv.mean():.2f}  (expected: {n*p})")
print(f"  Var  = {rv.var():.2f}  (expected: {n*p*(1-p):.2f})")

# P(X >= 15) - probability of 15 or more successes
p_15_plus = 1 - rv.cdf(14)
print(f"  P(X >= 15) = {p_15_plus:.4f}")

# Confidence interval for observed accuracy
# 100 predictions, 73 correct - is p significantly > 0.6?
n_test, n_correct = 100, 73
p_value = 1 - binom(n_test, 0.6).cdf(n_correct - 1)
print(f"\n  P(X >= {n_correct} | p=0.6, n={n_test}) = {p_value:.4f}")
print(f"  Significant at alpha=0.05? {p_value < 0.05}")

3. Categorical (Multinoulli) Distribution

Definition

The Categorical distribution is the multiclass generalization of Bernoulli:

$X \sim \text{Categorical}(\mathbf{p}), \quad \mathbf{p} = [p_1, \ldots, p_K]$

$P(X = k) = p_k, \quad \sum_{k=1}^K p_k = 1$

This can also be written with one-hot vector $\mathbf{y} = \mathbf{e}_k$ :

$P(\mathbf{y}) = \prod_{k=1}^K p_k^{y_k}$

Multinomial Distribution

$M$ samples from a Categorical( $\mathbf{p}$ ) distribution: $\mathbf{c} \sim \text{Multinomial}(N, \mathbf{p})$ counts how many times each category appears in $N$ trials.

$P(\mathbf{c}) = \binom{N}{c_1, c_2, \ldots, c_K} \prod_{k=1}^K p_k^{c_k}$

ML Connection

Multiclass classification output: softmax → Categorical distribution
Language model next-token prediction: Categorical over vocabulary
Cross-entropy loss: negative log-likelihood of Categorical model

$\mathcal{L} = -\sum_{k=1}^K y_k \log p_k = -\log p_{y^*}$

where $y^*$ is the true class (only one term survives because $\mathbf{y}$ is one-hot).

import numpy as np

def sample_categorical(probs, n_samples=10):
    """Sample from a Categorical distribution."""
    return np.random.choice(len(probs), size=n_samples, p=probs)

# Softmax output from a 5-class classifier
logits = np.array([1.2, 0.5, 2.1, -0.3, 0.8])

def softmax(x):
    e = np.exp(x - x.max())
    return e / e.sum()

probs = softmax(logits)
print(f"Class probabilities: {probs.round(4)}")

# Sample predictions
np.random.seed(42)
preds = sample_categorical(probs, n_samples=10000)
print(f"\nEmpirical class frequencies over 10,000 samples:")
for k in range(5):
    freq = (preds == k).mean()
    print(f"  Class {k}: {freq:.4f}  (expected: {probs[k]:.4f})")

# Cross-entropy loss: -log p(true_class)
true_class = 2
ce_loss = -np.log(probs[true_class])
print(f"\nCross-entropy loss for true_class={true_class}: {ce_loss:.4f}")

4. Poisson Distribution

Definition

$X \sim \text{Poisson}(\lambda)$

Models the number of events in a fixed interval, given events occur at rate $\lambda$ .

$P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \ldots$

Parameter	$\lambda > 0$ (rate)
Mean	$\lambda$
Variance	$\lambda$
Support	$\{0, 1, 2, \ldots\}$

Key property: mean equals variance. If empirical data shows mean $\approx$ variance, Poisson is a good model.

ML Connection

Natural language: Word counts in documents (approximately Poisson for rare words)
Event modeling: Number of model API calls per minute, number of anomalies detected
Neural Poisson processes: Modeling event sequences in time (used in recommendation systems, neuroscience)

from scipy.stats import poisson
import numpy as np

lam = 3.5   # average 3.5 events per interval
rv = poisson(lam)

print(f"Poisson(λ={lam}):")
print(f"  Mean = {rv.mean():.2f}  (= λ = {lam})")
print(f"  Var  = {rv.var():.2f}  (= λ = {lam})")

print("\n  PMF:")
for k in range(10):
    bar = '#' * int(rv.pmf(k) * 100)
    print(f"    P(X={k}) = {rv.pmf(k):.4f}  {bar}")

5. Uniform Distribution

Continuous Uniform

$X \sim \text{Uniform}(a, b)$

$f(x) = \frac{1}{b-a}, \quad x \in [a, b]$

Parameter	$a < b$
Mean	$(a+b)/2$
Variance	$(b-a)^2/12$
Support	$[a, b]$

ML Connection

Weight initialization: Xavier/Glorot initialization draws from $\text{Uniform}(-\sqrt{6/(n_{in}+n_{out})}, \sqrt{6/(n_{in}+n_{out})})$
Random search: Hyperparameter search often draws from uniform distributions
Sampling foundations: The inverse CDF method starts with $U \sim \text{Uniform}(0,1)$

import numpy as np

# Xavier uniform initialization
def xavier_uniform(n_in, n_out):
    limit = np.sqrt(6.0 / (n_in + n_out))
    return np.random.uniform(-limit, limit, size=(n_in, n_out))

# Initialize a weight matrix for a 256->128 layer
W = xavier_uniform(256, 128)
print(f"Xavier Uniform initialization:")
print(f"  Shape: {W.shape}")
print(f"  Mean: {W.mean():.6f}  (expected: 0)")
print(f"  Std:  {W.std():.4f}  (expected: {np.sqrt(1.0/(256+128)):.4f})")

# The limit ensures variance of activations is preserved
# under the assumption of linear activations

6. Gaussian (Normal) Distribution

The most important distribution in ML and statistics.

Definition

$X \sim \mathcal{N}(\mu, \sigma^2)$

$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$

Parameter	$\mu \in \mathbb{R}$ (mean), $\sigma^2 > 0$ (variance)
Mean	$\mu$
Variance	$\sigma^2$
Support	$(-\infty, \infty)$
Skewness	0
Kurtosis	3

Why Gaussian Is Everywhere

Central Limit Theorem: Sum/average of many independent random variables converges to Gaussian
Maximum entropy: Given fixed mean and variance, Gaussian has the highest entropy (least informative prior)
Conjugacy: Gaussian is conjugate prior for itself (and for many other distributions)
Closed under linear operations: If $X \sim \mathcal{N}(\mu, \sigma^2)$ , then $aX + b \sim \mathcal{N}(a\mu+b, a^2\sigma^2)$

The 68-95-99.7 Rule

                    N(μ, σ²)

         68.27%
    ├───────────────┤
         95.45%
    ├──────────────────────┤
              99.73%
    ├──────────────────────────────┤

   μ-3σ  μ-2σ  μ-σ    μ    μ+σ  μ+2σ  μ+3σ

Multivariate Gaussian

For random vector $\mathbf{X} \in \mathbb{R}^d$ :

$\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$

$f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu})\right)$

ML Connection

ML Application	Role of Gaussian
Linear Regression	Noise model: $y = \mathbf{w}^T\mathbf{x} + \epsilon$ , $\epsilon \sim \mathcal{N}(0, \sigma^2)$
Weight Initialization	He initialization: $W \sim \mathcal{N}(0, 2/n_{in})$
L2 Regularization	Equivalent to Gaussian prior on weights
VAE Latent Space	$\mathbf{z} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2)$ , prior $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$
Gaussian Processes	Entire functions distributed as multivariate Gaussian
BatchNorm	Normalized activations are approximately Gaussian
Diffusion Models	Forward process adds Gaussian noise; reverse learns to denoise

import numpy as np
from scipy.stats import norm, multivariate_normal

# Standard Normal
print("Standard Normal N(0,1):")
print(f"  P(|X| < 1) = {norm.cdf(1) - norm.cdf(-1):.4f}  (68.27% rule)")
print(f"  P(|X| < 2) = {norm.cdf(2) - norm.cdf(-2):.4f}  (95.45% rule)")
print(f"  P(|X| < 3) = {norm.cdf(3) - norm.cdf(-3):.4f}  (99.73% rule)")

# Linear regression noise model: MSE loss = MLE under Gaussian noise
np.random.seed(42)
n = 200
X = np.random.randn(n, 2)
w_true = np.array([2.0, -1.5])
sigma_noise = 0.5
y = X @ w_true + sigma_noise * np.random.randn(n)

# MLE for linear regression = OLS
w_hat = np.linalg.lstsq(X, y, rcond=None)[0]
print(f"\nLinear Regression (MLE = OLS):")
print(f"  True weights: {w_true}")
print(f"  Estimated weights: {w_hat.round(4)}")

# L2 regularization = Gaussian prior on weights
# Posterior mode = MAP with N(0, 1/lambda) prior
lam = 0.1  # regularization strength
n_feat = X.shape[1]
w_map = np.linalg.solve(X.T @ X + lam * np.eye(n_feat), X.T @ y)
print(f"  MAP (L2 reg, lambda={lam}): {w_map.round(4)}")

# VAE: reparameterization trick with Gaussian latent space
def reparameterize(mu, log_var):
    """
    Sample z ~ N(mu, exp(log_var)) using reparameterization trick.
    z = mu + std * epsilon, epsilon ~ N(0, I)
    Enables backpropagation through the sampling step.
    """
    std = np.exp(0.5 * log_var)
    epsilon = np.random.randn(*mu.shape)
    return mu + std * epsilon

# Encoder outputs
mu     = np.array([1.0, -0.5, 0.3])
log_var = np.array([-0.5, 0.2, -1.0])

z_sample = reparameterize(mu, log_var)
print(f"\nVAE Reparameterization:")
print(f"  mu      = {mu}")
print(f"  std     = {np.exp(0.5 * log_var).round(4)}")
print(f"  z sample= {z_sample.round(4)}")

7. Exponential Distribution

Definition

$X \sim \text{Exponential}(\lambda)$

$f(x) = \lambda e^{-\lambda x}, \quad x \geq 0$

Parameter	$\lambda > 0$ (rate)
Mean	$1/\lambda$
Variance	$1/\lambda^2$
Support	$[0, \infty)$

The memoryless property: $P(X > s + t \mid X > s) = P(X > t)$ - the future waiting time doesn't depend on how long you've already waited.

ML Connection

Survival analysis: Time until model failure or user churn
Neural ODE connections: Exponential decay in continuous-time systems
Activation statistics: Frequencies in certain sparse models

8. Beta Distribution

Definition

$X \sim \text{Beta}(\alpha, \beta)$

$f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}, \quad x \in [0, 1]$

where $B(\alpha, \beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta)$ is the Beta function.

Parameter	$\alpha, \beta > 0$
Mean	$\alpha/(\alpha+\beta)$
Variance	$\alpha\beta/[(\alpha+\beta)^2(\alpha+\beta+1)]$
Support	$[0, 1]$

Shape Families

$\alpha$ , $\beta$	Shape
$\alpha = \beta = 1$	Uniform on $[0,1]$
$\alpha = \beta > 1$	Symmetric, bell-shaped, centered at 0.5
$\alpha > \beta$	Skewed toward 1
$\alpha < 1$ , $\beta < 1$	U-shaped (concentrated near 0 and 1)
$\alpha, \beta \gg 1$	Concentrated around $\alpha/(\alpha+\beta)$

Beta distribution shapes:

Beta(1,1) = Uniform:    ────────────────
Beta(2,5):              /\
                       /  \
                      /    \──────────
Beta(0.5,0.5) U-shape:│              │
                      \/──────────────\/

ML Connection

Conjugate prior for Bernoulli/Binomial: If $p \sim \text{Beta}(\alpha, \beta)$ and you observe $k$ successes in $n$ trials, the posterior is $\text{Beta}(\alpha+k, \beta+n-k)$
Bayesian A/B testing: Model click-through rate as Beta, update with observations
Beta-VAE: Uses Beta as regularization weight in ELBO

from scipy.stats import beta
import numpy as np

# Bayesian A/B testing with Beta-Bernoulli conjugacy
# Prior: Beta(1, 1) = Uniform (uninformative)
# Observe: 40 clicks out of 100 visits

alpha_prior, beta_prior = 1, 1
n_visits, n_clicks = 100, 40

# Posterior update: Beta(alpha + n_clicks, beta + n_visits - n_clicks)
alpha_post = alpha_prior + n_clicks
beta_post  = beta_prior  + (n_visits - n_clicks)

posterior = beta(alpha_post, beta_post)
print(f"Prior: Beta({alpha_prior}, {beta_prior})")
print(f"Data:  {n_clicks} clicks in {n_visits} visits")
print(f"Posterior: Beta({alpha_post}, {beta_post})")
print(f"  Posterior mean: {posterior.mean():.4f}  (= {n_clicks/n_visits})")
print(f"  Posterior std:  {posterior.std():.4f}")
print(f"  95% credible interval: [{posterior.ppf(0.025):.4f}, {posterior.ppf(0.975):.4f}]")

9. Dirichlet Distribution

Definition

The Dirichlet is the multivariate generalization of the Beta distribution, defined over the probability simplex $\Delta^{K-1}$ :

$\mathbf{p} \sim \text{Dirichlet}(\boldsymbol{\alpha}), \quad \boldsymbol{\alpha} = [\alpha_1, \ldots, \alpha_K]$

$f(\mathbf{p}) = \frac{\Gamma(\sum_k \alpha_k)}{\prod_k \Gamma(\alpha_k)} \prod_{k=1}^K p_k^{\alpha_k - 1}$

Support: $\mathbf{p} \in \Delta^{K-1}$ (all $p_k \geq 0$ , $\sum_k p_k = 1$ )

Property	Value
Mean of $k$ -th component	$\alpha_k / \sum_j \alpha_j$
Marginals	Each $p_k \sim \text{Beta}(\alpha_k, \sum_{j \neq k} \alpha_j)$
Concentration	Controlled by $\alpha_0 = \sum_k \alpha_k$

Concentration Parameter

The total concentration $\alpha_0 = \sum_k \alpha_k$ controls the spread:

$\alpha_k = 1$ for all $k$ (i.e., $\alpha_0 = K$ ): Uniform over the simplex
$\alpha_k \gg 1$ : Concentrated near the mean $\boldsymbol{\alpha}/\alpha_0$ (low variance)
$\alpha_k < 1$ : Sparse - distribution concentrated near corners of the simplex

Dirichlet(1,1,1)         Dirichlet(10,10,10)    Dirichlet(0.1,0.1,0.1)
(3 classes, symmetric)

Uniform over triangle    Concentrated center     Concentrated at corners
(any mix is likely)      (balanced likely)       (one class dominates)

ML Connection

Conjugate prior for Categorical/Multinomial: If $\mathbf{p} \sim \text{Dirichlet}(\boldsymbol{\alpha})$ and you observe counts $\mathbf{c}$ , the posterior is $\text{Dirichlet}(\boldsymbol{\alpha} + \mathbf{c})$
Latent Dirichlet Allocation (LDA): Topics are Dirichlet-distributed mixtures of words; documents are Dirichlet-distributed mixtures of topics
Prior over softmax outputs: In Bayesian neural networks, a Dirichlet prior over class probabilities
Data augmentation (Mixup): Mixing coefficients drawn from $\text{Dirichlet}(\boldsymbol{\alpha})$

import numpy as np
from scipy.stats import dirichlet

# Dirichlet as prior for multinomial observations
K = 4   # number of classes
alpha_prior = np.ones(K)   # symmetric Dirichlet(1,1,1,1) = uniform

# Observed counts: 30 A, 25 B, 20 C, 25 D
counts = np.array([30, 25, 20, 25])

# Posterior update (Dirichlet-Multinomial conjugacy)
alpha_post = alpha_prior + counts
post_rv = dirichlet(alpha_post)

print(f"Dirichlet-Multinomial Bayesian Update:")
print(f"  Prior:     Dirichlet({alpha_prior})")
print(f"  Observed:  counts = {counts}")
print(f"  Posterior: Dirichlet({alpha_post})")
print(f"  Posterior mean: {(alpha_post / alpha_post.sum()).round(4)}")
print(f"  MLE:            {(counts / counts.sum()).round(4)}")

# Sample from posterior
samples = post_rv.rvs(size=5)
print(f"\n  5 samples from posterior:")
for s in samples:
    print(f"    {s.round(4)}")

10. Distribution Cheat Sheet for ML

DISTRIBUTION QUICK REFERENCE
═══════════════════════════════════════════════════════════════════

Bernoulli(p)          → Binary output, dropout mask
  P(X=1)=p, P(X=0)=1-p, E=p, Var=p(1-p)

Binomial(n,p)         → Count of successes in n trials
  P(X=k)=C(n,k)p^k(1-p)^(n-k), E=np, Var=np(1-p)

Categorical(p)        → Multiclass output (softmax)
  P(X=k)=p_k, generalizes Bernoulli to K classes

Poisson(λ)            → Event count per interval
  P(X=k)=λ^k e^-λ / k!, E=Var=λ

Uniform(a,b)          → Weight initialization, random search
  f(x)=1/(b-a), E=(a+b)/2, Var=(b-a)²/12

Normal(μ,σ²)          → Regression noise, latent space, CLT limit
  f(x)=exp(-(x-μ)²/2σ²)/(σ√2π), E=μ, Var=σ²

Exponential(λ)        → Waiting times, survival analysis
  f(x)=λe^-λx, E=1/λ, Var=1/λ², memoryless

Beta(α,β)             → Prior over probability, Bayesian A/B test
  Support=[0,1], conjugate to Bernoulli, E=α/(α+β)

Dirichlet(α)          → Prior over probability vector
  Support=simplex, conjugate to Categorical, generalizes Beta

11. Interview Q&A

Q1: Why is the Gaussian distribution so prevalent in machine learning?

A: The Gaussian appears everywhere for four distinct reasons. First, the Central Limit Theorem: averages of many independent random variables converge to Gaussian, regardless of the underlying distribution. Since most ML quantities (gradients, losses, activations) are sums over many data points or neurons, they tend to be approximately Gaussian. Second, maximum entropy: among all distributions with fixed mean $\mu$ and variance $\sigma^2$ , the Gaussian has the maximum entropy (is the least informative), making it a natural default. Third, mathematical tractability: sums of Gaussians are Gaussian; Gaussian is closed under linear transformations; the multivariate Gaussian has a closed-form density and closed-form marginals and conditionals. Fourth, conjugacy: Gaussian is self-conjugate (Gaussian prior + Gaussian likelihood = Gaussian posterior), enabling closed-form Bayesian updates.

Q2: What is the relationship between L2 regularization and a Gaussian prior on weights?

A: From a Bayesian perspective, L2 regularization is equivalent to placing an independent Gaussian prior $w_j \sim \mathcal{N}(0, 1/\lambda)$ on each weight. The MAP estimate maximizes the log posterior: $\log P(\mathbf{w} \mid \mathcal{D}) = \log P(\mathcal{D} \mid \mathbf{w}) + \log P(\mathbf{w})$ . The Gaussian prior contributes $\log P(\mathbf{w}) = -\frac{\lambda}{2}\|\mathbf{w}\|^2 + \text{const}$ . So MAP is equivalent to minimizing $\mathcal{L}(\mathbf{w}) + \frac{\lambda}{2}\|\mathbf{w}\|^2$ , which is exactly L2 regularized loss minimization. The regularization strength $\lambda$ controls the tightness of the Gaussian prior: larger $\lambda$ means narrower prior (more weight toward zero, stronger regularization).

Q3: What is the Dirichlet distribution and how does it appear in NLP models?

A: The Dirichlet distribution is a distribution over probability vectors (the probability simplex). If $\mathbf{p} \sim \text{Dirichlet}(\boldsymbol{\alpha})$ , then $\mathbf{p}$ is a random vector with non-negative components summing to 1. It is the conjugate prior for the Categorical/Multinomial distribution. In NLP, Latent Dirichlet Allocation (LDA) uses Dirichlet distributions to model the generative process of documents: topics are Dirichlet-distributed mixtures of words ( $\boldsymbol{\phi}_k \sim \text{Dirichlet}(\boldsymbol{\beta})$ ), and documents are Dirichlet-distributed mixtures of topics ( $\boldsymbol{\theta}_d \sim \text{Dirichlet}(\boldsymbol{\alpha})$ ). The concentration parameter $\alpha_0 = \sum_k \alpha_k$ controls how sparse or spread-out the topic mixture is.

Q4: What is the Beta-Bernoulli conjugacy and why is it useful for Bayesian A/B testing?

A: Conjugacy means that when the prior and likelihood come from matching distribution families, the posterior is in the same family as the prior - enabling closed-form updates. For Beta-Bernoulli: if the prior on conversion rate is $p \sim \text{Beta}(\alpha, \beta)$ , and we observe $k$ conversions in $n$ trials (Binomial likelihood), the posterior is $p \mid \text{data} \sim \text{Beta}(\alpha + k, \beta + n - k)$ . This is extremely useful for A/B testing: we can initialize with an uninformative prior $\text{Beta}(1,1)$ , observe user interactions sequentially, update the posterior with each observation, and at any point query the posterior to compute: probability that variant A beats variant B, expected conversion rate with uncertainty, credible intervals. No approximation or MCMC needed - all updates are exact and $O(1)$ .

Q5: Why does cross-entropy loss correspond to maximum likelihood under a Categorical distribution?

A: Given a model that outputs class probabilities $\mathbf{p} = \text{softmax}(\mathbf{z})$ , we model the true label $y$ as $y \sim \text{Categorical}(\mathbf{p})$ . The likelihood of observing the true label $y^* = k$ is $P(y = k \mid \mathbf{x}) = p_k$ . The log-likelihood is $\log p_k$ . MLE maximizes the log-likelihood over the dataset: $\arg\max_\theta \frac{1}{N}\sum_i \log p_{y_i^*}$ . Negating to get a minimization objective: $\arg\min_\theta -\frac{1}{N}\sum_i \log p_{y_i^*}$ . This is exactly the cross-entropy loss: $\mathcal{L} = -\frac{1}{N}\sum_i \sum_k y_{ik} \log p_{ik}$ , where $y_{ik}$ is the one-hot label (so only the true class term survives). Cross-entropy minimization is therefore exactly MLE under a Categorical output distribution - a direct consequence of the probabilistic interpretation.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Probability Distributions demo on the EngineersOfAI Playground - no code required.

:::

The ML Scenario That Motivates This Lesson​

1. Bernoulli Distribution​

Definition​

ML Connection​

2. Binomial Distribution​

Definition​

ML Connection​

3. Categorical (Multinoulli) Distribution​

Definition​

Multinomial Distribution​

ML Connection​

4. Poisson Distribution​

Definition​

ML Connection​

5. Uniform Distribution​

Continuous Uniform​

ML Connection​

6. Gaussian (Normal) Distribution​

Definition​

Why Gaussian Is Everywhere​

The 68-95-99.7 Rule​

Multivariate Gaussian​

ML Connection​

7. Exponential Distribution​

Definition​

ML Connection​

8. Beta Distribution​

Definition​

Shape Families​

ML Connection​

9. Dirichlet Distribution​

Definition​

Concentration Parameter​

ML Connection​

10. Distribution Cheat Sheet for ML​

11. Interview Q&A​

The ML Scenario That Motivates This Lesson

1. Bernoulli Distribution

Definition

ML Connection

2. Binomial Distribution

Definition

ML Connection

3. Categorical (Multinoulli) Distribution

Definition

Multinomial Distribution

ML Connection

4. Poisson Distribution

Definition

ML Connection

5. Uniform Distribution

Continuous Uniform

ML Connection

6. Gaussian (Normal) Distribution

Definition

Why Gaussian Is Everywhere

The 68-95-99.7 Rule

Multivariate Gaussian

ML Connection

7. Exponential Distribution

Definition

ML Connection

8. Beta Distribution

Definition

Shape Families

ML Connection

9. Dirichlet Distribution

Definition

Concentration Parameter

ML Connection

10. Distribution Cheat Sheet for ML

11. Interview Q&A