Skip to main content

Common Probability Distributions

Reading time: ~50 minutes | Interview relevance: Very High | Target roles: ML Engineer, AI Engineer, Research Scientist, Data Scientist

The ML Scenario That Motivates This Lesson

You're designing a new model architecture. For each design choice, you need to pick the right output distribution:

  • Binary classification? → Bernoulli
  • Multiclass classification? → Categorical (Multinomial with n=1n=1)
  • Regression with Gaussian noise? → Gaussian
  • Counting events (how many spam words in this email)? → Poisson
  • Modeling a probability value (Bayesian prior over pp)? → Beta
  • Modeling a probability vector (prior over class proportions)? → Dirichlet

Every generative model in ML is defined by its choice of distribution. A VAE generates images from p(xz)p(\mathbf{x} \mid \mathbf{z}), which might be Gaussian (for real-valued images) or Bernoulli (for binary images). Understanding which distribution to use is fundamental ML engineering skill.

1. Bernoulli Distribution

Definition

XBernoulli(p)X \sim \text{Bernoulli}(p)

P(X=1)=p,P(X=0)=1pP(X = 1) = p, \quad P(X = 0) = 1 - p

PMF: P(X=x)=px(1p)1xP(X = x) = p^x (1-p)^{1-x} for x{0,1}x \in \{0, 1\}

Parameterp[0,1]p \in [0, 1]
Meanpp
Variancep(1p)p(1-p)
Support{0,1}\{0, 1\}

Variance is maximized at p=0.5p = 0.5 (maximum uncertainty) and is 0 at p=0p = 0 or p=1p = 1 (no uncertainty).

ML Connection

  • Binary classification output: P(y=1x)=σ(wTx)BernoulliP(y=1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}) \sim \text{Bernoulli}
  • Dropout masks: each neuron mask miBernoulli(1pdrop)m_i \sim \text{Bernoulli}(1-p_\text{drop})
  • Binary cross-entropy loss: negative log-likelihood of Bernoulli model

L=[ylogp+(1y)log(1p)]\mathcal{L} = -[y \log p + (1-y) \log(1-p)]

import numpy as np
from scipy.stats import bernoulli

p = 0.7
rv = bernoulli(p)

print(f"Bernoulli(p={p}):")
print(f" P(X=1) = {rv.pmf(1):.4f}")
print(f" P(X=0) = {rv.pmf(0):.4f}")
print(f" Mean = {rv.mean():.4f} (expected: {p})")
print(f" Var = {rv.var():.4f} (expected: {p*(1-p):.4f})")

# Variance is maximized at p=0.5
p_vals = np.linspace(0, 1, 100)
variances = p_vals * (1 - p_vals)
max_var_idx = np.argmax(variances)
print(f"\nMax variance at p = {p_vals[max_var_idx]:.2f}, Var = {variances[max_var_idx]:.4f}")

2. Binomial Distribution

Definition

XBinomial(n,p)X \sim \text{Binomial}(n, p)

XX counts the number of successes in nn independent Bernoulli(pp) trials.

P(X=k)=(nk)pk(1p)nk,k=0,1,,nP(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \ldots, n

ParameternNn \in \mathbb{N}, p[0,1]p \in [0,1]
Meannpnp
Variancenp(1p)np(1-p)
Support{0,1,,n}\{0, 1, \ldots, n\}

Note: Bernoulli(pp) = Binomial(1,p1, p).

ML Connection

  • Reliability testing: How many of nn predictions are correct?
  • A/B testing: Given nn users, how many click the new button?
  • Hypothesis testing: Is observed accuracy significantly different from baseline?
from scipy.stats import binom
import numpy as np

n, p = 20, 0.6
rv = binom(n, p)

print(f"Binomial(n={n}, p={p}):")
print(f" Mean = {rv.mean():.2f} (expected: {n*p})")
print(f" Var = {rv.var():.2f} (expected: {n*p*(1-p):.2f})")

# P(X >= 15) - probability of 15 or more successes
p_15_plus = 1 - rv.cdf(14)
print(f" P(X >= 15) = {p_15_plus:.4f}")

# Confidence interval for observed accuracy
# 100 predictions, 73 correct - is p significantly > 0.6?
n_test, n_correct = 100, 73
p_value = 1 - binom(n_test, 0.6).cdf(n_correct - 1)
print(f"\n P(X >= {n_correct} | p=0.6, n={n_test}) = {p_value:.4f}")
print(f" Significant at alpha=0.05? {p_value < 0.05}")

3. Categorical (Multinoulli) Distribution

Definition

The Categorical distribution is the multiclass generalization of Bernoulli:

XCategorical(p),p=[p1,,pK]X \sim \text{Categorical}(\mathbf{p}), \quad \mathbf{p} = [p_1, \ldots, p_K]

P(X=k)=pk,k=1Kpk=1P(X = k) = p_k, \quad \sum_{k=1}^K p_k = 1

This can also be written with one-hot vector y=ek\mathbf{y} = \mathbf{e}_k:

P(y)=k=1KpkykP(\mathbf{y}) = \prod_{k=1}^K p_k^{y_k}

Multinomial Distribution

MM samples from a Categorical(p\mathbf{p}) distribution: cMultinomial(N,p)\mathbf{c} \sim \text{Multinomial}(N, \mathbf{p}) counts how many times each category appears in NN trials.

P(c)=(Nc1,c2,,cK)k=1KpkckP(\mathbf{c}) = \binom{N}{c_1, c_2, \ldots, c_K} \prod_{k=1}^K p_k^{c_k}

ML Connection

  • Multiclass classification output: softmax → Categorical distribution
  • Language model next-token prediction: Categorical over vocabulary
  • Cross-entropy loss: negative log-likelihood of Categorical model

L=k=1Kyklogpk=logpy\mathcal{L} = -\sum_{k=1}^K y_k \log p_k = -\log p_{y^*}

where yy^* is the true class (only one term survives because y\mathbf{y} is one-hot).

import numpy as np

def sample_categorical(probs, n_samples=10):
"""Sample from a Categorical distribution."""
return np.random.choice(len(probs), size=n_samples, p=probs)

# Softmax output from a 5-class classifier
logits = np.array([1.2, 0.5, 2.1, -0.3, 0.8])

def softmax(x):
e = np.exp(x - x.max())
return e / e.sum()

probs = softmax(logits)
print(f"Class probabilities: {probs.round(4)}")

# Sample predictions
np.random.seed(42)
preds = sample_categorical(probs, n_samples=10000)
print(f"\nEmpirical class frequencies over 10,000 samples:")
for k in range(5):
freq = (preds == k).mean()
print(f" Class {k}: {freq:.4f} (expected: {probs[k]:.4f})")

# Cross-entropy loss: -log p(true_class)
true_class = 2
ce_loss = -np.log(probs[true_class])
print(f"\nCross-entropy loss for true_class={true_class}: {ce_loss:.4f}")

4. Poisson Distribution

Definition

XPoisson(λ)X \sim \text{Poisson}(\lambda)

Models the number of events in a fixed interval, given events occur at rate λ\lambda.

P(X=k)=λkeλk!,k=0,1,2,P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \ldots

Parameterλ>0\lambda > 0 (rate)
Meanλ\lambda
Varianceλ\lambda
Support{0,1,2,}\{0, 1, 2, \ldots\}

Key property: mean equals variance. If empirical data shows mean \approx variance, Poisson is a good model.

ML Connection

  • Natural language: Word counts in documents (approximately Poisson for rare words)
  • Event modeling: Number of model API calls per minute, number of anomalies detected
  • Neural Poisson processes: Modeling event sequences in time (used in recommendation systems, neuroscience)
from scipy.stats import poisson
import numpy as np

lam = 3.5 # average 3.5 events per interval
rv = poisson(lam)

print(f"Poisson(λ={lam}):")
print(f" Mean = {rv.mean():.2f} (= λ = {lam})")
print(f" Var = {rv.var():.2f} (= λ = {lam})")

print("\n PMF:")
for k in range(10):
bar = '#' * int(rv.pmf(k) * 100)
print(f" P(X={k}) = {rv.pmf(k):.4f} {bar}")

5. Uniform Distribution

Continuous Uniform

XUniform(a,b)X \sim \text{Uniform}(a, b)

f(x)=1ba,x[a,b]f(x) = \frac{1}{b-a}, \quad x \in [a, b]

Parametera<ba < b
Mean(a+b)/2(a+b)/2
Variance(ba)2/12(b-a)^2/12
Support[a,b][a, b]

ML Connection

  • Weight initialization: Xavier/Glorot initialization draws from Uniform(6/(nin+nout),6/(nin+nout))\text{Uniform}(-\sqrt{6/(n_{in}+n_{out})}, \sqrt{6/(n_{in}+n_{out})})
  • Random search: Hyperparameter search often draws from uniform distributions
  • Sampling foundations: The inverse CDF method starts with UUniform(0,1)U \sim \text{Uniform}(0,1)
import numpy as np

# Xavier uniform initialization
def xavier_uniform(n_in, n_out):
limit = np.sqrt(6.0 / (n_in + n_out))
return np.random.uniform(-limit, limit, size=(n_in, n_out))

# Initialize a weight matrix for a 256->128 layer
W = xavier_uniform(256, 128)
print(f"Xavier Uniform initialization:")
print(f" Shape: {W.shape}")
print(f" Mean: {W.mean():.6f} (expected: 0)")
print(f" Std: {W.std():.4f} (expected: {np.sqrt(1.0/(256+128)):.4f})")

# The limit ensures variance of activations is preserved
# under the assumption of linear activations

6. Gaussian (Normal) Distribution

The most important distribution in ML and statistics.

Definition

XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2)

f(x)=1σ2πexp((xμ)22σ2)f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

ParameterμR\mu \in \mathbb{R} (mean), σ2>0\sigma^2 > 0 (variance)
Meanμ\mu
Varianceσ2\sigma^2
Support(,)(-\infty, \infty)
Skewness0
Kurtosis3

Why Gaussian Is Everywhere

  1. Central Limit Theorem: Sum/average of many independent random variables converges to Gaussian
  2. Maximum entropy: Given fixed mean and variance, Gaussian has the highest entropy (least informative prior)
  3. Conjugacy: Gaussian is conjugate prior for itself (and for many other distributions)
  4. Closed under linear operations: If XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2), then aX+bN(aμ+b,a2σ2)aX + b \sim \mathcal{N}(a\mu+b, a^2\sigma^2)

The 68-95-99.7 Rule

N(μ, σ²)

68.27%
├───────────────┤
95.45%
├──────────────────────┤
99.73%
├──────────────────────────────┤

μ-3σ μ-2σ μ-σ μ μ+σ μ+2σ μ+3σ

Multivariate Gaussian

For random vector XRd\mathbf{X} \in \mathbb{R}^d:

XN(μ,Σ)\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})

f(x)=1(2π)d/2Σ1/2exp(12(xμ)TΣ1(xμ))f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu})\right)

ML Connection

ML ApplicationRole of Gaussian
Linear RegressionNoise model: y=wTx+ϵy = \mathbf{w}^T\mathbf{x} + \epsilon, ϵN(0,σ2)\epsilon \sim \mathcal{N}(0, \sigma^2)
Weight InitializationHe initialization: WN(0,2/nin)W \sim \mathcal{N}(0, 2/n_{in})
L2 RegularizationEquivalent to Gaussian prior on weights
VAE Latent SpacezN(μ,σ2)\mathbf{z} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2), prior p(z)=N(0,I)p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})
Gaussian ProcessesEntire functions distributed as multivariate Gaussian
BatchNormNormalized activations are approximately Gaussian
Diffusion ModelsForward process adds Gaussian noise; reverse learns to denoise
import numpy as np
from scipy.stats import norm, multivariate_normal

# Standard Normal
print("Standard Normal N(0,1):")
print(f" P(|X| < 1) = {norm.cdf(1) - norm.cdf(-1):.4f} (68.27% rule)")
print(f" P(|X| < 2) = {norm.cdf(2) - norm.cdf(-2):.4f} (95.45% rule)")
print(f" P(|X| < 3) = {norm.cdf(3) - norm.cdf(-3):.4f} (99.73% rule)")

# Linear regression noise model: MSE loss = MLE under Gaussian noise
np.random.seed(42)
n = 200
X = np.random.randn(n, 2)
w_true = np.array([2.0, -1.5])
sigma_noise = 0.5
y = X @ w_true + sigma_noise * np.random.randn(n)

# MLE for linear regression = OLS
w_hat = np.linalg.lstsq(X, y, rcond=None)[0]
print(f"\nLinear Regression (MLE = OLS):")
print(f" True weights: {w_true}")
print(f" Estimated weights: {w_hat.round(4)}")

# L2 regularization = Gaussian prior on weights
# Posterior mode = MAP with N(0, 1/lambda) prior
lam = 0.1 # regularization strength
n_feat = X.shape[1]
w_map = np.linalg.solve(X.T @ X + lam * np.eye(n_feat), X.T @ y)
print(f" MAP (L2 reg, lambda={lam}): {w_map.round(4)}")
# VAE: reparameterization trick with Gaussian latent space
def reparameterize(mu, log_var):
"""
Sample z ~ N(mu, exp(log_var)) using reparameterization trick.
z = mu + std * epsilon, epsilon ~ N(0, I)
Enables backpropagation through the sampling step.
"""
std = np.exp(0.5 * log_var)
epsilon = np.random.randn(*mu.shape)
return mu + std * epsilon

# Encoder outputs
mu = np.array([1.0, -0.5, 0.3])
log_var = np.array([-0.5, 0.2, -1.0])

z_sample = reparameterize(mu, log_var)
print(f"\nVAE Reparameterization:")
print(f" mu = {mu}")
print(f" std = {np.exp(0.5 * log_var).round(4)}")
print(f" z sample= {z_sample.round(4)}")

7. Exponential Distribution

Definition

XExponential(λ)X \sim \text{Exponential}(\lambda)

f(x)=λeλx,x0f(x) = \lambda e^{-\lambda x}, \quad x \geq 0

Parameterλ>0\lambda > 0 (rate)
Mean1/λ1/\lambda
Variance1/λ21/\lambda^2
Support[0,)[0, \infty)

The memoryless property: P(X>s+tX>s)=P(X>t)P(X > s + t \mid X > s) = P(X > t) - the future waiting time doesn't depend on how long you've already waited.

ML Connection

  • Survival analysis: Time until model failure or user churn
  • Neural ODE connections: Exponential decay in continuous-time systems
  • Activation statistics: Frequencies in certain sparse models

8. Beta Distribution

Definition

XBeta(α,β)X \sim \text{Beta}(\alpha, \beta)

f(x)=xα1(1x)β1B(α,β),x[0,1]f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}, \quad x \in [0, 1]

where B(α,β)=Γ(α)Γ(β)/Γ(α+β)B(\alpha, \beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta) is the Beta function.

Parameterα,β>0\alpha, \beta > 0
Meanα/(α+β)\alpha/(\alpha+\beta)
Varianceαβ/[(α+β)2(α+β+1)]\alpha\beta/[(\alpha+\beta)^2(\alpha+\beta+1)]
Support[0,1][0, 1]

Shape Families

α\alpha, β\betaShape
α=β=1\alpha = \beta = 1Uniform on [0,1][0,1]
α=β>1\alpha = \beta > 1Symmetric, bell-shaped, centered at 0.5
α>β\alpha > \betaSkewed toward 1
α<1\alpha < 1, β<1\beta < 1U-shaped (concentrated near 0 and 1)
α,β1\alpha, \beta \gg 1Concentrated around α/(α+β)\alpha/(\alpha+\beta)
Beta distribution shapes:

Beta(1,1) = Uniform: ────────────────
Beta(2,5): /\
/ \
/ \──────────
Beta(0.5,0.5) U-shape:│ │
\/──────────────\/

ML Connection

  • Conjugate prior for Bernoulli/Binomial: If pBeta(α,β)p \sim \text{Beta}(\alpha, \beta) and you observe kk successes in nn trials, the posterior is Beta(α+k,β+nk)\text{Beta}(\alpha+k, \beta+n-k)
  • Bayesian A/B testing: Model click-through rate as Beta, update with observations
  • Beta-VAE: Uses Beta as regularization weight in ELBO
from scipy.stats import beta
import numpy as np

# Bayesian A/B testing with Beta-Bernoulli conjugacy
# Prior: Beta(1, 1) = Uniform (uninformative)
# Observe: 40 clicks out of 100 visits

alpha_prior, beta_prior = 1, 1
n_visits, n_clicks = 100, 40

# Posterior update: Beta(alpha + n_clicks, beta + n_visits - n_clicks)
alpha_post = alpha_prior + n_clicks
beta_post = beta_prior + (n_visits - n_clicks)

posterior = beta(alpha_post, beta_post)
print(f"Prior: Beta({alpha_prior}, {beta_prior})")
print(f"Data: {n_clicks} clicks in {n_visits} visits")
print(f"Posterior: Beta({alpha_post}, {beta_post})")
print(f" Posterior mean: {posterior.mean():.4f} (= {n_clicks/n_visits})")
print(f" Posterior std: {posterior.std():.4f}")
print(f" 95% credible interval: [{posterior.ppf(0.025):.4f}, {posterior.ppf(0.975):.4f}]")

9. Dirichlet Distribution

Definition

The Dirichlet is the multivariate generalization of the Beta distribution, defined over the probability simplex ΔK1\Delta^{K-1}:

pDirichlet(α),α=[α1,,αK]\mathbf{p} \sim \text{Dirichlet}(\boldsymbol{\alpha}), \quad \boldsymbol{\alpha} = [\alpha_1, \ldots, \alpha_K]

f(p)=Γ(kαk)kΓ(αk)k=1Kpkαk1f(\mathbf{p}) = \frac{\Gamma(\sum_k \alpha_k)}{\prod_k \Gamma(\alpha_k)} \prod_{k=1}^K p_k^{\alpha_k - 1}

Support: pΔK1\mathbf{p} \in \Delta^{K-1} (all pk0p_k \geq 0, kpk=1\sum_k p_k = 1)

PropertyValue
Mean of kk-th componentαk/jαj\alpha_k / \sum_j \alpha_j
MarginalsEach pkBeta(αk,jkαj)p_k \sim \text{Beta}(\alpha_k, \sum_{j \neq k} \alpha_j)
ConcentrationControlled by α0=kαk\alpha_0 = \sum_k \alpha_k

Concentration Parameter

The total concentration α0=kαk\alpha_0 = \sum_k \alpha_k controls the spread:

  • αk=1\alpha_k = 1 for all kk (i.e., α0=K\alpha_0 = K): Uniform over the simplex
  • αk1\alpha_k \gg 1: Concentrated near the mean α/α0\boldsymbol{\alpha}/\alpha_0 (low variance)
  • αk<1\alpha_k < 1: Sparse - distribution concentrated near corners of the simplex
Dirichlet(1,1,1) Dirichlet(10,10,10) Dirichlet(0.1,0.1,0.1)
(3 classes, symmetric)

Uniform over triangle Concentrated center Concentrated at corners
(any mix is likely) (balanced likely) (one class dominates)

ML Connection

  • Conjugate prior for Categorical/Multinomial: If pDirichlet(α)\mathbf{p} \sim \text{Dirichlet}(\boldsymbol{\alpha}) and you observe counts c\mathbf{c}, the posterior is Dirichlet(α+c)\text{Dirichlet}(\boldsymbol{\alpha} + \mathbf{c})
  • Latent Dirichlet Allocation (LDA): Topics are Dirichlet-distributed mixtures of words; documents are Dirichlet-distributed mixtures of topics
  • Prior over softmax outputs: In Bayesian neural networks, a Dirichlet prior over class probabilities
  • Data augmentation (Mixup): Mixing coefficients drawn from Dirichlet(α)\text{Dirichlet}(\boldsymbol{\alpha})
import numpy as np
from scipy.stats import dirichlet

# Dirichlet as prior for multinomial observations
K = 4 # number of classes
alpha_prior = np.ones(K) # symmetric Dirichlet(1,1,1,1) = uniform

# Observed counts: 30 A, 25 B, 20 C, 25 D
counts = np.array([30, 25, 20, 25])

# Posterior update (Dirichlet-Multinomial conjugacy)
alpha_post = alpha_prior + counts
post_rv = dirichlet(alpha_post)

print(f"Dirichlet-Multinomial Bayesian Update:")
print(f" Prior: Dirichlet({alpha_prior})")
print(f" Observed: counts = {counts}")
print(f" Posterior: Dirichlet({alpha_post})")
print(f" Posterior mean: {(alpha_post / alpha_post.sum()).round(4)}")
print(f" MLE: {(counts / counts.sum()).round(4)}")

# Sample from posterior
samples = post_rv.rvs(size=5)
print(f"\n 5 samples from posterior:")
for s in samples:
print(f" {s.round(4)}")

10. Distribution Cheat Sheet for ML

DISTRIBUTION QUICK REFERENCE
═══════════════════════════════════════════════════════════════════

Bernoulli(p) → Binary output, dropout mask
P(X=1)=p, P(X=0)=1-p, E=p, Var=p(1-p)

Binomial(n,p) → Count of successes in n trials
P(X=k)=C(n,k)p^k(1-p)^(n-k), E=np, Var=np(1-p)

Categorical(p) → Multiclass output (softmax)
P(X=k)=p_k, generalizes Bernoulli to K classes

Poisson(λ) → Event count per interval
P(X=k)=λ^k e^-λ / k!, E=Var=λ

Uniform(a,b) → Weight initialization, random search
f(x)=1/(b-a), E=(a+b)/2, Var=(b-a)²/12

Normal(μ,σ²) → Regression noise, latent space, CLT limit
f(x)=exp(-(x-μ)²/2σ²)/(σ√2π), E=μ, Var=σ²

Exponential(λ) → Waiting times, survival analysis
f(x)=λe^-λx, E=1/λ, Var=1/λ², memoryless

Beta(α,β) → Prior over probability, Bayesian A/B test
Support=[0,1], conjugate to Bernoulli, E=α/(α+β)

Dirichlet(α) → Prior over probability vector
Support=simplex, conjugate to Categorical, generalizes Beta

11. Interview Q&A

Q1: Why is the Gaussian distribution so prevalent in machine learning?

A: The Gaussian appears everywhere for four distinct reasons. First, the Central Limit Theorem: averages of many independent random variables converge to Gaussian, regardless of the underlying distribution. Since most ML quantities (gradients, losses, activations) are sums over many data points or neurons, they tend to be approximately Gaussian. Second, maximum entropy: among all distributions with fixed mean μ\mu and variance σ2\sigma^2, the Gaussian has the maximum entropy (is the least informative), making it a natural default. Third, mathematical tractability: sums of Gaussians are Gaussian; Gaussian is closed under linear transformations; the multivariate Gaussian has a closed-form density and closed-form marginals and conditionals. Fourth, conjugacy: Gaussian is self-conjugate (Gaussian prior + Gaussian likelihood = Gaussian posterior), enabling closed-form Bayesian updates.

Q2: What is the relationship between L2 regularization and a Gaussian prior on weights?

A: From a Bayesian perspective, L2 regularization is equivalent to placing an independent Gaussian prior wjN(0,1/λ)w_j \sim \mathcal{N}(0, 1/\lambda) on each weight. The MAP estimate maximizes the log posterior: logP(wD)=logP(Dw)+logP(w)\log P(\mathbf{w} \mid \mathcal{D}) = \log P(\mathcal{D} \mid \mathbf{w}) + \log P(\mathbf{w}). The Gaussian prior contributes logP(w)=λ2w2+const\log P(\mathbf{w}) = -\frac{\lambda}{2}\|\mathbf{w}\|^2 + \text{const}. So MAP is equivalent to minimizing L(w)+λ2w2\mathcal{L}(\mathbf{w}) + \frac{\lambda}{2}\|\mathbf{w}\|^2, which is exactly L2 regularized loss minimization. The regularization strength λ\lambda controls the tightness of the Gaussian prior: larger λ\lambda means narrower prior (more weight toward zero, stronger regularization).

Q3: What is the Dirichlet distribution and how does it appear in NLP models?

A: The Dirichlet distribution is a distribution over probability vectors (the probability simplex). If pDirichlet(α)\mathbf{p} \sim \text{Dirichlet}(\boldsymbol{\alpha}), then p\mathbf{p} is a random vector with non-negative components summing to 1. It is the conjugate prior for the Categorical/Multinomial distribution. In NLP, Latent Dirichlet Allocation (LDA) uses Dirichlet distributions to model the generative process of documents: topics are Dirichlet-distributed mixtures of words (ϕkDirichlet(β)\boldsymbol{\phi}_k \sim \text{Dirichlet}(\boldsymbol{\beta})), and documents are Dirichlet-distributed mixtures of topics (θdDirichlet(α)\boldsymbol{\theta}_d \sim \text{Dirichlet}(\boldsymbol{\alpha})). The concentration parameter α0=kαk\alpha_0 = \sum_k \alpha_k controls how sparse or spread-out the topic mixture is.

Q4: What is the Beta-Bernoulli conjugacy and why is it useful for Bayesian A/B testing?

A: Conjugacy means that when the prior and likelihood come from matching distribution families, the posterior is in the same family as the prior - enabling closed-form updates. For Beta-Bernoulli: if the prior on conversion rate is pBeta(α,β)p \sim \text{Beta}(\alpha, \beta), and we observe kk conversions in nn trials (Binomial likelihood), the posterior is pdataBeta(α+k,β+nk)p \mid \text{data} \sim \text{Beta}(\alpha + k, \beta + n - k). This is extremely useful for A/B testing: we can initialize with an uninformative prior Beta(1,1)\text{Beta}(1,1), observe user interactions sequentially, update the posterior with each observation, and at any point query the posterior to compute: probability that variant A beats variant B, expected conversion rate with uncertainty, credible intervals. No approximation or MCMC needed - all updates are exact and O(1)O(1).

Q5: Why does cross-entropy loss correspond to maximum likelihood under a Categorical distribution?

A: Given a model that outputs class probabilities p=softmax(z)\mathbf{p} = \text{softmax}(\mathbf{z}), we model the true label yy as yCategorical(p)y \sim \text{Categorical}(\mathbf{p}). The likelihood of observing the true label y=ky^* = k is P(y=kx)=pkP(y = k \mid \mathbf{x}) = p_k. The log-likelihood is logpk\log p_k. MLE maximizes the log-likelihood over the dataset: argmaxθ1Nilogpyi\arg\max_\theta \frac{1}{N}\sum_i \log p_{y_i^*}. Negating to get a minimization objective: argminθ1Nilogpyi\arg\min_\theta -\frac{1}{N}\sum_i \log p_{y_i^*}. This is exactly the cross-entropy loss: L=1Nikyiklogpik\mathcal{L} = -\frac{1}{N}\sum_i \sum_k y_{ik} \log p_{ik}, where yiky_{ik} is the one-hot label (so only the true class term survives). Cross-entropy minimization is therefore exactly MLE under a Categorical output distribution - a direct consequence of the probabilistic interpretation.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Probability Distributions demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.