Skip to main content

Joint and Marginal Distributions

Reading time: ~45 minutes | Interview relevance: High | Target roles: ML Engineer, AI Engineer, Research Scientist, Data Scientist

The ML Scenario That Motivates This Lesson

You're building a Variational Autoencoder (VAE). The model has:

  • Observed variable: x\mathbf{x} (the image)
  • Latent variable: z\mathbf{z} (the compressed representation)

The VAE defines a joint distribution p(x,z)=p(xz)p(z)p(\mathbf{x}, \mathbf{z}) = p(\mathbf{x} \mid \mathbf{z}) \cdot p(\mathbf{z}). Training requires computing the marginal p(x)=p(xz)p(z)dzp(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z}) d\mathbf{z}, which is intractable for neural networks. The ELBO (evidence lower bound) is a tractable lower bound on logp(x)\log p(\mathbf{x}), derived using Jensen's inequality.

Understanding joint and marginal distributions is the prerequisite for understanding virtually every generative model: VAEs, GANs, diffusion models, Bayesian networks, hidden Markov models, and Gaussian processes all reason about joint distributions over multiple variables.

1. Joint Distributions

A joint distribution describes the probability behavior of two or more random variables simultaneously.

Discrete Joint Distribution

For discrete random variables XX and YY, the joint PMF is:

pX,Y(x,y)=P(X=x,Y=y)p_{X,Y}(x, y) = P(X = x, Y = y)

Requirements: pX,Y(x,y)0p_{X,Y}(x, y) \geq 0 for all (x,y)(x, y), and xypX,Y(x,y)=1\sum_x \sum_y p_{X,Y}(x, y) = 1.

Example:

Y=0Y=0Y=1Y=1Y=2Y=2
X=0X=00.100.050.05
X=1X=10.200.150.05
X=2X=20.050.200.15

Table entries sum to 1.0.

Continuous Joint Distribution

For continuous random variables XX and YY, the joint PDF fX,Y(x,y)f_{X,Y}(x, y) satisfies:

P((X,Y)A)=AfX,Y(x,y)dxdyP((X,Y) \in A) = \iint_A f_{X,Y}(x, y) \, dx \, dy

with fX,Y(x,y)0f_{X,Y}(x, y) \geq 0 and fX,Y(x,y)dxdy=1\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x, y) \, dx \, dy = 1.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal

# Bivariate Gaussian joint distribution
mu = np.array([1.0, 2.0])
Sigma = np.array([[2.0, 1.0],
[1.0, 1.5]])

rv = multivariate_normal(mean=mu, cov=Sigma)

# Evaluate joint PDF on a grid
x1 = np.linspace(-3, 5, 100)
x2 = np.linspace(-2, 6, 100)
X1, X2 = np.meshgrid(x1, x2)
pos = np.dstack([X1, X2])

Z = rv.pdf(pos) # joint PDF values

print(f"Joint PDF max value: {Z.max():.4f}")
print(f"Approximate integral (should be 1): {Z.sum() * (x1[1]-x1[0]) * (x2[1]-x2[0]):.4f}")

# Sample from the joint distribution
samples = rv.rvs(size=1000, random_state=42)
print(f"\nSamples shape: {samples.shape}")
print(f"Empirical mean: {samples.mean(axis=0).round(4)}")
print(f"Empirical cov:\n{np.cov(samples.T).round(4)}")

2. Marginal Distributions

The marginal distribution of XX is obtained by "integrating out" (summing out) the other variable YY:

Discrete Marginalization

pX(x)=ypX,Y(x,y)p_X(x) = \sum_y p_{X,Y}(x, y)

"Sum along rows (or columns) of the joint probability table."

Using our example table:

  • P(X=0)=0.10+0.05+0.05=0.20P(X=0) = 0.10 + 0.05 + 0.05 = 0.20
  • P(X=1)=0.20+0.15+0.05=0.40P(X=1) = 0.20 + 0.15 + 0.05 = 0.40
  • P(X=2)=0.05+0.20+0.15=0.40P(X=2) = 0.05 + 0.20 + 0.15 = 0.40

Continuous Marginalization

fX(x)=fX,Y(x,y)dyf_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) \, dy

Joint distribution Marginal of X (integrate over Y)
f(x, y): f_X(x) = ∫ f(x,y) dy

y │ │
4 │ .... │
3 │ ...... ▲ higher here where joint has
2 │ ....... │ more mass
1 │ ...... │
0 │ .... │
└──────── x ───┴──────── x
from scipy.stats import multivariate_normal, norm
import numpy as np

# Bivariate Gaussian marginals
mu = np.array([1.0, 2.0])
Sigma = np.array([[2.0, 1.0],
[1.0, 1.5]])

# Marginals of bivariate Gaussian are univariate Gaussian
# X ~ N(mu[0], Sigma[0,0]), Y ~ N(mu[1], Sigma[1,1])
print("Bivariate Gaussian marginals:")
print(f" X ~ N(mu={mu[0]}, sigma^2={Sigma[0,0]}) -> std={np.sqrt(Sigma[0,0]):.4f}")
print(f" Y ~ N(mu={mu[1]}, sigma^2={Sigma[1,1]}) -> std={np.sqrt(Sigma[1,1]):.4f}")

# Verify numerically
samples = multivariate_normal(mean=mu, cov=Sigma).rvs(size=100_000, random_state=42)
print(f"\nEmpirical marginal of X: mean={samples[:,0].mean():.4f}, std={samples[:,0].std():.4f}")
print(f"Empirical marginal of Y: mean={samples[:,1].mean():.4f}, std={samples[:,1].std():.4f}")

3. Conditional Distribution from Joint

The conditional distribution of XX given Y=yY = y is:

fXY(xy)=fX,Y(x,y)fY(y)f_{X \mid Y}(x \mid y) = \frac{f_{X,Y}(x, y)}{f_Y(y)}

This is Bayes' theorem in the continuous setting. Given the joint, we can compute any conditional by dividing by the appropriate marginal.

import numpy as np
from scipy.stats import multivariate_normal, norm

# Conditional distribution of bivariate Gaussian
# If (X,Y) ~ N(mu, Sigma), then:
# X | Y=y ~ N(mu_X + rho*(sigma_X/sigma_Y)*(y - mu_Y), sigma_X^2*(1-rho^2))

mu_x, mu_y = 1.0, 2.0
sigma_x, sigma_y = np.sqrt(2.0), np.sqrt(1.5)
rho = 1.0 / (sigma_x * sigma_y) # Sigma[0,1] / (sigma_x * sigma_y)

def conditional_gaussian(y_obs, mu_x, mu_y, sigma_x, sigma_y, rho):
"""Compute parameters of X | Y=y_obs for bivariate Gaussian."""
mu_cond = mu_x + rho * (sigma_x / sigma_y) * (y_obs - mu_y)
var_cond = sigma_x**2 * (1 - rho**2)
return mu_cond, var_cond

y_obs = 3.0
mu_c, var_c = conditional_gaussian(y_obs, mu_x, mu_y, sigma_x, sigma_y, rho)
print(f"X | Y={y_obs}: mean={mu_c:.4f}, std={np.sqrt(var_c):.4f}")
print(f" (unconditional X: mean={mu_x}, std={sigma_x:.4f})")
print(f" Conditioning on Y narrows our uncertainty about X")

Key Relationship

fX,Y(x,y)=fXY(xy)fY(y)=fYX(yx)fX(x)f_{X,Y}(x, y) = f_{X \mid Y}(x \mid y) \cdot f_Y(y) = f_{Y \mid X}(y \mid x) \cdot f_X(x)

From the joint, you can compute:

  • Marginals: integrate/sum out the other variable
  • Conditionals: divide joint by the appropriate marginal
  • Everything you need for Bayesian inference

4. Independence in Terms of Joint Distribution

Random variables XX and YY are independent if and only if their joint distribution factorizes:

fX,Y(x,y)=fX(x)fY(y)for all (x,y)f_{X,Y}(x, y) = f_X(x) \cdot f_Y(y) \quad \text{for all } (x, y)

Equivalently:

  • fXY(xy)=fX(x)f_{X \mid Y}(x \mid y) = f_X(x) (conditioning on YY tells us nothing about XX)
  • The joint PDF/PMF can be written as a product of marginals
import numpy as np

np.random.seed(42)
n = 100_000

# Independent case
X_ind = np.random.randn(n)
Y_ind = np.random.randn(n)

# Dependent case (positive correlation)
X_dep = np.random.randn(n)
Y_dep = 0.8 * X_dep + 0.6 * np.random.randn(n)

# Test independence: for independent RVs,
# P(X in A, Y in B) = P(X in A) * P(Y in B)

A = (X_ind > 0)
B_ind = (Y_ind > 0)
B_dep = (Y_dep > 0)

# Independent case
p_A = A.mean()
p_B_ind = B_ind.mean()
p_AB_ind = (A & B_ind).mean()
print("Independent case:")
print(f" P(X>0) * P(Y>0) = {p_A * p_B_ind:.4f}")
print(f" P(X>0, Y>0) = {p_AB_ind:.4f}")
print(f" Difference: {abs(p_A * p_B_ind - p_AB_ind):.5f} (near 0 = independent)")

# Dependent case
A2 = (X_dep > 0)
p_B_dep = B_dep.mean()
p_AB_dep = (A2 & B_dep).mean()
print("\nDependent case (rho=0.8):")
print(f" P(X>0) * P(Y>0) = {p_A * p_B_dep:.4f}")
print(f" P(X>0, Y>0) = {p_AB_dep:.4f}")
print(f" Difference: {abs(p_A * p_B_dep - p_AB_dep):.5f} (>0 = dependent)")

5. The Multivariate Gaussian: The Central Distribution for ML

The multivariate Gaussian XN(μ,Σ)\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) is the most important joint distribution in ML.

Definition

f(x)=1(2π)d/2Σ1/2exp(12(xμ)TΣ1(xμ))f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu})\right)

Beautiful Properties

PropertyStatement
MarginalsAny marginal of a multivariate Gaussian is Gaussian
ConditionalsAny conditional of a multivariate Gaussian is Gaussian
Linear transformationsAX+bN(Aμ+b,AΣAT)A\mathbf{X} + \mathbf{b} \sim \mathcal{N}(A\boldsymbol{\mu}+\mathbf{b}, A\boldsymbol{\Sigma}A^T)
Sum of independent GaussiansAlso Gaussian
Uncorrelated = IndependentFor Gaussians only: Σij=0\Sigma_{ij}=0 implies XiXjX_i \perp X_j

The last property is unique to the Gaussian - in general, uncorrelated \neq independent.

Marginal of Multivariate Gaussian

For X=[X1X2]N([μ1μ2],[Σ11Σ12Σ21Σ22])\mathbf{X} = \begin{bmatrix} \mathbf{X}_1 \\ \mathbf{X}_2 \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}\boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2\end{bmatrix}, \begin{bmatrix}\boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22}\end{bmatrix}\right):

X1N(μ1,Σ11)\mathbf{X}_1 \sim \mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_{11})

Conditional of Multivariate Gaussian

X1X2=x2N(μ12,Σ12)\mathbf{X}_1 \mid \mathbf{X}_2 = \mathbf{x}_2 \sim \mathcal{N}(\boldsymbol{\mu}_{1|2}, \boldsymbol{\Sigma}_{1|2})

μ12=μ1+Σ12Σ221(x2μ2)\boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2)

Σ12=Σ11Σ12Σ221Σ21\boldsymbol{\Sigma}_{1|2} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}

This is the Gaussian conditioning formula, and it underpins Gaussian Processes and Kalman filters.

import numpy as np
from scipy.stats import multivariate_normal

# Gaussian conditioning
# (X1, X2) ~ N([1,2], [[2,1],[1,1.5]])
# Compute X1 | X2 = x2_obs

mu = np.array([1.0, 2.0])
Sigma = np.array([[2.0, 1.0],
[1.0, 1.5]])

x2_obs = 3.0

# Partition
mu1, mu2 = mu[0], mu[1]
S11 = Sigma[0,0]; S12 = Sigma[0,1]; S22 = Sigma[1,1]

# Conditional mean and variance
mu_cond = mu1 + S12 / S22 * (x2_obs - mu2)
sigma2_cond = S11 - S12**2 / S22

print(f"X1 | X2={x2_obs}:")
print(f" Conditional mean: {mu_cond:.4f} (marginal mean: {mu1})")
print(f" Conditional variance: {sigma2_cond:.4f} (marginal var: {S11})")
print(f" Variance reduction: {(1 - sigma2_cond/S11)*100:.1f}%")

6. Covariance Matrix as Joint Distribution Statistic

The covariance matrix Σ\boldsymbol{\Sigma} captures the second-order structure of a joint distribution. For ML:

Feature vector x = [x1, x2, ..., xd]

Σ = Cov(x)

┌────────────────────┴──────────────────────────────┐
│ │
Σ diagonal Off-diagonal Σ_ij
= feature variances = covariance between features i,j
│ │
Used in: Used in:
- Feature scaling - PCA (find axes of max variance)
- Normalization - LDA (class-conditional cov)
- BatchNorm - Mahalanobis distance

Mahalanobis Distance

The Mahalanobis distance uses the inverse covariance to measure distances that account for correlations and different scales:

dM(x,μ)=(xμ)TΣ1(xμ)d_M(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{(\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu})}

This is the exponent in the Gaussian density. Points at equal Mahalanobis distance form ellipses aligned with the covariance structure (not circles as with Euclidean distance).

import numpy as np

# Mahalanobis vs Euclidean distance
mu = np.array([0.0, 0.0])
Sigma = np.array([[4.0, 2.0],
[2.0, 1.5]])

Sigma_inv = np.linalg.inv(Sigma)

points = np.array([[2.0, 0.0], # point along x1 axis
[0.0, 1.0], # point along x2 axis
[1.0, 0.8]]) # diagonal point

print(f"{'Point':<15} {'Euclidean':>12} {'Mahalanobis':>13}")
print("-" * 42)
for p in points:
diff = p - mu
d_euc = np.sqrt(diff @ diff)
d_mah = np.sqrt(diff @ Sigma_inv @ diff)
print(f"{str(p):<15} {d_euc:>12.4f} {d_mah:>13.4f}")

# In anomaly detection, Mahalanobis distance is used to score outliers
# under a multivariate Gaussian assumption

7. ML Connection: Latent Variable Models

Latent variable models define a joint distribution p(x,z)p(\mathbf{x}, \mathbf{z}) where x\mathbf{x} is observed and z\mathbf{z} is latent (unobserved):

p(x,z)=p(xz)p(z)p(\mathbf{x}, \mathbf{z}) = p(\mathbf{x} \mid \mathbf{z}) \cdot p(\mathbf{z})

The Marginalization Problem

The marginal likelihood of observed data requires integrating out the latent variable:

p(x)=p(xz)p(z)dzp(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, d\mathbf{z}

For neural network models p(xz;θ)p(\mathbf{x} \mid \mathbf{z}; \theta), this integral is intractable - there is no closed form.

VAE: ELBO as Lower Bound on logp(x)\log p(\mathbf{x})

VAEs introduce an approximate posterior qϕ(zx)q_\phi(\mathbf{z} \mid \mathbf{x}) and maximize:

L(θ,ϕ;x)=Eqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\mathcal{L}(\theta, \phi; \mathbf{x}) = \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x} \mid \mathbf{z})] - D_{KL}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p(\mathbf{z}))

The first term is the reconstruction loss; the second is a regularizer that pushes the approximate posterior toward the prior.

VAE joint distribution:

p(x, z) = p(x | z) · p(z)
│ │
│ └── Prior: z ~ N(0, I)

└── Decoder: p(x | z; θ) - neural network

q(z | x; φ) = N(μ_φ(x), diag(σ_φ(x)²))
└── Encoder (approximate posterior)

Training: maximize ELBO = E_q[log p(x|z)] - KL(q || p)
import numpy as np

def elbo_gaussian_vae(x, mu_enc, log_var_enc, x_recon, sigma_dec=1.0):
"""
Compute ELBO for a single data point in a Gaussian VAE.

Args:
x: observed data (d_x,)
mu_enc, log_var_enc: encoder parameters (d_z,)
x_recon: decoder output (d_x,)
sigma_dec: decoder noise std

Returns:
ELBO value (scalar)
"""
# Reconstruction term: log p(x | z) ~ Gaussian
# = -0.5 * sum((x - x_recon)^2) / sigma_dec^2 - constant
recon_loss = -0.5 * np.sum((x - x_recon)**2) / sigma_dec**2

# KL divergence: KL(N(mu, sigma^2) || N(0, I))
# = 0.5 * sum(sigma^2 + mu^2 - 1 - log(sigma^2))
kl_div = 0.5 * np.sum(np.exp(log_var_enc) + mu_enc**2 - 1 - log_var_enc)

elbo = recon_loss - kl_div
return elbo, recon_loss, -kl_div

# Example
d_x, d_z = 784, 16
np.random.seed(42)

x = np.random.randn(d_x) # original image (flattened)
mu_enc = np.random.randn(d_z) * 0.1 # encoder mean
log_var = np.random.randn(d_z) * 0.1 # encoder log-variance
x_recon = x + 0.1 * np.random.randn(d_x) # reconstructed image

elbo, recon, kl = elbo_gaussian_vae(x, mu_enc, log_var, x_recon)
print(f"ELBO components:")
print(f" Reconstruction: {recon:.4f}")
print(f" -KL divergence: {kl:.4f}")
print(f" ELBO: {elbo:.4f}")

8. Graphical Models and Conditional Independence Structure

Graphical models use graphs to represent the conditional independence structure of a joint distribution. They allow large joint distributions to factorize into products of small factors.

Bayesian Network (Directed Graphical Model)

A Bayesian network represents:

p(x1,,xn)=i=1np(xiparents(xi))p(x_1, \ldots, x_n) = \prod_{i=1}^n p(x_i \mid \text{parents}(x_i))

This represents:

p(Season,Sprinkler,Rain,Wet Grass,Slippery)=p(Season)p(SprinklerSeason)p(RainSeason)p(WetSprinkler,Rain)p(SlipperyRain)p(\text{Season}, \text{Sprinkler}, \text{Rain}, \text{Wet Grass}, \text{Slippery}) = p(\text{Season}) \cdot p(\text{Sprinkler} \mid \text{Season}) \cdot p(\text{Rain} \mid \text{Season}) \cdot p(\text{Wet} \mid \text{Sprinkler}, \text{Rain}) \cdot p(\text{Slippery} \mid \text{Rain})

The graph encodes that "Wet Grass" is conditionally independent of "Season" given "Sprinkler" and "Rain."

Markov Random Field (Undirected Graphical Model)

p(x)=1ZcCψc(xc)p(\mathbf{x}) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \psi_c(\mathbf{x}_c)

Used in image segmentation, CRFs (Conditional Random Fields) for sequence labeling in NLP.

9. Summary: Joint Distribution Toolkit

Given joint distribution p(x, y):

OPERATION FORMULA PURPOSE
──────────────────────────────────────────────────────────────
Marginal of X p(x) = Σ_y p(x,y) Ignore Y
Marginal of Y p(y) = Σ_x p(x,y) Ignore X
Conditional X|Y=y p(x|y) = p(x,y)/p(y) Update on Y
Conditional Y|X=x p(y|x) = p(x,y)/p(x) Update on X
Independence check p(x,y) = p(x)p(y)? Is Y informative about X?
Expectation of g(X,Y) E[g] = ΣΣ g(x,y)p(x,y) Average over joint
Covariance Cov(X,Y)=E[XY]-E[X]E[Y] Linear relationship

10. Interview Q&A

Q1: What is a marginal distribution and how do you compute it from a joint distribution?

A: The marginal distribution of XX describes XX alone, without regard to YY. It is computed by "summing out" or "integrating out" YY from the joint: pX(x)=ypX,Y(x,y)p_X(x) = \sum_y p_{X,Y}(x, y) for discrete, fX(x)=fX,Y(x,y)dyf_X(x) = \int f_{X,Y}(x,y) dy for continuous. Intuitively, it collapses the two-dimensional table into one dimension by summing along the other axis. In ML, marginalization is critical in latent variable models: the marginal likelihood p(x)=p(x,z)dzp(\mathbf{x}) = \int p(\mathbf{x}, \mathbf{z}) d\mathbf{z} is what we want to maximize, but this integral is often intractable for complex models. The entire field of variational inference (ELBO, VAEs) and approximate inference (MCMC, EP) exists because of the difficulty of this marginalization.

Q2: Why is the multivariate Gaussian so important in machine learning?

A: The multivariate Gaussian N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) is central for several reasons. First, it is fully characterized by just two statistics: mean and covariance - making it tractable to estimate and work with. Second, all marginals and conditionals of a multivariate Gaussian are again Gaussian - enabling closed-form Bayesian updates (Kalman filters, Gaussian processes). Third, the Central Limit Theorem ensures that many real-world quantities are approximately Gaussian. Fourth, for Gaussians, uncorrelated implies independent - simplifying independence testing. Fifth, linear transformations of Gaussians are Gaussian, making it easy to propagate distributions through linear layers. Applications include: multivariate regression, PCA, Gaussian processes, Kalman filtering, and the prior/posterior in Bayesian neural networks.

Q3: What is the difference between marginal and conditional independence?

A: Marginal independence (XYX \perp Y) means p(x,y)=p(x)p(y)p(x, y) = p(x) p(y) - knowing YY tells you nothing about XX. Conditional independence (XYZX \perp Y \mid Z) means p(x,yz)=p(xz)p(yz)p(x, y \mid z) = p(x \mid z) p(y \mid z) - given ZZ, knowing YY tells you nothing about XX. These can go in opposite directions. Example: XX = shoe size, YY = reading ability, ZZ = age. XX and YY are positively correlated (both increase with age) - NOT marginally independent. But XYZX \perp Y \mid Z - conditional on age, shoe size and reading ability are independent. This is a "common cause" (Berkson's paradox in reverse). The opposite: XX = fire (YY = smoke), they can be marginally independent but become dependent when conditioned on a common effect.

Q4: How does the ELBO in a VAE relate to joint and marginal distributions?

A: The VAE defines a joint distribution pθ(x,z)=pθ(xz)p(z)p_\theta(\mathbf{x}, \mathbf{z}) = p_\theta(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z}). Training would ideally maximize the log marginal likelihood logpθ(x)=logpθ(xz)p(z)dz\log p_\theta(\mathbf{x}) = \log \int p_\theta(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z}) d\mathbf{z}, but this integral is intractable for neural network decoders. The ELBO (Evidence Lower BOund) is derived by introducing an approximate posterior qϕ(zx)q_\phi(\mathbf{z} \mid \mathbf{x}) and applying Jensen's inequality: logpθ(x)Eqϕ[logpθ(xz)]DKL(qϕ(zx)p(z))\log p_\theta(\mathbf{x}) \geq \mathbb{E}_{q_\phi}[\log p_\theta(\mathbf{x} \mid \mathbf{z})] - D_{KL}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p(\mathbf{z})). The gap between the ELBO and the true log-likelihood is DKL(qϕpθ(zx))D_{KL}(q_\phi \| p_\theta(\mathbf{z} \mid \mathbf{x})) - zero when the approximate posterior equals the true posterior. The ELBO trades off reconstruction quality (first term) with regularization (KL term keeps latent codes close to the prior).

Q5: What is the Mahalanobis distance and when would you use it over Euclidean distance?

A: The Mahalanobis distance dM(x,μ)=(xμ)TΣ1(xμ)d_M(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{(\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu})} measures distance relative to the covariance structure of the data distribution. Use it over Euclidean when: (1) features have very different scales - Mahalanobis automatically accounts for scale differences (equivalent to normalizing by each feature's variance); (2) features are correlated - Euclidean ignores correlations, Mahalanobis accounts for them (it measures distance in the space rotated by the covariance eigenvectors); (3) anomaly detection under a Gaussian assumption - Mahalanobis distance equals the negative log-likelihood of the Gaussian up to constants, so high Mahalanobis distance = low probability under the model. Common applications: multivariate outlier detection, LDA (Linear Discriminant Analysis which minimizes within-class Mahalanobis distance), and Gaussian process regression.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Joint Distributions demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.