Joint and Marginal Distributions
Reading time: ~45 minutes | Interview relevance: High | Target roles: ML Engineer, AI Engineer, Research Scientist, Data Scientist
The ML Scenario That Motivates This Lesson
You're building a Variational Autoencoder (VAE). The model has:
- Observed variable: (the image)
- Latent variable: (the compressed representation)
The VAE defines a joint distribution . Training requires computing the marginal , which is intractable for neural networks. The ELBO (evidence lower bound) is a tractable lower bound on , derived using Jensen's inequality.
Understanding joint and marginal distributions is the prerequisite for understanding virtually every generative model: VAEs, GANs, diffusion models, Bayesian networks, hidden Markov models, and Gaussian processes all reason about joint distributions over multiple variables.
1. Joint Distributions
A joint distribution describes the probability behavior of two or more random variables simultaneously.
Discrete Joint Distribution
For discrete random variables and , the joint PMF is:
Requirements: for all , and .
Example:
| 0.10 | 0.05 | 0.05 | |
| 0.20 | 0.15 | 0.05 | |
| 0.05 | 0.20 | 0.15 |
Table entries sum to 1.0.
Continuous Joint Distribution
For continuous random variables and , the joint PDF satisfies:
with and .
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
# Bivariate Gaussian joint distribution
mu = np.array([1.0, 2.0])
Sigma = np.array([[2.0, 1.0],
[1.0, 1.5]])
rv = multivariate_normal(mean=mu, cov=Sigma)
# Evaluate joint PDF on a grid
x1 = np.linspace(-3, 5, 100)
x2 = np.linspace(-2, 6, 100)
X1, X2 = np.meshgrid(x1, x2)
pos = np.dstack([X1, X2])
Z = rv.pdf(pos) # joint PDF values
print(f"Joint PDF max value: {Z.max():.4f}")
print(f"Approximate integral (should be 1): {Z.sum() * (x1[1]-x1[0]) * (x2[1]-x2[0]):.4f}")
# Sample from the joint distribution
samples = rv.rvs(size=1000, random_state=42)
print(f"\nSamples shape: {samples.shape}")
print(f"Empirical mean: {samples.mean(axis=0).round(4)}")
print(f"Empirical cov:\n{np.cov(samples.T).round(4)}")
2. Marginal Distributions
The marginal distribution of is obtained by "integrating out" (summing out) the other variable :
Discrete Marginalization
"Sum along rows (or columns) of the joint probability table."
Using our example table:
Continuous Marginalization
Joint distribution Marginal of X (integrate over Y)
f(x, y): f_X(x) = ∫ f(x,y) dy
y │ │
4 │ .... │
3 │ ...... ▲ higher here where joint has
2 │ ....... │ more mass
1 │ ...... │
0 │ .... │
└──────── x ───┴──────── x
from scipy.stats import multivariate_normal, norm
import numpy as np
# Bivariate Gaussian marginals
mu = np.array([1.0, 2.0])
Sigma = np.array([[2.0, 1.0],
[1.0, 1.5]])
# Marginals of bivariate Gaussian are univariate Gaussian
# X ~ N(mu[0], Sigma[0,0]), Y ~ N(mu[1], Sigma[1,1])
print("Bivariate Gaussian marginals:")
print(f" X ~ N(mu={mu[0]}, sigma^2={Sigma[0,0]}) -> std={np.sqrt(Sigma[0,0]):.4f}")
print(f" Y ~ N(mu={mu[1]}, sigma^2={Sigma[1,1]}) -> std={np.sqrt(Sigma[1,1]):.4f}")
# Verify numerically
samples = multivariate_normal(mean=mu, cov=Sigma).rvs(size=100_000, random_state=42)
print(f"\nEmpirical marginal of X: mean={samples[:,0].mean():.4f}, std={samples[:,0].std():.4f}")
print(f"Empirical marginal of Y: mean={samples[:,1].mean():.4f}, std={samples[:,1].std():.4f}")
3. Conditional Distribution from Joint
The conditional distribution of given is:
This is Bayes' theorem in the continuous setting. Given the joint, we can compute any conditional by dividing by the appropriate marginal.
import numpy as np
from scipy.stats import multivariate_normal, norm
# Conditional distribution of bivariate Gaussian
# If (X,Y) ~ N(mu, Sigma), then:
# X | Y=y ~ N(mu_X + rho*(sigma_X/sigma_Y)*(y - mu_Y), sigma_X^2*(1-rho^2))
mu_x, mu_y = 1.0, 2.0
sigma_x, sigma_y = np.sqrt(2.0), np.sqrt(1.5)
rho = 1.0 / (sigma_x * sigma_y) # Sigma[0,1] / (sigma_x * sigma_y)
def conditional_gaussian(y_obs, mu_x, mu_y, sigma_x, sigma_y, rho):
"""Compute parameters of X | Y=y_obs for bivariate Gaussian."""
mu_cond = mu_x + rho * (sigma_x / sigma_y) * (y_obs - mu_y)
var_cond = sigma_x**2 * (1 - rho**2)
return mu_cond, var_cond
y_obs = 3.0
mu_c, var_c = conditional_gaussian(y_obs, mu_x, mu_y, sigma_x, sigma_y, rho)
print(f"X | Y={y_obs}: mean={mu_c:.4f}, std={np.sqrt(var_c):.4f}")
print(f" (unconditional X: mean={mu_x}, std={sigma_x:.4f})")
print(f" Conditioning on Y narrows our uncertainty about X")
Key Relationship
From the joint, you can compute:
- Marginals: integrate/sum out the other variable
- Conditionals: divide joint by the appropriate marginal
- Everything you need for Bayesian inference
4. Independence in Terms of Joint Distribution
Random variables and are independent if and only if their joint distribution factorizes:
Equivalently:
- (conditioning on tells us nothing about )
- The joint PDF/PMF can be written as a product of marginals
import numpy as np
np.random.seed(42)
n = 100_000
# Independent case
X_ind = np.random.randn(n)
Y_ind = np.random.randn(n)
# Dependent case (positive correlation)
X_dep = np.random.randn(n)
Y_dep = 0.8 * X_dep + 0.6 * np.random.randn(n)
# Test independence: for independent RVs,
# P(X in A, Y in B) = P(X in A) * P(Y in B)
A = (X_ind > 0)
B_ind = (Y_ind > 0)
B_dep = (Y_dep > 0)
# Independent case
p_A = A.mean()
p_B_ind = B_ind.mean()
p_AB_ind = (A & B_ind).mean()
print("Independent case:")
print(f" P(X>0) * P(Y>0) = {p_A * p_B_ind:.4f}")
print(f" P(X>0, Y>0) = {p_AB_ind:.4f}")
print(f" Difference: {abs(p_A * p_B_ind - p_AB_ind):.5f} (near 0 = independent)")
# Dependent case
A2 = (X_dep > 0)
p_B_dep = B_dep.mean()
p_AB_dep = (A2 & B_dep).mean()
print("\nDependent case (rho=0.8):")
print(f" P(X>0) * P(Y>0) = {p_A * p_B_dep:.4f}")
print(f" P(X>0, Y>0) = {p_AB_dep:.4f}")
print(f" Difference: {abs(p_A * p_B_dep - p_AB_dep):.5f} (>0 = dependent)")
5. The Multivariate Gaussian: The Central Distribution for ML
The multivariate Gaussian is the most important joint distribution in ML.
Definition
Beautiful Properties
| Property | Statement |
|---|---|
| Marginals | Any marginal of a multivariate Gaussian is Gaussian |
| Conditionals | Any conditional of a multivariate Gaussian is Gaussian |
| Linear transformations | |
| Sum of independent Gaussians | Also Gaussian |
| Uncorrelated = Independent | For Gaussians only: implies |
The last property is unique to the Gaussian - in general, uncorrelated independent.
Marginal of Multivariate Gaussian
For :
Conditional of Multivariate Gaussian
This is the Gaussian conditioning formula, and it underpins Gaussian Processes and Kalman filters.
import numpy as np
from scipy.stats import multivariate_normal
# Gaussian conditioning
# (X1, X2) ~ N([1,2], [[2,1],[1,1.5]])
# Compute X1 | X2 = x2_obs
mu = np.array([1.0, 2.0])
Sigma = np.array([[2.0, 1.0],
[1.0, 1.5]])
x2_obs = 3.0
# Partition
mu1, mu2 = mu[0], mu[1]
S11 = Sigma[0,0]; S12 = Sigma[0,1]; S22 = Sigma[1,1]
# Conditional mean and variance
mu_cond = mu1 + S12 / S22 * (x2_obs - mu2)
sigma2_cond = S11 - S12**2 / S22
print(f"X1 | X2={x2_obs}:")
print(f" Conditional mean: {mu_cond:.4f} (marginal mean: {mu1})")
print(f" Conditional variance: {sigma2_cond:.4f} (marginal var: {S11})")
print(f" Variance reduction: {(1 - sigma2_cond/S11)*100:.1f}%")
6. Covariance Matrix as Joint Distribution Statistic
The covariance matrix captures the second-order structure of a joint distribution. For ML:
Feature vector x = [x1, x2, ..., xd]
│
Σ = Cov(x)
│
┌────────────────────┴──────────────────────────────┐
│ │
Σ diagonal Off-diagonal Σ_ij
= feature variances = covariance between features i,j
│ │
Used in: Used in:
- Feature scaling - PCA (find axes of max variance)
- Normalization - LDA (class-conditional cov)
- BatchNorm - Mahalanobis distance
Mahalanobis Distance
The Mahalanobis distance uses the inverse covariance to measure distances that account for correlations and different scales:
This is the exponent in the Gaussian density. Points at equal Mahalanobis distance form ellipses aligned with the covariance structure (not circles as with Euclidean distance).
import numpy as np
# Mahalanobis vs Euclidean distance
mu = np.array([0.0, 0.0])
Sigma = np.array([[4.0, 2.0],
[2.0, 1.5]])
Sigma_inv = np.linalg.inv(Sigma)
points = np.array([[2.0, 0.0], # point along x1 axis
[0.0, 1.0], # point along x2 axis
[1.0, 0.8]]) # diagonal point
print(f"{'Point':<15} {'Euclidean':>12} {'Mahalanobis':>13}")
print("-" * 42)
for p in points:
diff = p - mu
d_euc = np.sqrt(diff @ diff)
d_mah = np.sqrt(diff @ Sigma_inv @ diff)
print(f"{str(p):<15} {d_euc:>12.4f} {d_mah:>13.4f}")
# In anomaly detection, Mahalanobis distance is used to score outliers
# under a multivariate Gaussian assumption
7. ML Connection: Latent Variable Models
Latent variable models define a joint distribution where is observed and is latent (unobserved):
The Marginalization Problem
The marginal likelihood of observed data requires integrating out the latent variable:
For neural network models , this integral is intractable - there is no closed form.
VAE: ELBO as Lower Bound on
VAEs introduce an approximate posterior and maximize:
The first term is the reconstruction loss; the second is a regularizer that pushes the approximate posterior toward the prior.
VAE joint distribution:
p(x, z) = p(x | z) · p(z)
│ │
│ └── Prior: z ~ N(0, I)
│
└── Decoder: p(x | z; θ) - neural network
q(z | x; φ) = N(μ_φ(x), diag(σ_φ(x)²))
└── Encoder (approximate posterior)
Training: maximize ELBO = E_q[log p(x|z)] - KL(q || p)
import numpy as np
def elbo_gaussian_vae(x, mu_enc, log_var_enc, x_recon, sigma_dec=1.0):
"""
Compute ELBO for a single data point in a Gaussian VAE.
Args:
x: observed data (d_x,)
mu_enc, log_var_enc: encoder parameters (d_z,)
x_recon: decoder output (d_x,)
sigma_dec: decoder noise std
Returns:
ELBO value (scalar)
"""
# Reconstruction term: log p(x | z) ~ Gaussian
# = -0.5 * sum((x - x_recon)^2) / sigma_dec^2 - constant
recon_loss = -0.5 * np.sum((x - x_recon)**2) / sigma_dec**2
# KL divergence: KL(N(mu, sigma^2) || N(0, I))
# = 0.5 * sum(sigma^2 + mu^2 - 1 - log(sigma^2))
kl_div = 0.5 * np.sum(np.exp(log_var_enc) + mu_enc**2 - 1 - log_var_enc)
elbo = recon_loss - kl_div
return elbo, recon_loss, -kl_div
# Example
d_x, d_z = 784, 16
np.random.seed(42)
x = np.random.randn(d_x) # original image (flattened)
mu_enc = np.random.randn(d_z) * 0.1 # encoder mean
log_var = np.random.randn(d_z) * 0.1 # encoder log-variance
x_recon = x + 0.1 * np.random.randn(d_x) # reconstructed image
elbo, recon, kl = elbo_gaussian_vae(x, mu_enc, log_var, x_recon)
print(f"ELBO components:")
print(f" Reconstruction: {recon:.4f}")
print(f" -KL divergence: {kl:.4f}")
print(f" ELBO: {elbo:.4f}")
8. Graphical Models and Conditional Independence Structure
Graphical models use graphs to represent the conditional independence structure of a joint distribution. They allow large joint distributions to factorize into products of small factors.
Bayesian Network (Directed Graphical Model)
A Bayesian network represents:
This represents:
The graph encodes that "Wet Grass" is conditionally independent of "Season" given "Sprinkler" and "Rain."
Markov Random Field (Undirected Graphical Model)
Used in image segmentation, CRFs (Conditional Random Fields) for sequence labeling in NLP.
9. Summary: Joint Distribution Toolkit
Given joint distribution p(x, y):
OPERATION FORMULA PURPOSE
──────────────────────────────────────────────────────────────
Marginal of X p(x) = Σ_y p(x,y) Ignore Y
Marginal of Y p(y) = Σ_x p(x,y) Ignore X
Conditional X|Y=y p(x|y) = p(x,y)/p(y) Update on Y
Conditional Y|X=x p(y|x) = p(x,y)/p(x) Update on X
Independence check p(x,y) = p(x)p(y)? Is Y informative about X?
Expectation of g(X,Y) E[g] = ΣΣ g(x,y)p(x,y) Average over joint
Covariance Cov(X,Y)=E[XY]-E[X]E[Y] Linear relationship
10. Interview Q&A
Q1: What is a marginal distribution and how do you compute it from a joint distribution?
A: The marginal distribution of describes alone, without regard to . It is computed by "summing out" or "integrating out" from the joint: for discrete, for continuous. Intuitively, it collapses the two-dimensional table into one dimension by summing along the other axis. In ML, marginalization is critical in latent variable models: the marginal likelihood is what we want to maximize, but this integral is often intractable for complex models. The entire field of variational inference (ELBO, VAEs) and approximate inference (MCMC, EP) exists because of the difficulty of this marginalization.
Q2: Why is the multivariate Gaussian so important in machine learning?
A: The multivariate Gaussian is central for several reasons. First, it is fully characterized by just two statistics: mean and covariance - making it tractable to estimate and work with. Second, all marginals and conditionals of a multivariate Gaussian are again Gaussian - enabling closed-form Bayesian updates (Kalman filters, Gaussian processes). Third, the Central Limit Theorem ensures that many real-world quantities are approximately Gaussian. Fourth, for Gaussians, uncorrelated implies independent - simplifying independence testing. Fifth, linear transformations of Gaussians are Gaussian, making it easy to propagate distributions through linear layers. Applications include: multivariate regression, PCA, Gaussian processes, Kalman filtering, and the prior/posterior in Bayesian neural networks.
Q3: What is the difference between marginal and conditional independence?
A: Marginal independence () means - knowing tells you nothing about . Conditional independence () means - given , knowing tells you nothing about . These can go in opposite directions. Example: = shoe size, = reading ability, = age. and are positively correlated (both increase with age) - NOT marginally independent. But - conditional on age, shoe size and reading ability are independent. This is a "common cause" (Berkson's paradox in reverse). The opposite: = fire ( = smoke), they can be marginally independent but become dependent when conditioned on a common effect.
Q4: How does the ELBO in a VAE relate to joint and marginal distributions?
A: The VAE defines a joint distribution . Training would ideally maximize the log marginal likelihood , but this integral is intractable for neural network decoders. The ELBO (Evidence Lower BOund) is derived by introducing an approximate posterior and applying Jensen's inequality: . The gap between the ELBO and the true log-likelihood is - zero when the approximate posterior equals the true posterior. The ELBO trades off reconstruction quality (first term) with regularization (KL term keeps latent codes close to the prior).
Q5: What is the Mahalanobis distance and when would you use it over Euclidean distance?
A: The Mahalanobis distance measures distance relative to the covariance structure of the data distribution. Use it over Euclidean when: (1) features have very different scales - Mahalanobis automatically accounts for scale differences (equivalent to normalizing by each feature's variance); (2) features are correlated - Euclidean ignores correlations, Mahalanobis accounts for them (it measures distance in the space rotated by the covariance eigenvectors); (3) anomaly detection under a Gaussian assumption - Mahalanobis distance equals the negative log-likelihood of the Gaussian up to constants, so high Mahalanobis distance = low probability under the model. Common applications: multivariate outlier detection, LDA (Linear Discriminant Analysis which minimizes within-class Mahalanobis distance), and Gaussian process regression.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Joint Distributions demo on the EngineersOfAI Playground - no code required.
:::
