Joint and Marginal Distributions

Reading time: ~45 minutes | Interview relevance: High | Target roles: ML Engineer, AI Engineer, Research Scientist, Data Scientist

The ML Scenario That Motivates This Lesson

You're building a Variational Autoencoder (VAE). The model has:

Observed variable: $\mathbf{x}$ (the image)
Latent variable: $\mathbf{z}$ (the compressed representation)

The VAE defines a joint distribution $p(\mathbf{x}, \mathbf{z}) = p(\mathbf{x} \mid \mathbf{z}) \cdot p(\mathbf{z})$ . Training requires computing the marginal $p(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z}) d\mathbf{z}$ , which is intractable for neural networks. The ELBO (evidence lower bound) is a tractable lower bound on $\log p(\mathbf{x})$ , derived using Jensen's inequality.

Understanding joint and marginal distributions is the prerequisite for understanding virtually every generative model: VAEs, GANs, diffusion models, Bayesian networks, hidden Markov models, and Gaussian processes all reason about joint distributions over multiple variables.

1. Joint Distributions

A joint distribution describes the probability behavior of two or more random variables simultaneously.

Discrete Joint Distribution

For discrete random variables $X$ and $Y$ , the joint PMF is:

$p_{X,Y}(x, y) = P(X = x, Y = y)$

Requirements: $p_{X,Y}(x, y) \geq 0$ for all $(x, y)$ , and $\sum_x \sum_y p_{X,Y}(x, y) = 1$ .

Example:

	$Y=0$	$Y=1$	$Y=2$
$X=0$	0.10	0.05	0.05
$X=1$	0.20	0.15	0.05
$X=2$	0.05	0.20	0.15

Table entries sum to 1.0.

Continuous Joint Distribution

For continuous random variables $X$ and $Y$ , the joint PDF $f_{X,Y}(x, y)$ satisfies:

$P((X,Y) \in A) = \iint_A f_{X,Y}(x, y) \, dx \, dy$

with $f_{X,Y}(x, y) \geq 0$ and $\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x, y) \, dx \, dy = 1$ .

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal

# Bivariate Gaussian joint distribution
mu    = np.array([1.0, 2.0])
Sigma = np.array([[2.0, 1.0],
                  [1.0, 1.5]])

rv = multivariate_normal(mean=mu, cov=Sigma)

# Evaluate joint PDF on a grid
x1 = np.linspace(-3, 5, 100)
x2 = np.linspace(-2, 6, 100)
X1, X2 = np.meshgrid(x1, x2)
pos = np.dstack([X1, X2])

Z = rv.pdf(pos)  # joint PDF values

print(f"Joint PDF max value: {Z.max():.4f}")
print(f"Approximate integral (should be 1): {Z.sum() * (x1[1]-x1[0]) * (x2[1]-x2[0]):.4f}")

# Sample from the joint distribution
samples = rv.rvs(size=1000, random_state=42)
print(f"\nSamples shape: {samples.shape}")
print(f"Empirical mean: {samples.mean(axis=0).round(4)}")
print(f"Empirical cov:\n{np.cov(samples.T).round(4)}")

2. Marginal Distributions

The marginal distribution of $X$ is obtained by "integrating out" (summing out) the other variable $Y$ :

Discrete Marginalization

$p_X(x) = \sum_y p_{X,Y}(x, y)$

"Sum along rows (or columns) of the joint probability table."

Using our example table:

$P(X=0) = 0.10 + 0.05 + 0.05 = 0.20$
$P(X=1) = 0.20 + 0.15 + 0.05 = 0.40$
$P(X=2) = 0.05 + 0.20 + 0.15 = 0.40$

Continuous Marginalization

$f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) \, dy$

Joint distribution         Marginal of X (integrate over Y)
f(x, y):                   f_X(x) = ∫ f(x,y) dy

   y │                        │
   4 │   ....                 │
   3 │  ......                ▲ higher here where joint has
   2 │ .......                │ more mass
   1 │  ......                │
   0 │   ....                 │
     └──────── x           ───┴──────── x

from scipy.stats import multivariate_normal, norm
import numpy as np

# Bivariate Gaussian marginals
mu    = np.array([1.0, 2.0])
Sigma = np.array([[2.0, 1.0],
                  [1.0, 1.5]])

# Marginals of bivariate Gaussian are univariate Gaussian
# X ~ N(mu[0], Sigma[0,0]),  Y ~ N(mu[1], Sigma[1,1])
print("Bivariate Gaussian marginals:")
print(f"  X ~ N(mu={mu[0]}, sigma^2={Sigma[0,0]})  -> std={np.sqrt(Sigma[0,0]):.4f}")
print(f"  Y ~ N(mu={mu[1]}, sigma^2={Sigma[1,1]})  -> std={np.sqrt(Sigma[1,1]):.4f}")

# Verify numerically
samples = multivariate_normal(mean=mu, cov=Sigma).rvs(size=100_000, random_state=42)
print(f"\nEmpirical marginal of X: mean={samples[:,0].mean():.4f}, std={samples[:,0].std():.4f}")
print(f"Empirical marginal of Y: mean={samples[:,1].mean():.4f}, std={samples[:,1].std():.4f}")

3. Conditional Distribution from Joint

The conditional distribution of $X$ given $Y = y$ is:

$f_{X \mid Y}(x \mid y) = \frac{f_{X,Y}(x, y)}{f_Y(y)}$

This is Bayes' theorem in the continuous setting. Given the joint, we can compute any conditional by dividing by the appropriate marginal.

import numpy as np
from scipy.stats import multivariate_normal, norm

# Conditional distribution of bivariate Gaussian
# If (X,Y) ~ N(mu, Sigma), then:
# X | Y=y ~ N(mu_X + rho*(sigma_X/sigma_Y)*(y - mu_Y),  sigma_X^2*(1-rho^2))

mu_x, mu_y = 1.0, 2.0
sigma_x, sigma_y = np.sqrt(2.0), np.sqrt(1.5)
rho = 1.0 / (sigma_x * sigma_y)  # Sigma[0,1] / (sigma_x * sigma_y)

def conditional_gaussian(y_obs, mu_x, mu_y, sigma_x, sigma_y, rho):
    """Compute parameters of X | Y=y_obs for bivariate Gaussian."""
    mu_cond    = mu_x + rho * (sigma_x / sigma_y) * (y_obs - mu_y)
    var_cond   = sigma_x**2 * (1 - rho**2)
    return mu_cond, var_cond

y_obs = 3.0
mu_c, var_c = conditional_gaussian(y_obs, mu_x, mu_y, sigma_x, sigma_y, rho)
print(f"X | Y={y_obs}: mean={mu_c:.4f}, std={np.sqrt(var_c):.4f}")
print(f"  (unconditional X: mean={mu_x}, std={sigma_x:.4f})")
print(f"  Conditioning on Y narrows our uncertainty about X")

Key Relationship

$f_{X,Y}(x, y) = f_{X \mid Y}(x \mid y) \cdot f_Y(y) = f_{Y \mid X}(y \mid x) \cdot f_X(x)$

From the joint, you can compute:

Marginals: integrate/sum out the other variable
Conditionals: divide joint by the appropriate marginal
Everything you need for Bayesian inference

4. Independence in Terms of Joint Distribution

Random variables $X$ and $Y$ are independent if and only if their joint distribution factorizes:

$f_{X,Y}(x, y) = f_X(x) \cdot f_Y(y) \quad \text{for all } (x, y)$

Equivalently:

$f_{X \mid Y}(x \mid y) = f_X(x)$ (conditioning on $Y$ tells us nothing about $X$ )
The joint PDF/PMF can be written as a product of marginals

import numpy as np

np.random.seed(42)
n = 100_000

# Independent case
X_ind = np.random.randn(n)
Y_ind = np.random.randn(n)

# Dependent case (positive correlation)
X_dep = np.random.randn(n)
Y_dep = 0.8 * X_dep + 0.6 * np.random.randn(n)

# Test independence: for independent RVs,
# P(X in A, Y in B) = P(X in A) * P(Y in B)

A = (X_ind > 0)
B_ind = (Y_ind > 0)
B_dep = (Y_dep > 0)

# Independent case
p_A   = A.mean()
p_B_ind = B_ind.mean()
p_AB_ind = (A & B_ind).mean()
print("Independent case:")
print(f"  P(X>0) * P(Y>0) = {p_A * p_B_ind:.4f}")
print(f"  P(X>0, Y>0)     = {p_AB_ind:.4f}")
print(f"  Difference: {abs(p_A * p_B_ind - p_AB_ind):.5f}  (near 0 = independent)")

# Dependent case
A2 = (X_dep > 0)
p_B_dep = B_dep.mean()
p_AB_dep = (A2 & B_dep).mean()
print("\nDependent case (rho=0.8):")
print(f"  P(X>0) * P(Y>0) = {p_A * p_B_dep:.4f}")
print(f"  P(X>0, Y>0)     = {p_AB_dep:.4f}")
print(f"  Difference: {abs(p_A * p_B_dep - p_AB_dep):.5f}  (>0 = dependent)")

5. The Multivariate Gaussian: The Central Distribution for ML

The multivariate Gaussian $\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ is the most important joint distribution in ML.

Definition

$f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu})\right)$

Beautiful Properties

Property	Statement
Marginals	Any marginal of a multivariate Gaussian is Gaussian
Conditionals	Any conditional of a multivariate Gaussian is Gaussian
Linear transformations	$A\mathbf{X} + \mathbf{b} \sim \mathcal{N}(A\boldsymbol{\mu}+\mathbf{b}, A\boldsymbol{\Sigma}A^T)$
Sum of independent Gaussians	Also Gaussian
Uncorrelated = Independent	For Gaussians only: $\Sigma_{ij}=0$ implies $X_i \perp X_j$

The last property is unique to the Gaussian - in general, uncorrelated $\neq$ independent.

Marginal of Multivariate Gaussian

For $\mathbf{X} = \begin{bmatrix} \mathbf{X}_1 \\ \mathbf{X}_2 \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}\boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2\end{bmatrix}, \begin{bmatrix}\boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22}\end{bmatrix}\right)$ :

$\mathbf{X}_1 \sim \mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_{11})$

Conditional of Multivariate Gaussian

$\mathbf{X}_1 \mid \mathbf{X}_2 = \mathbf{x}_2 \sim \mathcal{N}(\boldsymbol{\mu}_{1|2}, \boldsymbol{\Sigma}_{1|2})$

$\boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2)$

$\boldsymbol{\Sigma}_{1|2} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}$

This is the Gaussian conditioning formula, and it underpins Gaussian Processes and Kalman filters.

import numpy as np
from scipy.stats import multivariate_normal

# Gaussian conditioning
# (X1, X2) ~ N([1,2], [[2,1],[1,1.5]])
# Compute X1 | X2 = x2_obs

mu = np.array([1.0, 2.0])
Sigma = np.array([[2.0, 1.0],
                  [1.0, 1.5]])

x2_obs = 3.0

# Partition
mu1, mu2 = mu[0], mu[1]
S11 = Sigma[0,0]; S12 = Sigma[0,1]; S22 = Sigma[1,1]

# Conditional mean and variance
mu_cond    = mu1 + S12 / S22 * (x2_obs - mu2)
sigma2_cond = S11 - S12**2 / S22

print(f"X1 | X2={x2_obs}:")
print(f"  Conditional mean:     {mu_cond:.4f}  (marginal mean: {mu1})")
print(f"  Conditional variance: {sigma2_cond:.4f}  (marginal var: {S11})")
print(f"  Variance reduction:   {(1 - sigma2_cond/S11)*100:.1f}%")

6. Covariance Matrix as Joint Distribution Statistic

The covariance matrix $\boldsymbol{\Sigma}$ captures the second-order structure of a joint distribution. For ML:

                    Feature vector x = [x1, x2, ..., xd]
                              │
                         Σ = Cov(x)
                              │
         ┌────────────────────┴──────────────────────────────┐
         │                                                    │
    Σ diagonal               Off-diagonal Σ_ij
    = feature variances      = covariance between features i,j
         │                                                    │
    Used in:                                             Used in:
    - Feature scaling        - PCA (find axes of max variance)
    - Normalization          - LDA (class-conditional cov)
    - BatchNorm              - Mahalanobis distance

Mahalanobis Distance

The Mahalanobis distance uses the inverse covariance to measure distances that account for correlations and different scales:

$d_M(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{(\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu})}$

This is the exponent in the Gaussian density. Points at equal Mahalanobis distance form ellipses aligned with the covariance structure (not circles as with Euclidean distance).

import numpy as np

# Mahalanobis vs Euclidean distance
mu    = np.array([0.0, 0.0])
Sigma = np.array([[4.0, 2.0],
                  [2.0, 1.5]])

Sigma_inv = np.linalg.inv(Sigma)

points = np.array([[2.0, 0.0],    # point along x1 axis
                   [0.0, 1.0],    # point along x2 axis
                   [1.0, 0.8]])   # diagonal point

print(f"{'Point':<15} {'Euclidean':>12} {'Mahalanobis':>13}")
print("-" * 42)
for p in points:
    diff  = p - mu
    d_euc = np.sqrt(diff @ diff)
    d_mah = np.sqrt(diff @ Sigma_inv @ diff)
    print(f"{str(p):<15} {d_euc:>12.4f} {d_mah:>13.4f}")

# In anomaly detection, Mahalanobis distance is used to score outliers
# under a multivariate Gaussian assumption

7. ML Connection: Latent Variable Models

Latent variable models define a joint distribution $p(\mathbf{x}, \mathbf{z})$ where $\mathbf{x}$ is observed and $\mathbf{z}$ is latent (unobserved):

$p(\mathbf{x}, \mathbf{z}) = p(\mathbf{x} \mid \mathbf{z}) \cdot p(\mathbf{z})$

The Marginalization Problem

The marginal likelihood of observed data requires integrating out the latent variable:

$p(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, d\mathbf{z}$

For neural network models $p(\mathbf{x} \mid \mathbf{z}; \theta)$ , this integral is intractable - there is no closed form.

VAE: ELBO as Lower Bound on $\log p(\mathbf{x})$

VAEs introduce an approximate posterior $q_\phi(\mathbf{z} \mid \mathbf{x})$ and maximize:

$\mathcal{L}(\theta, \phi; \mathbf{x}) = \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x} \mid \mathbf{z})] - D_{KL}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p(\mathbf{z}))$

The first term is the reconstruction loss; the second is a regularizer that pushes the approximate posterior toward the prior.

VAE joint distribution:

  p(x, z) = p(x | z) · p(z)
               │            │
               │            └── Prior: z ~ N(0, I)
               │
               └── Decoder: p(x | z; θ) - neural network

  q(z | x; φ) = N(μ_φ(x), diag(σ_φ(x)²))
                       └── Encoder (approximate posterior)

  Training: maximize ELBO = E_q[log p(x|z)] - KL(q || p)

import numpy as np

def elbo_gaussian_vae(x, mu_enc, log_var_enc, x_recon, sigma_dec=1.0):
    """
    Compute ELBO for a single data point in a Gaussian VAE.

    Args:
        x: observed data (d_x,)
        mu_enc, log_var_enc: encoder parameters (d_z,)
        x_recon: decoder output (d_x,)
        sigma_dec: decoder noise std

    Returns:
        ELBO value (scalar)
    """
    # Reconstruction term: log p(x | z) ~ Gaussian
    # = -0.5 * sum((x - x_recon)^2) / sigma_dec^2 - constant
    recon_loss = -0.5 * np.sum((x - x_recon)**2) / sigma_dec**2

    # KL divergence: KL(N(mu, sigma^2) || N(0, I))
    # = 0.5 * sum(sigma^2 + mu^2 - 1 - log(sigma^2))
    kl_div = 0.5 * np.sum(np.exp(log_var_enc) + mu_enc**2 - 1 - log_var_enc)

    elbo = recon_loss - kl_div
    return elbo, recon_loss, -kl_div

# Example
d_x, d_z = 784, 16
np.random.seed(42)

x         = np.random.randn(d_x)          # original image (flattened)
mu_enc    = np.random.randn(d_z) * 0.1   # encoder mean
log_var   = np.random.randn(d_z) * 0.1   # encoder log-variance
x_recon   = x + 0.1 * np.random.randn(d_x)  # reconstructed image

elbo, recon, kl = elbo_gaussian_vae(x, mu_enc, log_var, x_recon)
print(f"ELBO components:")
print(f"  Reconstruction: {recon:.4f}")
print(f"  -KL divergence: {kl:.4f}")
print(f"  ELBO:           {elbo:.4f}")

8. Graphical Models and Conditional Independence Structure

Graphical models use graphs to represent the conditional independence structure of a joint distribution. They allow large joint distributions to factorize into products of small factors.

Bayesian Network (Directed Graphical Model)

A Bayesian network represents:

$p(x_1, \ldots, x_n) = \prod_{i=1}^n p(x_i \mid \text{parents}(x_i))$

This represents:

$p(\text{Season}, \text{Sprinkler}, \text{Rain}, \text{Wet Grass}, \text{Slippery}) = p(\text{Season}) \cdot p(\text{Sprinkler} \mid \text{Season}) \cdot p(\text{Rain} \mid \text{Season}) \cdot p(\text{Wet} \mid \text{Sprinkler}, \text{Rain}) \cdot p(\text{Slippery} \mid \text{Rain})$

The graph encodes that "Wet Grass" is conditionally independent of "Season" given "Sprinkler" and "Rain."

Markov Random Field (Undirected Graphical Model)

$p(\mathbf{x}) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \psi_c(\mathbf{x}_c)$

Used in image segmentation, CRFs (Conditional Random Fields) for sequence labeling in NLP.

9. Summary: Joint Distribution Toolkit

Given joint distribution p(x, y):

OPERATION             FORMULA                 PURPOSE
──────────────────────────────────────────────────────────────
Marginal of X         p(x) = Σ_y p(x,y)       Ignore Y
Marginal of Y         p(y) = Σ_x p(x,y)       Ignore X
Conditional X|Y=y     p(x|y) = p(x,y)/p(y)   Update on Y
Conditional Y|X=x     p(y|x) = p(x,y)/p(x)   Update on X
Independence check    p(x,y) = p(x)p(y)?      Is Y informative about X?
Expectation of g(X,Y) E[g] = ΣΣ g(x,y)p(x,y) Average over joint
Covariance            Cov(X,Y)=E[XY]-E[X]E[Y] Linear relationship

10. Interview Q&A

Q1: What is a marginal distribution and how do you compute it from a joint distribution?

A: The marginal distribution of $X$ describes $X$ alone, without regard to $Y$ . It is computed by "summing out" or "integrating out" $Y$ from the joint: $p_X(x) = \sum_y p_{X,Y}(x, y)$ for discrete, $f_X(x) = \int f_{X,Y}(x,y) dy$ for continuous. Intuitively, it collapses the two-dimensional table into one dimension by summing along the other axis. In ML, marginalization is critical in latent variable models: the marginal likelihood $p(\mathbf{x}) = \int p(\mathbf{x}, \mathbf{z}) d\mathbf{z}$ is what we want to maximize, but this integral is often intractable for complex models. The entire field of variational inference (ELBO, VAEs) and approximate inference (MCMC, EP) exists because of the difficulty of this marginalization.

Q2: Why is the multivariate Gaussian so important in machine learning?

A: The multivariate Gaussian $\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ is central for several reasons. First, it is fully characterized by just two statistics: mean and covariance - making it tractable to estimate and work with. Second, all marginals and conditionals of a multivariate Gaussian are again Gaussian - enabling closed-form Bayesian updates (Kalman filters, Gaussian processes). Third, the Central Limit Theorem ensures that many real-world quantities are approximately Gaussian. Fourth, for Gaussians, uncorrelated implies independent - simplifying independence testing. Fifth, linear transformations of Gaussians are Gaussian, making it easy to propagate distributions through linear layers. Applications include: multivariate regression, PCA, Gaussian processes, Kalman filtering, and the prior/posterior in Bayesian neural networks.

Q3: What is the difference between marginal and conditional independence?

A: Marginal independence ( $X \perp Y$ ) means $p(x, y) = p(x) p(y)$ - knowing $Y$ tells you nothing about $X$ . Conditional independence ( $X \perp Y \mid Z$ ) means $p(x, y \mid z) = p(x \mid z) p(y \mid z)$ - given $Z$ , knowing $Y$ tells you nothing about $X$ . These can go in opposite directions. Example: $X$ = shoe size, $Y$ = reading ability, $Z$ = age. $X$ and $Y$ are positively correlated (both increase with age) - NOT marginally independent. But $X \perp Y \mid Z$ - conditional on age, shoe size and reading ability are independent. This is a "common cause" (Berkson's paradox in reverse). The opposite: $X$ = fire ( $Y$ = smoke), they can be marginally independent but become dependent when conditioned on a common effect.

Q4: How does the ELBO in a VAE relate to joint and marginal distributions?

A: The VAE defines a joint distribution $p_\theta(\mathbf{x}, \mathbf{z}) = p_\theta(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z})$ . Training would ideally maximize the log marginal likelihood $\log p_\theta(\mathbf{x}) = \log \int p_\theta(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z}) d\mathbf{z}$ , but this integral is intractable for neural network decoders. The ELBO (Evidence Lower BOund) is derived by introducing an approximate posterior $q_\phi(\mathbf{z} \mid \mathbf{x})$ and applying Jensen's inequality: $\log p_\theta(\mathbf{x}) \geq \mathbb{E}_{q_\phi}[\log p_\theta(\mathbf{x} \mid \mathbf{z})] - D_{KL}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p(\mathbf{z}))$ . The gap between the ELBO and the true log-likelihood is $D_{KL}(q_\phi \| p_\theta(\mathbf{z} \mid \mathbf{x}))$ - zero when the approximate posterior equals the true posterior. The ELBO trades off reconstruction quality (first term) with regularization (KL term keeps latent codes close to the prior).

Q5: What is the Mahalanobis distance and when would you use it over Euclidean distance?

A: The Mahalanobis distance $d_M(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{(\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu})}$ measures distance relative to the covariance structure of the data distribution. Use it over Euclidean when: (1) features have very different scales - Mahalanobis automatically accounts for scale differences (equivalent to normalizing by each feature's variance); (2) features are correlated - Euclidean ignores correlations, Mahalanobis accounts for them (it measures distance in the space rotated by the covariance eigenvectors); (3) anomaly detection under a Gaussian assumption - Mahalanobis distance equals the negative log-likelihood of the Gaussian up to constants, so high Mahalanobis distance = low probability under the model. Common applications: multivariate outlier detection, LDA (Linear Discriminant Analysis which minimizes within-class Mahalanobis distance), and Gaussian process regression.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Joint Distributions demo on the EngineersOfAI Playground - no code required.

:::

The ML Scenario That Motivates This Lesson​

1. Joint Distributions​

Discrete Joint Distribution​

Continuous Joint Distribution​

2. Marginal Distributions​

Discrete Marginalization​

Continuous Marginalization​

3. Conditional Distribution from Joint​

Key Relationship​

4. Independence in Terms of Joint Distribution​

5. The Multivariate Gaussian: The Central Distribution for ML​

Definition​

Beautiful Properties​

Marginal of Multivariate Gaussian​

Conditional of Multivariate Gaussian​

6. Covariance Matrix as Joint Distribution Statistic​

Mahalanobis Distance​

7. ML Connection: Latent Variable Models​

The Marginalization Problem​

VAE: ELBO as Lower Bound on log⁡p(x)\log p(\mathbf{x})logp(x)​

8. Graphical Models and Conditional Independence Structure​

Bayesian Network (Directed Graphical Model)​

Markov Random Field (Undirected Graphical Model)​

9. Summary: Joint Distribution Toolkit​

10. Interview Q&A​