KL Divergence

The Scenario That Motivates This Lesson

You are training a Variational Autoencoder (VAE). The loss function has two terms: a reconstruction loss and a "KL loss." Your training crashes - the KL loss explodes to thousands while reconstruction is still poor.

Your colleague says: "The posterior is collapsing - just reduce the beta coefficient." You nod, but do you actually understand why the KL divergence appears in the VAE objective, what it measures, and why it can explode?

This lesson answers those questions. KL divergence is the fundamental measure of how much one probability distribution differs from another - and it appears everywhere in modern ML.

Definition: KL Divergence

The Kullback-Leibler divergence from distribution $Q$ to distribution $P$ is:

$D_{\text{KL}}(P \| Q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)}$

For continuous distributions:

$D_{\text{KL}}(P \| Q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx$

Read as: "The KL divergence of $P$ from $Q$ " or "the KL divergence when using $Q$ to approximate $P$ ."

Alternative Expressions

$D_{\text{KL}}(P \| Q) = \mathbb{E}_{x \sim P}\left[\log \frac{p(x)}{q(x)}\right] = \mathbb{E}_{x \sim P}[\log p(x)] - \mathbb{E}_{x \sim P}[\log q(x)]$

Or equivalently using entropy:

$D_{\text{KL}}(P \| Q) = -H(P) + H(P, Q)$

where $H(P,Q)$ is the cross-entropy. This identity - $D_{\text{KL}}(P \| Q) = H(P, Q) - H(P)$ - is one of the most important equations in this entire module.

:::note By Convention: 0 log(0/0) = 0 and 0 log(0/q) = 0 If $p(x) = 0$ , the term contributes 0 regardless of $q(x)$ . But if $p(x) > 0$ and $q(x) = 0$ , the KL divergence is $+\infty$ . This means: Q must have support everywhere P does, or KL divergence is infinite. :::

Key Properties

Non-Negativity (Gibbs' Inequality)

$D_{\text{KL}}(P \| Q) \geq 0 \quad \text{with equality iff } P = Q \text{ almost everywhere}$

Proof by Jensen's inequality:

$D_{\text{KL}}(P \| Q) = \mathbb{E}_{x \sim P}\left[\log \frac{p(x)}{q(x)}\right] = -\mathbb{E}_{x \sim P}\left[\log \frac{q(x)}{p(x)}\right]$

Since $-\log$ is convex, by Jensen's inequality:

$-\mathbb{E}\left[\log \frac{q(x)}{p(x)}\right] \geq -\log \mathbb{E}\left[\frac{q(x)}{p(x)}\right] = -\log \int p(x) \cdot \frac{q(x)}{p(x)} dx = -\log 1 = 0$

Asymmetry: Why KL Is Not a Distance

$D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P) \quad \text{in general}$

This is the most important property to internalize. KL divergence is not a metric - it does not satisfy the symmetry axiom, and it does not satisfy the triangle inequality.

Asymmetry Example:

P = [0.9, 0.1]   (concentrated on outcome 1)
Q = [0.1, 0.9]   (concentrated on outcome 2)

D_KL(P||Q) = 0.9 * log(0.9/0.1) + 0.1 * log(0.1/0.9)
           = 0.9 * log(9) + 0.1 * log(1/9)
           = 0.9 * 2.197 - 0.1 * 2.197
           = 1.758 nats

D_KL(Q||P) = 0.1 * log(0.1/0.9) + 0.9 * log(0.9/0.1)
           = 0.1 * (-2.197) + 0.9 * 2.197
           = 1.758 nats

(This symmetric case is coincidental for swapped distributions of this form.
In general D_KL(P||Q) ≠ D_KL(Q||P).)

No Triangle Inequality

$D_{\text{KL}}(P \| R) \not\leq D_{\text{KL}}(P \| Q) + D_{\text{KL}}(Q \| R)$ in general.

Forward KL vs. Reverse KL: Geometric Intuition

This asymmetry has profound consequences for variational inference. There are two ways to use KL divergence to fit an approximate distribution $q$ to a target $p$ :

Forward KL: $D_{\text{KL}}(P \| Q)$ - "Mean-Seeking"

Minimize the expectation under $P$ :

$\min_Q D_{\text{KL}}(P \| Q) = \min_Q \mathbb{E}_{x \sim P}\left[\log \frac{p(x)}{q(x)}\right]$

The loss is large when $q(x)$ is small but $p(x)$ is large. This forces $Q$ to cover all modes of $P$ . If $P$ is bimodal, $Q$ must place probability mass at both modes.

When $P$ has a mode where $q(x) \approx 0$ but $p(x) > 0$ , the term $p(x)\log(p(x)/q(x)) \to \infty$ . So $Q$ is penalized severely for ignoring any region where $P$ has mass.

Result: $Q$ tends to be diffuse, covering all modes of $P$ (mode-covering behavior).

Reverse KL: $D_{\text{KL}}(Q \| P)$ - "Mode-Seeking"

Minimize the expectation under $Q$ :

$\min_Q D_{\text{KL}}(Q \| P) = \min_Q \mathbb{E}_{x \sim Q}\left[\log \frac{q(x)}{p(x)}\right]$

The loss is large when $q(x)$ is large but $p(x)$ is small. This forces $Q$ to not place mass where $P$ has none. But when $q(x) = 0$ , the term contributes 0 regardless of $p(x)$ .

So $Q$ is free to ignore modes of $P$ (by setting $q(x) = 0$ there), as long as it concentrates on a region where $p(x) > 0$ .

Result: $Q$ tends to be a sharp approximation concentrated on one mode of $P$ (mode-seeking behavior).

Forward KL (P||Q): Q must cover all modes of P

True P (bimodal):   |  *       *  |
Q fitted by fwd KL: |    *****    |  ← spreads across both modes
                    +---------+-+

Reverse KL (Q||P): Q can pick one mode and ignore the rest

True P (bimodal):   |  *       *  |
Q fitted by rev KL: |  ***        |  ← concentrates on one mode
                    +---------+-+

:::info ML Connection - VAEs Use Reverse KL Variational Autoencoders minimize the reverse KL: $D_{\text{KL}}(q_\phi(z|x) \| p(z))$ . This means the encoder posterior $q_\phi$ is pushed to concentrate on regions of the prior $p(z) = \mathcal{N}(0, I)$ that have mass. It won't spread across the entire prior - it seeks modes. This is why VAE latent spaces can collapse (mode-seeking behavior: posterior collapses to a single mode of the prior). :::

KL Divergence Between Gaussians

A particularly important closed-form result used everywhere in deep learning:

For two univariate Gaussians $P = \mathcal{N}(\mu_1, \sigma_1^2)$ and $Q = \mathcal{N}(\mu_2, \sigma_2^2)$ :

$D_{\text{KL}}(P \| Q) = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}$

Special case for VAEs - KL from $\mathcal{N}(\mu, \sigma^2)$ to the standard normal $\mathcal{N}(0, 1)$ :

$D_{\text{KL}}\bigl(\mathcal{N}(\mu, \sigma^2) \| \mathcal{N}(0, 1)\bigr) = \frac{1}{2}\left(\mu^2 + \sigma^2 - \ln\sigma^2 - 1\right)$

For a diagonal multivariate Gaussian of dimension $d$ :

$D_{\text{KL}}\bigl(\mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2)) \| \mathcal{N}(\mathbf{0}, \mathbf{I})\bigr) = \frac{1}{2}\sum_{j=1}^{d}\left(\mu_j^2 + \sigma_j^2 - \ln\sigma_j^2 - 1\right)$

This is the exact formula used in every VAE implementation.

Python: Computing KL Divergence

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm


def kl_divergence_discrete(p: np.ndarray, q: np.ndarray) -> float:
    """
    KL divergence D_KL(P||Q) for discrete distributions.

    Args:
        p: true distribution (reference)
        q: approximate distribution (model)

    Returns:
        KL divergence in nats
    """
    p = np.asarray(p, dtype=float)
    q = np.asarray(q, dtype=float)

    # Ensure valid distributions
    assert np.isclose(p.sum(), 1.0) and np.isclose(q.sum(), 1.0)

    # Only sum where p > 0 (0 * log(0/q) = 0 by convention)
    mask = p > 0
    if np.any(q[mask] == 0):
        return float('inf')  # KL is infinite if q is 0 where p > 0

    return float(np.sum(p[mask] * np.log(p[mask] / q[mask])))


def kl_divergence_gaussian(mu1, sigma1, mu2=0.0, sigma2=1.0) -> float:
    """
    KL divergence D_KL(N(mu1,sigma1^2) || N(mu2,sigma2^2)).
    Default Q = N(0,1) for VAE KL term.
    """
    return (
        np.log(sigma2 / sigma1)
        + (sigma1**2 + (mu1 - mu2)**2) / (2 * sigma2**2)
        - 0.5
    )


def vae_kl_loss(mu: np.ndarray, log_var: np.ndarray) -> float:
    """
    KL divergence for VAE: D_KL(N(mu, sigma^2) || N(0, I)).
    Input log_var = log(sigma^2) (standard VAE parameterization).
    """
    # Per-dimension: 0.5 * (mu^2 + sigma^2 - log_sigma^2 - 1)
    sigma_sq = np.exp(log_var)
    per_dim = 0.5 * (mu**2 + sigma_sq - log_var - 1)
    return float(np.sum(per_dim))  # sum over latent dimensions


# --- Demo 1: Asymmetry of KL divergence ---
print("=== Asymmetry of KL Divergence ===")
distributions = [
    ([0.9, 0.1], [0.5, 0.5], "Peaked P, Uniform Q"),
    ([0.5, 0.5], [0.9, 0.1], "Uniform P, Peaked Q"),
    ([0.7, 0.2, 0.1], [0.1, 0.2, 0.7], "Reversed peaks"),
    ([0.99, 0.01], [0.5, 0.5], "Very peaked P"),
]

print(f"{'Description':<30} | {'D_KL(P||Q)':>12} | {'D_KL(Q||P)':>12}")
print("-" * 60)
for p, q, desc in distributions:
    kl_fwd = kl_divergence_discrete(p, q)
    kl_rev = kl_divergence_discrete(q, p)
    print(f"{desc:<30} | {kl_fwd:>12.4f} | {kl_rev:>12.4f}")


# --- Demo 2: KL divergence goes infinite ---
p_has_support = [0.7, 0.2, 0.1]
q_missing_support = [0.8, 0.2, 0.0]  # q=0 where p>0!

kl = kl_divergence_discrete(p_has_support, q_missing_support)
print(f"\nD_KL(P||Q) when Q has zero where P>0: {kl}")
# Output: inf


# --- Demo 3: Gaussian KL for VAE ---
print("\n=== VAE KL Loss (D_KL(q||N(0,1))) ===")
test_cases = [
    (0.0, 0.0,  "Perfect match: mu=0, log_var=0 (sigma=1)"),
    (1.0, 0.0,  "Shifted: mu=1, sigma=1"),
    (0.0, 1.0,  "Wider: mu=0, sigma=sqrt(e)≈1.65"),
    (2.0, 1.0,  "Shifted + wider: mu=2, sigma=sqrt(e)"),
    (0.0, -2.0, "Narrow: mu=0, sigma=exp(-1)≈0.37"),
]

for mu, log_var, desc in test_cases:
    kl = vae_kl_loss(np.array([mu]), np.array([log_var]))
    print(f"  {desc}")
    print(f"    KL = {kl:.4f} nats")

KL Divergence in the VAE Loss

The VAE maximizes the Evidence Lower BOund (ELBO):

$\mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\phi(z|x)}\bigl[\log p_\theta(x|z)\bigr] - D_{\text{KL}}\bigl(q_\phi(z|x) \| p(z)\bigr)$

Term 1 (Reconstruction): How well does the decoder reconstruct $x$ from latent $z$ ? Measures the likelihood of the data.

Term 2 (KL Penalty): How much does the encoder posterior $q_\phi(z|x)$ deviate from the prior $p(z) = \mathcal{N}(0, I)$ ?

The KL term acts as regularization: if the encoder encodes too much specific information about $x$ (departing from the prior), it is penalized.

import torch
import torch.nn.functional as F


def vae_loss(
    x: torch.Tensor,
    x_recon: torch.Tensor,
    mu: torch.Tensor,
    log_var: torch.Tensor,
    beta: float = 1.0,
) -> dict:
    """
    VAE ELBO loss = reconstruction + beta * KL.

    Args:
        x:        original input (batch, dim)
        x_recon:  decoder output (batch, dim)
        mu:       encoder mean (batch, latent_dim)
        log_var:  encoder log variance (batch, latent_dim)
        beta:     KL weight (beta-VAE: beta>1 encourages disentanglement)

    Returns:
        dict with 'total', 'reconstruction', 'kl' losses
    """
    # Reconstruction: binary cross-entropy (for images in [0,1])
    recon_loss = F.binary_cross_entropy(x_recon, x, reduction='sum')

    # KL: -0.5 * sum(1 + log_var - mu^2 - exp(log_var))
    kl_loss = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())

    total = recon_loss + beta * kl_loss

    return {
        'total': total,
        'reconstruction': recon_loss,
        'kl': kl_loss
    }


# Example: batch of 4 samples, latent dim 2
torch.manual_seed(42)
batch_size, latent_dim = 4, 2

mu = torch.randn(batch_size, latent_dim) * 0.5
log_var = torch.randn(batch_size, latent_dim) * 0.5

x = torch.rand(batch_size, 16)
x_recon = torch.sigmoid(torch.randn(batch_size, 16))

losses = vae_loss(x, x_recon, mu, log_var, beta=1.0)
print(f"VAE losses:")
print(f"  Reconstruction: {losses['reconstruction'].item():.4f}")
print(f"  KL Divergence:  {losses['kl'].item():.4f}")
print(f"  Total:          {losses['total'].item():.4f}")

KL Divergence in Reinforcement Learning: PPO

Proximal Policy Optimization (PPO) constrains how much the policy changes at each update using KL divergence:

$L^{\text{CLIP}}(\theta) = \hat{\mathbb{E}}_t\left[\min\left(r_t(\theta)\hat{A}_t,\ \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]$

The clipping is equivalent to a soft constraint on:

$D_{\text{KL}}\bigl(\pi_{\theta_{\text{old}}}(\cdot|s_t) \| \pi_\theta(\cdot|s_t)\bigr) \leq \delta$

The KL-penalty version of PPO makes this explicit:

$L^{\text{KL}}(\theta) = \hat{\mathbb{E}}_t\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} \hat{A}_t - \beta_t \, D_{\text{KL}}(\pi_{\theta_{\text{old}}} \| \pi_\theta)\right]$

Why? Large policy changes can destabilize training. By measuring the distributional change (not just parameter change), we ensure the new policy doesn't behave too differently from the old one - regardless of the scale of parameter space.

import torch
import torch.nn.functional as F


def ppo_kl_loss(
    old_log_probs: torch.Tensor,
    new_log_probs: torch.Tensor,
    advantages: torch.Tensor,
    beta: float = 0.01,
    kl_target: float = 0.01,
) -> dict:
    """
    PPO with KL penalty.

    Args:
        old_log_probs: log probs under old policy, shape (batch, num_actions)
        new_log_probs: log probs under new policy, shape (batch, num_actions)
        advantages:    advantage estimates, shape (batch,)
        beta:          KL penalty coefficient
        kl_target:     target KL for adaptive beta

    Returns:
        dict with policy loss, kl, and updated beta
    """
    # Importance sampling ratio
    ratio = torch.exp(new_log_probs - old_log_probs)  # per action

    # Policy gradient objective (take max action ratio for each state)
    pg_loss = -(ratio * advantages.unsqueeze(-1)).mean()

    # KL divergence: D_KL(old || new)
    # = sum_a pi_old(a) * log(pi_old(a) / pi_new(a))
    old_probs = torch.exp(old_log_probs)
    kl = torch.sum(old_probs * (old_log_probs - new_log_probs), dim=-1).mean()

    # Adaptive beta (PPO paper appendix)
    if kl > 1.5 * kl_target:
        beta = beta * 2
    elif kl < kl_target / 1.5:
        beta = beta / 2

    total_loss = pg_loss + beta * kl

    return {'policy_loss': pg_loss, 'kl': kl, 'beta': beta, 'total': total_loss}


# Simulate a small policy update
torch.manual_seed(0)
batch = 8
actions = 4

old_logits = torch.randn(batch, actions)
new_logits = old_logits + 0.1 * torch.randn(batch, actions)  # small update
advantages = torch.randn(batch)

old_log_probs = F.log_softmax(old_logits, dim=-1)
new_log_probs = F.log_softmax(new_logits, dim=-1)

result = ppo_kl_loss(old_log_probs, new_log_probs, advantages)
print(f"KL divergence (old||new): {result['kl'].item():.6f}")
print(f"Policy loss:              {result['policy_loss'].item():.4f}")
print(f"Adaptive beta:            {result['beta']:.4f}")

Jensen-Shannon Divergence: The Symmetric Alternative

The Jensen-Shannon divergence is a symmetrized, bounded version of KL:

$D_{\text{JS}}(P \| Q) = \frac{1}{2}D_{\text{KL}}\left(P \| M\right) + \frac{1}{2}D_{\text{KL}}\left(Q \| M\right)$

where $M = \frac{1}{2}(P + Q)$ is the mixture distribution.

Properties:

Symmetric: $D_{\text{JS}}(P \| Q) = D_{\text{JS}}(Q \| P)$
Bounded: $0 \leq D_{\text{JS}}(P \| Q) \leq \ln 2$ (nats) or $\leq 1$ (bits)
Well-defined even when distributions don't share support (unlike KL)
$\sqrt{D_{\text{JS}}}$ is a proper metric (satisfies triangle inequality)

:::info ML Connection - GANs and Jensen-Shannon Divergence The original GAN paper (Goodfellow et al., 2014) showed that the optimal GAN discriminator minimizes the Jensen-Shannon divergence between the real data distribution $p_{\text{data}}$ and the generator distribution $p_g$ :

$C(G) = -\log 4 + 2 \cdot D_{\text{JS}}(p_{\text{data}} \| p_g)$

This is why standard GANs can suffer from vanishing gradients when the two distributions don't overlap - $D_{\text{JS}}$ saturates at $\log 2$ for non-overlapping distributions. The Wasserstein GAN replaced JS divergence with the Earth Mover's distance to fix this. :::

def js_divergence(p: np.ndarray, q: np.ndarray) -> float:
    """Jensen-Shannon divergence (nats). Always in [0, ln(2)]."""
    p = np.asarray(p, dtype=float)
    q = np.asarray(q, dtype=float)
    m = 0.5 * (p + q)
    return 0.5 * kl_divergence_discrete(p, m) + 0.5 * kl_divergence_discrete(q, m)


# Compare KL vs JS for distributions with varying overlap
print("\n=== KL vs JS Divergence ===")
test_pairs = [
    ([0.5, 0.5], [0.5, 0.5], "Identical"),
    ([0.9, 0.1], [0.1, 0.9], "Opposite"),
    ([0.7, 0.3], [0.5, 0.5], "Moderate"),
    ([0.99, 0.01], [0.5, 0.5], "Very different"),
]

print(f"{'Case':<20} | {'KL(P||Q)':>10} | {'KL(Q||P)':>10} | {'JS':>8}")
print("-" * 55)
for p, q, name in test_pairs:
    kl_pq = kl_divergence_discrete(p, q)
    kl_qp = kl_divergence_discrete(q, p)
    js    = js_divergence(p, q)
    print(f"{name:<20} | {kl_pq:>10.4f} | {kl_qp:>10.4f} | {js:>8.4f}")
print(f"\nln(2) = {np.log(2):.4f} nats (maximum JS)")

Summary: KL Variants Used in ML

Divergence          | Formula                    | ML Application
--------------------+----------------------------+---------------------------
Forward KL          | E_P[log P/Q]               | EM algorithm, ADF
Reverse KL          | E_Q[log Q/P]               | VAE, VI (mode-seeking)
JS Divergence       | 0.5*KL(P||M)+0.5*KL(Q||M) | GAN training
Alpha divergence    | family parameterized by α  | EP, power EP
Renyi divergence    | 1/(α-1) log E_P[(P/Q)^α]  | Min-max RL
Total variation     | 0.5 Σ |p(x)-q(x)|          | Theoretical bounds

Interview Questions and Answers

Q1: What is KL divergence and why is it not a true distance metric?

KL divergence $D_{\text{KL}}(P \| Q) = \mathbb{E}_{x \sim P}[\log(p(x)/q(x))]$ measures how much information is lost when using $Q$ to approximate $P$ . It is not a metric because it violates two metric axioms: (1) it is not symmetric - $D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P)$ in general, and (2) it does not satisfy the triangle inequality. We still use it because it has excellent theoretical properties: it equals zero iff $P = Q$ , is always non-negative, and has a natural information-theoretic interpretation.

Q2: What is the difference between forward and reverse KL, and which does a VAE use?

Forward KL: $D_{\text{KL}}(P \| Q)$ - minimize by placing $Q$ 's mass everywhere $P$ has mass (mode-covering). If you ignore a mode of $P$ with $q(x) \approx 0$ , the term $p(x)\log(p(x)/q(x))$ becomes huge.

Reverse KL: $D_{\text{KL}}(Q \| P)$ - minimize by not placing $Q$ 's mass where $P$ has none (mode-seeking). Zero-probability regions of $P$ are safely ignored by setting $q(x) = 0$ .

VAEs use reverse KL: $D_{\text{KL}}(q_\phi(z|x) \| p(z))$ . The encoder posterior $q_\phi$ is penalized for deviating from the prior $p(z)$ , but not for failing to cover all of $p(z)$ 's support. This leads to posterior collapse in extreme cases - a known VAE failure mode.

Q3: Why does the KL divergence go to infinity when Q assigns zero probability to a region where P has nonzero probability?

In $D_{\text{KL}}(P \| Q) = \sum_x p(x)\log(p(x)/q(x))$ , when $p(x) > 0$ but $q(x) = 0$ , the term becomes $p(x)\log(\infty) = +\infty$ . Intuitively: $Q$ is asserting that event $x$ is impossible, but $P$ says it can happen. If $x$ does occur (sampled from $P$ ), we have infinite surprise - $Q$ is completely unprepared for it. This is why in practice, we add label smoothing or use $\epsilon$ -regularization to prevent any $q(x) = 0$ .

Q4: How does KL divergence appear in the PPO algorithm, and why use it there instead of L2 distance between parameters?

PPO constrains policy updates using $D_{\text{KL}}(\pi_{\text{old}} \| \pi_{\text{new}}) \leq \delta$ . This measures the behavioral difference between policies - how differently they would act in the environment - not the Euclidean distance between parameters.

Parameters can have very different scales across layers, so L2 distance in parameter space is not behaviorally meaningful: a tiny change in a highly sensitive parameter can drastically change policy behavior. KL divergence directly measures the change in the action distribution, which is what matters for stability. Two policies can have very different parameters but identical KL divergence (and thus identical behavior) - this is what we want to constrain.

Q5: What is the relationship between KL divergence and information gain in Bayesian inference?

In Bayesian inference, after observing data $D$ , you update from prior $p(\theta)$ to posterior $p(\theta|D)$ . The KL divergence from prior to posterior:

$D_{\text{KL}}\bigl(p(\theta|D) \| p(\theta)\bigr) = \mathbb{E}_{\theta \sim \text{posterior}}\left[\log\frac{p(\theta|D)}{p(\theta)}\right]$

measures how much information you gained about $\theta$ from $D$ . This is also called the expected information gain or the Bayesian surprise. It quantifies how much data shifted your beliefs. Small KL divergence = data was not very informative (consistent with many parameter values). Large KL = data strongly ruled out large regions of parameter space. This is used in Bayesian experimental design to choose experiments that maximize expected information gain.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the KL Divergence demo on the EngineersOfAI Playground - no code required.

:::

The Scenario That Motivates This Lesson​

Definition: KL Divergence​

Alternative Expressions​

Key Properties​

Non-Negativity (Gibbs' Inequality)​

Asymmetry: Why KL Is Not a Distance​

No Triangle Inequality​

Forward KL vs. Reverse KL: Geometric Intuition​

Forward KL: DKL(P∥Q)D_{\text{KL}}(P \| Q)DKL​(P∥Q) - "Mean-Seeking"​

Reverse KL: DKL(Q∥P)D_{\text{KL}}(Q \| P)DKL​(Q∥P) - "Mode-Seeking"​

KL Divergence Between Gaussians​

Python: Computing KL Divergence​

KL Divergence in the VAE Loss​

KL Divergence in Reinforcement Learning: PPO​

Jensen-Shannon Divergence: The Symmetric Alternative​

Summary: KL Variants Used in ML​

Interview Questions and Answers​