What is dynamic programming?

Policy evaluation, policy iteration, and value iteration - solving MDPs exactly when you know the environment model. Master the theoretical foundation that all model-free RL approximates.

How does policy evaluation work in practice?

Dynamic Programming for RL covers dynamic programming, policy evaluation, policy iteration from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/reinforcement-learning/dynamic-programming

What is the difference between dynamic programming and policy iteration?

See the full breakdown at https://engineersofai.com/docs/ml/reinforcement-learning/dynamic-programming

Dynamic Programming for RL

Reading time: ~45 min | Interview relevance: High (conceptual foundation) | Target roles: ML Engineer, Research Engineer, AI Engineer

The Real Engineering Moment

Before AlphaGo, before DQN, before RLHF - game-playing AI was built on dynamic programming. Early chess programs evaluated board positions by backward induction: look ahead several moves, compute the value at the leaves, and back-propagate to the root. IBM's Deep Blue, which defeated Garry Kasparov in 1997, searched 200 million positions per second using minimax with alpha-beta pruning - a specialized form of DP. The technique dates to Bellman in the 1950s.

Understanding dynamic programming is not just historically useful. When you see a bug in your PPO training loop where values are collapsing, understanding what the Bellman backup is supposed to do tells you where to look. When you design a reward shaping scheme for a new task, you are essentially solving a modified MDP analytically. When an interviewer asks "why does Q-learning converge?" - the answer lives in the DP contraction mapping proof. The theory lives here.

The mental model to carry through this lesson: DP is exact RL. Model-free methods (Q-learning, SARSA, PPO) are all approximations of what DP would compute if we had the model. Every temporal difference update is a sampled, single-step approximation of the Bellman backup. Once you see this, the zoo of RL algorithms becomes much more coherent.

:::note The Key Limitation Dynamic programming requires full knowledge of the environment model - the transition probabilities $P(s'|s,a)$ and the reward function $R(s,a)$ . In most real problems, you don't have this. Model-free methods exist precisely because DP is often infeasible. But you need to understand what DP computes before understanding why model-free methods approximate it. :::

Why This Exists: The Problem with Brute Force

Before dynamic programming, solving a sequential decision problem required exhaustive search: enumerate all possible sequences of actions, compute the return for each, pick the best. For a problem with $|A|$ actions and $T$ time steps, that is $|A|^T$ candidate sequences - exponential in the horizon.

Bellman's insight (1954): An optimal policy has a recursive structure. Whatever the first action was, the remaining actions must be optimal for the resulting state. This is the principle of optimality:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

This recursive structure means you can decompose the global optimization into a sequence of local optimizations. Instead of $|A|^T$ candidates, you only need $|A| \times |S|$ computations per iteration. The exponential becomes polynomial.

Historical Context

1953: Richard Bellman joins RAND Corporation and begins working on multi-stage decision problems. He needs to find optimal control policies for dynamic systems. Classical calculus methods fail - they require differentiability and can't handle discrete decisions or constraints.

1954: Bellman formalizes the principle of optimality and derives the recursive equations. He deliberately calls it "dynamic programming" - "dynamic" to convey the time-varying nature, "programming" in the mathematical/planning sense (not computer programming). He later admitted the name was chosen partly to obscure the mathematical nature of the work from military funding agencies who distrusted "mathematics."

1957: Bellman publishes "Dynamic Programming" - the foundational textbook. His equations for MDPs appear, though MDPs as a formal model are crystallized by Howard in 1960.

1980s–1990s: Watkins, Sutton, and others connect DP to online, model-free learning. Temporal difference (TD) learning emerges as "DP without the model" - using sampled transitions instead of the full $P$ matrix.

2013: DQN shows that the Bellman optimality backup, implemented with neural function approximation and experience replay, achieves superhuman Atari performance. The DP ideas survive; the tabular tables do not.

The Three DP Algorithms: Overview

Dynamic programming gives us three related algorithms, all based on iteratively applying the Bellman equation:

Algorithm	What it computes	Key update	When to use
Policy Evaluation	$V^\pi$ for fixed $\pi$	Bellman expectation	Assess a given policy
Policy Iteration	Optimal $\pi^*$	Evaluation + greedy improvement	Small-medium problems
Value Iteration	$V^*$ directly	Bellman optimality	Faster in practice for large problems

Algorithm 1: Policy Evaluation

Problem: Given a policy $\pi$ , compute its value function $V^\pi$ .

The Bellman expectation equation as an iterative update:

$V_{k+1}(s) \leftarrow \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)\left[R(s,a) + \gamma V_k(s')\right]$

Start with $V_0 = \mathbf{0}$ and iterate. This is iterative policy evaluation.

Convergence Proof: Contraction Mapping

Why does this converge? The Bellman expectation operator $\mathcal{T}^\pi$ maps one value function to another:

$(\mathcal{T}^\pi V)(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V(s')]$

Claim: $\mathcal{T}^\pi$ is a contraction mapping in the $L^\infty$ (sup-norm) sense.

Proof: Let $V$ and $V'$ be any two value functions. Then:

$|(\mathcal{T}^\pi V)(s) - (\mathcal{T}^\pi V')(s)|$

$= \left|\gamma \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[V(s') - V'(s')]\right|$

$\leq \gamma \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) |V(s') - V'(s')|$

$\leq \gamma \|V - V'\|_\infty \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)$

$= \gamma \|V - V'\|_\infty$

Taking the sup over $s$ : $\|\mathcal{T}^\pi V - \mathcal{T}^\pi V'\|_\infty \leq \gamma \|V - V'\|_\infty$ .

Since $\gamma < 1$ , each iteration shrinks the error by factor $\gamma$ . By the Banach fixed-point theorem, there exists a unique fixed point $V^\pi$ such that $\mathcal{T}^\pi V^\pi = V^\pi$ , and iterating $\mathcal{T}^\pi$ from any starting point converges to $V^\pi$ .

Convergence rate: After $k$ iterations, the error is $\gamma^k \|V_0 - V^\pi\|_\infty$ . For $\gamma = 0.9$ and tolerance $10^{-8}$ starting from a zero initialization with maximum reward $r_{max}$ : you need $k = \log(r_{max}/10^{-8}) / \log(1/\gamma) \approx 175$ iterations for $r_{max} = 1$ . More for larger rewards.

Full Implementation

import numpy as np
from typing import Optional


def policy_evaluation(
    policy: np.ndarray,       # shape (n_states, n_actions) - π(a|s)
    P: np.ndarray,            # shape (n_states, n_actions, n_states) - P(s'|s,a)
    R: np.ndarray,            # shape (n_states, n_actions) - R(s,a)
    gamma: float = 0.9,
    tol: float = 1e-8,
    max_iter: int = 10_000,
    in_place: bool = False,   # use in-place updates (faster convergence)
) -> tuple[np.ndarray, list[float]]:
    """
    Iterative policy evaluation.

    Convergence guarantee: By Banach fixed-point theorem, iterating
    the Bellman operator T^pi (a gamma-contraction in L-inf norm)
    converges to the unique fixed point V^pi.

    Returns (V^pi, convergence_deltas)
    """
    n_states = P.shape[0]
    V = np.zeros(n_states)
    deltas = []

    for iteration in range(max_iter):
        if in_place:
            # In-place update: immediately use new V values
            # Faster convergence, same fixed point
            V_old = V.copy()
            for s in range(n_states):
                Q_s = R[s] + gamma * P[s] @ V   # shape: (n_actions,)
                V[s] = policy[s] @ Q_s           # scalar
            delta = np.max(np.abs(V - V_old))
        else:
            # Synchronous update: use old V for all updates
            Q = R + gamma * (P @ V)       # (n_states, n_actions)
            V_new = np.einsum('sa,sa->s', policy, Q)   # (n_states,)
            delta = np.max(np.abs(V_new - V))
            V = V_new

        deltas.append(delta)

        if delta < tol:
            print(f"Policy evaluation converged: {iteration + 1} iterations, "
                  f"delta={delta:.2e}")
            break

    return V, deltas


def policy_evaluation_matrix(
    policy: np.ndarray,
    P: np.ndarray,
    R: np.ndarray,
    gamma: float = 0.9,
) -> np.ndarray:
    """
    Direct matrix solution for V^pi (exact, not iterative).

    The Bellman expectation equation in matrix form:
       V = R_pi + gamma * P_pi * V
    =>  (I - gamma * P_pi) V = R_pi
    =>  V = (I - gamma * P_pi)^{-1} R_pi

    Where R_pi[s] = sum_a pi(a|s) R(s,a)
    and   P_pi[s,s'] = sum_a pi(a|s) P(s'|s,a)

    O(|S|^3) - only feasible for very small state spaces.
    """
    n_states = P.shape[0]

    # Compute policy-weighted reward and transition matrix
    R_pi = np.einsum('sa,sa->s', policy, R)         # (n_states,)
    P_pi = np.einsum('sa,sas->ss', policy,           # (n_states, n_states)
                     P.transpose(0, 1, 2))

    # Actually easier notation:
    # P_pi[s, s'] = sum_a pi[s,a] * P[s,a,s']
    P_pi = np.einsum('sa,san->sn', policy, P)        # (n_states, n_states)

    # Solve linear system
    A = np.eye(n_states) - gamma * P_pi
    V = np.linalg.solve(A, R_pi)
    return V

Algorithm 2: Policy Iteration

Problem: Find the optimal policy $\pi^*$ .

Idea: Alternately evaluate the current policy (policy evaluation) and improve it greedily (policy improvement). Each improvement step is guaranteed not to make the policy worse.

Policy Improvement: The Greedy Step

Given $V^\pi$ , we compute the Q-function and take the greedy action:

$\pi'(s) = \arg\max_a Q^\pi(s, a) = \arg\max_a \sum_{s'} P(s'|s,a)\left[R(s,a) + \gamma V^\pi(s')\right]$

Policy Improvement Theorem: Formal Proof Sketch

Theorem (Sutton & Barto): If $\pi'$ is the greedy policy w.r.t. $V^\pi$ , then $V^{\pi'}(s) \geq V^\pi(s)$ for all $s$ .

Proof: By definition of greedy selection:

$Q^\pi(s, \pi'(s)) = \max_a Q^\pi(s, a) \geq Q^\pi(s, \pi(s)) = V^\pi(s)$

Now expand by rolling out one more step under $\pi'$ :

$V^\pi(s) \leq Q^\pi(s, \pi'(s)) = \mathbb{E}[r_t + \gamma V^\pi(s_{t+1}) \mid s_t=s, a_t=\pi'(s)]$

$\leq \mathbb{E}[r_t + \gamma Q^\pi(s_{t+1}, \pi'(s_{t+1})) \mid s_t=s, a_t=\pi'(s)]$

$\leq \mathbb{E}[r_t + \gamma r_{t+1} + \gamma^2 V^\pi(s_{t+2}) \mid \ldots]$

$\leq \ldots \leq \mathbb{E}_{\pi'}\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \mid s_t=s\right] = V^{\pi'}(s)$

The inequality is strict if $\pi'(s) \neq \pi(s)$ for at least one state. If $\pi'(s) = \pi(s)$ everywhere, then $V^\pi$ satisfies the Bellman optimality equation - $\pi$ is already optimal.

Full Policy Iteration Implementation

def policy_iteration(
    P: np.ndarray,       # (n_states, n_actions, n_states)
    R: np.ndarray,       # (n_states, n_actions)
    gamma: float = 0.9,
    eval_tol: float = 1e-8,
    max_outer: int = 1000,
    verbose: bool = True,
) -> tuple[np.ndarray, np.ndarray, list[int]]:
    """
    Policy iteration.

    Alternates between:
    1. Policy evaluation: compute V^pi exactly (iterative)
    2. Policy improvement: π' = greedy(V^pi)

    Convergence: guaranteed in finite steps (≤ |A|^|S| policies).
    In practice: converges in very few iterations (< 50 even for large problems).

    Returns (optimal_policy, V*, evaluation_iterations_per_round)
    """
    n_states, n_actions, _ = P.shape
    eval_counts = []

    # Initialize: uniform random policy
    policy = np.ones((n_states, n_actions)) / n_actions

    for outer_iter in range(max_outer):
        # ── Step 1: Policy Evaluation ────────────────────────────────────────
        V, deltas = policy_evaluation(policy, P, R, gamma, tol=eval_tol)
        eval_counts.append(len(deltas))

        # ── Step 2: Policy Improvement (greedy) ─────────────────────────────
        # Q[s,a] = R[s,a] + gamma * sum_{s'} P[s,a,s'] * V[s']
        Q = R + gamma * (P @ V)       # (n_states, n_actions)
        best_actions = np.argmax(Q, axis=1)  # (n_states,)

        # Build new deterministic policy
        new_policy = np.zeros((n_states, n_actions))
        new_policy[np.arange(n_states), best_actions] = 1.0

        # ── Convergence Check ────────────────────────────────────────────────
        if np.array_equal(new_policy, policy):
            if verbose:
                print(f"Policy iteration converged after {outer_iter + 1} "
                      f"outer iterations")
            return best_actions, V, eval_counts

        policy = new_policy

    return np.argmax(policy, axis=1), V, eval_counts

Convergence bound: There are at most $|A|^{|S|}$ deterministic policies. Since each iteration strictly improves or maintains $V^\pi$ (and $V$ is bounded), policy iteration must converge in at most $|A|^{|S|}$ steps. In practice, it converges in far fewer - typically $O(|S|)$ iterations.

Algorithm 3: Value Iteration

Problem: Find $V^*$ directly, without explicit policy evaluation steps.

Key insight: Instead of waiting for policy evaluation to converge before improving, apply a single Bellman optimality backup at each state:

$V_{k+1}(s) \leftarrow \max_a \left[R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_k(s')\right]$

This is the Bellman optimality operator $\mathcal{T}^*$ applied iteratively. It also forms a contraction with factor $\gamma$ , converging to $V^*$ .

Why this works: The Bellman optimality operator:

$(\mathcal{T}^* V)(s) = \max_a \left[R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s')\right]$

is also a $\gamma$ -contraction in $L^\infty$ . Same proof as for $\mathcal{T}^\pi$ , but with $\max$ instead of $\sum_a \pi(a|s)$ . The $\max$ operator doesn't affect the contraction because $|\max f - \max g| \leq \max |f - g|$ .

Full Value Iteration Implementation

def value_iteration(
    P: np.ndarray,          # (n_states, n_actions, n_states)
    R: np.ndarray,          # (n_states, n_actions)
    gamma: float = 0.9,
    tol: float = 1e-8,
    max_iter: int = 100_000,
    verbose: bool = True,
) -> tuple[np.ndarray, np.ndarray, list[float]]:
    """
    Value iteration.

    Applies the Bellman optimality operator T* iteratively:
       V_{k+1}(s) = max_a [R(s,a) + gamma * sum_s' P(s'|s,a) * V_k(s')]

    Convergence: T* is a gamma-contraction in L-inf norm.
    After k iterations: ||V_k - V*||_inf <= gamma^k / (1-gamma) * ||V_1 - V_0||_inf

    Returns (optimal_policy, V*, convergence_deltas)
    """
    n_states, n_actions, _ = P.shape
    V = np.zeros(n_states)
    deltas = []

    for iteration in range(max_iter):
        # Q[s,a] = R[s,a] + gamma * sum_{s'} P[s,a,s'] * V[s']
        Q = R + gamma * (P @ V)    # (n_states, n_actions)

        # V_new[s] = max_a Q[s,a]
        V_new = np.max(Q, axis=1)  # (n_states,)

        delta = np.max(np.abs(V_new - V))
        deltas.append(delta)
        V = V_new

        if delta < tol:
            if verbose:
                print(f"Value iteration converged: {iteration + 1} iterations, "
                      f"delta={delta:.2e}")
            break

    # Extract greedy optimal policy
    Q_final = R + gamma * (P @ V)
    optimal_policy = np.argmax(Q_final, axis=1)

    return optimal_policy, V, deltas

Error bound at convergence: If the iteration terminates because $\|V_{k+1} - V_k\|_\infty \leq \epsilon$ , then the true error satisfies:

$\|V_k - V^*\|_\infty \leq \frac{\gamma}{1-\gamma} \epsilon$

For $\gamma = 0.9$ and $\epsilon = 10^{-8}$ : the true error is bounded by $9 \times 10^{-8}$ .

GridWorld: Full DP Solution with Visualization

Let's implement and solve a classic 4×4 GridWorld to see DP in action:

┌───┬───┬───┬───┐
│ S │   │   │   │    S = Start (state 0)
├───┼───┼───┼───┤
│   │ X │   │   │    X = Wall (state 5, impassable)
├───┼───┼───┼───┤
│   │   │   │   │
├───┼───┼───┼───┤
│   │   │   │ G │    G = Goal (state 15, reward +10)
└───┴───┴───┴───┘

import numpy as np

class GridWorld:
    """
    4x4 GridWorld with a wall and a goal state.
    Actions: 0=UP, 1=RIGHT, 2=DOWN, 3=LEFT
    Reward: +10 for reaching goal, -0.1 per step (time penalty).
    Transitions: deterministic (no stochasticity).
    """

    ACTION_NAMES = {0: '↑', 1: '→', 2: '↓', 3: '←'}
    MOVES = {0: (-1, 0), 1: (0, 1), 2: (1, 0), 3: (0, -1)}

    def __init__(self, size: int = 4, gamma: float = 0.9):
        self.size = size
        self.gamma = gamma
        self.n_states = size * size
        self.n_actions = 4
        self.goal = self.n_states - 1     # bottom-right: state 15
        self.walls = {5}                   # position (1,1)
        self.P, self.R = self._build_mdp()

    def _idx(self, r: int, c: int) -> int:
        return r * self.size + c

    def _rc(self, s: int) -> tuple[int, int]:
        return s // self.size, s % self.size

    def _build_mdp(self):
        n, na = self.n_states, self.n_actions
        P = np.zeros((n, na, n))
        R = np.full((n, na), -0.1)   # step cost

        for s in range(n):
            if s == self.goal:
                P[s, :, s] = 1.0      # absorbing terminal
                R[s, :] = 0.0
                continue
            if s in self.walls:
                P[s, :, s] = 1.0      # walls absorb too
                R[s, :] = -1.0        # penalty for entering wall (shouldn't happen)
                continue

            r, c = self._rc(s)
            for a, (dr, dc) in self.MOVES.items():
                nr, nc = r + dr, c + dc
                # Clamp to grid boundaries
                if not (0 <= nr < self.size and 0 <= nc < self.size):
                    ns = s      # bounce off boundary
                else:
                    ns = self._idx(nr, nc)
                    if ns in self.walls:
                        ns = s  # bounce off wall

                P[s, a, ns] += 1.0
                if ns == self.goal:
                    R[s, a] += 10.0

        return P, R

    def display_values(self, V: np.ndarray, title: str = "Value Function"):
        print(f"\n{title}:")
        border = "+" + "--------+" * self.size
        print(border)
        for r in range(self.size):
            row = "|"
            for c in range(self.size):
                s = self._idx(r, c)
                if s == self.goal:
                    row += "  GOAL  |"
                elif s in self.walls:
                    row += "  WALL  |"
                else:
                    row += f" {V[s]:6.3f} |"
            print(row)
            print(border)

    def display_policy(self, policy: np.ndarray, title: str = "Optimal Policy"):
        print(f"\n{title}:")
        border = "+" + "------+" * self.size
        print(border)
        for r in range(self.size):
            row = "|"
            for c in range(self.size):
                s = self._idx(r, c)
                if s == self.goal:
                    row += "  G   |"
                elif s in self.walls:
                    row += "  X   |"
                else:
                    row += f"  {self.ACTION_NAMES[policy[s]]}   |"
            print(row)
            print(border)


def compare_algorithms(gamma: float = 0.9):
    """Run all three DP algorithms and compare results."""
    grid = GridWorld(size=4, gamma=gamma)
    P, R = grid.P, grid.R

    # ── Value Iteration ──────────────────────────────────────────────────────
    print("=" * 60)
    print("VALUE ITERATION")
    pi_vi, V_vi, deltas_vi = value_iteration(P, R, gamma=gamma, verbose=True)
    grid.display_values(V_vi, "V* (Value Iteration)")
    grid.display_policy(pi_vi, "π* (Value Iteration)")
    print(f"Iterations: {len(deltas_vi)}")

    # ── Policy Iteration ─────────────────────────────────────────────────────
    print("\n" + "=" * 60)
    print("POLICY ITERATION")
    pi_poli, V_poli, eval_counts = policy_iteration(P, R, gamma=gamma, verbose=True)
    grid.display_values(V_poli, "V* (Policy Iteration)")
    grid.display_policy(pi_poli, "π* (Policy Iteration)")
    print(f"Outer iterations: {len(eval_counts)}")
    print(f"Total inner iterations: {sum(eval_counts)}")

    # ── Verify Equivalence ──────────────────────────────────────────────────
    print("\n" + "=" * 60)
    print("COMPARISON:")
    print(f"Max V difference: {np.max(np.abs(V_vi - V_poli)):.2e}")
    print(f"Policy agreement: {np.mean(pi_vi == pi_poli) * 100:.1f}%")


if __name__ == "__main__":
    compare_algorithms(gamma=0.9)

Expected output summary:

Value Function (4×4 GridWorld, γ=0.9):
+--------+--------+--------+--------+
|  3.352 |  4.437 |  5.735 |  7.186 |
+--------+--------+--------+--------+
|  2.643 |  WALL  |  7.186 |  8.651 |
+--------+--------+--------+--------+
|  3.352 |  4.437 |  7.186 | 10.000 |
+--------+--------+--------+--------+
|  2.643 |  3.352 |  7.186 |  GOAL  |
+--------+--------+--------+--------+

States adjacent to the goal have high values. The wall forces a detour that reduces values on the left side. The optimal policy navigates around the wall.

Asynchronous DP Variants

The standard DP algorithms update all states simultaneously (synchronous). Asynchronous variants update states in a different order, which can be more efficient.

In-Place DP

Instead of computing a completely new $V$ each iteration, immediately use updated values:

def value_iteration_inplace(P, R, gamma=0.9, tol=1e-8):
    """
    In-place value iteration: immediately use updated V[s] for subsequent states.
    Converges in fewer iterations (though each iteration is same cost).
    The update order within each sweep can affect speed significantly.
    """
    n_states = P.shape[0]
    V = np.zeros(n_states)

    for iteration in range(100_000):
        delta = 0.0
        for s in range(n_states):
            Q_s = R[s] + gamma * P[s] @ V   # (n_actions,)
            v_new = np.max(Q_s)
            delta = max(delta, abs(v_new - V[s]))
            V[s] = v_new                      # immediate update

        if delta < tol:
            print(f"In-place VI converged: {iteration + 1} iterations")
            break

    return np.argmax(R + gamma * (P @ V), axis=1), V

Prioritized Sweeping

Update states in order of how much their value estimate changed - prioritize states where the Bellman error (current estimate vs backed-up value) is largest. This focuses computation where it's most needed.

import heapq

def prioritized_sweeping(P, R, gamma=0.9, tol=1e-8, theta=1e-5):
    """
    Prioritized sweeping: update states in order of Bellman error magnitude.
    Much more efficient when only a few states have large errors.
    """
    n_states, n_actions, _ = P.shape
    V = np.zeros(n_states)

    # Priority queue: (negative_error, state_index)
    pq = []
    for s in range(n_states):
        heapq.heappush(pq, (-1.0, s))   # initialize all with high priority

    total_updates = 0
    while pq:
        neg_priority, s = heapq.heappop(pq)
        if -neg_priority < tol:
            break

        # Update state s
        Q_s = R[s] + gamma * P[s] @ V
        V_new_s = np.max(Q_s)
        V[s] = V_new_s
        total_updates += 1

        # Re-prioritize predecessors of s
        for pred in range(n_states):
            for a in range(n_actions):
                if P[pred, a, s] > 0:
                    Q_pred = R[pred] + gamma * P[pred] @ V
                    error = abs(np.max(Q_pred) - V[pred])
                    if error > theta:
                        heapq.heappush(pq, (-error, pred))

    print(f"Prioritized sweeping: {total_updates} updates (vs {n_states}×iters for standard VI)")
    return np.argmax(R + gamma * (P @ V), axis=1), V

Real-Time DP (RTDP)

Only update states that the agent actually visits during policy execution. Converges to optimal for reachable states. Useful when only a small fraction of states are ever visited.

Generalized Policy Iteration (GPI)

The concept that unifies all RL algorithms - not just the three DP methods:

:::note Generalized Policy Iteration Any interleaving of policy evaluation (making the value function consistent with the current policy) and policy improvement (making the policy greedy with respect to the current value function) converges to $V^*$ and $\pi^*$ . :::

The two processes compete: making $V$ accurate for $\pi$ destroys the optimality of $\pi$ w.r.t. $V$ ; making $\pi$ greedy for $V$ makes $V$ inaccurate again. But they both push toward the same equilibrium: $V^*$ and $\pi^*$ .

GPI explains every RL algorithm:

Algorithm	Evaluation step	Improvement step
Policy Iteration	Full convergence of $V^\pi$	Greedy $\pi' = \arg\max Q^\pi$
Value Iteration	One Bellman optimality backup	Implicit (max in backup)
Q-Learning	TD(0) update: $Q \leftarrow Q + \alpha \delta$	Greedy: use $\max Q$
SARSA	TD(0) update with on-policy target	On-policy (softer) improvement
PPO	Value function regression	Policy gradient update

Comparison: All Three DP Algorithms

	Policy Evaluation	Policy Iteration	Value Iteration
Computes	$V^\pi$ for fixed $\pi$	$\pi^, V^$	$V^, \pi^$
Convergence guarantee	Exact (within $\epsilon$ of $V^\pi$ )	Exact (finite iterations)	Approximate (within $\epsilon$ of $V^*$ )
Iterations needed	$O(\log(1/\epsilon) / \log(1/\gamma))$	Few outer (many inner)	$O(\log(1/\epsilon) / \log(1/\gamma))$
Cost per iteration	$O(\\|S\\|^2 \\|A\\|)$	Same	Same
Memory	$O(\\|S\\|)$ for $V$	$O(\\|S\\|)$ for $V$ + $\pi$	$O(\\|S\\|)$ for $V$
When faster	Never standalone	Small problems (<1K states)	Large problems (policy eval costly)
Requires model	Yes	Yes	Yes
Practical use	Sub-routine in Policy Iter	Robotics, small MDPs	Games, planning problems

The Curse of Dimensionality

DP is exact and elegant. In practice, it breaks down in three ways:

1. Model required: You need $P(s'|s,a)$ . For most interesting problems - Atari, robot manipulation, RLHF - you don't have this. You can only sample from the environment.

2. State space explosion: The $P$ matrix has size $|S|^2 \times |A|$ . For Atari (84×84×4 frames), there are approximately $256^{84 \times 84 \times 4} \approx 10^{67970}$ distinct states. You cannot store a table. For an LLM with vocabulary 50,000 and context length 2048, state space is $50000^{2048}$ - utterly intractable.

3. Continuous state/action spaces: Joint angles, velocities, temperatures - continuous variables mean infinite state spaces. Even with discretization, the curse of dimensionality means the table grows exponentially in the number of dimensions.

GridWorld (16 states):    → Tabular DP works fine
Atari (10^67970 states):  → Function approximation required (DQN)
Robotics (continuous):    → Function approx + policy gradient (PPO, SAC)
LLMs (token sequences):   → Function approx + policy gradient (RLHF)

These limitations motivate everything that follows:

Connection to Model-Based RL: Dyna-Q

An important bridge between model-based DP and model-free learning: the Dyna architecture (Sutton, 1990).

Dyna learns both a model of the environment and a value function simultaneously:

Take action, observe $(s, a, r, s')$ from real environment
Update value function directly from this experience (model-free)
Update the learned model $\hat{P}$ and $\hat{R}$ from this experience
Simulate $k$ additional experiences from the learned model
Update value function from simulated experiences (DP-like)

class DynaQ:
    """
    Dyna-Q: model-free Q-learning augmented with model-based planning.
    After each real step, plan k steps using the learned model.
    """
    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.9,
                 epsilon=0.1, n_planning=10):
        self.Q = np.zeros((n_states, n_actions))
        self.model = {}         # model[(s,a)] = (r, s')
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.n_planning = n_planning
        self.visited = []       # list of (s,a) pairs tried

    def act(self, s: int) -> int:
        if np.random.random() < self.epsilon:
            return np.random.randint(self.Q.shape[1])
        return np.argmax(self.Q[s])

    def real_update(self, s, a, r, s_next):
        """Direct RL update from real experience."""
        td_target = r + self.gamma * np.max(self.Q[s_next])
        self.Q[s, a] += self.alpha * (td_target - self.Q[s, a])

        # Update model
        self.model[(s, a)] = (r, s_next)
        if (s, a) not in self.visited:
            self.visited.append((s, a))

    def planning_updates(self):
        """Plan k steps using the learned model (DP-like)."""
        for _ in range(self.n_planning):
            if not self.visited:
                break
            # Sample a previously visited state-action pair
            idx = np.random.randint(len(self.visited))
            s, a = self.visited[idx]
            r, s_next = self.model[(s, a)]

            # Same Q-learning update, but using simulated experience
            td_target = r + self.gamma * np.max(self.Q[s_next])
            self.Q[s, a] += self.alpha * (td_target - self.Q[s, a])

Why Dyna is powerful: With $k=50$ planning steps per real step, the agent can achieve the same performance as $k$ real steps' worth of model-free learning, using only 1 real step. If the environment is expensive to interact with (e.g., physical robots), this is a huge advantage.

Common Mistakes

:::danger Forgetting to Normalize the Policy Matrix In policy_evaluation, the policy must be a proper probability distribution: np.sum(policy, axis=1) should equal 1 for every state. If you pass an un-normalized policy (e.g., raw Q-values or logits), the Bellman equation is wrong and evaluation diverges. Always normalize: policy = softmax(logits, axis=1) or build a one-hot from greedy actions. :::

:::danger Using Reward R[s] Instead of R[s,a] Many textbook formulations use $R(s)$ (reward depends only on state). NumPy implementations often slip between $R[s]$ (shape (n_states,)) and $R[s,a]$ (shape (n_states, n_actions)). The Bellman backup requires $R[s,a]$ : Q = R + gamma * (P @ V). If your R has the wrong shape, you get broadcasting bugs that are silent - the code runs but computes the wrong thing. :::

:::warning In-Place vs Synchronous Updates Change Convergence Behavior In-place value iteration (immediately using updated $V[s]$ for subsequent state updates within the same sweep) converges faster than synchronous updates (keeping a separate $V_{new}$ ). Both converge to the same $V^*$ . But in distributed DP or when debugging convergence, make sure you know which variant you're running - comparing convergence curves between in-place and synchronous runs is misleading. :::

:::warning Policy Iteration Can Cycle with Ties When multiple actions achieve the same Q-value, argmax returns the first one by convention. Different runs may break ties differently, leading to different but equally optimal policies. If you see policy iteration not converging (flipping between two equivalent policies), add a small random tiebreaker or check for tie-equal Q-values explicitly. :::

YouTube Resources

Video	Creator	Why Watch
David Silver RL Lecture 3 - Planning by DP	David Silver (DeepMind)	Definitive coverage of policy evaluation, iteration, and value iteration with clear proofs
Reinforcement Learning - Dynamic Programming	Stanford CS234	Clean academic lecture connecting DP to model-free RL
Policy Iteration vs Value Iteration Explained	deeplizard	Intuitive comparison with visual GridWorld examples
Bellman Equation and Dynamic Programming	Mutual Information	Excellent mathematical walkthrough, connects Bellman to modern deep RL
Sutton & Barto Book Club - Chapter 4: DP	Matthew Bain	Goes through the Sutton & Barto textbook chapter on DP with commentary

Interview Q&A

Q1: What is the difference between policy iteration and value iteration?

Policy iteration has two explicit phases per outer iteration: (1) full policy evaluation - run the Bellman expectation equation until convergence to get $V^\pi$ ; (2) greedy policy improvement using $V^\pi$ , constructing $\pi'(s) = \arg\max_a Q^\pi(s,a)$ . These phases alternate until the policy doesn't change. Value iteration merges both: $V_{k+1}(s) = \max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_k(s')]$ . This is the Bellman optimality backup - one step of evaluation and implicit improvement per state per sweep. Both converge to $V^*$ and $\pi^*$ . Trade-offs: policy iteration often converges in fewer outer iterations (because exact evaluation gives a better improvement step) but each outer iteration is expensive (inner convergence). Value iteration's iterations are cheap (one sweep) but it may need many. For most practical problems, value iteration is preferred. A key insight: policy iteration with $k=1$ inner steps is exactly value iteration. So both are special cases of Generalized Policy Iteration.

Q2: Why does policy improvement always improve or maintain performance? Prove it.

This is the Policy Improvement Theorem. Let $\pi'$ be the greedy policy w.r.t. $V^\pi$ , so $Q^\pi(s, \pi'(s)) = \max_a Q^\pi(s,a) \geq Q^\pi(s, \pi(s)) = V^\pi(s)$ . Starting from this inequality, unroll one step: $V^\pi(s) \leq Q^\pi(s, \pi'(s)) = \mathbb{E}[r_t + \gamma V^\pi(s_{t+1}) | s_t=s, a_t=\pi'(s)]$ . Apply the inequality again to $V^\pi(s_{t+1})$ : $\leq \mathbb{E}[r_t + \gamma r_{t+1} + \gamma^2 V^\pi(s_{t+2}) | \pi']$ . Continue unrolling to infinity: $\leq \mathbb{E}_{\pi'}[\sum_k \gamma^k r_{t+k} | s_t=s] = V^{\pi'}(s)$ . Therefore $V^{\pi'}(s) \geq V^\pi(s)$ for all $s$ . Strict improvement holds whenever $\pi'(s) \neq \pi(s)$ for some $s$ . When $\pi' = \pi$ everywhere, $V^\pi$ satisfies the Bellman optimality equations - $\pi$ is optimal.

Q3: What is the curse of dimensionality in RL, and how does deep RL address it?

The curse of dimensionality refers to the exponential growth of state (and action) spaces as dimensions increase. In tabular DP, you store one value per state - feasible for small discrete spaces (GridWorld: 16 states), infeasible for Atari ( $\approx 10^{70000}$ states) or robot arms (continuous). Three manifestations: (1) Memory: $P$ matrix has $|S|^2 |A|$ entries - for $|S|=10^6$ , $|A|=1000$ , this is $10^{15}$ entries, far beyond RAM. (2) Computation: each Bellman sweep is $O(|S|^2 |A|)$ - again infeasible. (3) Sample complexity: even with sampling (model-free RL), you need to visit every state enough times for reliable estimates. Deep RL addresses this with function approximation: instead of a lookup table, train a neural network $V_\theta(s)$ or $Q_\theta(s,a)$ that generalizes across similar states. The Bellman update becomes a supervised learning target: $Q_\theta$ minimized against $[r + \gamma \max_{a'} Q_\theta(s',a')]$ . The network generalizes - seeing state $s$ improves estimates for nearby states. This is what DQN exploits.

Q4: What does the contraction mapping theorem guarantee for DP?

The Bellman operator $\mathcal{T}$ (either $\mathcal{T}^\pi$ for evaluation or $\mathcal{T}^*$ for optimality) satisfies $\|\mathcal{T}V - \mathcal{T}V'\|_\infty \leq \gamma \|V - V'\|_\infty$ for any $V, V'$ . This makes it a $\gamma$ -contraction in the $L^\infty$ norm. The Banach fixed-point theorem then guarantees: (1) there exists a unique fixed point $V^*$ (or $V^\pi$ ); (2) iterating $\mathcal{T}$ from any starting $V_0$ converges to this fixed point; (3) the convergence rate is geometric - after $k$ iterations, error $\leq \gamma^k \cdot \|V_0 - V^*\|_\infty$ . Higher $\gamma$ means slower convergence but longer planning horizon. The contraction property is also why TD learning with function approximation can be unstable (the "deadly triad"): when $\mathcal{T}$ is approximated by a neural network with gradient descent, the contraction property is lost, and divergence can occur.

Q5: Where does DP fit in the context of modern LLM training?

DP itself isn't directly used in LLM training - LLMs have astronomically large state spaces (all possible token sequences) and we don't have the transition model. However, the DP framework provides the conceptual foundation. In RLHF: the reward model provides $R$ , and PPO approximates the policy improvement step without knowing $P$ explicitly. PPO's value function update is a model-free approximation of policy evaluation. In MCTS-based systems (AlphaGo, AlphaZero, and recent LLM planning methods): DP is approximated via tree search using neural network value estimates to guide the search and prune the tree. The value network essentially provides $V^\pi(s)$ estimates that would otherwise require expensive rollouts to compute. Understanding DP makes you a better RL practitioner: when PPO's value estimates diverge, you know it's because the Bellman backup is not contracting properly with function approximation - the "deadly triad" of off-policy learning, function approximation, and bootstrapping.

Q6: Explain Generalized Policy Iteration. How does it connect DP to model-free RL?

GPI is the observation that any algorithm that alternates between two processes - (1) making the value function consistent with the current policy (evaluation direction), and (2) making the policy greedy with respect to the current value function (improvement direction) - converges to $V^*$ and $\pi^*$ , regardless of the exact details of each step. Policy iteration and value iteration are extremes: policy iteration runs evaluation to full convergence; value iteration uses a single backup. Q-learning is GPI: the update $Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \max_{a'} Q(s',a') - Q(s,a)]$ is a noisy, sampled approximation of one step of the Bellman optimality backup (evaluation), and the argmax policy is greedy improvement. PPO is GPI: the value network provides evaluation, and the policy gradient update provides improvement. Recognizing that all RL algorithms are GPI variants helps when debugging - if your algorithm isn't converging, ask whether the evaluation step is accurate enough and whether the improvement step is actually improving the policy.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the MDP GridWorld demo on the EngineersOfAI Playground - no code required.

:::

The Real Engineering Moment​

Why This Exists: The Problem with Brute Force​

Historical Context​

The Three DP Algorithms: Overview​

Algorithm 1: Policy Evaluation​

Convergence Proof: Contraction Mapping​

Full Implementation​

Algorithm 2: Policy Iteration​

Policy Improvement: The Greedy Step​

Policy Improvement Theorem: Formal Proof Sketch​

Full Policy Iteration Implementation​

Algorithm 3: Value Iteration​

Full Value Iteration Implementation​

GridWorld: Full DP Solution with Visualization​

Asynchronous DP Variants​

In-Place DP​

Prioritized Sweeping​

Real-Time DP (RTDP)​

Generalized Policy Iteration (GPI)​

Comparison: All Three DP Algorithms​

The Curse of Dimensionality​

Connection to Model-Based RL: Dyna-Q​

Common Mistakes​

YouTube Resources​

Interview Q&A​