Dynamic Programming for RL
Reading time: ~45 min | Interview relevance: High (conceptual foundation) | Target roles: ML Engineer, Research Engineer, AI Engineer
The Real Engineering Moment
Before AlphaGo, before DQN, before RLHF - game-playing AI was built on dynamic programming. Early chess programs evaluated board positions by backward induction: look ahead several moves, compute the value at the leaves, and back-propagate to the root. IBM's Deep Blue, which defeated Garry Kasparov in 1997, searched 200 million positions per second using minimax with alpha-beta pruning - a specialized form of DP. The technique dates to Bellman in the 1950s.
Understanding dynamic programming is not just historically useful. When you see a bug in your PPO training loop where values are collapsing, understanding what the Bellman backup is supposed to do tells you where to look. When you design a reward shaping scheme for a new task, you are essentially solving a modified MDP analytically. When an interviewer asks "why does Q-learning converge?" - the answer lives in the DP contraction mapping proof. The theory lives here.
The mental model to carry through this lesson: DP is exact RL. Model-free methods (Q-learning, SARSA, PPO) are all approximations of what DP would compute if we had the model. Every temporal difference update is a sampled, single-step approximation of the Bellman backup. Once you see this, the zoo of RL algorithms becomes much more coherent.
:::note The Key Limitation Dynamic programming requires full knowledge of the environment model - the transition probabilities and the reward function . In most real problems, you don't have this. Model-free methods exist precisely because DP is often infeasible. But you need to understand what DP computes before understanding why model-free methods approximate it. :::
Why This Exists: The Problem with Brute Force
Before dynamic programming, solving a sequential decision problem required exhaustive search: enumerate all possible sequences of actions, compute the return for each, pick the best. For a problem with actions and time steps, that is candidate sequences - exponential in the horizon.
Bellman's insight (1954): An optimal policy has a recursive structure. Whatever the first action was, the remaining actions must be optimal for the resulting state. This is the principle of optimality:
An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.
This recursive structure means you can decompose the global optimization into a sequence of local optimizations. Instead of candidates, you only need computations per iteration. The exponential becomes polynomial.
Historical Context
1953: Richard Bellman joins RAND Corporation and begins working on multi-stage decision problems. He needs to find optimal control policies for dynamic systems. Classical calculus methods fail - they require differentiability and can't handle discrete decisions or constraints.
1954: Bellman formalizes the principle of optimality and derives the recursive equations. He deliberately calls it "dynamic programming" - "dynamic" to convey the time-varying nature, "programming" in the mathematical/planning sense (not computer programming). He later admitted the name was chosen partly to obscure the mathematical nature of the work from military funding agencies who distrusted "mathematics."
1957: Bellman publishes "Dynamic Programming" - the foundational textbook. His equations for MDPs appear, though MDPs as a formal model are crystallized by Howard in 1960.
1980s–1990s: Watkins, Sutton, and others connect DP to online, model-free learning. Temporal difference (TD) learning emerges as "DP without the model" - using sampled transitions instead of the full matrix.
2013: DQN shows that the Bellman optimality backup, implemented with neural function approximation and experience replay, achieves superhuman Atari performance. The DP ideas survive; the tabular tables do not.
The Three DP Algorithms: Overview
Dynamic programming gives us three related algorithms, all based on iteratively applying the Bellman equation:
| Algorithm | What it computes | Key update | When to use |
|---|---|---|---|
| Policy Evaluation | for fixed | Bellman expectation | Assess a given policy |
| Policy Iteration | Optimal | Evaluation + greedy improvement | Small-medium problems |
| Value Iteration | directly | Bellman optimality | Faster in practice for large problems |
Algorithm 1: Policy Evaluation
Problem: Given a policy , compute its value function .
The Bellman expectation equation as an iterative update:
Start with and iterate. This is iterative policy evaluation.
Convergence Proof: Contraction Mapping
Why does this converge? The Bellman expectation operator maps one value function to another:
Claim: is a contraction mapping in the (sup-norm) sense.
Proof: Let and be any two value functions. Then:
Taking the sup over : .
Since , each iteration shrinks the error by factor . By the Banach fixed-point theorem, there exists a unique fixed point such that , and iterating from any starting point converges to .
Convergence rate: After iterations, the error is . For and tolerance starting from a zero initialization with maximum reward : you need iterations for . More for larger rewards.
Full Implementation
import numpy as np
from typing import Optional
def policy_evaluation(
policy: np.ndarray, # shape (n_states, n_actions) - π(a|s)
P: np.ndarray, # shape (n_states, n_actions, n_states) - P(s'|s,a)
R: np.ndarray, # shape (n_states, n_actions) - R(s,a)
gamma: float = 0.9,
tol: float = 1e-8,
max_iter: int = 10_000,
in_place: bool = False, # use in-place updates (faster convergence)
) -> tuple[np.ndarray, list[float]]:
"""
Iterative policy evaluation.
Convergence guarantee: By Banach fixed-point theorem, iterating
the Bellman operator T^pi (a gamma-contraction in L-inf norm)
converges to the unique fixed point V^pi.
Returns (V^pi, convergence_deltas)
"""
n_states = P.shape[0]
V = np.zeros(n_states)
deltas = []
for iteration in range(max_iter):
if in_place:
# In-place update: immediately use new V values
# Faster convergence, same fixed point
V_old = V.copy()
for s in range(n_states):
Q_s = R[s] + gamma * P[s] @ V # shape: (n_actions,)
V[s] = policy[s] @ Q_s # scalar
delta = np.max(np.abs(V - V_old))
else:
# Synchronous update: use old V for all updates
Q = R + gamma * (P @ V) # (n_states, n_actions)
V_new = np.einsum('sa,sa->s', policy, Q) # (n_states,)
delta = np.max(np.abs(V_new - V))
V = V_new
deltas.append(delta)
if delta < tol:
print(f"Policy evaluation converged: {iteration + 1} iterations, "
f"delta={delta:.2e}")
break
return V, deltas
def policy_evaluation_matrix(
policy: np.ndarray,
P: np.ndarray,
R: np.ndarray,
gamma: float = 0.9,
) -> np.ndarray:
"""
Direct matrix solution for V^pi (exact, not iterative).
The Bellman expectation equation in matrix form:
V = R_pi + gamma * P_pi * V
=> (I - gamma * P_pi) V = R_pi
=> V = (I - gamma * P_pi)^{-1} R_pi
Where R_pi[s] = sum_a pi(a|s) R(s,a)
and P_pi[s,s'] = sum_a pi(a|s) P(s'|s,a)
O(|S|^3) - only feasible for very small state spaces.
"""
n_states = P.shape[0]
# Compute policy-weighted reward and transition matrix
R_pi = np.einsum('sa,sa->s', policy, R) # (n_states,)
P_pi = np.einsum('sa,sas->ss', policy, # (n_states, n_states)
P.transpose(0, 1, 2))
# Actually easier notation:
# P_pi[s, s'] = sum_a pi[s,a] * P[s,a,s']
P_pi = np.einsum('sa,san->sn', policy, P) # (n_states, n_states)
# Solve linear system
A = np.eye(n_states) - gamma * P_pi
V = np.linalg.solve(A, R_pi)
return V
Algorithm 2: Policy Iteration
Problem: Find the optimal policy .
Idea: Alternately evaluate the current policy (policy evaluation) and improve it greedily (policy improvement). Each improvement step is guaranteed not to make the policy worse.
Policy Improvement: The Greedy Step
Given , we compute the Q-function and take the greedy action:
Policy Improvement Theorem: Formal Proof Sketch
Theorem (Sutton & Barto): If is the greedy policy w.r.t. , then for all .
Proof: By definition of greedy selection:
Now expand by rolling out one more step under :
The inequality is strict if for at least one state. If everywhere, then satisfies the Bellman optimality equation - is already optimal.
Full Policy Iteration Implementation
def policy_iteration(
P: np.ndarray, # (n_states, n_actions, n_states)
R: np.ndarray, # (n_states, n_actions)
gamma: float = 0.9,
eval_tol: float = 1e-8,
max_outer: int = 1000,
verbose: bool = True,
) -> tuple[np.ndarray, np.ndarray, list[int]]:
"""
Policy iteration.
Alternates between:
1. Policy evaluation: compute V^pi exactly (iterative)
2. Policy improvement: π' = greedy(V^pi)
Convergence: guaranteed in finite steps (≤ |A|^|S| policies).
In practice: converges in very few iterations (< 50 even for large problems).
Returns (optimal_policy, V*, evaluation_iterations_per_round)
"""
n_states, n_actions, _ = P.shape
eval_counts = []
# Initialize: uniform random policy
policy = np.ones((n_states, n_actions)) / n_actions
for outer_iter in range(max_outer):
# ── Step 1: Policy Evaluation ────────────────────────────────────────
V, deltas = policy_evaluation(policy, P, R, gamma, tol=eval_tol)
eval_counts.append(len(deltas))
# ── Step 2: Policy Improvement (greedy) ─────────────────────────────
# Q[s,a] = R[s,a] + gamma * sum_{s'} P[s,a,s'] * V[s']
Q = R + gamma * (P @ V) # (n_states, n_actions)
best_actions = np.argmax(Q, axis=1) # (n_states,)
# Build new deterministic policy
new_policy = np.zeros((n_states, n_actions))
new_policy[np.arange(n_states), best_actions] = 1.0
# ── Convergence Check ────────────────────────────────────────────────
if np.array_equal(new_policy, policy):
if verbose:
print(f"Policy iteration converged after {outer_iter + 1} "
f"outer iterations")
return best_actions, V, eval_counts
policy = new_policy
return np.argmax(policy, axis=1), V, eval_counts
Convergence bound: There are at most deterministic policies. Since each iteration strictly improves or maintains (and is bounded), policy iteration must converge in at most steps. In practice, it converges in far fewer - typically iterations.
Algorithm 3: Value Iteration
Problem: Find directly, without explicit policy evaluation steps.
Key insight: Instead of waiting for policy evaluation to converge before improving, apply a single Bellman optimality backup at each state:
This is the Bellman optimality operator applied iteratively. It also forms a contraction with factor , converging to .
Why this works: The Bellman optimality operator:
is also a -contraction in . Same proof as for , but with instead of . The operator doesn't affect the contraction because .
Full Value Iteration Implementation
def value_iteration(
P: np.ndarray, # (n_states, n_actions, n_states)
R: np.ndarray, # (n_states, n_actions)
gamma: float = 0.9,
tol: float = 1e-8,
max_iter: int = 100_000,
verbose: bool = True,
) -> tuple[np.ndarray, np.ndarray, list[float]]:
"""
Value iteration.
Applies the Bellman optimality operator T* iteratively:
V_{k+1}(s) = max_a [R(s,a) + gamma * sum_s' P(s'|s,a) * V_k(s')]
Convergence: T* is a gamma-contraction in L-inf norm.
After k iterations: ||V_k - V*||_inf <= gamma^k / (1-gamma) * ||V_1 - V_0||_inf
Returns (optimal_policy, V*, convergence_deltas)
"""
n_states, n_actions, _ = P.shape
V = np.zeros(n_states)
deltas = []
for iteration in range(max_iter):
# Q[s,a] = R[s,a] + gamma * sum_{s'} P[s,a,s'] * V[s']
Q = R + gamma * (P @ V) # (n_states, n_actions)
# V_new[s] = max_a Q[s,a]
V_new = np.max(Q, axis=1) # (n_states,)
delta = np.max(np.abs(V_new - V))
deltas.append(delta)
V = V_new
if delta < tol:
if verbose:
print(f"Value iteration converged: {iteration + 1} iterations, "
f"delta={delta:.2e}")
break
# Extract greedy optimal policy
Q_final = R + gamma * (P @ V)
optimal_policy = np.argmax(Q_final, axis=1)
return optimal_policy, V, deltas
Error bound at convergence: If the iteration terminates because , then the true error satisfies:
For and : the true error is bounded by .
GridWorld: Full DP Solution with Visualization
Let's implement and solve a classic 4×4 GridWorld to see DP in action:
┌───┬───┬───┬───┐
│ S │ │ │ │ S = Start (state 0)
├───┼───┼───┼───┤
│ │ X │ │ │ X = Wall (state 5, impassable)
├───┼───┼───┼───┤
│ │ │ │ │
├───┼───┼───┼───┤
│ │ │ │ G │ G = Goal (state 15, reward +10)
└───┴───┴───┴───┘
import numpy as np
class GridWorld:
"""
4x4 GridWorld with a wall and a goal state.
Actions: 0=UP, 1=RIGHT, 2=DOWN, 3=LEFT
Reward: +10 for reaching goal, -0.1 per step (time penalty).
Transitions: deterministic (no stochasticity).
"""
ACTION_NAMES = {0: '↑', 1: '→', 2: '↓', 3: '←'}
MOVES = {0: (-1, 0), 1: (0, 1), 2: (1, 0), 3: (0, -1)}
def __init__(self, size: int = 4, gamma: float = 0.9):
self.size = size
self.gamma = gamma
self.n_states = size * size
self.n_actions = 4
self.goal = self.n_states - 1 # bottom-right: state 15
self.walls = {5} # position (1,1)
self.P, self.R = self._build_mdp()
def _idx(self, r: int, c: int) -> int:
return r * self.size + c
def _rc(self, s: int) -> tuple[int, int]:
return s // self.size, s % self.size
def _build_mdp(self):
n, na = self.n_states, self.n_actions
P = np.zeros((n, na, n))
R = np.full((n, na), -0.1) # step cost
for s in range(n):
if s == self.goal:
P[s, :, s] = 1.0 # absorbing terminal
R[s, :] = 0.0
continue
if s in self.walls:
P[s, :, s] = 1.0 # walls absorb too
R[s, :] = -1.0 # penalty for entering wall (shouldn't happen)
continue
r, c = self._rc(s)
for a, (dr, dc) in self.MOVES.items():
nr, nc = r + dr, c + dc
# Clamp to grid boundaries
if not (0 <= nr < self.size and 0 <= nc < self.size):
ns = s # bounce off boundary
else:
ns = self._idx(nr, nc)
if ns in self.walls:
ns = s # bounce off wall
P[s, a, ns] += 1.0
if ns == self.goal:
R[s, a] += 10.0
return P, R
def display_values(self, V: np.ndarray, title: str = "Value Function"):
print(f"\n{title}:")
border = "+" + "--------+" * self.size
print(border)
for r in range(self.size):
row = "|"
for c in range(self.size):
s = self._idx(r, c)
if s == self.goal:
row += " GOAL |"
elif s in self.walls:
row += " WALL |"
else:
row += f" {V[s]:6.3f} |"
print(row)
print(border)
def display_policy(self, policy: np.ndarray, title: str = "Optimal Policy"):
print(f"\n{title}:")
border = "+" + "------+" * self.size
print(border)
for r in range(self.size):
row = "|"
for c in range(self.size):
s = self._idx(r, c)
if s == self.goal:
row += " G |"
elif s in self.walls:
row += " X |"
else:
row += f" {self.ACTION_NAMES[policy[s]]} |"
print(row)
print(border)
def compare_algorithms(gamma: float = 0.9):
"""Run all three DP algorithms and compare results."""
grid = GridWorld(size=4, gamma=gamma)
P, R = grid.P, grid.R
# ── Value Iteration ──────────────────────────────────────────────────────
print("=" * 60)
print("VALUE ITERATION")
pi_vi, V_vi, deltas_vi = value_iteration(P, R, gamma=gamma, verbose=True)
grid.display_values(V_vi, "V* (Value Iteration)")
grid.display_policy(pi_vi, "π* (Value Iteration)")
print(f"Iterations: {len(deltas_vi)}")
# ── Policy Iteration ─────────────────────────────────────────────────────
print("\n" + "=" * 60)
print("POLICY ITERATION")
pi_poli, V_poli, eval_counts = policy_iteration(P, R, gamma=gamma, verbose=True)
grid.display_values(V_poli, "V* (Policy Iteration)")
grid.display_policy(pi_poli, "π* (Policy Iteration)")
print(f"Outer iterations: {len(eval_counts)}")
print(f"Total inner iterations: {sum(eval_counts)}")
# ── Verify Equivalence ──────────────────────────────────────────────────
print("\n" + "=" * 60)
print("COMPARISON:")
print(f"Max V difference: {np.max(np.abs(V_vi - V_poli)):.2e}")
print(f"Policy agreement: {np.mean(pi_vi == pi_poli) * 100:.1f}%")
if __name__ == "__main__":
compare_algorithms(gamma=0.9)
Expected output summary:
Value Function (4×4 GridWorld, γ=0.9):
+--------+--------+--------+--------+
| 3.352 | 4.437 | 5.735 | 7.186 |
+--------+--------+--------+--------+
| 2.643 | WALL | 7.186 | 8.651 |
+--------+--------+--------+--------+
| 3.352 | 4.437 | 7.186 | 10.000 |
+--------+--------+--------+--------+
| 2.643 | 3.352 | 7.186 | GOAL |
+--------+--------+--------+--------+
States adjacent to the goal have high values. The wall forces a detour that reduces values on the left side. The optimal policy navigates around the wall.
Asynchronous DP Variants
The standard DP algorithms update all states simultaneously (synchronous). Asynchronous variants update states in a different order, which can be more efficient.
In-Place DP
Instead of computing a completely new each iteration, immediately use updated values:
def value_iteration_inplace(P, R, gamma=0.9, tol=1e-8):
"""
In-place value iteration: immediately use updated V[s] for subsequent states.
Converges in fewer iterations (though each iteration is same cost).
The update order within each sweep can affect speed significantly.
"""
n_states = P.shape[0]
V = np.zeros(n_states)
for iteration in range(100_000):
delta = 0.0
for s in range(n_states):
Q_s = R[s] + gamma * P[s] @ V # (n_actions,)
v_new = np.max(Q_s)
delta = max(delta, abs(v_new - V[s]))
V[s] = v_new # immediate update
if delta < tol:
print(f"In-place VI converged: {iteration + 1} iterations")
break
return np.argmax(R + gamma * (P @ V), axis=1), V
Prioritized Sweeping
Update states in order of how much their value estimate changed - prioritize states where the Bellman error (current estimate vs backed-up value) is largest. This focuses computation where it's most needed.
import heapq
def prioritized_sweeping(P, R, gamma=0.9, tol=1e-8, theta=1e-5):
"""
Prioritized sweeping: update states in order of Bellman error magnitude.
Much more efficient when only a few states have large errors.
"""
n_states, n_actions, _ = P.shape
V = np.zeros(n_states)
# Priority queue: (negative_error, state_index)
pq = []
for s in range(n_states):
heapq.heappush(pq, (-1.0, s)) # initialize all with high priority
total_updates = 0
while pq:
neg_priority, s = heapq.heappop(pq)
if -neg_priority < tol:
break
# Update state s
Q_s = R[s] + gamma * P[s] @ V
V_new_s = np.max(Q_s)
V[s] = V_new_s
total_updates += 1
# Re-prioritize predecessors of s
for pred in range(n_states):
for a in range(n_actions):
if P[pred, a, s] > 0:
Q_pred = R[pred] + gamma * P[pred] @ V
error = abs(np.max(Q_pred) - V[pred])
if error > theta:
heapq.heappush(pq, (-error, pred))
print(f"Prioritized sweeping: {total_updates} updates (vs {n_states}×iters for standard VI)")
return np.argmax(R + gamma * (P @ V), axis=1), V
Real-Time DP (RTDP)
Only update states that the agent actually visits during policy execution. Converges to optimal for reachable states. Useful when only a small fraction of states are ever visited.
Generalized Policy Iteration (GPI)
The concept that unifies all RL algorithms - not just the three DP methods:
:::note Generalized Policy Iteration Any interleaving of policy evaluation (making the value function consistent with the current policy) and policy improvement (making the policy greedy with respect to the current value function) converges to and . :::
The two processes compete: making accurate for destroys the optimality of w.r.t. ; making greedy for makes inaccurate again. But they both push toward the same equilibrium: and .
GPI explains every RL algorithm:
| Algorithm | Evaluation step | Improvement step |
|---|---|---|
| Policy Iteration | Full convergence of | Greedy |
| Value Iteration | One Bellman optimality backup | Implicit (max in backup) |
| Q-Learning | TD(0) update: | Greedy: use |
| SARSA | TD(0) update with on-policy target | On-policy (softer) improvement |
| PPO | Value function regression | Policy gradient update |
Comparison: All Three DP Algorithms
| Policy Evaluation | Policy Iteration | Value Iteration | |
|---|---|---|---|
| Computes | for fixed | ||
| Convergence guarantee | Exact (within of ) | Exact (finite iterations) | Approximate (within of ) |
| Iterations needed | Few outer (many inner) | ||
| Cost per iteration | Same | Same | |
| Memory | for | for + | for |
| When faster | Never standalone | Small problems (<1K states) | Large problems (policy eval costly) |
| Requires model | Yes | Yes | Yes |
| Practical use | Sub-routine in Policy Iter | Robotics, small MDPs | Games, planning problems |
The Curse of Dimensionality
DP is exact and elegant. In practice, it breaks down in three ways:
1. Model required: You need . For most interesting problems - Atari, robot manipulation, RLHF - you don't have this. You can only sample from the environment.
2. State space explosion: The matrix has size . For Atari (84×84×4 frames), there are approximately distinct states. You cannot store a table. For an LLM with vocabulary 50,000 and context length 2048, state space is - utterly intractable.
3. Continuous state/action spaces: Joint angles, velocities, temperatures - continuous variables mean infinite state spaces. Even with discretization, the curse of dimensionality means the table grows exponentially in the number of dimensions.
GridWorld (16 states): → Tabular DP works fine
Atari (10^67970 states): → Function approximation required (DQN)
Robotics (continuous): → Function approx + policy gradient (PPO, SAC)
LLMs (token sequences): → Function approx + policy gradient (RLHF)
These limitations motivate everything that follows:
Connection to Model-Based RL: Dyna-Q
An important bridge between model-based DP and model-free learning: the Dyna architecture (Sutton, 1990).
Dyna learns both a model of the environment and a value function simultaneously:
- Take action, observe from real environment
- Update value function directly from this experience (model-free)
- Update the learned model and from this experience
- Simulate additional experiences from the learned model
- Update value function from simulated experiences (DP-like)
class DynaQ:
"""
Dyna-Q: model-free Q-learning augmented with model-based planning.
After each real step, plan k steps using the learned model.
"""
def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.9,
epsilon=0.1, n_planning=10):
self.Q = np.zeros((n_states, n_actions))
self.model = {} # model[(s,a)] = (r, s')
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon
self.n_planning = n_planning
self.visited = [] # list of (s,a) pairs tried
def act(self, s: int) -> int:
if np.random.random() < self.epsilon:
return np.random.randint(self.Q.shape[1])
return np.argmax(self.Q[s])
def real_update(self, s, a, r, s_next):
"""Direct RL update from real experience."""
td_target = r + self.gamma * np.max(self.Q[s_next])
self.Q[s, a] += self.alpha * (td_target - self.Q[s, a])
# Update model
self.model[(s, a)] = (r, s_next)
if (s, a) not in self.visited:
self.visited.append((s, a))
def planning_updates(self):
"""Plan k steps using the learned model (DP-like)."""
for _ in range(self.n_planning):
if not self.visited:
break
# Sample a previously visited state-action pair
idx = np.random.randint(len(self.visited))
s, a = self.visited[idx]
r, s_next = self.model[(s, a)]
# Same Q-learning update, but using simulated experience
td_target = r + self.gamma * np.max(self.Q[s_next])
self.Q[s, a] += self.alpha * (td_target - self.Q[s, a])
Why Dyna is powerful: With planning steps per real step, the agent can achieve the same performance as real steps' worth of model-free learning, using only 1 real step. If the environment is expensive to interact with (e.g., physical robots), this is a huge advantage.
Common Mistakes
:::danger Forgetting to Normalize the Policy Matrix
In policy_evaluation, the policy must be a proper probability distribution: np.sum(policy, axis=1) should equal 1 for every state. If you pass an un-normalized policy (e.g., raw Q-values or logits), the Bellman equation is wrong and evaluation diverges. Always normalize: policy = softmax(logits, axis=1) or build a one-hot from greedy actions.
:::
:::danger Using Reward R[s] Instead of R[s,a]
Many textbook formulations use (reward depends only on state). NumPy implementations often slip between (shape (n_states,)) and (shape (n_states, n_actions)). The Bellman backup requires : Q = R + gamma * (P @ V). If your R has the wrong shape, you get broadcasting bugs that are silent - the code runs but computes the wrong thing.
:::
:::warning In-Place vs Synchronous Updates Change Convergence Behavior In-place value iteration (immediately using updated for subsequent state updates within the same sweep) converges faster than synchronous updates (keeping a separate ). Both converge to the same . But in distributed DP or when debugging convergence, make sure you know which variant you're running - comparing convergence curves between in-place and synchronous runs is misleading. :::
:::warning Policy Iteration Can Cycle with Ties
When multiple actions achieve the same Q-value, argmax returns the first one by convention. Different runs may break ties differently, leading to different but equally optimal policies. If you see policy iteration not converging (flipping between two equivalent policies), add a small random tiebreaker or check for tie-equal Q-values explicitly.
:::
YouTube Resources
| Video | Creator | Why Watch |
|---|---|---|
| David Silver RL Lecture 3 - Planning by DP | David Silver (DeepMind) | Definitive coverage of policy evaluation, iteration, and value iteration with clear proofs |
| Reinforcement Learning - Dynamic Programming | Stanford CS234 | Clean academic lecture connecting DP to model-free RL |
| Policy Iteration vs Value Iteration Explained | deeplizard | Intuitive comparison with visual GridWorld examples |
| Bellman Equation and Dynamic Programming | Mutual Information | Excellent mathematical walkthrough, connects Bellman to modern deep RL |
| Sutton & Barto Book Club - Chapter 4: DP | Matthew Bain | Goes through the Sutton & Barto textbook chapter on DP with commentary |
Interview Q&A
Q1: What is the difference between policy iteration and value iteration?
Policy iteration has two explicit phases per outer iteration: (1) full policy evaluation - run the Bellman expectation equation until convergence to get ; (2) greedy policy improvement using , constructing . These phases alternate until the policy doesn't change. Value iteration merges both: . This is the Bellman optimality backup - one step of evaluation and implicit improvement per state per sweep. Both converge to and . Trade-offs: policy iteration often converges in fewer outer iterations (because exact evaluation gives a better improvement step) but each outer iteration is expensive (inner convergence). Value iteration's iterations are cheap (one sweep) but it may need many. For most practical problems, value iteration is preferred. A key insight: policy iteration with inner steps is exactly value iteration. So both are special cases of Generalized Policy Iteration.
Q2: Why does policy improvement always improve or maintain performance? Prove it.
This is the Policy Improvement Theorem. Let be the greedy policy w.r.t. , so . Starting from this inequality, unroll one step: . Apply the inequality again to : . Continue unrolling to infinity: . Therefore for all . Strict improvement holds whenever for some . When everywhere, satisfies the Bellman optimality equations - is optimal.
Q3: What is the curse of dimensionality in RL, and how does deep RL address it?
The curse of dimensionality refers to the exponential growth of state (and action) spaces as dimensions increase. In tabular DP, you store one value per state - feasible for small discrete spaces (GridWorld: 16 states), infeasible for Atari ( states) or robot arms (continuous). Three manifestations: (1) Memory: matrix has entries - for , , this is entries, far beyond RAM. (2) Computation: each Bellman sweep is - again infeasible. (3) Sample complexity: even with sampling (model-free RL), you need to visit every state enough times for reliable estimates. Deep RL addresses this with function approximation: instead of a lookup table, train a neural network or that generalizes across similar states. The Bellman update becomes a supervised learning target: minimized against . The network generalizes - seeing state improves estimates for nearby states. This is what DQN exploits.
Q4: What does the contraction mapping theorem guarantee for DP?
The Bellman operator (either for evaluation or for optimality) satisfies for any . This makes it a -contraction in the norm. The Banach fixed-point theorem then guarantees: (1) there exists a unique fixed point (or ); (2) iterating from any starting converges to this fixed point; (3) the convergence rate is geometric - after iterations, error . Higher means slower convergence but longer planning horizon. The contraction property is also why TD learning with function approximation can be unstable (the "deadly triad"): when is approximated by a neural network with gradient descent, the contraction property is lost, and divergence can occur.
Q5: Where does DP fit in the context of modern LLM training?
DP itself isn't directly used in LLM training - LLMs have astronomically large state spaces (all possible token sequences) and we don't have the transition model. However, the DP framework provides the conceptual foundation. In RLHF: the reward model provides , and PPO approximates the policy improvement step without knowing explicitly. PPO's value function update is a model-free approximation of policy evaluation. In MCTS-based systems (AlphaGo, AlphaZero, and recent LLM planning methods): DP is approximated via tree search using neural network value estimates to guide the search and prune the tree. The value network essentially provides estimates that would otherwise require expensive rollouts to compute. Understanding DP makes you a better RL practitioner: when PPO's value estimates diverge, you know it's because the Bellman backup is not contracting properly with function approximation - the "deadly triad" of off-policy learning, function approximation, and bootstrapping.
Q6: Explain Generalized Policy Iteration. How does it connect DP to model-free RL?
GPI is the observation that any algorithm that alternates between two processes - (1) making the value function consistent with the current policy (evaluation direction), and (2) making the policy greedy with respect to the current value function (improvement direction) - converges to and , regardless of the exact details of each step. Policy iteration and value iteration are extremes: policy iteration runs evaluation to full convergence; value iteration uses a single backup. Q-learning is GPI: the update is a noisy, sampled approximation of one step of the Bellman optimality backup (evaluation), and the argmax policy is greedy improvement. PPO is GPI: the value network provides evaluation, and the policy gradient update provides improvement. Recognizing that all RL algorithms are GPI variants helps when debugging - if your algorithm isn't converging, ask whether the evaluation step is accurate enough and whether the improvement step is actually improving the policy.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the MDP GridWorld demo on the EngineersOfAI Playground - no code required.
:::
