Skip to main content

MDP and the RL Framework

import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';

Reading time: ~45 min | Interview relevance: Very High | Target roles: ML Engineer, AI Engineer, Research Engineer


The Real Engineering Moment

It's 2019 at OpenAI. The team training GPT-2 is discussing what comes next. Raw language model pre-training produces impressive text completion - but the model cheerfully writes misinformation, follows harmful instructions, and ignores the user's actual intent. The problem is not the model's capability. It's the objective. Predicting the next token is not the same as being helpful.

The question they grapple with: how do you teach a model to optimize for something as fuzzy as "what the user actually wants"? You can't label every output as good or bad - there are too many possible responses. You can't write down a mathematical reward function that captures helpfulness precisely. But you can get humans to compare pairs of outputs and say which one they prefer.

The answer turns out to be the same mathematical framework that DeepMind used to teach an agent to play Go, Atari games, and robot locomotion: the Markov Decision Process. In 2022, this framework powered the training of ChatGPT through RLHF. In 2023 and beyond, it underpins every alignment technique applied to frontier models - including Claude. Understanding this framework is not optional for AI engineers: it is the prerequisite for understanding every modern LLM alignment technique.

But the MDP framework is older and broader than LLMs. Howard formalized it in 1960 for operations research. Bellman derived the optimality equations in the 1950s. The same math governs a robot deciding how to step on uneven terrain, a trading algorithm deciding when to buy, and a recommendation system deciding what content to surface next. One framework, thousands of applications.

This lesson builds the formal MDP foundation from scratch. By the end, you will be able to formulate any sequential decision problem as an MDP, write the Bellman equations, implement policy evaluation in NumPy, and understand exactly how this connects to modern deep RL and RLHF.


Why This Exists: The Sequential Decision Problem

Before MDPs, decision-making problems were often modeled in one of two unsatisfactory ways.

Static optimization: Find the best action given the current state, ignoring the future. A thermostat that only looks at current temperature, not the trajectory. A recommendation system that maximizes immediate click probability, ignoring what users actually want after clicking. These approaches are myopic - they optimize the present at the cost of the future.

Game tree search (minimax): Works for two-player zero-sum games with perfect information and a finite game tree. But most real problems are stochastic (the environment is random), have enormous state spaces (chess is manageable; robot locomotion is not), and have a single agent rather than two opponents.

The MDP framework provides a mathematically rigorous way to model any sequential decision problem under uncertainty. It handles stochasticity (random environment transitions), long-term consequences (the agent cares about total future reward, not just immediate), and partial information (POMDPs extend the framework). Most importantly, it comes with provably correct algorithms for finding optimal solutions.


Historical Context

1950s: Richard Bellman, working at RAND Corporation, develops the principle of optimality and the recursive equations now bearing his name. He coins the term "dynamic programming" - deliberately vague to avoid military scrutiny of mathematical research. His insight: an optimal policy has the property that, whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

1960: Ronald Howard publishes "Dynamic Programming and Markov Processes," formally defining the MDP as a decision-making framework and proving that optimal policies exist and can be computed.

1988: Chris Watkins invents Q-learning in his PhD thesis, extending DP to the model-free setting where the transition function is unknown.

1992: Ronald Williams derives the REINFORCE policy gradient algorithm, showing how to optimize policies without needing the model at all.

2013: DeepMind combines Q-learning with deep neural networks to create DQN, achieving superhuman performance on Atari games. The state space is no longer a table - it's raw pixels.

2017–2022: OpenAI, Anthropic, DeepMind apply RL to language models via RLHF, directly using the MDP framework with language tokens as actions and human preferences as rewards.

The theoretical foundation hasn't changed since Bellman. What changed is the function approximation (neural networks) and the scale (billions of parameters).


The RL Setup: Five Components

Reinforcement learning formalizes the interaction between a decision-making agent and an environment. The agent observes the world, takes actions, and receives feedback.

┌─────────────────────────────────────────────────────────────┐
│ │
│ ┌────────┐ action a_t ┌─────────────┐ │
│ │ │ ─────────────────▶│ │ │
│ │ Agent │ │ Environment │ │
│ │ │ ◀─────────────────│ │ │
│ └────────┘ state s_{t+1} └─────────────┘ │
│ reward r_{t+1} │
│ │
└─────────────────────────────────────────────────────────────┘

The five components of any RL problem:

1. State (S): What the agent observes about the world. In chess: board position. In robotics: joint angles and velocities. In LLM training: the current conversation so far.

2. Action (A): What the agent can do. In chess: legal moves. In robotics: motor torques. In LLM training: the next token to generate.

3. Reward (R): The scalar feedback signal the agent receives. In chess: +1 for win, -1 for loss, 0 otherwise. In LLM training: a score from a reward model.

4. Policy (π): The agent's decision-making function - given state, what action to take.

5. Environment dynamics (P): How the world transitions from one state to the next given an action.

The agent's goal: find a policy that maximizes cumulative reward over time.


Formal MDP Definition

A Markov Decision Process is a tuple M=(S,A,P,R,γ)\mathcal{M} = (S, A, P, R, \gamma):

State Space SS

The set of all possible states the environment can be in. Can be:

  • Discrete and finite: Chess positions, GridWorld cells, inventory levels
  • Continuous: Robot joint angles Rn\in \mathbb{R}^n, velocity, temperature
  • High-dimensional: Raw pixel frames in Atari (84×84×4 = 28,224 dimensions)
  • Structured: Graphs, sequences (token histories in LLMs)

The Markov property requires: the state sts_t must contain all information needed to predict the future. If it doesn't, you have a partially observable MDP (POMDP), discussed later.

Action Space AA

The set of actions available to the agent. Can be:

  • Discrete: A={up,down,left,right}A = \{up, down, left, right\}, A={buy,sell,hold}A = \{buy, sell, hold\}, A={0,1,,V1}A = \{0, 1, \ldots, V-1\} for token vocabulary
  • Continuous: A=[1,1]24A = [-1, 1]^{24} for robot joint torques
  • Parameterized: Discrete action type with continuous parameters (move location, force magnitude)

In continuous action spaces, you cannot enumerate actions - this is why Q-learning fails and policy gradient methods are needed.

Transition Function P(ss,a)P(s' | s, a)

The probability of transitioning to state ss' after taking action aa in state ss:

P:S×A×S[0,1]P: S \times A \times S \rightarrow [0, 1]

sSP(ss,a)=1sS,aA\sum_{s' \in S} P(s' | s, a) = 1 \quad \forall s \in S, a \in A

This is the model of the environment. It captures everything the environment can do in response to an action.

Deterministic environment: P(ss,a)=1P(s'|s,a) = 1 for exactly one ss'. Chess (ignoring illegal moves) is deterministic. So is a simple GridWorld where walls always block movement.

Stochastic environment: Multiple ss' have non-zero probability. A robot slipping on a surface. A game with random wind. In most real-world deployments, environments are stochastic.

Reward Function R(s,a)R(s, a)

The expected scalar reward received after taking action aa in state ss:

R:S×ARR: S \times A \rightarrow \mathbb{R}

Some formulations use R(s,a,s)R(s, a, s') (reward that depends on the next state too) or R(s)R(s) (reward depends only on state). For MDPs, these are equivalent in the sense that the same algorithms apply.

Reward design is the hardest part of applied RL. The reward function must:

  • Be well-specified (captures what you actually want)
  • Be computable from available information
  • Not be gameable (the agent shouldn't find a loophole that scores high but doesn't solve the task)

Reward hacking - the agent optimizing a proxy reward that diverges from the true objective - is the central challenge in RLHF. You want "helpfulness" but reward model scores are a proxy. The agent finds ways to get high proxy reward that don't correspond to genuine helpfulness.

Discount Factor γ[0,1)\gamma \in [0, 1)

Controls how much the agent values future rewards relative to immediate rewards. The total return from time step tt is:

Gt=rt+γrt+1+γ2rt+2+=k=0γkrt+kG_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots = \sum_{k=0}^{\infty} \gamma^k r_{t+k}

This is a geometric series. For bounded rewards rtrmax|r_t| \leq r_{max} and γ<1\gamma < 1:

Gtk=0γkrmax=rmax1γG_t \leq \sum_{k=0}^{\infty} \gamma^k r_{max} = \frac{r_{max}}{1 - \gamma}

The sum is always finite - a critical mathematical property that guarantees value functions are well-defined.


The Discount Factor γ: Deep Dive

The discount factor has three distinct interpretations:

1. Time preference (economic): A dollar today is worth more than a dollar tomorrow. γ\gamma captures this: reward now is worth 1, reward next step is worth γ\gamma, reward kk steps later is worth γk\gamma^k.

2. Uncertainty about the future: The further into the future, the less certain the prediction. Discounting reflects that distant futures are speculative.

3. Probability of termination: If at each step the episode could end with probability (1γ)(1-\gamma), the expected return under this termination model equals the discounted return. This interpretation makes γ\gamma concrete: γ=0.99\gamma = 0.99 means the episode ends with 1% probability each step.

What γ\gamma controls in practice:

γ = 0.0 → Agent only cares about immediate reward. Greedy.
γ = 0.5 → Reward halves every step. Step 10 reward worth 0.1%
γ = 0.9 → Balances near and far. Common for episodic tasks.
γ = 0.99 → Far-sighted. Reward 100 steps away still worth 37%
γ = 1.0 → Equal weight forever. Only safe for episodic, finite MDPs.

Geometric series interpretation. With γ=0.99\gamma = 0.99, reward at step kk contributes 0.99k0.99^k to the total return. The "effective horizon" - the number of steps within which 95% of the total return accrues - is approximately 11γ\frac{1}{1-\gamma}. At γ=0.9\gamma=0.9, horizon ≈ 10 steps. At γ=0.99\gamma=0.99, horizon ≈ 100 steps.

:::note RLHF Setting For LLM RLHF: γ\gamma is often set to 1.0 because episodes (full responses) are short (hundreds of tokens), finite, and we care equally about all tokens. For robotics or long-horizon planning: γ=0.99\gamma = 0.99 is common. :::


The Markov Property

A critical assumption that makes RL mathematically tractable:

:::note The Markov Property The future is independent of the past, given the present.

P(st+1st,at,st1,at1,,s0,a0)=P(st+1st,at)P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \ldots, s_0, a_0) = P(s_{t+1} | s_t, a_t) :::

In plain English: to decide what to do next, you only need the current state - not the entire history of how you got here. The current state contains all relevant information.

When is this reasonable?

  • Board games: the current board captures everything about the game state
  • CartPole: current position, velocity, angle, and angular velocity are sufficient
  • Fully observable robot: joint angles and velocities define the configuration

When does it fail?

  • Partially observable environments: the agent only sees part of the true state
  • History-dependent dynamics: past actions influence current state in ways not captured in observations
  • Non-stationary environments: the rules change over time

Finite vs Infinite Horizon MDPs

Episodic (finite horizon) MDPs: Episodes have a defined end - a terminal state or a maximum number of steps. The return is Gt=k=0Ttγkrt+kG_t = \sum_{k=0}^{T-t} \gamma^k r_{t+k} where TT is the episode length. Examples: games (chess ends with checkmate), conversations in RLHF, navigation tasks (robot reaches goal).

Continuing (infinite horizon) MDPs: No natural episode boundary. The agent runs indefinitely. Requires γ<1\gamma < 1 to keep returns finite. Examples: trading strategies, inventory management, a robot that must operate continuously.

Key difference for algorithms: Episodic tasks allow Monte Carlo returns (wait until episode end, compute exact return). Continuing tasks require bootstrapping - using value function estimates to compute partial returns.


Policy: What the Agent Does

A policy π\pi maps states to actions. There are two types:

Deterministic Policy

π:SAa=π(s)\pi: S \rightarrow A \qquad a = \pi(s)

Given state ss, always take the same action aa. Simple and sufficient for fully observable, non-adversarial environments. The optimal policy in any fully observable MDP is always deterministic.

Stochastic Policy

π:SΔ(A)aπ(s)\pi: S \rightarrow \Delta(A) \qquad a \sim \pi(\cdot | s)

Given state ss, sample an action from a probability distribution over actions. Crucial for:

  • Exploration: Stochastic policies naturally explore different actions
  • Partially observable environments: When the true state is hidden, a stochastic policy over observations can be optimal
  • Adversarial settings: Mixed strategies prevent exploitation (Rock-Paper-Scissors has no deterministic optimal policy)
  • Policy gradients: Gradient methods require differentiable stochastic policies

Tabular vs Function Approximation

Tabular policies store an explicit mapping for every state: a lookup table of size S|S|. Feasible only for small, discrete state spaces (GridWorld with 16 states). Exact and provably correct when it applies.

Function approximation represents the policy as a parameterized function πθ(as)\pi_\theta(a|s) - typically a neural network. Generalizes across states. Required for continuous or high-dimensional state spaces.

import torch
import torch.nn as nn

class PolicyNetwork(nn.Module):
"""Stochastic policy for discrete action spaces."""
def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim),
)

def forward(self, state: torch.Tensor) -> torch.distributions.Distribution:
logits = self.net(state)
return torch.distributions.Categorical(logits=logits)

def act(self, state: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
dist = self.forward(state)
action = dist.sample()
log_prob = dist.log_prob(action)
return action, log_prob


class GaussianPolicyNetwork(nn.Module):
"""Stochastic policy for continuous action spaces."""
def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 256):
super().__init__()
self.backbone = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
)
self.mean_head = nn.Linear(hidden_dim, action_dim)
self.log_std_head = nn.Linear(hidden_dim, action_dim)

def forward(self, state: torch.Tensor) -> torch.distributions.Distribution:
features = self.backbone(state)
mean = self.mean_head(features)
# Clamp log_std for numerical stability
log_std = self.log_std_head(features).clamp(-20, 2)
std = log_std.exp()
return torch.distributions.Normal(mean, std)

Value Functions

Value functions answer: "How good is it to be in a particular state (or take a particular action)?" They are the mechanism by which an agent plans ahead.

State Value Function Vπ(s)V^\pi(s)

Vπ(s)=Eπ[Gtst=s]=Eπ[k=0γkrt+kst=s]V^\pi(s) = \mathbb{E}_\pi\left[G_t \mid s_t = s\right] = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \mid s_t = s\right]

The expected total return starting from state ss and following policy π\pi forever after.

Intuition: Vπ(s)V^\pi(s) tells you how good a state is, assuming you keep following your current policy. A state where you're about to win a chess game has high VV; a state where you're in checkmate has zero VV (the game is already lost).

Action-Value Function Qπ(s,a)Q^\pi(s, a)

Qπ(s,a)=Eπ[Gtst=s,at=a]Q^\pi(s, a) = \mathbb{E}_\pi\left[G_t \mid s_t = s, a_t = a\right]

The expected total return starting from state ss, taking action aa (which may not follow π\pi), and then following policy π\pi thereafter.

Intuition: Qπ(s,a)Q^\pi(s, a) tells you how good it is to take a specific action from a specific state, before returning to the policy. It is more useful than VV for action selection: the optimal action is a=argmaxaQ(s,a)a^* = \arg\max_a Q^*(s, a) without needing to know the transition function.

Relationship Between V and Q

Vπ(s)=aπ(as)Qπ(s,a)=Eaπ[Qπ(s,a)]V^\pi(s) = \sum_a \pi(a|s) \, Q^\pi(s, a) = \mathbb{E}_{a \sim \pi}[Q^\pi(s, a)]

The value of a state is the expected Q-value over all actions weighted by the policy. For a deterministic policy π(s)=a\pi(s) = a^*: Vπ(s)=Qπ(s,a)V^\pi(s) = Q^\pi(s, a^*).


The Bellman Equations

The Bellman equations express the recursive structure of value functions. They are the most important equations in RL.

Bellman Expectation Equation for V

Expanding the definition of VπV^\pi using the one-step dynamics:

Vπ(s)=aπ(as)[R(s,a)+γsP(ss,a)Vπ(s)]\boxed{V^\pi(s) = \sum_a \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V^\pi(s') \right]}

Read this as: the value of a state equals (for each possible action weighted by the policy) the immediate reward plus the discounted expected value of the next state.

Step-by-step derivation:

Vπ(s)=Eπ[Gtst=s]V^\pi(s) = \mathbb{E}_\pi[G_t | s_t = s]

=Eπ[rt+γGt+1st=s]= \mathbb{E}_\pi[r_t + \gamma G_{t+1} | s_t = s]

=aπ(as)sP(ss,a)[R(s,a)+γEπ[Gt+1st+1=s]]= \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a) + \gamma \mathbb{E}_\pi[G_{t+1} | s_{t+1} = s'] \right]

=aπ(as)sP(ss,a)[R(s,a)+γVπ(s)]= \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a) + \gamma V^\pi(s') \right]

The third line uses the Markov property - the future expected return from ss' depends only on ss', not on the history.

Bellman Expectation Equation for Q

Qπ(s,a)=sP(ss,a)[R(s,a)+γaπ(as)Qπ(s,a)]\boxed{Q^\pi(s, a) = \sum_{s'} P(s'|s,a) \left[ R(s,a) + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s', a') \right]}

The Q-value of (s,a)(s, a) equals the expected immediate reward plus the discounted expected Q-value in the next state under the policy.

Bellman Optimality Equations

For the optimal policy π\pi^* and its value functions VV^* and QQ^*:

V(s)=maxa[R(s,a)+γsP(ss,a)V(s)]\boxed{V^*(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V^*(s') \right]}

Q(s,a)=R(s,a)+γsP(ss,a)maxaQ(s,a)\boxed{Q^*(s, a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s', a')}

These equations are fixed-point conditions - VV^* and QQ^* are the unique functions satisfying them simultaneously for all states.

Once you have QQ^*, the optimal policy is trivially recovered:

π(s)=argmaxaQ(s,a)\pi^*(s) = \arg\max_a Q^*(s, a)

This is why most RL algorithms focus on learning Q-functions - the optimal policy follows directly without needing to know the transition model.


The Bellman Backup: Visual Intuition

The Bellman equation defines a backup operation: to update the value of a state, look one step ahead at its successors. This is the tree diagram:

V(s) ← value of current state
|
──────┼──────
│ │
π(a₁|s) π(a₂|s) ← action choices weighted by policy
│ │
───┼─── ───┼───
│ │ │ │
s'₁ s'₂ s'₃ s'₄ ← stochastic next states
P(s'|s,a) P(s'|s,a)
│ │ │ │
R+γV R+γV R+γV R+γV ← reward + discounted next value

Key insight: You can bootstrap - use existing (possibly imperfect) value estimates for successor states to improve the value estimate for the current state. This is the core of all temporal difference methods: you don't need to wait for the end of the episode.


MDP Visualization as a Directed Graph

Each edge is annotated with action, transition probability, and reward. The agent must learn to navigate this graph to maximize total discounted reward.


GridWorld: Hand-Computed Value Function

Let's compute value functions by hand for a simple 3×3 grid. This builds intuition before code.

┌───┬───┬───┐
│ S │ │ │ S = Start (state 0)
├───┼───┼───┤
│ │ X │ │ X = Wall (impassable, state 4)
├───┼───┼───┤
│ │ │ G │ G = Goal (state 8, reward +1)
└───┴───┴───┘

State indices: 0 1 2
3 4 5
6 7 8

Policy: Always go right if possible, else go down.

Assumptions: Deterministic transitions, γ=0.9\gamma = 0.9, terminal reward R(78)=1R(7 \to 8) = 1, step cost 0.1-0.1 elsewhere.

Working backwards from the goal:

  • V(8)=0V(8) = 0 (terminal)
  • V(7)=R(7,right)+γV(8)=0.9×0+1=1.0V(7) = R(7, right) + \gamma V(8) = 0.9 \times 0 + 1 = 1.0 - one step from goal
  • V(6)=R(6,right)+γV(7)=0.1+0.9×1.0=0.8V(6) = R(6, right) + \gamma V(7) = -0.1 + 0.9 \times 1.0 = 0.8
  • V(5)=R(5,down)+γV(8)=0.1+0.9×0=0.1V(5) = R(5, down) + \gamma V(8) = -0.1 + 0.9 \times 0 = -0.1 - wait, 5 goes to 8 directly? With our policy (right): state 5 is at row 1 col 2, going right exits the grid so we go down → state 8. V(5)=0.1+0.9×0=0.1V(5) = -0.1 + 0.9 \times 0 = -0.1... that seems wrong.

Adjusting: state 5 right = state 8 (terminal), reward = 1. V(5)=R(5,right)+γV(8)=0.9+0.9×0=0.9V(5) = R(5,right) + \gamma V(8) = 0.9 + 0.9 \times 0 = 0.9.

The key insight: states closer to the goal have higher value, and the discount factor creates a gradient that guides the agent toward reward.


OpenAI Gym: Environment Interaction Loop

The standard way to interact with any RL environment in Python:

import gymnasium as gym
import numpy as np

def run_episode(env_name: str = "CartPole-v1", render: bool = False) -> float:
"""
Run one episode of a Gym environment with a random policy.
Returns total reward.
"""
env = gym.make(env_name, render_mode="human" if render else None)

# Reset - get initial observation
obs, info = env.reset(seed=42)
total_reward = 0.0
done = False
step = 0

while not done:
# Random action from the action space
action = env.action_space.sample()

# Step: take action, get next observation
obs, reward, terminated, truncated, info = env.step(action)
total_reward += reward
done = terminated or truncated
step += 1

print(f"Step {step:3d} | Obs: {obs} | Action: {action} | Reward: {reward:.2f}")

env.close()
print(f"\nEpisode ended after {step} steps | Total reward: {total_reward:.2f}")
return total_reward


def inspect_environment(env_name: str = "CartPole-v1"):
"""Print key properties of a Gym environment."""
env = gym.make(env_name)

print(f"Environment: {env_name}")
print(f"Observation space: {env.observation_space}")
print(f" Type: {type(env.observation_space).__name__}")
if hasattr(env.observation_space, 'shape'):
print(f" Shape: {env.observation_space.shape}")
if hasattr(env.observation_space, 'low'):
print(f" Low: {env.observation_space.low}")
print(f" High: {env.observation_space.high}")

print(f"\nAction space: {env.action_space}")
print(f" Type: {type(env.action_space).__name__}")
if hasattr(env.action_space, 'n'):
print(f" N actions: {env.action_space.n}")

env.close()

# Common Gym environments and what they model
ENVIRONMENTS = {
"CartPole-v1": {
"state": "4D: [cart_pos, cart_vel, pole_angle, pole_vel]",
"actions": "Discrete(2): push left or right",
"reward": "+1 for every step pole stays upright",
"episode_end": "pole > 12 degrees OR cart > 2.4 units",
},
"LunarLander-v2": {
"state": "8D: position, velocity, angle, angular vel, leg contacts",
"actions": "Discrete(4): nothing, left engine, main engine, right engine",
"reward": "landing pad +100 to +140, crash -100, each leg contact +10",
"episode_end": "lander crashes or lands successfully",
},
"HalfCheetah-v4": {
"state": "17D: joint positions and velocities",
"actions": "Continuous 6D: torques for 6 joints",
"reward": "velocity forward - control cost",
"episode_end": "1000 steps",
},
}

NumPy Policy Evaluation: Full Implementation

The core algorithm for computing VπV^\pi given a policy π\pi and the full MDP model:

import numpy as np
from typing import Optional


def policy_evaluation(
policy: np.ndarray, # shape (n_states, n_actions) - stochastic policy
P: np.ndarray, # shape (n_states, n_actions, n_states) - transitions
R: np.ndarray, # shape (n_states, n_actions) - rewards
gamma: float = 0.9,
tol: float = 1e-8,
max_iter: int = 10_000,
verbose: bool = True,
) -> np.ndarray:
"""
Iterative policy evaluation.
Applies the Bellman expectation equation repeatedly until convergence.

Convergence is guaranteed by the Banach fixed-point theorem:
The Bellman operator T^pi is a contraction with factor gamma.
Therefore, repeated application converges to the unique fixed point V^pi.

Time complexity per iteration: O(|S|^2 * |A|)
Memory: O(|S| * |A| * |S|) for P, O(|S|) for V
"""
n_states = P.shape[0]
V = np.zeros(n_states)
deltas = []

for iteration in range(max_iter):
# Bellman expectation backup (vectorized):
# Q[s,a] = R[s,a] + gamma * sum_s' P[s,a,s'] * V[s']
Q = R + gamma * (P @ V) # shape: (n_states, n_actions)

# V_new[s] = sum_a pi[s,a] * Q[s,a]
V_new = np.einsum('sa,sa->s', policy, Q) # shape: (n_states,)

# Check convergence: max absolute change
delta = np.max(np.abs(V_new - V))
deltas.append(delta)
V = V_new

if delta < tol:
if verbose:
print(f"Converged in {iteration + 1} iterations (delta={delta:.2e})")
break
else:
if verbose:
print(f"Did not converge in {max_iter} iterations (final delta={delta:.2e})")

return V


def compute_q_from_v(
V: np.ndarray,
P: np.ndarray,
R: np.ndarray,
gamma: float = 0.9,
) -> np.ndarray:
"""
Given V^pi, compute Q^pi using one Bellman backup.
Q[s,a] = R[s,a] + gamma * sum_s' P[s,a,s'] * V[s']
"""
return R + gamma * (P @ V)


# ── Demo: 5-state chain MDP ──────────────────────────────────────────────────

def build_chain_mdp(n_states: int = 5, gamma: float = 0.9):
"""
Build a chain MDP: states 0..n_states-1.
Actions: 0=left, 1=right.
Goal: reach state n_states-1 (terminal, reward +10).
Step cost: -0.1 per step.
Stochastic: 80% intended direction, 20% stay.
"""
n_actions = 2
P = np.zeros((n_states, n_actions, n_states))
R = np.full((n_states, n_actions), -0.1)

for s in range(n_states - 1):
# Action 0 (left): 80% go to s-1, 20% stay
if s > 0:
P[s, 0, s - 1] = 0.8
P[s, 0, s] = 0.2
else:
P[s, 0, s] = 1.0 # can't go left of start

# Action 1 (right): 80% go to s+1, 20% stay
P[s, 1, s + 1] = 0.8
P[s, 1, s] = 0.2

# Terminal state: absorbing
P[n_states - 1, :, n_states - 1] = 1.0
R[n_states - 1, :] = 0.0 # no reward from terminal

# Reward for reaching terminal state
for s in range(n_states - 1):
R[s, 1] += P[s, 1, n_states - 1] * 10.0 # partial reward for probability of reaching goal

return P, R


if __name__ == "__main__":
P, R = build_chain_mdp(n_states=5, gamma=0.9)

# Always-right policy
always_right = np.zeros((5, 2))
always_right[:, 1] = 1.0 # probability 1 for action "right"

V = policy_evaluation(always_right, P, R, gamma=0.9, verbose=True)
Q = compute_q_from_v(V, P, R)

print("\nValue Function V^pi (always right):")
for s, v in enumerate(V):
print(f" V({s}) = {v:7.4f}")

print("\nQ-values Q^pi(s, a):")
for s in range(5):
print(f" Q({s}, left) = {Q[s,0]:7.4f} | Q({s}, right) = {Q[s,1]:7.4f}")

Advantage Function A(s,a)A(s, a)

A derived quantity that becomes critical in policy gradient methods and actor-critic architectures:

Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)

The advantage function measures how much better action aa is compared to the average action taken by the policy in state ss.

  • A(s,a)>0A(s, a) > 0: action aa is better than the policy's average - reinforce it
  • A(s,a)<0A(s, a) < 0: action aa is worse than the policy's average - discourage it
  • A(s,a)=0A(s, a) = 0: action aa is exactly as good as average

Key property: Eaπ[Aπ(s,a)]=0\mathbb{E}_{a \sim \pi}[A^\pi(s, a)] = 0 for all ss. The advantage is zero-mean by construction.

Why this matters for training: policy gradient methods work better when trained on advantages rather than raw Q-values. Advantages have lower variance because the baseline V(s)V(s) removes the part of the return that doesn't depend on the specific action chosen.


The Exploration-Exploitation Dilemma

A fundamental tension in every RL problem:

Exploitation: Take the action that currently seems best according to your value estimates. Get the reward you know you can get.

Exploration: Try less-certain actions to gather information that might reveal better strategies. Sacrifice some immediate reward for potential future gain.

Why this is hard: You don't know which unexplored actions are good until you try them. But trying them wastes time that could be spent exploiting the best known action. The optimal solution depends on the true environment, which you're trying to learn.

The ϵ\epsilon-Greedy Strategy

The simplest exploration strategy: with probability ϵ\epsilon, take a random action; with probability 1ϵ1-\epsilon, take the greedy action.

def epsilon_greedy_action(
Q: np.ndarray, # Q[a] = Q-value for each action
epsilon: float = 0.1,
) -> int:
"""
Epsilon-greedy action selection.
Balances exploration (random) and exploitation (greedy).
"""
if np.random.random() < epsilon:
return np.random.randint(len(Q)) # explore: random action
else:
return np.argmax(Q) # exploit: best known action

Formal treatment. The optimal exploration strategy for a bandit problem (one-state MDP) is given by the Upper Confidence Bound (UCB) algorithm:

at=argmaxa[Q(a)+clntNt(a)]a_t = \arg\max_a \left[ Q(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right]

where Nt(a)N_t(a) is the number of times action aa has been taken and cc is an exploration constant. UCB balances exploitation (high Q(a)Q(a)) with exploration (high uncertainty, i.e., low Nt(a)N_t(a)).

For the full MDP setting, exploration is much harder - the optimal exploration strategy depends on the entire MDP structure. In practice, ϵ\epsilon-greedy with annealing (reducing ϵ\epsilon over training) is the most common approach.


Partial Observability: POMDPs

When the Markov property fails - the agent cannot observe the full state - we work with Partially Observable MDPs (POMDPs).

Formal definition: A POMDP is a tuple (S,A,P,R,Ω,O,γ)(S, A, P, R, \Omega, O, \gamma) where:

  • S,A,P,R,γS, A, P, R, \gamma: same as MDP
  • Ω\Omega: set of possible observations
  • O(os,a)O(o | s', a): probability of observation oo given that the agent took action aa and landed in state ss'

The agent cannot observe ss directly - only the noisy observation oo. To plan optimally, the agent must maintain a belief state b(s)b(s) - a probability distribution over true states - and update it using Bayes' rule:

b(s)=ηO(os,a)sP(ss,a)b(s)b'(s') = \eta \cdot O(o | s', a) \sum_s P(s' | s, a) \, b(s)

where η\eta is a normalizing constant.

Why POMDPs are harder: The belief state is a probability distribution - a continuous, infinite-dimensional object. The optimal solution for a POMDP requires planning in belief space, which is computationally intractable in general.

Practical solutions:

  • Recurrent neural networks: Maintain a hidden state hth_t that captures history. The hidden state acts as a learned approximate belief state.
  • Transformers: The attention mechanism over all past observations functions as a belief state in disguise.
  • Frame stacking: In Atari, stacking the last 4 frames approximately satisfies the Markov property for velocity information.

:::note RLHF and Partial Observability In LLM RLHF, the "state" is the full conversation including the model's internal representations. But the reward model only sees the text. This is partial observability - the reward model must infer quality from incomplete information. :::


Real-World MDP Formulations

Understanding how to map real problems to MDPs is a core engineering skill.

Recommendation System MDP

State s: User's full watch/read/purchase history + current session context
Action a: Which item to recommend next (from a catalog of millions)
Reward r: User engagement - click (+1), watch full video (+5), like (+2), skip (-0.5)
Transition P: New user state after interaction with recommended item
γ: 0.99 - we care about long-term engagement, not just next click
Challenge: Action space is enormous (millions of items); reward hacking (clickbait)

Algorithmic Trading MDP

State s: Price history, order book, portfolio positions, macro indicators
Action a: {buy N shares, sell N shares, hold} for each asset
Reward r: Mark-to-market PnL minus transaction costs
Transition P: Market dynamics (partially observable, non-stationary)
γ: 0.999 - daily trading, care about long-run Sharpe ratio
Challenge: Non-stationarity (market regime changes), partial observability

Robot Locomotion MDP

State s: Joint angles, angular velocities, foot contact sensors, IMU data
Action a: Torques for each joint (continuous, high-dimensional)
Reward r: Forward velocity - energy consumption - fall penalty
Transition P: Physics simulation (MuJoCo, IsaacGym)
γ: 0.99 - robot must plan several gait cycles ahead
Challenge: Sim-to-real gap, continuous action space, contact discontinuities

When to Use RL vs Supervised Learning

This is a critical design decision. RL is not always the right tool.

Use Supervised Learning when:

  • You have labeled examples of correct behavior
  • Actions are independent (no sequential dependency)
  • The mapping input → output is well-defined

Use Reinforcement Learning when:

  • Correct behavior is hard to specify but easy to evaluate (chess: hard to say what move is right, easy to say who won)
  • Decisions are sequential and actions have long-term consequences
  • You need the agent to discover novel strategies not in training data
  • You can sample from the environment (real or simulated) cheaply

Common Mistakes

:::danger Confusing V and Q in Code V(s)V(s) has shape (n_states,). Q(s,a)Q(s,a) has shape (n_states, n_actions). A common bug: computing Q = P @ V gives shape (n_states, n_actions) - this is Q[s,a]=sP[s,a,s]V[s]Q[s,a] = \sum_{s'} P[s,a,s'] V[s'], which is the Bellman backup without the reward. Always add R: Q = R + gamma * (P @ V). :::

:::danger Setting γ = 1 for Continuing Tasks If your task has no terminal state and γ=1\gamma = 1, value functions are infinite (or undefined). This causes NaN in value function computation and unstable training. Always use γ<1\gamma < 1 for continuing tasks, or ensure your episodes terminate. :::

:::warning Reward Scale Matters More Than You Think If rewards are too large (e.g., millions of dollars in a trading system), gradients explode. If too small (e.g., 1e-6 per step), learning is slow and value functions are numerically indistinguishable. Normalize rewards to roughly [1,1][-1, 1] or at least ensure rt10|r_t| \leq 10. :::

:::warning The Sparse Reward Problem If rewards are only given at the very end of long episodes (e.g., win/lose in chess), the agent gets almost no learning signal. Solutions: reward shaping (add intermediate rewards), curriculum learning (start with easier tasks), hierarchical RL (decompose into subgoals), or imitation learning to bootstrap. :::


Three Algorithm Families

The RL learning problem - given only interaction with the environment, find π\pi^* - is solved by three broad families of algorithms:


YouTube Resources

VideoCreatorWhy Watch
Lecture 1: Introduction to Reinforcement LearningDavid Silver (DeepMind)The gold standard intro - covers MDP framework and Bellman equations rigorously
Reinforcement Learning Course - Full Lecture 1Pieter Abbeel (Berkeley)Excellent coverage of MDP formalism, great for mathematical intuition
RL Tutorial - Lecture 2: MDPs, Value FunctionsYannis KalogeropoulosApproachable walkthrough of value functions and Bellman with code
Reinforcement Learning - Introduction to MDPsMutual InformationClean visual explanation of the MDP framework and discount factor
Deep RL Bootcamp Lecture 1 - Motivation + OverviewPieter Abbeel & Rocky DuanMotivates RL from applications - great for seeing why it matters

Interview Q&A

Q1: Explain the Bellman equation and why it is the foundation of all RL algorithms.

The Bellman equation expresses a recursive consistency condition: the value of any state equals the expected immediate reward plus the discounted value of the next state. For the expectation equation: Vπ(s)=aπ(as)sP(ss,a)[R(s,a)+γVπ(s)]V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V^\pi(s')]. Its importance is threefold. First, it gives a self-consistency requirement - any correct value function must satisfy this for all states simultaneously. Second, it enables bootstrapping: you can update the value estimate for one state using value estimates of neighboring states, without waiting for full episode rollouts. Third, it defines the Bellman operator T\mathcal{T}, which is a contraction mapping - repeated application converges to the unique fixed point VπV^\pi by the Banach theorem. All temporal difference algorithms - Q-learning, SARSA, TD(λ), DQN, PPO's value update - are variants of this update applied online with function approximation.

Q2: What is the difference between V(s)V(s) and Q(s,a)Q(s,a)? When would you use each?

Vπ(s)V^\pi(s) is the expected return starting from state ss and following π\pi forever. Qπ(s,a)Q^\pi(s,a) is the expected return starting from ss, taking action aa first, then following π\pi. QQ is strictly more informative: you can derive VV from QQ (as V(s)=aπ(as)Q(s,a)V(s) = \sum_a \pi(a|s) Q(s,a)), and you can derive the optimal action directly as argmaxaQ(s,a)\arg\max_a Q^*(s,a) without knowing the transition model. In practice: Q-learning and DQN learn QQ because the optimal policy is recoverable greedily. Actor-critic methods learn VV for the critic because it's cheaper (no action dimension) and use QVQ - V as the advantage estimate for policy updates. You'd choose VV when you have a separate policy you're improving; QQ when you want the value function to directly determine the policy.

Q3: Why does the Markov property matter? What happens when it's violated?

The Markov property - that the future depends only on the current state, not the history - allows compact value functions V(s)V(s) and Q(s,a)Q(s,a) that depend only on the current state. Without it, you'd need V(h)V(h) where hh is the entire history, making the problem exponentially harder (the space of all histories is astronomically large). When the Markov property is violated (partial observability): naive RL with state-only policies can be suboptimal or even fail to converge. The agent may not have enough information to determine the right action from the current observation. Solutions include: (1) using recurrent networks (LSTM, Transformer) that compress history into a hidden state, effectively acting on a learned belief state; (2) frame stacking - including the last kk observations as a single input, approximately satisfying the Markov property for velocity information; (3) explicitly maintaining and updating a belief state using Bayes' rule (formal POMDP solution, computationally expensive).

Q4: How does the discount factor γ affect agent behavior and algorithm convergence?

γ\gamma controls the effective planning horizon. Small γ\gamma (0.5): agent is myopic, optimizes immediate reward, may sacrifice long-term outcomes. Large γ\gamma (0.99): agent is far-sighted, will forgo immediate reward for better future outcomes. Two mathematical effects: (1) Convergence speed - the Bellman operator's contraction rate is γ\gamma. After kk iterations of value iteration, error is γk\gamma^k times initial error. Higher γ\gamma means slower convergence. (2) Credit assignment difficulty - with large γ\gamma, a reward 100 steps in the future still contributes significantly to the present value. Propagating this information backwards through many steps requires many training iterations. In practice: γ=0.99\gamma = 0.99 is standard for most deep RL; γ=1.0\gamma = 1.0 is safe for episodic tasks with guaranteed termination; γ=0.9\gamma = 0.9 gives faster learning but a myopic policy.

Q5: What is the advantage function and why is it used in practice?

The advantage function Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s) measures how much better action aa is relative to the average action from state ss under the current policy. It has two key properties: (1) Zero-mean - Eaπ[A(s,a)]=0\mathbb{E}_{a \sim \pi}[A(s,a)] = 0 by construction, since Eaπ[Q(s,a)]=V(s)\mathbb{E}_{a \sim \pi}[Q(s,a)] = V(s). (2) Lower variance - using AA instead of raw QQ or GtG_t in policy gradient updates removes the baseline V(s)V(s) which captures the part of the return independent of the action. This variance reduction speeds training substantially. Practically: policy gradient updates use θlogπθ(as)A(s,a)\nabla_\theta \log \pi_\theta(a|s) \cdot A(s,a). When A>0A > 0 (action was better than average), the action probability increases. When A<0A < 0 (worse than average), it decreases. This is much more stable than updating on raw returns, which can push all action probabilities in the same direction when overall returns are high.

Q6: How would you formulate an LLM chat assistant as an MDP?

State SS: the conversation history (all tokens generated so far) - in practice, the hidden states of the transformer. Action AA: the next token to generate from a vocabulary of size ~50,000. Transition PP: deterministic from the LLM's perspective - given a state and action (token choice), the new state is the old state concatenated with that token. Reward RR: typically sparse - a reward model score r(x,y)r(x, y) for the full response yy given prompt xx, received only at the end of generation. Episode: one full turn of the conversation (prompt → complete response). Discount γ\gamma: often 1.0 for RLHF since episodes are short. The main challenges: (1) sparse reward (only at episode end) makes credit assignment hard; (2) enormous action space (50K tokens per step); (3) reward hacking (the model learns to game the reward model rather than genuinely improving); (4) the KL-from-reference penalty added in PPO-RLHF is needed to prevent the policy from drifting too far from the pretrained model.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the MDP GridWorld demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.