MDP and the RL Framework
import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';
Reading time: ~45 min | Interview relevance: Very High | Target roles: ML Engineer, AI Engineer, Research Engineer
The Real Engineering Moment
It's 2019 at OpenAI. The team training GPT-2 is discussing what comes next. Raw language model pre-training produces impressive text completion - but the model cheerfully writes misinformation, follows harmful instructions, and ignores the user's actual intent. The problem is not the model's capability. It's the objective. Predicting the next token is not the same as being helpful.
The question they grapple with: how do you teach a model to optimize for something as fuzzy as "what the user actually wants"? You can't label every output as good or bad - there are too many possible responses. You can't write down a mathematical reward function that captures helpfulness precisely. But you can get humans to compare pairs of outputs and say which one they prefer.
The answer turns out to be the same mathematical framework that DeepMind used to teach an agent to play Go, Atari games, and robot locomotion: the Markov Decision Process. In 2022, this framework powered the training of ChatGPT through RLHF. In 2023 and beyond, it underpins every alignment technique applied to frontier models - including Claude. Understanding this framework is not optional for AI engineers: it is the prerequisite for understanding every modern LLM alignment technique.
But the MDP framework is older and broader than LLMs. Howard formalized it in 1960 for operations research. Bellman derived the optimality equations in the 1950s. The same math governs a robot deciding how to step on uneven terrain, a trading algorithm deciding when to buy, and a recommendation system deciding what content to surface next. One framework, thousands of applications.
This lesson builds the formal MDP foundation from scratch. By the end, you will be able to formulate any sequential decision problem as an MDP, write the Bellman equations, implement policy evaluation in NumPy, and understand exactly how this connects to modern deep RL and RLHF.
Why This Exists: The Sequential Decision Problem
Before MDPs, decision-making problems were often modeled in one of two unsatisfactory ways.
Static optimization: Find the best action given the current state, ignoring the future. A thermostat that only looks at current temperature, not the trajectory. A recommendation system that maximizes immediate click probability, ignoring what users actually want after clicking. These approaches are myopic - they optimize the present at the cost of the future.
Game tree search (minimax): Works for two-player zero-sum games with perfect information and a finite game tree. But most real problems are stochastic (the environment is random), have enormous state spaces (chess is manageable; robot locomotion is not), and have a single agent rather than two opponents.
The MDP framework provides a mathematically rigorous way to model any sequential decision problem under uncertainty. It handles stochasticity (random environment transitions), long-term consequences (the agent cares about total future reward, not just immediate), and partial information (POMDPs extend the framework). Most importantly, it comes with provably correct algorithms for finding optimal solutions.
Historical Context
1950s: Richard Bellman, working at RAND Corporation, develops the principle of optimality and the recursive equations now bearing his name. He coins the term "dynamic programming" - deliberately vague to avoid military scrutiny of mathematical research. His insight: an optimal policy has the property that, whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.
1960: Ronald Howard publishes "Dynamic Programming and Markov Processes," formally defining the MDP as a decision-making framework and proving that optimal policies exist and can be computed.
1988: Chris Watkins invents Q-learning in his PhD thesis, extending DP to the model-free setting where the transition function is unknown.
1992: Ronald Williams derives the REINFORCE policy gradient algorithm, showing how to optimize policies without needing the model at all.
2013: DeepMind combines Q-learning with deep neural networks to create DQN, achieving superhuman performance on Atari games. The state space is no longer a table - it's raw pixels.
2017–2022: OpenAI, Anthropic, DeepMind apply RL to language models via RLHF, directly using the MDP framework with language tokens as actions and human preferences as rewards.
The theoretical foundation hasn't changed since Bellman. What changed is the function approximation (neural networks) and the scale (billions of parameters).
The RL Setup: Five Components
Reinforcement learning formalizes the interaction between a decision-making agent and an environment. The agent observes the world, takes actions, and receives feedback.
┌─────────────────────────────────────────────────────────────┐
│ │
│ ┌────────┐ action a_t ┌─────────────┐ │
│ │ │ ─────────────────▶│ │ │
│ │ Agent │ │ Environment │ │
│ │ │ ◀─────────────────│ │ │
│ └────────┘ state s_{t+1} └─────────────┘ │
│ reward r_{t+1} │
│ │
└─────────────────────────────────────────────────────────────┘
The five components of any RL problem:
1. State (S): What the agent observes about the world. In chess: board position. In robotics: joint angles and velocities. In LLM training: the current conversation so far.
2. Action (A): What the agent can do. In chess: legal moves. In robotics: motor torques. In LLM training: the next token to generate.
3. Reward (R): The scalar feedback signal the agent receives. In chess: +1 for win, -1 for loss, 0 otherwise. In LLM training: a score from a reward model.
4. Policy (π): The agent's decision-making function - given state, what action to take.
5. Environment dynamics (P): How the world transitions from one state to the next given an action.
The agent's goal: find a policy that maximizes cumulative reward over time.
Formal MDP Definition
A Markov Decision Process is a tuple :
State Space
The set of all possible states the environment can be in. Can be:
- Discrete and finite: Chess positions, GridWorld cells, inventory levels
- Continuous: Robot joint angles , velocity, temperature
- High-dimensional: Raw pixel frames in Atari (84×84×4 = 28,224 dimensions)
- Structured: Graphs, sequences (token histories in LLMs)
The Markov property requires: the state must contain all information needed to predict the future. If it doesn't, you have a partially observable MDP (POMDP), discussed later.
Action Space
The set of actions available to the agent. Can be:
- Discrete: , , for token vocabulary
- Continuous: for robot joint torques
- Parameterized: Discrete action type with continuous parameters (move location, force magnitude)
In continuous action spaces, you cannot enumerate actions - this is why Q-learning fails and policy gradient methods are needed.
Transition Function
The probability of transitioning to state after taking action in state :
This is the model of the environment. It captures everything the environment can do in response to an action.
Deterministic environment: for exactly one . Chess (ignoring illegal moves) is deterministic. So is a simple GridWorld where walls always block movement.
Stochastic environment: Multiple have non-zero probability. A robot slipping on a surface. A game with random wind. In most real-world deployments, environments are stochastic.
Reward Function
The expected scalar reward received after taking action in state :
Some formulations use (reward that depends on the next state too) or (reward depends only on state). For MDPs, these are equivalent in the sense that the same algorithms apply.
Reward design is the hardest part of applied RL. The reward function must:
- Be well-specified (captures what you actually want)
- Be computable from available information
- Not be gameable (the agent shouldn't find a loophole that scores high but doesn't solve the task)
Reward hacking - the agent optimizing a proxy reward that diverges from the true objective - is the central challenge in RLHF. You want "helpfulness" but reward model scores are a proxy. The agent finds ways to get high proxy reward that don't correspond to genuine helpfulness.
Discount Factor
Controls how much the agent values future rewards relative to immediate rewards. The total return from time step is:
This is a geometric series. For bounded rewards and :
The sum is always finite - a critical mathematical property that guarantees value functions are well-defined.
The Discount Factor γ: Deep Dive
The discount factor has three distinct interpretations:
1. Time preference (economic): A dollar today is worth more than a dollar tomorrow. captures this: reward now is worth 1, reward next step is worth , reward steps later is worth .
2. Uncertainty about the future: The further into the future, the less certain the prediction. Discounting reflects that distant futures are speculative.
3. Probability of termination: If at each step the episode could end with probability , the expected return under this termination model equals the discounted return. This interpretation makes concrete: means the episode ends with 1% probability each step.
What controls in practice:
γ = 0.0 → Agent only cares about immediate reward. Greedy.
γ = 0.5 → Reward halves every step. Step 10 reward worth 0.1%
γ = 0.9 → Balances near and far. Common for episodic tasks.
γ = 0.99 → Far-sighted. Reward 100 steps away still worth 37%
γ = 1.0 → Equal weight forever. Only safe for episodic, finite MDPs.
Geometric series interpretation. With , reward at step contributes to the total return. The "effective horizon" - the number of steps within which 95% of the total return accrues - is approximately . At , horizon ≈ 10 steps. At , horizon ≈ 100 steps.
:::note RLHF Setting For LLM RLHF: is often set to 1.0 because episodes (full responses) are short (hundreds of tokens), finite, and we care equally about all tokens. For robotics or long-horizon planning: is common. :::
The Markov Property
A critical assumption that makes RL mathematically tractable:
:::note The Markov Property The future is independent of the past, given the present.
:::
In plain English: to decide what to do next, you only need the current state - not the entire history of how you got here. The current state contains all relevant information.
When is this reasonable?
- Board games: the current board captures everything about the game state
- CartPole: current position, velocity, angle, and angular velocity are sufficient
- Fully observable robot: joint angles and velocities define the configuration
When does it fail?
- Partially observable environments: the agent only sees part of the true state
- History-dependent dynamics: past actions influence current state in ways not captured in observations
- Non-stationary environments: the rules change over time
Finite vs Infinite Horizon MDPs
Episodic (finite horizon) MDPs: Episodes have a defined end - a terminal state or a maximum number of steps. The return is where is the episode length. Examples: games (chess ends with checkmate), conversations in RLHF, navigation tasks (robot reaches goal).
Continuing (infinite horizon) MDPs: No natural episode boundary. The agent runs indefinitely. Requires to keep returns finite. Examples: trading strategies, inventory management, a robot that must operate continuously.
Key difference for algorithms: Episodic tasks allow Monte Carlo returns (wait until episode end, compute exact return). Continuing tasks require bootstrapping - using value function estimates to compute partial returns.
Policy: What the Agent Does
A policy maps states to actions. There are two types:
Deterministic Policy
Given state , always take the same action . Simple and sufficient for fully observable, non-adversarial environments. The optimal policy in any fully observable MDP is always deterministic.
Stochastic Policy
Given state , sample an action from a probability distribution over actions. Crucial for:
- Exploration: Stochastic policies naturally explore different actions
- Partially observable environments: When the true state is hidden, a stochastic policy over observations can be optimal
- Adversarial settings: Mixed strategies prevent exploitation (Rock-Paper-Scissors has no deterministic optimal policy)
- Policy gradients: Gradient methods require differentiable stochastic policies
Tabular vs Function Approximation
Tabular policies store an explicit mapping for every state: a lookup table of size . Feasible only for small, discrete state spaces (GridWorld with 16 states). Exact and provably correct when it applies.
Function approximation represents the policy as a parameterized function - typically a neural network. Generalizes across states. Required for continuous or high-dimensional state spaces.
import torch
import torch.nn as nn
class PolicyNetwork(nn.Module):
"""Stochastic policy for discrete action spaces."""
def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim),
)
def forward(self, state: torch.Tensor) -> torch.distributions.Distribution:
logits = self.net(state)
return torch.distributions.Categorical(logits=logits)
def act(self, state: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
dist = self.forward(state)
action = dist.sample()
log_prob = dist.log_prob(action)
return action, log_prob
class GaussianPolicyNetwork(nn.Module):
"""Stochastic policy for continuous action spaces."""
def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 256):
super().__init__()
self.backbone = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
)
self.mean_head = nn.Linear(hidden_dim, action_dim)
self.log_std_head = nn.Linear(hidden_dim, action_dim)
def forward(self, state: torch.Tensor) -> torch.distributions.Distribution:
features = self.backbone(state)
mean = self.mean_head(features)
# Clamp log_std for numerical stability
log_std = self.log_std_head(features).clamp(-20, 2)
std = log_std.exp()
return torch.distributions.Normal(mean, std)
Value Functions
Value functions answer: "How good is it to be in a particular state (or take a particular action)?" They are the mechanism by which an agent plans ahead.
State Value Function
The expected total return starting from state and following policy forever after.
Intuition: tells you how good a state is, assuming you keep following your current policy. A state where you're about to win a chess game has high ; a state where you're in checkmate has zero (the game is already lost).
Action-Value Function
The expected total return starting from state , taking action (which may not follow ), and then following policy thereafter.
Intuition: tells you how good it is to take a specific action from a specific state, before returning to the policy. It is more useful than for action selection: the optimal action is without needing to know the transition function.
Relationship Between V and Q
The value of a state is the expected Q-value over all actions weighted by the policy. For a deterministic policy : .
The Bellman Equations
The Bellman equations express the recursive structure of value functions. They are the most important equations in RL.
Bellman Expectation Equation for V
Expanding the definition of using the one-step dynamics:
Read this as: the value of a state equals (for each possible action weighted by the policy) the immediate reward plus the discounted expected value of the next state.
Step-by-step derivation:
The third line uses the Markov property - the future expected return from depends only on , not on the history.
Bellman Expectation Equation for Q
The Q-value of equals the expected immediate reward plus the discounted expected Q-value in the next state under the policy.
Bellman Optimality Equations
For the optimal policy and its value functions and :
These equations are fixed-point conditions - and are the unique functions satisfying them simultaneously for all states.
Once you have , the optimal policy is trivially recovered:
This is why most RL algorithms focus on learning Q-functions - the optimal policy follows directly without needing to know the transition model.
The Bellman Backup: Visual Intuition
The Bellman equation defines a backup operation: to update the value of a state, look one step ahead at its successors. This is the tree diagram:
V(s) ← value of current state
|
──────┼──────
│ │
π(a₁|s) π(a₂|s) ← action choices weighted by policy
│ │
───┼─── ───┼───
│ │ │ │
s'₁ s'₂ s'₃ s'₄ ← stochastic next states
P(s'|s,a) P(s'|s,a)
│ │ │ │
R+γV R+γV R+γV R+γV ← reward + discounted next value
Key insight: You can bootstrap - use existing (possibly imperfect) value estimates for successor states to improve the value estimate for the current state. This is the core of all temporal difference methods: you don't need to wait for the end of the episode.
MDP Visualization as a Directed Graph
Each edge is annotated with action, transition probability, and reward. The agent must learn to navigate this graph to maximize total discounted reward.
GridWorld: Hand-Computed Value Function
Let's compute value functions by hand for a simple 3×3 grid. This builds intuition before code.
┌───┬───┬───┐
│ S │ │ │ S = Start (state 0)
├───┼───┼───┤
│ │ X │ │ X = Wall (impassable, state 4)
├───┼───┼───┤
│ │ │ G │ G = Goal (state 8, reward +1)
└───┴───┴───┘
State indices: 0 1 2
3 4 5
6 7 8
Policy: Always go right if possible, else go down.
Assumptions: Deterministic transitions, , terminal reward , step cost elsewhere.
Working backwards from the goal:
- (terminal)
- - one step from goal
- - wait, 5 goes to 8 directly? With our policy (right): state 5 is at row 1 col 2, going right exits the grid so we go down → state 8. ... that seems wrong.
Adjusting: state 5 right = state 8 (terminal), reward = 1. .
The key insight: states closer to the goal have higher value, and the discount factor creates a gradient that guides the agent toward reward.
OpenAI Gym: Environment Interaction Loop
The standard way to interact with any RL environment in Python:
import gymnasium as gym
import numpy as np
def run_episode(env_name: str = "CartPole-v1", render: bool = False) -> float:
"""
Run one episode of a Gym environment with a random policy.
Returns total reward.
"""
env = gym.make(env_name, render_mode="human" if render else None)
# Reset - get initial observation
obs, info = env.reset(seed=42)
total_reward = 0.0
done = False
step = 0
while not done:
# Random action from the action space
action = env.action_space.sample()
# Step: take action, get next observation
obs, reward, terminated, truncated, info = env.step(action)
total_reward += reward
done = terminated or truncated
step += 1
print(f"Step {step:3d} | Obs: {obs} | Action: {action} | Reward: {reward:.2f}")
env.close()
print(f"\nEpisode ended after {step} steps | Total reward: {total_reward:.2f}")
return total_reward
def inspect_environment(env_name: str = "CartPole-v1"):
"""Print key properties of a Gym environment."""
env = gym.make(env_name)
print(f"Environment: {env_name}")
print(f"Observation space: {env.observation_space}")
print(f" Type: {type(env.observation_space).__name__}")
if hasattr(env.observation_space, 'shape'):
print(f" Shape: {env.observation_space.shape}")
if hasattr(env.observation_space, 'low'):
print(f" Low: {env.observation_space.low}")
print(f" High: {env.observation_space.high}")
print(f"\nAction space: {env.action_space}")
print(f" Type: {type(env.action_space).__name__}")
if hasattr(env.action_space, 'n'):
print(f" N actions: {env.action_space.n}")
env.close()
# Common Gym environments and what they model
ENVIRONMENTS = {
"CartPole-v1": {
"state": "4D: [cart_pos, cart_vel, pole_angle, pole_vel]",
"actions": "Discrete(2): push left or right",
"reward": "+1 for every step pole stays upright",
"episode_end": "pole > 12 degrees OR cart > 2.4 units",
},
"LunarLander-v2": {
"state": "8D: position, velocity, angle, angular vel, leg contacts",
"actions": "Discrete(4): nothing, left engine, main engine, right engine",
"reward": "landing pad +100 to +140, crash -100, each leg contact +10",
"episode_end": "lander crashes or lands successfully",
},
"HalfCheetah-v4": {
"state": "17D: joint positions and velocities",
"actions": "Continuous 6D: torques for 6 joints",
"reward": "velocity forward - control cost",
"episode_end": "1000 steps",
},
}
NumPy Policy Evaluation: Full Implementation
The core algorithm for computing given a policy and the full MDP model:
import numpy as np
from typing import Optional
def policy_evaluation(
policy: np.ndarray, # shape (n_states, n_actions) - stochastic policy
P: np.ndarray, # shape (n_states, n_actions, n_states) - transitions
R: np.ndarray, # shape (n_states, n_actions) - rewards
gamma: float = 0.9,
tol: float = 1e-8,
max_iter: int = 10_000,
verbose: bool = True,
) -> np.ndarray:
"""
Iterative policy evaluation.
Applies the Bellman expectation equation repeatedly until convergence.
Convergence is guaranteed by the Banach fixed-point theorem:
The Bellman operator T^pi is a contraction with factor gamma.
Therefore, repeated application converges to the unique fixed point V^pi.
Time complexity per iteration: O(|S|^2 * |A|)
Memory: O(|S| * |A| * |S|) for P, O(|S|) for V
"""
n_states = P.shape[0]
V = np.zeros(n_states)
deltas = []
for iteration in range(max_iter):
# Bellman expectation backup (vectorized):
# Q[s,a] = R[s,a] + gamma * sum_s' P[s,a,s'] * V[s']
Q = R + gamma * (P @ V) # shape: (n_states, n_actions)
# V_new[s] = sum_a pi[s,a] * Q[s,a]
V_new = np.einsum('sa,sa->s', policy, Q) # shape: (n_states,)
# Check convergence: max absolute change
delta = np.max(np.abs(V_new - V))
deltas.append(delta)
V = V_new
if delta < tol:
if verbose:
print(f"Converged in {iteration + 1} iterations (delta={delta:.2e})")
break
else:
if verbose:
print(f"Did not converge in {max_iter} iterations (final delta={delta:.2e})")
return V
def compute_q_from_v(
V: np.ndarray,
P: np.ndarray,
R: np.ndarray,
gamma: float = 0.9,
) -> np.ndarray:
"""
Given V^pi, compute Q^pi using one Bellman backup.
Q[s,a] = R[s,a] + gamma * sum_s' P[s,a,s'] * V[s']
"""
return R + gamma * (P @ V)
# ── Demo: 5-state chain MDP ──────────────────────────────────────────────────
def build_chain_mdp(n_states: int = 5, gamma: float = 0.9):
"""
Build a chain MDP: states 0..n_states-1.
Actions: 0=left, 1=right.
Goal: reach state n_states-1 (terminal, reward +10).
Step cost: -0.1 per step.
Stochastic: 80% intended direction, 20% stay.
"""
n_actions = 2
P = np.zeros((n_states, n_actions, n_states))
R = np.full((n_states, n_actions), -0.1)
for s in range(n_states - 1):
# Action 0 (left): 80% go to s-1, 20% stay
if s > 0:
P[s, 0, s - 1] = 0.8
P[s, 0, s] = 0.2
else:
P[s, 0, s] = 1.0 # can't go left of start
# Action 1 (right): 80% go to s+1, 20% stay
P[s, 1, s + 1] = 0.8
P[s, 1, s] = 0.2
# Terminal state: absorbing
P[n_states - 1, :, n_states - 1] = 1.0
R[n_states - 1, :] = 0.0 # no reward from terminal
# Reward for reaching terminal state
for s in range(n_states - 1):
R[s, 1] += P[s, 1, n_states - 1] * 10.0 # partial reward for probability of reaching goal
return P, R
if __name__ == "__main__":
P, R = build_chain_mdp(n_states=5, gamma=0.9)
# Always-right policy
always_right = np.zeros((5, 2))
always_right[:, 1] = 1.0 # probability 1 for action "right"
V = policy_evaluation(always_right, P, R, gamma=0.9, verbose=True)
Q = compute_q_from_v(V, P, R)
print("\nValue Function V^pi (always right):")
for s, v in enumerate(V):
print(f" V({s}) = {v:7.4f}")
print("\nQ-values Q^pi(s, a):")
for s in range(5):
print(f" Q({s}, left) = {Q[s,0]:7.4f} | Q({s}, right) = {Q[s,1]:7.4f}")
Advantage Function
A derived quantity that becomes critical in policy gradient methods and actor-critic architectures:
The advantage function measures how much better action is compared to the average action taken by the policy in state .
- : action is better than the policy's average - reinforce it
- : action is worse than the policy's average - discourage it
- : action is exactly as good as average
Key property: for all . The advantage is zero-mean by construction.
Why this matters for training: policy gradient methods work better when trained on advantages rather than raw Q-values. Advantages have lower variance because the baseline removes the part of the return that doesn't depend on the specific action chosen.
The Exploration-Exploitation Dilemma
A fundamental tension in every RL problem:
Exploitation: Take the action that currently seems best according to your value estimates. Get the reward you know you can get.
Exploration: Try less-certain actions to gather information that might reveal better strategies. Sacrifice some immediate reward for potential future gain.
Why this is hard: You don't know which unexplored actions are good until you try them. But trying them wastes time that could be spent exploiting the best known action. The optimal solution depends on the true environment, which you're trying to learn.
The -Greedy Strategy
The simplest exploration strategy: with probability , take a random action; with probability , take the greedy action.
def epsilon_greedy_action(
Q: np.ndarray, # Q[a] = Q-value for each action
epsilon: float = 0.1,
) -> int:
"""
Epsilon-greedy action selection.
Balances exploration (random) and exploitation (greedy).
"""
if np.random.random() < epsilon:
return np.random.randint(len(Q)) # explore: random action
else:
return np.argmax(Q) # exploit: best known action
Formal treatment. The optimal exploration strategy for a bandit problem (one-state MDP) is given by the Upper Confidence Bound (UCB) algorithm:
where is the number of times action has been taken and is an exploration constant. UCB balances exploitation (high ) with exploration (high uncertainty, i.e., low ).
For the full MDP setting, exploration is much harder - the optimal exploration strategy depends on the entire MDP structure. In practice, -greedy with annealing (reducing over training) is the most common approach.
Partial Observability: POMDPs
When the Markov property fails - the agent cannot observe the full state - we work with Partially Observable MDPs (POMDPs).
Formal definition: A POMDP is a tuple where:
- : same as MDP
- : set of possible observations
- : probability of observation given that the agent took action and landed in state
The agent cannot observe directly - only the noisy observation . To plan optimally, the agent must maintain a belief state - a probability distribution over true states - and update it using Bayes' rule:
where is a normalizing constant.
Why POMDPs are harder: The belief state is a probability distribution - a continuous, infinite-dimensional object. The optimal solution for a POMDP requires planning in belief space, which is computationally intractable in general.
Practical solutions:
- Recurrent neural networks: Maintain a hidden state that captures history. The hidden state acts as a learned approximate belief state.
- Transformers: The attention mechanism over all past observations functions as a belief state in disguise.
- Frame stacking: In Atari, stacking the last 4 frames approximately satisfies the Markov property for velocity information.
:::note RLHF and Partial Observability In LLM RLHF, the "state" is the full conversation including the model's internal representations. But the reward model only sees the text. This is partial observability - the reward model must infer quality from incomplete information. :::
Real-World MDP Formulations
Understanding how to map real problems to MDPs is a core engineering skill.
Recommendation System MDP
State s: User's full watch/read/purchase history + current session context
Action a: Which item to recommend next (from a catalog of millions)
Reward r: User engagement - click (+1), watch full video (+5), like (+2), skip (-0.5)
Transition P: New user state after interaction with recommended item
γ: 0.99 - we care about long-term engagement, not just next click
Challenge: Action space is enormous (millions of items); reward hacking (clickbait)
Algorithmic Trading MDP
State s: Price history, order book, portfolio positions, macro indicators
Action a: {buy N shares, sell N shares, hold} for each asset
Reward r: Mark-to-market PnL minus transaction costs
Transition P: Market dynamics (partially observable, non-stationary)
γ: 0.999 - daily trading, care about long-run Sharpe ratio
Challenge: Non-stationarity (market regime changes), partial observability
Robot Locomotion MDP
State s: Joint angles, angular velocities, foot contact sensors, IMU data
Action a: Torques for each joint (continuous, high-dimensional)
Reward r: Forward velocity - energy consumption - fall penalty
Transition P: Physics simulation (MuJoCo, IsaacGym)
γ: 0.99 - robot must plan several gait cycles ahead
Challenge: Sim-to-real gap, continuous action space, contact discontinuities
When to Use RL vs Supervised Learning
This is a critical design decision. RL is not always the right tool.
Use Supervised Learning when:
- You have labeled examples of correct behavior
- Actions are independent (no sequential dependency)
- The mapping input → output is well-defined
Use Reinforcement Learning when:
- Correct behavior is hard to specify but easy to evaluate (chess: hard to say what move is right, easy to say who won)
- Decisions are sequential and actions have long-term consequences
- You need the agent to discover novel strategies not in training data
- You can sample from the environment (real or simulated) cheaply
Common Mistakes
:::danger Confusing V and Q in Code
has shape (n_states,). has shape (n_states, n_actions). A common bug: computing Q = P @ V gives shape (n_states, n_actions) - this is , which is the Bellman backup without the reward. Always add R: Q = R + gamma * (P @ V).
:::
:::danger Setting γ = 1 for Continuing Tasks If your task has no terminal state and , value functions are infinite (or undefined). This causes NaN in value function computation and unstable training. Always use for continuing tasks, or ensure your episodes terminate. :::
:::warning Reward Scale Matters More Than You Think If rewards are too large (e.g., millions of dollars in a trading system), gradients explode. If too small (e.g., 1e-6 per step), learning is slow and value functions are numerically indistinguishable. Normalize rewards to roughly or at least ensure . :::
:::warning The Sparse Reward Problem If rewards are only given at the very end of long episodes (e.g., win/lose in chess), the agent gets almost no learning signal. Solutions: reward shaping (add intermediate rewards), curriculum learning (start with easier tasks), hierarchical RL (decompose into subgoals), or imitation learning to bootstrap. :::
Three Algorithm Families
The RL learning problem - given only interaction with the environment, find - is solved by three broad families of algorithms:
YouTube Resources
| Video | Creator | Why Watch |
|---|---|---|
| Lecture 1: Introduction to Reinforcement Learning | David Silver (DeepMind) | The gold standard intro - covers MDP framework and Bellman equations rigorously |
| Reinforcement Learning Course - Full Lecture 1 | Pieter Abbeel (Berkeley) | Excellent coverage of MDP formalism, great for mathematical intuition |
| RL Tutorial - Lecture 2: MDPs, Value Functions | Yannis Kalogeropoulos | Approachable walkthrough of value functions and Bellman with code |
| Reinforcement Learning - Introduction to MDPs | Mutual Information | Clean visual explanation of the MDP framework and discount factor |
| Deep RL Bootcamp Lecture 1 - Motivation + Overview | Pieter Abbeel & Rocky Duan | Motivates RL from applications - great for seeing why it matters |
Interview Q&A
Q1: Explain the Bellman equation and why it is the foundation of all RL algorithms.
The Bellman equation expresses a recursive consistency condition: the value of any state equals the expected immediate reward plus the discounted value of the next state. For the expectation equation: . Its importance is threefold. First, it gives a self-consistency requirement - any correct value function must satisfy this for all states simultaneously. Second, it enables bootstrapping: you can update the value estimate for one state using value estimates of neighboring states, without waiting for full episode rollouts. Third, it defines the Bellman operator , which is a contraction mapping - repeated application converges to the unique fixed point by the Banach theorem. All temporal difference algorithms - Q-learning, SARSA, TD(λ), DQN, PPO's value update - are variants of this update applied online with function approximation.
Q2: What is the difference between and ? When would you use each?
is the expected return starting from state and following forever. is the expected return starting from , taking action first, then following . is strictly more informative: you can derive from (as ), and you can derive the optimal action directly as without knowing the transition model. In practice: Q-learning and DQN learn because the optimal policy is recoverable greedily. Actor-critic methods learn for the critic because it's cheaper (no action dimension) and use as the advantage estimate for policy updates. You'd choose when you have a separate policy you're improving; when you want the value function to directly determine the policy.
Q3: Why does the Markov property matter? What happens when it's violated?
The Markov property - that the future depends only on the current state, not the history - allows compact value functions and that depend only on the current state. Without it, you'd need where is the entire history, making the problem exponentially harder (the space of all histories is astronomically large). When the Markov property is violated (partial observability): naive RL with state-only policies can be suboptimal or even fail to converge. The agent may not have enough information to determine the right action from the current observation. Solutions include: (1) using recurrent networks (LSTM, Transformer) that compress history into a hidden state, effectively acting on a learned belief state; (2) frame stacking - including the last observations as a single input, approximately satisfying the Markov property for velocity information; (3) explicitly maintaining and updating a belief state using Bayes' rule (formal POMDP solution, computationally expensive).
Q4: How does the discount factor γ affect agent behavior and algorithm convergence?
controls the effective planning horizon. Small (0.5): agent is myopic, optimizes immediate reward, may sacrifice long-term outcomes. Large (0.99): agent is far-sighted, will forgo immediate reward for better future outcomes. Two mathematical effects: (1) Convergence speed - the Bellman operator's contraction rate is . After iterations of value iteration, error is times initial error. Higher means slower convergence. (2) Credit assignment difficulty - with large , a reward 100 steps in the future still contributes significantly to the present value. Propagating this information backwards through many steps requires many training iterations. In practice: is standard for most deep RL; is safe for episodic tasks with guaranteed termination; gives faster learning but a myopic policy.
Q5: What is the advantage function and why is it used in practice?
The advantage function measures how much better action is relative to the average action from state under the current policy. It has two key properties: (1) Zero-mean - by construction, since . (2) Lower variance - using instead of raw or in policy gradient updates removes the baseline which captures the part of the return independent of the action. This variance reduction speeds training substantially. Practically: policy gradient updates use . When (action was better than average), the action probability increases. When (worse than average), it decreases. This is much more stable than updating on raw returns, which can push all action probabilities in the same direction when overall returns are high.
Q6: How would you formulate an LLM chat assistant as an MDP?
State : the conversation history (all tokens generated so far) - in practice, the hidden states of the transformer. Action : the next token to generate from a vocabulary of size ~50,000. Transition : deterministic from the LLM's perspective - given a state and action (token choice), the new state is the old state concatenated with that token. Reward : typically sparse - a reward model score for the full response given prompt , received only at the end of generation. Episode: one full turn of the conversation (prompt → complete response). Discount : often 1.0 for RLHF since episodes are short. The main challenges: (1) sparse reward (only at episode end) makes credit assignment hard; (2) enormous action space (50K tokens per step); (3) reward hacking (the model learns to game the reward model rather than genuinely improving); (4) the KL-from-reference penalty added in PPO-RLHF is needed to prevent the policy from drifting too far from the pretrained model.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the MDP GridWorld demo on the EngineersOfAI Playground - no code required.
:::
