What is Markov decision process?

Master Markov Decision Processes - the mathematical foundation of all reinforcement learning. Understand states, actions, rewards, value functions, the Bellman equations, and how real-world systems are modeled as MDPs.

How does MDP work in practice?

MDP and the RL Framework covers Markov decision process, MDP, Bellman equation from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/reinforcement-learning/mdp-and-the-rl-framework

What is the difference between Markov decision process and Bellman equation?

See the full breakdown at https://engineersofai.com/docs/ml/reinforcement-learning/mdp-and-the-rl-framework

MDP and the RL Framework

import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';

Reading time: ~45 min | Interview relevance: Very High | Target roles: ML Engineer, AI Engineer, Research Engineer

The Real Engineering Moment

It's 2019 at OpenAI. The team training GPT-2 is discussing what comes next. Raw language model pre-training produces impressive text completion - but the model cheerfully writes misinformation, follows harmful instructions, and ignores the user's actual intent. The problem is not the model's capability. It's the objective. Predicting the next token is not the same as being helpful.

The question they grapple with: how do you teach a model to optimize for something as fuzzy as "what the user actually wants"? You can't label every output as good or bad - there are too many possible responses. You can't write down a mathematical reward function that captures helpfulness precisely. But you can get humans to compare pairs of outputs and say which one they prefer.

The answer turns out to be the same mathematical framework that DeepMind used to teach an agent to play Go, Atari games, and robot locomotion: the Markov Decision Process. In 2022, this framework powered the training of ChatGPT through RLHF. In 2023 and beyond, it underpins every alignment technique applied to frontier models - including Claude. Understanding this framework is not optional for AI engineers: it is the prerequisite for understanding every modern LLM alignment technique.

But the MDP framework is older and broader than LLMs. Howard formalized it in 1960 for operations research. Bellman derived the optimality equations in the 1950s. The same math governs a robot deciding how to step on uneven terrain, a trading algorithm deciding when to buy, and a recommendation system deciding what content to surface next. One framework, thousands of applications.

This lesson builds the formal MDP foundation from scratch. By the end, you will be able to formulate any sequential decision problem as an MDP, write the Bellman equations, implement policy evaluation in NumPy, and understand exactly how this connects to modern deep RL and RLHF.

Why This Exists: The Sequential Decision Problem

Before MDPs, decision-making problems were often modeled in one of two unsatisfactory ways.

Static optimization: Find the best action given the current state, ignoring the future. A thermostat that only looks at current temperature, not the trajectory. A recommendation system that maximizes immediate click probability, ignoring what users actually want after clicking. These approaches are myopic - they optimize the present at the cost of the future.

Game tree search (minimax): Works for two-player zero-sum games with perfect information and a finite game tree. But most real problems are stochastic (the environment is random), have enormous state spaces (chess is manageable; robot locomotion is not), and have a single agent rather than two opponents.

The MDP framework provides a mathematically rigorous way to model any sequential decision problem under uncertainty. It handles stochasticity (random environment transitions), long-term consequences (the agent cares about total future reward, not just immediate), and partial information (POMDPs extend the framework). Most importantly, it comes with provably correct algorithms for finding optimal solutions.

Historical Context

1950s: Richard Bellman, working at RAND Corporation, develops the principle of optimality and the recursive equations now bearing his name. He coins the term "dynamic programming" - deliberately vague to avoid military scrutiny of mathematical research. His insight: an optimal policy has the property that, whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

1960: Ronald Howard publishes "Dynamic Programming and Markov Processes," formally defining the MDP as a decision-making framework and proving that optimal policies exist and can be computed.

1988: Chris Watkins invents Q-learning in his PhD thesis, extending DP to the model-free setting where the transition function is unknown.

1992: Ronald Williams derives the REINFORCE policy gradient algorithm, showing how to optimize policies without needing the model at all.

2013: DeepMind combines Q-learning with deep neural networks to create DQN, achieving superhuman performance on Atari games. The state space is no longer a table - it's raw pixels.

2017–2022: OpenAI, Anthropic, DeepMind apply RL to language models via RLHF, directly using the MDP framework with language tokens as actions and human preferences as rewards.

The theoretical foundation hasn't changed since Bellman. What changed is the function approximation (neural networks) and the scale (billions of parameters).

The RL Setup: Five Components

Reinforcement learning formalizes the interaction between a decision-making agent and an environment. The agent observes the world, takes actions, and receives feedback.

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   ┌────────┐    action a_t     ┌─────────────┐             │
│   │        │ ─────────────────▶│             │             │
│   │ Agent  │                   │ Environment │             │
│   │        │ ◀─────────────────│             │             │
│   └────────┘  state s_{t+1}    └─────────────┘             │
│                reward r_{t+1}                               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The five components of any RL problem:

1. State (S): What the agent observes about the world. In chess: board position. In robotics: joint angles and velocities. In LLM training: the current conversation so far.

2. Action (A): What the agent can do. In chess: legal moves. In robotics: motor torques. In LLM training: the next token to generate.

3. Reward (R): The scalar feedback signal the agent receives. In chess: +1 for win, -1 for loss, 0 otherwise. In LLM training: a score from a reward model.

4. Policy (π): The agent's decision-making function - given state, what action to take.

5. Environment dynamics (P): How the world transitions from one state to the next given an action.

The agent's goal: find a policy that maximizes cumulative reward over time.

Formal MDP Definition

A Markov Decision Process is a tuple $\mathcal{M} = (S, A, P, R, \gamma)$ :

State Space $S$

The set of all possible states the environment can be in. Can be:

Discrete and finite: Chess positions, GridWorld cells, inventory levels
Continuous: Robot joint angles $\in \mathbb{R}^n$ , velocity, temperature
High-dimensional: Raw pixel frames in Atari (84×84×4 = 28,224 dimensions)
Structured: Graphs, sequences (token histories in LLMs)

The Markov property requires: the state $s_t$ must contain all information needed to predict the future. If it doesn't, you have a partially observable MDP (POMDP), discussed later.

Action Space $A$

The set of actions available to the agent. Can be:

Discrete: $A = \{up, down, left, right\}$ , $A = \{buy, sell, hold\}$ , $A = \{0, 1, \ldots, V-1\}$ for token vocabulary
Continuous: $A = [-1, 1]^{24}$ for robot joint torques
Parameterized: Discrete action type with continuous parameters (move location, force magnitude)

In continuous action spaces, you cannot enumerate actions - this is why Q-learning fails and policy gradient methods are needed.

Transition Function $P(s' | s, a)$

The probability of transitioning to state $s'$ after taking action $a$ in state $s$ :

$P: S \times A \times S \rightarrow [0, 1]$

$\sum_{s' \in S} P(s' | s, a) = 1 \quad \forall s \in S, a \in A$

This is the model of the environment. It captures everything the environment can do in response to an action.

Deterministic environment: $P(s'|s,a) = 1$ for exactly one $s'$ . Chess (ignoring illegal moves) is deterministic. So is a simple GridWorld where walls always block movement.

Stochastic environment: Multiple $s'$ have non-zero probability. A robot slipping on a surface. A game with random wind. In most real-world deployments, environments are stochastic.

Reward Function $R(s, a)$

The expected scalar reward received after taking action $a$ in state $s$ :

$R: S \times A \rightarrow \mathbb{R}$

Some formulations use $R(s, a, s')$ (reward that depends on the next state too) or $R(s)$ (reward depends only on state). For MDPs, these are equivalent in the sense that the same algorithms apply.

Reward design is the hardest part of applied RL. The reward function must:

Be well-specified (captures what you actually want)
Be computable from available information
Not be gameable (the agent shouldn't find a loophole that scores high but doesn't solve the task)

Reward hacking - the agent optimizing a proxy reward that diverges from the true objective - is the central challenge in RLHF. You want "helpfulness" but reward model scores are a proxy. The agent finds ways to get high proxy reward that don't correspond to genuine helpfulness.

Discount Factor $\gamma \in [0, 1)$

Controls how much the agent values future rewards relative to immediate rewards. The total return from time step $t$ is:

$G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$

This is a geometric series. For bounded rewards $|r_t| \leq r_{max}$ and $\gamma < 1$ :

$G_t \leq \sum_{k=0}^{\infty} \gamma^k r_{max} = \frac{r_{max}}{1 - \gamma}$

The sum is always finite - a critical mathematical property that guarantees value functions are well-defined.

The Discount Factor γ: Deep Dive

The discount factor has three distinct interpretations:

1. Time preference (economic): A dollar today is worth more than a dollar tomorrow. $\gamma$ captures this: reward now is worth 1, reward next step is worth $\gamma$ , reward $k$ steps later is worth $\gamma^k$ .

2. Uncertainty about the future: The further into the future, the less certain the prediction. Discounting reflects that distant futures are speculative.

3. Probability of termination: If at each step the episode could end with probability $(1-\gamma)$ , the expected return under this termination model equals the discounted return. This interpretation makes $\gamma$ concrete: $\gamma = 0.99$ means the episode ends with 1% probability each step.

What $\gamma$ controls in practice:

γ = 0.0  →  Agent only cares about immediate reward. Greedy.
γ = 0.5  →  Reward halves every step. Step 10 reward worth 0.1%
γ = 0.9  →  Balances near and far. Common for episodic tasks.
γ = 0.99 →  Far-sighted. Reward 100 steps away still worth 37%
γ = 1.0  →  Equal weight forever. Only safe for episodic, finite MDPs.

Geometric series interpretation. With $\gamma = 0.99$ , reward at step $k$ contributes $0.99^k$ to the total return. The "effective horizon" - the number of steps within which 95% of the total return accrues - is approximately $\frac{1}{1-\gamma}$ . At $\gamma=0.9$ , horizon ≈ 10 steps. At $\gamma=0.99$ , horizon ≈ 100 steps.

:::note RLHF Setting For LLM RLHF: $\gamma$ is often set to 1.0 because episodes (full responses) are short (hundreds of tokens), finite, and we care equally about all tokens. For robotics or long-horizon planning: $\gamma = 0.99$ is common. :::

The Markov Property

A critical assumption that makes RL mathematically tractable:

:::note The Markov Property The future is independent of the past, given the present.

$P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \ldots, s_0, a_0) = P(s_{t+1} | s_t, a_t)$ :::

In plain English: to decide what to do next, you only need the current state - not the entire history of how you got here. The current state contains all relevant information.

When is this reasonable?

Board games: the current board captures everything about the game state
CartPole: current position, velocity, angle, and angular velocity are sufficient
Fully observable robot: joint angles and velocities define the configuration

When does it fail?

Partially observable environments: the agent only sees part of the true state
History-dependent dynamics: past actions influence current state in ways not captured in observations
Non-stationary environments: the rules change over time

Finite vs Infinite Horizon MDPs

Episodic (finite horizon) MDPs: Episodes have a defined end - a terminal state or a maximum number of steps. The return is $G_t = \sum_{k=0}^{T-t} \gamma^k r_{t+k}$ where $T$ is the episode length. Examples: games (chess ends with checkmate), conversations in RLHF, navigation tasks (robot reaches goal).

Continuing (infinite horizon) MDPs: No natural episode boundary. The agent runs indefinitely. Requires $\gamma < 1$ to keep returns finite. Examples: trading strategies, inventory management, a robot that must operate continuously.

Key difference for algorithms: Episodic tasks allow Monte Carlo returns (wait until episode end, compute exact return). Continuing tasks require bootstrapping - using value function estimates to compute partial returns.

Policy: What the Agent Does

A policy $\pi$ maps states to actions. There are two types:

Deterministic Policy

$\pi: S \rightarrow A \qquad a = \pi(s)$

Given state $s$ , always take the same action $a$ . Simple and sufficient for fully observable, non-adversarial environments. The optimal policy in any fully observable MDP is always deterministic.

Stochastic Policy

$\pi: S \rightarrow \Delta(A) \qquad a \sim \pi(\cdot | s)$

Given state $s$ , sample an action from a probability distribution over actions. Crucial for:

Exploration: Stochastic policies naturally explore different actions
Partially observable environments: When the true state is hidden, a stochastic policy over observations can be optimal
Adversarial settings: Mixed strategies prevent exploitation (Rock-Paper-Scissors has no deterministic optimal policy)
Policy gradients: Gradient methods require differentiable stochastic policies

Tabular vs Function Approximation

Tabular policies store an explicit mapping for every state: a lookup table of size $|S|$ . Feasible only for small, discrete state spaces (GridWorld with 16 states). Exact and provably correct when it applies.

Function approximation represents the policy as a parameterized function $\pi_\theta(a|s)$ - typically a neural network. Generalizes across states. Required for continuous or high-dimensional state spaces.

import torch
import torch.nn as nn

class PolicyNetwork(nn.Module):
    """Stochastic policy for discrete action spaces."""
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
        )

    def forward(self, state: torch.Tensor) -> torch.distributions.Distribution:
        logits = self.net(state)
        return torch.distributions.Categorical(logits=logits)

    def act(self, state: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        dist = self.forward(state)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action, log_prob


class GaussianPolicyNetwork(nn.Module):
    """Stochastic policy for continuous action spaces."""
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 256):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
        )
        self.mean_head = nn.Linear(hidden_dim, action_dim)
        self.log_std_head = nn.Linear(hidden_dim, action_dim)

    def forward(self, state: torch.Tensor) -> torch.distributions.Distribution:
        features = self.backbone(state)
        mean = self.mean_head(features)
        # Clamp log_std for numerical stability
        log_std = self.log_std_head(features).clamp(-20, 2)
        std = log_std.exp()
        return torch.distributions.Normal(mean, std)

Value Functions

Value functions answer: "How good is it to be in a particular state (or take a particular action)?" They are the mechanism by which an agent plans ahead.

State Value Function $V^\pi(s)$

$V^\pi(s) = \mathbb{E}_\pi\left[G_t \mid s_t = s\right] = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \mid s_t = s\right]$

The expected total return starting from state $s$ and following policy $\pi$ forever after.

Intuition: $V^\pi(s)$ tells you how good a state is, assuming you keep following your current policy. A state where you're about to win a chess game has high $V$ ; a state where you're in checkmate has zero $V$ (the game is already lost).

Action-Value Function $Q^\pi(s, a)$

$Q^\pi(s, a) = \mathbb{E}_\pi\left[G_t \mid s_t = s, a_t = a\right]$

The expected total return starting from state $s$ , taking action $a$ (which may not follow $\pi$ ), and then following policy $\pi$ thereafter.

Intuition: $Q^\pi(s, a)$ tells you how good it is to take a specific action from a specific state, before returning to the policy. It is more useful than $V$ for action selection: the optimal action is $a^* = \arg\max_a Q^*(s, a)$ without needing to know the transition function.

Relationship Between V and Q

$V^\pi(s) = \sum_a \pi(a|s) \, Q^\pi(s, a) = \mathbb{E}_{a \sim \pi}[Q^\pi(s, a)]$

The value of a state is the expected Q-value over all actions weighted by the policy. For a deterministic policy $\pi(s) = a^*$ : $V^\pi(s) = Q^\pi(s, a^*)$ .

The Bellman Equations

The Bellman equations express the recursive structure of value functions. They are the most important equations in RL.

Bellman Expectation Equation for V

Expanding the definition of $V^\pi$ using the one-step dynamics:

$\boxed{V^\pi(s) = \sum_a \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V^\pi(s') \right]}$

Read this as: the value of a state equals (for each possible action weighted by the policy) the immediate reward plus the discounted expected value of the next state.

Step-by-step derivation:

$V^\pi(s) = \mathbb{E}_\pi[G_t | s_t = s]$

$= \mathbb{E}_\pi[r_t + \gamma G_{t+1} | s_t = s]$

$= \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a) + \gamma \mathbb{E}_\pi[G_{t+1} | s_{t+1} = s'] \right]$

$= \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a) + \gamma V^\pi(s') \right]$

The third line uses the Markov property - the future expected return from $s'$ depends only on $s'$ , not on the history.

Bellman Expectation Equation for Q

$\boxed{Q^\pi(s, a) = \sum_{s'} P(s'|s,a) \left[ R(s,a) + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s', a') \right]}$

The Q-value of $(s, a)$ equals the expected immediate reward plus the discounted expected Q-value in the next state under the policy.

Bellman Optimality Equations

For the optimal policy $\pi^*$ and its value functions $V^*$ and $Q^*$ :

$\boxed{V^*(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V^*(s') \right]}$

$\boxed{Q^*(s, a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s', a')}$

These equations are fixed-point conditions - $V^*$ and $Q^*$ are the unique functions satisfying them simultaneously for all states.

Once you have $Q^*$ , the optimal policy is trivially recovered:

$\pi^*(s) = \arg\max_a Q^*(s, a)$

This is why most RL algorithms focus on learning Q-functions - the optimal policy follows directly without needing to know the transition model.

The Bellman Backup: Visual Intuition

The Bellman equation defines a backup operation: to update the value of a state, look one step ahead at its successors. This is the tree diagram:

         V(s)               ← value of current state
          |
    ──────┼──────
    │            │
  π(a₁|s)     π(a₂|s)      ← action choices weighted by policy
    │            │
 ───┼───      ───┼───
 │     │      │     │
s'₁  s'₂   s'₃  s'₄        ← stochastic next states
P(s'|s,a)   P(s'|s,a)
 │     │      │     │
R+γV  R+γV  R+γV  R+γV      ← reward + discounted next value

Key insight: You can bootstrap - use existing (possibly imperfect) value estimates for successor states to improve the value estimate for the current state. This is the core of all temporal difference methods: you don't need to wait for the end of the episode.

MDP Visualization as a Directed Graph

Each edge is annotated with action, transition probability, and reward. The agent must learn to navigate this graph to maximize total discounted reward.

GridWorld: Hand-Computed Value Function

Let's compute value functions by hand for a simple 3×3 grid. This builds intuition before code.

┌───┬───┬───┐
│ S │   │   │   S = Start (state 0)
├───┼───┼───┤
│   │ X │   │   X = Wall (impassable, state 4)
├───┼───┼───┤
│   │   │ G │   G = Goal (state 8, reward +1)
└───┴───┴───┘

State indices:  0  1  2
                3  4  5
                6  7  8

Policy: Always go right if possible, else go down.

Assumptions: Deterministic transitions, $\gamma = 0.9$ , terminal reward $R(7 \to 8) = 1$ , step cost $-0.1$ elsewhere.

Working backwards from the goal:

$V(8) = 0$ (terminal)
$V(7) = R(7, right) + \gamma V(8) = 0.9 \times 0 + 1 = 1.0$ - one step from goal
$V(6) = R(6, right) + \gamma V(7) = -0.1 + 0.9 \times 1.0 = 0.8$
$V(5) = R(5, down) + \gamma V(8) = -0.1 + 0.9 \times 0 = -0.1$ - wait, 5 goes to 8 directly? With our policy (right): state 5 is at row 1 col 2, going right exits the grid so we go down → state 8. $V(5) = -0.1 + 0.9 \times 0 = -0.1$ ... that seems wrong.

Adjusting: state 5 right = state 8 (terminal), reward = 1. $V(5) = R(5,right) + \gamma V(8) = 0.9 + 0.9 \times 0 = 0.9$ .

The key insight: states closer to the goal have higher value, and the discount factor creates a gradient that guides the agent toward reward.

OpenAI Gym: Environment Interaction Loop

The standard way to interact with any RL environment in Python:

import gymnasium as gym
import numpy as np

def run_episode(env_name: str = "CartPole-v1", render: bool = False) -> float:
    """
    Run one episode of a Gym environment with a random policy.
    Returns total reward.
    """
    env = gym.make(env_name, render_mode="human" if render else None)

    # Reset - get initial observation
    obs, info = env.reset(seed=42)
    total_reward = 0.0
    done = False
    step = 0

    while not done:
        # Random action from the action space
        action = env.action_space.sample()

        # Step: take action, get next observation
        obs, reward, terminated, truncated, info = env.step(action)
        total_reward += reward
        done = terminated or truncated
        step += 1

        print(f"Step {step:3d} | Obs: {obs} | Action: {action} | Reward: {reward:.2f}")

    env.close()
    print(f"\nEpisode ended after {step} steps | Total reward: {total_reward:.2f}")
    return total_reward


def inspect_environment(env_name: str = "CartPole-v1"):
    """Print key properties of a Gym environment."""
    env = gym.make(env_name)

    print(f"Environment: {env_name}")
    print(f"Observation space: {env.observation_space}")
    print(f"  Type: {type(env.observation_space).__name__}")
    if hasattr(env.observation_space, 'shape'):
        print(f"  Shape: {env.observation_space.shape}")
    if hasattr(env.observation_space, 'low'):
        print(f"  Low:  {env.observation_space.low}")
        print(f"  High: {env.observation_space.high}")

    print(f"\nAction space: {env.action_space}")
    print(f"  Type: {type(env.action_space).__name__}")
    if hasattr(env.action_space, 'n'):
        print(f"  N actions: {env.action_space.n}")

    env.close()

# Common Gym environments and what they model
ENVIRONMENTS = {
    "CartPole-v1": {
        "state": "4D: [cart_pos, cart_vel, pole_angle, pole_vel]",
        "actions": "Discrete(2): push left or right",
        "reward": "+1 for every step pole stays upright",
        "episode_end": "pole > 12 degrees OR cart > 2.4 units",
    },
    "LunarLander-v2": {
        "state": "8D: position, velocity, angle, angular vel, leg contacts",
        "actions": "Discrete(4): nothing, left engine, main engine, right engine",
        "reward": "landing pad +100 to +140, crash -100, each leg contact +10",
        "episode_end": "lander crashes or lands successfully",
    },
    "HalfCheetah-v4": {
        "state": "17D: joint positions and velocities",
        "actions": "Continuous 6D: torques for 6 joints",
        "reward": "velocity forward - control cost",
        "episode_end": "1000 steps",
    },
}

NumPy Policy Evaluation: Full Implementation

The core algorithm for computing $V^\pi$ given a policy $\pi$ and the full MDP model:

import numpy as np
from typing import Optional


def policy_evaluation(
    policy: np.ndarray,       # shape (n_states, n_actions) - stochastic policy
    P: np.ndarray,            # shape (n_states, n_actions, n_states) - transitions
    R: np.ndarray,            # shape (n_states, n_actions) - rewards
    gamma: float = 0.9,
    tol: float = 1e-8,
    max_iter: int = 10_000,
    verbose: bool = True,
) -> np.ndarray:
    """
    Iterative policy evaluation.
    Applies the Bellman expectation equation repeatedly until convergence.

    Convergence is guaranteed by the Banach fixed-point theorem:
    The Bellman operator T^pi is a contraction with factor gamma.
    Therefore, repeated application converges to the unique fixed point V^pi.

    Time complexity per iteration: O(|S|^2 * |A|)
    Memory: O(|S| * |A| * |S|) for P, O(|S|) for V
    """
    n_states = P.shape[0]
    V = np.zeros(n_states)
    deltas = []

    for iteration in range(max_iter):
        # Bellman expectation backup (vectorized):
        # Q[s,a] = R[s,a] + gamma * sum_s' P[s,a,s'] * V[s']
        Q = R + gamma * (P @ V)   # shape: (n_states, n_actions)

        # V_new[s] = sum_a pi[s,a] * Q[s,a]
        V_new = np.einsum('sa,sa->s', policy, Q)  # shape: (n_states,)

        # Check convergence: max absolute change
        delta = np.max(np.abs(V_new - V))
        deltas.append(delta)
        V = V_new

        if delta < tol:
            if verbose:
                print(f"Converged in {iteration + 1} iterations (delta={delta:.2e})")
            break
    else:
        if verbose:
            print(f"Did not converge in {max_iter} iterations (final delta={delta:.2e})")

    return V


def compute_q_from_v(
    V: np.ndarray,
    P: np.ndarray,
    R: np.ndarray,
    gamma: float = 0.9,
) -> np.ndarray:
    """
    Given V^pi, compute Q^pi using one Bellman backup.
    Q[s,a] = R[s,a] + gamma * sum_s' P[s,a,s'] * V[s']
    """
    return R + gamma * (P @ V)


# ── Demo: 5-state chain MDP ──────────────────────────────────────────────────

def build_chain_mdp(n_states: int = 5, gamma: float = 0.9):
    """
    Build a chain MDP: states 0..n_states-1.
    Actions: 0=left, 1=right.
    Goal: reach state n_states-1 (terminal, reward +10).
    Step cost: -0.1 per step.
    Stochastic: 80% intended direction, 20% stay.
    """
    n_actions = 2
    P = np.zeros((n_states, n_actions, n_states))
    R = np.full((n_states, n_actions), -0.1)

    for s in range(n_states - 1):
        # Action 0 (left): 80% go to s-1, 20% stay
        if s > 0:
            P[s, 0, s - 1] = 0.8
            P[s, 0, s] = 0.2
        else:
            P[s, 0, s] = 1.0  # can't go left of start

        # Action 1 (right): 80% go to s+1, 20% stay
        P[s, 1, s + 1] = 0.8
        P[s, 1, s] = 0.2

    # Terminal state: absorbing
    P[n_states - 1, :, n_states - 1] = 1.0
    R[n_states - 1, :] = 0.0  # no reward from terminal

    # Reward for reaching terminal state
    for s in range(n_states - 1):
        R[s, 1] += P[s, 1, n_states - 1] * 10.0  # partial reward for probability of reaching goal

    return P, R


if __name__ == "__main__":
    P, R = build_chain_mdp(n_states=5, gamma=0.9)

    # Always-right policy
    always_right = np.zeros((5, 2))
    always_right[:, 1] = 1.0  # probability 1 for action "right"

    V = policy_evaluation(always_right, P, R, gamma=0.9, verbose=True)
    Q = compute_q_from_v(V, P, R)

    print("\nValue Function V^pi (always right):")
    for s, v in enumerate(V):
        print(f"  V({s}) = {v:7.4f}")

    print("\nQ-values Q^pi(s, a):")
    for s in range(5):
        print(f"  Q({s}, left) = {Q[s,0]:7.4f}  |  Q({s}, right) = {Q[s,1]:7.4f}")

Advantage Function $A(s, a)$

A derived quantity that becomes critical in policy gradient methods and actor-critic architectures:

$A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$

The advantage function measures how much better action $a$ is compared to the average action taken by the policy in state $s$ .

$A(s, a) > 0$ : action $a$ is better than the policy's average - reinforce it
$A(s, a) < 0$ : action $a$ is worse than the policy's average - discourage it
$A(s, a) = 0$ : action $a$ is exactly as good as average

Key property: $\mathbb{E}_{a \sim \pi}[A^\pi(s, a)] = 0$ for all $s$ . The advantage is zero-mean by construction.

Why this matters for training: policy gradient methods work better when trained on advantages rather than raw Q-values. Advantages have lower variance because the baseline $V(s)$ removes the part of the return that doesn't depend on the specific action chosen.

The Exploration-Exploitation Dilemma

A fundamental tension in every RL problem:

Exploitation: Take the action that currently seems best according to your value estimates. Get the reward you know you can get.

Exploration: Try less-certain actions to gather information that might reveal better strategies. Sacrifice some immediate reward for potential future gain.

Why this is hard: You don't know which unexplored actions are good until you try them. But trying them wastes time that could be spent exploiting the best known action. The optimal solution depends on the true environment, which you're trying to learn.

The $\epsilon$ -Greedy Strategy

The simplest exploration strategy: with probability $\epsilon$ , take a random action; with probability $1-\epsilon$ , take the greedy action.

def epsilon_greedy_action(
    Q: np.ndarray,        # Q[a] = Q-value for each action
    epsilon: float = 0.1,
) -> int:
    """
    Epsilon-greedy action selection.
    Balances exploration (random) and exploitation (greedy).
    """
    if np.random.random() < epsilon:
        return np.random.randint(len(Q))   # explore: random action
    else:
        return np.argmax(Q)                 # exploit: best known action

Formal treatment. The optimal exploration strategy for a bandit problem (one-state MDP) is given by the Upper Confidence Bound (UCB) algorithm:

$a_t = \arg\max_a \left[ Q(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right]$

where $N_t(a)$ is the number of times action $a$ has been taken and $c$ is an exploration constant. UCB balances exploitation (high $Q(a)$ ) with exploration (high uncertainty, i.e., low $N_t(a)$ ).

For the full MDP setting, exploration is much harder - the optimal exploration strategy depends on the entire MDP structure. In practice, $\epsilon$ -greedy with annealing (reducing $\epsilon$ over training) is the most common approach.

Partial Observability: POMDPs

When the Markov property fails - the agent cannot observe the full state - we work with Partially Observable MDPs (POMDPs).

Formal definition: A POMDP is a tuple $(S, A, P, R, \Omega, O, \gamma)$ where:

$S, A, P, R, \gamma$ : same as MDP
$\Omega$ : set of possible observations
$O(o | s', a)$ : probability of observation $o$ given that the agent took action $a$ and landed in state $s'$

The agent cannot observe $s$ directly - only the noisy observation $o$ . To plan optimally, the agent must maintain a belief state $b(s)$ - a probability distribution over true states - and update it using Bayes' rule:

$b'(s') = \eta \cdot O(o | s', a) \sum_s P(s' | s, a) \, b(s)$

where $\eta$ is a normalizing constant.

Why POMDPs are harder: The belief state is a probability distribution - a continuous, infinite-dimensional object. The optimal solution for a POMDP requires planning in belief space, which is computationally intractable in general.

Practical solutions:

Recurrent neural networks: Maintain a hidden state $h_t$ that captures history. The hidden state acts as a learned approximate belief state.
Transformers: The attention mechanism over all past observations functions as a belief state in disguise.
Frame stacking: In Atari, stacking the last 4 frames approximately satisfies the Markov property for velocity information.

:::note RLHF and Partial Observability In LLM RLHF, the "state" is the full conversation including the model's internal representations. But the reward model only sees the text. This is partial observability - the reward model must infer quality from incomplete information. :::

Real-World MDP Formulations

Understanding how to map real problems to MDPs is a core engineering skill.

Recommendation System MDP

State s:       User's full watch/read/purchase history + current session context
Action a:      Which item to recommend next (from a catalog of millions)
Reward r:      User engagement - click (+1), watch full video (+5), like (+2), skip (-0.5)
Transition P:  New user state after interaction with recommended item
γ:             0.99 - we care about long-term engagement, not just next click
Challenge:     Action space is enormous (millions of items); reward hacking (clickbait)

Algorithmic Trading MDP

State s:       Price history, order book, portfolio positions, macro indicators
Action a:      {buy N shares, sell N shares, hold} for each asset
Reward r:      Mark-to-market PnL minus transaction costs
Transition P:  Market dynamics (partially observable, non-stationary)
γ:             0.999 - daily trading, care about long-run Sharpe ratio
Challenge:     Non-stationarity (market regime changes), partial observability

Robot Locomotion MDP

State s:       Joint angles, angular velocities, foot contact sensors, IMU data
Action a:      Torques for each joint (continuous, high-dimensional)
Reward r:      Forward velocity - energy consumption - fall penalty
Transition P:  Physics simulation (MuJoCo, IsaacGym)
γ:             0.99 - robot must plan several gait cycles ahead
Challenge:     Sim-to-real gap, continuous action space, contact discontinuities

When to Use RL vs Supervised Learning

This is a critical design decision. RL is not always the right tool.

Use Supervised Learning when:

You have labeled examples of correct behavior
Actions are independent (no sequential dependency)
The mapping input → output is well-defined

Use Reinforcement Learning when:

Correct behavior is hard to specify but easy to evaluate (chess: hard to say what move is right, easy to say who won)
Decisions are sequential and actions have long-term consequences
You need the agent to discover novel strategies not in training data
You can sample from the environment (real or simulated) cheaply

Common Mistakes

:::danger Confusing V and Q in Code $V(s)$ has shape (n_states,). $Q(s,a)$ has shape (n_states, n_actions). A common bug: computing Q = P @ V gives shape (n_states, n_actions) - this is $Q[s,a] = \sum_{s'} P[s,a,s'] V[s']$ , which is the Bellman backup without the reward. Always add R: Q = R + gamma * (P @ V). :::

:::danger Setting γ = 1 for Continuing Tasks If your task has no terminal state and $\gamma = 1$ , value functions are infinite (or undefined). This causes NaN in value function computation and unstable training. Always use $\gamma < 1$ for continuing tasks, or ensure your episodes terminate. :::

:::warning Reward Scale Matters More Than You Think If rewards are too large (e.g., millions of dollars in a trading system), gradients explode. If too small (e.g., 1e-6 per step), learning is slow and value functions are numerically indistinguishable. Normalize rewards to roughly $[-1, 1]$ or at least ensure $|r_t| \leq 10$ . :::

:::warning The Sparse Reward Problem If rewards are only given at the very end of long episodes (e.g., win/lose in chess), the agent gets almost no learning signal. Solutions: reward shaping (add intermediate rewards), curriculum learning (start with easier tasks), hierarchical RL (decompose into subgoals), or imitation learning to bootstrap. :::

Three Algorithm Families

The RL learning problem - given only interaction with the environment, find $\pi^*$ - is solved by three broad families of algorithms:

YouTube Resources

Video	Creator	Why Watch
Lecture 1: Introduction to Reinforcement Learning	David Silver (DeepMind)	The gold standard intro - covers MDP framework and Bellman equations rigorously
Reinforcement Learning Course - Full Lecture 1	Pieter Abbeel (Berkeley)	Excellent coverage of MDP formalism, great for mathematical intuition
RL Tutorial - Lecture 2: MDPs, Value Functions	Yannis Kalogeropoulos	Approachable walkthrough of value functions and Bellman with code
Reinforcement Learning - Introduction to MDPs	Mutual Information	Clean visual explanation of the MDP framework and discount factor
Deep RL Bootcamp Lecture 1 - Motivation + Overview	Pieter Abbeel & Rocky Duan	Motivates RL from applications - great for seeing why it matters

Interview Q&A

Q1: Explain the Bellman equation and why it is the foundation of all RL algorithms.

The Bellman equation expresses a recursive consistency condition: the value of any state equals the expected immediate reward plus the discounted value of the next state. For the expectation equation: $V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V^\pi(s')]$ . Its importance is threefold. First, it gives a self-consistency requirement - any correct value function must satisfy this for all states simultaneously. Second, it enables bootstrapping: you can update the value estimate for one state using value estimates of neighboring states, without waiting for full episode rollouts. Third, it defines the Bellman operator $\mathcal{T}$ , which is a contraction mapping - repeated application converges to the unique fixed point $V^\pi$ by the Banach theorem. All temporal difference algorithms - Q-learning, SARSA, TD(λ), DQN, PPO's value update - are variants of this update applied online with function approximation.

Q2: What is the difference between $V(s)$ and $Q(s,a)$ ? When would you use each?

$V^\pi(s)$ is the expected return starting from state $s$ and following $\pi$ forever. $Q^\pi(s,a)$ is the expected return starting from $s$ , taking action $a$ first, then following $\pi$ . $Q$ is strictly more informative: you can derive $V$ from $Q$ (as $V(s) = \sum_a \pi(a|s) Q(s,a)$ ), and you can derive the optimal action directly as $\arg\max_a Q^*(s,a)$ without knowing the transition model. In practice: Q-learning and DQN learn $Q$ because the optimal policy is recoverable greedily. Actor-critic methods learn $V$ for the critic because it's cheaper (no action dimension) and use $Q - V$ as the advantage estimate for policy updates. You'd choose $V$ when you have a separate policy you're improving; $Q$ when you want the value function to directly determine the policy.

Q3: Why does the Markov property matter? What happens when it's violated?

The Markov property - that the future depends only on the current state, not the history - allows compact value functions $V(s)$ and $Q(s,a)$ that depend only on the current state. Without it, you'd need $V(h)$ where $h$ is the entire history, making the problem exponentially harder (the space of all histories is astronomically large). When the Markov property is violated (partial observability): naive RL with state-only policies can be suboptimal or even fail to converge. The agent may not have enough information to determine the right action from the current observation. Solutions include: (1) using recurrent networks (LSTM, Transformer) that compress history into a hidden state, effectively acting on a learned belief state; (2) frame stacking - including the last $k$ observations as a single input, approximately satisfying the Markov property for velocity information; (3) explicitly maintaining and updating a belief state using Bayes' rule (formal POMDP solution, computationally expensive).

Q4: How does the discount factor γ affect agent behavior and algorithm convergence?

$\gamma$ controls the effective planning horizon. Small $\gamma$ (0.5): agent is myopic, optimizes immediate reward, may sacrifice long-term outcomes. Large $\gamma$ (0.99): agent is far-sighted, will forgo immediate reward for better future outcomes. Two mathematical effects: (1) Convergence speed - the Bellman operator's contraction rate is $\gamma$ . After $k$ iterations of value iteration, error is $\gamma^k$ times initial error. Higher $\gamma$ means slower convergence. (2) Credit assignment difficulty - with large $\gamma$ , a reward 100 steps in the future still contributes significantly to the present value. Propagating this information backwards through many steps requires many training iterations. In practice: $\gamma = 0.99$ is standard for most deep RL; $\gamma = 1.0$ is safe for episodic tasks with guaranteed termination; $\gamma = 0.9$ gives faster learning but a myopic policy.

Q5: What is the advantage function and why is it used in practice?

The advantage function $A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$ measures how much better action $a$ is relative to the average action from state $s$ under the current policy. It has two key properties: (1) Zero-mean - $\mathbb{E}_{a \sim \pi}[A(s,a)] = 0$ by construction, since $\mathbb{E}_{a \sim \pi}[Q(s,a)] = V(s)$ . (2) Lower variance - using $A$ instead of raw $Q$ or $G_t$ in policy gradient updates removes the baseline $V(s)$ which captures the part of the return independent of the action. This variance reduction speeds training substantially. Practically: policy gradient updates use $\nabla_\theta \log \pi_\theta(a|s) \cdot A(s,a)$ . When $A > 0$ (action was better than average), the action probability increases. When $A < 0$ (worse than average), it decreases. This is much more stable than updating on raw returns, which can push all action probabilities in the same direction when overall returns are high.

Q6: How would you formulate an LLM chat assistant as an MDP?

State $S$ : the conversation history (all tokens generated so far) - in practice, the hidden states of the transformer. Action $A$ : the next token to generate from a vocabulary of size ~50,000. Transition $P$ : deterministic from the LLM's perspective - given a state and action (token choice), the new state is the old state concatenated with that token. Reward $R$ : typically sparse - a reward model score $r(x, y)$ for the full response $y$ given prompt $x$ , received only at the end of generation. Episode: one full turn of the conversation (prompt → complete response). Discount $\gamma$ : often 1.0 for RLHF since episodes are short. The main challenges: (1) sparse reward (only at episode end) makes credit assignment hard; (2) enormous action space (50K tokens per step); (3) reward hacking (the model learns to game the reward model rather than genuinely improving); (4) the KL-from-reference penalty added in PPO-RLHF is needed to prevent the policy from drifting too far from the pretrained model.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the MDP GridWorld demo on the EngineersOfAI Playground - no code required.

:::

The Real Engineering Moment​

Why This Exists: The Sequential Decision Problem​

Historical Context​

The RL Setup: Five Components​

Formal MDP Definition​

State Space SSS​

Action Space AAA​

Transition Function P(s′∣s,a)P(s' | s, a)P(s′∣s,a)​

Reward Function R(s,a)R(s, a)R(s,a)​

Discount Factor γ∈[0,1)\gamma \in [0, 1)γ∈[0,1)​

The Discount Factor γ: Deep Dive​

The Markov Property​

Finite vs Infinite Horizon MDPs​

Policy: What the Agent Does​

Deterministic Policy​

Stochastic Policy​

Tabular vs Function Approximation​

Value Functions​

State Value Function Vπ(s)V^\pi(s)Vπ(s)​

Action-Value Function Qπ(s,a)Q^\pi(s, a)Qπ(s,a)​

Relationship Between V and Q​

The Bellman Equations​

Bellman Expectation Equation for V​

Bellman Expectation Equation for Q​

Bellman Optimality Equations​

The Bellman Backup: Visual Intuition​

MDP Visualization as a Directed Graph​

GridWorld: Hand-Computed Value Function​

OpenAI Gym: Environment Interaction Loop​

NumPy Policy Evaluation: Full Implementation​

Advantage Function A(s,a)A(s, a)A(s,a)​

The Exploration-Exploitation Dilemma​

The ϵ\epsilonϵ-Greedy Strategy​

Partial Observability: POMDPs​

Real-World MDP Formulations​

Recommendation System MDP​

Algorithmic Trading MDP​

Robot Locomotion MDP​

When to Use RL vs Supervised Learning​

Common Mistakes​

Three Algorithm Families​

YouTube Resources​

Interview Q&A​