Model-free temporal difference learning - Q-learning for off-policy control and SARSA for on-policy control. Understand TD vs MC vs DP, convergence conditions, eligibility traces, Double Q-learning, and implement Q-tables in NumPy.

How does SARSA work in practice?

Q-Learning and SARSA covers Q-learning, SARSA, temporal difference learning from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/reinforcement-learning/q-learning-and-sarsa

What is the difference between Q-learning and temporal difference learning?

See the full breakdown at https://engineersofai.com/docs/ml/reinforcement-learning/q-learning-and-sarsa

Q-Learning and SARSA

Reading time: ~40 min | Interview relevance: High | Target roles: ML Engineer, Research Engineer, AI Engineer

The Real Engineering Moment

It is 1989. Chris Watkins, a PhD student at Cambridge, is working on a problem that has stumped researchers for a decade: how do you teach a machine to make good decisions when you have no model of the world? You cannot enumerate transitions. You cannot compute the Bellman equation exactly. You only have one tool - the ability to try things and observe what happens.

The idea that emerges from his thesis is deceptively simple. After taking an action and observing the reward and next state, you update your estimate of how good that action was - not by looking at the full future, but by comparing your prediction to a one-step improvement of it. You bootstrap. You use your current estimates as targets for updating your current estimates. This is circular reasoning that somehow converges.

Twenty-five years later, DeepMind's team would take this exact update rule, attach it to a convolutional neural network reading raw pixels from an Atari 2600 game, and produce an agent that beat human professionals at Breakout and Pong. The algorithm was Q-learning. The paper was DQN. The equation at its core was Watkins' 1989 insight, unchanged.

What makes Q-learning so powerful is what it does NOT need. No transition model. No reward model. No knowledge of the environment's internal workings whatsoever. Just experience tuples - what happened, what reward came, where you ended up. From those raw observations, Q-learning recovers the optimal policy. That is a remarkable guarantee, and understanding why it holds - and when it fails - is the foundation for all modern deep RL.

In this lesson we go deep. We trace the lineage from dynamic programming through Monte Carlo to TD learning. We implement Q-learning and SARSA from scratch and see why they behave differently on the Cliff Walking problem. We examine the convergence conditions that make tabular Q-learning reliable. And we see how these ideas extend to eligibility traces, n-step returns, and Double Q-learning - the building blocks that became DQN.

Why This Exists: The DP-to-Model-Free Gap

To understand why Q-learning matters, you need to understand what it replaced.

Dynamic Programming (DP) - Value Iteration and Policy Iteration - gives you exact solutions to MDPs. Value Iteration computes the true $Q^*(s,a)$ by iterating the Bellman optimality equation:

$Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s',a')$

This is beautiful mathematics. It converges. It is provably correct. There is one problem: it requires $P(s'|s,a)$ - the full transition probability for every state-action pair - and $R(s,a)$ for all states. For real problems, you almost never have this.

Monte Carlo (MC) methods broke away from needing a model. Instead of using $P(s'|s,a)$ , they run actual episodes and use the observed returns as estimates of value:

$V(s_t) \leftarrow V(s_t) + \alpha \left[G_t - V(s_t)\right]$

where $G_t = \sum_{k=0}^{T-t} \gamma^k r_{t+k}$ is the actual return from time $t$ to end of episode.

MC is model-free - it learns from experience. But it has two problems: (1) it requires waiting until the end of the episode before updating, which is slow and impractical for long or continuing tasks; (2) the return $G_t$ has very high variance because it accumulates all future randomness.

Temporal Difference (TD) learning - the family that includes Q-learning and SARSA - combines the best of both. Like MC, it is model-free: it learns from experience. Like DP, it bootstraps: it updates estimates using other estimates without waiting for the full trajectory. This is the core idea of TD learning, and it is what makes Q-learning practical.

Historical Context

Year	Paper	Contribution
1957	Bellman - Dynamic Programming	Bellman equations, optimal substructure
1988	Sutton - Learning to predict by the methods of TD	First formal TD methods
1989	Watkins - Learning from delayed rewards (PhD thesis)	Q-learning algorithm
1992	Watkins & Dayan - Q-learning	Convergence proof
1994	Rummery & Niranjan - On-line Q-learning using connectionist systems	SARSA (called modified Q-learning)
1995	Singh - Reinforcement learning with replacing eligibility traces	Eligibility traces formalization
2010	van Hasselt - Double Q-learning	Maximization bias fix
2013	Mnih et al. - Playing Atari with deep RL	DQN (Q-learning + neural nets)

The naming of SARSA is an interesting story. The update rule involves a quintuple $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$ - taking each first letter gives SARSA. Rummery and Niranjan called it "modified connectionist Q-learning," but Sutton renamed it SARSA in his 1996 paper, and the name stuck.

The Model-Free Setting

In dynamic programming, we had the full environment model: $P(s'|s,a)$ and $R(s,a)$ . With it, we could compute the Bellman update exactly.

In the model-free setting, we only have access to experience tuples: $(s_t, a_t, r_t, s_{t+1})$

We observe what happened but cannot enumerate all possible transitions. The key insight of TD learning: you can still update value estimates using these sampled transitions, even without the full model.

The core idea - Temporal Difference (TD) Error:

$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$

For Q-values:

$\delta_t = r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)$

This is the difference between:

What you expected: $Q(s_t, a_t)$
What you got plus what you now estimate: $r_t + \gamma Q(s_{t+1}, a_{t+1})$

If $\delta_t > 0$ : this (state, action) was better than expected - increase its Q-value. If $\delta_t < 0$ : this (state, action) was worse than expected - decrease its Q-value.

The update: $Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \cdot \delta_t$

where $\alpha \in (0,1]$ is the learning rate.

TD vs MC vs DP: The Bias-Variance Tradeoff

This is one of the most important conceptual frameworks in all of RL. Every algorithm sits somewhere on this spectrum.

DP: Uses the true Bellman equation. Targets $r + \gamma \max_{a'} Q^*(s', a')$ computed from the true model. Zero variance (no sampling), zero bias (exact model). Requires full model - rarely available.

Monte Carlo: Targets are actual episode returns $G_t$ . Unbiased - the return is the true value under the current policy (no approximation). High variance - $G_t$ is the sum of all future random rewards.

TD(0): Targets are $r_t + \gamma Q(s_{t+1}, a_{t+1})$ - a one-step estimate. Biased because $Q(s_{t+1}, a_{t+1})$ is an approximation of the true future value. Low variance because it only involves one step of randomness.

The practical answer: TD methods almost always outperform MC in practice, despite the bias. The variance reduction is worth more than the bias introduced, especially in environments with long horizons. The sweet spot is n-step TD, covered later.

:::tip Intuition for Bias vs Variance Think of shooting free throws. High variance = your shots land all over the place but center on the basket. High bias = your shots consistently miss to the left. TD learning introduces bias (systematic error from bootstrapping) but dramatically reduces variance (variance from all future randomness). In practice, you can correct bias over time with better value estimates. You cannot easily average away extreme variance. :::

Exploration vs Exploitation

Before introducing the algorithms, a critical engineering challenge: how does the agent decide what action to take while learning?

Pure exploitation: Always take $\arg\max_a Q(s, a)$ . Problem: never tries new actions, gets stuck in suboptimal policies.

Pure exploration: Take random actions. Problem: never uses what it has learned.

ε-greedy: The standard practical solution.

$a_t = \begin{cases} \arg\max_a Q(s_t, a) & \text{with probability } 1-\varepsilon \\ \text{random action} & \text{with probability } \varepsilon \end{cases}$

In practice, $\varepsilon$ is annealed over training - start high (1.0) to explore broadly, decay to a small value (0.01) to mostly exploit once the Q-values are reasonable.

import numpy as np

class EpsilonGreedy:
    def __init__(
        self,
        n_actions: int,
        epsilon_start: float = 1.0,
        epsilon_end: float = 0.01,
        epsilon_decay: float = 0.995,
    ):
        self.n_actions = n_actions
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay

    def select_action(self, q_values: np.ndarray) -> int:
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        return int(np.argmax(q_values))

    def decay(self):
        self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)

Other exploration strategies:

Softmax / Boltzmann: Sample actions with probability proportional to $\exp(Q(s,a)/\tau)$ . Temperature $\tau$ controls exploration. Naturally explores more for actions with similar Q-values.
UCB (Upper Confidence Bound): Select $\arg\max_a [Q(s,a) + c\sqrt{\ln t / N(s,a)}]$ where $N(s,a)$ is the visit count. Prefers less-visited actions. Theoretically principled for bandits.
Thompson Sampling: Maintain a distribution over Q-values, sample from it, then act greedily. Bayesian and principled.
Entropy regularization: Used in Soft Actor-Critic - add entropy bonus to reward, so the agent is intrinsically motivated to stay uncertain (explored more in the Actor-Critic lesson).

Q-Learning: Off-Policy TD Control

Q-learning updates toward the maximum Q-value in the next state, regardless of what action the policy actually takes:

$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\right]$

The target is $r_t + \gamma \max_{a'} Q(s_{t+1}, a')$ - this is the Bellman optimality equation treated as a regression target.

Why "off-policy"? The update target uses $\max_{a'} Q(s_{t+1}, a')$ - the greedy action - but the agent may be taking a different (exploratory) action due to ε-greedy. The learned Q-function converges to $Q^*$ regardless of the behavior policy, as long as all state-action pairs are visited sufficiently often.

Intuition for the update: You are at state $s$ . You take action $a$ . You observe reward $r$ and land in $s'$ . Your old estimate was $Q(s,a)$ . A better estimate is: take the reward you just received ( $r$ ) plus the discounted best you can do from here ( $\gamma \max_{a'} Q(s', a')$ ). Move your estimate a fraction $\alpha$ toward this better estimate. Repeat until the estimate is accurate.

import numpy as np
import gymnasium as gym
from collections import defaultdict

class QLearningAgent:
    def __init__(
        self,
        n_actions: int,
        alpha: float = 0.1,
        gamma: float = 0.99,
        epsilon_start: float = 1.0,
        epsilon_end: float = 0.01,
        epsilon_decay: float = 0.995,
    ):
        self.n_actions = n_actions
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay

        # Q-table: default to 0 for any unseen (state, action)
        self.Q = defaultdict(lambda: np.zeros(n_actions))

    def act(self, state) -> int:
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        return int(np.argmax(self.Q[state]))

    def update(self, state, action: int, reward: float, next_state, done: bool):
        """Q-learning update rule."""
        current_q = self.Q[state][action]

        if done:
            target = reward  # No future reward if episode ended
        else:
            # KEY: use max over all actions in next state - greedy, not actual
            target = reward + self.gamma * np.max(self.Q[next_state])

        td_error = target - current_q
        self.Q[state][action] += self.alpha * td_error

    def decay_epsilon(self):
        self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)


def train_q_learning(env_name: str = "FrozenLake-v1", n_episodes: int = 5000):
    env = gym.make(env_name)
    agent = QLearningAgent(n_actions=env.action_space.n)

    episode_rewards = []

    for episode in range(n_episodes):
        state, _ = env.reset()
        total_reward = 0.0
        done = False

        while not done:
            action = agent.act(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            agent.update(state, action, reward, next_state, done)
            state = next_state
            total_reward += reward

        agent.decay_epsilon()
        episode_rewards.append(total_reward)

        if (episode + 1) % 500 == 0:
            avg = np.mean(episode_rewards[-500:])
            print(f"Episode {episode+1:5d} | ε={agent.epsilon:.3f} | Avg reward: {avg:.3f}")

    return agent, episode_rewards


agent, rewards = train_q_learning("FrozenLake-v1", n_episodes=5000)

Typical output on FrozenLake (4×4 grid, slippery):

Episode  500 | ε=0.075 | Avg reward: 0.234
Episode 1000 | ε=0.010 | Avg reward: 0.628
Episode 1500 | ε=0.010 | Avg reward: 0.712
Episode 2000 | ε=0.010 | Avg reward: 0.738
Episode 5000 | ε=0.010 | Avg reward: 0.761

SARSA: On-Policy TD Control

SARSA is named after the quintuple it uses: $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$ :

$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)\right]$

The key difference from Q-learning: the target uses $Q(s_{t+1}, a_{t+1})$ - the action actually taken in the next state under the current policy (not the max).

Why "on-policy"? SARSA learns the Q-function for the policy it is currently following. If you use ε-greedy, SARSA learns $Q^{\varepsilon\text{-greedy}}$ , not $Q^*$ . The Q-values reflect the cost of occasionally taking random actions. This is a crucial distinction.

class SARSAAgent:
    def __init__(
        self,
        n_actions: int,
        alpha: float = 0.1,
        gamma: float = 0.99,
        epsilon_start: float = 1.0,
        epsilon_end: float = 0.01,
        epsilon_decay: float = 0.995,
    ):
        self.n_actions = n_actions
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        self.Q = defaultdict(lambda: np.zeros(n_actions))

    def act(self, state) -> int:
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        return int(np.argmax(self.Q[state]))

    def update(
        self,
        state, action: int,
        reward: float,
        next_state, next_action: int,
        done: bool,
    ):
        """SARSA update rule - uses next_action, not max."""
        current_q = self.Q[state][action]

        if done:
            target = reward
        else:
            # SARSA: use the actual next action taken - NOT argmax
            target = reward + self.gamma * self.Q[next_state][next_action]

        self.Q[state][action] += self.alpha * (target - current_q)


def train_sarsa(env_name: str = "FrozenLake-v1", n_episodes: int = 5000):
    env = gym.make(env_name)
    agent = SARSAAgent(n_actions=env.action_space.n)

    episode_rewards = []

    for episode in range(n_episodes):
        state, _ = env.reset()
        action = agent.act(state)  # Choose first action before the loop
        total_reward = 0.0
        done = False

        while not done:
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            if done:
                next_action = 0  # Doesn't matter for terminal state
            else:
                next_action = agent.act(next_state)  # Choose NEXT action NOW

            # SARSA update: uses (s, a, r, s', a') - the actual next action
            agent.update(state, action, reward, next_state, next_action, done)

            state = next_state
            action = next_action  # Carry the chosen action forward
            total_reward += reward

        agent.epsilon = max(agent.epsilon_end, agent.epsilon * agent.epsilon_decay)
        episode_rewards.append(total_reward)

    return agent, episode_rewards

Notice the structural difference in the training loop: SARSA must choose the next action before the update, so it can include $a_{t+1}$ in the update. Q-learning does not need this - it just takes the max.

Q-Learning vs SARSA: The Cliff Walking Example

The classic illustration of the on-policy vs off-policy difference uses Cliff Walking (Sutton & Barto, Example 6.6):

Start                              Goal
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│   │   │   │   │   │   │   │   │   │   │   │   │  row 0 (top)
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
│   │   │   │   │   │   │   │   │   │   │   │   │  row 1
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
│ Q-learning path (optimal, dangerous)           │  row 2  ← Q-learning learns
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤         this row
│ S │ C │ C │ C │ C │ C │ C │ C │ C │ C │ C │ G │  row 3
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
  Start              Cliff (-100 reward, reset)   Goal

Environment: Every step costs -1. Falling off the cliff costs -100 and resets to Start. Goal has reward 0 (episode ends).

Q-learning (off-policy): Learns the optimal path - right along the cliff edge (row 2). The optimal policy walks the cliff edge because every step is -1 and the cliff is never in the optimal path. During training, ε-greedy causes occasional random steps into the cliff, producing terrible training episodes. But the learned $Q^*$ correctly identifies the edge path as optimal.

SARSA (on-policy): Learns the safe path - further from the cliff (row 1 or 2 but more cautious). Because SARSA evaluates the policy being executed (including its ε-greedy random steps), it learns that states near the cliff are dangerous - a random step from there costs -100. So SARSA naturally moves away from the cliff, accepting a longer path in exchange for safety.

Empirical result: During training, SARSA achieves much higher average reward than Q-learning (because Q-learning keeps falling off the cliff during exploration). But Q-learning's final converged policy is better than SARSA's (because Q-learning converged to $Q^*$ , the truly optimal policy).

import gymnasium as gym
import numpy as np
from collections import defaultdict
import matplotlib.pyplot as plt

def run_cliff_comparison(n_episodes: int = 500):
    """Compare Q-learning vs SARSA on CliffWalking-v0."""
    env = gym.make("CliffWalking-v0")
    n_actions = env.action_space.n

    q_agent = QLearningAgent(n_actions=n_actions, epsilon_start=0.1, epsilon_end=0.1)
    s_agent = SARSAAgent(n_actions=n_actions, epsilon_start=0.1, epsilon_end=0.1)

    q_rewards, s_rewards = [], []

    for episode in range(n_episodes):
        # Q-Learning episode
        state, _ = env.reset()
        total_r, done = 0.0, False
        while not done:
            a = q_agent.act(state)
            ns, r, term, trunc, _ = env.step(a)
            done = term or trunc
            q_agent.update(state, a, r, ns, done)
            state, total_r = ns, total_r + r
        q_rewards.append(total_r)

        # SARSA episode
        state, _ = env.reset()
        action = s_agent.act(state)
        total_r, done = 0.0, False
        while not done:
            ns, r, term, trunc, _ = env.step(action)
            done = term or trunc
            na = s_agent.act(ns) if not done else 0
            s_agent.update(state, action, r, ns, na, done)
            state, action, total_r = ns, na, total_r + r
        s_rewards.append(total_r)

    # Smooth rewards for plotting
    window = 10
    q_smooth = np.convolve(q_rewards, np.ones(window)/window, mode='valid')
    s_smooth = np.convolve(s_rewards, np.ones(window)/window, mode='valid')

    print(f"Q-Learning final policy avg: {np.mean(q_rewards[-100:]):.1f}")
    print(f"SARSA final policy avg:      {np.mean(s_rewards[-100:]):.1f}")
    # Expected: Q-Learning ≈ -13 (optimal) but noisy; SARSA ≈ -17 (safe path) but stable

run_cliff_comparison()

The engineering tradeoff:

If you're deploying the learned policy (and exploration stops): Q-learning gives a better final policy
If the learning policy IS the deployment policy (online learning, never stops exploring): SARSA is safer
This distinction maps directly to RLHF: during fine-tuning, you eventually stop training and deploy. So off-policy methods are appropriate - you care about the final policy, not what happens during training.

TD Error as a Learning Signal

The TD error is not just a training signal - it has a deep interpretation:

$\delta_t = r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)$

Neuroscience connection: In 1997, Schultz et al. discovered that dopamine neurons in the brain fire according to a signal that looks exactly like TD error. Unexpected rewards (positive $\delta$ ) cause dopamine release; expected rewards do not; missing expected rewards (negative $\delta$ ) suppress dopamine. RL is not just a machine learning technique - it may describe how biological brains learn.

Engineering implication: Monitoring TD errors during training tells you a lot:

Large TD errors: model is learning a lot, or the environment is very noisy
TD errors consistently near zero: model has converged (or is stuck - check if Q-values are sensible)
TD errors oscillating wildly: potential instability; might need lower learning rate or gradient clipping
TD error suddenly spikes: distribution shift or bug in the environment

Convergence Guarantees

Q-learning converges to $Q^*$ under these conditions (Watkins & Dayan, 1992):

All state-action pairs are visited infinitely often (sufficient exploration - e.g., $\varepsilon > 0$ always)
Step sizes satisfy the Robbins-Monro conditions: $\sum_t \alpha_t = \infty$ and $\sum_t \alpha_t^2 < \infty$
Rewards are bounded: $|R| < \infty$

Condition 1 in practice: ε-greedy with $\varepsilon > 0$ satisfies this as long as the graph of reachable states is connected. In large state spaces, this requires many episodes.

Condition 2 in practice: A decaying schedule like $\alpha_t = 1/t$ satisfies this. Constant $\alpha$ (e.g., 0.1) does NOT satisfy condition 2 - but it still works in practice. With constant $\alpha$ , Q-learning converges to a neighborhood of $Q^*$ rather than exactly $Q^*$ . This is a acceptable tradeoff for non-stationary environments.

SARSA convergence: SARSA converges to $Q^*$ if: (1) same conditions as above, plus (2) the policy is GLIE (Greedy in the Limit with Infinite Exploration) - i.e., $\varepsilon_t \to 0$ but $\sum_t \varepsilon_t = \infty$ . With constant $\varepsilon$ , SARSA converges to the optimal ε-soft policy, not $Q^*$ .

:::warning With function approximation, convergence is not guaranteed The tabular convergence proofs do NOT extend to function approximation (neural networks). With a neural network Q-function, training can and does diverge. This is why DQN needed experience replay and target networks - covered in the next lesson. :::

n-Step TD and TD(λ)

Standard TD(0) uses a 1-step return: $r_t + \gamma Q(s_{t+1}, a_{t+1})$ .

n-step TD uses a longer lookahead before bootstrapping:

$G_t^{(n)} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots + \gamma^{n-1} r_{t+n-1} + \gamma^n Q(s_{t+n}, a_{t+n})$

$n$	Target	Bias	Variance
1 (TD)	$r_t + \gamma Q(s_{t+1})$	High	Low
5	$r_t + \ldots + \gamma^4 r_{t+4} + \gamma^5 Q(s_{t+5})$	Medium	Medium
$\infty$ (MC)	$G_t$ (actual return)	Zero	High

In practice, $n = 5$ to $n = 20$ often works better than either extreme. Rainbow DQN uses $n = 3$ .

TD(λ) elegantly interpolates using eligibility traces - a weighted average of ALL $n$ -step returns:

$G_t^\lambda = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}$

The weight given to the $n$ -step return decays geometrically with $\lambda$ . At $\lambda=0$ : pure TD(0). At $\lambda=1$ : pure Monte Carlo.

Eligibility traces implement this efficiently without storing all future returns. Each state-action pair has a trace $e(s,a)$ that accumulates evidence:

$e_t(s,a) = \begin{cases} \gamma \lambda \cdot e_{t-1}(s,a) + 1 & \text{if } s=s_t, a=a_t \\ \gamma \lambda \cdot e_{t-1}(s,a) & \text{otherwise} \end{cases}$

The update uses all traces simultaneously:

$Q(s,a) \leftarrow Q(s,a) + \alpha \delta_t e_t(s,a) \quad \text{for all } (s,a)$

class SARSALambdaAgent:
    """SARSA with eligibility traces (TD(λ))."""

    def __init__(
        self,
        n_states: int,
        n_actions: int,
        alpha: float = 0.1,
        gamma: float = 0.99,
        lam: float = 0.9,  # λ - the trace decay parameter
        epsilon: float = 0.1,
    ):
        self.n_actions = n_actions
        self.alpha = alpha
        self.gamma = gamma
        self.lam = lam
        self.epsilon = epsilon

        self.Q = np.zeros((n_states, n_actions))
        self.e = np.zeros((n_states, n_actions))  # eligibility traces

    def act(self, state: int) -> int:
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        return int(np.argmax(self.Q[state]))

    def reset_traces(self):
        """Reset eligibility traces at the start of each episode."""
        self.e[:] = 0.0

    def update(self, s: int, a: int, r: float, s_next: int, a_next: int, done: bool):
        # TD error
        if done:
            delta = r - self.Q[s, a]
        else:
            delta = r + self.gamma * self.Q[s_next, a_next] - self.Q[s, a]

        # Increment trace for current (s, a)
        self.e[s, a] += 1.0

        # Update ALL Q-values proportional to their eligibility trace
        self.Q += self.alpha * delta * self.e

        # Decay all traces
        self.e *= self.gamma * self.lam

        if done:
            self.reset_traces()

In practice, $\lambda \in [0.9, 0.99]$ often works best. TD(λ) is rarely used with neural networks because maintaining per-parameter traces is memory-intensive, but it is important conceptually and works very well in tabular settings.

Double Q-Learning: Fixing Maximization Bias

Standard Q-learning has a subtle but significant problem: the $\max$ in the target overestimates Q-values.

Why? The target is $r + \gamma \max_{a'} Q(s', a')$ . If the Q-values have any estimation noise (which they always do early in training), the maximum over noisy values is systematically higher than the maximum of the true values. This is maximization bias.

Proof by simple example: Suppose there are 5 actions, all with true Q-value 0. Your estimates are $[0.1, -0.2, 0.3, -0.1, 0.2]$ (noise around 0). The true max is 0, but $\max$ of your estimates is 0.3. You consistently overestimate by using the maximum of noisy estimates.

Double Q-Learning (van Hasselt, 2010) fixes this by maintaining two independent Q-tables, $Q_A$ and $Q_B$ :

Use $Q_A$ to select the best action: $a^* = \arg\max_{a'} Q_A(s', a')$
Use $Q_B$ to evaluate it: target = $r + \gamma Q_B(s', a^*)$
Alternate which table is updated

Because the action selection and value evaluation use independent estimates, the overestimation bias is substantially reduced.

class DoubleQLearningAgent:
    """Double Q-learning with two independent Q-tables."""

    def __init__(
        self,
        n_states: int,
        n_actions: int,
        alpha: float = 0.1,
        gamma: float = 0.99,
        epsilon_start: float = 1.0,
        epsilon_end: float = 0.01,
        epsilon_decay: float = 0.995,
    ):
        self.n_actions = n_actions
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay

        # Two independent Q-tables
        self.Q_A = np.zeros((n_states, n_actions))
        self.Q_B = np.zeros((n_states, n_actions))

    def act(self, state: int) -> int:
        # Act based on sum of both tables
        q_combined = self.Q_A[state] + self.Q_B[state]
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        return int(np.argmax(q_combined))

    def update(self, s: int, a: int, r: float, s_next: int, done: bool):
        # Randomly decide which table to update
        if np.random.random() < 0.5:
            # Update Q_A using Q_B for evaluation
            if done:
                target = r
            else:
                best_action = np.argmax(self.Q_A[s_next])  # Select with Q_A
                target = r + self.gamma * self.Q_B[s_next, best_action]  # Evaluate with Q_B
            self.Q_A[s, a] += self.alpha * (target - self.Q_A[s, a])
        else:
            # Update Q_B using Q_A for evaluation
            if done:
                target = r
            else:
                best_action = np.argmax(self.Q_B[s_next])  # Select with Q_B
                target = r + self.gamma * self.Q_A[s_next, best_action]  # Evaluate with Q_A
            self.Q_B[s, a] += self.alpha * (target - self.Q_B[s, a])

    def decay_epsilon(self):
        self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)

Double Q-learning is the tabular precursor to Double DQN. In DQN, the two tables become the online network and the target network - an elegant reuse of the existing infrastructure.

Function Approximation and the Deadly Triad

All of the above assumes a tabular Q-function - one value per (state, action) pair. This breaks down when state spaces are continuous or too large to enumerate (e.g., Atari pixel inputs: $210 \times 160 \times 3 = 100{,}800$ dimensions).

Linear function approximation: Replace the Q-table with a linear model:

$Q(s, a; \mathbf{w}) = \mathbf{w}^\top \phi(s, a)$

where $\phi(s,a)$ is a feature vector. The update becomes a gradient descent step:

$\mathbf{w} \leftarrow \mathbf{w} + \alpha \delta_t \nabla_\mathbf{w} Q(s_t, a_t; \mathbf{w}) = \mathbf{w} + \alpha \delta_t \phi(s_t, a_t)$

Linear FA is guaranteed to converge with TD learning (under on-policy sampling). But neural networks are different - and this is where the deadly triad appears.

The Deadly Triad (Sutton & Barto, Chapter 11): Three elements that together cause divergence:

Function approximation - generalizes: updating one Q-value affects neighboring states
Bootstrapping - targets depend on current estimates, which are being updated
Off-policy learning - distribution of updates doesn't match the distribution of visits

All three are present in standard DQN. Why does DQN succeed despite the deadly triad? Experience replay partially addresses distribution mismatch, and target networks stabilize bootstrapping. But DQN can still diverge with bad hyperparameters - the deadly triad is a warning, not a guarantee of failure.

:::danger The Deadly Triad in production RL If you are training a neural network with Q-learning and see Q-values grow without bound (loss blows up, Q-values → ∞), the deadly triad is your diagnosis. Fixes in order of impact: (1) add or reduce target network update frequency, (2) reduce learning rate, (3) add gradient clipping, (4) use Double DQN to reduce overestimation feedback loops. Never ignore diverging Q-values - they will not stabilize on their own. :::

GridWorld Implementation: Full NumPy Example

A complete GridWorld environment and Q-learning agent using only NumPy, useful for interviews and understanding the mechanics:

import numpy as np
from typing import Tuple

class GridWorld:
    """
    Simple 4x4 GridWorld:
    S: start (0,0)
    G: goal (3,3), reward +1
    H: hole (1,1) and (2,2), reward -1, terminal
    All other steps: reward -0.01
    """
    def __init__(self, size: int = 4):
        self.size = size
        self.holes = {(1, 1), (2, 2)}
        self.goal = (size - 1, size - 1)
        self.state = (0, 0)

        # Actions: 0=up, 1=right, 2=down, 3=left
        self.action_deltas = [(-1, 0), (0, 1), (1, 0), (0, -1)]
        self.n_actions = 4

    def reset(self) -> Tuple[int, int]:
        self.state = (0, 0)
        return self.state

    def step(self, action: int) -> Tuple[Tuple, float, bool]:
        dr, dc = self.action_deltas[action]
        r, c = self.state
        nr = max(0, min(self.size - 1, r + dr))
        nc = max(0, min(self.size - 1, c + dc))
        self.state = (nr, nc)

        if self.state == self.goal:
            return self.state, 1.0, True
        elif self.state in self.holes:
            return self.state, -1.0, True
        else:
            return self.state, -0.01, False

    def state_to_idx(self, state: Tuple) -> int:
        return state[0] * self.size + state[1]

    @property
    def n_states(self):
        return self.size * self.size


def train_gridworld_q_learning(n_episodes: int = 2000):
    env = GridWorld(size=4)
    n_states = env.n_states
    n_actions = env.n_actions

    Q = np.zeros((n_states, n_actions))
    alpha, gamma = 0.1, 0.99
    epsilon = 1.0
    epsilon_min, epsilon_decay = 0.01, 0.995

    episode_rewards = []

    for ep in range(n_episodes):
        state = env.reset()
        s_idx = env.state_to_idx(state)
        total_reward = 0.0
        done = False

        while not done:
            # ε-greedy action selection
            if np.random.random() < epsilon:
                action = np.random.randint(n_actions)
            else:
                action = np.argmax(Q[s_idx])

            next_state, reward, done = env.step(action)
            ns_idx = env.state_to_idx(next_state)

            # Q-learning update
            if done:
                target = reward
            else:
                target = reward + gamma * np.max(Q[ns_idx])

            Q[s_idx, action] += alpha * (target - Q[s_idx, action])

            s_idx = ns_idx
            total_reward += reward

        epsilon = max(epsilon_min, epsilon * epsilon_decay)
        episode_rewards.append(total_reward)

    # Print learned Q-table as policy arrows
    action_symbols = ['↑', '→', '↓', '←']
    print("\nLearned Policy:")
    for r in range(env.size):
        row = ""
        for c in range(env.size):
            if (r, c) == env.goal:
                row += " G "
            elif (r, c) in env.holes:
                row += " H "
            else:
                s_idx = env.state_to_idx((r, c))
                best_action = np.argmax(Q[s_idx])
                row += f" {action_symbols[best_action]} "
        print(row)

    return Q, episode_rewards


Q, rewards = train_gridworld_q_learning()

Expected output:

Learned Policy:
 →  →  →  ↓
 ↑  H  →  ↓
 ↑  →  H  ↓
 ↑  →  →  G

Production Engineering Notes

Hyperparameter guidelines for tabular Q-learning:

Hyperparameter	Typical range	Effect of increasing
α (learning rate)	0.01 – 0.5	Faster but less stable convergence
γ (discount)	0.9 – 0.999	More long-term planning; slower convergence
ε start	0.5 – 1.0	More exploration early
ε end	0.01 – 0.1	More vs less exploration at convergence
ε decay	0.99 – 0.9999	Slower vs faster transition to exploitation

Monitoring signals to track:

Episode reward (smoothed over 100 episodes): should trend upward
TD error magnitude: should decrease over training
Epsilon: should decrease smoothly, not drop too fast
Policy greedy reward (occasional evaluation episodes with ε=0): true policy quality

When to use which algorithm:

Scenario	Algorithm
Tabular, offline evaluation	Q-learning
Online deployment, safety-critical	SARSA
Long horizons, dense rewards	SARSA(λ) or n-step TD
Large discrete action space	Double Q-learning
Continuous state space	DQN (next lesson)

Common Mistakes

:::danger Forgetting to reset eligibility traces between episodes In SARSA(λ), traces carry over within an episode but MUST be reset at the start of each new episode. If you forget, the traces from the previous episode corrupt the updates at the start of the next episode. Always call reset_traces() in the episode loop's initialization. :::

:::danger Updating SARSA with next_action chosen AFTER the update SARSA requires choosing $a_{t+1}$ BEFORE the update so that the actual next action is used in the target. If you choose $a_{t+1}$ after the update, you're accidentally implementing Q-learning on-policy (which is almost always suboptimal). The training loop must look like: choose $a_t$ , step, choose $a_{t+1}$ , update, advance. :::

:::warning Using Q-learning in safety-critical online settings Q-learning is off-policy - it ignores exploratory actions in its value estimates. In safety-critical online learning (e.g., a physical robot that never stops running), the exploration policy IS the deployment policy. Q-learning learns the optimal policy but the exploration behavior may be dangerous. Use SARSA, which accounts for the exploratory actions in its estimates. :::

:::warning Decaying ε too fast If ε reaches 0.01 within the first 5% of training, the agent stops exploring before Q-values have had time to propagate through the state space. States far from the goal may never receive positive gradient signal. A general rule: ε should reach its final value at roughly the 50–70% mark of training, not earlier. :::

:::tip Debugging tip: inspect the Q-table For tabular problems, always print the learned Q-table and visualize the greedy policy. A policy that looks random or cycles is diagnostic: (1) if random → Q-values haven't converged (train longer or reduce ε faster); (2) if cycling → conflicting Q-values (check for state aliasing or very small alpha); (3) if stuck in local region → insufficient exploration (increase ε or ε-decay timescale). :::

YouTube Resources

Title	Channel	Why Watch
Q-Learning Explained	The Coding Train	Visual, beginner-friendly walkthrough with a maze example - great for building intuition before the math
Temporal Difference Learning - David Silver RL Course Lecture 4	DeepMind / UCL	The canonical lecture - Silver covers TD(0), SARSA, Q-learning with full derivations and proofs
Q-Learning and SARSA Comparison	sentdex	Side-by-side Python implementation on FrozenLake, good for seeing practical differences
Eligibility Traces and TD Lambda	Mutual Information	Clear explanation of why TD(λ) interpolates MC and TD, with visual intuition for traces
Reinforcement Learning Explained - Full Course	Andrej Karpathy (Stanford)	Broader context - how tabular RL fits into the modern deep RL landscape

Interview Q&A

Q1: What is the difference between Q-learning and SARSA? When would you prefer each?

Both are TD methods that learn Q-functions from experience without a model. Q-learning is off-policy: the update target uses $\max_{a'} Q(s', a')$ - the greedy action - regardless of what the agent actually does during exploration. This means Q-learning learns $Q^*$ , the optimal Q-function, even while behaving non-optimally. SARSA is on-policy: the update target uses $Q(s', a')$ where $a'$ is the action actually taken by the current (exploratory) policy. SARSA converges to $Q^\pi$ where $\pi$ is the current behavior policy, not necessarily $Q^*$ .

Prefer Q-learning when: you want to learn the optimal policy and will stop exploring before deployment (this is the typical offline training case). Prefer SARSA in online settings where the exploration policy continues operating indefinitely - SARSA accounts for exploration mistakes and learns a safer policy in the presence of dangerous states. The Cliff Walking example perfectly illustrates this: Q-learning learns the fast-but-risky cliff-edge path; SARSA learns the safe inland path that avoids occasional random steps off the cliff.

Q2: What is the temporal difference error and why does it matter?

The TD error $\delta_t = r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)$ is the difference between the "target" (what you just observed plus the discounted future estimate) and the "prediction" (your current Q-value estimate). It is the core learning signal in all TD methods. If $\delta > 0$ , the outcome was better than predicted - increase Q for this (state, action). If $\delta < 0$ , decrease Q.

The TD error matters for two reasons. First, it allows learning without waiting for the full episode - you update after each step using bootstrapping, which makes learning online and efficient. Second, in neuroscience, Schultz et al. (1997) showed that dopamine neuron firing patterns in primates match the TD error signal: unexpected rewards trigger dopamine release (positive TD error), expected rewards do not (zero TD error), and omission of expected rewards suppresses dopamine (negative TD error). This suggests TD learning may be how biological brains solve the reward learning problem.

Q3: Why does Q-learning sometimes fail with function approximation, and what is the deadly triad?

With tabular Q-learning, each update affects exactly one entry in the table. With a neural network, each update affects the entire function - nearby states' Q-values shift too, because the network generalizes. This creates three interacting failure modes that Sutton calls the deadly triad: (1) function approximation - generalization means updates have side effects on other states; (2) bootstrapping - targets depend on current estimates, which are being updated simultaneously, creating a moving target problem; (3) off-policy learning - the distribution of states being trained on differs from the distribution encountered during the current policy.

Any two of these three components is manageable. All three together cause divergence. DQN manages the triad with: experience replay (partially addresses distribution mismatch, breaks temporal correlation) and target networks (stabilizes the moving target problem by freezing bootstrap targets for thousands of steps). Even so, DQN can diverge with bad hyperparameters - the deadly triad is a fundamental instability, not a solved problem.

Q4: Explain ε-greedy and why ε is annealed. What are the alternatives?

ε-greedy selects the greedy (highest Q-value) action with probability $1-\varepsilon$ , and a random action with probability $\varepsilon$ . Annealing means starting with large $\varepsilon$ (typically 1.0 - fully random) and gradually reducing it to a small value (typically 0.01). The rationale: early in training, Q-values are random initialization noise - greedy selection has no meaning, so exploration is essential for the agent to discover what rewards are available. As training progresses, Q-values become meaningful estimates, so exploitation of learned knowledge becomes increasingly valuable. The final small $\varepsilon > 0$ maintains convergence guarantees (all states must be visited infinitely often) and helps in slightly non-stationary environments.

Alternatives: UCB (Upper Confidence Bound) selects $\arg\max_a [Q(s,a) + c\sqrt{\ln t / N(s,a)}]$ , preferring less-visited state-action pairs - theoretically principled for bandits. Boltzmann / softmax sampling selects actions proportional to $\exp(Q(s,a)/\tau)$ , providing more nuanced exploration that preferentially explores actions with similar Q-values. Thompson Sampling maintains uncertainty estimates over Q-values and samples from the posterior. In deep RL, entropy regularization (Soft Actor-Critic) is considered the most principled approach - the agent is rewarded for maintaining high entropy in its policy.

Q5: Explain Double Q-Learning. What problem does it solve?

Standard Q-learning uses $\max_{a'} Q(s', a')$ in the target. When Q-values have estimation noise (which they always do), the maximum of noisy estimates is systematically higher than the true maximum - this is maximization bias. Simple example: 5 actions all with true value 0, estimated as $[0.1, -0.2, 0.3, -0.1, 0.2]$ . True max is 0; estimated max is 0.3. The bias propagates: overestimated actions are selected more often, generating more data that reinforces the overestimation.

Double Q-Learning (van Hasselt, 2010) maintains two independent Q-tables $Q_A$ and $Q_B$ . For each update, it uses one to select the best action ( $a^* = \arg\max_{a'} Q_A(s', a')$ ) and the other to evaluate it (target = $r + \gamma Q_B(s', a^*)$ ). Because the selection and evaluation use independent (thus uncorrelated) noise, the overestimation bias is substantially reduced. Double DQN (2015) adapts this to neural networks using the online network for selection and the target network for evaluation - requiring no additional networks, just a two-line change to the target computation. Double DQN consistently outperforms vanilla DQN and is now considered the default.

Q6: How does tabular Q-learning connect to modern LLM training with RLHF?

They share the same objective - maximize expected cumulative reward - but operate at vastly different scales. RLHF uses PPO (a policy gradient method), not Q-learning, because the action space (50k+ token vocabulary) is too large for tabular Q-values and the continuous generation process suits policy gradient better. But the conceptual lineage is direct: (1) the reward model in RLHF plays the role of $R(s,a)$ in the tabular MDP; (2) the language model is the policy; (3) the KL penalty from the reference model prevents reward hacking just as the discount factor and bounded rewards prevent Q-values from diverging.

Q-learning's core insight - that you can learn an optimal policy from sampled interactions without knowing the environment model - is exactly what RLHF does. The environment (human preferences, real-world effects of generated text) is unknown and cannot be modeled. The agent (LLM) interacts with it, observations are collected (human ratings), and the policy is updated to maximize reward. Tabular Q-learning to RLHF is a straight conceptual line: same mathematics, 35 years of engineering in between.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Q-Learning in GridWorld demo on the EngineersOfAI Playground - no code required.

:::

The Real Engineering Moment​

Why This Exists: The DP-to-Model-Free Gap​

Historical Context​

The Model-Free Setting​

TD vs MC vs DP: The Bias-Variance Tradeoff​

Exploration vs Exploitation​

Q-Learning: Off-Policy TD Control​

SARSA: On-Policy TD Control​

Q-Learning vs SARSA: The Cliff Walking Example​

TD Error as a Learning Signal​

Convergence Guarantees​

n-Step TD and TD(λ)​

Double Q-Learning: Fixing Maximization Bias​

Function Approximation and the Deadly Triad​

GridWorld Implementation: Full NumPy Example​

Production Engineering Notes​

Common Mistakes​

YouTube Resources​

Interview Q&A​