Q-Learning and SARSA
Reading time: ~40 min | Interview relevance: High | Target roles: ML Engineer, Research Engineer, AI Engineer
The Real Engineering Moment
It is 1989. Chris Watkins, a PhD student at Cambridge, is working on a problem that has stumped researchers for a decade: how do you teach a machine to make good decisions when you have no model of the world? You cannot enumerate transitions. You cannot compute the Bellman equation exactly. You only have one tool - the ability to try things and observe what happens.
The idea that emerges from his thesis is deceptively simple. After taking an action and observing the reward and next state, you update your estimate of how good that action was - not by looking at the full future, but by comparing your prediction to a one-step improvement of it. You bootstrap. You use your current estimates as targets for updating your current estimates. This is circular reasoning that somehow converges.
Twenty-five years later, DeepMind's team would take this exact update rule, attach it to a convolutional neural network reading raw pixels from an Atari 2600 game, and produce an agent that beat human professionals at Breakout and Pong. The algorithm was Q-learning. The paper was DQN. The equation at its core was Watkins' 1989 insight, unchanged.
What makes Q-learning so powerful is what it does NOT need. No transition model. No reward model. No knowledge of the environment's internal workings whatsoever. Just experience tuples - what happened, what reward came, where you ended up. From those raw observations, Q-learning recovers the optimal policy. That is a remarkable guarantee, and understanding why it holds - and when it fails - is the foundation for all modern deep RL.
In this lesson we go deep. We trace the lineage from dynamic programming through Monte Carlo to TD learning. We implement Q-learning and SARSA from scratch and see why they behave differently on the Cliff Walking problem. We examine the convergence conditions that make tabular Q-learning reliable. And we see how these ideas extend to eligibility traces, n-step returns, and Double Q-learning - the building blocks that became DQN.
Why This Exists: The DP-to-Model-Free Gap
To understand why Q-learning matters, you need to understand what it replaced.
Dynamic Programming (DP) - Value Iteration and Policy Iteration - gives you exact solutions to MDPs. Value Iteration computes the true by iterating the Bellman optimality equation:
This is beautiful mathematics. It converges. It is provably correct. There is one problem: it requires - the full transition probability for every state-action pair - and for all states. For real problems, you almost never have this.
Monte Carlo (MC) methods broke away from needing a model. Instead of using , they run actual episodes and use the observed returns as estimates of value:
where is the actual return from time to end of episode.
MC is model-free - it learns from experience. But it has two problems: (1) it requires waiting until the end of the episode before updating, which is slow and impractical for long or continuing tasks; (2) the return has very high variance because it accumulates all future randomness.
Temporal Difference (TD) learning - the family that includes Q-learning and SARSA - combines the best of both. Like MC, it is model-free: it learns from experience. Like DP, it bootstraps: it updates estimates using other estimates without waiting for the full trajectory. This is the core idea of TD learning, and it is what makes Q-learning practical.
Historical Context
| Year | Paper | Contribution |
|---|---|---|
| 1957 | Bellman - Dynamic Programming | Bellman equations, optimal substructure |
| 1988 | Sutton - Learning to predict by the methods of TD | First formal TD methods |
| 1989 | Watkins - Learning from delayed rewards (PhD thesis) | Q-learning algorithm |
| 1992 | Watkins & Dayan - Q-learning | Convergence proof |
| 1994 | Rummery & Niranjan - On-line Q-learning using connectionist systems | SARSA (called modified Q-learning) |
| 1995 | Singh - Reinforcement learning with replacing eligibility traces | Eligibility traces formalization |
| 2010 | van Hasselt - Double Q-learning | Maximization bias fix |
| 2013 | Mnih et al. - Playing Atari with deep RL | DQN (Q-learning + neural nets) |
The naming of SARSA is an interesting story. The update rule involves a quintuple - taking each first letter gives SARSA. Rummery and Niranjan called it "modified connectionist Q-learning," but Sutton renamed it SARSA in his 1996 paper, and the name stuck.
The Model-Free Setting
In dynamic programming, we had the full environment model: and . With it, we could compute the Bellman update exactly.
In the model-free setting, we only have access to experience tuples:
We observe what happened but cannot enumerate all possible transitions. The key insight of TD learning: you can still update value estimates using these sampled transitions, even without the full model.
The core idea - Temporal Difference (TD) Error:
For Q-values:
This is the difference between:
- What you expected:
- What you got plus what you now estimate:
If : this (state, action) was better than expected - increase its Q-value. If : this (state, action) was worse than expected - decrease its Q-value.
The update:
where is the learning rate.
TD vs MC vs DP: The Bias-Variance Tradeoff
This is one of the most important conceptual frameworks in all of RL. Every algorithm sits somewhere on this spectrum.
DP: Uses the true Bellman equation. Targets computed from the true model. Zero variance (no sampling), zero bias (exact model). Requires full model - rarely available.
Monte Carlo: Targets are actual episode returns . Unbiased - the return is the true value under the current policy (no approximation). High variance - is the sum of all future random rewards.
TD(0): Targets are - a one-step estimate. Biased because is an approximation of the true future value. Low variance because it only involves one step of randomness.
The practical answer: TD methods almost always outperform MC in practice, despite the bias. The variance reduction is worth more than the bias introduced, especially in environments with long horizons. The sweet spot is n-step TD, covered later.
:::tip Intuition for Bias vs Variance Think of shooting free throws. High variance = your shots land all over the place but center on the basket. High bias = your shots consistently miss to the left. TD learning introduces bias (systematic error from bootstrapping) but dramatically reduces variance (variance from all future randomness). In practice, you can correct bias over time with better value estimates. You cannot easily average away extreme variance. :::
Exploration vs Exploitation
Before introducing the algorithms, a critical engineering challenge: how does the agent decide what action to take while learning?
Pure exploitation: Always take . Problem: never tries new actions, gets stuck in suboptimal policies.
Pure exploration: Take random actions. Problem: never uses what it has learned.
ε-greedy: The standard practical solution.
In practice, is annealed over training - start high (1.0) to explore broadly, decay to a small value (0.01) to mostly exploit once the Q-values are reasonable.
import numpy as np
class EpsilonGreedy:
def __init__(
self,
n_actions: int,
epsilon_start: float = 1.0,
epsilon_end: float = 0.01,
epsilon_decay: float = 0.995,
):
self.n_actions = n_actions
self.epsilon = epsilon_start
self.epsilon_end = epsilon_end
self.epsilon_decay = epsilon_decay
def select_action(self, q_values: np.ndarray) -> int:
if np.random.random() < self.epsilon:
return np.random.randint(self.n_actions)
return int(np.argmax(q_values))
def decay(self):
self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)
Other exploration strategies:
- Softmax / Boltzmann: Sample actions with probability proportional to . Temperature controls exploration. Naturally explores more for actions with similar Q-values.
- UCB (Upper Confidence Bound): Select where is the visit count. Prefers less-visited actions. Theoretically principled for bandits.
- Thompson Sampling: Maintain a distribution over Q-values, sample from it, then act greedily. Bayesian and principled.
- Entropy regularization: Used in Soft Actor-Critic - add entropy bonus to reward, so the agent is intrinsically motivated to stay uncertain (explored more in the Actor-Critic lesson).
Q-Learning: Off-Policy TD Control
Q-learning updates toward the maximum Q-value in the next state, regardless of what action the policy actually takes:
The target is - this is the Bellman optimality equation treated as a regression target.
Why "off-policy"? The update target uses - the greedy action - but the agent may be taking a different (exploratory) action due to ε-greedy. The learned Q-function converges to regardless of the behavior policy, as long as all state-action pairs are visited sufficiently often.
Intuition for the update: You are at state . You take action . You observe reward and land in . Your old estimate was . A better estimate is: take the reward you just received () plus the discounted best you can do from here (). Move your estimate a fraction toward this better estimate. Repeat until the estimate is accurate.
import numpy as np
import gymnasium as gym
from collections import defaultdict
class QLearningAgent:
def __init__(
self,
n_actions: int,
alpha: float = 0.1,
gamma: float = 0.99,
epsilon_start: float = 1.0,
epsilon_end: float = 0.01,
epsilon_decay: float = 0.995,
):
self.n_actions = n_actions
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon_start
self.epsilon_end = epsilon_end
self.epsilon_decay = epsilon_decay
# Q-table: default to 0 for any unseen (state, action)
self.Q = defaultdict(lambda: np.zeros(n_actions))
def act(self, state) -> int:
if np.random.random() < self.epsilon:
return np.random.randint(self.n_actions)
return int(np.argmax(self.Q[state]))
def update(self, state, action: int, reward: float, next_state, done: bool):
"""Q-learning update rule."""
current_q = self.Q[state][action]
if done:
target = reward # No future reward if episode ended
else:
# KEY: use max over all actions in next state - greedy, not actual
target = reward + self.gamma * np.max(self.Q[next_state])
td_error = target - current_q
self.Q[state][action] += self.alpha * td_error
def decay_epsilon(self):
self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)
def train_q_learning(env_name: str = "FrozenLake-v1", n_episodes: int = 5000):
env = gym.make(env_name)
agent = QLearningAgent(n_actions=env.action_space.n)
episode_rewards = []
for episode in range(n_episodes):
state, _ = env.reset()
total_reward = 0.0
done = False
while not done:
action = agent.act(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
agent.update(state, action, reward, next_state, done)
state = next_state
total_reward += reward
agent.decay_epsilon()
episode_rewards.append(total_reward)
if (episode + 1) % 500 == 0:
avg = np.mean(episode_rewards[-500:])
print(f"Episode {episode+1:5d} | ε={agent.epsilon:.3f} | Avg reward: {avg:.3f}")
return agent, episode_rewards
agent, rewards = train_q_learning("FrozenLake-v1", n_episodes=5000)
Typical output on FrozenLake (4×4 grid, slippery):
Episode 500 | ε=0.075 | Avg reward: 0.234
Episode 1000 | ε=0.010 | Avg reward: 0.628
Episode 1500 | ε=0.010 | Avg reward: 0.712
Episode 2000 | ε=0.010 | Avg reward: 0.738
Episode 5000 | ε=0.010 | Avg reward: 0.761
SARSA: On-Policy TD Control
SARSA is named after the quintuple it uses: :
The key difference from Q-learning: the target uses - the action actually taken in the next state under the current policy (not the max).
Why "on-policy"? SARSA learns the Q-function for the policy it is currently following. If you use ε-greedy, SARSA learns , not . The Q-values reflect the cost of occasionally taking random actions. This is a crucial distinction.
class SARSAAgent:
def __init__(
self,
n_actions: int,
alpha: float = 0.1,
gamma: float = 0.99,
epsilon_start: float = 1.0,
epsilon_end: float = 0.01,
epsilon_decay: float = 0.995,
):
self.n_actions = n_actions
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon_start
self.epsilon_end = epsilon_end
self.epsilon_decay = epsilon_decay
self.Q = defaultdict(lambda: np.zeros(n_actions))
def act(self, state) -> int:
if np.random.random() < self.epsilon:
return np.random.randint(self.n_actions)
return int(np.argmax(self.Q[state]))
def update(
self,
state, action: int,
reward: float,
next_state, next_action: int,
done: bool,
):
"""SARSA update rule - uses next_action, not max."""
current_q = self.Q[state][action]
if done:
target = reward
else:
# SARSA: use the actual next action taken - NOT argmax
target = reward + self.gamma * self.Q[next_state][next_action]
self.Q[state][action] += self.alpha * (target - current_q)
def train_sarsa(env_name: str = "FrozenLake-v1", n_episodes: int = 5000):
env = gym.make(env_name)
agent = SARSAAgent(n_actions=env.action_space.n)
episode_rewards = []
for episode in range(n_episodes):
state, _ = env.reset()
action = agent.act(state) # Choose first action before the loop
total_reward = 0.0
done = False
while not done:
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
if done:
next_action = 0 # Doesn't matter for terminal state
else:
next_action = agent.act(next_state) # Choose NEXT action NOW
# SARSA update: uses (s, a, r, s', a') - the actual next action
agent.update(state, action, reward, next_state, next_action, done)
state = next_state
action = next_action # Carry the chosen action forward
total_reward += reward
agent.epsilon = max(agent.epsilon_end, agent.epsilon * agent.epsilon_decay)
episode_rewards.append(total_reward)
return agent, episode_rewards
Notice the structural difference in the training loop: SARSA must choose the next action before the update, so it can include in the update. Q-learning does not need this - it just takes the max.
Q-Learning vs SARSA: The Cliff Walking Example
The classic illustration of the on-policy vs off-policy difference uses Cliff Walking (Sutton & Barto, Example 6.6):
Start Goal
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ │ │ │ │ │ │ │ │ │ │ │ │ row 0 (top)
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
│ │ │ │ │ │ │ │ │ │ │ │ │ row 1
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
│ Q-learning path (optimal, dangerous) │ row 2 ← Q-learning learns
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤ this row
│ S │ C │ C │ C │ C │ C │ C │ C │ C │ C │ C │ G │ row 3
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
Start Cliff (-100 reward, reset) Goal
Environment: Every step costs -1. Falling off the cliff costs -100 and resets to Start. Goal has reward 0 (episode ends).
Q-learning (off-policy): Learns the optimal path - right along the cliff edge (row 2). The optimal policy walks the cliff edge because every step is -1 and the cliff is never in the optimal path. During training, ε-greedy causes occasional random steps into the cliff, producing terrible training episodes. But the learned correctly identifies the edge path as optimal.
SARSA (on-policy): Learns the safe path - further from the cliff (row 1 or 2 but more cautious). Because SARSA evaluates the policy being executed (including its ε-greedy random steps), it learns that states near the cliff are dangerous - a random step from there costs -100. So SARSA naturally moves away from the cliff, accepting a longer path in exchange for safety.
Empirical result: During training, SARSA achieves much higher average reward than Q-learning (because Q-learning keeps falling off the cliff during exploration). But Q-learning's final converged policy is better than SARSA's (because Q-learning converged to , the truly optimal policy).
import gymnasium as gym
import numpy as np
from collections import defaultdict
import matplotlib.pyplot as plt
def run_cliff_comparison(n_episodes: int = 500):
"""Compare Q-learning vs SARSA on CliffWalking-v0."""
env = gym.make("CliffWalking-v0")
n_actions = env.action_space.n
q_agent = QLearningAgent(n_actions=n_actions, epsilon_start=0.1, epsilon_end=0.1)
s_agent = SARSAAgent(n_actions=n_actions, epsilon_start=0.1, epsilon_end=0.1)
q_rewards, s_rewards = [], []
for episode in range(n_episodes):
# Q-Learning episode
state, _ = env.reset()
total_r, done = 0.0, False
while not done:
a = q_agent.act(state)
ns, r, term, trunc, _ = env.step(a)
done = term or trunc
q_agent.update(state, a, r, ns, done)
state, total_r = ns, total_r + r
q_rewards.append(total_r)
# SARSA episode
state, _ = env.reset()
action = s_agent.act(state)
total_r, done = 0.0, False
while not done:
ns, r, term, trunc, _ = env.step(action)
done = term or trunc
na = s_agent.act(ns) if not done else 0
s_agent.update(state, action, r, ns, na, done)
state, action, total_r = ns, na, total_r + r
s_rewards.append(total_r)
# Smooth rewards for plotting
window = 10
q_smooth = np.convolve(q_rewards, np.ones(window)/window, mode='valid')
s_smooth = np.convolve(s_rewards, np.ones(window)/window, mode='valid')
print(f"Q-Learning final policy avg: {np.mean(q_rewards[-100:]):.1f}")
print(f"SARSA final policy avg: {np.mean(s_rewards[-100:]):.1f}")
# Expected: Q-Learning ≈ -13 (optimal) but noisy; SARSA ≈ -17 (safe path) but stable
run_cliff_comparison()
The engineering tradeoff:
- If you're deploying the learned policy (and exploration stops): Q-learning gives a better final policy
- If the learning policy IS the deployment policy (online learning, never stops exploring): SARSA is safer
- This distinction maps directly to RLHF: during fine-tuning, you eventually stop training and deploy. So off-policy methods are appropriate - you care about the final policy, not what happens during training.
TD Error as a Learning Signal
The TD error is not just a training signal - it has a deep interpretation:
Neuroscience connection: In 1997, Schultz et al. discovered that dopamine neurons in the brain fire according to a signal that looks exactly like TD error. Unexpected rewards (positive ) cause dopamine release; expected rewards do not; missing expected rewards (negative ) suppress dopamine. RL is not just a machine learning technique - it may describe how biological brains learn.
Engineering implication: Monitoring TD errors during training tells you a lot:
- Large TD errors: model is learning a lot, or the environment is very noisy
- TD errors consistently near zero: model has converged (or is stuck - check if Q-values are sensible)
- TD errors oscillating wildly: potential instability; might need lower learning rate or gradient clipping
- TD error suddenly spikes: distribution shift or bug in the environment
Convergence Guarantees
Q-learning converges to under these conditions (Watkins & Dayan, 1992):
- All state-action pairs are visited infinitely often (sufficient exploration - e.g., always)
- Step sizes satisfy the Robbins-Monro conditions: and
- Rewards are bounded:
Condition 1 in practice: ε-greedy with satisfies this as long as the graph of reachable states is connected. In large state spaces, this requires many episodes.
Condition 2 in practice: A decaying schedule like satisfies this. Constant (e.g., 0.1) does NOT satisfy condition 2 - but it still works in practice. With constant , Q-learning converges to a neighborhood of rather than exactly . This is a acceptable tradeoff for non-stationary environments.
SARSA convergence: SARSA converges to if: (1) same conditions as above, plus (2) the policy is GLIE (Greedy in the Limit with Infinite Exploration) - i.e., but . With constant , SARSA converges to the optimal ε-soft policy, not .
:::warning With function approximation, convergence is not guaranteed The tabular convergence proofs do NOT extend to function approximation (neural networks). With a neural network Q-function, training can and does diverge. This is why DQN needed experience replay and target networks - covered in the next lesson. :::
n-Step TD and TD(λ)
Standard TD(0) uses a 1-step return: .
n-step TD uses a longer lookahead before bootstrapping:
| Target | Bias | Variance | |
|---|---|---|---|
| 1 (TD) | High | Low | |
| 5 | Medium | Medium | |
| (MC) | (actual return) | Zero | High |
In practice, to often works better than either extreme. Rainbow DQN uses .
TD(λ) elegantly interpolates using eligibility traces - a weighted average of ALL -step returns:
The weight given to the -step return decays geometrically with . At : pure TD(0). At : pure Monte Carlo.
Eligibility traces implement this efficiently without storing all future returns. Each state-action pair has a trace that accumulates evidence:
The update uses all traces simultaneously:
class SARSALambdaAgent:
"""SARSA with eligibility traces (TD(λ))."""
def __init__(
self,
n_states: int,
n_actions: int,
alpha: float = 0.1,
gamma: float = 0.99,
lam: float = 0.9, # λ - the trace decay parameter
epsilon: float = 0.1,
):
self.n_actions = n_actions
self.alpha = alpha
self.gamma = gamma
self.lam = lam
self.epsilon = epsilon
self.Q = np.zeros((n_states, n_actions))
self.e = np.zeros((n_states, n_actions)) # eligibility traces
def act(self, state: int) -> int:
if np.random.random() < self.epsilon:
return np.random.randint(self.n_actions)
return int(np.argmax(self.Q[state]))
def reset_traces(self):
"""Reset eligibility traces at the start of each episode."""
self.e[:] = 0.0
def update(self, s: int, a: int, r: float, s_next: int, a_next: int, done: bool):
# TD error
if done:
delta = r - self.Q[s, a]
else:
delta = r + self.gamma * self.Q[s_next, a_next] - self.Q[s, a]
# Increment trace for current (s, a)
self.e[s, a] += 1.0
# Update ALL Q-values proportional to their eligibility trace
self.Q += self.alpha * delta * self.e
# Decay all traces
self.e *= self.gamma * self.lam
if done:
self.reset_traces()
In practice, often works best. TD(λ) is rarely used with neural networks because maintaining per-parameter traces is memory-intensive, but it is important conceptually and works very well in tabular settings.
Double Q-Learning: Fixing Maximization Bias
Standard Q-learning has a subtle but significant problem: the in the target overestimates Q-values.
Why? The target is . If the Q-values have any estimation noise (which they always do early in training), the maximum over noisy values is systematically higher than the maximum of the true values. This is maximization bias.
Proof by simple example: Suppose there are 5 actions, all with true Q-value 0. Your estimates are (noise around 0). The true max is 0, but of your estimates is 0.3. You consistently overestimate by using the maximum of noisy estimates.
Double Q-Learning (van Hasselt, 2010) fixes this by maintaining two independent Q-tables, and :
- Use to select the best action:
- Use to evaluate it: target =
- Alternate which table is updated
Because the action selection and value evaluation use independent estimates, the overestimation bias is substantially reduced.
class DoubleQLearningAgent:
"""Double Q-learning with two independent Q-tables."""
def __init__(
self,
n_states: int,
n_actions: int,
alpha: float = 0.1,
gamma: float = 0.99,
epsilon_start: float = 1.0,
epsilon_end: float = 0.01,
epsilon_decay: float = 0.995,
):
self.n_actions = n_actions
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon_start
self.epsilon_end = epsilon_end
self.epsilon_decay = epsilon_decay
# Two independent Q-tables
self.Q_A = np.zeros((n_states, n_actions))
self.Q_B = np.zeros((n_states, n_actions))
def act(self, state: int) -> int:
# Act based on sum of both tables
q_combined = self.Q_A[state] + self.Q_B[state]
if np.random.random() < self.epsilon:
return np.random.randint(self.n_actions)
return int(np.argmax(q_combined))
def update(self, s: int, a: int, r: float, s_next: int, done: bool):
# Randomly decide which table to update
if np.random.random() < 0.5:
# Update Q_A using Q_B for evaluation
if done:
target = r
else:
best_action = np.argmax(self.Q_A[s_next]) # Select with Q_A
target = r + self.gamma * self.Q_B[s_next, best_action] # Evaluate with Q_B
self.Q_A[s, a] += self.alpha * (target - self.Q_A[s, a])
else:
# Update Q_B using Q_A for evaluation
if done:
target = r
else:
best_action = np.argmax(self.Q_B[s_next]) # Select with Q_B
target = r + self.gamma * self.Q_A[s_next, best_action] # Evaluate with Q_A
self.Q_B[s, a] += self.alpha * (target - self.Q_B[s, a])
def decay_epsilon(self):
self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)
Double Q-learning is the tabular precursor to Double DQN. In DQN, the two tables become the online network and the target network - an elegant reuse of the existing infrastructure.
Function Approximation and the Deadly Triad
All of the above assumes a tabular Q-function - one value per (state, action) pair. This breaks down when state spaces are continuous or too large to enumerate (e.g., Atari pixel inputs: dimensions).
Linear function approximation: Replace the Q-table with a linear model:
where is a feature vector. The update becomes a gradient descent step:
Linear FA is guaranteed to converge with TD learning (under on-policy sampling). But neural networks are different - and this is where the deadly triad appears.
The Deadly Triad (Sutton & Barto, Chapter 11): Three elements that together cause divergence:
- Function approximation - generalizes: updating one Q-value affects neighboring states
- Bootstrapping - targets depend on current estimates, which are being updated
- Off-policy learning - distribution of updates doesn't match the distribution of visits
All three are present in standard DQN. Why does DQN succeed despite the deadly triad? Experience replay partially addresses distribution mismatch, and target networks stabilize bootstrapping. But DQN can still diverge with bad hyperparameters - the deadly triad is a warning, not a guarantee of failure.
:::danger The Deadly Triad in production RL If you are training a neural network with Q-learning and see Q-values grow without bound (loss blows up, Q-values → ∞), the deadly triad is your diagnosis. Fixes in order of impact: (1) add or reduce target network update frequency, (2) reduce learning rate, (3) add gradient clipping, (4) use Double DQN to reduce overestimation feedback loops. Never ignore diverging Q-values - they will not stabilize on their own. :::
GridWorld Implementation: Full NumPy Example
A complete GridWorld environment and Q-learning agent using only NumPy, useful for interviews and understanding the mechanics:
import numpy as np
from typing import Tuple
class GridWorld:
"""
Simple 4x4 GridWorld:
S: start (0,0)
G: goal (3,3), reward +1
H: hole (1,1) and (2,2), reward -1, terminal
All other steps: reward -0.01
"""
def __init__(self, size: int = 4):
self.size = size
self.holes = {(1, 1), (2, 2)}
self.goal = (size - 1, size - 1)
self.state = (0, 0)
# Actions: 0=up, 1=right, 2=down, 3=left
self.action_deltas = [(-1, 0), (0, 1), (1, 0), (0, -1)]
self.n_actions = 4
def reset(self) -> Tuple[int, int]:
self.state = (0, 0)
return self.state
def step(self, action: int) -> Tuple[Tuple, float, bool]:
dr, dc = self.action_deltas[action]
r, c = self.state
nr = max(0, min(self.size - 1, r + dr))
nc = max(0, min(self.size - 1, c + dc))
self.state = (nr, nc)
if self.state == self.goal:
return self.state, 1.0, True
elif self.state in self.holes:
return self.state, -1.0, True
else:
return self.state, -0.01, False
def state_to_idx(self, state: Tuple) -> int:
return state[0] * self.size + state[1]
@property
def n_states(self):
return self.size * self.size
def train_gridworld_q_learning(n_episodes: int = 2000):
env = GridWorld(size=4)
n_states = env.n_states
n_actions = env.n_actions
Q = np.zeros((n_states, n_actions))
alpha, gamma = 0.1, 0.99
epsilon = 1.0
epsilon_min, epsilon_decay = 0.01, 0.995
episode_rewards = []
for ep in range(n_episodes):
state = env.reset()
s_idx = env.state_to_idx(state)
total_reward = 0.0
done = False
while not done:
# ε-greedy action selection
if np.random.random() < epsilon:
action = np.random.randint(n_actions)
else:
action = np.argmax(Q[s_idx])
next_state, reward, done = env.step(action)
ns_idx = env.state_to_idx(next_state)
# Q-learning update
if done:
target = reward
else:
target = reward + gamma * np.max(Q[ns_idx])
Q[s_idx, action] += alpha * (target - Q[s_idx, action])
s_idx = ns_idx
total_reward += reward
epsilon = max(epsilon_min, epsilon * epsilon_decay)
episode_rewards.append(total_reward)
# Print learned Q-table as policy arrows
action_symbols = ['↑', '→', '↓', '←']
print("\nLearned Policy:")
for r in range(env.size):
row = ""
for c in range(env.size):
if (r, c) == env.goal:
row += " G "
elif (r, c) in env.holes:
row += " H "
else:
s_idx = env.state_to_idx((r, c))
best_action = np.argmax(Q[s_idx])
row += f" {action_symbols[best_action]} "
print(row)
return Q, episode_rewards
Q, rewards = train_gridworld_q_learning()
Expected output:
Learned Policy:
→ → → ↓
↑ H → ↓
↑ → H ↓
↑ → → G
Production Engineering Notes
Hyperparameter guidelines for tabular Q-learning:
| Hyperparameter | Typical range | Effect of increasing |
|---|---|---|
| α (learning rate) | 0.01 – 0.5 | Faster but less stable convergence |
| γ (discount) | 0.9 – 0.999 | More long-term planning; slower convergence |
| ε start | 0.5 – 1.0 | More exploration early |
| ε end | 0.01 – 0.1 | More vs less exploration at convergence |
| ε decay | 0.99 – 0.9999 | Slower vs faster transition to exploitation |
Monitoring signals to track:
- Episode reward (smoothed over 100 episodes): should trend upward
- TD error magnitude: should decrease over training
- Epsilon: should decrease smoothly, not drop too fast
- Policy greedy reward (occasional evaluation episodes with ε=0): true policy quality
When to use which algorithm:
| Scenario | Algorithm |
|---|---|
| Tabular, offline evaluation | Q-learning |
| Online deployment, safety-critical | SARSA |
| Long horizons, dense rewards | SARSA(λ) or n-step TD |
| Large discrete action space | Double Q-learning |
| Continuous state space | DQN (next lesson) |
Common Mistakes
:::danger Forgetting to reset eligibility traces between episodes
In SARSA(λ), traces carry over within an episode but MUST be reset at the start of each new episode. If you forget, the traces from the previous episode corrupt the updates at the start of the next episode. Always call reset_traces() in the episode loop's initialization.
:::
:::danger Updating SARSA with next_action chosen AFTER the update SARSA requires choosing BEFORE the update so that the actual next action is used in the target. If you choose after the update, you're accidentally implementing Q-learning on-policy (which is almost always suboptimal). The training loop must look like: choose , step, choose , update, advance. :::
:::warning Using Q-learning in safety-critical online settings Q-learning is off-policy - it ignores exploratory actions in its value estimates. In safety-critical online learning (e.g., a physical robot that never stops running), the exploration policy IS the deployment policy. Q-learning learns the optimal policy but the exploration behavior may be dangerous. Use SARSA, which accounts for the exploratory actions in its estimates. :::
:::warning Decaying ε too fast If ε reaches 0.01 within the first 5% of training, the agent stops exploring before Q-values have had time to propagate through the state space. States far from the goal may never receive positive gradient signal. A general rule: ε should reach its final value at roughly the 50–70% mark of training, not earlier. :::
:::tip Debugging tip: inspect the Q-table For tabular problems, always print the learned Q-table and visualize the greedy policy. A policy that looks random or cycles is diagnostic: (1) if random → Q-values haven't converged (train longer or reduce ε faster); (2) if cycling → conflicting Q-values (check for state aliasing or very small alpha); (3) if stuck in local region → insufficient exploration (increase ε or ε-decay timescale). :::
YouTube Resources
| Title | Channel | Why Watch |
|---|---|---|
| Q-Learning Explained | The Coding Train | Visual, beginner-friendly walkthrough with a maze example - great for building intuition before the math |
| Temporal Difference Learning - David Silver RL Course Lecture 4 | DeepMind / UCL | The canonical lecture - Silver covers TD(0), SARSA, Q-learning with full derivations and proofs |
| Q-Learning and SARSA Comparison | sentdex | Side-by-side Python implementation on FrozenLake, good for seeing practical differences |
| Eligibility Traces and TD Lambda | Mutual Information | Clear explanation of why TD(λ) interpolates MC and TD, with visual intuition for traces |
| Reinforcement Learning Explained - Full Course | Andrej Karpathy (Stanford) | Broader context - how tabular RL fits into the modern deep RL landscape |
Interview Q&A
Q1: What is the difference between Q-learning and SARSA? When would you prefer each?
Both are TD methods that learn Q-functions from experience without a model. Q-learning is off-policy: the update target uses - the greedy action - regardless of what the agent actually does during exploration. This means Q-learning learns , the optimal Q-function, even while behaving non-optimally. SARSA is on-policy: the update target uses where is the action actually taken by the current (exploratory) policy. SARSA converges to where is the current behavior policy, not necessarily .
Prefer Q-learning when: you want to learn the optimal policy and will stop exploring before deployment (this is the typical offline training case). Prefer SARSA in online settings where the exploration policy continues operating indefinitely - SARSA accounts for exploration mistakes and learns a safer policy in the presence of dangerous states. The Cliff Walking example perfectly illustrates this: Q-learning learns the fast-but-risky cliff-edge path; SARSA learns the safe inland path that avoids occasional random steps off the cliff.
Q2: What is the temporal difference error and why does it matter?
The TD error is the difference between the "target" (what you just observed plus the discounted future estimate) and the "prediction" (your current Q-value estimate). It is the core learning signal in all TD methods. If , the outcome was better than predicted - increase Q for this (state, action). If , decrease Q.
The TD error matters for two reasons. First, it allows learning without waiting for the full episode - you update after each step using bootstrapping, which makes learning online and efficient. Second, in neuroscience, Schultz et al. (1997) showed that dopamine neuron firing patterns in primates match the TD error signal: unexpected rewards trigger dopamine release (positive TD error), expected rewards do not (zero TD error), and omission of expected rewards suppresses dopamine (negative TD error). This suggests TD learning may be how biological brains solve the reward learning problem.
Q3: Why does Q-learning sometimes fail with function approximation, and what is the deadly triad?
With tabular Q-learning, each update affects exactly one entry in the table. With a neural network, each update affects the entire function - nearby states' Q-values shift too, because the network generalizes. This creates three interacting failure modes that Sutton calls the deadly triad: (1) function approximation - generalization means updates have side effects on other states; (2) bootstrapping - targets depend on current estimates, which are being updated simultaneously, creating a moving target problem; (3) off-policy learning - the distribution of states being trained on differs from the distribution encountered during the current policy.
Any two of these three components is manageable. All three together cause divergence. DQN manages the triad with: experience replay (partially addresses distribution mismatch, breaks temporal correlation) and target networks (stabilizes the moving target problem by freezing bootstrap targets for thousands of steps). Even so, DQN can diverge with bad hyperparameters - the deadly triad is a fundamental instability, not a solved problem.
Q4: Explain ε-greedy and why ε is annealed. What are the alternatives?
ε-greedy selects the greedy (highest Q-value) action with probability , and a random action with probability . Annealing means starting with large (typically 1.0 - fully random) and gradually reducing it to a small value (typically 0.01). The rationale: early in training, Q-values are random initialization noise - greedy selection has no meaning, so exploration is essential for the agent to discover what rewards are available. As training progresses, Q-values become meaningful estimates, so exploitation of learned knowledge becomes increasingly valuable. The final small maintains convergence guarantees (all states must be visited infinitely often) and helps in slightly non-stationary environments.
Alternatives: UCB (Upper Confidence Bound) selects , preferring less-visited state-action pairs - theoretically principled for bandits. Boltzmann / softmax sampling selects actions proportional to , providing more nuanced exploration that preferentially explores actions with similar Q-values. Thompson Sampling maintains uncertainty estimates over Q-values and samples from the posterior. In deep RL, entropy regularization (Soft Actor-Critic) is considered the most principled approach - the agent is rewarded for maintaining high entropy in its policy.
Q5: Explain Double Q-Learning. What problem does it solve?
Standard Q-learning uses in the target. When Q-values have estimation noise (which they always do), the maximum of noisy estimates is systematically higher than the true maximum - this is maximization bias. Simple example: 5 actions all with true value 0, estimated as . True max is 0; estimated max is 0.3. The bias propagates: overestimated actions are selected more often, generating more data that reinforces the overestimation.
Double Q-Learning (van Hasselt, 2010) maintains two independent Q-tables and . For each update, it uses one to select the best action () and the other to evaluate it (target = ). Because the selection and evaluation use independent (thus uncorrelated) noise, the overestimation bias is substantially reduced. Double DQN (2015) adapts this to neural networks using the online network for selection and the target network for evaluation - requiring no additional networks, just a two-line change to the target computation. Double DQN consistently outperforms vanilla DQN and is now considered the default.
Q6: How does tabular Q-learning connect to modern LLM training with RLHF?
They share the same objective - maximize expected cumulative reward - but operate at vastly different scales. RLHF uses PPO (a policy gradient method), not Q-learning, because the action space (50k+ token vocabulary) is too large for tabular Q-values and the continuous generation process suits policy gradient better. But the conceptual lineage is direct: (1) the reward model in RLHF plays the role of in the tabular MDP; (2) the language model is the policy; (3) the KL penalty from the reference model prevents reward hacking just as the discount factor and bounded rewards prevent Q-values from diverging.
Q-learning's core insight - that you can learn an optimal policy from sampled interactions without knowing the environment model - is exactly what RLHF does. The environment (human preferences, real-world effects of generated text) is unknown and cannot be modeled. The agent (LLM) interacts with it, observations are collected (human ratings), and the policy is updated to maximize reward. Tabular Q-learning to RLHF is a straight conceptual line: same mathematics, 35 years of engineering in between.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Q-Learning in GridWorld demo on the EngineersOfAI Playground - no code required.
:::
