Module 11: Reinforcement Learning

Why This Module Matters Now

Reinforcement learning was once confined to game-playing agents and robotics labs. Today it sits at the center of how the most capable AI systems in the world are trained. ChatGPT, Claude, Gemini - all of them were shaped by RL techniques. DeepMind's AlphaGo used Monte Carlo tree search and RL to beat the world champion at Go in 2016. AlphaStar mastered StarCraft II. GPT-4 was aligned using PPO over a learned reward model. DeepSeek-R1 uses Group Relative Policy Optimization to improve reasoning.

If you want to build, fine-tune, or deploy large language models at a professional level, you need to understand RL - not just RLHF as a buzzword, but the actual machinery: how value functions work, why policy gradients are unstable, what PPO's clipping actually does, and why DPO can sidestep the RL loop entirely.

This module takes you from the mathematical foundations through state-of-the-art alignment techniques. Every lesson connects theory to engineering practice.

RL vs Supervised Learning

The core distinction separates RL from everything else in this curriculum:

Dimension	Supervised Learning	Reinforcement Learning
Feedback	Labels on every example	Scalar reward (often delayed)
Data	Fixed dataset	Generated by agent interaction
Objective	Minimize loss on given labels	Maximize cumulative reward
Environment	Static	Dynamic - agent affects state
Credit assignment	Per example, immediate	Across sequence of actions

In supervised learning, you always know the right answer - you minimize the distance to it. In RL, you don't know what the right action is. You only know whether things went well or badly, often many steps after the fact. This makes RL substantially harder to train and debug.

The Alignment Connection

The reason every ML engineer needs RL literacy in 2026:

Pre-training  →  SFT  →  Reward Model  →  PPO / DPO  →  Deployed Model
                                  ↑
                        RL sits right here

RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimisation) are the techniques that transform a next-token predictor into an assistant that follows instructions, avoids harmful outputs, and aligns with human preferences. Without understanding the RL machinery underneath, you cannot:

Debug reward hacking in your fine-tuning pipeline
Choose between PPO and DPO for your use case
Understand why KL penalties matter
Design reward functions for custom tasks
Evaluate whether your alignment is working

Module Map

The module has two tracks that converge at the alignment lessons:

Value-based track (01 → 02 → 03 → 04): Learns about Q-functions, tabular RL, and how deep networks approximate value functions.

Policy-based track (01 → 05 → 06): Learns about directly optimizing policies, the instability of vanilla policy gradients, and PPO as the fix.

Alignment track (07 → 08): Applies PPO to LLM fine-tuning (RLHF), then shows how DPO avoids the RL loop entirely.

Applications (09 → 10): Production RL systems and the emerging frontier of agentic RL.

Lesson Table

#	Topic	Key Concepts	Difficulty
01	MDP and the RL Framework	States, actions, rewards, Bellman equations	Foundational
02	Dynamic Programming	Policy evaluation, policy iteration, value iteration	Intermediate
03	Q-Learning and SARSA	Model-free TD control, exploration	Intermediate
04	Deep Q-Networks	DQN, experience replay, target networks	Advanced
05	Policy Gradient Methods	REINFORCE, actor-critic, A2C	Advanced
06	Proximal Policy Optimisation	PPO-Clip, GAE, stable training	Advanced
07	RL from Human Feedback	RLHF pipeline, reward models, KL penalty	Advanced
08	Direct Preference Optimisation	DPO loss, closed-form optimality	Advanced
09	RL in Production	Bandits, sim-to-real, safety	Practical
10	RL for AI Agents	Agentic policies, tool use, MCTS	Frontier

Prerequisites

Before diving in, make sure you are comfortable with:

Calculus: Gradients, chain rule - used extensively in policy gradients
Probability: Expectations, conditional distributions - the Bellman equation is an expectation
Neural networks: Forward pass, backpropagation - DQN and policy networks are standard neural nets
PyTorch: Basic autograd - we implement DQN and PPO from scratch

If you have worked through Modules 01–06 of this curriculum, you have everything you need.

Five Key Intuitions to Build

1. The exploration-exploitation tradeoff is fundamental. Every RL algorithm must balance trying new things (exploration) against doing what already works (exploitation). ε-greedy, softmax sampling, and entropy bonuses are all engineering answers to this problem.

2. Credit assignment is the hard problem. When your agent wins a game after 500 moves, which of those moves caused the win? Temporal difference learning, Monte Carlo returns, and advantage estimation are all ways to solve this.

3. Sample efficiency matters. RL is notoriously data-hungry. Experience replay, model-based methods, and offline RL all address the cost of collecting experience.

4. Stability is the engineer's problem. RL training can diverge, oscillate, or collapse. Target networks, PPO clipping, and KL penalties are engineering solutions to training instability.

5. Reward hacking is real. If your reward function does not capture your actual goal precisely, the agent will find ways to maximize the reward while failing your goal. This is why RLHF uses KL penalties from a reference policy - to prevent the model from drifting too far from sensible behavior while chasing reward.

Why This Module Matters Now​

RL vs Supervised Learning​

The Alignment Connection​

Module Map​

Lesson Table​

Prerequisites​

Five Key Intuitions to Build​

Recommended Reading​