Module 11: Reinforcement Learning
Why This Module Matters Now
Reinforcement learning was once confined to game-playing agents and robotics labs. Today it sits at the center of how the most capable AI systems in the world are trained. ChatGPT, Claude, Gemini - all of them were shaped by RL techniques. DeepMind's AlphaGo used Monte Carlo tree search and RL to beat the world champion at Go in 2016. AlphaStar mastered StarCraft II. GPT-4 was aligned using PPO over a learned reward model. DeepSeek-R1 uses Group Relative Policy Optimization to improve reasoning.
If you want to build, fine-tune, or deploy large language models at a professional level, you need to understand RL - not just RLHF as a buzzword, but the actual machinery: how value functions work, why policy gradients are unstable, what PPO's clipping actually does, and why DPO can sidestep the RL loop entirely.
This module takes you from the mathematical foundations through state-of-the-art alignment techniques. Every lesson connects theory to engineering practice.
RL vs Supervised Learning
The core distinction separates RL from everything else in this curriculum:
| Dimension | Supervised Learning | Reinforcement Learning |
|---|---|---|
| Feedback | Labels on every example | Scalar reward (often delayed) |
| Data | Fixed dataset | Generated by agent interaction |
| Objective | Minimize loss on given labels | Maximize cumulative reward |
| Environment | Static | Dynamic - agent affects state |
| Credit assignment | Per example, immediate | Across sequence of actions |
In supervised learning, you always know the right answer - you minimize the distance to it. In RL, you don't know what the right action is. You only know whether things went well or badly, often many steps after the fact. This makes RL substantially harder to train and debug.
The Alignment Connection
The reason every ML engineer needs RL literacy in 2026:
Pre-training → SFT → Reward Model → PPO / DPO → Deployed Model
↑
RL sits right here
RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimisation) are the techniques that transform a next-token predictor into an assistant that follows instructions, avoids harmful outputs, and aligns with human preferences. Without understanding the RL machinery underneath, you cannot:
- Debug reward hacking in your fine-tuning pipeline
- Choose between PPO and DPO for your use case
- Understand why KL penalties matter
- Design reward functions for custom tasks
- Evaluate whether your alignment is working
Module Map
The module has two tracks that converge at the alignment lessons:
Value-based track (01 → 02 → 03 → 04): Learns about Q-functions, tabular RL, and how deep networks approximate value functions.
Policy-based track (01 → 05 → 06): Learns about directly optimizing policies, the instability of vanilla policy gradients, and PPO as the fix.
Alignment track (07 → 08): Applies PPO to LLM fine-tuning (RLHF), then shows how DPO avoids the RL loop entirely.
Applications (09 → 10): Production RL systems and the emerging frontier of agentic RL.
Lesson Table
| # | Topic | Key Concepts | Difficulty |
|---|---|---|---|
| 01 | MDP and the RL Framework | States, actions, rewards, Bellman equations | Foundational |
| 02 | Dynamic Programming | Policy evaluation, policy iteration, value iteration | Intermediate |
| 03 | Q-Learning and SARSA | Model-free TD control, exploration | Intermediate |
| 04 | Deep Q-Networks | DQN, experience replay, target networks | Advanced |
| 05 | Policy Gradient Methods | REINFORCE, actor-critic, A2C | Advanced |
| 06 | Proximal Policy Optimisation | PPO-Clip, GAE, stable training | Advanced |
| 07 | RL from Human Feedback | RLHF pipeline, reward models, KL penalty | Advanced |
| 08 | Direct Preference Optimisation | DPO loss, closed-form optimality | Advanced |
| 09 | RL in Production | Bandits, sim-to-real, safety | Practical |
| 10 | RL for AI Agents | Agentic policies, tool use, MCTS | Frontier |
Prerequisites
Before diving in, make sure you are comfortable with:
- Calculus: Gradients, chain rule - used extensively in policy gradients
- Probability: Expectations, conditional distributions - the Bellman equation is an expectation
- Neural networks: Forward pass, backpropagation - DQN and policy networks are standard neural nets
- PyTorch: Basic autograd - we implement DQN and PPO from scratch
If you have worked through Modules 01–06 of this curriculum, you have everything you need.
Five Key Intuitions to Build
1. The exploration-exploitation tradeoff is fundamental. Every RL algorithm must balance trying new things (exploration) against doing what already works (exploitation). ε-greedy, softmax sampling, and entropy bonuses are all engineering answers to this problem.
2. Credit assignment is the hard problem. When your agent wins a game after 500 moves, which of those moves caused the win? Temporal difference learning, Monte Carlo returns, and advantage estimation are all ways to solve this.
3. Sample efficiency matters. RL is notoriously data-hungry. Experience replay, model-based methods, and offline RL all address the cost of collecting experience.
4. Stability is the engineer's problem. RL training can diverge, oscillate, or collapse. Target networks, PPO clipping, and KL penalties are engineering solutions to training instability.
5. Reward hacking is real. If your reward function does not capture your actual goal precisely, the agent will find ways to maximize the reward while failing your goal. This is why RLHF uses KL penalties from a reference policy - to prevent the model from drifting too far from sensible behavior while chasing reward.
Recommended Reading
- Sutton & Barto, "Reinforcement Learning: An Introduction" (2018) - The standard textbook. Free at incompleteideas.net.
- OpenAI Spinning Up - Practical RL implementations and clear blog posts on algorithms.
- InstructGPT paper (Ouyang et al., 2022) - The RLHF paper that defined modern LLM alignment.
- DPO paper (Rafailov et al., 2023) - Closed-form alternative to RLHF.
- Anthropic's Constitutional AI paper (Bai et al., 2022) - How AI feedback replaces human feedback at scale.
Let's start with the mathematical foundation that underlies all of RL: the Markov Decision Process.
