Deep Q-Networks (DQN)
Scale Q-learning to high-dimensional inputs with neural networks. Learn the DQN architecture, experience replay, target networks, Double DQN, Dueling DQN, Prioritized Experience Replay, and Rainbow. Full PyTorch implementation included.
Direct Preference Optimisation - RLHF Without the RL
DPO: how Rafailov et al. (2023) showed that RLHF has a closed-form solution - no reward model, no PPO, just supervised training on preference pairs.
Dynamic Programming for RL
Policy evaluation, policy iteration, and value iteration - solving MDPs exactly when you know the environment model. Master the theoretical foundation that all model-free RL approximates.
MDP and the RL Framework
Master Markov Decision Processes - the mathematical foundation of all reinforcement learning. Understand states, actions, rewards, value functions, the Bellman equations, and how real-world systems are modeled as MDPs.
Module 11 - Reinforcement Learning
A comprehensive module covering RL fundamentals through modern alignment techniques including RLHF and DPO, connecting classical theory to LLM training.
Policy Gradient Methods
Directly optimize policies with gradient ascent - REINFORCE derivation, the log-derivative trick, variance reduction with baselines, actor-critic, A2C/A3C, and entropy regularization. The foundation for PPO and RLHF.
Proximal Policy Optimisation - The Algorithm That Runs ChatGPT's RLHF
PPO: the dominant policy gradient algorithm - how clipping the probability ratio prevents destructive policy updates while maintaining the efficiency of on-policy learning.
Q-Learning and SARSA
Model-free temporal difference learning - Q-learning for off-policy control and SARSA for on-policy control. Understand TD vs MC vs DP, convergence conditions, eligibility traces, Double Q-learning, and implement Q-tables in NumPy.
RL for AI Agents - Teaching Models to Act in the World
How RL enables autonomous AI agents: ReAct, tool use, MCTS planning, AlphaCode, SWE-bench, and the emerging agent-RL paradigm powering Claude, GPT-4o, and Gemini.
RL from Human Feedback - How ChatGPT Learned to Be Helpful
The complete RLHF pipeline: supervised fine-tuning, reward model training from human preferences, and PPO fine-tuning - the technique behind InstructGPT, ChatGPT, and Claude.
RL in Production - Where Theory Meets Reality
Engineering challenges of deploying RL: offline RL, reward shaping, safe RL, exploration in production, and real-world case studies from DeepMind, Google, and Netflix.