Skip to main content

Module 11: Reinforcement Learning

Why This Module Matters Now

Reinforcement learning was once confined to game-playing agents and robotics labs. Today it sits at the center of how the most capable AI systems in the world are trained. ChatGPT, Claude, Gemini - all of them were shaped by RL techniques. DeepMind's AlphaGo used Monte Carlo tree search and RL to beat the world champion at Go in 2016. AlphaStar mastered StarCraft II. GPT-4 was aligned using PPO over a learned reward model. DeepSeek-R1 uses Group Relative Policy Optimization to improve reasoning.

If you want to build, fine-tune, or deploy large language models at a professional level, you need to understand RL - not just RLHF as a buzzword, but the actual machinery: how value functions work, why policy gradients are unstable, what PPO's clipping actually does, and why DPO can sidestep the RL loop entirely.

This module takes you from the mathematical foundations through state-of-the-art alignment techniques. Every lesson connects theory to engineering practice.

RL vs Supervised Learning

The core distinction separates RL from everything else in this curriculum:

DimensionSupervised LearningReinforcement Learning
FeedbackLabels on every exampleScalar reward (often delayed)
DataFixed datasetGenerated by agent interaction
ObjectiveMinimize loss on given labelsMaximize cumulative reward
EnvironmentStaticDynamic - agent affects state
Credit assignmentPer example, immediateAcross sequence of actions

In supervised learning, you always know the right answer - you minimize the distance to it. In RL, you don't know what the right action is. You only know whether things went well or badly, often many steps after the fact. This makes RL substantially harder to train and debug.

The Alignment Connection

The reason every ML engineer needs RL literacy in 2026:

Pre-training → SFT → Reward Model → PPO / DPO → Deployed Model

RL sits right here

RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimisation) are the techniques that transform a next-token predictor into an assistant that follows instructions, avoids harmful outputs, and aligns with human preferences. Without understanding the RL machinery underneath, you cannot:

  • Debug reward hacking in your fine-tuning pipeline
  • Choose between PPO and DPO for your use case
  • Understand why KL penalties matter
  • Design reward functions for custom tasks
  • Evaluate whether your alignment is working

Module Map

The module has two tracks that converge at the alignment lessons:

Value-based track (01 → 02 → 03 → 04): Learns about Q-functions, tabular RL, and how deep networks approximate value functions.

Policy-based track (01 → 05 → 06): Learns about directly optimizing policies, the instability of vanilla policy gradients, and PPO as the fix.

Alignment track (07 → 08): Applies PPO to LLM fine-tuning (RLHF), then shows how DPO avoids the RL loop entirely.

Applications (09 → 10): Production RL systems and the emerging frontier of agentic RL.

Lesson Table

#TopicKey ConceptsDifficulty
01MDP and the RL FrameworkStates, actions, rewards, Bellman equationsFoundational
02Dynamic ProgrammingPolicy evaluation, policy iteration, value iterationIntermediate
03Q-Learning and SARSAModel-free TD control, explorationIntermediate
04Deep Q-NetworksDQN, experience replay, target networksAdvanced
05Policy Gradient MethodsREINFORCE, actor-critic, A2CAdvanced
06Proximal Policy OptimisationPPO-Clip, GAE, stable trainingAdvanced
07RL from Human FeedbackRLHF pipeline, reward models, KL penaltyAdvanced
08Direct Preference OptimisationDPO loss, closed-form optimalityAdvanced
09RL in ProductionBandits, sim-to-real, safetyPractical
10RL for AI AgentsAgentic policies, tool use, MCTSFrontier

Prerequisites

Before diving in, make sure you are comfortable with:

  • Calculus: Gradients, chain rule - used extensively in policy gradients
  • Probability: Expectations, conditional distributions - the Bellman equation is an expectation
  • Neural networks: Forward pass, backpropagation - DQN and policy networks are standard neural nets
  • PyTorch: Basic autograd - we implement DQN and PPO from scratch

If you have worked through Modules 01–06 of this curriculum, you have everything you need.

Five Key Intuitions to Build

1. The exploration-exploitation tradeoff is fundamental. Every RL algorithm must balance trying new things (exploration) against doing what already works (exploitation). ε-greedy, softmax sampling, and entropy bonuses are all engineering answers to this problem.

2. Credit assignment is the hard problem. When your agent wins a game after 500 moves, which of those moves caused the win? Temporal difference learning, Monte Carlo returns, and advantage estimation are all ways to solve this.

3. Sample efficiency matters. RL is notoriously data-hungry. Experience replay, model-based methods, and offline RL all address the cost of collecting experience.

4. Stability is the engineer's problem. RL training can diverge, oscillate, or collapse. Target networks, PPO clipping, and KL penalties are engineering solutions to training instability.

5. Reward hacking is real. If your reward function does not capture your actual goal precisely, the agent will find ways to maximize the reward while failing your goal. This is why RLHF uses KL penalties from a reference policy - to prevent the model from drifting too far from sensible behavior while chasing reward.

  • Sutton & Barto, "Reinforcement Learning: An Introduction" (2018) - The standard textbook. Free at incompleteideas.net.
  • OpenAI Spinning Up - Practical RL implementations and clear blog posts on algorithms.
  • InstructGPT paper (Ouyang et al., 2022) - The RLHF paper that defined modern LLM alignment.
  • DPO paper (Rafailov et al., 2023) - Closed-form alternative to RLHF.
  • Anthropic's Constitutional AI paper (Bai et al., 2022) - How AI feedback replaces human feedback at scale.

Let's start with the mathematical foundation that underlies all of RL: the Markov Decision Process.

© 2026 EngineersOfAI. All rights reserved.