11 docs tagged with "reinforcement-learning"

Deep Q-Networks (DQN)

Scale Q-learning to high-dimensional inputs with neural networks. Learn the DQN architecture, experience replay, target networks, Double DQN, Dueling DQN, Prioritized Experience Replay, and Rainbow. Full PyTorch implementation included.

Direct Preference Optimisation - RLHF Without the RL

DPO: how Rafailov et al. (2023) showed that RLHF has a closed-form solution - no reward model, no PPO, just supervised training on preference pairs.

Dynamic Programming for RL

Policy evaluation, policy iteration, and value iteration - solving MDPs exactly when you know the environment model. Master the theoretical foundation that all model-free RL approximates.

MDP and the RL Framework

Master Markov Decision Processes - the mathematical foundation of all reinforcement learning. Understand states, actions, rewards, value functions, the Bellman equations, and how real-world systems are modeled as MDPs.

Module 11 - Reinforcement Learning

A comprehensive module covering RL fundamentals through modern alignment techniques including RLHF and DPO, connecting classical theory to LLM training.

Policy Gradient Methods

Directly optimize policies with gradient ascent - REINFORCE derivation, the log-derivative trick, variance reduction with baselines, actor-critic, A2C/A3C, and entropy regularization. The foundation for PPO and RLHF.

Proximal Policy Optimisation - The Algorithm That Runs ChatGPT's RLHF

PPO: the dominant policy gradient algorithm - how clipping the probability ratio prevents destructive policy updates while maintaining the efficiency of on-policy learning.

Q-Learning and SARSA

Model-free temporal difference learning - Q-learning for off-policy control and SARSA for on-policy control. Understand TD vs MC vs DP, convergence conditions, eligibility traces, Double Q-learning, and implement Q-tables in NumPy.

RL for AI Agents - Teaching Models to Act in the World

How RL enables autonomous AI agents: ReAct, tool use, MCTS planning, AlphaCode, SWE-bench, and the emerging agent-RL paradigm powering Claude, GPT-4o, and Gemini.

RL from Human Feedback - How ChatGPT Learned to Be Helpful

The complete RLHF pipeline: supervised fine-tuning, reward model training from human preferences, and PPO fine-tuning - the technique behind InstructGPT, ChatGPT, and Claude.

RL in Production - Where Theory Meets Reality

Engineering challenges of deploying RL: offline RL, reward shaping, safe RL, exploration in production, and real-world case studies from DeepMind, Google, and Netflix.