Module 10: Reasoning Models
The biggest shift in LLM capability in 2024–2025 was not a bigger training run. It was a new question: what if you gave the model more time to think at inference?
This module covers the ideas, architectures, and engineering decisions that power today's reasoning models - from the theoretical basis of test-time compute scaling, to the concrete training algorithms behind OpenAI o1 and DeepSeek-R1, to the practical questions of when to use these models in production and how to evaluate them honestly.
Module Map
Lessons at a Glance
| # | Lesson | What You Will Learn |
|---|---|---|
| 01 | Test-Time Compute | The paradigm shift from training-time scaling to inference-time scaling; best-of-N, majority voting, and how accuracy scales with compute |
| 02 | Chain-of-Thought at Inference | Wei et al. 2022, zero-shot and few-shot CoT, self-consistency, process vs outcome supervision, when CoT hurts |
| 03 | OpenAI o1 and o3 | Hidden chain-of-thought, RL from process rewards, compute budget tokens, o3's ARC-AGI results |
| 04 | DeepSeek-R1 | Pure RL with GRPO, R1-Zero experiment, SFT cold start, distillation to smaller models, open weights |
| 05 | Process Reward Models | Outcome vs process supervision, Lightman et al. 2023, Math-Shepherd, using PRMs for search |
| 06 | MCTS for LLMs | Classic MCTS adapted to token sequences, AlphaCode 2, Tree-of-Thought, compute vs quality trade-offs |
| 07 | When to Use Reasoning Models | Task taxonomy, latency/cost analysis, hybrid routing patterns, production pipelines |
| 08 | Evaluating Reasoning Models | AIME, MATH-500, Codeforces, ARC-AGI, GPQA Diamond, contamination concerns |
Key Concepts
Test-time compute - spending more inference-time compute (more tokens, more sampling passes) to improve output quality, as an alternative to training a bigger model.
Chain-of-thought (CoT) - prompting a model to produce intermediate reasoning steps before arriving at a final answer, dramatically improving accuracy on multi-step problems.
Process reward model (PRM) - a reward model trained to score individual reasoning steps, not just final answers, enabling dense supervision for complex problem solving.
MCTS for LLMs - applying Monte Carlo Tree Search over the space of reasoning steps, using a value function (PRM) to guide exploration.
o1 / o3 - OpenAI's reasoning model series, trained with RL to produce extended hidden chain-of-thought before answering. Sets the current state of the art on competition math, coding, and PhD-level science.
DeepSeek-R1 - DeepSeek's open-weights reasoning model, trained using Group Relative Policy Optimization (GRPO), showing that strong reasoning can emerge from pure RL without outcome-labeled data.
