Module 10: Reasoning Models

The biggest shift in LLM capability in 2024–2025 was not a bigger training run. It was a new question: what if you gave the model more time to think at inference?

This module covers the ideas, architectures, and engineering decisions that power today's reasoning models - from the theoretical basis of test-time compute scaling, to the concrete training algorithms behind OpenAI o1 and DeepSeek-R1, to the practical questions of when to use these models in production and how to evaluate them honestly.

Module Map

Lessons at a Glance

#	Lesson	What You Will Learn
01	Test-Time Compute	The paradigm shift from training-time scaling to inference-time scaling; best-of-N, majority voting, and how accuracy scales with compute
02	Chain-of-Thought at Inference	Wei et al. 2022, zero-shot and few-shot CoT, self-consistency, process vs outcome supervision, when CoT hurts
03	OpenAI o1 and o3	Hidden chain-of-thought, RL from process rewards, compute budget tokens, o3's ARC-AGI results
04	DeepSeek-R1	Pure RL with GRPO, R1-Zero experiment, SFT cold start, distillation to smaller models, open weights
05	Process Reward Models	Outcome vs process supervision, Lightman et al. 2023, Math-Shepherd, using PRMs for search
06	MCTS for LLMs	Classic MCTS adapted to token sequences, AlphaCode 2, Tree-of-Thought, compute vs quality trade-offs
07	When to Use Reasoning Models	Task taxonomy, latency/cost analysis, hybrid routing patterns, production pipelines
08	Evaluating Reasoning Models	AIME, MATH-500, Codeforces, ARC-AGI, GPQA Diamond, contamination concerns

Key Concepts

Test-time compute - spending more inference-time compute (more tokens, more sampling passes) to improve output quality, as an alternative to training a bigger model.

Chain-of-thought (CoT) - prompting a model to produce intermediate reasoning steps before arriving at a final answer, dramatically improving accuracy on multi-step problems.

Process reward model (PRM) - a reward model trained to score individual reasoning steps, not just final answers, enabling dense supervision for complex problem solving.

MCTS for LLMs - applying Monte Carlo Tree Search over the space of reasoning steps, using a value function (PRM) to guide exploration.

o1 / o3 - OpenAI's reasoning model series, trained with RL to produce extended hidden chain-of-thought before answering. Sets the current state of the art on competition math, coding, and PhD-level science.

DeepSeek-R1 - DeepSeek's open-weights reasoning model, trained using Group Relative Policy Optimization (GRPO), showing that strong reasoning can emerge from pure RL without outcome-labeled data.

Module Map​

Lessons at a Glance​

Key Concepts​

Module Map

Lessons at a Glance

Key Concepts