Skip to main content

Module 10: Reasoning Models

The biggest shift in LLM capability in 2024–2025 was not a bigger training run. It was a new question: what if you gave the model more time to think at inference?

This module covers the ideas, architectures, and engineering decisions that power today's reasoning models - from the theoretical basis of test-time compute scaling, to the concrete training algorithms behind OpenAI o1 and DeepSeek-R1, to the practical questions of when to use these models in production and how to evaluate them honestly.

Module Map

Lessons at a Glance

#LessonWhat You Will Learn
01Test-Time ComputeThe paradigm shift from training-time scaling to inference-time scaling; best-of-N, majority voting, and how accuracy scales with compute
02Chain-of-Thought at InferenceWei et al. 2022, zero-shot and few-shot CoT, self-consistency, process vs outcome supervision, when CoT hurts
03OpenAI o1 and o3Hidden chain-of-thought, RL from process rewards, compute budget tokens, o3's ARC-AGI results
04DeepSeek-R1Pure RL with GRPO, R1-Zero experiment, SFT cold start, distillation to smaller models, open weights
05Process Reward ModelsOutcome vs process supervision, Lightman et al. 2023, Math-Shepherd, using PRMs for search
06MCTS for LLMsClassic MCTS adapted to token sequences, AlphaCode 2, Tree-of-Thought, compute vs quality trade-offs
07When to Use Reasoning ModelsTask taxonomy, latency/cost analysis, hybrid routing patterns, production pipelines
08Evaluating Reasoning ModelsAIME, MATH-500, Codeforces, ARC-AGI, GPQA Diamond, contamination concerns

Key Concepts

Test-time compute - spending more inference-time compute (more tokens, more sampling passes) to improve output quality, as an alternative to training a bigger model.

Chain-of-thought (CoT) - prompting a model to produce intermediate reasoning steps before arriving at a final answer, dramatically improving accuracy on multi-step problems.

Process reward model (PRM) - a reward model trained to score individual reasoning steps, not just final answers, enabling dense supervision for complex problem solving.

MCTS for LLMs - applying Monte Carlo Tree Search over the space of reasoning steps, using a value function (PRM) to guide exploration.

o1 / o3 - OpenAI's reasoning model series, trained with RL to produce extended hidden chain-of-thought before answering. Sets the current state of the art on competition math, coding, and PhD-level science.

DeepSeek-R1 - DeepSeek's open-weights reasoning model, trained using Group Relative Policy Optimization (GRPO), showing that strong reasoning can emerge from pure RL without outcome-labeled data.

© 2026 EngineersOfAI. All rights reserved.