Chain-of-Thought Reasoning at Inference Time
The Moment the Model Started Showing Its Work
You're working at an edtech company. Your team built a tutoring assistant that helps students solve algebra problems. The model - a capable 2021-era LLM - was confident, fast, and wrong about 40% of the time. Not random wrong. Confidently, fluently, definitively wrong. It would produce a beautiful-sounding explanation for an incorrect answer, and a struggling student had no way to know they'd been misled.
Your team tried fine-tuning. You tried better system prompts. You added a "double-check your work" instruction. Nothing moved the needle more than a few points. The model would breeze through a problem, reach the wrong answer, and state it with the same confident tone as when it was right.
Then one of your engineers read a preprint from Google Brain. The paper showed something almost comically simple: if you append "Let's think step by step" to a math problem prompt, accuracy on multi-step problems jumps dramatically. Not because the model learned anything new. Not because the model is bigger. Just because those five words change how the model approaches the problem - it starts generating intermediate reasoning steps, working through the problem explicitly, rather than leaping to a final answer.
Your team tried it. Accuracy on multi-step algebra went from 38% to 61% overnight. No retraining. No new data. Just a different prompting strategy.
This is the story of chain-of-thought prompting - one of the most important ideas in modern LLM engineering.
Why This Exists - The Problem with Direct Answer Prompting
When you prompt a language model with a question and ask for a direct answer, you're asking it to do something very unnatural: compress a complex multi-step reasoning process into a single token sequence that jumps directly from question to answer.
For simple questions - "What is the capital of France?" - this works perfectly. The answer "Paris" is a single fact encoded directly during pre-training. No reasoning required.
But for a problem like: "A store is offering a 20% discount. The original price is $85. After the discount, there's an 8% sales tax. What is the final price?"
The correct answer requires:
- Calculate the discount amount: 85 \times 0.20 = \17$
- Apply the discount: 85 - 17 = \68$
- Calculate the tax: 68 \times 0.08 = \5.44$
- Add the tax: 68 + 5.44 = \73.44$
When a model is asked for the direct answer, it's trying to compute $73.44 in a single generation step. Each token is a single forward pass through the network - there's no "working memory" between tokens beyond what's in the context window. The model has to encode all of this into the token probability distribution for the answer.
Chain-of-thought solves this by making the reasoning process explicit in the token sequence itself. When the model generates "Step 1: The discount is 17..." that intermediate result is now part of the context window, which the model can attend to when computing subsequent steps. The model's "working memory" is the context window, and chain-of-thought fills that working memory with the right intermediate computations.
Historical Context - The Wei et al. (2022) Breakthrough
The chain-of-thought paper - "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" by Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou at Google Brain - was released as a preprint in January 2022 and published at NeurIPS 2022.
The key experimental finding that shocked the field: chain-of-thought prompting only works on sufficiently large models. On small models (under roughly 7B parameters by today's standards), adding chain-of-thought actually hurts performance or has no effect. On large models (100B+ at the time), it provides substantial gains.
This "emergent" behavior was unexpected. It suggested that the ability to reason through intermediate steps is not present in small models - it's a capability that emerges from scale. The model needs sufficient representation capacity to track and compute through intermediate steps.
The second major finding: chain-of-thought effects are largest on the hardest problems. For simple math, direct answers are fine. For multi-step math requiring 5+ operations, chain-of-thought improves accuracy by 40+ percentage points on models like PaLM 540B.
Around the same time, Kojima et al. (2022, "Large Language Models are Zero-Shot Reasoners") showed that you don't even need exemplars. Simply appending "Let's think step by step" to any prompt - zero-shot, no examples - induces the model to generate a chain of reasoning. This "zero-shot CoT" was nearly as effective as few-shot CoT on many tasks, and dramatically simpler to implement.
The Two Flavors: Few-Shot and Zero-Shot CoT
Few-Shot Chain-of-Thought
In few-shot CoT (Wei et al., 2022), you provide several example (question, chain-of-thought, answer) triples in the prompt. These examples demonstrate what reasoning looks like.
FEW_SHOT_COT_PROMPT = """
Solve each math word problem by thinking step by step.
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls.
How many tennis balls does he have now?
A: Roger starts with 5 balls. He buys 2 cans of 3 balls each, so 2 x 3 = 6 more balls.
5 + 6 = 11 balls total. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more,
how many apples do they have?
A: The cafeteria starts with 23 apples. They use 20, leaving 23 - 20 = 3 apples.
Then they buy 6 more: 3 + 6 = 9 apples. The answer is 9.
Q: {question}
A:"""
def few_shot_cot(question: str, model_fn) -> str:
"""
Few-shot CoT: provide examples of reasoning, then solve the new question.
Args:
question: The math problem to solve
model_fn: A callable that takes a prompt and returns a completion
Returns:
The model's chain-of-thought + answer
"""
prompt = FEW_SHOT_COT_PROMPT.format(question=question)
completion = model_fn(prompt)
return completion
def extract_answer_from_cot(cot_text: str) -> str:
"""
Extract the final numerical answer from a chain-of-thought response.
Looks for patterns like "The answer is X" or "= X" at the end.
"""
import re
# Try "The answer is X" pattern first
match = re.search(r"[Tt]he answer is\s+(\d+(?:\.\d+)?)", cot_text)
if match:
return match.group(1)
# Try "= X" at end of a line
match = re.search(r"=\s*(\d+(?:\.\d+)?)\s*\.?\s*$", cot_text, re.MULTILINE)
if match:
return match.group(1)
# Try last number in the text
numbers = re.findall(r"\d+(?:\.\d+)?", cot_text)
if numbers:
return numbers[-1]
return "unknown"
Zero-Shot Chain-of-Thought
Zero-shot CoT (Kojima et al., 2022) is simpler and often just as effective:
def zero_shot_cot(question: str, model_fn) -> dict:
"""
Two-stage zero-shot CoT:
Stage 1: Elicit the chain of thought with "Let's think step by step"
Stage 2: Extract the final answer with "Therefore, the answer is:"
"""
# Stage 1: Generate the reasoning
reasoning_prompt = f"Q: {question}\nA: Let's think step by step."
reasoning = model_fn(reasoning_prompt)
# Stage 2: Extract final answer
answer_prompt = (
f"Q: {question}\n"
f"A: Let's think step by step. {reasoning}\n"
f"Therefore, the answer (arabic numerals) is:"
)
answer = model_fn(answer_prompt)
return {
"question": question,
"reasoning": reasoning,
"answer": answer.strip(),
}
# Modern version with a single prompt (works well with capable models)
def zero_shot_cot_simple(question: str) -> str:
"""
Modern zero-shot CoT prompt -- works well with Claude, GPT-4, etc.
"""
return f"""Solve this step by step.
Problem: {question}
Work through each step carefully. Show your reasoning. State your final answer clearly at the end."""
The elegance of zero-shot CoT is that it works without any task-specific examples. You can use the same "Let's think step by step" instruction across wildly different problem types and get consistent improvements.
Self-Consistency - Taking a Majority Vote
Wang et al. (2022), "Self-Consistency Improves Chain of Thought Reasoning in Language Models," built on the CoT foundation with a powerful insight: there are many valid reasoning paths to a correct answer, but usually only one correct answer.
If you sample multiple chain-of-thought completions at high temperature (to get diversity), the correct answer tends to appear more often than any specific wrong answer. Taking a majority vote over the final answers is remarkably effective.
from collections import Counter
from typing import List, Optional, Callable
def majority_vote(answers: List[str]) -> tuple:
"""
Take a majority vote over a list of answers.
Returns the plurality answer and its confidence (fraction of votes).
"""
if not answers:
raise ValueError("Need at least one answer")
normalized = [a.strip().lower() for a in answers]
counter = Counter(normalized)
most_common_answer, count = counter.most_common(1)[0]
confidence = count / len(answers)
for original, norm in zip(answers, normalized):
if norm == most_common_answer:
return original.strip(), confidence
return answers[0], confidence
def self_consistency(
problem: str,
model_fn: Callable,
n_samples: int = 20,
temperature: float = 0.7,
answer_extractor: Optional[Callable] = None,
) -> dict:
"""
Self-consistency: sample N CoT completions, take majority vote.
Args:
problem: The problem to solve
model_fn: Callable(prompt, temperature) -> completion
n_samples: Number of independent samples to draw
temperature: Sampling temperature (must be > 0 for diversity)
answer_extractor: Function to extract final answer from CoT text
Returns:
dict with final_answer, confidence, vote_counts, all_samples
"""
if temperature <= 0:
raise ValueError("Temperature must be > 0 for self-consistency to work")
if answer_extractor is None:
answer_extractor = extract_answer_from_cot
prompt = f"""Solve this problem step by step.
Problem: {problem}
Work through each step. State your final answer clearly at the end as "The answer is [number]"."""
completions = []
answers = []
for i in range(n_samples):
completion = model_fn(prompt, temperature=temperature)
answer = answer_extractor(completion)
completions.append(completion)
answers.append(answer)
vote_counts = Counter(answers)
final_answer, _ = majority_vote(answers)
vote_count = vote_counts[final_answer.strip().lower()]
confidence = vote_count / n_samples
return {
"final_answer": final_answer,
"confidence": confidence,
"vote_counts": dict(vote_counts),
"all_answers": answers,
"n_samples": n_samples,
}
Self-consistency empirically improves accuracy by 10–20 percentage points on benchmarks like GSM8K, MATH, and AQuA. The gains are largest when the base model accuracy is in the 50–80% range.
Process Supervision vs. Outcome Supervision
One of the most important conceptual distinctions in reasoning model research is between outcome supervision and process supervision.
Outcome Supervision
In outcome supervision, you only give the model feedback on whether the final answer was correct. The model generates a solution, you check the last line, and you signal "right" or "wrong." This is simple to implement but has a major flaw: it provides no signal about where the reasoning went wrong.
Consider a student who writes 5 steps of algebra and makes an error in step 3. Outcome supervision tells them "wrong." It gives no information about whether steps 1, 2, 4, and 5 were correct.
Process Supervision
Process supervision gives feedback on each intermediate reasoning step. For a 5-step solution, you get 5 pieces of feedback, not one. The model receives dense training signal - it knows that step 1 was correct, step 2 was correct, step 3 was wrong.
Lightman et al. (2023) "Let's Verify Step by Step" showed that process reward models (PRMs) trained with step-level supervision significantly outperform outcome reward models (ORMs) at selecting correct solutions, especially on challenging math problems.
The tradeoff: process supervision requires much more expensive annotation. You need human (or automated) verification at each reasoning step, not just at the final answer. For math, this can be partially automated (checking algebraic equivalence step by step), but for open-ended reasoning, human annotation is required.
Faithfulness of Chain-of-Thought
This is one of the most intellectually interesting and practically important questions in the field: is the model's generated chain-of-thought actually causal to its final answer, or is it post-hoc rationalization?
The disturbing finding from several papers (Turpin et al., 2023; Lanham et al., 2023) is that models sometimes produce plausible-looking chains of thought that do not faithfully reflect their internal reasoning process. The model may have already "decided" on an answer through its internal computations, and the chain of thought is a narrative explanation generated to be consistent with that answer - not the actual reasoning that produced it.
Evidence for unfaithfulness:
- Adding misleading information to the problem (but not to the chain of thought) shifts the final answer, even when the chain of thought doesn't acknowledge the misleading information
- Models maintain biased final answers even when their chain of thought correctly identifies the unbiased approach
- Counterfactual interventions on chain-of-thought steps don't always change final answers as expected
Evidence for faithfulness:
- Removing chain-of-thought from the context reduces accuracy on tasks where CoT helps
- Models that produce longer, more detailed chains of thought tend to have higher accuracy
- Chain-of-thought text does encode problem-relevant information that the model attends to
The current consensus: CoT is partially faithful. Some of the reasoning is genuinely used by the model; some is generated as a post-hoc narrative. The ratio varies by model, task, and whether the model was specifically trained to reason faithfully (as in o1-style training).
def test_cot_faithfulness(
problem: str,
model_fn: Callable,
n_counterfactuals: int = 5,
) -> dict:
"""
Test whether a model's CoT is faithful via counterfactual interventions.
Strategy:
1. Generate CoT + answer for original problem
2. Perturb the CoT to point toward a different answer
3. Check if the final answer changes when we feed back the perturbed CoT
If the model is faithful, perturbing the CoT should change the answer.
If unfaithful, the model will "fix" the CoT to reach its predetermined answer.
"""
original_prompt = f"Solve step by step:\n\n{problem}\n\nStep 1:"
original_cot = model_fn(original_prompt)
original_answer = extract_answer_from_cot(original_cot)
faithfulness_results = []
for i in range(n_counterfactuals):
# Feed back a corrupted reasoning chain
wrong_intermediate = i * 10 + 5 # Some wrong value
corrupted_prompt = (
f"Solve step by step:\n\n{problem}\n\n"
f"[Reasoning so far: I note the key value here is {wrong_intermediate}. "
f"Working from that...]\n\nContinue and give the final answer:"
)
corrupted_response = model_fn(corrupted_prompt)
corrupted_answer = extract_answer_from_cot(corrupted_response)
changed = corrupted_answer != original_answer
faithfulness_results.append({
"corrupted_hint": wrong_intermediate,
"answer_changed": changed,
"corrupted_answer": corrupted_answer,
})
faithfulness_rate = sum(r["answer_changed"] for r in faithfulness_results) / n_counterfactuals
return {
"original_answer": original_answer,
"faithfulness_rate": faithfulness_rate,
"counterfactual_results": faithfulness_results,
"interpretation": (
"Likely faithful" if faithfulness_rate > 0.7
else "Possibly unfaithful" if faithfulness_rate > 0.3
else "Likely unfaithful (post-hoc rationalization)"
),
}
When Chain-of-Thought Hurts
There are well-documented cases where CoT reduces accuracy:
Simple tasks: For problems where the answer is a direct lookup (capital cities, historical dates, unit conversions), CoT can introduce unnecessary steps that create opportunities for errors.
Perceptual tasks: Tasks that require pattern recognition or classification don't benefit from step-by-step verbal reasoning.
Tasks with short optimal reasoning paths: Some problems have one-step solutions. Forcing step-by-step reasoning makes the model generate filler steps.
When the model is wrong about the domain: If the model has incorrect beliefs, a chain of thought makes those beliefs explicit and amplifies their influence.
def adaptive_cot_selector(
question: str,
question_classifier: Callable,
) -> str:
"""
Decide whether to use CoT based on question type.
Returns the appropriate prompt.
"""
question_type = question_classifier(question)
cot_beneficial_types = {
"multi_step_math", "logical_reasoning", "complex_code",
"multi_hop_qa", "planning", "scientific_calculation"
}
cot_harmful_types = {
"simple_factual", "direct_lookup", "perceptual", "short_answer"
}
if question_type in cot_beneficial_types:
return f"Solve this step by step:\n\n{question}"
elif question_type in cot_harmful_types:
return f"Answer concisely:\n\n{question}"
else:
return f"Think briefly, then answer:\n\n{question}"
Performance on Key Benchmarks
| Benchmark | Direct Answer | Few-shot CoT | Self-Consistency (N=40) |
|---|---|---|---|
| GSM8K (grade school math) | 17.9% | 56.9% | 74.4% |
| MATH (competition math) | 6.9% | 18.1% | 43.2% |
| AQuA (algebraic word problems) | 30.6% | 47.0% | 65.8% |
| CommonsenseQA | 68.3% | 73.9% | 78.7% |
Results from Wei et al. (2022) and Wang et al. (2022), using PaLM 540B.
The pattern is consistent: multi-step math benefits most, commonsense reasoning benefits moderately, and the gains compound with self-consistency.
Production Engineering Notes
Prompt Templates That Work
COT_TEMPLATES = {
"math_cot": (
"Solve this step by step, showing all work. "
"State your final answer as 'The answer is [value].'\n\n{problem}"
),
"code_debug_cot": (
"Debug this code step by step:\n"
"1. Identify what the code is supposed to do\n"
"2. Trace through the execution\n"
"3. Identify the bug\n"
"4. Write the fix\n\n{code_and_error}"
),
"multihop_cot": (
"Answer this by identifying and reasoning through each piece of "
"information needed, then combining them:\n\n{question}"
),
"zero_shot": (
"{question}\n\nLet's think step by step."
),
}
Structured Output Extraction
When using CoT in production, parsing the final answer reliably is crucial:
import re
from typing import Optional
def extract_structured_answer(
cot_text: str,
answer_format: str = "number",
) -> Optional[str]:
"""
Extract the final answer from a CoT completion.
"""
if answer_format == "number":
patterns = [
r"[Tt]he answer is\s*:?\s*\$?([\d,]+(?:\.\d+)?)",
r"[Ff]inal answer\s*:?\s*\$?([\d,]+(?:\.\d+)?)",
r"=\s*\$?([\d,]+(?:\.\d+)?)\s*\.?\s*$",
]
for pattern in patterns:
match = re.search(pattern, cot_text, re.MULTILINE)
if match:
return match.group(1).replace(",", "")
elif answer_format == "boolean":
match = re.search(
r"(?:answer|conclusion|result)\s*:?\s*(yes|no|true|false)",
cot_text, re.IGNORECASE
)
if match:
return match.group(1).lower()
elif answer_format == "code":
match = re.search(r"```(?:python)?\n(.*?)```", cot_text, re.DOTALL)
if match:
return match.group(1).strip()
return None
Cost Analysis for Self-Consistency
Self-consistency at N=20 costs 20x more than single-pass inference. Whether it's worth it depends entirely on the value of getting the right answer:
| Use Case | N Recommended | Justification |
|---|---|---|
| Math homework assistant | 5–10 | Low stakes, medium cost sensitivity |
| Financial calculation | 20–40 | High stakes, cost acceptable |
| Medical decision support | 40+ or specialized model | Very high stakes |
| Customer support reply | 1 (no CoT needed) | Low complexity, cost sensitive |
| Code security audit | 10–20 | High stakes, medium volume |
:::danger Common Mistake: CoT with Temperature=0 Self-consistency requires diversity among samples. Running self-consistency with temperature=0 (greedy decoding) will produce identical outputs for each sample, giving you zero benefit from multiple samples. Always use temperature greater than 0 (typically 0.6–0.8) when doing self-consistency. :::
:::warning Post-Hoc Rationalization Risk Do not treat a model's chain of thought as ground-truth explanation of its reasoning. Research shows that CoT is at best partially faithful. This matters for high-stakes applications: a model may produce a plausible-sounding reasoning chain that leads to the right answer through internally different processes than described. Use CoT for accuracy improvement, but be cautious about using it for interpretability or auditing. :::
:::tip When to Use Few-Shot vs. Zero-Shot CoT Few-shot CoT works better when you have a specific format you need the model to follow (structured step labels, specific notation). Zero-shot CoT ("Let's think step by step") is better when you want flexibility, are working across many different problem types, or want to minimize prompt length. Modern capable models (GPT-4, Claude 3+, Gemini Ultra) respond well to zero-shot CoT and often don't need examples. :::
:::note The Emergence Threshold Wei et al. (2022) found that CoT only helps above a model capability threshold - roughly 100B parameters in 2022 models, which corresponds to approximately 7–13B in more efficiently trained 2024-era models. If you're using a small model and CoT isn't helping, the model may simply be below the threshold where intermediate reasoning steps can be computed reliably. :::
Interview Questions and Answers
Q1: Explain why chain-of-thought prompting improves reasoning in LLMs.
Chain-of-thought works by externalizing the intermediate reasoning process into the token sequence. Transformers compute one token at a time; each token is a single forward pass with no persistent "working memory" across tokens beyond what's in the context window. For multi-step problems, intermediate results need to be stored somewhere. By generating chain-of-thought text, the model places intermediate results explicitly in its context window, where they can be attended to by subsequent token generations. This is the computational equivalent of a human writing out work on scratch paper: not because writing it down magically makes you smarter, but because having the work written down allows you to reference it reliably in later steps.
Q2: What is self-consistency and why does it work?
Self-consistency generates multiple chain-of-thought completions at temperature greater than 0 and takes a majority vote over final answers. It works because: (1) Correct answers tend to be unique - there's usually one right answer. (2) Wrong answers tend to be diverse - different reasoning errors produce different wrong answers. (3) With enough samples, correct answers will statistically dominate in the vote. The key assumptions are that the model has a non-trivial probability of getting the right answer on any single pass, and that errors are not systematically biased toward one specific wrong answer.
Q3: What is the difference between process supervision and outcome supervision?
Outcome supervision provides feedback only on whether the final answer is correct (one signal per problem). Process supervision provides feedback on each individual reasoning step (one signal per step). Process supervision is much more informative: instead of a binary correct/incorrect signal for the entire solution, you get fine-grained feedback on exactly which step failed. The tradeoff is annotation cost: outcome labels are easy to collect (just check the final answer); process labels require verifying each step, which is expensive. Lightman et al. (2023) showed that process reward models trained with step-level supervision significantly outperform outcome reward models on difficult reasoning benchmarks.
Q4: Is chain-of-thought always beneficial? When does it hurt?
CoT can hurt on simple factual tasks (where the direct answer is more accurate than forcing step-by-step reasoning), perceptual tasks (image classification, pattern matching), and tasks with single-step solutions (where CoT just adds noise). The key heuristic: CoT helps when the task genuinely requires multi-step reasoning and when the model's uncertainty is high enough that working through steps explicitly catches errors. If a model has high single-pass accuracy on a task, CoT adds latency and cost without meaningful benefit.
Q5: How would you implement a production CoT system with self-consistency for a math tutoring app?
A production implementation would: (1) Classify each problem by difficulty - hard problems get self-consistency, easy ones get single-pass. (2) For hard problems, generate N=10–20 samples in parallel at temperature=0.7. (3) Extract final answers using robust regex or asking the model to output a structured format. (4) Take majority vote; if one answer dominates with high confidence (greater than 70%), return it. (5) If no clear majority, escalate to a reasoning model or flag for human review. (6) Log accuracy metrics: track when self-consistency confidence correlates with actual correctness, and use that to calibrate thresholds over time.
Q6: What does CoT faithfulness mean and why does it matter in production?
CoT faithfulness refers to whether the chain-of-thought text actually reflects the model's internal computational process, or whether it's a plausible-sounding narrative generated after the model has already determined an answer (post-hoc rationalization). It matters for two reasons: (1) If you're using CoT outputs to explain model decisions to users (interpretability), unfaithful CoT is misleading - you're showing users a rationalization, not the actual reasoning. (2) If you're using CoT for process supervision training, using unfaithful CoT as training signal can reinforce generating confident-sounding explanations for wrong answers. In production, treat CoT primarily as an accuracy tool, not as a window into the model's true reasoning process.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the ReAct Agent demo on the EngineersOfAI Playground - no code required.
:::
