Chain-of-Thought Prompting
The Model That Couldn't Count
Your company's AI assistant handles financial reports. A client asks: "If our Q1 revenue was $2.4M and grew 15% in Q2, then dropped 8% in Q3, what is Q3 revenue?"
The model answers: $2.49M
Wrong. The correct answer is 2.4M × 1.15 = 2.76M × 0.92 = 2.4M × 1.15 = 2.76M × 0.92 = 2.54M.)
The model is wrong because it's pattern-matching to the answer format rather than computing. It's seen thousands of percentage problems and learned to produce plausible-looking numbers - without actually doing the math.
You modify the prompt: "Think through this step by step before giving the final answer."
Now the model outputs:
Step 1: Calculate Q2 revenue
Q1 revenue = $2.4M
Q2 growth = 15%
Q2 revenue = $2.4M × 1.15 = $2.76M
Step 2: Calculate Q3 revenue
Q3 decline = 8%
Q3 revenue = $2.76M × 0.92 = $2.5392M ≈ $2.54M
Final answer: Q3 revenue is approximately $2.54M
Correct. The same model, the same weights, the same knowledge - but now it gets the answer right. The only difference is that you gave it space to think.
This is chain-of-thought prompting. It's one of the most impactful discoveries in prompt engineering, and understanding why it works changes how you design every LLM system.
Why This Exists: The Reasoning Bottleneck
Language models generate text one token at a time. Each token is predicted based on all previous tokens. When a model answers a complex question in a single token prediction step - jumping straight from "Q1=2.49M" - it has to compress a multi-step computation into one forward pass.
That's not how computation works. Multi-step reasoning requires intermediate state. A human doing this calculation would write down the intermediate steps. A computer program would store intermediate values in variables.
The model's architecture doesn't have an explicit "memory" for intermediate computation steps. It can only use the tokens in its context window as its scratchpad. Without intermediate steps in the context, it has to guess the answer - and for complex enough tasks, guessing fails.
Chain-of-thought fixes this by turning the context window into a scratchpad. When you instruct the model to "think step by step," you're giving it token budget to work through intermediate steps. Each intermediate step is a token (or several), and those tokens become the "memory" the model uses to compute subsequent steps.
Historical Context: The Wei et al. 2022 Paper
The chain-of-thought technique was formalized in "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (Wei et al., 2022, Google Brain).
The paper's contribution was both the technique itself and the discovery of its boundary conditions. The authors showed:
- Adding reasoning traces to few-shot examples dramatically improved accuracy on arithmetic, commonsense, and symbolic reasoning tasks
- Chain-of-thought is an emergent capability - it only works above a certain model size threshold (~100B parameters)
- For smaller models, chain-of-thought examples actually hurt performance
The most striking result: on the GSM8K grade school math benchmark, PaLM 540B with 8 chain-of-thought exemplars achieved 56.9% accuracy. Without chain-of-thought, the same model achieved 17.9% accuracy. That's a 3x improvement from a prompt change.
Then, a follow-up paper changed the game: "Large Language Models are Zero-Shot Reasoners" (Kojima et al., 2022). The finding: you don't even need few-shot examples. Simply appending "Let's think step by step." to your prompt triggers chain-of-thought reasoning. Zero-shot CoT was born.
Two Forms of Chain-of-Thought
Form 1: Few-Shot CoT (Wei et al., 2022)
You provide examples that include explicit reasoning chains:
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger starts with 5 balls. He buys 2 cans × 3 balls = 6 more balls.
Total: 5 + 6 = 11 tennis balls.
Q: The cafeteria had 23 apples. They used 20 to make lunch, then bought 6 more.
How many apples do they have now?
A: They started with 23. Used 20, so 23 - 20 = 3 left.
Then bought 6 more: 3 + 6 = 9 apples.
Q: A store had 15 shirts. They sold 8 and received a shipment of 12.
How many shirts do they have?
A:
The model sees the pattern: answer with reasoning steps, not just the final number. It mirrors this format for the new question.
Form 2: Zero-Shot CoT (Kojima et al., 2022)
No examples needed. Just append the magic phrase:
Q: A store had 15 shirts. They sold 8 and received a shipment of 12.
How many shirts do they have?
A: Let's think step by step.
Output:
Starting shirts: 15
Sold: 8
After selling: 15 - 8 = 7
Received: 12
Final count: 7 + 12 = 19 shirts
The phrase "Let's think step by step" isn't magic - it's a strong signal that the expected response format includes intermediate reasoning. The model has seen millions of documents where people use this phrase before working through a problem.
Other trigger phrases that work:
- "Think through this carefully."
- "Work through this step by step."
- "Show your reasoning."
- "Let me work through this."
- "First, let me think..."
Why CoT Works: The Token Budget Hypothesis
The most coherent explanation for why chain-of-thought works is the token budget hypothesis:
Language model computation is bounded by the number of forward passes executed. Each token generation is one forward pass. A model computing a complex multi-step answer in a single token is doing the equivalent of a computer trying to solve a sorting algorithm in one CPU instruction - it's simply not enough computational budget.
When the model generates intermediate reasoning tokens, each token:
- Represents an intermediate result that gets incorporated into the context
- Gives the model additional "compute steps" for the next token prediction
- Creates a chain of conditioning - each step is conditioned on the correct intermediate result
This is why chain-of-thought is analogous to showing your work in math class: the intermediate steps aren't just helpful for the reader - they're necessary for the reasoning process itself.
Emergent Capability: The Scale Threshold
Chain-of-thought doesn't work below approximately 100B parameters. This was one of the key findings in Wei et al. (2022).
For smaller models, adding reasoning examples to the prompt hurts performance. The model doesn't have the capacity to follow multi-step reasoning chains - it just confuses the pattern.
This has a practical implication: if you're using a small or quantized model, chain-of-thought may not help. Use GPT-4, Claude Sonnet, or another large frontier model for tasks that require it.
Newer, more efficient models (like Llama 3.1 at 8B with instruction tuning) have pushed the threshold lower. But the general principle holds: reasoning complexity scales with model size.
Self-Consistency: Sampling Multiple Chains
A single chain-of-thought reasoning path can still go wrong. Self-consistency (Wang et al., 2022) improves reliability by sampling multiple reasoning chains and taking the majority vote on the final answer.
The intuition: if 7 out of 10 independent reasoning chains arrive at the same answer, that answer is likely correct - even if some individual chains contain errors.
Self-consistency typically samples 5-40 chains and aggregates. It's expensive (N × cost of a single call) but dramatically improves reliability for high-stakes reasoning tasks.
Least-to-Most Prompting
Least-to-Most Prompting (Zhou et al., 2022) is a CoT variant for complex problems: decompose the problem into subproblems, solve them in order from simplest to most complex, using earlier answers to solve later ones.
Problem: "Write a function that counts the number of vowels in all words
in a list that have more than 5 characters."
First, let me break this into subproblems:
1. How do I count vowels in a single word?
2. How do I filter words longer than 5 characters?
3. How do I combine these for a whole list?
Step 1: Count vowels in a word
vowels = 'aeiouAEIOU'
count = sum(1 for c in word if c in vowels)
Step 2: Filter words by length
long_words = [w for w in words if len(w) > 5]
Step 3: Combine
total = sum(sum(1 for c in w if c in 'aeiouAEIOU') for w in words if len(w) > 5)
This approach is particularly powerful for programming tasks and math word problems with multiple steps.
When NOT to Use Chain-of-Thought
CoT adds tokens (cost and latency). Don't use it for:
- Simple factual questions: "What is the capital of France?" - CoT adds nothing.
- Direct classification: "Is this spam? Yes/No" - reasoning doesn't help if the classification is obvious.
- Format transformation: "Convert this CSV to JSON" - the task is mechanical, not reasoning.
- Small models: Below ~100B parameters, CoT often hurts.
Use CoT for:
- Multi-step arithmetic and algebra
- Logic puzzles and constraint satisfaction
- Code generation with complex requirements
- Medical diagnosis reasoning
- Any task where "explain your reasoning" would naturally apply
Code: Implementing Self-Consistency CoT
import anthropic
from collections import Counter
import re
client = anthropic.Anthropic()
def chain_of_thought_answer(question: str, n_samples: int = 5) -> dict:
"""
Implements self-consistency CoT:
1. Sample N reasoning chains
2. Extract final answers
3. Return majority vote answer with confidence
Args:
question: The question to answer
n_samples: Number of reasoning chains to sample
Returns:
dict with 'answer', 'confidence', 'all_answers', 'reasoning_chains'
"""
chains = []
answers = []
for i in range(n_samples):
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=500,
temperature=0.7, # Non-zero to get diverse chains
messages=[
{
"role": "user",
"content": f"""{question}
Let's think step by step. Work through each step carefully, then state your final answer
on the last line in the format: "Final answer: [answer]"
"""
}
]
)
chain = message.content[0].text.strip()
chains.append(chain)
# Extract final answer
final_answer_match = re.search(
r"[Ff]inal answer:\s*(.+?)(?:\n|$)",
chain
)
if final_answer_match:
answers.append(final_answer_match.group(1).strip())
else:
# Fallback: take last non-empty line
lines = [l.strip() for l in chain.split('\n') if l.strip()]
if lines:
answers.append(lines[-1])
# Majority vote
if answers:
answer_counts = Counter(answers)
majority_answer, majority_count = answer_counts.most_common(1)[0]
confidence = majority_count / len(answers)
else:
majority_answer = "Unable to determine"
confidence = 0.0
return {
"answer": majority_answer,
"confidence": confidence,
"all_answers": answers,
"reasoning_chains": chains,
"vote_distribution": dict(Counter(answers))
}
# Test on math reasoning
result = chain_of_thought_answer(
question="""A train leaves Chicago at 2 PM traveling east at 60 mph.
Another train leaves New York at 3 PM traveling west at 80 mph.
Chicago and New York are 790 miles apart.
At what time do the trains meet?""",
n_samples=5
)
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']:.0%}")
print(f"Vote distribution: {result['vote_distribution']}")
print("\nSample reasoning chain:")
print(result['reasoning_chains'][0])
Zero-Shot CoT for Code Review
import anthropic
client = anthropic.Anthropic()
def code_review_with_reasoning(code: str, language: str = "Python") -> str:
"""
Performs a thorough code review by making the model reason step by step
through different quality dimensions before giving its assessment.
"""
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1500,
temperature=0,
messages=[
{
"role": "user",
"content": f"""Review the following {language} code for issues.
Think through each of these dimensions step by step:
1. Security vulnerabilities (SQL injection, XSS, auth bypass, etc.)
2. Performance issues (N+1 queries, unnecessary loops, memory leaks)
3. Error handling (uncaught exceptions, missing validation)
4. Code quality (naming, readability, SOLID principles)
For each dimension, reason through what you see before stating any findings.
Code:
```{language.lower()}
{code}
After your analysis, provide a prioritized list of issues with severity (critical/high/medium/low).""" } ] )
return message.content[0].text
Example usage
vulnerable_code = """ def get_user_by_email(email): conn = sqlite3.connect('users.db') cursor = conn.cursor() query = f"SELECT * FROM users WHERE email = '{email}'" cursor.execute(query) return cursor.fetchone() """
review = code_review_with_reasoning(vulnerable_code) print(review)
### Few-Shot CoT for Domain-Specific Reasoning
```python
import anthropic
client = anthropic.Anthropic()
# Medical triage reasoning (illustrative - not for actual medical use)
FEW_SHOT_MEDICAL_TRIAGE = """
Assess the urgency level of the following patient presentations.
Use: IMMEDIATE (life-threatening), URGENT (needs care within 1 hour), STANDARD (routine)
Patient: 65-year-old male, crushing chest pain radiating to left arm, sweating, nausea.
Reasoning: Crushing chest pain radiating to left arm + diaphoresis + nausea in older male
= classic MI presentation. This is a life-threatening cardiac emergency.
Urgency: IMMEDIATE
Patient: 28-year-old female, headache for 3 days, mild fever 99.1°F, no neck stiffness.
Reasoning: Low-grade fever with headache without meningeal signs. Duration suggests tension
headache or viral illness. No red flags for meningitis (no neck stiffness, photophobia).
Urgency: STANDARD
Patient: 45-year-old male, sudden severe headache "worst of life", stiff neck.
Reasoning: "Thunderclap" worst-of-life headache + neck stiffness = red flags for
subarachnoid hemorrhage or meningitis. Both are neurological emergencies.
Urgency: IMMEDIATE
Patient: {patient_presentation}
Reasoning:"""
def assess_patient_urgency(patient_presentation: str) -> str:
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=300,
temperature=0,
messages=[
{
"role": "user",
"content": FEW_SHOT_MEDICAL_TRIAGE.format(
patient_presentation=patient_presentation
)
}
]
)
return message.content[0].text.strip()
# This is illustrative only - do not use for actual medical decisions
presentation = "32-year-old female, ankle pain after sports injury, moderate swelling, able to bear weight."
result = assess_patient_urgency(presentation)
print(result)
Production Engineering Notes
1. Token Cost of CoT
CoT reasoning chains can be 5-20x longer than direct answers. At $3-15 per million tokens for frontier models, this adds up:
- Direct answer: ~20 tokens → ~$0.00006 per call
- CoT answer: ~200 tokens → ~$0.0006 per call
- Self-consistency (5 samples): ~1000 tokens → ~$0.003 per call
Budget accordingly. Use CoT only where accuracy improvement justifies cost.
2. Extracting the Final Answer
Always extract the final answer programmatically. CoT responses contain reasoning + answer - you typically need only the answer for downstream processing:
import re
def extract_final_answer(cot_response: str) -> str:
"""Extract just the final answer from a CoT response."""
# Look for explicit "Final answer:" marker
match = re.search(r'(?:final answer|therefore|in conclusion)[:\s]+(.+?)(?:\n|$)',
cot_response, re.IGNORECASE)
if match:
return match.group(1).strip()
# Fallback: last non-empty line
lines = [l.strip() for l in cot_response.split('\n') if l.strip()]
return lines[-1] if lines else cot_response
3. When to Use Self-Consistency
Self-consistency (multiple samples + majority vote) is worth the 5-10x cost when:
- The task is high-stakes (medical, financial, legal)
- You're seeing inconsistent single-chain answers
- Accuracy improvements of 10-20% are worth the cost
Common Mistakes
:::danger Mistake 1: Using CoT for Simple Tasks Adding "Let's think step by step" to a simple classification prompt wastes tokens and introduces unnecessary variance. Use CoT only when multi-step reasoning is genuinely needed. :::
:::danger Mistake 2: Extracting Answers Without Validation CoT responses require answer extraction. Always build a robust extraction step. If the extraction fails, log it and handle the failure gracefully - don't let a regex failure break your pipeline. :::
:::warning Mistake 3: Setting Temperature=0 for Self-Consistency Self-consistency requires diverse reasoning chains. At temperature=0, you get the same chain every time - defeating the purpose. Use temperature=0.5–0.9 when sampling multiple chains. :::
:::warning Mistake 4: Assuming CoT Eliminates Hallucination CoT reduces certain types of errors (arithmetic, multi-step logic) but doesn't eliminate hallucination. The model can still produce confident-sounding but wrong reasoning chains, especially for factual claims. :::
Interview Q&A
Q1: What is chain-of-thought prompting and why does it improve LLM performance on reasoning tasks?
Chain-of-thought prompting instructs LLMs to generate intermediate reasoning steps before producing a final answer. It improves performance because language models generate text one token at a time - complex multi-step reasoning requires intermediate computational state, which the model can only maintain by writing intermediate results into its context. Without CoT, the model must compress multi-step computation into a single token prediction, which fails for complex problems. With CoT, the context window acts as a scratchpad, giving the model the computational "budget" to work through problems step by step.
Q2: What is the difference between few-shot CoT and zero-shot CoT?
Few-shot CoT (Wei et al., 2022) provides examples in the prompt that include explicit reasoning chains - the model learns the expected reasoning format from the examples. Zero-shot CoT (Kojima et al., 2022) uses no examples; instead, a simple phrase like "Let's think step by step" is appended to the prompt, which triggers chain-of-thought reasoning because the model has seen this phrase used in this way in its training data. Zero-shot CoT is simpler and requires no example curation; few-shot CoT is more controlled and can specify the exact reasoning format and style you need.
Q3: What is self-consistency and when would you use it?
Self-consistency (Wang et al., 2022) samples multiple independent chain-of-thought reasoning paths for the same question, then takes a majority vote on the final answer. It improves accuracy by leveraging the diversity of reasoning paths - if multiple independent paths arrive at the same answer, that answer is more likely to be correct. Use it for high-stakes tasks (medical, financial, legal) where you need higher confidence, or when single-chain answers are inconsistent. The cost is N × the cost of a single call, so it's not always practical.
Q4: Why does CoT only work above a certain model size threshold?
This is the emergent capability finding from Wei et al. (2022): CoT is only helpful for models above approximately 100B parameters. Smaller models either produce incoherent reasoning chains or follow them incorrectly. The hypothesis is that following multi-step reasoning chains requires sufficient model capacity to maintain coherence across a long reasoning trace and correctly condition each step on previous steps. Below the threshold, the model doesn't have enough capacity to execute this reliably - and the examples actually confuse it.
Q5: What is least-to-most prompting and when does it outperform standard CoT?
Least-to-most prompting (Zhou et al., 2022) decomposes a complex problem into subproblems sorted from simplest to hardest, then solves each in sequence, using earlier answers to solve later ones. It outperforms standard CoT on tasks with clear dependency structure - programming problems (you need the helper function before you can write the main function), math word problems with multiple steps, and any task where a complex goal can be systematically broken into simpler goals. Standard CoT treats the problem as one reasoning sequence; least-to-most explicitly structures the problem decomposition.
Q6: How would you implement CoT in a production system where cost is a concern?
First, evaluate whether CoT is actually needed - measure accuracy on your task with zero-shot, zero-shot CoT, and few-shot CoT to see if the accuracy improvement justifies the cost. Second, use zero-shot CoT rather than few-shot to reduce prompt tokens. Third, if self-consistency is needed, run it only for queries with low confidence from a cheaper first pass - cascade: run the query cheap first, only escalate to self-consistency if confidence is below a threshold. Fourth, cache CoT results for identical or near-identical queries. Fifth, consider using CoT for an offline evaluation or training data generation pipeline, but serving direct predictions in production using a model fine-tuned on CoT-generated data.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the ReAct Agent demo on the EngineersOfAI Playground - no code required.
:::
