Reasoning and Math Evaluation
Reading time: ~40 min - Interview relevance: Very High - Target roles: ML Engineer, Research Engineer
The difference between a model that can reason and a model that pattern-matches its way through a benchmark is often invisible until you deploy it. Math and reasoning evaluation is where that distinction gets exposed - but only if you design your evaluation to test the process, not just check the final answer.
The Education Platform That Almost Used the Wrong Model
In late 2023, an EdTech startup was building an AI tutoring assistant for middle and high school students. The product would help students work through math problems, explain reasoning steps, and identify where students went wrong. The ML team evaluated three open-source models on GSM8K, the standard grade-school math benchmark. Model A scored 72%. Model B scored 68%. Model C scored 79%. The choice looked easy.
During user testing three weeks later, something strange emerged. Model C - the highest scorer - frequently gave correct final answers with incorrect explanations. A student would ask "why did you multiply here?" and Model C would produce a plausible-sounding justification that was actually wrong. Model A, which scored lower on GSM8K, gave answers that were occasionally incorrect but almost always with reasoning that was either clearly right or clearly wrong. Students could tell when Model A was wrong and learn from the correction. With Model C, students absorbed incorrect reasoning presented with confidence.
The fundamental problem: GSM8K measures final-answer accuracy. The product needed explanation quality and reasoning faithfulness. A model that had learned to output the correct answer through pattern matching - rather than through genuine step-by-step reasoning - could score well on GSM8K while being nearly useless for a tutoring use case that depended on correct intermediate steps.
The team went back to evaluation. They ran a step-level correctness analysis, evaluating whether each intermediate reasoning step was logically valid given the previous steps. Model C's step-level accuracy was 61%. Model A's was 84%. Model A shipped. The GSM8K score difference of 7 percentage points had pointed in exactly the wrong direction for this application.
This is not a fringe case. Reasoning evaluation - more than almost any other area of LLM evaluation - requires thinking carefully about what you are actually measuring, because the gap between "correct final answer" and "correct reasoning process" is wide and practically important.
Why This Exists - The Limits of Answer-Only Evaluation
The earliest approaches to evaluating language model mathematical ability were simple: ask the model a question, check if the output contains the right number. This works for arithmetic problems where a model either computes correctly or not. It breaks down the moment problems require multiple steps.
Consider a problem: "A train travels 120 miles in 2 hours, then 90 miles in 1.5 hours. What is its average speed for the entire journey?" The correct answer is 70 mph (210 miles / 3 hours). But a model could reach 70 mph through several paths: correct total-distance-over-total-time reasoning, coincidence (arithmetic error that happens to cancel out), or simply pattern-matching "average speed" to "add and divide by 2" which would give (60 + 60)/2 = 60 mph (wrong) but might produce 70 anyway through a different coincidence.
Answer-only evaluation cannot distinguish these cases. A model that gets 80% of multi-step problems right by occasionally getting lucky on the arithmetic and frequently getting lucky on the setup provides fundamentally different capabilities than a model that reasons correctly 80% of the time. The second model will generalize to harder problems; the first will not.
The field recognized this problem around 2021 when Wei et al. demonstrated chain-of-thought (CoT) prompting: showing models examples with step-by-step reasoning before asking them to solve a new problem dramatically improved accuracy on reasoning tasks. This discovery had an important side effect for evaluation: it made the reasoning trace visible and evaluatable. You could now check not just whether the answer was correct but whether the steps made sense.
The subsequent few years saw an explosion in reasoning benchmarks designed to probe specific capabilities: multi-step arithmetic (GSM8K), competition mathematics (MATH, AIME), abstract reasoning (ARC, BIG-Bench Hard), logical deduction (LogiQA), and graduate-level science reasoning (GPQA). Each benchmark probes a different point on the difficulty spectrum and different types of reasoning failure.
Historical Context - The Chain-of-Thought Revolution
The story of reasoning evaluation is inseparable from the story of chain-of-thought prompting.
In 2022, Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou at Google published "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." The paper showed that simply demonstrating step-by-step reasoning in few-shot examples caused models to produce step-by-step reasoning themselves - and this dramatically improved accuracy on arithmetic, commonsense, and symbolic reasoning tasks.
The "aha moment" was the emergent nature of the effect. Chain-of-thought prompting had essentially no benefit for models smaller than about 100B parameters. Above that threshold, performance jumps sharply. This suggested that multi-step reasoning required a minimum amount of model capacity to emerge - the steps need to interact with each other in the forward pass, requiring sufficient depth and width to "remember" intermediate results.
GSM8K (Cobbe et al., 2021, OpenAI) was published around the same time and became the standard grade-school math benchmark. The 8,500 problems require 2-8 steps of arithmetic reasoning. What made GSM8K influential was the deliberate design to require reasoning chains: you cannot solve GSM8K problems reliably without some form of step-by-step computation. This made it a natural fit for evaluating CoT methods.
MATH (Hendrycks et al., 2021, UC Berkeley) pushed difficulty dramatically further, collecting 12,500 competition mathematics problems across five difficulty levels, from AMC-8 level up to AIME and beyond. The gap between state-of-the-art model performance on GSM8K (now nearly saturated at 90%+) and on the hardest MATH problems (still under 80% for most models) reveals the difference between grade-school reasoning and genuine mathematical capability.
The process reward model concept (Lightman et al., 2023, OpenAI, "Let's Verify Step by Step") introduced a framework for training models to evaluate the correctness of individual reasoning steps, not just final answers. This work directly connected reasoning evaluation to the training-time feedback signal - if you want models that reason correctly, you need evaluation frameworks that measure correctness of process.
Core Concepts
GSM8K - Grade School Math as a Reasoning Probe
GSM8K contains 8,500 grade-school math problems (7,473 train, 1,319 test) created by human writers specifically to require multi-step reasoning. The problems deliberately avoid tricks, cultural knowledge, and ambiguity - they test whether a model can execute arithmetic reasoning chains correctly.
A typical GSM8K problem:
"Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"
The correct answer is 72. The reasoning chain is: May sales = 48/2 = 24. Total = 48 + 24 = 72. This is genuinely multi-step: you must compute an intermediate value and use it in a second computation. Models that memorize "add two numbers" cannot solve this without the intermediate step.
GSM8K evaluation protocol matters:
- Zero-shot CoT: "Let's think step by step." appended to the question. Tests whether the model spontaneously reasons well.
- Few-shot CoT: 8 examples with full reasoning chains in the prompt. Tests whether the model follows demonstrated reasoning patterns.
- Calculator augmented: model can call a calculator tool for arithmetic. Separates arithmetic errors from reasoning errors.
By 2024, frontier models (GPT-4, Claude 3, Gemini Ultra) score 90-95% on GSM8K with CoT. Small models (7B) score 60-75%. GSM8K is becoming saturated for large models but remains useful for smaller model evaluation and for measuring whether fine-tuning degrades reasoning.
The MATH Dataset - Difficulty Stratification
MATH (Hendrycks et al., 2021) provides 12,500 competition math problems across 7 subjects (Algebra, Number Theory, Combinatorics, Geometry, Probability, Precalculus, Intermediate Algebra) and 5 difficulty levels (Level 1 through Level 5).
Level 1 problems are roughly AMC-8 difficulty. Level 5 problems include AIME-style problems. The difficulty stratification allows fine-grained analysis: a model might solve 90% of Level 1 but only 20% of Level 5. This is far more diagnostic than a single aggregate score.
MATH evaluation has a well-known formatting problem. Answers must be expressed in a specific form - fractions as \frac{a}{b}, roots as \sqrt{n}, intervals in specific bracket notation. A model that computes the correct value but formats it as a decimal instead of a fraction is marked wrong. The MATH-500 subset (Lightman et al., 2023) is a hand-curated 500-problem subset with cleaner answer parsing. For comparative evaluation, MATH-500 is preferred because it reduces noise from formatting errors.
The LaTeX-based answer format in MATH means you need careful parsing in your evaluation pipeline. The sympy library with a LaTeX parser is the standard approach for comparing mathematical expressions that might be formatted differently but equal:
import sympy
from sympy.parsing.latex import parse_latex
def math_answers_equal(predicted: str, ground_truth: str) -> bool:
"""
Check if two LaTeX math expressions represent the same value.
Handles cases like '\\frac{1}{2}' vs '0.5' vs '1/2'.
"""
try:
pred_expr = parse_latex(predicted.strip())
gt_expr = parse_latex(ground_truth.strip())
# Simplify difference - if it equals zero, the expressions are equal
diff = sympy.simplify(pred_expr - gt_expr)
return diff == 0
except Exception:
# Fall back to string comparison if parsing fails
return predicted.strip() == ground_truth.strip()
# Examples
print(math_answers_equal(r"\frac{1}{2}", "0.5")) # True
print(math_answers_equal(r"\sqrt{4}", "2")) # True
print(math_answers_equal("3", r"\frac{6}{2}")) # True
print(math_answers_equal("3", "4")) # False
AIME - The Competition Math Frontier
AIME (American Invitational Mathematics Examination) problems represent a step beyond MATH. AIME problems have integer answers from 0 to 999 (eliminating formatting issues), are drawn from actual math competition exams, and require sophisticated multi-step mathematical reasoning.
As of 2024, the hardest open-source models score 10-30% on AIME problems, while frontier closed models with test-time compute (OpenAI o1) can reach 70-80%. AIME is useful as the high end of a difficulty spectrum - it separates models that can solve competition math from those that cannot, which is relevant for evaluating models intended for advanced math tutoring or research assistance.
A critical evaluation note: AIME answer validation is clean (integer in 0-999) but problem contamination is severe. AIME problems are publicly posted with full solutions. Always check whether AIME test sets use recent exams (post model training cutoff) for contamination-free evaluation.
BIG-Bench Hard - 23 Tasks Where Models Previously Failed
BIG-Bench Hard (BBH, Suzgun et al., 2022) is a subset of 23 tasks from the BIG-Bench benchmark where even large language models performed below average human performance prior to chain-of-thought prompting. The tasks span diverse reasoning types:
- Formal reasoning: logical deduction, tracking shuffled objects, causal judgment
- Language understanding: disambiguation, sarcasm detection, movie recommendation
- Mathematical reasoning: word problems, date understanding, multistep arithmetic
- Algorithmic tasks: dyck language (bracket matching), word sorting, boolean expressions
What makes BBH important for evaluation is that it was specifically designed to resist gaming: problems require multiple distinct reasoning steps, and the tasks were selected precisely because simple pattern matching fails on them.
By 2024, models with CoT prompting solve 60-80% of BBH tasks. Zero-shot (no CoT) typically scores 40-55% on the same models. The CoT improvement gap on BBH is a direct measurement of how much explicit step-by-step reasoning benefits a given model.
GPQA - Graduate-Level Science
GPQA (Graduate-Level Google-Proof Q&A, Rein et al., 2023) measures whether models can answer questions that require graduate-level scientific expertise in biology, chemistry, and physics. The problems are designed to be "Google-proof" - you cannot find the answer by searching, because the answer requires genuine understanding of the domain.
GPQA problems were written by PhD students and verified by domain experts. Even without time pressure, non-expert humans (people with some science background but not in the specific domain) answer only 34% correctly. Domain experts get about 65%. Models have ranged from 40-75% depending on size and training.
GPQA is relevant for evaluating models intended for scientific applications: research assistance, lab automation, scientific literature synthesis. The difficulty level ensures the benchmark does not saturate quickly and the graduate-level reasoning requirement differentiates models that have genuine scientific understanding from those with surface-level pattern matching.
LogiQA - Logical Reasoning Under Natural Language
LogiQA (Liu et al., 2020) provides 8,678 multiple-choice questions derived from Chinese civil service examinations, requiring logical deduction from natural language passages. The tasks include:
- Categorical reasoning: All A are B, some B are C, what follows about A and C?
- Sufficient and necessary conditions: If P then Q. Not Q. What can you conclude?
- Disjunctive reasoning: Either A or B (or both). Not A. Therefore?
- Conjunctive reasoning: Both A and B. What follows?
LogiQA is valuable because it tests formal logic in naturalistic language, which surfaces a specific failure mode: models that have memorized logical rules but cannot apply them when the logical structure is embedded in a naturalistic passage rather than presented symbolically.
Chain-of-Thought Evaluation - Testing Process, Not Just Output
Zero-Shot vs Few-Shot CoT
The original Wei et al. (2022) chain-of-thought paper used few-shot CoT: 8 examples with step-by-step reasoning chains were included in the prompt before the question. Kojima et al. (2022) showed that simply appending "Let's think step by step." to a question (zero-shot CoT) also elicits reasoning chains and improves accuracy substantially.
For evaluation, the choice between zero-shot and few-shot CoT affects your scores and what you are measuring:
Zero-shot CoT measures the model's intrinsic tendency to reason step-by-step when prompted. This is closer to how most users interact with models and is the relevant setting for most production use cases.
Few-shot CoT measures the model's ability to follow demonstrated reasoning patterns. This is a higher ceiling measurement - you are giving the model the template for how to reason - but it is also more sensitive to prompt quality and less representative of zero-shot deployment.
The protocol recommendation for benchmark evaluation is to report both, because the gap between them is informative:
- Large gap (few-shot >> zero-shot): model can reason correctly when shown examples, but does not spontaneously reason well
- Small gap: model has internalized step-by-step reasoning
- Negative gap (zero-shot > few-shot): rare, but indicates few-shot examples may be poorly constructed or confusing the model
Evaluating Step-Level Correctness
Final-answer evaluation tells you nothing about reasoning quality. A model can reach the right answer through a faulty reasoning chain and the wrong answer through a mostly-correct chain with an arithmetic error at the end. For applications where reasoning quality matters (tutoring, scientific assistance, decision support), you need step-level evaluation.
The practical approach to step-level evaluation:
- Generate reasoning traces from the model
- Use a capable evaluation model (GPT-4 or an equivalent) to score each step
- Identify the first incorrect step in chains that fail
- Categorize error types: arithmetic error, wrong formula, misreading the problem, logical fallacy
from openai import OpenAI
client = OpenAI()
STEP_EVALUATION_PROMPT = """You are evaluating the correctness of individual reasoning steps in a math problem solution.
Problem: {problem}
Reasoning chain:
{steps}
For each step, evaluate:
1. Is the step logically valid given the previous steps and the problem?
2. Is the arithmetic/computation correct?
3. Does it make progress toward solving the problem?
Return a JSON list where each element has:
- "step_number": integer
- "step_text": the step content
- "is_correct": boolean
- "error_type": null if correct, otherwise one of ["arithmetic_error", "wrong_formula", "misread_problem", "logical_error", "notation_error"]
- "explanation": brief explanation if incorrect
Return ONLY the JSON, no preamble."""
def evaluate_reasoning_steps(
problem: str,
reasoning_trace: str,
evaluator_model: str = "gpt-4o"
) -> list[dict]:
"""
Evaluate each step in a reasoning trace for correctness.
Returns list of step evaluations.
"""
import json
# Parse reasoning trace into steps
# Common patterns: numbered steps, newline-separated, "Step N:" prefix
steps = parse_steps(reasoning_trace)
response = client.chat.completions.create(
model=evaluator_model,
messages=[
{
"role": "user",
"content": STEP_EVALUATION_PROMPT.format(
problem=problem,
steps="\n".join(
f"Step {i+1}: {step}"
for i, step in enumerate(steps)
)
)
}
],
temperature=0.0 # Deterministic evaluation
)
try:
evaluations = json.loads(response.choices[0].message.content)
return evaluations
except json.JSONDecodeError:
return []
def parse_steps(reasoning_trace: str) -> list[str]:
"""Extract individual steps from a reasoning trace."""
import re
# Try numbered steps pattern first
numbered = re.findall(r'(?:Step \d+:|^\d+\.)\s*(.+?)(?=(?:Step \d+:|^\d+\.)|$)',
reasoning_trace, re.MULTILINE | re.DOTALL)
if len(numbered) >= 2:
return [s.strip() for s in numbered]
# Fall back to sentence splitting on newlines
lines = [l.strip() for l in reasoning_trace.split('\n') if l.strip()]
return lines
def analyze_step_correctness(evaluations: list[dict]) -> dict:
"""Compute step-level correctness statistics."""
if not evaluations:
return {"error": "No evaluations"}
n_steps = len(evaluations)
n_correct = sum(1 for e in evaluations if e.get("is_correct", False))
# Find first error
first_error_step = None
first_error_type = None
for e in evaluations:
if not e.get("is_correct", True):
first_error_step = e["step_number"]
first_error_type = e.get("error_type")
break
# Count error types
error_types = {}
for e in evaluations:
if not e.get("is_correct", True):
et = e.get("error_type", "unknown")
error_types[et] = error_types.get(et, 0) + 1
return {
"n_steps": n_steps,
"n_correct": n_correct,
"step_accuracy": n_correct / n_steps,
"first_error_step": first_error_step,
"first_error_type": first_error_type,
"error_type_counts": error_types
}
Self-Consistency - Majority Voting Over Multiple Chains
Wang et al. (2022) showed that generating multiple diverse reasoning chains and taking a majority vote over final answers substantially improves accuracy - often by 5-15 percentage points above single-sample CoT. The intuition: if the model independently reasons through a problem 20 different ways and 15 of those chains arrive at the same answer, that convergence is strong evidence the answer is correct.
Self-consistency is both an inference strategy and an evaluation tool:
- As inference: at deployment time, generate k chains, take majority vote
- As evaluation: self-consistency accuracy (after majority voting) measures the model's capability ceiling under multiple attempts
The formal definition of self-consistency accuracy: generate k reasoning chains for each problem, compute the most common final answer, check if that answer is correct.
where is the number of problems, is the answer from the -th chain for problem , and is the correct answer.
from collections import Counter
from typing import List, Optional
import numpy as np
def extract_final_answer(reasoning_trace: str) -> Optional[str]:
"""
Extract the final answer from a reasoning trace.
Handles common patterns: "The answer is X", "= X", "\\boxed{X}"
"""
import re
# LaTeX boxed answer (MATH dataset style)
boxed = re.findall(r'\\boxed\{([^}]+)\}', reasoning_trace)
if boxed:
return boxed[-1].strip()
# "The answer is X" pattern
answer_is = re.findall(
r'(?:the answer is|answer:|therefore,?)\s*([^\n.]+)',
reasoning_trace.lower()
)
if answer_is:
return answer_is[-1].strip()
# Final number in the trace (fallback for GSM8K)
numbers = re.findall(r'[\d,]+(?:\.\d+)?', reasoning_trace)
if numbers:
return numbers[-1].replace(',', '')
return None
def self_consistency_accuracy(
problems: list[dict],
model,
tokenizer,
k: int = 20,
temperature: float = 0.7,
device: str = "cuda"
) -> dict:
"""
Compute self-consistency accuracy on a reasoning dataset.
Each problem should have keys: 'question', 'answer'
Returns:
{
"greedy_accuracy": float, # Single greedy decode accuracy
"sc_accuracy": float, # Self-consistency @ k accuracy
"consistency_rate": float, # How often majority agrees
"improvement": float # SC - greedy
}
"""
greedy_correct = 0
sc_correct = 0
consistency_rates = []
for problem in problems:
question = problem["question"]
correct_answer = str(problem["answer"]).strip()
# Greedy decode (single answer)
greedy_output = generate_reasoning(
model, tokenizer, question,
temperature=0.0, device=device
)
greedy_answer = extract_final_answer(greedy_output)
if greedy_answer and greedy_answer.strip() == correct_answer:
greedy_correct += 1
# Self-consistency: k diverse samples
answers = []
for _ in range(k):
output = generate_reasoning(
model, tokenizer, question,
temperature=temperature, device=device
)
answer = extract_final_answer(output)
if answer:
answers.append(answer.strip())
if not answers:
continue
# Majority vote
counter = Counter(answers)
majority_answer, majority_count = counter.most_common(1)[0]
consistency_rate = majority_count / len(answers)
consistency_rates.append(consistency_rate)
if majority_answer == correct_answer:
sc_correct += 1
n = len(problems)
greedy_acc = greedy_correct / n
sc_acc = sc_correct / n
return {
"greedy_accuracy": greedy_acc,
"sc_accuracy": sc_acc,
"consistency_rate": np.mean(consistency_rates),
"improvement": sc_acc - greedy_acc,
"n_problems": n,
"k": k
}
def generate_reasoning(model, tokenizer, question, temperature, device):
"""Generate a single reasoning trace for a question."""
import torch
prompt = f"Question: {question}\n\nLet's think step by step.\n"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
if temperature == 0.0:
output = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False,
pad_token_id=tokenizer.eos_token_id
)
else:
output = model.generate(
**inputs,
max_new_tokens=512,
temperature=temperature,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
return tokenizer.decode(
output[0][inputs.input_ids.shape[1]:],
skip_special_tokens=True
)
Process Reward Models vs Outcome Reward Models
The distinction between process reward models (PRMs) and outcome reward models (ORMs) is fundamental for both evaluation and training of reasoning models.
An outcome reward model answers the question: "Did the model get the right final answer?" It evaluates only the last token - the answer. ORMs are easy to train (you just need correct/incorrect labels) and straightforward to use in evaluation.
A process reward model answers the question: "Is each intermediate step in this reasoning chain correct?" PRMs assign a reward to each step in the chain. Lightman et al. (2023) in "Let's Verify Step by Step" trained a PRM on a dataset of human-annotated step correctness labels for GSM8K and MATH problems and showed that using a PRM to select among multiple generated chains significantly outperforms using an ORM.
The key finding: if you generate 8 reasoning chains and use an ORM to pick the best (select the chain whose final answer appears most confident), you do worse than if you use a PRM to pick the chain where every step was scored as correct. This is because:
- Multiple chains can lead to the same correct answer through different paths - not all paths are equally generalizable
- A chain with all correct steps is more likely to be correct for the right reasons
- PRMs can identify chains where an error early on happened to cancel out, producing a correct answer through faulty reasoning
For evaluation purposes, PRMs offer a richer signal than simple correctness:
- Step-level accuracy: what fraction of steps are correct (even in chains that get the wrong final answer)?
- Error localization: where in the chain does the first error occur?
- Error type distribution: arithmetic errors vs. setup errors vs. logical errors
class ProcessRewardEvaluator:
"""
Use a trained PRM to evaluate reasoning chain quality.
Uses the math-shepherd or Mistral-based PRM from HuggingFace.
"""
def __init__(self, model_name: str = "peiyi9979/math-shepherd-mistral-7b-prm"):
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
self.model.eval()
# PRM-specific tokens that indicate step correctness
self.good_token = "ки" # Good step token in math-shepherd
self.bad_token = "+" # Varies by PRM implementation
def score_steps(
self,
problem: str,
reasoning_trace: str
) -> list[float]:
"""
Score each step in a reasoning trace.
Returns list of per-step correctness scores in [0, 1].
"""
import torch
import re
# Format input for PRM
# Math-shepherd uses a specific format with step markers
steps = parse_steps(reasoning_trace)
step_scores = []
for i, step in enumerate(steps):
# Build context up to this step
context = f"{problem}\n" + "\n".join(
f"Step {j+1}: {s}"
for j, s in enumerate(steps[:i+1])
)
# Append the PRM query token
query = context + f" {self.good_token}"
inputs = self.tokenizer(query, return_tensors="pt")
inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = self.model(**inputs)
logits = outputs.logits[0, -1, :] # Logits for next token
# Get probability of "good" vs "bad" step token
good_id = self.tokenizer.encode(
self.good_token, add_special_tokens=False
)[0]
bad_id = self.tokenizer.encode(
self.bad_token, add_special_tokens=False
)[0]
good_logit = logits[good_id].item()
bad_logit = logits[bad_id].item()
# Softmax over just these two
import math
denom = math.log(math.exp(good_logit) + math.exp(bad_logit))
good_prob = math.exp(good_logit - denom)
step_scores.append(good_prob)
return step_scores
def select_best_chain(
self,
problem: str,
chains: list[str]
) -> tuple[int, list[list[float]]]:
"""
Select the best reasoning chain using PRM scores.
Returns (best_chain_index, all_step_scores).
Uses minimum step score as the chain quality metric
(weakest link in the reasoning chain).
"""
all_scores = []
chain_min_scores = []
for chain in chains:
scores = self.score_steps(problem, chain)
all_scores.append(scores)
chain_min_scores.append(min(scores) if scores else 0.0)
best_idx = chain_min_scores.index(max(chain_min_scores))
return best_idx, all_scores
Mermaid Diagrams
Reasoning Benchmark Difficulty Spectrum
Chain-of-Thought Evaluation Flow
PRM vs ORM Selection Strategy
Production Engineering Notes
Running GSM8K Evaluation Efficiently
GSM8K's 1,319 test problems can be evaluated quickly with careful batching. The standard evaluation script from EleutherAI's lm-evaluation-harness handles this cleanly:
# Install lm-evaluation-harness
pip install lm-eval
# Evaluate on GSM8K with 8-shot CoT
lm_eval --model hf \
--model_args pretrained=deepseek-ai/deepseek-math-7b-instruct \
--tasks gsm8k \
--num_fewshot 8 \
--batch_size 8 \
--output_path results/gsm8k_deepseek.json
# Zero-shot CoT evaluation
lm_eval --model hf \
--model_args pretrained=deepseek-ai/deepseek-math-7b-instruct \
--tasks gsm8k_cot_zeroshot \
--batch_size 8 \
--output_path results/gsm8k_zeroshot.json
For MATH evaluation, the answer normalization is critical. Always use the normalization functions from the Hendrycks MATH repository or equivalent, not a simple string comparison:
def normalize_math_answer(answer: str) -> str:
"""Normalize a MATH dataset answer for comparison."""
import re
# Remove LaTeX formatting
answer = answer.strip()
answer = answer.replace("$", "")
answer = answer.replace("\\left(", "(")
answer = answer.replace("\\right)", ")")
answer = answer.replace("\\left[", "[")
answer = answer.replace("\\right]", "]")
# Normalize fractions: \frac{a}{b} -> a/b
frac_pattern = r'\\frac\{([^}]+)\}\{([^}]+)\}'
answer = re.sub(frac_pattern, r'(\1)/(\2)', answer)
# Remove \text{} wrappers
answer = re.sub(r'\\text\{([^}]+)\}', r'\1', answer)
# Normalize spaces
answer = re.sub(r'\s+', ' ', answer).strip()
return answer
def math_answers_match(predicted: str, ground_truth: str) -> bool:
"""Check if predicted and ground truth MATH answers match."""
# First try normalized string comparison
pred_norm = normalize_math_answer(predicted)
gt_norm = normalize_math_answer(ground_truth)
if pred_norm == gt_norm:
return True
# Try symbolic equality via sympy
return math_answers_equal(pred_norm, gt_norm)
Measuring CoT Quality Degradation After Fine-Tuning
A common and underdetected problem: fine-tuning on domain data can degrade reasoning quality even when final-answer accuracy on domain tasks improves. The model learns shortcuts specific to your domain data, at the cost of the general reasoning chains it learned during pretraining.
Detection protocol:
- Before fine-tuning: run GSM8K with CoT, record per-problem reasoning traces
- After fine-tuning: run the same problems, compare reasoning traces
- Check: are reasoning chains still complete? Are they getting shorter? Are intermediate steps still logically connected?
A quick diagnostic is reasoning chain length. If post-fine-tuning GSM8K reasoning traces are significantly shorter on average, the model may have learned to skip steps rather than reason through them.
def measure_reasoning_chain_quality(
model,
tokenizer,
problems: list[dict],
device: str = "cuda"
) -> dict:
"""
Measure quality signals of generated reasoning chains.
Use this before and after fine-tuning to detect regression.
"""
import torch
import statistics
chain_lengths = []
step_counts = []
has_intermediate_results = []
for problem in problems:
prompt = (
f"Question: {problem['question']}\n"
"Let's think step by step.\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False,
pad_token_id=tokenizer.eos_token_id
)
trace = tokenizer.decode(
output[0][inputs.input_ids.shape[1]:],
skip_special_tokens=True
)
# Length in tokens
chain_lengths.append(
output.shape[1] - inputs.input_ids.shape[1]
)
# Number of reasoning steps
steps = parse_steps(trace)
step_counts.append(len(steps))
# Does the trace compute intermediate numerical results?
import re
has_numbers = bool(re.search(r'=\s*\d', trace))
has_intermediate_results.append(has_numbers)
return {
"mean_chain_length_tokens": statistics.mean(chain_lengths),
"mean_step_count": statistics.mean(step_counts),
"fraction_with_intermediate_results": (
sum(has_intermediate_results) / len(has_intermediate_results)
),
"n_problems": len(problems)
}
GSM8K vs Internal Domain Math - Predicting Transfer
GSM8K accuracy does not directly predict math accuracy on domain-specific problems. A model fine-tuned for financial analysis may learn to handle compound interest and amortization correctly while losing some general arithmetic reasoning. Track both GSM8K (general math) and an internal domain math set in parallel.
The signal to watch: if GSM8K improves after fine-tuning, that is suspicious - your domain data should not improve general math capability. If GSM8K drops by more than 3-5 percentage points, fine-tuning may be degrading general reasoning. If internal domain math improves while GSM8K is stable, that is the expected behavior.
Common Mistakes
:::danger Evaluating Reasoning Models Without Chain-of-Thought
If you evaluate a model's mathematical capabilities without CoT prompting, your scores will be dramatically lower than what the model is actually capable of, and the ordering of models may be wrong. A model that scores 45% on GSM8K zero-shot (no CoT) may score 72% with few-shot CoT. If you compare a model at 45% zero-shot against another model at 60% zero-shot (but the second model was evaluated with CoT), you will reach an incorrect conclusion.
Always use a consistent prompting protocol across all models in a comparison. State the protocol explicitly when reporting results. The difference between "GSM8K: 72%" and "GSM8K (8-shot CoT): 72%" can be 20+ percentage points and completely changes the interpretation. :::
:::danger Treating MATH Aggregate Score as a Single Number
MATH has 5 difficulty levels and 7 subject areas. Two models with identical aggregate scores of "52% on MATH" can have radically different capability profiles. One might be strong on Number Theory but weak on Geometry. Another might solve all Level 1-3 problems but fail on Level 4-5. These represent completely different mathematical capabilities.
Always report MATH scores broken down by difficulty level, and ideally by subject. If you are evaluating for a specific use case (calculus tutoring vs. statistics analysis vs. competition math coaching), filter to the relevant subset. An aggregate score that hides the difficulty breakdown is close to useless for practical decision-making. :::
:::warning Self-Consistency Requires Temperature Calibration
Self-consistency with k=20 chains at temperature 0.0 (greedy) will generate nearly identical chains - majority voting over k copies of the same answer is just greedy with extra steps and extra compute. Self-consistency requires diverse chains to be meaningful.
Standard temperature for self-consistency is 0.7-0.9. At this temperature, different chains will explore different reasoning paths and occasionally reach different answers. The convergence signal from majority voting is only meaningful if there was genuine diversity in the generation process.
Check your implementation: if 18 out of 20 chains are identical, you have misconfigured temperature, not evidence of model confidence. :::
:::warning Assuming Answer Extraction Is Reliable
Extracting the final answer from a reasoning trace is harder than it looks. Models write "The answer is 42", "= 42", "$42", "42 clips", "42.0", and "\boxed{42}" as distinct formats that all represent the same answer. Poorly written answer extraction will artificially lower scores by failing to match semantically identical answers.
Test your answer extraction on 50 problems manually. Verify the extracted answer matches what a human would identify as the final answer. A 5% answer extraction failure rate translates directly into 5% lower accuracy scores - this is a measurement artifact, not a model capability difference. Use the normalization functions from established evaluation frameworks rather than writing your own regex from scratch. :::
:::warning Contamination Is Severe for Competition Math Problems
AIME, AMC, and competition math problems are extensively discussed online with full solutions. Models trained on internet data have likely seen many of these problems and their solutions. When a model scores 60% on AIME problems from 2018-2022 and only 25% on AIME 2024, the difference is probably contamination, not a genuine capability drop.
For reliable reasoning evaluation, use either:
- Fresh competition problems from after the model's training cutoff
- MATH problems at Level 3-5 (less discussed online than competition finals)
- GPT-generated novel reasoning problems in the style of competition math but not taken from actual competitions
- Your own internal reasoning test set drawn from domain problems not available online :::
Interview Questions
1. What is the difference between GSM8K and MATH, and when would you use each?
Answer: GSM8K and MATH differ in difficulty, problem type, and what capability they measure.
GSM8K contains 8,500 grade-school word problems requiring 2-8 steps of arithmetic reasoning. Problems are accessible - they involve addition, subtraction, multiplication, and division applied to everyday scenarios. By 2024, frontier models score 90-95% on GSM8K with chain-of-thought prompting, meaning it is approaching saturation for large models. GSM8K is most useful for: evaluating small models (7B range) where the harder problems are too difficult to show signal, measuring whether fine-tuning has degraded basic reasoning, and providing a quick regression test.
MATH contains 12,500 competition-level problems across 7 subjects and 5 difficulty levels. The hardest problems (Level 4-5) require mathematical insight beyond step-by-step arithmetic - recognizing problem structure, applying non-obvious transformations, and executing multi-page computations. Frontier models score 50-80% on MATH, with difficulty varying dramatically by level and subject. MATH is most useful for: differentiating between strong models, evaluating mathematical capability for research assistance or advanced tutoring, and tracking progress on genuinely hard reasoning.
In practice, use GSM8K as a baseline sanity check and MATH Level 3+ as the discriminating signal for reasoning capability. Always report MATH broken down by difficulty level, not just as an aggregate.
2. Explain chain-of-thought prompting and why it works. Under what conditions does it not help?
Answer: Chain-of-thought prompting provides examples that include intermediate reasoning steps before the final answer. When a model sees 8 examples of "here is the problem, here is the step-by-step reasoning, here is the answer", it learns to produce similar reasoning traces on new problems.
The why: chain-of-thought works because complex problems require intermediate computation that does not fit in a single forward pass without explicit representation. Writing out a step allocates attention to the intermediate result, making it available for subsequent steps. The model is essentially using its own output as working memory.
The emergent threshold: CoT helps substantially only above roughly 100B parameters (Wei et al., 2022). Below this threshold, models lack the capacity to produce coherent reasoning chains - the intermediate steps are incoherent and do not improve accuracy, sometimes making it worse.
Conditions where CoT does not help or hurts:
- Simple tasks: for one-step problems, CoT adds verbosity without benefit. Asking "What is 7+3?" does not benefit from a reasoning chain.
- Small models: sub-7B models often produce irrelevant reasoning chains that confuse rather than help. Zero-shot is often better for very small models.
- Misleading few-shot examples: poorly constructed CoT examples can lead models away from correct reasoning patterns
- Hallucinated reasoning chains: for factual questions, models may produce confident but incorrect reasoning chains that lead to wrong answers with high confidence - worse than a direct uncertain answer
3. What is self-consistency and when should you use it in production?
Answer: Self-consistency (Wang et al., 2022) generates k diverse reasoning chains at elevated temperature (0.7-0.9) and takes a majority vote over the final answers. If 15 of 20 independently generated chains all reach the same answer, that convergence is strong evidence the answer is correct.
Self-consistency improves accuracy by 5-15 percentage points on math and reasoning tasks compared to single-chain greedy decoding. The improvement is largest on problems of intermediate difficulty - problems that are easy (model is always right) or very hard (model is always wrong) show less improvement.
Production use cases where self-consistency makes sense:
- High-stakes automated decisions: when accuracy matters more than latency and cost (financial calculations, medical dosage calculations, legal analysis)
- Confidence estimation: the fraction of chains agreeing on an answer is a calibrated confidence signal. Agreement < 60% indicates low reliability.
- Evaluation benchmarks: SC@20 measures a model's capability ceiling more reliably than greedy, which is relevant for understanding true capability
Production use cases where self-consistency is inappropriate:
- Interactive conversations: generating 20 chains before every response adds 20x latency. Users will not wait.
- Cheap commodity tasks: self-consistency costs 20x the compute. For simple factual lookups, it is wasteful.
- Streaming responses: you cannot vote before generating, so streaming is incompatible with self-consistency
The practical approach: use self-consistency in batch pipelines that run overnight and in high-value single-use evaluations. Do not use it in interactive serving unless you have a two-stage pipeline that separates fast draft generation from careful high-stakes final answers.
4. What are process reward models and how do they differ from outcome reward models?
Answer: Outcome reward models (ORMs) evaluate only the final answer: is the conclusion correct or not? They assign a single scalar reward to the entire reasoning chain based on whether it ends in the right answer.
Process reward models (PRMs) evaluate each individual step in the reasoning chain: is this step logically valid given the problem and the steps that preceded it? They assign a reward to every step, providing dense feedback throughout the reasoning process.
The practical difference (Lightman et al., 2023): when you use a reward model to select the best chain from k generated candidates, PRM-based selection outperforms ORM-based selection by roughly 5-10 percentage points on MATH. The mechanism: ORMs can be fooled by chains that make errors that happen to cancel out (producing the right answer through wrong reasoning). PRMs correctly penalize these chains because step-level errors are explicitly scored.
For evaluation purposes, PRMs enable:
- Richer diagnostic signals: step accuracy, first error location, error type distribution
- Better chain selection in best-of-N sampling scenarios
- Honest capability measurement: a model that gets 80% final-answer accuracy but only 65% step accuracy is less reliable than these numbers suggest
For training purposes, PRMs provide a more informative training signal than sparse final-answer reward, which is why they are used in training more capable reasoning models (OpenAI o1's training process incorporates PRM-like step evaluation).
Training a PRM requires step-level human annotations or a synthetic annotation process, making it more expensive than an ORM. The Lightman et al. paper required human labelers to mark each step as positive, negative, or neutral - a significant annotation investment. The math-shepherd approach (Wang et al., 2023) automates PRM annotation by using Monte Carlo estimation: for each prefix of steps, estimate whether the chain can be completed correctly from that point.
5. How would you evaluate whether a model is genuinely reasoning vs. pattern matching on math benchmarks?
Answer: The distinction between genuine reasoning and pattern matching shows up most clearly through:
Perturbation sensitivity: Take GSM8K problems and change one number or operation in a semantically neutral way (change "3 apples" to "7 apples" where the answer should scale proportionally). A reasoning model adjusts its answer accordingly. A pattern-matching model may produce an answer calibrated to the original problem structure.
Step-shuffled problems: Present the problem information in a different order than typical word problems. Reasoning models should be robust to this; pattern-matching models may fail because they have learned to extract numbers in a specific positional pattern.
Novel problem types: Create problems with the same structure as GSM8K but in unusual domains (economics problems, physics word problems, abstract scenarios). Reasoning models transfer; pattern-matching models fail.
Reasoning chain coherence: Generate chains and have a PRM or expert human evaluate whether each step follows logically from the previous. A model that has memorized problem-answer patterns but not internalized the reasoning process will produce chains with logical gaps that happen to end in correct answers.
Out-of-distribution difficulty: Test on problems slightly harder than training data. A model that genuinely reasons can make progress on harder problems; a model that pattern-matched training data hits a wall.
Practically: compare performance on known benchmarks (potentially contaminated) vs. fresh competition problems (post-training-cutoff AIME/AMC) and vs. internally constructed novel problems. A large gap between known and novel benchmark performance is a strong signal of pattern matching over genuine reasoning.
6. How does temperature affect reasoning evaluation and what protocol should you use?
Answer: Temperature controls the randomness of token sampling and has a large effect on reasoning evaluation quality.
For single-answer accuracy (greedy evaluation): use temperature 0.0 (deterministic greedy decoding). This gives reproducible, stable results. The same model on the same problem always produces the same answer. Useful for regression testing and for final reported numbers where you want minimal variance.
For pass@k or self-consistency evaluation: use temperature 0.7-0.9. You need diverse chains to make majority voting meaningful. Temperature 0.0 produces identical chains - voting over copies is pointless. Temperature above 1.0 adds excessive randomness that generates incoherent chains. The 0.7-0.8 range is empirically the sweet spot for diverse but coherent reasoning.
For estimating capability ceiling: use temperature 0.8 with k=20+ samples. This shows what the model is capable of under ideal conditions and is useful for comparing models' maximum potential.
Protocol recommendation for reporting:
- Report greedy accuracy (temperature=0.0) as the primary metric - this is most relevant for single-response applications
- Report SC@20 (temperature=0.8) as the capability ceiling metric
- State both protocols explicitly in any comparison - two papers reporting "72% on GSM8K" may be measuring very different things if one uses greedy and one uses self-consistency
The improvement from greedy to SC@20 is itself informative: a large gap (greedy 60%, SC@20 80%) means the model frequently knows how to solve the problem but does not reliably execute it. A small gap means the model's per-attempt quality is already high. For production systems where you can afford multiple samples, the SC@20 number is the relevant deployment metric.
Summary
Reasoning and math evaluation measures capabilities that separate models that have internalized problem-solving skills from those that pattern-match their way through familiar problem types.
The benchmark landscape spans a clear difficulty spectrum: GSM8K for grade-school multi-step arithmetic (now approaching saturation for large models), MATH for competition mathematics with stratified difficulty levels, AIME for competition math at the frontier of current capability, BIG-Bench Hard for diverse reasoning tasks specifically selected to require multi-step logic, and GPQA for graduate-level science requiring genuine domain expertise.
Chain-of-thought evaluation - comparing zero-shot CoT, few-shot CoT, and self-consistency voting - reveals not just whether a model is accurate but whether it reasons in ways that generalize. Step-level evaluation with process reward models goes further, identifying where reasoning breaks down rather than just whether it ultimately succeeds.
The evaluation approach should match the deployment need: if you are building a math tutoring tool where explanation quality matters, step-level accuracy is more important than final-answer accuracy. If you are building an automated calculation pipeline where only the result matters, greedy accuracy with tight regression testing on GSM8K is sufficient.
The worst mistake in reasoning evaluation is treating a single aggregate score as a capability summary. GSM8K accuracy tells you almost nothing about MATH Level 5 capability. MATH aggregate score hides dramatic variance across difficulty levels and subjects. Report full distributions, use appropriate difficulty levels for your use case, and always validate that benchmark performance predicts performance on the actual tasks you care about.
