Skip to main content

Evaluating Reasoning Models

The Problem With Easy Benchmarks

In 2019, BERT achieved human-level performance on the SuperGLUE benchmark. The NLP community celebrated. A year later, new harder benchmarks showed that BERT, and all its successors, still failed at reasoning tasks that any thoughtful human found straightforward. The benchmarks had been "solved" - but the underlying capability hadn't been built.

This is the recurring challenge in AI evaluation: as soon as a benchmark is widely used, models begin to approach or exceed human performance on it, and it stops telling you anything interesting. Sometimes this is because models genuinely learned the capability. More often, it's because the benchmark leaked into training data, or the problems weren't hard enough to begin with, or the evaluation protocol had systematic flaws.

Evaluating reasoning models is particularly hard because the very thing they're supposed to be good at - novel logical reasoning - is also the thing that's hardest to evaluate without contamination. Competition math problems are available online. Competitive programming problems are on public datasets. PhD-level science questions have been on the internet for years.

This lesson covers the key benchmarks for reasoning models, what they measure, their limitations, and how to design reliable evaluations for your own systems.


Why This Exists - The Race to the Bottom in Benchmarks

When a benchmark becomes famous, it becomes a training target. Labs optimize their models against public benchmarks, both intentionally (through targeted fine-tuning) and unintentionally (through data contamination - training data scraped from the internet includes benchmark problems and solutions).

The consequence is a well-documented pattern:

  1. A new hard benchmark is released
  2. Models score poorly (genuine capability gap)
  3. Researchers celebrate the benchmark as a meaningful measure
  4. Labs train on problems from the benchmark domain (sometimes the exact problems)
  5. Scores improve rapidly
  6. It becomes unclear whether improvement reflects genuine capability or benchmark-specific optimization
  7. The community needs a new, harder benchmark
  8. Repeat

Understanding this cycle is essential for correctly interpreting the dramatic numbers you'll see in reasoning model comparisons.


The Key Benchmarks

AIME - American Invitational Mathematics Examination

What it is: AIME is a real math competition administered annually to high school students. Problems require proof-level mathematical insight, combinatorial reasoning, and number theory. Each answer is a single integer from 0–999, making automated verification trivial.

Why it's good: (1) Problems are hard - they were designed to distinguish exceptional human students. (2) The integer-answer format makes evaluation unambiguous. (3) New problems are released each year, making contamination measurable. (4) Performance is historically well-calibrated to human ability levels.

Key results:

  • GPT-4o (2024): ~9–13% accuracy
  • o1-preview (2024): ~74% accuracy
  • DeepSeek-R1: ~79% accuracy
  • o3 (2024): ~96% accuracy
  • Top human students at competition: typically 7–15 correct out of 15 (AIME 2024)

What AIME measures: multi-step mathematical reasoning, number theory, combinatorics, geometry. It measures genuine mathematical problem-solving, not pattern matching.

Contamination risk: AIME 2024 problems are new each year and weren't in training data for models released before the 2024 competition. AIME 2023 and earlier problems are in public datasets and likely in training data.

Limitations: (1) Only tests competition math - a narrow slice of mathematical reasoning. (2) The integer-answer format means models can avoid showing work and guess. (3) Doesn't test proof writing or mathematical communication.

# AIME evaluation harness

def evaluate_on_aime(
model_fn,
aime_problems: list, # List of {problem, answer, year} dicts
n_samples_per_problem: int = 1, # Set to >1 for best-of-N evaluation
temperature: float = 0.0 if n_samples_per_problem == 1 else 0.7,
) -> dict:
"""
Evaluate a model on AIME problems.

Args:
model_fn: Function that takes a prompt and returns a completion
aime_problems: AIME problem set
n_samples_per_problem: 1 for greedy, >1 for best-of-N

Returns:
Accuracy and per-problem results
"""
import re

def extract_aime_answer(completion: str) -> int | None:
"""AIME answers are integers 0-999."""
# Look for explicit answer patterns
patterns = [
r"(?:answer|result|solution)\s*(?:is|=|:)?\s*(\d{1,3})\b",
r"\\boxed\{(\d{1,3})\}",
r"=\s*(\d{1,3})\s*$",
]
for pattern in patterns:
match = re.search(pattern, completion, re.IGNORECASE | re.MULTILINE)
if match:
val = int(match.group(1))
if 0 <= val <= 999:
return val

# Last 3-digit-or-fewer number as fallback
numbers = re.findall(r'\b(\d{1,3})\b', completion)
if numbers:
return int(numbers[-1])
return None

correct = 0
total = len(aime_problems)
results = []

for problem_data in aime_problems:
problem_text = problem_data["problem"]
ground_truth = problem_data["answer"]

prompt = f"""Solve this AIME problem. Show your work step by step.
Express your final answer as a single integer from 0 to 999.

Problem: {problem_text}"""

# Sample multiple times if using best-of-N
answers = []
for _ in range(n_samples_per_problem):
completion = model_fn(prompt, temperature=temperature)
answer = extract_aime_answer(completion)
if answer is not None:
answers.append(answer)

# Select answer: use majority vote if n>1, else single answer
if answers:
from collections import Counter
final_answer = Counter(answers).most_common(1)[0][0]
else:
final_answer = None

is_correct = final_answer == ground_truth
if is_correct:
correct += 1

results.append({
"problem": problem_text[:100] + "...",
"ground_truth": ground_truth,
"model_answer": final_answer,
"correct": is_correct,
"year": problem_data.get("year"),
})

return {
"accuracy": correct / total,
"correct": correct,
"total": total,
"per_problem": results,
"n_samples": n_samples_per_problem,
}

MATH-500

What it is: A subset of 500 problems from the MATH benchmark (Hendrycks et al., 2021), spanning competition math from AMC 8/10/12 through AIME and AMC. Problems come from algebra, geometry, number theory, precalculus, statistics, and AMC-level combinatorics.

Why it's good: Covers a broad range of difficulty within competition math. The 500-problem subset was specifically designed to be representative across difficulty levels and topics.

Key results:

  • GPT-4o: 74.6%
  • o1: 96.4%
  • DeepSeek-R1: 97.3%

Contamination risk: HIGH. MATH-500 problems are from well-known competition sources and have been on the internet for years. Most frontier models have likely seen many of these problems (or identical problems) in training. Take MATH-500 results with significant skepticism for models trained after 2023.

What MATH-500 measures: math problem-solving in the official competition math idiom. Strong correlation with AIME performance.

Codeforces Percentile Rating

What it is: A real competitive programming platform where users solve algorithmic problems and receive a rating based on their performance. Ratings are competitive - your rating reflects performance against other actual human competitors on problems you haven't seen before.

Why it's good: (1) New problems are constantly added - contamination is less severe than for fixed benchmarks. (2) Rating is comparative against humans, giving intuitive context. (3) Problems span all difficulty levels. (4) Verifiable through code execution.

Key results (as Codeforces rating, approximately):

  • GPT-4o: ~800 (roughly bottom 40%)
  • o1: ~1550 (roughly top 20%)
  • DeepSeek-R1: ~1700 (top 4%)
  • o3: ~1900+ (Candidate Master level, top 1%)

Limitations: Codeforces ratings are influenced by problem selection in the evaluation, which problems are included, and whether the model can use all testing infrastructure available to humans. Comparison to human ratings requires careful methodology.

ARC-AGI - Abstraction and Reasoning Corpus

What it is: Created by François Chollet (2019), ARC-AGI consists of tasks that require identifying abstract visual patterns from 3–5 demonstration pairs and applying the pattern to a new input. Tasks are deliberately designed to resist memorization - they test genuine pattern abstraction.

Example: Given demonstrations that show a 3x3 blue square transforming to a 3x3 red circle, and a 4x4 blue triangle as the test input, produce a 4x4 red circle.

Why it's significant: ARC-AGI was specifically designed as a measure of reasoning that would not be "solved" by training data memorization or scale. For years, models scored terribly on it. The o3 breakthrough (87.5% with high compute, 2024) was therefore a significant event.

Key results:

  • All models before o3: under 35%
  • GPT-4o: ~5%
  • o1-preview: ~32%
  • o3 (low compute): 75.7%
  • o3 (high compute): 87.5%

Contamination risk: ARC-AGI tasks are generated programmatically; the specific held-out test set is not public. Contamination risk is lower than for human-written benchmarks, though not zero.

What ARC-AGI measures: systematic generalization - applying a pattern learned from a few examples to a new instance. This is distinct from both memorization (applying stored knowledge) and multi-step math reasoning (following logical rules). It's closer to what Chollet calls "developer intelligence" - the ability to adapt to novel tasks.

Limitations: (1) The tasks are visual pattern matching, which may not capture all aspects of reasoning. (2) The very high cost of o3's high-compute setting (~$6,000 per task) means it's not practical. (3) ARC-AGI 2 has been developed to be harder and is replacing the original.

GPQA Diamond - Graduate-Level Scientific Reasoning

What it is: GPQA (Graduate-Level Google-Proof Q&A) contains multiple-choice questions at PhD level in chemistry, biology, and physics. "Diamond" refers to the hardest subset. These questions were created by PhD students and validated by domain experts.

Why it's good: Tests domain expertise that goes beyond what can be memorized from Wikipedia or textbooks. Questions require synthesizing knowledge across sub-fields and making inference under uncertainty.

Key results:

  • Human domain experts: ~65%
  • GPT-4o: ~53%
  • o1: ~77.3%
  • DeepSeek-R1: ~71.5%
  • Claude Opus 4.6: ~79%

Contamination risk: Moderate. Questions were vetted to be "Google-proof" (not directly findable by search), but advanced models trained on academic papers may have seen equivalent reasoning in training data.

Limitations: Multiple-choice format means 25% random baseline. The "diamond" subset is small (198 questions), so results can be noisy.

FrontierMath

What it is: A dataset of genuinely novel, expert-level mathematics problems created by professional mathematicians, specifically designed to be resistant to memorization. Released by Epoch AI in late 2024.

Why it's significant: Unlike MATH-500, FrontierMath problems are not recycled from known competitions. They are constructed to require research-level mathematical reasoning.

Key results:

  • GPT-4o, Claude 3.5 Sonnet: under 2%
  • o3: approximately 25%

The dramatic difficulty of FrontierMath (compared to MATH-500 where o1 scores 96%) suggests that current reasoning models, despite their impressive performance on competition math, are still very far from the frontier of mathematical research.


Process-Level vs. Outcome-Level Evaluation

Most benchmarks above use outcome evaluation: is the final answer correct? This is necessary for scale but misses important information.

Process-level evaluation asks: was the reasoning path correct, step by step?

def process_level_evaluation(
problem: str,
solution_steps: list,
ground_truth_steps: list,
prm: object, # Process Reward Model
tolerance: float = 0.3, # Allow minor step variations
) -> dict:
"""
Evaluate a solution at the process level, not just the outcome.

Args:
problem: The problem statement
solution_steps: Model's step-by-step solution
ground_truth_steps: Reference correct step-by-step solution
prm: Process Reward Model for step scoring
tolerance: Acceptable deviation from ground truth per step

Returns:
Process-level evaluation results
"""
# PRM scoring of model's solution
model_step_scores = prm(problem, solution_steps)

# Find first incorrect step
first_error = None
for i, score in enumerate(model_step_scores):
if score.item() < 0.5: # PRM says this step is wrong
first_error = i
break

# Check if correct steps match ground truth structure
# (simplified comparison - real implementation would use symbolic math)
length_ratio = len(solution_steps) / max(len(ground_truth_steps), 1)

return {
"outcome_correct": solution_steps[-1].strip() == ground_truth_steps[-1].strip()
if solution_steps and ground_truth_steps else False,
"first_error_at_step": first_error,
"n_steps": len(solution_steps),
"n_reference_steps": len(ground_truth_steps),
"length_ratio": length_ratio,
"average_step_quality": sum(s.item() for s in model_step_scores) / len(model_step_scores),
"min_step_quality": min(s.item() for s in model_step_scores),
"all_step_scores": [s.item() for s in model_step_scores],
}

Why process-level evaluation matters:

  1. A model might arrive at a correct answer through faulty reasoning (lucky guess)
  2. A model might have nearly correct reasoning but make a small arithmetic error at the end
  3. For teaching and tutoring applications, you need to know where the reasoning went wrong, not just that it was wrong
  4. Process-level metrics are more predictive of generalization than outcome metrics

Contamination - The Elephant in the Room

Contamination occurs when benchmark problems (or very similar problems) appear in model training data. For reasoning benchmarks, this is a severe concern.

def estimate_contamination_risk(
benchmark_problems: list,
model_training_cutoff: str,
problem_source: str,
) -> dict:
"""
Estimate contamination risk for a benchmark.
This is a heuristic framework - exact contamination is unknowable.
"""
contamination_factors = {
"problem_age_years": 0,
"source_accessibility": "unknown",
"is_on_popular_platforms": False,
"has_solutions_online": False,
"is_paywalled": False,
}

risk_scores = {
"MATH-500": 0.85, # Very high - problems are old, on internet
"AIME current_year": 0.05, # Very low - new each year
"AIME 2020_and_earlier": 0.80, # High - widely available
"ARC-AGI": 0.20, # Low - procedurally generated, test set private
"GPQA_Diamond": 0.30, # Moderate - designed to be Google-proof
"FrontierMath": 0.05, # Very low - created specifically to be novel
"Codeforces_live": 0.20, # Moderate - new problems but old ones in data
}

# Mitigation strategies
mitigations = [
"Report n-gram overlap between training data and benchmark",
"Use held-out test sets not released until after model training",
"Compare performance on pre-training-cutoff vs. post-cutoff benchmark versions",
"Report canary test: accuracy on intentionally corrupted problems",
"Test on paraphrased versions of benchmark problems",
]

return {
"risk_scores": risk_scores,
"recommended_mitigations": mitigations,
"most_reliable_benchmarks": ["FrontierMath", "ARC-AGI", "AIME current year", "Codeforces live"],
"least_reliable": ["MATH-500", "AIME 2020 and earlier"],
}

How to Detect Contamination

  1. N-gram overlap: compute the overlap between benchmark problems and the model's training data (if accessible). High overlap suggests contamination.

  2. Calibration check: a contaminated model will be overconfident on contaminated problems. Compare model confidence on benchmark problems vs. on held-out problems from the same domain.

  3. Paraphrase test: paraphrase benchmark problems (same math, different wording). If a model's performance drops dramatically on paraphrased problems, it may have memorized the originals.

  4. Canary test: deliberately corrupt some benchmark problems (change numbers, flip conditions) and check if the model "corrects" the problem back to the original. If it does, the original problem is in its training data.

def canary_contamination_test(
model_fn,
benchmark_problem: str,
corruption_fn,
n_corruptions: int = 5,
) -> dict:
"""
Test for contamination by seeing if the model corrects corrupted problems
back to their original form.

A model that has memorized the benchmark problem may:
1. "Correct" the corruption in its reasoning
2. Produce the answer to the original problem despite the corruption
3. Note the "error" in the problem
"""
original_response = model_fn(benchmark_problem)

contamination_signals = []
for _ in range(n_corruptions):
corrupted_problem = corruption_fn(benchmark_problem)
corrupted_response = model_fn(corrupted_problem)

# Check if response to corrupted problem matches original response
original_answer = extract_answer(original_response)
corrupted_answer = extract_answer(corrupted_response)

# If model gives the same answer to a corrupted problem as the original,
# it may have ignored the corruption (memorization signal)
contamination_signals.append({
"corrupted_problem": corrupted_problem[:200],
"original_answer": original_answer,
"corrupted_answer": corrupted_answer,
"answers_match": original_answer == corrupted_answer,
"model_noted_corruption": "error" in corrupted_response.lower()
or "typo" in corrupted_response.lower(),
})

match_rate = sum(c["answers_match"] for c in contamination_signals) / n_corruptions

return {
"match_rate": match_rate,
"contamination_suspected": match_rate > 0.6,
"evidence": contamination_signals,
}

Designing Your Own Evaluation

For production systems, you need an evaluation suite tailored to your actual use cases, not generic benchmarks. Here's how to design it:

def design_production_evaluation(
task_distribution: dict,
accuracy_requirement: float,
latency_budget_seconds: float,
) -> dict:
"""
Design a production evaluation suite.

Args:
task_distribution: Dict mapping task_type -> (fraction, difficulty)
accuracy_requirement: Required accuracy (0-1)
latency_budget_seconds: Max acceptable latency

Returns:
Recommended evaluation approach
"""
# Step 1: Collect test cases representative of production distribution
# - Sample from actual user queries (anonymized)
# - Augment with hand-crafted edge cases
# - Include adversarial examples
# Target: 200-500 problems with verified answers

evaluation_design = {
"test_set_size": "200-500 problems",
"collection_method": [
"Sample from production logs (anonymize first)",
"Hand-craft hard cases for each task type",
"Include near-misses from known failure modes",
],
"evaluation_metrics": [
"Primary: accuracy (correct/total)",
"Secondary: partial credit (if answers have structure)",
"Latency: p50 and p95",
"Cost: average tokens per query",
],
"contamination_protection": [
"Use queries from after model training cutoff",
"For math: use new problems not on public datasets",
"Rotate test sets quarterly",
],
"statistical_considerations": {
"minimum_problems_for_significance": 100,
"confidence_interval_method": "Wilson interval for proportions",
"significance_level": 0.05,
},
}

# Step 2: Define success criteria
# Not "maximize accuracy" but "meet accuracy requirement at cost/latency budget"
evaluation_design["success_criteria"] = {
"primary": f"Accuracy >= {accuracy_requirement:.0%} on held-out test set",
"latency": f"p95 latency < {latency_budget_seconds}s",
"cost": "Cost per query within budget (calculate from expected volume)",
"regression": "No accuracy degradation > 2% on any task category",
}

return evaluation_design

Key Evaluation Principles

  1. Use a held-out test set, always: if your test set is public, models will eventually train on it

  2. Stratify by difficulty: aggregate accuracy hides important information. A model that's 95% on easy problems and 30% on hard problems is different from one that's 80% across the board

  3. Report confidence intervals: with 100 test problems, your accuracy estimate has a 95% confidence interval of roughly ±10 percentage points. Report this

  4. Measure calibration, not just accuracy: a well-calibrated model that expresses 90% confidence is right 90% of the time. Test whether your model's confidence scores match its actual accuracy

  5. Track regression over time: as you update models, track whether accuracy on your evaluation suite stays stable


The Benchmark Landscape at a Glance

BenchmarkTypeContamination RiskBest ForAvoid When
AIME (current year)MathLowCompetition math evalNot released yet
AIME (2020-)MathHighHistorical comparison onlyEvaluating recent models
MATH-500MathVery HighQuick comparisonClaiming "new SOTA"
FrontierMathMathVery LowGenuine mathematical capabilityNeeding easy tasks
Codeforces (live)CodeLow-MediumCode reasoningFixed snapshots
ARC-AGIAbstractLowNovel reasoningNarrow math focus
GPQA DiamondScienceMediumPhD-level scienceNarrow domain

:::danger Common Mistake: Treating MATH-500 as Ground Truth MATH-500 is severely contaminated for models trained after 2022. A model scoring 97% on MATH-500 has very likely seen many of those exact problems during training. For reliable evaluation of mathematical reasoning, use AIME from the current year, FrontierMath, or problems you construct yourself with verified solutions. :::

:::warning Small Test Sets and Statistical Noise A 100-problem evaluation has a 95% confidence interval of approximately ±10 percentage points. When comparing two models that score 82% and 79% on 100 problems, the difference is not statistically significant. Use at least 300 problems for meaningful comparisons, and always report confidence intervals. :::

:::tip Build Domain-Specific Evaluations For production systems, generic benchmarks like MATH-500 are poor proxies for actual product performance. Build a test set of 200–500 problems representative of your actual user distribution, with verified answers. Run this evaluation before every model update. This is more predictive of your users' experience than any public benchmark. :::


Interview Questions and Answers

Q1: Why is AIME preferred over MATH-500 for evaluating reasoning models?

AIME is preferred for three reasons: (1) New AIME problems are released each year, making contamination measurable and controllable. For MATH-500, problems are from well-known competitions that have been online for years and are very likely in training data. (2) AIME has a natural difficulty calibration - it's designed to identify exceptional mathematical talent, so performance is meaningful relative to human experts. (3) The integer-answer format (0–999) eliminates ambiguity in evaluation while preventing lucky guesses on a small number of choices. MATH-500's advantage is breadth (covers more mathematical topics) and size, but for ranking reasoning models in 2024–2025, AIME (current year) provides more reliable signal.

Q2: What does ARC-AGI measure and why was o3's performance significant?

ARC-AGI measures systematic generalization - the ability to infer an abstract rule from a small number of demonstrations and apply it to a new case. This is distinct from memorization (applying stored knowledge) or logical deduction (following known rules). Chollet designed it specifically to be resistant to being "solved" by scale and memorization. o3's 87.5% score (high compute, 2024) was significant because: (a) all prior models scored under 35%, suggesting this was a genuine capability barrier; (b) ARC-AGI's design should make memorization-based improvement minimal; (c) it suggests o3 can perform a form of in-context pattern learning that approaches human-like abstract reasoning on these structured tasks.

Q3: Explain the contamination problem and how you would detect it.

Contamination occurs when benchmark problems or their solutions appear in model training data, inflating performance beyond genuine capability. Detection methods: (1) N-gram overlap analysis - compute the fraction of benchmark problems that appear verbatim or near-verbatim in the training corpus. (2) Paraphrase testing - paraphrase problems (same math, different wording) and compare performance on originals vs. paraphrases; large drops suggest memorization. (3) Canary testing - intentionally corrupt benchmark problems (change key numbers, flip conditions) and check if the model "corrects" the corruptions; if it produces the original answer despite the corruption, the original is memorized. (4) Calibration analysis - a model that memorized a problem will be more confident than its accuracy warrants.

Q4: How do you design a reliable internal evaluation suite for a reasoning model in production?

Key steps: (1) Collect test cases representative of actual production distribution - sample from real user queries, augment with hand-crafted hard cases. (2) Ensure all test cases have verified ground-truth answers (use a domain expert or formal verifier). (3) Stratify by difficulty: report accuracy separately for easy/medium/hard problems. (4) Use contamination protection: include problems created after the model's training cutoff. (5) Calculate confidence intervals: with N=200 problems, report 95% CI, not just point estimates. (6) Define clear success criteria before running the evaluation. (7) Run regression tests: compare against a baseline model to ensure you haven't regressed on any category. (8) Rotate the test set quarterly to prevent overfitting in the development process.

Q5: Why does Codeforces provide a better competitive programming evaluation than LeetCode hard?

Codeforces is preferable for three reasons: (1) Continuous stream of new problems - contamination risk is lower than for static datasets. LeetCode hard problems have been on the internet for years. (2) Competitive rating - performance is relative to actual human competitors on the same problems, giving intuitive context. Saying a model has a Codeforces rating of 1700 means it performs better than ~96% of active human competitors. (3) Variety of difficulty and problem types - Codeforces spans easy to world-champion difficulty, while LeetCode hard is a narrower slice. Limitation: comparing model vs. human Codeforces ratings requires careful methodology - models can't compete in real-time, and the evaluation must control for which problems are selected.

Q6: What is the difference between outcome-level and process-level evaluation, and when does each matter?

Outcome evaluation checks only the final answer (correct or incorrect). Process evaluation checks each intermediate reasoning step. Outcome evaluation is appropriate when: you only care about final answers, you have limited evaluation resources, or you're doing large-scale benchmarking. Process evaluation is important when: the correctness of reasoning matters (not just the answer), you're evaluating for teaching/tutoring applications (need to know where reasoning breaks down), you're training process reward models (need step-level labels), or you're investigating model failure modes (outcome evaluation doesn't tell you whether the model failed due to a wrong setup, wrong computation, or wrong reasoning approach). Process evaluation requires a process reward model or human annotation at each step, making it significantly more expensive.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Reasoning Model Evaluation demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.