Skip to main content

Test-Time Compute - Scaling at Inference

The Night the Math Didn't Add Up

It's 11 PM. You're a senior ML engineer at a fintech startup, and your GPT-4-based code review pipeline just flagged a critical bug in a payment calculation. The pipeline runs a single inference pass, looks at the output, and calls it done. Tonight the model agreed with flawed logic - a subtle off-by-one error in interest compounding - because it never really "checked its work." It generated a confident-sounding wrong answer in 800 milliseconds and moved on.

You know a junior engineer would have made the same mistake, but a senior engineer would have caught it - not because they knew more at the start, but because they slowed down. They worked through the math on a scratch pad. They double-checked. They tried it from a different angle. They caught the flaw on the third attempt.

That gap - between a fast, confident first guess and a slower, more careful answer - is exactly what the test-time compute paradigm is trying to close. The question researchers started asking in 2023 is: what happens if we let the model think longer? Not train it more, not make it bigger, but genuinely give it more computational budget at inference time?

The answer turned out to be more interesting than anyone expected. On many hard reasoning tasks - competition math, complex code, multi-step logical inference - spending ten times more compute at inference time gives you gains that would require training a model several times larger to achieve by pre-training alone. The implications for how we build and deploy AI systems are enormous.

This lesson is about how test-time compute works, why it works, what the math looks like, and what it means to actually deploy systems that use it.


Why This Exists - The Limits of Training-Time Scaling

For most of the deep learning era, the answer to "how do you make a model smarter?" was: train a bigger model on more data. This worked extraordinarily well. The Kaplan et al. (2020) scaling laws showed that model performance follows predictable power laws as you scale parameters and training tokens. OpenAI, Google, Anthropic, and Meta spent years and hundreds of millions of dollars following this curve.

But by late 2022 and into 2023, two cracks appeared in this strategy.

Crack one: diminishing returns on data. Quality training data is not unlimited. The Hoffmann et al. (2022) "Chinchilla" paper showed that prior models had been undertrained - you should scale data and parameters together. But even with this correction, there is a ceiling: the internet has roughly 10–15 trillion tokens of high-quality text, and the frontier models are approaching that. Running a second pass over the same data helps marginally, but synthetic data pipelines needed to produce genuinely novel training signal are still nascent.

Crack two: compute cost at training scale. Training GPT-4 reportedly cost somewhere between 50Mand50M and 100M. Training a model 10x larger would cost 10–100x more (depending on efficiency gains). That is not a sustainable growth curve for most organizations, and even for the frontier labs it creates a significant speed-of-iteration problem.

So the question became: can we get smarter outputs without building a fundamentally bigger model? Can we trade inference-time compute for output quality?

The answer is yes - but only for certain types of problems. Understanding why and when it works requires understanding how models actually fail.


The Core Insight - Why Thinking Longer Helps

When a language model generates a response, it produces tokens one at a time, left to right. Each token is generated by a single forward pass through the network. The model cannot go back. It cannot check its work mid-generation (at least not without architectural changes). Whatever knowledge and reasoning is encoded in those billions of parameters has to manifest in a single sweeping left-to-right pass.

For simple tasks - completing a sentence, translating a phrase, retrieving a well-known fact - this is fine. But for tasks that require multi-step reasoning, the model is being asked to maintain a complex chain of logic entirely in its forward pass, without the ability to course-correct.

Think about how a human solves a math competition problem versus how they answer "what is 2+2." The second is immediate recall. The first involves exploration, backtracking, checking intermediate results, and trying alternative approaches. The human is doing something that takes time and involves multiple mental acts.

Test-time compute gives the language model an analogous ability - not through architectural changes, but through generating more tokens and potentially multiple independent attempts.


The Mechanisms - How to Spend More Compute at Inference

There are several distinct strategies, each with different cost/quality profiles.

Best-of-N Sampling

The simplest form: generate NN independent completions for a given prompt, then select the best one using some scoring mechanism.

If you have a verifier that can tell you whether an answer is correct (a math problem with a ground-truth answer, a code solution that passes or fails tests), best-of-N is extremely powerful. You sample NN candidate solutions, run them through the verifier, and return the first correct one (or the highest-scoring one if your verifier produces scores).

The math of best-of-N: suppose each independent sample has probability pp of being correct. The probability that at least one of NN samples is correct is:

P(at least one correct)=1(1p)NP(\text{at least one correct}) = 1 - (1 - p)^N

If p=0.2p = 0.2 (20% single-pass accuracy), then:

  • N=1N = 1: 20% accuracy
  • N=5N = 5: 10.85=67.2%1 - 0.8^5 = 67.2\%
  • N=10N = 10: 10.810=89.3%1 - 0.8^{10} = 89.3\%
  • N=100N = 100: 10.810099.99%1 - 0.8^{100} \approx 99.99\%

This is a dramatic improvement with no change to the underlying model. The only cost is linear in NN - NN forward passes instead of one.

The critical dependency here is the verifier. Best-of-N only works when you can reliably score or verify outputs. For math problems with deterministic answers, this is easy. For open-ended writing or reasoning, it is much harder.

import anthropic
from typing import Callable

def best_of_n(
prompt: str,
n: int,
verifier: Callable[[str], float],
model: str = "claude-3-5-sonnet-20241022",
temperature: float = 0.8,
) -> tuple[str, float]:
"""
Generate N completions and return the one with the highest verifier score.

Args:
prompt: The input prompt
n: Number of samples to generate
verifier: A function that scores a completion (higher = better)
temperature: Sampling temperature (>0 for diversity)

Returns:
(best_completion, best_score)
"""
client = anthropic.Anthropic()

candidates = []
for i in range(n):
response = client.messages.create(
model=model,
max_tokens=2048,
temperature=temperature,
messages=[{"role": "user", "content": prompt}]
)
completion = response.content[0].text
score = verifier(completion)
candidates.append((completion, score))
print(f"Sample {i+1}/{n}: score={score:.3f}")

# Return the highest-scoring candidate
best_completion, best_score = max(candidates, key=lambda x: x[1])
return best_completion, best_score


def math_verifier(completion: str) -> float:
"""
Simple verifier for math problems that checks if the answer is correct.
In practice, you would parse the answer and compare to ground truth.
"""
# Example: check if "42" appears as the final answer
import re
# Look for common answer patterns
patterns = [
r"(?:answer|result|solution)(?:\s+is)?\s*:?\s*(\d+)",
r"=\s*(\d+)\s*$",
r"\*\*(\d+)\*\*",
]
for pattern in patterns:
match = re.search(pattern, completion, re.IGNORECASE | re.MULTILINE)
if match:
value = int(match.group(1))
return 1.0 if value == 42 else 0.0
return 0.0 # Could not find an answer


# Simpler demonstration with a local model approach
def best_of_n_simple(
prompt: str,
n: int,
correct_answer: str,
) -> dict:
"""Simulate best-of-N with a mock model for demonstration."""
import random

# Simulate sampling with p=0.25 per sample
p_correct = 0.25
results = []

for i in range(n):
is_correct = random.random() < p_correct
answer = correct_answer if is_correct else f"wrong_answer_{i}"
results.append({"answer": answer, "correct": is_correct})

# Best-of-N: return first correct answer if any
for result in results:
if result["correct"]:
return {
"answer": result["answer"],
"found_correct": True,
"samples_tried": results.index(result) + 1
}

return {
"answer": results[-1]["answer"],
"found_correct": False,
"samples_tried": n
}


# Demonstrate the probability curve
import math

def best_of_n_probability(p: float, n: int) -> float:
"""Probability of at least one correct in N samples."""
return 1 - (1 - p) ** n

print("Best-of-N accuracy with p=0.25 base rate:")
for n in [1, 2, 4, 8, 16, 32, 64, 128]:
acc = best_of_n_probability(0.25, n)
bar = "=" * int(acc * 40)
print(f" N={n:4d}: {acc:.1%} |{bar}")

Majority Voting (Self-Consistency)

Wang et al. (2022) introduced self-consistency: generate multiple independent chain-of-thought reasoning paths, and take a majority vote over the final answers. Instead of needing a verifier, you rely on the assumption that correct answers will cluster together.

The intuition is simple: if you ask the model a hard math problem 20 times, and 14 of those times it arrives at 47, that answer is probably right - even if the reasoning paths were different. Wrong answers tend to disagree with each other; correct answers tend to converge.

final answer=argmaxai=1N1[answeri=a]\text{final answer} = \arg\max_{a} \sum_{i=1}^{N} \mathbf{1}[\text{answer}_i = a]

This is remarkably effective on math and reasoning benchmarks. Wang et al. showed that self-consistency with chain-of-thought improved accuracy by 10–20 percentage points on GSM8K and other math benchmarks over single-pass CoT.

from collections import Counter
from typing import List

def majority_vote(answers: List[str]) -> tuple[str, float]:
"""
Take a majority vote over a list of answers.

Returns the plurality answer and its confidence (fraction of votes).
"""
if not answers:
raise ValueError("Need at least one answer")

# Normalize answers (strip whitespace, lowercase for comparison)
normalized = [a.strip().lower() for a in answers]

counter = Counter(normalized)
most_common_answer, count = counter.most_common(1)[0]
confidence = count / len(answers)

# Return the original (non-normalized) version of the winning answer
for original, norm in zip(answers, normalized):
if norm == most_common_answer:
return original.strip(), confidence

return answers[0], confidence


def self_consistency_pipeline(
problem: str,
n_samples: int = 20,
temperature: float = 0.7,
) -> dict:
"""
Self-consistency: sample N chain-of-thought completions,
extract final answers, take majority vote.

This is a pseudocode demonstration - in practice you'd call an LLM API.
"""
# In a real implementation:
# 1. Prompt model with CoT instructions: "Think step by step, then give your final answer."
# 2. Sample n_samples completions at temperature > 0
# 3. Parse final answer from each completion (often after "Therefore" or "The answer is")
# 4. Take majority vote

cot_prompt = f"""Solve this step by step.

Problem: {problem}

Work through each step carefully, then state your final answer clearly."""

# Simulated answers for demonstration
simulated_answers = ["47", "47", "47", "52", "47", "47", "52", "47", "47", "47",
"47", "52", "47", "47", "47", "52", "47", "47", "47", "52"]

# In reality you'd call: answers = [extract_answer(call_llm(cot_prompt)) for _ in range(n_samples)]
answers = simulated_answers[:n_samples]

best_answer, confidence = majority_vote(answers)

vote_distribution = Counter(a.strip().lower() for a in answers)

return {
"answer": best_answer,
"confidence": confidence,
"vote_distribution": dict(vote_distribution),
"n_samples": n_samples,
}


result = self_consistency_pipeline(
"A train leaves Chicago at 8 AM traveling at 60 mph. Another train leaves Detroit at 9 AM "
"traveling at 80 mph toward Chicago. Chicago to Detroit is 280 miles. Where do they meet?",
n_samples=20
)
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']:.1%}")
print(f"Vote distribution: {result['vote_distribution']}")

Beam Search over Reasoning Steps

Rather than sampling independently and aggregating, beam search maintains a set of kk partial solutions and greedily expands the most promising ones. Traditional beam search over tokens is not very helpful for reasoning (it tends to produce repetitive outputs), but beam search over reasoning steps - where each step is a full sentence or paragraph - can be powerful when combined with a process reward model to score intermediate steps.

Sequential Refinement

A fourth approach is self-refinement or iterative correction: generate an initial answer, have the model critique its own answer, and generate a revised answer. This can be done in a loop. Research shows this helps in some settings but the model can get "stuck" - confidently wrong models tend to produce confident critiques that reinforce wrong answers.


The Scaling Curve - Compute vs. Accuracy

One of the most important empirical findings of the past two years is that test-time compute scaling follows its own power law. Snell et al. (2024) showed that on difficult reasoning tasks, accuracy scales roughly as a power law in compute spent at inference:

accuracyaCtestb\text{accuracy} \approx a \cdot C_{\text{test}}^b

where CtestC_{\text{test}} is the total compute spent at inference (proportional to total tokens generated across all samples), aa and bb are task-specific constants, and b>0b > 0 for hard reasoning tasks.

This has a crucial implication: a smaller model with more test-time compute can match or exceed a larger model with single-pass inference, on the right tasks.

Jones (2021) showed that on competitive programming problems, compute-optimal scaling at inference (allocating more compute to harder problems) can dramatically improve performance. The key insight is that you should spend compute adaptively: easy problems don't need many samples; hard problems benefit enormously from more.


Compute Budget Tokens - Teaching Models to Self-Regulate

OpenAI's o1 introduced an interesting technique: the model is given a "compute budget" as part of its context - essentially a signal of how much "thinking" it should do. Longer budgets lead to more extended chain-of-thought. Shorter budgets lead to faster, more concise reasoning.

This is conceptually similar to the idea of adaptive computation - allocating more compute to inputs that require it. In the o1 paradigm, the model learns, through reinforcement learning, to generate more reasoning tokens on harder problems.

The key insight is that the number of tokens a model generates is itself a form of compute. Each token generation is one forward pass through the transformer. Generating 2000 tokens "costs" 2000 forward passes, compared to 200 tokens for a simple answer. This is the "thinking" that happens in extended chain-of-thought reasoning models.


When Test-Time Compute Helps and When It Doesn't

This is the most practically important question for engineers building production systems.

Tasks Where It Helps Enormously

  • Mathematical reasoning: algebra, calculus, number theory, competition math
  • Multi-step code generation: writing non-trivial algorithms, debugging complex programs
  • Formal logic and planning: satisfiability problems, constraint satisfaction, scheduling
  • Scientific reasoning: chemistry stoichiometry, physics problems with multiple laws
  • Complex decision-making: analyzing trade-offs across many variables

The common thread: these are tasks where the correct answer can be verified (either externally or by the model self-checking), where there exist multiple valid reasoning paths, and where errors compound - a mistake in step 2 makes step 3, 4, and 5 wrong too.

Tasks Where It Doesn't Help (or Hurts)

  • Simple factual retrieval: "What year was the Eiffel Tower built?" - spending more compute doesn't help
  • Stylistic writing: creative writing quality doesn't improve with majority voting
  • Perceptual tasks: image description doesn't improve by sampling 20 times
  • Tasks requiring external knowledge: if the model doesn't know a fact, sampling it 100 times doesn't create the knowledge
  • Short, well-formed questions: the model's first answer is usually fine

Comparing Training-Time vs. Inference-Time Scaling

DimensionTraining-Time ScalingTest-Time Compute
What scalesParameters + dataTokens generated per query
Cost structureOne-time (amortized)Per-query (ongoing)
Latency impactNone (baked into model)Significant (linear in samples)
Task coverageBroad - all tasksNarrow - hard reasoning tasks
Improvement typeAcross-the-boardTask-specific
Capital requirementVery high ($M+)Moderate (compute at serving)
ReversibleNo - once trainedYes - tunable per request

The key economic point: training-time improvements are amortized across all queries - you pay once, every query benefits. Test-time compute costs are per-query - you pay for every request that uses extra compute. For high-value, low-volume queries (medical diagnosis, legal analysis, complex financial modeling), the per-query cost may be acceptable. For high-volume commodity queries, it's often not.


Production Engineering Notes

Parallelizing Best-of-N

In production, best-of-N sampling is trivially parallelizable. You can fire off NN API calls simultaneously (constrained by rate limits). The wall-clock latency of best-of-N can be similar to a single call if you have enough parallel capacity.

import asyncio
import anthropic
from typing import Callable

async def best_of_n_parallel(
prompt: str,
n: int,
verifier: Callable[[str], float],
model: str = "claude-3-5-sonnet-20241022",
temperature: float = 0.8,
) -> tuple[str, float]:
"""
Generate N completions in parallel, return the best.
Wall-clock latency ~ single call (if enough capacity).
"""
client = anthropic.AsyncAnthropic()

async def single_sample(i: int) -> tuple[str, float]:
response = await client.messages.create(
model=model,
max_tokens=2048,
temperature=temperature,
messages=[{"role": "user", "content": prompt}]
)
completion = response.content[0].text
score = verifier(completion)
return completion, score

# Fire all N requests simultaneously
tasks = [single_sample(i) for i in range(n)]
results = await asyncio.gather(*tasks)

# Return the best
best_completion, best_score = max(results, key=lambda x: x[1])
return best_completion, best_score


# Run with: asyncio.run(best_of_n_parallel(prompt, n=8, verifier=my_verifier))

Adaptive Compute Allocation

Not every query is equally hard. A production system should route:

  • Easy queries → single-pass inference (cheapest)
  • Medium queries → self-consistency with N=5–10
  • Hard queries → best-of-N with N=20–50 + verifier, or a dedicated reasoning model
def compute_budget_router(
problem: str,
difficulty_classifier: Callable[[str], str],
) -> dict:
"""
Route to different compute strategies based on estimated difficulty.
"""
difficulty = difficulty_classifier(problem)

strategies = {
"easy": {
"approach": "single_pass",
"n_samples": 1,
"use_cot": False,
"estimated_tokens": 200,
},
"medium": {
"approach": "self_consistency",
"n_samples": 8,
"use_cot": True,
"estimated_tokens": 2000, # 8 * 250
},
"hard": {
"approach": "best_of_n_with_verifier",
"n_samples": 32,
"use_cot": True,
"estimated_tokens": 16000, # 32 * 500
},
"very_hard": {
"approach": "reasoning_model",
"n_samples": 1,
"use_cot": True, # hidden in o1
"estimated_tokens": 10000, # o1 extended thinking
},
}

return strategies.get(difficulty, strategies["medium"])

Latency Budgets

In many production settings, latency matters more than accuracy. The best-of-N strategy with N=32N=32 parallel calls will have similar latency to N=1N=1 if you have the API capacity - but it costs 32x more. Sequential MCTS with 50 steps might take 2–3 minutes. That may be acceptable for overnight batch jobs, but not for interactive user-facing applications.

Measure: accuracy-at-latency-budget. "What's the best accuracy I can achieve in under 5 seconds for this task type?"


:::danger Common Mistake: Using Test-Time Compute for Simple Tasks Applying best-of-N or self-consistency to simple factual retrieval or short-form writing wastes money and time. Test-time compute only helps when the task has meaningful variance across samples - i.e., when the model sometimes gets it right and sometimes gets it wrong. Profile your tasks first. :::

:::warning Verifier Dependency Best-of-N's performance depends critically on verifier quality. A weak verifier (one that gives high scores to plausible-sounding but wrong answers) can actually hurt performance compared to just taking the first sample, because you end up selecting confidently wrong answers. :::

:::tip The Snell et al. (2024) Rule of Thumb Snell et al. found that test-time compute is most beneficial when the task has a "difficulty gap" - when the model's single-pass accuracy is somewhere between 20% and 80%. Below 20%, the model doesn't know enough to succeed even with many tries. Above 80%, the first try is usually right and extra compute is wasted. :::


Interview Questions and Answers

Q1: What is test-time compute and why does it matter?

Test-time compute is the practice of spending more computational resources at inference time - generating more tokens, more samples, or exploring more reasoning paths - rather than (or in addition to) training a larger model. It matters because it provides a way to trade inference cost for output quality, particularly on hard reasoning tasks. The practical implication is that you can achieve higher accuracy on difficult problems by running a smaller model multiple times, which can be more cost-effective than training a bigger model.

Q2: Explain best-of-N sampling. What is required for it to work?

Best-of-N sampling generates NN independent completions and selects the best one according to some scoring criterion. The math shows that if each sample has probability pp of being correct, the probability that at least one of NN samples is correct is 1(1p)N1 - (1-p)^N, which grows quickly with NN. For it to work, you need: (1) a reliable scoring or verification mechanism, (2) sufficient output diversity across samples (requires temperature > 0), and (3) tasks where the model can sometimes succeed but sometimes fails (not too easy, not too hard).

Q3: What is self-consistency and how does it differ from best-of-N?

Self-consistency (Wang et al., 2022) generates multiple chain-of-thought completions and takes a majority vote over final answers, without needing an external verifier. It relies on the observation that correct answers tend to converge across different reasoning paths, while wrong answers tend to disagree. Best-of-N requires a verifier (which is a separate model or rule-based system that scores answers). Self-consistency is verifier-free but requires that answers be discrete and comparable (works well for math, less well for open-ended generation).

Q4: How does test-time compute compare to pre-training compute scaling?

Pre-training compute is a one-time cost amortized across all queries - once the model is trained, every query benefits. Test-time compute is a per-query cost - you pay for each request that uses extra compute. Pre-training scaling improves performance broadly across all tasks; test-time compute scaling primarily helps on hard reasoning tasks where the model has some non-zero success rate. The key trade-off: pre-training is capital-intensive upfront; test-time compute is operationally expensive ongoing. For a high-volume consumer product, pre-training scaling is usually preferable; for low-volume, high-value reasoning tasks, test-time compute can be more practical.

Q5: In a production system, how would you decide when to use test-time compute?

A good production strategy involves: (1) Classify query difficulty - use a lightweight classifier or heuristics (query length, keywords indicating math/code/logic). (2) Set a latency budget - determine the maximum acceptable response time for this use case. (3) Profile task variance - measure how much the model's accuracy varies across single-pass samples; high variance means test-time compute helps. (4) Calculate cost/quality ROI - compare the improvement in accuracy against the increase in per-query cost, then decide if the improvement justifies the cost for your use case. (5) Implement adaptive routing - easy queries get single-pass inference, hard queries get best-of-N or a reasoning model.

Q6: What is the "difficulty sweet spot" for test-time compute?

Test-time compute is most effective when single-pass accuracy is in the 20–80% range for a given task. Below 20%, the model lacks the knowledge or capability to solve the problem reliably even with many tries - you need a better model or more training data. Above 80%, the model usually gets it right on the first try, so extra sampling is wasteful. The 20–80% range is where the model "knows enough to sometimes get it right" - and test-time compute can shift that from "sometimes" to "usually" or "almost always."

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Monte Carlo Tree Search for LLM Reasoning demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.