Process Reward Models (PRMs)
The Teaching Problem
Picture a first-year university student attempting a statistics problem on an exam. They write 7 lines of work. The final number is wrong. The professor puts a red X at the bottom and moves on.
Now picture the same student in a study session with a good tutor. The tutor watches them work through the problem step by step. The tutor stops them at line 3: "This step is right. But look at what you wrote on line 4 - you forgot to account for the sample size correction here." The student fixes line 4, continues, and gets the right answer.
Same student. Same baseline knowledge. One learns almost nothing from the exam; the other understands exactly where their reasoning broke down.
This is the intuition behind process reward models. The "professor with a red X" is an outcome reward model: binary feedback on the final answer. The "good tutor" is a process reward model: step-by-step evaluation of the reasoning quality.
For training AI systems to reason well, the difference turns out to matter enormously.
Why This Exists - The Sparse Reward Problem
When training an LLM with reinforcement learning on reasoning tasks, the reward signal is typically: was the final answer correct? This creates what is called a sparse reward problem.
Consider a math problem that requires 8 steps. The model generates 8 steps of reasoning. If the final answer is wrong, the model receives reward 0. But which step was wrong? Was step 1 wrong (in which case steps 2–8 are all wrong, despite potentially being internally consistent)? Was step 7 wrong and steps 1–6 were all correct?
The outcome reward gives you one bit of information for the entire trajectory. This is an extremely weak training signal for long-horizon reasoning tasks. Credit assignment - figuring out which steps caused the failure - becomes nearly impossible.
The consequence: RL with pure outcome rewards learns slowly, is prone to reward hacking (finding solutions that verify as correct through lucky shortcuts), and fails to improve the quality of intermediate reasoning even as it improves final accuracy.
Process reward models address this directly by providing a reward signal at each step, turning the sparse reward problem into a dense reward problem.
Historical Context - From ORMs to PRMs
The distinction between outcome and process supervision in the context of LLMs was formalized by Lightman et al. (2023) at OpenAI: "Let's Verify Step by Step."
Prior work (Cobbe et al., 2021, "Training Verifiers to Solve Math Word Problems") had shown the value of outcome reward models (ORMs): a model trained to predict whether a complete solution is correct or not. ORMs are used at inference time to re-rank multiple solutions - generate solutions, score them with the ORM, return the highest-scored one.
The Lightman et al. contribution was to show that PRMs - which score each individual step - significantly outperform ORMs for solution re-ranking, especially on hard problems. The intuition: an ORM has to look at a potentially incorrect solution and globally assess whether it's right. A PRM evaluates each step independently and can identify exactly which step went wrong.
The key experimental setup: they hired human annotators to go through solutions to MATH competition problems, step by step, and label each step as:
- Positive: the step is mathematically correct and makes progress toward the solution
- Negative: the step contains a mathematical error
- Neutral: the step is valid but makes little progress
They then trained a PRM on these annotations and showed it substantially outperformed ORMs on selecting correct solutions.
The Architecture of a PRM
A process reward model is a transformer that takes as input the problem statement and a partial solution (all steps up to and including the current step) and outputs a scalar score estimating the quality of the current step.
In practice, PRMs are typically initialized from the same base model as the policy (or from a slightly smaller model) and fine-tuned on step-level annotation data. The key architectural difference from a standard LLM is the scalar output head that predicts step quality.
import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer
from typing import List, Optional
class ProcessRewardModel(nn.Module):
"""
Process Reward Model (PRM) for evaluating reasoning steps.
Takes a problem and a partial solution (list of steps),
returns a quality score for each step.
"""
def __init__(
self,
base_model_name: str = "deepseek-ai/deepseek-math-7b-base",
step_separator: str = "\n\nStep ",
):
super().__init__()
self.tokenizer = AutoTokenizer.from_pretrained(base_model_name)
self.encoder = AutoModel.from_pretrained(base_model_name)
self.step_separator = step_separator
# Scalar scoring head
hidden_size = self.encoder.config.hidden_size
self.score_head = nn.Sequential(
nn.Linear(hidden_size, hidden_size // 4),
nn.GELU(),
nn.Linear(hidden_size // 4, 1),
nn.Sigmoid(), # Output in [0, 1]
)
def forward(
self,
problem: str,
steps: List[str],
) -> torch.Tensor:
"""
Score each step in the solution.
Args:
problem: The math problem text
steps: List of solution steps
Returns:
Tensor of shape [num_steps] with scores in [0, 1]
"""
# Construct the full text with step separators
full_text = f"Problem: {problem}\n\nSolution:"
step_positions = []
for i, step in enumerate(steps):
full_text += f"\n\nStep {i+1}: {step}"
# Record where each step ends (the position we'll extract score from)
step_positions.append(len(self.tokenizer.encode(full_text)) - 1)
# Tokenize the full solution
inputs = self.tokenizer(
full_text,
return_tensors="pt",
truncation=True,
max_length=4096,
)
# Forward pass through encoder
with torch.no_grad():
outputs = self.encoder(**inputs)
# Extract hidden states at step-end positions
hidden_states = outputs.last_hidden_state.squeeze(0) # [seq_len, hidden]
step_scores = []
for pos in step_positions:
# Clamp to valid range in case of truncation
pos = min(pos, hidden_states.shape[0] - 1)
step_repr = hidden_states[pos] # [hidden]
score = self.score_head(step_repr.unsqueeze(0)) # [1, 1]
step_scores.append(score.squeeze())
return torch.stack(step_scores) # [num_steps]
def score_solution(
self,
problem: str,
steps: List[str],
) -> dict:
"""
Score a complete solution and return step-level and aggregate results.
"""
with torch.no_grad():
step_scores = self.forward(problem, steps)
scores_list = step_scores.tolist()
# Find first wrong step (score below threshold)
threshold = 0.5
first_error = None
for i, score in enumerate(scores_list):
if score < threshold:
first_error = i
break
# Aggregate score: minimum step score (any wrong step = wrong solution)
min_score = min(scores_list)
# Alternatively: product of step scores (probability all steps correct)
product_score = 1.0
for s in scores_list:
product_score *= s
return {
"step_scores": scores_list,
"min_step_score": min_score,
"product_score": product_score,
"first_error_at_step": first_error,
"predicted_correct": min_score > threshold,
}
The Lightman et al. (2023) Experiment
The key finding from "Let's Verify Step by Step" (Lightman et al., 2023, OpenAI):
On MATH benchmark with PaLM 2 as the policy model:
- ORM + best-of-N: 72.4% accuracy with N=1860 samples
- PRM + best-of-N: 78.2% accuracy with the same N=1860 samples
The PRM consistently outperforms the ORM across all values. The gap is largest for hard problems - precisely where you need better verification most.
Why? An ORM looks at a potentially long, incorrect solution and has to holistically judge "is this right?" This is a hard prediction task - the solution may look fluent and plausible even if step 6 contains a subtle error. A PRM evaluates each step in context and can identify the specific step where reasoning went wrong.
The annotation cost was significant: they created a dataset of 800K step-level annotations, requiring significant human annotation effort on competition math problems. This cost is one of the key limitations of PRMs.
Math-Shepherd - Automatic PRM Training
The cost of human step-level annotation is a major bottleneck. Wang et al. (2024) "Math-Shepherd: Verify and Reinforce LLMs Step-by-Step without Human Annotations" addressed this with an automatic annotation approach.
The key insight: for math problems with verifiable answers, you can estimate the quality of each intermediate step by asking "from this partial solution, what fraction of completions eventually produce the correct answer?"
The Math-Shepherd annotation procedure:
- For each problem in the dataset, generate many complete solutions
- For each partial solution (first steps), generate completions and check which produce the correct final answer
- The "step quality" label for step is the fraction of completions from step that eventually succeed
This is Monte Carlo estimation of the value function at each reasoning step - essentially doing best-of-N sampling from each intermediate point.
from typing import List, Tuple, Callable
import random
def math_shepherd_annotate(
problem: str,
partial_solution: List[str], # First k steps
policy_model: Callable, # Function that generates completions
verifier: Callable, # Function that checks if answer is correct
n_completions: int = 16,
max_steps: int = 10,
) -> float:
"""
Estimate step quality using Math-Shepherd's automatic annotation.
The quality of a partial solution (first k steps) is estimated as
the fraction of completions from that point that eventually produce
the correct answer.
Args:
problem: The math problem
partial_solution: List of steps completed so far
policy_model: Generates a completion given partial solution
verifier: Checks if a complete solution is correct
n_completions: Number of MC rollouts to estimate quality
max_steps: Maximum steps to generate per completion
Returns:
Estimated step quality in [0, 1]
"""
correct_completions = 0
for _ in range(n_completions):
# Generate a completion from the current partial solution
completion_steps = policy_model.complete_from_partial(
problem=problem,
partial_solution=partial_solution,
max_steps=max_steps,
)
# Full solution = partial + completion
full_solution = partial_solution + completion_steps
# Check if the final answer is correct
final_answer = extract_final_answer(full_solution)
if verifier(problem, final_answer):
correct_completions += 1
# Quality = fraction of completions that succeed
return correct_completions / n_completions
def build_prm_dataset_with_math_shepherd(
problems: list,
policy_model: Callable,
verifier: Callable,
n_completions_per_step: int = 16,
) -> List[dict]:
"""
Build a PRM training dataset using Math-Shepherd automatic annotation.
For each problem, annotate the quality of each step without human labels.
"""
dataset = []
for problem in problems:
# Generate a reference solution
reference_solution = policy_model.generate(problem)
steps = parse_steps(reference_solution)
if not steps:
continue
# Annotate each step
step_annotations = []
for k in range(len(steps)):
partial = steps[:k+1]
quality = math_shepherd_annotate(
problem=problem.question,
partial_solution=partial,
policy_model=policy_model,
verifier=verifier,
n_completions=n_completions_per_step,
)
step_annotations.append({
"step_index": k,
"step_text": steps[k],
"quality_score": quality,
"estimated_correct": quality > 0.5,
})
dataset.append({
"problem": problem.question,
"steps": steps,
"step_annotations": step_annotations,
})
return dataset
Math-Shepherd is significant because it makes PRM training scalable without expensive human annotation. The quality of annotations is lower than human labels (humans can catch conceptual errors that MC estimation might miss), but the scale advantage more than compensates on most benchmarks.
Using PRMs for Search
The most powerful application of PRMs is as a value function for search algorithms. Instead of just re-ranking complete solutions, you can use a PRM to guide generation - selecting which reasoning step to expand next.
Best-of-N with PRM Scoring
The simplest approach: generate complete solutions, score them with the PRM, return the highest-scored one.
def prm_best_of_n(
problem: str,
policy_model,
prm: ProcessRewardModel,
n: int = 32,
temperature: float = 0.8,
aggregation: str = "min", # "min" or "product"
) -> dict:
"""
Best-of-N selection using PRM scoring.
Aggregation strategies:
- "min": return the solution where the minimum step score is highest
(pessimistic: the worst step determines the solution quality)
- "product": return the solution with highest product of step scores
(probability that ALL steps are correct)
"""
candidates = []
for i in range(n):
# Generate complete solution
solution_text = policy_model.generate(
problem=problem,
temperature=temperature,
)
steps = parse_steps(solution_text)
if not steps:
continue
# Score with PRM
result = prm.score_solution(problem, steps)
if aggregation == "min":
aggregate_score = result["min_step_score"]
elif aggregation == "product":
aggregate_score = result["product_score"]
else:
raise ValueError(f"Unknown aggregation: {aggregation}")
candidates.append({
"solution": solution_text,
"steps": steps,
"step_scores": result["step_scores"],
"aggregate_score": aggregate_score,
})
# Return highest-scoring solution
best = max(candidates, key=lambda x: x["aggregate_score"])
return {
"best_solution": best["solution"],
"best_score": best["aggregate_score"],
"step_scores": best["step_scores"],
"n_candidates": len(candidates),
}
Beam Search with PRM
A more sophisticated approach: maintain a beam of partial solutions, expand each, score with PRM, keep top-k.
import heapq
from dataclasses import dataclass, field
from typing import Any
@dataclass(order=True)
class BeamState:
"""State in the reasoning beam search."""
score: float
steps: List[str] = field(compare=False)
step_scores: List[float] = field(compare=False)
def __post_init__(self):
# Negate score so heapq (min-heap) works as max-heap
self.score = -self.score
def prm_beam_search(
problem: str,
policy_model,
prm: ProcessRewardModel,
beam_size: int = 8,
max_steps: int = 12,
temperature: float = 0.6,
) -> dict:
"""
Beam search over reasoning steps, guided by PRM scores.
At each step:
1. Expand each beam state by sampling next step from policy
2. Score new states with PRM
3. Keep top beam_size states
This finds high-quality reasoning paths more efficiently than
independent best-of-N sampling.
"""
# Initialize beam with empty solutions
beam = [BeamState(score=-0.0, steps=[], step_scores=[])]
for step_idx in range(max_steps):
candidates = []
for state in beam:
# Sample multiple next steps from policy
next_step_candidates = policy_model.sample_next_step(
problem=problem,
completed_steps=state.steps,
n_candidates=beam_size * 2, # Oversample, then prune
temperature=temperature,
)
for next_step in next_step_candidates:
if next_step is None: # Model indicated solution complete
# Score the complete solution
if state.steps: # Don't accept empty solutions
candidates.append(state)
break
new_steps = state.steps + [next_step]
step_score_tensor = prm(problem, new_steps)
latest_score = step_score_tensor[-1].item()
# Aggregate: use min score of all steps so far
all_scores = state.step_scores + [latest_score]
min_score = min(all_scores)
candidates.append(BeamState(
score=-min_score, # Negate for min-heap
steps=new_steps,
step_scores=all_scores,
))
# Keep top beam_size candidates
beam = heapq.nsmallest(beam_size, candidates)
# Check if all beams have terminated
all_complete = all(
policy_model.is_complete(state.steps) for state in beam
)
if all_complete:
break
# Return best solution
best = min(beam) # min because we negated scores
return {
"solution_steps": best.steps,
"step_scores": best.step_scores,
"aggregate_score": -best.score,
}
Limitations of PRMs
Despite their power, PRMs have several significant limitations:
Annotation Cost
Human step-level annotation is expensive. For math, a single problem might require 8–12 step annotations from a human expert. Creating a training dataset of hundreds of thousands of such annotations (as Lightman et al. did) requires significant resources. Math-Shepherd partially addresses this but introduces noise.
Step Granularity
PRMs require defining what a "step" is. For formal math proofs, this is clear. For open-ended reasoning about complex topics, it's not. Should a paragraph be one step or five? The choice affects both training and inference.
Reward Hacking at Step Level
Models trained with process rewards can learn to game the PRM - generating steps that look good to the PRM scorer without actually making progress. This is analogous to outcome reward hacking, but harder to detect because the steps can be locally plausible while globally misleading.
Distribution Mismatch
PRMs are trained on solutions generated by a specific policy. When used to guide a different policy (or the same policy after RL training changes its behavior), the PRM scores may be miscalibrated. This is the distribution shift problem: the PRM has learned to score reasoning steps in the distribution it was trained on, and may give misleading scores for novel reasoning patterns.
def detect_reward_hacking(
prm: ProcessRewardModel,
solutions: List[dict],
outcome_verifier: Callable,
) -> dict:
"""
Check whether high PRM scores correlate with actual correctness.
If not, the model may be reward hacking the PRM.
"""
correct_with_high_prm = 0
incorrect_with_high_prm = 0
correct_with_low_prm = 0
incorrect_with_low_prm = 0
threshold = 0.7
for solution in solutions:
prm_score = solution["prm_aggregate_score"]
actually_correct = outcome_verifier(
solution["problem"],
solution["final_answer"]
)
if prm_score >= threshold:
if actually_correct:
correct_with_high_prm += 1
else:
incorrect_with_high_prm += 1
else:
if actually_correct:
correct_with_low_prm += 1
else:
incorrect_with_low_prm += 1
total_high_prm = correct_with_high_prm + incorrect_with_high_prm
precision = (
correct_with_high_prm / total_high_prm
if total_high_prm > 0 else 0
)
# High precision means PRM correctly identifies good solutions
# Low precision means model is hacking the PRM
return {
"precision_at_high_prm": precision,
"high_prm_count": total_high_prm,
"reward_hacking_suspected": precision < 0.7,
"correct_with_high_prm": correct_with_high_prm,
"incorrect_with_high_prm": incorrect_with_high_prm,
}
Production Engineering Notes
Serving PRMs Efficiently
PRMs add inference-time cost. Each candidate solution requires a PRM forward pass. For best-of-N with N=32 and 8 steps per solution, that's 256 PRM forward passes per query.
Optimizations:
- Batch scoring: score all steps across all solutions in a single batched forward pass
- Early termination: if a step scores below a threshold, stop scoring that solution
- Smaller PRM: use a smaller model for the PRM than for the policy
def batch_prm_scoring(
prm: ProcessRewardModel,
solutions: List[dict],
early_stop_threshold: float = 0.2,
batch_size: int = 32,
) -> List[float]:
"""
Score multiple solutions efficiently with batched PRM inference.
Args:
prm: The process reward model
solutions: List of dicts with 'problem' and 'steps' keys
early_stop_threshold: Stop scoring a solution if any step scores below this
batch_size: Number of steps to score in parallel
Returns:
List of aggregate scores (one per solution)
"""
aggregate_scores = []
for solution in solutions:
problem = solution["problem"]
steps = solution["steps"]
min_score = 1.0
early_stopped = False
# Score steps in sequence (each step needs previous context)
for k in range(len(steps)):
with torch.no_grad():
step_scores = prm(problem, steps[:k+1])
latest_score = step_scores[-1].item()
min_score = min(min_score, latest_score)
# Early termination: no point continuing if this step is wrong
if latest_score < early_stop_threshold:
early_stopped = True
break
aggregate_scores.append(min_score)
return aggregate_scores
:::danger Common Mistake: Using PRM Scores as Ground Truth PRM scores are noisy estimates of step quality. A step with a PRM score of 0.9 is not certainly correct - the PRM has finite accuracy, especially on reasoning patterns outside its training distribution. Always maintain a separate outcome verifier as the final arbiter of correctness. Use PRM scores for ranking and search guidance, not as definitive quality labels. :::
:::warning The Step Boundary Problem The quality of a PRM depends heavily on how steps are segmented. If your solution format uses very long steps (multiple sub-computations per step), the PRM loses resolution - a step might be partially correct in ways the PRM can't distinguish. If steps are too short (one arithmetic operation each), the context for each step becomes very thin. Aim for natural semantic units: one logical inference per step. :::
:::tip Math-Shepherd vs. Human Annotation For building your own PRM, Math-Shepherd's automatic annotation is the right starting point - it's scalable and surprisingly effective. Use it to build a large training set, then optionally add a smaller human-annotated validation set to calibrate the PRM's scores. The combination gives you the scale of automatic annotation with the reliability verification of human labels. :::
Interview Questions and Answers
Q1: What is a process reward model and how does it differ from an outcome reward model?
A process reward model (PRM) scores individual reasoning steps, while an outcome reward model (ORM) scores complete solutions. A PRM takes the problem and all steps up to step as input and outputs a quality score for step . An ORM takes the complete solution and outputs a binary correct/incorrect prediction. PRMs provide dense, step-level training signal; ORMs provide sparse, solution-level signal. Lightman et al. (2023) showed PRMs substantially outperform ORMs for solution re-ranking, especially on hard math problems, because they can identify exactly which step went wrong rather than just knowing that the final answer was wrong.
Q2: Why is the sparse reward problem particularly severe for reasoning tasks?
In reasoning tasks, the reward only comes at the end: did the model get the right final answer? For a solution with 10 steps, the model receives one binary signal for what may be 500+ tokens of generation. Credit assignment - determining which specific tokens or steps caused the success or failure - is extremely difficult with such sparse signal. The model might get lucky on some problems, unlucky on others, and the RL training can't identify which specific reasoning patterns led to success. Dense process rewards, one per step, dramatically improve credit assignment.
Q3: What is Math-Shepherd and why was it significant?
Math-Shepherd (Wang et al., 2024) is an automatic PRM annotation method that eliminates the need for expensive human step-level annotations. The key idea: estimate the quality of a partial solution (first steps) by sampling many completions from that point and measuring the fraction that produce the correct final answer. This is Monte Carlo estimation of the value function at each step. Significance: it makes PRM training scalable from thousands of human-annotated examples (Lightman et al.) to millions of automatically annotated examples, enabling PRMs to generalize to a much wider range of reasoning styles and problem types.
Q4: How would you use a PRM to improve inference-time accuracy?
The primary use is solution re-ranking: generate solutions with the policy, score each with the PRM, return the highest-scored solution. This is significantly better than ORM-based re-ranking. A more sophisticated use is beam search: maintain partial solutions at each step, expand each by generating candidate next steps, score with the PRM, keep top-k. This explores the reasoning tree more efficiently than independent sampling. For maximum accuracy on the hardest problems, combine PRM-guided beam search with outcome verification: use beam search to identify promising paths, then verify the top candidate's final answer with an external checker.
Q5: What are the main failure modes of PRMs in production?
Three main failure modes: (1) Reward hacking - the policy learns to generate steps that score well on the PRM but don't actually advance the solution. Detection: check whether PRM-high-scored solutions actually have higher outcome accuracy. (2) Distribution shift - if the policy changes significantly from what the PRM was trained on (through additional RL training), PRM scores become miscalibrated. Mitigation: periodically retrain the PRM on samples from the current policy. (3) Step boundary sensitivity - PRM scores vary significantly depending on how steps are segmented. A single step containing an error plus subsequent work may score differently than if the error step and subsequent steps are separate. Mitigation: enforce consistent step segmentation in both training and inference.
Q6: Compare PRM-guided best-of-N to self-consistency (majority voting) - when would you choose each?
Self-consistency (majority voting) requires no verifier or reward model - it just needs multiple samples and discrete comparable outputs. It's simple to implement and effective when the correct answer is one of a finite set of discrete values (numbers, categories, code). PRM-guided best-of-N requires a trained PRM but provides much richer guidance - it can identify the best solution even when all solutions agree on a wrong answer (which self-consistency can't detect). Choose self-consistency when: no verifier available, answers are discrete and comparable, cost of training a PRM is prohibitive. Choose PRM when: you have a trained PRM or can build one, maximum accuracy is critical, solutions might be complex enough that simple answer comparison misses quality differences.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the RLHF Pipeline demo on the EngineersOfAI Playground - no code required.
:::
