What is process reward model?

How process reward models provide step-level supervision for reasoning - the Lightman et al. 2023 paper, Math-Shepherd, using PRMs for search, and their limitations.

How does PRM work in practice?

Process Reward Models (PRMs) covers process reward model, PRM, outcome reward model from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/reasoning-models/process-reward-models

What is the difference between process reward model and outcome reward model?

See the full breakdown at https://engineersofai.com/docs/llms/reasoning-models/process-reward-models

Process Reward Models (PRMs)

The Teaching Problem

Picture a first-year university student attempting a statistics problem on an exam. They write 7 lines of work. The final number is wrong. The professor puts a red X at the bottom and moves on.

Now picture the same student in a study session with a good tutor. The tutor watches them work through the problem step by step. The tutor stops them at line 3: "This step is right. But look at what you wrote on line 4 - you forgot to account for the sample size correction here." The student fixes line 4, continues, and gets the right answer.

Same student. Same baseline knowledge. One learns almost nothing from the exam; the other understands exactly where their reasoning broke down.

This is the intuition behind process reward models. The "professor with a red X" is an outcome reward model: binary feedback on the final answer. The "good tutor" is a process reward model: step-by-step evaluation of the reasoning quality.

For training AI systems to reason well, the difference turns out to matter enormously.

Why This Exists - The Sparse Reward Problem

When training an LLM with reinforcement learning on reasoning tasks, the reward signal is typically: was the final answer correct? This creates what is called a sparse reward problem.

Consider a math problem that requires 8 steps. The model generates 8 steps of reasoning. If the final answer is wrong, the model receives reward 0. But which step was wrong? Was step 1 wrong (in which case steps 2–8 are all wrong, despite potentially being internally consistent)? Was step 7 wrong and steps 1–6 were all correct?

The outcome reward gives you one bit of information for the entire trajectory. This is an extremely weak training signal for long-horizon reasoning tasks. Credit assignment - figuring out which steps caused the failure - becomes nearly impossible.

The consequence: RL with pure outcome rewards learns slowly, is prone to reward hacking (finding solutions that verify as correct through lucky shortcuts), and fails to improve the quality of intermediate reasoning even as it improves final accuracy.

Process reward models address this directly by providing a reward signal at each step, turning the sparse reward problem into a dense reward problem.

Historical Context - From ORMs to PRMs

The distinction between outcome and process supervision in the context of LLMs was formalized by Lightman et al. (2023) at OpenAI: "Let's Verify Step by Step."

Prior work (Cobbe et al., 2021, "Training Verifiers to Solve Math Word Problems") had shown the value of outcome reward models (ORMs): a model trained to predict whether a complete solution is correct or not. ORMs are used at inference time to re-rank multiple solutions - generate $N$ solutions, score them with the ORM, return the highest-scored one.

The Lightman et al. contribution was to show that PRMs - which score each individual step - significantly outperform ORMs for solution re-ranking, especially on hard problems. The intuition: an ORM has to look at a potentially incorrect solution and globally assess whether it's right. A PRM evaluates each step independently and can identify exactly which step went wrong.

The key experimental setup: they hired human annotators to go through solutions to MATH competition problems, step by step, and label each step as:

Positive: the step is mathematically correct and makes progress toward the solution
Negative: the step contains a mathematical error
Neutral: the step is valid but makes little progress

They then trained a PRM on these annotations and showed it substantially outperformed ORMs on selecting correct solutions.

The Architecture of a PRM

A process reward model is a transformer that takes as input the problem statement and a partial solution (all steps up to and including the current step) and outputs a scalar score estimating the quality of the current step.

In practice, PRMs are typically initialized from the same base model as the policy (or from a slightly smaller model) and fine-tuned on step-level annotation data. The key architectural difference from a standard LLM is the scalar output head that predicts step quality.

import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer
from typing import List, Optional


class ProcessRewardModel(nn.Module):
    """
    Process Reward Model (PRM) for evaluating reasoning steps.

    Takes a problem and a partial solution (list of steps),
    returns a quality score for each step.
    """

    def __init__(
        self,
        base_model_name: str = "deepseek-ai/deepseek-math-7b-base",
        step_separator: str = "\n\nStep ",
    ):
        super().__init__()
        self.tokenizer = AutoTokenizer.from_pretrained(base_model_name)
        self.encoder = AutoModel.from_pretrained(base_model_name)
        self.step_separator = step_separator

        # Scalar scoring head
        hidden_size = self.encoder.config.hidden_size
        self.score_head = nn.Sequential(
            nn.Linear(hidden_size, hidden_size // 4),
            nn.GELU(),
            nn.Linear(hidden_size // 4, 1),
            nn.Sigmoid(),  # Output in [0, 1]
        )

    def forward(
        self,
        problem: str,
        steps: List[str],
    ) -> torch.Tensor:
        """
        Score each step in the solution.

        Args:
            problem: The math problem text
            steps: List of solution steps

        Returns:
            Tensor of shape [num_steps] with scores in [0, 1]
        """
        # Construct the full text with step separators
        full_text = f"Problem: {problem}\n\nSolution:"
        step_positions = []

        for i, step in enumerate(steps):
            full_text += f"\n\nStep {i+1}: {step}"
            # Record where each step ends (the position we'll extract score from)
            step_positions.append(len(self.tokenizer.encode(full_text)) - 1)

        # Tokenize the full solution
        inputs = self.tokenizer(
            full_text,
            return_tensors="pt",
            truncation=True,
            max_length=4096,
        )

        # Forward pass through encoder
        with torch.no_grad():
            outputs = self.encoder(**inputs)

        # Extract hidden states at step-end positions
        hidden_states = outputs.last_hidden_state.squeeze(0)  # [seq_len, hidden]

        step_scores = []
        for pos in step_positions:
            # Clamp to valid range in case of truncation
            pos = min(pos, hidden_states.shape[0] - 1)
            step_repr = hidden_states[pos]  # [hidden]
            score = self.score_head(step_repr.unsqueeze(0))  # [1, 1]
            step_scores.append(score.squeeze())

        return torch.stack(step_scores)  # [num_steps]

    def score_solution(
        self,
        problem: str,
        steps: List[str],
    ) -> dict:
        """
        Score a complete solution and return step-level and aggregate results.
        """
        with torch.no_grad():
            step_scores = self.forward(problem, steps)

        scores_list = step_scores.tolist()

        # Find first wrong step (score below threshold)
        threshold = 0.5
        first_error = None
        for i, score in enumerate(scores_list):
            if score < threshold:
                first_error = i
                break

        # Aggregate score: minimum step score (any wrong step = wrong solution)
        min_score = min(scores_list)

        # Alternatively: product of step scores (probability all steps correct)
        product_score = 1.0
        for s in scores_list:
            product_score *= s

        return {
            "step_scores": scores_list,
            "min_step_score": min_score,
            "product_score": product_score,
            "first_error_at_step": first_error,
            "predicted_correct": min_score > threshold,
        }

The Lightman et al. (2023) Experiment

The key finding from "Let's Verify Step by Step" (Lightman et al., 2023, OpenAI):

On MATH benchmark with PaLM 2 as the policy model:

ORM + best-of-N: 72.4% accuracy with N=1860 samples
PRM + best-of-N: 78.2% accuracy with the same N=1860 samples

The PRM consistently outperforms the ORM across all $N$ values. The gap is largest for hard problems - precisely where you need better verification most.

Why? An ORM looks at a potentially long, incorrect solution and has to holistically judge "is this right?" This is a hard prediction task - the solution may look fluent and plausible even if step 6 contains a subtle error. A PRM evaluates each step in context and can identify the specific step where reasoning went wrong.

The annotation cost was significant: they created a dataset of 800K step-level annotations, requiring significant human annotation effort on competition math problems. This cost is one of the key limitations of PRMs.

Math-Shepherd - Automatic PRM Training

The cost of human step-level annotation is a major bottleneck. Wang et al. (2024) "Math-Shepherd: Verify and Reinforce LLMs Step-by-Step without Human Annotations" addressed this with an automatic annotation approach.

The key insight: for math problems with verifiable answers, you can estimate the quality of each intermediate step by asking "from this partial solution, what fraction of completions eventually produce the correct answer?"

The Math-Shepherd annotation procedure:

For each problem in the dataset, generate many complete solutions
For each partial solution (first $k$ steps), generate $M$ completions and check which produce the correct final answer
The "step quality" label for step $k$ is the fraction of completions from step $k$ that eventually succeed

This is Monte Carlo estimation of the value function at each reasoning step - essentially doing best-of-N sampling from each intermediate point.

from typing import List, Tuple, Callable
import random


def math_shepherd_annotate(
    problem: str,
    partial_solution: List[str],  # First k steps
    policy_model: Callable,       # Function that generates completions
    verifier: Callable,           # Function that checks if answer is correct
    n_completions: int = 16,
    max_steps: int = 10,
) -> float:
    """
    Estimate step quality using Math-Shepherd's automatic annotation.

    The quality of a partial solution (first k steps) is estimated as
    the fraction of completions from that point that eventually produce
    the correct answer.

    Args:
        problem: The math problem
        partial_solution: List of steps completed so far
        policy_model: Generates a completion given partial solution
        verifier: Checks if a complete solution is correct
        n_completions: Number of MC rollouts to estimate quality
        max_steps: Maximum steps to generate per completion

    Returns:
        Estimated step quality in [0, 1]
    """
    correct_completions = 0

    for _ in range(n_completions):
        # Generate a completion from the current partial solution
        completion_steps = policy_model.complete_from_partial(
            problem=problem,
            partial_solution=partial_solution,
            max_steps=max_steps,
        )

        # Full solution = partial + completion
        full_solution = partial_solution + completion_steps

        # Check if the final answer is correct
        final_answer = extract_final_answer(full_solution)
        if verifier(problem, final_answer):
            correct_completions += 1

    # Quality = fraction of completions that succeed
    return correct_completions / n_completions


def build_prm_dataset_with_math_shepherd(
    problems: list,
    policy_model: Callable,
    verifier: Callable,
    n_completions_per_step: int = 16,
) -> List[dict]:
    """
    Build a PRM training dataset using Math-Shepherd automatic annotation.

    For each problem, annotate the quality of each step without human labels.
    """
    dataset = []

    for problem in problems:
        # Generate a reference solution
        reference_solution = policy_model.generate(problem)
        steps = parse_steps(reference_solution)

        if not steps:
            continue

        # Annotate each step
        step_annotations = []
        for k in range(len(steps)):
            partial = steps[:k+1]
            quality = math_shepherd_annotate(
                problem=problem.question,
                partial_solution=partial,
                policy_model=policy_model,
                verifier=verifier,
                n_completions=n_completions_per_step,
            )
            step_annotations.append({
                "step_index": k,
                "step_text": steps[k],
                "quality_score": quality,
                "estimated_correct": quality > 0.5,
            })

        dataset.append({
            "problem": problem.question,
            "steps": steps,
            "step_annotations": step_annotations,
        })

    return dataset

Math-Shepherd is significant because it makes PRM training scalable without expensive human annotation. The quality of annotations is lower than human labels (humans can catch conceptual errors that MC estimation might miss), but the scale advantage more than compensates on most benchmarks.

Using PRMs for Search

The most powerful application of PRMs is as a value function for search algorithms. Instead of just re-ranking complete solutions, you can use a PRM to guide generation - selecting which reasoning step to expand next.

Best-of-N with PRM Scoring

The simplest approach: generate $N$ complete solutions, score them with the PRM, return the highest-scored one.

def prm_best_of_n(
    problem: str,
    policy_model,
    prm: ProcessRewardModel,
    n: int = 32,
    temperature: float = 0.8,
    aggregation: str = "min",  # "min" or "product"
) -> dict:
    """
    Best-of-N selection using PRM scoring.

    Aggregation strategies:
    - "min": return the solution where the minimum step score is highest
      (pessimistic: the worst step determines the solution quality)
    - "product": return the solution with highest product of step scores
      (probability that ALL steps are correct)
    """
    candidates = []

    for i in range(n):
        # Generate complete solution
        solution_text = policy_model.generate(
            problem=problem,
            temperature=temperature,
        )
        steps = parse_steps(solution_text)

        if not steps:
            continue

        # Score with PRM
        result = prm.score_solution(problem, steps)

        if aggregation == "min":
            aggregate_score = result["min_step_score"]
        elif aggregation == "product":
            aggregate_score = result["product_score"]
        else:
            raise ValueError(f"Unknown aggregation: {aggregation}")

        candidates.append({
            "solution": solution_text,
            "steps": steps,
            "step_scores": result["step_scores"],
            "aggregate_score": aggregate_score,
        })

    # Return highest-scoring solution
    best = max(candidates, key=lambda x: x["aggregate_score"])

    return {
        "best_solution": best["solution"],
        "best_score": best["aggregate_score"],
        "step_scores": best["step_scores"],
        "n_candidates": len(candidates),
    }

Beam Search with PRM

A more sophisticated approach: maintain a beam of partial solutions, expand each, score with PRM, keep top-k.

import heapq
from dataclasses import dataclass, field
from typing import Any


@dataclass(order=True)
class BeamState:
    """State in the reasoning beam search."""
    score: float
    steps: List[str] = field(compare=False)
    step_scores: List[float] = field(compare=False)

    def __post_init__(self):
        # Negate score so heapq (min-heap) works as max-heap
        self.score = -self.score


def prm_beam_search(
    problem: str,
    policy_model,
    prm: ProcessRewardModel,
    beam_size: int = 8,
    max_steps: int = 12,
    temperature: float = 0.6,
) -> dict:
    """
    Beam search over reasoning steps, guided by PRM scores.

    At each step:
    1. Expand each beam state by sampling next step from policy
    2. Score new states with PRM
    3. Keep top beam_size states

    This finds high-quality reasoning paths more efficiently than
    independent best-of-N sampling.
    """
    # Initialize beam with empty solutions
    beam = [BeamState(score=-0.0, steps=[], step_scores=[])]

    for step_idx in range(max_steps):
        candidates = []

        for state in beam:
            # Sample multiple next steps from policy
            next_step_candidates = policy_model.sample_next_step(
                problem=problem,
                completed_steps=state.steps,
                n_candidates=beam_size * 2,  # Oversample, then prune
                temperature=temperature,
            )

            for next_step in next_step_candidates:
                if next_step is None:  # Model indicated solution complete
                    # Score the complete solution
                    if state.steps:  # Don't accept empty solutions
                        candidates.append(state)
                    break

                new_steps = state.steps + [next_step]
                step_score_tensor = prm(problem, new_steps)
                latest_score = step_score_tensor[-1].item()

                # Aggregate: use min score of all steps so far
                all_scores = state.step_scores + [latest_score]
                min_score = min(all_scores)

                candidates.append(BeamState(
                    score=-min_score,  # Negate for min-heap
                    steps=new_steps,
                    step_scores=all_scores,
                ))

        # Keep top beam_size candidates
        beam = heapq.nsmallest(beam_size, candidates)

        # Check if all beams have terminated
        all_complete = all(
            policy_model.is_complete(state.steps) for state in beam
        )
        if all_complete:
            break

    # Return best solution
    best = min(beam)  # min because we negated scores
    return {
        "solution_steps": best.steps,
        "step_scores": best.step_scores,
        "aggregate_score": -best.score,
    }

Limitations of PRMs

Despite their power, PRMs have several significant limitations:

Annotation Cost

Human step-level annotation is expensive. For math, a single problem might require 8–12 step annotations from a human expert. Creating a training dataset of hundreds of thousands of such annotations (as Lightman et al. did) requires significant resources. Math-Shepherd partially addresses this but introduces noise.

Step Granularity

PRMs require defining what a "step" is. For formal math proofs, this is clear. For open-ended reasoning about complex topics, it's not. Should a paragraph be one step or five? The choice affects both training and inference.

Reward Hacking at Step Level

Models trained with process rewards can learn to game the PRM - generating steps that look good to the PRM scorer without actually making progress. This is analogous to outcome reward hacking, but harder to detect because the steps can be locally plausible while globally misleading.

Distribution Mismatch

PRMs are trained on solutions generated by a specific policy. When used to guide a different policy (or the same policy after RL training changes its behavior), the PRM scores may be miscalibrated. This is the distribution shift problem: the PRM has learned to score reasoning steps in the distribution it was trained on, and may give misleading scores for novel reasoning patterns.

def detect_reward_hacking(
    prm: ProcessRewardModel,
    solutions: List[dict],
    outcome_verifier: Callable,
) -> dict:
    """
    Check whether high PRM scores correlate with actual correctness.
    If not, the model may be reward hacking the PRM.
    """
    correct_with_high_prm = 0
    incorrect_with_high_prm = 0
    correct_with_low_prm = 0
    incorrect_with_low_prm = 0

    threshold = 0.7

    for solution in solutions:
        prm_score = solution["prm_aggregate_score"]
        actually_correct = outcome_verifier(
            solution["problem"],
            solution["final_answer"]
        )

        if prm_score >= threshold:
            if actually_correct:
                correct_with_high_prm += 1
            else:
                incorrect_with_high_prm += 1
        else:
            if actually_correct:
                correct_with_low_prm += 1
            else:
                incorrect_with_low_prm += 1

    total_high_prm = correct_with_high_prm + incorrect_with_high_prm
    precision = (
        correct_with_high_prm / total_high_prm
        if total_high_prm > 0 else 0
    )

    # High precision means PRM correctly identifies good solutions
    # Low precision means model is hacking the PRM
    return {
        "precision_at_high_prm": precision,
        "high_prm_count": total_high_prm,
        "reward_hacking_suspected": precision < 0.7,
        "correct_with_high_prm": correct_with_high_prm,
        "incorrect_with_high_prm": incorrect_with_high_prm,
    }

Production Engineering Notes

Serving PRMs Efficiently

PRMs add inference-time cost. Each candidate solution requires a PRM forward pass. For best-of-N with N=32 and 8 steps per solution, that's 256 PRM forward passes per query.

Optimizations:

Batch scoring: score all steps across all solutions in a single batched forward pass
Early termination: if a step scores below a threshold, stop scoring that solution
Smaller PRM: use a smaller model for the PRM than for the policy

def batch_prm_scoring(
    prm: ProcessRewardModel,
    solutions: List[dict],
    early_stop_threshold: float = 0.2,
    batch_size: int = 32,
) -> List[float]:
    """
    Score multiple solutions efficiently with batched PRM inference.

    Args:
        prm: The process reward model
        solutions: List of dicts with 'problem' and 'steps' keys
        early_stop_threshold: Stop scoring a solution if any step scores below this
        batch_size: Number of steps to score in parallel

    Returns:
        List of aggregate scores (one per solution)
    """
    aggregate_scores = []

    for solution in solutions:
        problem = solution["problem"]
        steps = solution["steps"]

        min_score = 1.0
        early_stopped = False

        # Score steps in sequence (each step needs previous context)
        for k in range(len(steps)):
            with torch.no_grad():
                step_scores = prm(problem, steps[:k+1])
            latest_score = step_scores[-1].item()
            min_score = min(min_score, latest_score)

            # Early termination: no point continuing if this step is wrong
            if latest_score < early_stop_threshold:
                early_stopped = True
                break

        aggregate_scores.append(min_score)

    return aggregate_scores

:::danger Common Mistake: Using PRM Scores as Ground Truth PRM scores are noisy estimates of step quality. A step with a PRM score of 0.9 is not certainly correct - the PRM has finite accuracy, especially on reasoning patterns outside its training distribution. Always maintain a separate outcome verifier as the final arbiter of correctness. Use PRM scores for ranking and search guidance, not as definitive quality labels. :::

:::warning The Step Boundary Problem The quality of a PRM depends heavily on how steps are segmented. If your solution format uses very long steps (multiple sub-computations per step), the PRM loses resolution - a step might be partially correct in ways the PRM can't distinguish. If steps are too short (one arithmetic operation each), the context for each step becomes very thin. Aim for natural semantic units: one logical inference per step. :::

:::tip Math-Shepherd vs. Human Annotation For building your own PRM, Math-Shepherd's automatic annotation is the right starting point - it's scalable and surprisingly effective. Use it to build a large training set, then optionally add a smaller human-annotated validation set to calibrate the PRM's scores. The combination gives you the scale of automatic annotation with the reliability verification of human labels. :::

Interview Questions and Answers

Q1: What is a process reward model and how does it differ from an outcome reward model?

A process reward model (PRM) scores individual reasoning steps, while an outcome reward model (ORM) scores complete solutions. A PRM takes the problem and all steps up to step $k$ as input and outputs a quality score for step $k$ . An ORM takes the complete solution and outputs a binary correct/incorrect prediction. PRMs provide dense, step-level training signal; ORMs provide sparse, solution-level signal. Lightman et al. (2023) showed PRMs substantially outperform ORMs for solution re-ranking, especially on hard math problems, because they can identify exactly which step went wrong rather than just knowing that the final answer was wrong.

Q2: Why is the sparse reward problem particularly severe for reasoning tasks?

In reasoning tasks, the reward only comes at the end: did the model get the right final answer? For a solution with 10 steps, the model receives one binary signal for what may be 500+ tokens of generation. Credit assignment - determining which specific tokens or steps caused the success or failure - is extremely difficult with such sparse signal. The model might get lucky on some problems, unlucky on others, and the RL training can't identify which specific reasoning patterns led to success. Dense process rewards, one per step, dramatically improve credit assignment.

Q3: What is Math-Shepherd and why was it significant?

Math-Shepherd (Wang et al., 2024) is an automatic PRM annotation method that eliminates the need for expensive human step-level annotations. The key idea: estimate the quality of a partial solution (first $k$ steps) by sampling many completions from that point and measuring the fraction that produce the correct final answer. This is Monte Carlo estimation of the value function at each step. Significance: it makes PRM training scalable from thousands of human-annotated examples (Lightman et al.) to millions of automatically annotated examples, enabling PRMs to generalize to a much wider range of reasoning styles and problem types.

Q4: How would you use a PRM to improve inference-time accuracy?

The primary use is solution re-ranking: generate $N$ solutions with the policy, score each with the PRM, return the highest-scored solution. This is significantly better than ORM-based re-ranking. A more sophisticated use is beam search: maintain $k$ partial solutions at each step, expand each by generating candidate next steps, score with the PRM, keep top-k. This explores the reasoning tree more efficiently than independent sampling. For maximum accuracy on the hardest problems, combine PRM-guided beam search with outcome verification: use beam search to identify promising paths, then verify the top candidate's final answer with an external checker.

Q5: What are the main failure modes of PRMs in production?

Three main failure modes: (1) Reward hacking - the policy learns to generate steps that score well on the PRM but don't actually advance the solution. Detection: check whether PRM-high-scored solutions actually have higher outcome accuracy. (2) Distribution shift - if the policy changes significantly from what the PRM was trained on (through additional RL training), PRM scores become miscalibrated. Mitigation: periodically retrain the PRM on samples from the current policy. (3) Step boundary sensitivity - PRM scores vary significantly depending on how steps are segmented. A single step containing an error plus subsequent work may score differently than if the error step and subsequent steps are separate. Mitigation: enforce consistent step segmentation in both training and inference.

Q6: Compare PRM-guided best-of-N to self-consistency (majority voting) - when would you choose each?

Self-consistency (majority voting) requires no verifier or reward model - it just needs multiple samples and discrete comparable outputs. It's simple to implement and effective when the correct answer is one of a finite set of discrete values (numbers, categories, code). PRM-guided best-of-N requires a trained PRM but provides much richer guidance - it can identify the best solution even when all solutions agree on a wrong answer (which self-consistency can't detect). Choose self-consistency when: no verifier available, answers are discrete and comparable, cost of training a PRM is prohibitive. Choose PRM when: you have a trained PRM or can build one, maximum accuracy is critical, solutions might be complex enough that simple answer comparison misses quality differences.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the RLHF Pipeline demo on the EngineersOfAI Playground - no code required.

:::

The Teaching Problem​

Why This Exists - The Sparse Reward Problem​

Historical Context - From ORMs to PRMs​

The Architecture of a PRM​

The Lightman et al. (2023) Experiment​

Math-Shepherd - Automatic PRM Training​

Using PRMs for Search​

Best-of-N with PRM Scoring​

Beam Search with PRM​

Limitations of PRMs​

Annotation Cost​

Step Granularity​

Reward Hacking at Step Level​

Distribution Mismatch​

Production Engineering Notes​

Serving PRMs Efficiently​

Interview Questions and Answers​