Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Human Evaluation Process demo on the EngineersOfAI Playground - no code required. :::

Human Feedback Collection

The Preference Data That Trained Harmful Outputs

In late 2022, a team building a commercial customer service LLM collected 50,000 preference annotations using a simple interface: show annotators two responses, ask which one they prefer, record the answer. The process ran for four months, cost $200,000 in contractor time, and produced what looked like a clean dataset. The reward model trained on it produced a fine-tuned model that, in A/B testing, received significantly higher user ratings than the previous version.

Six months post-deployment, the team noticed a pattern in escalated customer complaints. The model had a tendency to make overconfident claims about product capabilities - claiming features existed that did not, providing incorrect warranty information stated with high certainty. Users preferred the confident responses during annotation because confidence felt more helpful. But confident incorrectness was worse than acknowledged uncertainty.

The root cause: position bias. The annotation interface always showed "Response A" first. Studies of position bias in preference annotation show that humans choose the first response 55-65% of the time regardless of quality. When the team audited their dataset, "Response A preferred" appeared 61% of the time - statistically impossible if position was not influencing judgment. The confidence bias layered on top: annotators preferred responses that sounded more certain, regardless of accuracy.

The reward model had learned to optimize for a proxy of human preference that was corrupted by two systematic biases. The fine-tuned model performed well on the biased human preference metric while degrading on the actual quality dimensions that mattered. The team had to discard six months of preference data, redesign the annotation interface, and start over with position randomization and blinded accuracy assessment.

Human feedback collection is more subtle than it appears. The interface design, the annotator instructions, the aggregation method, and the quality controls all encode assumptions that directly shape the reward model - which shapes the behavior of every model trained on it. Getting this wrong at scale is expensive and slow to detect.


The RLHF Feedback Pipeline Architecture

Each stage encodes quality decisions that compound through the pipeline. An error in prompt sampling (biased toward simple queries) produces a reward model that does not generalize to complex cases. An interface design flaw (consistent position bias) corrupts all downstream training. A poor aggregation strategy (ignoring annotator quality differences) averages in noise. Design each stage explicitly.


Generating Response Pairs for Comparison

The quality of the comparison dataset depends heavily on which responses you compare. Comparing a very good response against a clearly bad one produces easy annotation tasks that provide little signal - a reward model trained on easy comparisons does not generalize to nuanced distinctions. The most informative comparisons are between responses that differ on the specific dimensions you care about.

import anthropic
import json
import uuid
import time
import random
from dataclasses import dataclass, field
from concurrent.futures import ThreadPoolExecutor
from typing import Optional


@dataclass
class ResponsePair:
"""A pair of responses to the same prompt for human comparison."""
pair_id: str = field(default_factory=lambda: str(uuid.uuid4()))
prompt: str = ""
response_a: str = ""
response_b: str = ""

# Hidden from annotators - for analysis only
model_a: str = ""
model_b: str = ""
system_a: str = ""
system_b: str = ""
generation_config_a: dict = field(default_factory=dict)
generation_config_b: dict = field(default_factory=dict)

# Metadata
prompt_category: str = "" # helpfulness, safety, honesty, coding, etc.
difficulty_estimate: str = "medium" # easy, medium, hard
created_at: float = field(default_factory=time.time)

# Position randomization tracking (never shown to annotators)
original_a_is_response_a: bool = True # After randomization, which original is shown as A


class ResponsePairGenerator:
"""
Generates response pairs designed to produce informative preference signals.

Comparison strategies:
1. Same model, different system prompts - measures prompt sensitivity
2. Different models, same prompt - measures model quality differences
3. Same model, different temperatures - measures generation stability
4. Original vs. edited - measures impact of specific changes

Each strategy illuminates different aspects of model behavior.
Mixing strategies in the dataset produces a more robust reward model.
"""

def __init__(self):
self.client = anthropic.Anthropic()

def generate_pair_same_model_different_systems(
self,
prompt: str,
system_a: str,
system_b: str,
model: str = "claude-opus-4-6",
prompt_category: str = "general",
) -> ResponsePair:
"""
Compare two system prompts on the same model.
Useful for: measuring impact of instruction style, safety guardrails, etc.
"""
def call(system):
return self.client.messages.create(
model=model,
max_tokens=1024,
system=system,
messages=[{"role": "user", "content": prompt}]
).content[0].text

with ThreadPoolExecutor(max_workers=2) as executor:
future_a = executor.submit(call, system_a)
future_b = executor.submit(call, system_b)
resp_a = future_a.result()
resp_b = future_b.result()

# Randomize which response is shown as A vs B
if random.random() > 0.5:
resp_a, resp_b = resp_b, resp_a
system_a, system_b = system_b, system_a
original_a_is_a = False
else:
original_a_is_a = True

return ResponsePair(
prompt=prompt,
response_a=resp_a,
response_b=resp_b,
model_a=model,
model_b=model,
system_a=system_a,
system_b=system_b,
prompt_category=prompt_category,
original_a_is_response_a=original_a_is_a,
)

def generate_pair_different_models(
self,
prompt: str,
model_a: str,
model_b: str,
system: str = "You are a helpful assistant.",
prompt_category: str = "general",
) -> ResponsePair:
"""Compare two different models on the same prompt."""
def call(model):
return self.client.messages.create(
model=model,
max_tokens=1024,
system=system,
messages=[{"role": "user", "content": prompt}]
).content[0].text

with ThreadPoolExecutor(max_workers=2) as executor:
future_a = executor.submit(call, model_a)
future_b = executor.submit(call, model_b)
resp_a = future_a.result()
resp_b = future_b.result()

# Randomize position assignment
if random.random() > 0.5:
resp_a, resp_b = resp_b, resp_a
model_a, model_b = model_b, model_a
original_a_is_a = False
else:
original_a_is_a = True

return ResponsePair(
prompt=prompt,
response_a=resp_a,
response_b=resp_b,
model_a=model_a,
model_b=model_b,
system_a=system,
system_b=system,
prompt_category=prompt_category,
original_a_is_response_a=original_a_is_a,
)

def generate_pair_quality_contrast(
self,
prompt: str,
model: str = "claude-opus-4-6",
prompt_category: str = "general",
) -> ResponsePair:
"""
Generate high-quality vs. degraded response pair.
Useful for creating easy-to-label calibration tasks.

The degraded response is generated by prompting the model
to provide a less helpful version, which is cheaper than
manually writing bad examples.
"""
quality_system = "You are an exceptionally helpful, accurate, and concise assistant."
degraded_system = (
"You are an assistant. Provide overly verbose responses. "
"Hedge everything. Avoid direct answers. Add unnecessary disclaimers."
)

return self.generate_pair_same_model_different_systems(
prompt=prompt,
system_a=quality_system,
system_b=degraded_system,
model=model,
prompt_category=prompt_category,
)

The Preference Annotation Interface

The annotation interface design directly affects label quality. Small decisions - whether to show confidence scores, whether to require written rationale, how to present the responses - have large effects on the resulting preference data quality.

from dataclasses import dataclass, field
import time
import uuid
from typing import Optional


@dataclass
class PreferenceFeedback:
"""
A single human preference annotation.

Required fields reflect minimum quality data.
Optional fields improve signal quality when collected.
"""
feedback_id: str = field(default_factory=lambda: str(uuid.uuid4()))
pair_id: str = ""
annotator_id: str = ""

# Core preference signal
preferred: str = "" # "A", "B", or "tie"
confidence: int = 3 # 1-5 (1=guessing, 5=certain)

# Why this preference? (multi-select)
preferred_aspects: list[str] = field(default_factory=list)
# Options: "more_accurate", "more_helpful", "better_format",
# "more_complete", "more_concise", "safer", "more_honest"

# Quality signals
time_spent_seconds: float = 0.0
rationale: Optional[str] = None # Free text - required for high-stakes tasks

# Per-response dimension ratings (optional, improves reward model)
scores_a: dict = field(default_factory=dict) # {dimension: 1-5}
scores_b: dict = field(default_factory=dict) # {dimension: 1-5}

created_at: float = field(default_factory=time.time)


# Annotation interface design guidelines (documented as constants)
INTERFACE_DESIGN_PRINCIPLES = {
"position_randomization": (
"Always randomize which response is shown as A vs B. "
"Never consistently show one model/system as A. "
"Verify statistically: 'A preferred' rate should be 48-52% across dataset."
),
"model_blinding": (
"Never reveal which model generated each response. "
"If annotators know Claude generated Response A, they may prefer it "
"for brand reasons rather than quality reasons."
),
"sequential_presentation": (
"For highest quality: show each response separately, require initial rating "
"before showing the other response, then allow comparison. "
"Prevents anchoring bias from seeing both responses simultaneously."
),
"confidence_collection": (
"Collect annotator confidence (1-5) alongside preference. "
"Weight aggregated preferences by confidence. "
"Exclude confidence=1 responses (guessing)."
),
"rationale_requirement": (
"For high-stakes preferences (safety, medical, legal): require free text rationale. "
"For routine preferences: optional rationale improves signal but adds 30-60s per task."
),
"context_display": (
"Always show the full conversation history, not just the last turn. "
"A response that is excellent given full context may be poor without it."
),
}

FEEDBACK_DIMENSIONS = {
"helpfulness": "How well does this response address the user's actual need?",
"accuracy": "Is the factual content correct and free of errors?",
"honesty": "Does the response appropriately acknowledge uncertainty and limitations?",
"safety": "Does the response avoid harmful, inappropriate, or dangerous content?",
"conciseness": "Is the response an appropriate length - neither too brief nor verbose?",
"format": "Is the structure, formatting, and presentation appropriate for the context?",
}

Feedback Quality Controls

Systematic quality filters must run on every annotation before it enters the training dataset. Without filters, rushed annotations, distracted annotators, and gaming behavior corrupt the reward model's learned preferences.

from typing import Optional
from collections import defaultdict
import statistics


class FeedbackQualityFilter:
"""
Validates preference feedback before it enters training data.

Filter criteria are based on observable behavioral signals:
- Time spent: too fast means not read, too slow means distracted
- Confidence: explicit annotator self-assessment
- Consistency: test-retest on identical pairs
- Response format: valid preference value
"""

def __init__(
self,
min_time_seconds: float = 12.0, # < 12s: impossible to read both responses
max_time_seconds: float = 600.0, # > 10min: likely distracted/interrupted
min_confidence: int = 2, # confidence=1 means "just guessing"
require_rationale_for_confidence_4_5: bool = False,
):
self.min_time = min_time_seconds
self.max_time = max_time_seconds
self.min_confidence = min_confidence
self.require_rationale_for_high_confidence = require_rationale_for_confidence_4_5

def validate(self, feedback: PreferenceFeedback) -> tuple[bool, list[str]]:
"""
Validate a single feedback entry.
Returns (is_valid, list_of_rejection_reasons).
"""
reasons = []

if feedback.time_spent_seconds < self.min_time:
reasons.append(
f"Too fast: {feedback.time_spent_seconds:.1f}s < {self.min_time}s minimum. "
f"Not enough time to read both responses."
)

if feedback.time_spent_seconds > self.max_time:
reasons.append(
f"Too slow: {feedback.time_spent_seconds:.1f}s > {self.max_time}s maximum. "
f"Annotator likely interrupted or distracted."
)

if feedback.confidence < self.min_confidence:
reasons.append(
f"Confidence too low: {feedback.confidence} < {self.min_confidence}. "
f"Confidence=1 indicates annotator was guessing."
)

if feedback.preferred not in ("A", "B", "tie"):
reasons.append(f"Invalid preference value: '{feedback.preferred}'")

if (self.require_rationale_for_high_confidence
and feedback.confidence >= 4
and not feedback.rationale):
reasons.append(
"High confidence (4-5) without rationale. "
"High-confidence preferences require written justification."
)

return len(reasons) == 0, reasons

def filter_batch(
self,
feedbacks: list[PreferenceFeedback],
) -> tuple[list[PreferenceFeedback], list[dict]]:
"""Filter a batch. Returns (valid, rejected_with_reasons)."""
valid = []
rejected = []
for fb in feedbacks:
ok, reasons = self.validate(fb)
if ok:
valid.append(fb)
else:
rejected.append({
"feedback_id": fb.feedback_id,
"annotator_id": fb.annotator_id,
"reasons": reasons,
})
return valid, rejected


class AnnotatorConsistencyChecker:
"""
Detects annotators with low test-retest reliability.

Method: inject duplicate pairs (same pair shown twice to same annotator
weeks apart). If an annotator gives different answers on the same pair,
their reliability is low.

Target: >= 75% consistency on duplicate pairs.
Below 65%: flag for retraining.
Below 55%: flag for removal (worse than random on this task).
"""

CONSISTENCY_THRESHOLDS = {
"flag_for_review": 0.75,
"flag_for_retraining": 0.65,
"flag_for_removal": 0.55,
}

def __init__(self):
# {pair_id: {annotator_id: [preference_1, preference_2, ...]}}
self._responses: dict = defaultdict(lambda: defaultdict(list))

def record_response(
self,
annotator_id: str,
pair_id: str,
preferred: str,
) -> None:
self._responses[pair_id][annotator_id].append(preferred)

def compute_annotator_consistency(self, annotator_id: str) -> Optional[dict]:
"""
Compute test-retest consistency for an annotator.
Only meaningful when the annotator has labeled duplicate pairs.
"""
consistent = 0
total_duplicates = 0

for pair_id, responses_by_annotator in self._responses.items():
responses = responses_by_annotator.get(annotator_id, [])
if len(responses) >= 2:
# All responses agree = consistent
if len(set(responses)) == 1:
consistent += 1
total_duplicates += 1

if total_duplicates < 10:
return None # Not enough duplicate exposure

consistency = consistent / total_duplicates
status = "ok"
if consistency < self.CONSISTENCY_THRESHOLDS["flag_for_removal"]:
status = "flag_for_removal"
elif consistency < self.CONSISTENCY_THRESHOLDS["flag_for_retraining"]:
status = "flag_for_retraining"
elif consistency < self.CONSISTENCY_THRESHOLDS["flag_for_review"]:
status = "flag_for_review"

return {
"annotator_id": annotator_id,
"consistency": round(consistency, 3),
"n_duplicate_pairs": total_duplicates,
"status": status,
}

def detect_position_bias(self, annotator_id: str) -> Optional[dict]:
"""
Detect if an annotator shows systematic position bias.
If 'A preferred' rate deviates significantly from 50%, position bias is likely.

Target: 45-55% A-preferred rate.
Warning: outside 40-60%.
Problem: outside 35-65%.
"""
all_preferences = []
for pair_responses in self._responses.values():
ann_prefs = pair_responses.get(annotator_id, [])
all_preferences.extend(ann_prefs)

if len(all_preferences) < 50:
return None

a_rate = all_preferences.count("A") / len(all_preferences)
b_rate = all_preferences.count("B") / len(all_preferences)
tie_rate = all_preferences.count("tie") / len(all_preferences)

bias_detected = abs(a_rate - 0.5) > 0.15 # >15% deviation from 50%
severity = "none"
if abs(a_rate - 0.5) > 0.25:
severity = "severe"
elif abs(a_rate - 0.5) > 0.15:
severity = "moderate"
elif abs(a_rate - 0.5) > 0.10:
severity = "mild"

return {
"annotator_id": annotator_id,
"a_preference_rate": round(a_rate, 3),
"b_preference_rate": round(b_rate, 3),
"tie_rate": round(tie_rate, 3),
"bias_detected": bias_detected,
"severity": severity,
"n_preferences": len(all_preferences),
}

Preference Aggregation

Multiple annotators on the same pair must be combined into a single training signal. The aggregation strategy determines how annotator disagreement is handled.

from collections import Counter
from typing import Optional
import numpy as np


def aggregate_preferences(
feedbacks: list[PreferenceFeedback],
annotator_weights: Optional[dict] = None,
strategy: str = "weighted_majority",
) -> Optional[dict]:
"""
Aggregate multiple annotator preferences for a response pair.

Arguments:
- feedbacks: list of PreferenceFeedback for the same pair
- annotator_weights: {annotator_id: weight} from calibration task accuracy
- strategy: "majority", "weighted_majority", or "bradley_terry"

Returns: aggregated preference dict or None if insufficient signal.
"""
if not feedbacks:
return None

if strategy == "weighted_majority":
return _weighted_majority(feedbacks, annotator_weights)
elif strategy == "majority":
return _simple_majority(feedbacks)
else:
raise ValueError(f"Unknown strategy: {strategy}")


def _simple_majority(feedbacks: list[PreferenceFeedback]) -> Optional[dict]:
"""Simple majority vote - equal weight for all annotators."""
votes = Counter(fb.preferred for fb in feedbacks)
total = sum(votes.values())
winner, winner_count = votes.most_common(1)[0]
agreement_rate = winner_count / total

if agreement_rate < 0.55 and len(feedbacks) >= 3:
return {
"preferred": "disputed",
"agreement_rate": agreement_rate,
"n_annotators": len(feedbacks),
"votes": dict(votes),
"flag": "low_consensus",
"include_in_training": False,
}

return {
"preferred": winner,
"agreement_rate": agreement_rate,
"n_annotators": len(feedbacks),
"votes": dict(votes),
"include_in_training": True,
}


def _weighted_majority(
feedbacks: list[PreferenceFeedback],
annotator_weights: Optional[dict] = None,
) -> Optional[dict]:
"""
Weighted majority vote.
Weights: annotator's gold task accuracy (higher accuracy = more weight).
Also weights by self-reported confidence.
"""
weighted_votes = {"A": 0.0, "B": 0.0, "tie": 0.0}
total_weight = 0.0

for fb in feedbacks:
annotator_weight = (
annotator_weights.get(fb.annotator_id, 1.0)
if annotator_weights else 1.0
)
confidence_weight = fb.confidence / 5.0 # Normalize 1-5 to 0.2-1.0

combined_weight = annotator_weight * confidence_weight
weighted_votes[fb.preferred] += combined_weight
total_weight += combined_weight

if total_weight == 0:
return None

normalized = {pref: w / total_weight for pref, w in weighted_votes.items()}
winner = max(normalized, key=normalized.get)
winner_weight = normalized[winner]

low_consensus = winner_weight < 0.55 and len(feedbacks) >= 3

return {
"preferred": "disputed" if low_consensus else winner,
"winner_weight": round(winner_weight, 3),
"weighted_votes": {k: round(v, 3) for k, v in normalized.items()},
"n_annotators": len(feedbacks),
"include_in_training": not low_consensus,
"flag": "low_consensus" if low_consensus else None,
"avg_annotator_confidence": round(
sum(fb.confidence for fb in feedbacks) / len(feedbacks), 2
),
}

Building the RLHF Training Dataset

The final step converts raw preference judgments into the format required by reward model training or direct preference optimization (DPO).

from dataclasses import dataclass, field, asdict
import json
import uuid
import time


@dataclass
class RLHFSample:
"""A single training sample for reward model training or DPO."""
sample_id: str = field(default_factory=lambda: str(uuid.uuid4()))
prompt: str = ""
chosen: str = "" # The preferred response
rejected: str = "" # The non-preferred response
chosen_confidence: float = 1.0 # How confident the consensus was
rejected_confidence: float = 0.0
n_annotators: int = 0
agreement_rate: float = 1.0
prompt_category: str = ""
source: str = "human_comparison"
created_at: float = field(default_factory=time.time)

def to_trl_format(self) -> dict:
"""TRL (HuggingFace) trainer format."""
return {
"prompt": self.prompt,
"chosen": self.chosen,
"rejected": self.rejected,
}

def to_dpo_format(self) -> dict:
"""Direct Preference Optimization message format."""
return {
"prompt": [{"role": "user", "content": self.prompt}],
"chosen": [{"role": "assistant", "content": self.chosen}],
"rejected": [{"role": "assistant", "content": self.rejected}],
}

def to_orpo_format(self) -> dict:
"""ORPO (Odds Ratio Preference Optimization) format."""
return {
"prompt": self.prompt,
"chosen": self.chosen,
"rejected": self.rejected,
"chosen_score": self.agreement_rate,
"rejected_score": 1.0 - self.agreement_rate,
}


class RLHFDatasetBuilder:
"""
Builds RLHF training datasets from collected preference feedback.

Applies quality filters, aggregates annotator preferences,
and exports in DPO, TRL, or ORPO format.
"""

def __init__(
self,
quality_filter: FeedbackQualityFilter,
annotator_weights: Optional[dict] = None,
):
self.quality_filter = quality_filter
self.annotator_weights = annotator_weights or {}
self._pairs: dict[str, ResponsePair] = {}
self._feedbacks: dict[str, list[PreferenceFeedback]] = defaultdict(list)

def add_pair(self, pair: ResponsePair) -> None:
self._pairs[pair.pair_id] = pair

def add_feedback(self, feedback: PreferenceFeedback) -> None:
self._feedbacks[feedback.pair_id].append(feedback)

def build_dataset(
self,
min_annotators: int = 3,
min_agreement_rate: float = 0.60,
exclude_ties: bool = True,
) -> tuple[list[RLHFSample], dict]:
"""
Build clean RLHF dataset from collected pairs and feedback.

Returns: (samples, statistics)
"""
samples = []
stats = {
"total_pairs": len(self._pairs),
"insufficient_annotations": 0,
"failed_quality_filter": 0,
"low_consensus": 0,
"ties_excluded": 0,
"included_in_training": 0,
}

for pair_id, pair in self._pairs.items():
feedbacks = self._feedbacks.get(pair_id, [])

# Require minimum annotators
if len(feedbacks) < min_annotators:
stats["insufficient_annotations"] += 1
continue

# Quality filter
valid_feedbacks, rejected = self.quality_filter.filter_batch(feedbacks)
if len(valid_feedbacks) < min_annotators:
stats["failed_quality_filter"] += 1
continue

# Aggregate preferences
consensus = aggregate_preferences(
valid_feedbacks,
annotator_weights=self.annotator_weights,
strategy="weighted_majority",
)

if not consensus:
continue

if not consensus.get("include_in_training", True):
stats["low_consensus"] += 1
continue

preferred = consensus.get("preferred", "disputed")

if preferred == "tie":
if exclude_ties:
stats["ties_excluded"] += 1
continue
# Include tie as equal-weight training signal
# (less common approach - only for specific training objectives)

if preferred == "A":
chosen, rejected = pair.response_a, pair.response_b
elif preferred == "B":
chosen, rejected = pair.response_b, pair.response_a
else:
stats["low_consensus"] += 1
continue

sample = RLHFSample(
prompt=pair.prompt,
chosen=chosen,
rejected=rejected,
chosen_confidence=consensus.get("winner_weight", 0.7),
rejected_confidence=1.0 - consensus.get("winner_weight", 0.7),
n_annotators=consensus.get("n_annotators", 0),
agreement_rate=consensus.get("winner_weight", 0.0),
prompt_category=pair.prompt_category,
)
samples.append(sample)
stats["included_in_training"] += 1

inclusion_rate = stats["included_in_training"] / max(stats["total_pairs"], 1)
stats["inclusion_rate"] = round(inclusion_rate, 3)

return samples, stats

def export_jsonl(
self,
samples: list[RLHFSample],
output_path: str,
format: str = "dpo",
) -> int:
"""Export dataset to JSONL file."""
with open(output_path, "w") as f:
for sample in samples:
if format == "dpo":
record = sample.to_dpo_format()
elif format == "trl":
record = sample.to_trl_format()
elif format == "orpo":
record = sample.to_orpo_format()
else:
record = asdict(sample)
f.write(json.dumps(record) + "\n")

print(f"Exported {len(samples)} RLHF samples to {output_path} ({format} format)")
return len(samples)

def get_category_breakdown(self, samples: list[RLHFSample]) -> dict:
"""Analyze the category distribution of the dataset."""
by_category: dict = defaultdict(int)
for sample in samples:
by_category[sample.prompt_category] += 1

total = len(samples)
return {
cat: {"count": count, "fraction": round(count / max(total, 1), 3)}
for cat, count in sorted(by_category.items(), key=lambda x: -x[1])
}

ELO Rating for Model Ranking

When preference data covers multiple models, ELO rating provides a principled ranking that accounts for the strength of each comparison.

import math
from collections import defaultdict


def compute_elo_ratings(
preferences: list[dict], # [{"model_a": "...", "model_b": "...", "preferred": "A"/"B"/"tie"}]
initial_rating: float = 1500.0,
k_factor: float = 32.0,
n_bootstrap: int = 0,
) -> dict:
"""
Compute ELO ratings for models from head-to-head comparisons.

The Bradley-Terry model is the principled statistical foundation;
ELO is an efficient approximation that converges to Bradley-Terry with
enough comparisons.

Parameters:
- initial_rating: starting ELO for all models (1500 is standard)
- k_factor: learning rate per comparison (32 is standard for new models)
- n_bootstrap: if > 0, compute confidence intervals via bootstrapping

Returns: {model_name: elo_rating} sorted by rating descending
"""
ratings: dict[str, float] = {}
match_counts: dict[str, int] = defaultdict(int)

for pref in preferences:
model_a = pref.get("model_a", "model_a")
model_b = pref.get("model_b", "model_b")

if model_a not in ratings:
ratings[model_a] = initial_rating
if model_b not in ratings:
ratings[model_b] = initial_rating

ra = ratings[model_a]
rb = ratings[model_b]

# Expected win probability
ea = 1.0 / (1.0 + 10.0 ** ((rb - ra) / 400.0))
eb = 1.0 - ea

# Actual outcome
winner = pref.get("preferred", "tie")
if winner == "A":
sa, sb = 1.0, 0.0
elif winner == "B":
sa, sb = 0.0, 1.0
else: # tie
sa, sb = 0.5, 0.5

# Update ratings
ratings[model_a] += k_factor * (sa - ea)
ratings[model_b] += k_factor * (sb - eb)
match_counts[model_a] += 1
match_counts[model_b] += 1

result = {
model: {
"elo": round(rating, 1),
"n_comparisons": match_counts[model],
}
for model, rating in sorted(ratings.items(), key=lambda x: -x[1])
}

return result


def analyze_position_bias_in_dataset(
feedbacks: list[PreferenceFeedback],
pairs: dict, # {pair_id: ResponsePair}
) -> dict:
"""
Analyze dataset-wide position bias.

If the A-preference rate significantly exceeds 50%, position bias
is contaminating the dataset. This requires either re-annotation
with fixed randomization or statistical correction.

A healthy dataset should have 48-52% A-preferred rate.
"""
total = len(feedbacks)
a_count = sum(1 for fb in feedbacks if fb.preferred == "A")
b_count = sum(1 for fb in feedbacks if fb.preferred == "B")
tie_count = sum(1 for fb in feedbacks if fb.preferred == "tie")

a_rate = a_count / max(total, 1)
b_rate = b_count / max(total, 1)

# Chi-squared test for deviation from 50/50 (excluding ties)
non_tie = a_count + b_count
expected = non_tie / 2
if expected > 0:
chi_squared = ((a_count - expected) ** 2 + (b_count - expected) ** 2) / expected
# p < 0.05 corresponds to chi_squared > 3.84 for 1 degree of freedom
statistically_significant = chi_squared > 3.84
else:
chi_squared = 0.0
statistically_significant = False

return {
"total_preferences": total,
"a_preferred": a_count,
"b_preferred": b_count,
"tie": tie_count,
"a_rate": round(a_rate, 4),
"b_rate": round(b_rate, 4),
"bias_detected": statistically_significant and abs(a_rate - 0.5) > 0.05,
"chi_squared": round(chi_squared, 3),
"recommendation": (
"Position bias detected - re-annotate with verified randomization"
if statistically_significant and abs(a_rate - 0.5) > 0.05
else "Dataset passes position bias check"
),
}

Feedback Dimension Design

DimensionWhat It MeasuresMost Important For
HelpfulnessDoes it solve the user's actual need?Customer service, coding
AccuracyIs the information correct?Factual QA, medical, legal
HonestyDoes it appropriately hedge uncertainty?High-stakes decisions
SafetyDoes it avoid harmful content?All applications
ConcisenessIs length appropriate for the task?Chat interfaces
FormatIs structure appropriate for context?Document generation
ToneDoes register match the context?Creative, marketing

For reward model training, collect preferences on the dimensions most important to your application rather than all dimensions. A customer service reward model needs helpfulness and safety weights. A coding assistant needs accuracy and format weights. Over-weighting irrelevant dimensions adds noise.

:::tip Collect Multidimensional Ratings Alongside Comparisons Binary preference ("A is better") is informative but coarse. Adding per-response dimension ratings (helpfulness 1-5, accuracy 1-5) provides richer training signal. The reward model can learn that Response A was preferred primarily due to safety while both responses had similar helpfulness. This allows fine-grained reward shaping - particularly useful when one dimension (safety) should have veto power. :::

:::danger Never Reveal Which Model Generated Each Response Annotators have systematic biases toward (or against) specific models. If they know Response A came from GPT-4 and Response B from an internal model, their preference reflects brand awareness, not response quality. Keep model identities completely hidden through the entire annotation process. Use opaque response identifiers. Audit your annotation interface for any accidental model leakage. :::

:::warning Annotator Preference != User Preference Studies consistently show that trained annotators prefer verbose, formal, disclaimer-heavy responses significantly more than actual users. Users prefer concise, direct answers. If you optimize exclusively for annotator preference, you will produce a model users find frustrating. Supplement annotator preference with in-product behavioral signals: copy rate, thumbs-up rate, session completion, follow-up question rate. :::


Interview Q&A

Q1: What is RLHF and why is human feedback collection the most operationally complex part?

RLHF (Reinforcement Learning from Human Feedback) is a training approach that uses human preference judgments to train language models to produce outputs that humans prefer. The pipeline has three stages: collect human preferences between pairs of model responses, train a reward model that learns to predict human preferences, then fine-tune the language model using reinforcement learning to maximize the reward signal.

Human feedback collection is the bottleneck because there is no substitute for genuine human judgment when defining what "good" means. You cannot derive "helpful, harmless, honest" from a mathematical loss function without embedding human values. The collection process must scale to tens of thousands of labeled pairs to train a robust reward model, must maintain high quality across that scale, and must be designed to avoid the systematic biases (position bias, brand bias, confidence bias) that corrupt the preference signal.

Operationally, collection requires: building and maintaining an annotation interface, training and calibrating annotators, running quality filters, managing disagreement, and building the pipeline to convert raw preferences into training data. Each of these has failure modes that are slow to detect and expensive to correct once the reward model is trained.

Q2: How do you handle position bias in preference annotation?

Position bias is the systematic tendency for annotators to prefer the first-shown response regardless of quality. In controlled experiments, A-preferred rates of 55-65% are common in non-randomized annotation interfaces - nearly 10-15% of "preferences" are determined by position alone, not quality.

Prevention: randomize which response is shown as A and which as B for each annotation task. This is implemented by randomly swapping the assignment at pair generation time, tracking which original response was shown as A, and verifying statistically that the A-preferred rate in the dataset falls within 48-52%.

Detection: compute the dataset-level A-preference rate and run a chi-squared test against the 50/50 null hypothesis. If the A-rate is above 54% (or below 46%) with statistical significance, position bias is likely present. Per-annotator position bias is also detectable: individual annotators with A-rates above 65% show strong individual position bias.

Correction (post-hoc): if position bias is detected in an existing dataset, partial correction is possible by down-weighting annotations from annotators with extreme A-preference rates, or by creating balanced subsets that match A-preferred and B-preferred counts. However, re-annotation with a corrected interface is more reliable than statistical correction of biased data.

Q3: What is DPO and how does it differ from PPO-based RLHF?

DPO (Direct Preference Optimization, Rafailov et al. 2023) is a training method that fine-tunes language models directly from preference pairs without training a separate reward model or using RL.

The PPO-based RLHF pipeline has three components: the reference model, the reward model, and the policy model. Training requires: (1) fine-tuning the reference model with supervised learning, (2) training the reward model on human preferences, (3) running PPO to optimize the policy model against the reward signal while applying a KL divergence penalty to prevent the policy from drifting too far from the reference model. This is complex infrastructure: three models in memory simultaneously, a reward signal that can be noisy, and PPO hyperparameters that are sensitive to tune.

DPO re-parameterizes the reward model as a closed-form expression of the policy and reference model ratio. This eliminates the need for a separate reward model and the RL training loop. The DPO objective can be optimized with standard supervised fine-tuning infrastructure. Training is more stable, faster, and requires less infrastructure.

The practical tradeoffs: DPO is preferred for instruction tuning and harmlessness training where the preference signal is relatively straightforward. PPO remains preferred when you need fine-grained reward shaping (different weights for different dimensions), when you have a strong existing reward model, or when the task requires the flexibility of RL policy optimization. Most production teams now default to DPO for new training runs and reserve PPO for specialized cases.

Q4: How do you design the annotation interface to minimize systematic biases?

Five design decisions have the largest impact on annotation quality:

Position randomization: Implement randomized A/B assignment at the database level, verified with a weekly statistical audit of the A-preference rate. This is the single most impactful change for preference data quality.

Model blinding: Never reveal model identities. Use opaque response IDs. Audit the interface code for any metadata leakage (response timestamp patterns, formatting differences that signal model identity).

Context completeness: Always display the full conversation history before the response pair. "Response A is better" for a single turn may be wrong when the same response is appropriate only given context that appears two turns earlier.

Confidence collection: Require confidence ratings alongside preference. This enables confidence-weighted aggregation, which is more robust than simple majority vote. Also collect "reasons for preference" as a multi-select checklist - this provides diagnostic signal about which dimensions drive preference decisions.

Rationale requirements: For high-stakes preferences (safety, medical, legal), require free text justification. This slows annotation by 30-60 seconds per task but dramatically improves quality - annotators engage more carefully when they must articulate their reasoning. Track annotators who submit identical boilerplate rationales across multiple tasks - they are not engaging genuinely.

Q5: How do you detect and handle low-quality annotators in a preference collection pipeline?

Multi-layer detection catches different failure modes:

Gold pair calibration: Create pairs where the "correct" preference is unambiguous (e.g., a factually accurate response versus one with clear errors). Inject these into the annotation queue. Track each annotator's accuracy on gold pairs. Below 80% accuracy on gold pairs with clear correct answers indicates the annotator is not reading carefully.

Test-retest consistency: Show the same response pair to the same annotator twice, several weeks apart. Annotators with less than 65% consistency on duplicate pairs are unreliable. Target consistency above 75% for inclusion in training data.

Position bias monitoring: Track per-annotator A-preference rates. Rates outside 35-65% indicate position bias. Rates outside 30-70% are severe enough to exclude the annotator from training data entirely.

Speed monitoring: Very fast annotations (under 12 seconds for average-length responses) indicate the annotator is not reading both responses. Very slow annotations (over 10 minutes) indicate distraction or interruption. Time-gate filters should run automatically on every submission.

Agreement with peers: For items where 3+ annotators labeled the same pair, track each annotator's agreement rate with the majority consensus. Annotators below 50% agreement with majority consensus are either applying different criteria or not reading carefully - both require investigation.

When a low-quality annotator is detected: (1) remove their annotations from the training dataset, (2) retrain or remove them from the annotator pool, (3) re-annotate the items they had labeled exclusively with better annotators.

© 2026 EngineersOfAI. All rights reserved.