What is human evaluation?

Design rigorous human evaluation studies for LLMs - from annotation protocols to inter-annotator agreement to Chatbot Arena methodology.

How does LLM annotation work in practice?

Human Evaluation covers human evaluation, LLM annotation, inter-annotator agreement from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/llm-evaluation/human-evaluation

What is the difference between human evaluation and inter-annotator agreement?

See the full breakdown at https://engineersofai.com/docs/llms/llm-evaluation/human-evaluation

Human Evaluation

The Leaderboard Anomaly

It is Q3 review season at a major AI lab. Two models are on the internal leaderboard. Model A scores 87.2% on MMLU, 84.1% on HellaSwag, and 0.82 on BERTScore for the summarization benchmark. Model B scores 82.4%, 81.3%, and 0.79 - worse on every automatic metric. The team prepares to ship Model A.

Then the human evaluation results come back. Two hundred annotators rated 1,000 side-by-side response pairs. Model B wins 62% of comparisons. Users find it more helpful, clearer, and more honest. It also hallucinates less in the categories that matter most to real users - medical and legal queries where confident-sounding wrong answers are dangerous.

Model A had learned to game the benchmarks. Its responses were calibrated to the statistical patterns that automatic metrics reward - formal vocabulary, longer responses, specific word choices that MMLU-trained evaluators expected - but those patterns did not translate to real user value. Model B had been fine-tuned with a focus on accuracy and clarity, which scored slightly worse on metrics but dramatically better in human preference.

This is why human evaluation remains the gold standard for LLM quality. It is expensive, slow, and hard to scale - but it measures what actually matters: whether real humans find the model's outputs useful, accurate, and appropriate. Everything else is a proxy.

Why This Exists - The Limits of Automation

By the time you finish Lesson 02 on BLEU and ROUGE, you understand what automatic metrics measure. The harder question is what they miss.

What automatic metrics cannot capture:

Factual accuracy vs confident-sounding text: A model can produce grammatically perfect, semantically coherent falsehoods that score well on all automatic metrics.
Task completion: Did the model actually do what was asked? Answering a different, easier question with high-quality text is not success.
User experience: Tone, warmth, clarity, whether the response is appropriate for the context - none of these have reliable automatic proxies.
Safety: A response can be technically accurate but inappropriate, harmful in context, or violating implicit expectations.
Long-form coherence: A 2,000-word essay can have perfect perplexity and ROUGE scores while being logically incoherent.
Cultural and contextual appropriateness: What counts as helpful in one cultural context may be rude in another.

Human evaluation is not just a fallback when automatic metrics are unavailable. It is the calibration standard that validates (or invalidates) every automatic metric we use.

Historical Context

The field of human evaluation methodology for NLP was largely developed in the machine translation community. Early MT systems were evaluated by bilingual human experts using adequacy (does it convey the same information?) and fluency (is it natural text?) scales.

The shift from MT to general LLM evaluation brought new challenges. MT evaluation is relatively constrained - you can define correctness against the source text. LLM evaluation is open-ended - helpfulness, honesty, and harmlessness are multidimensional and subjective.

Key milestones:

2017: InstructGPT-era RLHF introduces systematic preference evaluation (OpenAI)
2022: Anthropic's Constitutional AI paper formalizes preference data collection methodology
2023: LMSYS Chatbot Arena launches - crowdsourced preference evaluation at scale
2023: MT-Bench establishes LLM-as-judge as a complement to human evaluation
2024: Scale AI and other annotation platforms develop specialized LLM evaluation protocols

The Three Evaluation Paradigms

Absolute Scoring (Likert Scale)

The simplest approach: present a single response to an annotator and ask them to rate it on a scale. Common scales:

5-point helpfulness scale:

5: Exceptionally helpful - fully addresses the request with accurate, well-organized, appropriately concise information
4: Mostly helpful - addresses the request with minor gaps or minor inaccuracies
3: Partially helpful - addresses part of the request but misses key components
2: Minimally helpful - tangentially related but largely misses the request
1: Not helpful - fails to address the request, incorrect, or harmful

7-point scales are used when finer granularity is needed (clinical settings, high-stakes evaluations).

Advantages: Simple to implement, enables longitudinal tracking (how is quality changing over time?), supports multi-dimensional evaluation.

Disadvantages: Annotators develop different internal standards (one annotator's "3" is another's "4"). Absolute scores are not reliable across annotator pools without calibration.

Side-by-Side Comparison (Preference Evaluation)

Show annotators two responses to the same prompt - Response A and Response B - and ask which they prefer and why.

Variants:

Binary preference: A is better, B is better, or tie
Graded preference: A is much better, A is slightly better, tie, B is slightly better, B is much better (5-point)
Forced choice: Ties not allowed - must pick one

Side-by-side comparison is more reliable than absolute scoring because it anchors the judgment. Instead of "is this a 3 or a 4?", annotators answer "is this better or worse than that?" - a much easier cognitive task with higher inter-annotator agreement.

This is the basis for RLHF preference data collection. OpenAI's InstructGPT trained on human preferences collected this way. The ELO rating system (used in Chatbot Arena) aggregates these binary comparisons into a global ranking.

Error Annotation

The most detailed approach: give annotators a full response and ask them to identify and label specific failure types.

Common error taxonomies:

Factual errors: incorrect claims, wrong dates/numbers, false attributions
Logical errors: non-sequitur conclusions, circular reasoning, contradictions
Safety violations: harmful content, privacy violations, policy violations
Instruction non-compliance: failing to follow formatting instructions, ignoring constraints
Hallucinated citations: claiming to cite sources that don't exist or don't say what's claimed
Verbosity: unnecessary padding, repeating information, not getting to the point

Error annotation requires more expert annotators and more time per response, but produces the most actionable feedback for model improvement.

Evaluation Dimensions

The dimensions you ask humans to evaluate should match your product goals. Common frameworks:

The H3 Framework (Helpfulness, Harmlessness, Honesty)

Used extensively at Anthropic:

Helpfulness: Does the response do what the user needs?
Harmlessness: Does the response avoid causing harm?
Honesty: Is the response accurate and appropriately calibrated?

These three dimensions can conflict. A maximally helpful response to "tell me how to make explosives" would be harmful. A maximally honest response might be distressingly blunt. Good models navigate these tensions - and human evaluation is the only way to measure how well they do.

Extended Dimensions

For production LLMs, common additional dimensions:

Coherence: Does the response flow logically? Is it well-organized?
Fluency: Is the language natural and grammatically correct?
Groundedness: Are claims supported by the provided context (for RAG)?
Calibration: Does the model express appropriate uncertainty?
Relevance: Does the response stay on topic?

Annotator Selection and Training

Who Should Annotate

The right annotators depend on the task:

Task	Annotator Type	Why
General helpfulness	Native speakers, general audience	Represents typical users
Medical QA	MD or medical students	Domain knowledge required for accuracy
Code generation	Software engineers	Must run code to verify correctness
Legal analysis	JD or paralegal	Domain knowledge for accuracy
Safety evaluation	Trained safety researchers	Specialized framework required
Multilingual	Native speakers of target language	Not fluent second-language speakers

Annotation Training Process

Written guidelines: A detailed annotation guide covering every dimension, with 10–20 worked examples showing the full range of scores.
Calibration sessions: Annotators independently rate a set of "gold" examples where expert consensus is known. Discuss disagreements as a group.
Agreement gates: Require annotators to achieve minimum inter-annotator agreement (e.g., kappa > 0.6) on calibration set before annotating production data.
Ongoing calibration: Periodically insert calibration examples into the annotation queue to monitor for annotator drift.
Feedback loops: Regularly audit samples, flag outlier annotators, provide individual feedback.

Inter-Annotator Agreement

How do you know if your annotation process is reliable? You need multiple annotators per item and a statistical measure of agreement.

Cohen's Kappa

For two annotators with categorical labels:

$\kappa = \frac{P_o - P_e}{1 - P_e}$

where $P_o$ is observed agreement and $P_e$ is expected agreement by chance.

$\kappa$ ranges from -1 to 1:

less than 0: worse than chance
0.0–0.20: slight agreement
0.21–0.40: fair agreement
0.41–0.60: moderate agreement
0.61–0.80: substantial agreement
0.81–1.00: almost perfect agreement

For LLM evaluation, $\kappa$ > 0.6 is typically required to trust the data. Achieving $\kappa$ > 0.7 on open-ended helpfulness ratings is considered excellent.

Krippendorff's Alpha

For more than two annotators, ordinal or interval scales, or missing data:

$\alpha = 1 - \frac{D_o}{D_e}$

where $D_o$ is observed disagreement and $D_e$ is expected disagreement by chance.

Krippendorff's alpha handles:

Any number of annotators
Ordinal scales (where 1 vs 5 disagreement is worse than 1 vs 2)
Missing data (not every annotator needs to rate every item)

Target: $\alpha$ > 0.667 for "tentative conclusions," $\alpha$ > 0.800 for "confident conclusions" (Krippendorff's recommendation).

Fleiss' Kappa

For multiple annotators with categorical labels:

$\kappa_F = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e}$

Used when you have a fixed set of annotators rating each item and categorical (not ordinal) labels.

Computing Inter-Annotator Agreement in Python

import numpy as np
from typing import List, Optional
from sklearn.metrics import cohen_kappa_score
import krippendorff  # pip install krippendorff

def compute_pairwise_kappa(
    annotator_labels: dict[str, List[int]],
    item_ids: Optional[List[str]] = None,
) -> dict:
    """
    Compute all pairwise Cohen's kappa values between annotators.

    Args:
        annotator_labels: {annotator_id: [label_for_item_1, label_for_item_2, ...]}
        item_ids: Optional list of item identifiers

    Returns:
        dict with pairwise kappa values and mean kappa
    """
    annotators = list(annotator_labels.keys())
    n_annotators = len(annotators)

    pairwise_kappas = {}
    kappa_values = []

    for i in range(n_annotators):
        for j in range(i + 1, n_annotators):
            a1 = annotators[i]
            a2 = annotators[j]

            labels1 = annotator_labels[a1]
            labels2 = annotator_labels[a2]

            # Filter to items both annotators rated
            both_rated = [
                (l1, l2) for l1, l2 in zip(labels1, labels2)
                if l1 is not None and l2 is not None
            ]

            if len(both_rated) < 10:
                print(f"Warning: {a1} vs {a2}: only {len(both_rated)} common items")
                continue

            y1, y2 = zip(*both_rated)
            kappa = cohen_kappa_score(y1, y2, weights="linear")  # Linear for ordinal
            pairwise_kappas[f"{a1}_vs_{a2}"] = round(kappa, 4)
            kappa_values.append(kappa)

    mean_kappa = np.mean(kappa_values) if kappa_values else 0.0

    interpretation = _interpret_kappa(mean_kappa)

    return {
        "pairwise_kappas": pairwise_kappas,
        "mean_kappa": round(mean_kappa, 4),
        "interpretation": interpretation,
        "n_annotators": n_annotators,
    }


def _interpret_kappa(kappa: float) -> str:
    if kappa < 0:
        return "worse than chance - check annotation setup"
    elif kappa < 0.20:
        return "slight agreement - not reliable for decisions"
    elif kappa < 0.40:
        return "fair agreement - use with caution"
    elif kappa < 0.60:
        return "moderate agreement - acceptable for research"
    elif kappa < 0.80:
        return "substantial agreement - good for production use"
    else:
        return "almost perfect agreement - excellent"


def compute_krippendorff_alpha(
    annotator_labels: dict[str, List[Optional[int]]],
    level_of_measurement: str = "ordinal",
) -> dict:
    """
    Compute Krippendorff's alpha for multiple annotators.

    Args:
        annotator_labels: {annotator_id: [label_or_None, ...]}
        level_of_measurement: "nominal", "ordinal", "interval", or "ratio"

    Returns:
        dict with alpha value and interpretation
    """
    # Convert to reliability data matrix (annotators x items)
    annotator_list = list(annotator_labels.keys())
    data = np.array([
        [label if label is not None else np.nan for label in annotator_labels[a]]
        for a in annotator_list
    ], dtype=float)

    alpha = krippendorff.alpha(
        reliability_data=data,
        level_of_measurement=level_of_measurement,
    )

    interpretation = (
        "not reliable (< 0.667)" if alpha < 0.667
        else "tentatively reliable (0.667–0.800)" if alpha < 0.800
        else "reliable (> 0.800)"
    )

    return {
        "krippendorff_alpha": round(alpha, 4),
        "interpretation": interpretation,
        "level_of_measurement": level_of_measurement,
        "n_annotators": len(annotator_list),
    }


def analyze_annotation_dataset(
    annotations_df,  # pandas DataFrame: columns = [item_id, annotator_id, label]
    label_column: str = "label",
) -> dict:
    """
    Full analysis of an annotation dataset.
    Reports agreement, identifies problematic items, flags outlier annotators.
    """
    import pandas as pd

    # Pivot to annotator x item matrix
    pivot = annotations_df.pivot(
        index="annotator_id",
        columns="item_id",
        values=label_column,
    )

    annotator_labels = {
        annotator: list(pivot.loc[annotator])
        for annotator in pivot.index
    }

    kappa_results = compute_pairwise_kappa(annotator_labels)
    alpha_results = compute_krippendorff_alpha(annotator_labels)

    # Find high-disagreement items
    item_std = pivot.std(axis=0)
    high_disagreement_items = item_std.nlargest(10).index.tolist()

    # Find outlier annotators
    annotator_means = pivot.mean(axis=1)
    overall_mean = annotator_means.mean()
    outlier_annotators = annotator_means[
        abs(annotator_means - overall_mean) > 2 * annotator_means.std()
    ].index.tolist()

    return {
        "kappa": kappa_results,
        "alpha": alpha_results,
        "high_disagreement_items": high_disagreement_items,
        "outlier_annotators": outlier_annotators,
        "mean_label": round(float(annotations_df[label_column].mean()), 4),
        "label_distribution": annotations_df[label_column].value_counts().to_dict(),
    }


# Example usage
if __name__ == "__main__":
    # Simulate annotation data: 3 annotators, 20 items, 1-5 scale
    np.random.seed(42)

    n_items = 20
    # Ground truth scores
    true_scores = np.random.randint(1, 6, n_items)

    # Annotators have different noise levels
    def add_noise(scores, noise_level):
        noisy = scores + np.random.randint(-noise_level, noise_level + 1, len(scores))
        return np.clip(noisy, 1, 5).tolist()

    annotator_labels = {
        "annotator_1": add_noise(true_scores, 1),
        "annotator_2": add_noise(true_scores, 1),
        "annotator_3": add_noise(true_scores, 2),  # This one is noisier
    }

    kappa = compute_pairwise_kappa(annotator_labels)
    alpha = compute_krippendorff_alpha(annotator_labels)

    print(f"Mean kappa: {kappa['mean_kappa']} ({kappa['interpretation']})")
    print(f"Krippendorff alpha: {alpha['krippendorff_alpha']} ({alpha['interpretation']})")
    print(f"Pairwise kappas: {kappa['pairwise_kappas']}")

LMSYS Chatbot Arena

Chatbot Arena (launched March 2023 by researchers at UC Berkeley and LMSYS) is the most widely trusted public evaluation platform for LLMs. Its key innovation is crowdsourced, real-user preference evaluation at scale.

How It Works

Users submit a prompt to the arena
Two anonymous models respond simultaneously
The user votes for the better response (or declares a tie)
Model identities are revealed after voting
An ELO rating is computed from all comparisons

ELO Rating System

Borrowed from chess, ELO converts pairwise preferences into a ranking. The expected probability that model A beats model B is:

$E_A = \frac{1}{1 + 10^{(R_B - R_A)/400}}$

After each comparison, ratings update:

$R_A' = R_A + K \cdot (S_A - E_A)$

where $S_A$ is the actual outcome (1 if A won, 0 if lost, 0.5 if tied) and $K$ is the update factor (typically 32).

Why Chatbot Arena Is More Trustworthy Than Academic Benchmarks

Real users, real prompts: The prompt distribution reflects what people actually want from LLMs, not what researchers construct for benchmarks.
No contamination: Users write new prompts continuously. There is no test set to contaminate.
Scale: Millions of comparisons aggregate into robust statistics. Small benchmark test sets are noisy.
Organic hard examples: Users naturally gravitate toward interesting, difficult queries where model differences are most visible.
No metric gaming: You cannot fine-tune specifically to score well on ELO without actually being better.

The downside: Chatbot Arena evaluates general-purpose helpfulness. Specialized domains (medical, legal, code) may require targeted arenas with domain-expert users.

Annotation Bias

The biggest threat to human evaluation validity is systematic bias in annotators. Key biases to guard against:

Position Bias

When comparing two responses side-by-side, annotators tend to prefer the first response they see. This is well-documented in research and practice. The fix: randomize which response appears on the left and which on the right, then aggregate. You can also test for bias directly: if you swap positions and re-evaluate, you should get the same relative preference. If you do not, you have position bias.

Verbosity Bias

Annotators tend to prefer longer, more detailed responses, even when the longer response contains more errors or padding. Mitigation: train annotators to explicitly consider whether length is justified; show length-matched alternatives; include specific evaluation dimensions for conciseness.

Self-Preference

LLMs acting as judges (covered in Lesson 04) strongly prefer outputs from the same model family. Human annotators from different cultural backgrounds also show preference patterns. Track annotator metadata and check for systematic differences.

Authority Bias

If annotators know which company made the model, they may favor the more prestigious name. Chatbot Arena's approach of hiding model identity until after the vote is the standard fix.

Crowdsourcing vs Expert Annotation

Dimension	Crowdsourcing (e.g., MTurk)	Expert Annotation
Cost	Low ( $0.05–$ 1 per rating)	High ( $50–$ 200/hour)
Scale	Very high (thousands/day)	Limited
Consistency	Lower (variable annotator quality)	Higher
Domain expertise	Limited	High
Use case	General helpfulness, preference	Safety, medical, legal, code
Quality control	Majority vote, spam filtering	High-agreement selection

When crowdsourcing works: General text quality, preference between responses, basic factual accuracy. Use majority vote (3–5 annotators per item) and reject annotators with suspiciously short completion times.

When expert annotation is required: Medical advice accuracy, legal analysis quality, code correctness (must run it), safety classification of edge cases, technical writing accuracy.

Production Engineering Notes

Annotation Interface Design

The annotation interface directly affects data quality. Key design principles:

Show only what annotators need: Do not show the model name, temperature, or other metadata that could bias ratings.
Random response ordering: For A/B comparisons, randomize which appears first.
Require explanation: Ask annotators to briefly explain their choice in free text. This prevents autopilot clicking and creates valuable diagnostic data.
Time tracking: Very fast responses (under 10 seconds for a 500-word reading task) are likely unreliable.
Calibration items: Include known-good and known-bad examples in every batch as quality checks.

Sample Size Calculation

How many annotations do you need for a statistically meaningful result?

For preference evaluation (A vs B), to detect a 5 percentage point difference with 80% power at p=0.05, you need approximately 400 comparisons per condition. For a 10 point difference, 100 comparisons suffice. For 2 point differences, you need 2,500+ comparisons.

Use power analysis:

from scipy.stats import norm
import math

def preference_sample_size(
    effect_size_pct: float,  # e.g., 5 for 5 percentage points
    power: float = 0.80,
    alpha: float = 0.05,
) -> int:
    """
    Calculate sample size for preference evaluation.
    Assumes binary outcome (Model A wins or loses).
    """
    p1 = 0.50  # Null hypothesis: equal preference
    p2 = 0.50 + effect_size_pct / 100  # Alternative: Model A preferred by effect_size%

    # Z-scores for alpha and beta
    z_alpha = norm.ppf(1 - alpha / 2)  # Two-tailed
    z_beta = norm.ppf(power)

    # Standard formula for two proportions
    pooled_p = (p1 + p2) / 2
    n = (
        (z_alpha * math.sqrt(2 * pooled_p * (1 - pooled_p)) +
         z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2
    ) / (p2 - p1) ** 2

    return math.ceil(n)


# Examples
for effect in [2, 5, 10, 15]:
    n = preference_sample_size(effect)
    print(f"{effect}pp difference → {n} comparisons needed")

# Output:
# 2pp difference → 9605 comparisons needed
# 5pp difference → 1537 comparisons needed
# 10pp difference → 384 comparisons needed
# 15pp difference → 170 comparisons needed

How OpenAI and Anthropic Run Internal Human Eval

OpenAI's approach (from InstructGPT paper):

Contractors and labelers from Scale AI
Detailed labeler instructions with worked examples
Preference data: which of two responses better satisfies the prompt?
Side-by-side format, randomized ordering
Multiple labelers per item, majority vote
Ongoing calibration via "gold" examples

Anthropic's approach (from Claude's Constitution and RLHF papers):

Heavy emphasis on Constitutional AI principles in annotation guidelines
Annotators trained on Anthropic's explicit values framework
Three-way tradeoff evaluation: helpful vs harmless vs honest
Regular red-teaming integrated into evaluation pipeline

:::tip The 90/10 Rule Spend 90% of your annotation budget on clear, representative cases and 10% on edge cases. Edge cases are critical for safety but rare enough that your overall dataset needs to be representative of normal usage. Over-indexing on edge cases creates models that over-refuse or are paranoid about normal queries. :::

:::danger Do Not Conflate Annotator Agreement with Annotation Quality High inter-annotator agreement means annotators agree with each other, not that they are correct. If you train annotators with biased guidelines, they will consistently agree on biased labels. Agreement measures consistency; correctness requires external validation against ground truth (where ground truth exists) or expert review. :::

Common Mistakes

:::warning Insufficient Annotator Training Throwing annotators at an annotation task without calibration produces low-quality, inconsistent data. Budget at least 2–4 hours of training and calibration per annotator before production annotation begins. :::

:::danger Single Annotator Per Item With a single annotator per item, you cannot measure reliability, detect annotator errors, or compute agreement. Use at minimum 2 annotators per item (3 for disagreement resolution) for any data that will inform product or model decisions. :::

:::warning Evaluating on Unrepresentative Prompts If your evaluation prompts come from one source (e.g., only academic questions, or only simple factual queries), your evaluation results will not generalize to production usage. Sample evaluation prompts from the same distribution as your actual users. :::

Interview Q&A

Q1: Design a human evaluation study for a new LLM-based customer support feature.

Start with clear goals: what does success mean? Probably helpfulness (did the customer get their problem resolved?) and appropriateness (was the response professional, safe, and accurate?). Evaluation dimensions: helpfulness (1-5), accuracy (1-5 or binary correct/incorrect), tone appropriateness (1-5), and a safety flag (safe/unsafe/borderline). Annotators: use customer support domain experts plus a sample of real customers. Sample 100–200 real support tickets plus 50 adversarial cases. Three annotators per item. Train annotators with 20 calibration examples. Compute Krippendorff's alpha before going live with annotations. For comparison: run side-by-side against the current support system with real customers via A/B test (measuring resolution rate and CSAT) as the ground truth.

Q2: What is the difference between Cohen's kappa and Krippendorff's alpha? When would you use each?

Cohen's kappa is for exactly two annotators with categorical labels and no missing data. It corrects for chance agreement. Krippendorff's alpha is more general: it handles any number of annotators, ordinal or interval measurement levels, and missing data (not every annotator needs to rate every item). For ordinal scales (1–5 ratings), Krippendorff's alpha with ordinal measurement is more appropriate than Cohen's kappa because it treats disagreement between 1 and 5 as worse than disagreement between 1 and 2. In practice: use Cohen's kappa for quick two-annotator agreement checks; use Krippendorff's alpha for reporting reliability of production annotation pipelines.

Q3: What is the Chatbot Arena and why do practitioners trust it more than academic benchmarks?

Chatbot Arena is a crowdsourced preference evaluation platform where real users compare two anonymous models and vote for the better one. ELO ratings aggregate millions of comparisons into model rankings. It is more trusted than academic benchmarks because: (1) prompts come from real users with genuine needs, not from benchmark designers; (2) there is no test set to contaminate; (3) the scale (millions of votes) produces robust statistics; (4) models cannot easily be fine-tuned to game ELO without actually being better. Limitations: it biases toward general-purpose helpfulness and toward user demographics who visit the arena website; it may not capture specialized capabilities well.

Q4: How would you handle annotator disagreement in a safety evaluation task?

Safety annotation is particularly difficult because edge cases are ambiguous and context-dependent. My approach: (1) Require three annotators per item, not two. (2) For borderline cases (annotators disagree), escalate to a senior safety researcher who reviews with full context. (3) Use tiered labels: "clearly safe," "borderline/ambiguous," "clearly unsafe" - disagreements on "clearly" categories are more concerning than disagreements on "borderline." (4) Track disagreement rates by category - categories with chronic high disagreement indicate the annotation guidelines need refinement. (5) Do not resolve disagreements by majority vote for safety classification - if any annotator labels something as unsafe, it warrants review. The cost of a false negative (unsafe content not caught) exceeds the cost of a false positive (unnecessary review).

Q5: A model performs better in side-by-side evaluation but worse on absolute Likert scale ratings. What explains this?

This discrepancy is common and has several explanations. First, Likert scale ratings are subject to annotator calibration differences - what one annotator calls "4" another calls "3." Side-by-side comparisons anchor to a specific baseline, eliminating this calibration noise. Second, the model might be making relative improvements in areas that are hard to express on an absolute scale - slightly better structure, slightly more appropriate tone - but these improvements push it over the "I prefer this one" threshold without dramatically changing the absolute quality number. Third, regression to the mean: if the baseline is mediocre (scoring 3 on average), a model that is clearly better might score 3.5 - not significantly higher in absolute terms, but clearly better in comparison. In general, side-by-side comparisons are more sensitive to small differences and more reliable than absolute Likert scores.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Human Evaluation & Inter-Rater Agreement demo on the EngineersOfAI Playground - no code required.

:::

The Leaderboard Anomaly​

Why This Exists - The Limits of Automation​

Historical Context​

The Three Evaluation Paradigms​

Absolute Scoring (Likert Scale)​

Side-by-Side Comparison (Preference Evaluation)​

Error Annotation​

Evaluation Dimensions​

The H3 Framework (Helpfulness, Harmlessness, Honesty)​

Extended Dimensions​

Annotator Selection and Training​

Who Should Annotate​

Annotation Training Process​

Inter-Annotator Agreement​

Cohen's Kappa​

Krippendorff's Alpha​

Fleiss' Kappa​

Computing Inter-Annotator Agreement in Python​

LMSYS Chatbot Arena​

How It Works​

ELO Rating System​

Why Chatbot Arena Is More Trustworthy Than Academic Benchmarks​

Annotation Bias​

Position Bias​

Verbosity Bias​

Self-Preference​

Authority Bias​

Crowdsourcing vs Expert Annotation​

Production Engineering Notes​

Annotation Interface Design​

Sample Size Calculation​

How OpenAI and Anthropic Run Internal Human Eval​

Common Mistakes​

Interview Q&A​