What is human evaluation?

When and how to run human evaluation for agentic systems - annotator selection, rubric design, inter-annotator agreement, crowdsourcing quality control, and closing the feedback loop.

How does annotation work in practice?

Human Evaluation for Agents covers human evaluation, annotation, inter-annotator agreement from first principles with code examples. Free lesson at https://engineersofai.com/docs/agentic-ai/agent-evaluation/human-evaluation-for-agents

What is the difference between human evaluation and inter-annotator agreement?

See the full breakdown at https://engineersofai.com/docs/agentic-ai/agent-evaluation/human-evaluation-for-agents

Human Evaluation for Agents

The Irreplaceable Signal

Eventually, humans must evaluate your agent. Not just because LLM judges are imperfect - though they are - but because the ultimate test of any agent is whether real humans, with real needs, find it useful and trustworthy.

An agent that scores 88% on your automated metrics and satisfies your LLM judge may still frustrate users with its phrasing, annoy domain experts with its oversimplifications, or alarm safety reviewers with subtle policy violations that no automated system caught. Automated metrics measure proxies. Human evaluation measures the real thing.

This lesson is about doing human evaluation well. The difference between poorly designed human evaluation (which produces noise) and well-designed human evaluation (which produces actionable signal) is entirely in the protocol design. Get the protocol wrong and you spend weeks collecting data that tells you nothing. Get it right and a 50-person annotation study can guide six months of product development.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Human Evaluation Process demo on the EngineersOfAI Playground - no code required. :::

When Human Evaluation Is Mandatory

Not every evaluation requires humans. But some situations absolutely do:

Safety-critical deployments: Any agent in healthcare, legal, financial, or safety-related domains must be evaluated by domain experts before deployment. Automated metrics cannot catch the subtle ways an agent might mislead a patient, create legal liability, or trigger a dangerous financial action. Human expert review is not optional.

Novel task types: When your agent handles a task type it has not been extensively evaluated on - new domains, new tool sets, new user populations - existing automated metrics may not be calibrated for that task type. Human evaluation establishes the baseline from which you can calibrate automated metrics.

Calibrating LLM judges: Every LLM judge requires periodic human calibration (see Lesson 05). Without human ground truth, you cannot know whether your judge is measuring what you think it measures.

Regulatory requirements: In regulated industries (financial services, medical devices, public sector AI), human expert evaluation may be legally required before deployment.

Significant model changes: When switching model families (not just versions), the failure modes may change in ways your existing automated metrics do not capture. Human evaluation of the new failure modes is required before trusting automated metrics again.

Human Eval Design Principles

Good human evaluation design is harder than it looks. Every decision - task selection, annotator choice, question design, interface design - affects the reliability of your data.

Principle 1: Define Your Evaluation Goal First

Before designing the protocol, answer: what decision is this evaluation meant to inform?

"Should we ship this agent version?" → Absolute quality evaluation
"Is version B better than version A?" → Comparative evaluation
"Where are the biggest quality gaps?" → Diagnostic evaluation
"Is this safe to deploy?" → Safety evaluation

Different goals require different designs. Comparative evaluation needs paired ratings. Diagnostic evaluation needs fine-grained dimensions. Safety evaluation needs adversarial task selection and specialized annotators.

Principle 2: Task Selection Determines Everything

The tasks you evaluate on constrain everything you can learn. A carefully selected task set of 200 examples produces more useful signal than a carelessly selected set of 2000.

Task selection principles:

Representative: Cover the full distribution of real user queries, not just easy ones
Diverse: Include different task types, lengths, domains, and difficulty levels
Edge cases: Specifically include known hard cases, ambiguous requests, and error-prone scenarios
Adversarial: Include tasks designed to probe specific failure modes
Fresh: Use tasks from production that the model has not been trained on

Avoid curating tasks that you know the agent handles well - this produces inflated, unrepresentative scores.

Annotator Selection

The annotator population determines the validity of your results. Wrong annotators produce confident but irrelevant data.

Domain Experts

When to use: safety review, technical accuracy verification, regulatory compliance, calibration set construction.

Cost: $50–$ 500/hour for qualified domain experts. A 200-task evaluation study might cost $5,000–$ 15,000.

Quality characteristics: High accuracy on domain-specific judgments, low throughput (10–20 tasks/hour), high variability in what they notice (different experts focus on different aspects of quality).

Practical considerations: Domain experts are hard to find, harder to schedule, and often have strong opinions that make inter-annotator agreement difficult. Invest in rubric alignment sessions before the study.

Crowdworkers

When to use: large-scale annotation, simple quality tasks (helpfulness, clarity), comparative preference, non-specialist judgments.

Platforms: Amazon Mechanical Turk (high volume, lower quality), Scale AI (managed, higher quality), Prolific (academic-focused, good quality for general tasks).

Cost: $0.05–$ 0.50/task for simple tasks, $1–$ 5/task for complex tasks.

Quality characteristics: High throughput (50–100 tasks/hour), high variance (some workers are excellent, many are careless), requires quality control mechanisms.

Quality control for crowdworkers:

Gold standard tasks: Insert known-answer items (10–15% of tasks). Workers who answer incorrectly get filtered.
Attention checks: Include obvious questions that attentive workers answer correctly.
Agreement thresholds: Flag workers whose agreement with peers is below 60%.
Time filters: Workers who complete tasks in less than 30 seconds are not reading carefully.
Redundancy: 3–5 workers per task, use majority vote.

Target Users

When to use: usability evaluation, final acceptance testing for new user segments, measuring helpfulness from the user's perspective.

Cost: Recruiting target users for paid studies costs $100–$ 300/participant for 60-minute sessions.

Quality characteristics: High ecological validity (you are measuring what matters to actual users), low throughput, difficult to recruit at scale.

Best used for: small qualitative studies (10–20 users) to understand the user experience, complementing quantitative automated evaluation.

What to Show Annotators

A critical and often overlooked design decision: what do annotators see?

What to Show	Pros	Cons	Best For
Final output only	Fast, low cognitive load	Misses trajectory quality	Output quality evaluation
Full trajectory	Captures process quality	Slow, cognitively demanding	In-depth trajectory review
Step-by-step	Can rate each decision	Very slow (20+ minutes/task)	Research, failure analysis
Redacted trajectory	Removes irrelevant details	May remove useful context	Scaled trajectory review

For most production evaluations, show annotators the final output only, plus the query. For diagnostic or safety evaluations, show the full trajectory. For large-scale crowdwork, showing the trajectory is usually not practical.

Question Design

The questions you ask annotators determine what you learn. Each question type has strengths and weaknesses:

Binary Questions (Yes/No)

Simple, high agreement, but coarse:

"Does this response answer the user's question?" (Y/N)
"Does this response contain any factually incorrect information?" (Y/N)
"Is this response safe and appropriate?" (Y/N)

Use binary questions for: safety checks, basic quality gates, high-volume crowdwork.

Likert Scale Questions (1-5 or 1-7)

Finer-grained but lower agreement:

"How helpful is this response? (1=Not at all helpful, 5=Extremely helpful)"
"How accurate is the information? (1=Very inaccurate, 5=Very accurate)"
"How clear and well-organized is this response? (1=Very confusing, 5=Extremely clear)"

Use Likert for: nuanced quality dimensions, when you need to distinguish between levels of good.

Comparative Questions

"Which response is better?" gives stronger signal than "Is this response good?":

"Response A or Response B: which is more helpful to the user?"
"Response A or Response B: which would you trust more?"
"Response A or Response B: which would you prefer to receive?"

Use comparative for: comparing two agent versions, where absolute judgment is difficult.

Open-Ended Questions

Richest qualitative signal, hardest to aggregate:

"What is the most important thing wrong with this response?"
"What would make this response better?"
"What would you change to make this more trustworthy?"

Use open-ended for: diagnostic evaluation, identifying failure modes, understanding user needs.

Rubric Design

Rubrics specify exactly what each score level means. Without rubrics, annotators interpret questions differently, producing low inter-annotator agreement.

HELPFULNESS RUBRIC

5 - Excellent: Response fully addresses the user's need, anticipates likely follow-up
questions, provides appropriate depth, and is organized for the user's context.

4 - Good: Response fully addresses the main need with appropriate depth. Minor aspects
could be improved (slightly more detail, better organization, etc.).

3 - Acceptable: Response addresses the main need but misses notable details or context
that the user would want. The core information is present.

2 - Poor: Response partially addresses the need. Significant aspects are missing,
incorrect, or off-topic. User would need to ask follow-up questions.

1 - Failing: Response fails to address the user's need. Off-topic, incorrect,
or completely unhelpful.

Note: Do not score based on response length. A brief accurate answer scores higher
than a lengthy inaccurate one.

Good rubric properties:

Each level has a concrete description with observable criteria
Edge cases and exceptions are noted
What NOT to consider is explicitly stated
Calibration examples accompany the rubric (show 2–3 examples at each score level)

Inter-Annotator Agreement

Inter-annotator agreement (IAA) measures how consistently your annotators rate the same items. Low IAA means your evaluation is producing noise, not signal.

Cohen's Kappa (2 annotators)

For two annotators on categorical ratings:

$\kappa = \frac{P_o - P_e}{1 - P_e}$

Where $P_o$ is observed agreement and $P_e$ is expected agreement by chance.

Interpretation:

$\kappa < 0.40$ : Poor agreement - your rubric or annotators need significant work
$0.40 \leq \kappa < 0.60$ : Moderate agreement - acceptable for exploration, not for decisions
$0.60 \leq \kappa < 0.80$ : Substantial agreement - acceptable for most production evaluations
$\kappa \geq 0.80$ : Near-perfect agreement - excellent

Fleiss' Kappa (multiple annotators)

For more than two annotators on the same items:

$\kappa = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e}$

Where $\bar{P}$ is mean annotator agreement and $\bar{P}_e$ is mean chance agreement.

Krippendorff's Alpha

More general than kappa - handles missing data, works for ordinal/interval scales:

$\alpha = 1 - \frac{D_o}{D_e}$

Where $D_o$ is observed disagreement and $D_e$ is expected disagreement. More appropriate than kappa for Likert scale data where the distance between disagreements matters (disagreeing by 4 points is worse than disagreeing by 1 point).

Target values: For Likert scale agent evaluation data, target Krippendorff's alpha > 0.67.

Full Python: Human Evaluation Toolkit

"""
Human evaluation toolkit for agent outputs.
Includes: dataset manager, CLI annotation interface,
inter-annotator agreement calculator, result analyzer.
"""

import json
import os
import time
from dataclasses import dataclass, field, asdict
from statistics import mean, stdev
from typing import Optional
import math


# ── Data models ────────────────────────────────────────────────────────────────

@dataclass
class EvalTask:
    task_id: str
    query: str
    agent_response: str
    trajectory_summary: Optional[list[str]] = None
    metadata: dict = field(default_factory=dict)
    is_gold_standard: bool = False   # Known-answer item for quality control
    gold_rating: Optional[int] = None  # Expected rating for gold standard


@dataclass
class Annotation:
    annotation_id: str
    task_id: str
    annotator_id: str
    timestamp: float

    # Ratings
    task_completion: int     # 1-5
    factual_accuracy: int    # 1-5
    helpfulness: int         # 1-5
    safety: int              # 1-5 (5 = very safe)

    # Qualitative
    primary_issue: Optional[str] = None
    improvement_suggestion: Optional[str] = None
    overall_notes: Optional[str] = None

    # Quality
    confidence: int = 3        # Annotator's confidence 1-5
    time_spent_seconds: float = 0.0

    def composite_score(self, weights: dict = None) -> float:
        weights = weights or {
            "task_completion": 0.35,
            "factual_accuracy": 0.30,
            "helpfulness": 0.25,
            "safety": 0.10,
        }
        total = (
            weights["task_completion"] * self.task_completion +
            weights["factual_accuracy"] * self.factual_accuracy +
            weights["helpfulness"] * self.helpfulness +
            weights["safety"] * self.safety
        )
        return total / sum(weights.values())


# ── Dataset manager ────────────────────────────────────────────────────────────

class EvalDatasetManager:
    """
    Manages eval tasks and annotations, stored as JSON files.
    In production, replace with a database backend.
    """

    def __init__(self, data_dir: str):
        self.data_dir = data_dir
        os.makedirs(data_dir, exist_ok=True)
        self.tasks_path = os.path.join(data_dir, "tasks.json")
        self.annotations_path = os.path.join(data_dir, "annotations.json")
        self._tasks: dict[str, EvalTask] = {}
        self._annotations: dict[str, list[Annotation]] = {}
        self._load()

    def add_task(self, task: EvalTask):
        self._tasks[task.task_id] = task
        self._save()

    def add_annotation(self, annotation: Annotation):
        task_id = annotation.task_id
        if task_id not in self._annotations:
            self._annotations[task_id] = []
        self._annotations[task_id].append(annotation)
        self._save()

    def get_unannotated_tasks(self, annotator_id: str, limit: int = 10) -> list[EvalTask]:
        """Return tasks this annotator has not yet rated."""
        annotated = {
            ann.task_id
            for anns in self._annotations.values()
            for ann in anns
            if ann.annotator_id == annotator_id
        }
        unannotated = [t for t_id, t in self._tasks.items() if t_id not in annotated]
        # Put gold standard tasks first
        gold = [t for t in unannotated if t.is_gold_standard]
        regular = [t for t in unannotated if not t.is_gold_standard]
        return (gold + regular)[:limit]

    def get_multi_annotated_tasks(self, min_annotations: int = 2) -> list[EvalTask]:
        """Return tasks with at least min_annotations annotations."""
        return [
            self._tasks[task_id]
            for task_id, anns in self._annotations.items()
            if len(anns) >= min_annotations and task_id in self._tasks
        ]

    def all_annotations(self) -> list[Annotation]:
        return [ann for anns in self._annotations.values() for ann in anns]

    def _load(self):
        if os.path.exists(self.tasks_path):
            with open(self.tasks_path) as f:
                data = json.load(f)
            self._tasks = {k: EvalTask(**v) for k, v in data.items()}
        if os.path.exists(self.annotations_path):
            with open(self.annotations_path) as f:
                data = json.load(f)
            self._annotations = {
                k: [Annotation(**a) for a in v]
                for k, v in data.items()
            }

    def _save(self):
        with open(self.tasks_path, "w") as f:
            json.dump({k: asdict(v) for k, v in self._tasks.items()}, f, indent=2)
        with open(self.annotations_path, "w") as f:
            json.dump(
                {k: [asdict(a) for a in v] for k, v in self._annotations.items()},
                f, indent=2
            )


# ── CLI annotation interface ───────────────────────────────────────────────────

class CLIAnnotationInterface:
    """
    Simple command-line interface for human annotation.
    For production, replace with a web-based annotation tool.
    """

    def __init__(self, dataset: EvalDatasetManager, annotator_id: str):
        self.dataset = dataset
        self.annotator_id = annotator_id

    def start_session(self, num_tasks: int = 10):
        tasks = self.dataset.get_unannotated_tasks(self.annotator_id, limit=num_tasks)

        if not tasks:
            print("No tasks remaining to annotate. Great work!")
            return

        print(f"\nStarting annotation session for {self.annotator_id}")
        print(f"Tasks to annotate: {len(tasks)}")
        print("─" * 60)

        for i, task in enumerate(tasks):
            print(f"\nTask {i+1}/{len(tasks)} - ID: {task.task_id}")
            if task.is_gold_standard:
                print("[QUALITY CHECK TASK]")

            annotation = self._annotate_task(task)
            self.dataset.add_annotation(annotation)

            if task.is_gold_standard and task.gold_rating is not None:
                composite = annotation.composite_score()
                expected = task.gold_rating
                diff = abs(composite - expected)
                if diff > 1.5:
                    print(f"Warning: Your rating ({composite:.1f}) differs significantly "
                          f"from expected ({expected:.1f}). Please review the rubric.")

        print(f"\nSession complete! Annotated {len(tasks)} tasks.")

    def _annotate_task(self, task: EvalTask) -> Annotation:
        import uuid

        print(f"\nQuery: {task.query}")
        print(f"\nAgent Response:")
        print("─" * 40)
        print(task.agent_response)
        print("─" * 40)

        if task.trajectory_summary:
            show_traj = input("\nShow trajectory? (y/N): ").strip().lower()
            if show_traj == "y":
                for step in task.trajectory_summary:
                    print(f"  {step}")

        print("\nRate this response (all scores 1-5):")

        start = time.time()

        tc = self._get_rating(
            "Task completion (1=failed, 5=perfect)",
            valid_range=(1, 5)
        )
        fa = self._get_rating(
            "Factual accuracy (1=many errors, 5=fully accurate)",
            valid_range=(1, 5)
        )
        h = self._get_rating(
            "Helpfulness (1=not helpful, 5=very helpful)",
            valid_range=(1, 5)
        )
        s = self._get_rating(
            "Safety (1=unsafe, 5=fully safe)",
            valid_range=(1, 5)
        )
        conf = self._get_rating(
            "Your confidence in these ratings (1=very unsure, 5=very sure)",
            valid_range=(1, 5)
        )

        issue = input("Primary issue (or press Enter to skip): ").strip() or None
        suggestion = input("Improvement suggestion (or press Enter to skip): ").strip() or None

        elapsed = time.time() - start

        return Annotation(
            annotation_id=str(uuid.uuid4())[:8],
            task_id=task.task_id,
            annotator_id=self.annotator_id,
            timestamp=time.time(),
            task_completion=tc,
            factual_accuracy=fa,
            helpfulness=h,
            safety=s,
            primary_issue=issue,
            improvement_suggestion=suggestion,
            confidence=conf,
            time_spent_seconds=elapsed,
        )

    def _get_rating(self, prompt: str, valid_range: tuple[int, int]) -> int:
        lo, hi = valid_range
        while True:
            try:
                val = int(input(f"{prompt} [{lo}-{hi}]: ").strip())
                if lo <= val <= hi:
                    return val
                print(f"Please enter a number between {lo} and {hi}")
            except ValueError:
                print("Please enter a number")


# ── Inter-annotator agreement calculator ──────────────────────────────────────

class IAACalculator:
    """
    Computes inter-annotator agreement metrics.
    Supports Cohen's kappa, Fleiss' kappa, and Krippendorff's alpha.
    """

    def cohen_kappa(
        self,
        ratings_a: list[int],
        ratings_b: list[int],
    ) -> float:
        """Cohen's kappa for two annotators on categorical ratings."""
        assert len(ratings_a) == len(ratings_b), "Lists must be same length"
        n = len(ratings_a)
        if n == 0:
            return 0.0

        categories = sorted(set(ratings_a) | set(ratings_b))
        k = len(categories)
        cat_to_idx = {c: i for i, c in enumerate(categories)}

        # Observed agreement
        p_o = sum(a == b for a, b in zip(ratings_a, ratings_b)) / n

        # Expected agreement
        counts_a = [0] * k
        counts_b = [0] * k
        for a, b in zip(ratings_a, ratings_b):
            counts_a[cat_to_idx[a]] += 1
            counts_b[cat_to_idx[b]] += 1

        p_e = sum(
            (counts_a[i] / n) * (counts_b[i] / n)
            for i in range(k)
        )

        if p_e == 1.0:
            return 1.0

        return (p_o - p_e) / (1 - p_e)

    def fleiss_kappa(
        self,
        ratings_matrix: list[list[int]],
    ) -> float:
        """
        Fleiss' kappa for multiple annotators.
        ratings_matrix: rows = items, columns = annotators
        """
        n_items = len(ratings_matrix)
        if n_items == 0:
            return 0.0

        n_annotators = len(ratings_matrix[0])
        all_cats = sorted({r for row in ratings_matrix for r in row if r is not None})
        k = len(all_cats)
        cat_to_idx = {c: i for i, c in enumerate(all_cats)}

        # Count matrix: n_items x k
        count_matrix = [[0] * k for _ in range(n_items)]
        for i, row in enumerate(ratings_matrix):
            for r in row:
                if r is not None:
                    count_matrix[i][cat_to_idx[r]] += 1

        # P_i: proportion of agreeing pairs per item
        p_i_list = []
        for i in range(n_items):
            n_i = sum(count_matrix[i])
            if n_i < 2:
                continue
            p_i = sum(count_matrix[i][j] * (count_matrix[i][j] - 1)
                      for j in range(k)) / (n_i * (n_i - 1))
            p_i_list.append(p_i)

        p_bar = mean(p_i_list) if p_i_list else 0.0

        # p_j: proportion of all assignments in each category
        total_assignments = n_items * n_annotators
        p_j = [
            sum(count_matrix[i][j] for i in range(n_items)) / total_assignments
            for j in range(k)
        ]

        p_e = sum(p ** 2 for p in p_j)

        if p_e == 1.0:
            return 1.0

        return (p_bar - p_e) / (1 - p_e)

    def krippendorff_alpha(
        self,
        ratings_matrix: list[list[Optional[int]]],
        level: str = "ordinal",
    ) -> float:
        """
        Krippendorff's alpha - handles missing data and ordinal scales.
        ratings_matrix: rows = items, columns = annotators (None for missing)
        level: "nominal", "ordinal", or "interval"
        """
        # Flatten all valid ratings
        pairs = []
        for row in ratings_matrix:
            valid_ratings = [r for r in row if r is not None]
            for i in range(len(valid_ratings)):
                for j in range(i + 1, len(valid_ratings)):
                    pairs.append((valid_ratings[i], valid_ratings[j]))

        if len(pairs) < 2:
            return 0.0

        def distance(a, b) -> float:
            if level == "nominal":
                return 0.0 if a == b else 1.0
            elif level == "ordinal":
                # Rank-based distance
                return float((a - b) ** 2)
            else:  # interval
                return float((a - b) ** 2)

        # Observed disagreement
        d_o = mean(distance(a, b) for a, b in pairs)

        # Expected disagreement
        all_values = [r for row in ratings_matrix for r in row if r is not None]
        all_pairs = [
            (all_values[i], all_values[j])
            for i in range(len(all_values))
            for j in range(len(all_values))
            if i != j
        ]
        d_e = mean(distance(a, b) for a, b in all_pairs) if all_pairs else 1.0

        if d_e == 0:
            return 1.0

        return 1 - d_o / d_e


# ── Result analyzer ────────────────────────────────────────────────────────────

class HumanEvalAnalyzer:
    """Analyzes human evaluation results to extract actionable insights."""

    def __init__(self, dataset: EvalDatasetManager):
        self.dataset = dataset
        self.iaa = IAACalculator()

    def compute_iaa(self) -> dict:
        """Compute IAA across all multi-annotated tasks."""
        multi_tasks = self.dataset.get_multi_annotated_tasks(min_annotations=2)

        if not multi_tasks:
            return {"error": "No tasks with multiple annotations"}

        # Build ratings matrices per dimension
        task_ids = [t.task_id for t in multi_tasks]
        dimensions = ["task_completion", "factual_accuracy", "helpfulness", "safety"]

        iaa_results = {}
        for dim in dimensions:
            # Collect all annotators' ratings per task (as a matrix)
            ratings_matrix = []
            for task_id in task_ids:
                anns = self.dataset._annotations.get(task_id, [])
                ratings = [getattr(a, dim) for a in anns if getattr(a, dim) is not None]
                if len(ratings) >= 2:
                    ratings_matrix.append(ratings)

            if not ratings_matrix:
                continue

            # Pad to same length with None
            max_len = max(len(r) for r in ratings_matrix)
            padded = [r + [None] * (max_len - len(r)) for r in ratings_matrix]

            alpha = self.iaa.krippendorff_alpha(padded, level="ordinal")
            iaa_results[dim] = round(alpha, 3)

            # Interpret
            if alpha < 0.40:
                interpretation = "Poor - rubric needs revision"
            elif alpha < 0.60:
                interpretation = "Moderate - acceptable for exploration"
            elif alpha < 0.80:
                interpretation = "Substantial - acceptable for decisions"
            else:
                interpretation = "Near-perfect - excellent"
            iaa_results[f"{dim}_interpretation"] = interpretation

        return iaa_results

    def quality_control_report(self) -> dict:
        """Identify low-quality annotators via gold standard performance."""
        gold_tasks = [
            t for t in self.dataset._tasks.values()
            if t.is_gold_standard and t.gold_rating is not None
        ]

        if not gold_tasks:
            return {"error": "No gold standard tasks configured"}

        gold_ids = {t.task_id for t in gold_tasks}
        gold_ratings = {t.task_id: t.gold_rating for t in gold_tasks}

        # Collect annotator performance on gold tasks
        annotator_performance = {}
        for task_id, anns in self.dataset._annotations.items():
            if task_id not in gold_ids:
                continue
            expected = gold_ratings[task_id]
            for ann in anns:
                aid = ann.annotator_id
                if aid not in annotator_performance:
                    annotator_performance[aid] = []
                composite = ann.composite_score()
                annotator_performance[aid].append(abs(composite - expected))

        report = {}
        for annotator_id, errors in annotator_performance.items():
            avg_error = mean(errors)
            report[annotator_id] = {
                "gold_tasks_completed": len(errors),
                "avg_error": round(avg_error, 2),
                "quality": "PASS" if avg_error < 1.0 else "FAIL",
            }

        return report

    def summary_report(self) -> dict:
        """Overall quality summary across all annotations."""
        all_anns = self.dataset.all_annotations()

        if not all_anns:
            return {"error": "No annotations yet"}

        composites = [a.composite_score() for a in all_anns]
        issues = [a.primary_issue for a in all_anns if a.primary_issue]
        avg_time = mean(a.time_spent_seconds for a in all_anns)

        # Count issues
        issue_counts = {}
        for issue in issues:
            issue_counts[issue] = issue_counts.get(issue, 0) + 1
        top_issues = sorted(issue_counts.items(), key=lambda x: x[1], reverse=True)[:5]

        return {
            "total_annotations": len(all_anns),
            "unique_tasks": len(self.dataset._tasks),
            "mean_composite_score": round(mean(composites), 3),
            "std_composite_score": round(stdev(composites) if len(composites) > 1 else 0, 3),
            "min_score": round(min(composites), 3),
            "max_score": round(max(composites), 3),
            "avg_annotation_time_seconds": round(avg_time, 1),
            "top_issues": top_issues,
            "iaa": self.compute_iaa(),
        }


# ── Feedback loop ──────────────────────────────────────────────────────────────

class FeedbackLoop:
    """
    Converts human eval insights into agent improvement actions.
    """

    @staticmethod
    def generate_improvement_actions(summary: dict) -> list[str]:
        """Translate human eval findings into concrete agent improvement actions."""
        actions = []
        mean_score = summary.get("mean_composite_score", 3.0)

        if mean_score < 2.5:
            actions.append("CRITICAL: Overall quality below threshold. "
                           "Consider delaying deployment until > 3.5.")

        # Dimension-specific actions
        top_issues = summary.get("top_issues", [])
        for issue, count in top_issues:
            if "accuracy" in issue.lower() or "incorrect" in issue.lower():
                actions.append(f"Factual errors reported {count}x - "
                               "add fact-checking tool or external verification step")
            elif "incomplete" in issue.lower() or "missing" in issue.lower():
                actions.append(f"Incomplete responses reported {count}x - "
                               "revise system prompt to require comprehensive coverage")
            elif "confusing" in issue.lower() or "unclear" in issue.lower():
                actions.append(f"Clarity issues reported {count}x - "
                               "add formatting guidelines to system prompt")
            elif "safe" in issue.lower() or "appropriate" in issue.lower():
                actions.append(f"Safety concerns reported {count}x - "
                               "escalate to safety review before deployment")

        # IAA issues
        iaa = summary.get("iaa", {})
        for dim, alpha in iaa.items():
            if not isinstance(alpha, (int, float)):
                continue
            if alpha < 0.40:
                actions.append(f"Low IAA on {dim} (alpha={alpha:.2f}) - "
                               "revise rubric criteria and hold annotator calibration session")

        return actions


# ── Demo ───────────────────────────────────────────────────────────────────────

def demo():
    import tempfile

    with tempfile.TemporaryDirectory() as tmpdir:
        dataset = EvalDatasetManager(tmpdir)

        # Add sample tasks
        tasks = [
            EvalTask(
                task_id="task_001",
                query="What is the capital of France?",
                agent_response="The capital of France is Paris.",
                is_gold_standard=True,
                gold_rating=4.5,
            ),
            EvalTask(
                task_id="task_002",
                query="Explain the transformer architecture.",
                agent_response="Transformers use attention mechanisms.",
            ),
        ]
        for task in tasks:
            dataset.add_task(task)

        # Simulate annotations
        import uuid
        for annotator_id in ["annotator_a", "annotator_b"]:
            for task in tasks:
                annotation = Annotation(
                    annotation_id=str(uuid.uuid4())[:8],
                    task_id=task.task_id,
                    annotator_id=annotator_id,
                    timestamp=time.time(),
                    task_completion=4 if task.task_id == "task_001" else 2,
                    factual_accuracy=5 if task.task_id == "task_001" else 2,
                    helpfulness=4 if task.task_id == "task_001" else 2,
                    safety=5,
                    confidence=4,
                    time_spent_seconds=45.0,
                )
                dataset.add_annotation(annotation)

        analyzer = HumanEvalAnalyzer(dataset)
        summary = analyzer.summary_report()

        print("\n── Human Evaluation Summary ─────────────────────")
        print(f"Total annotations: {summary['total_annotations']}")
        print(f"Mean composite score: {summary['mean_composite_score']}")
        print(f"IAA: {summary['iaa']}")

        actions = FeedbackLoop.generate_improvement_actions(summary)
        print("\nRecommended actions:")
        for action in actions:
            print(f"  - {action}")


if __name__ == "__main__":
    demo()

The Feedback Flywheel

Human evaluation is most powerful when it is a repeating cycle, not a one-time exercise:

Each cycle tightens the loop: the insights from one evaluation directly guide the next improvement, and the next evaluation verifies that improvement worked. Over time, the eval set becomes a comprehensive library of challenging tasks, and the agent becomes reliably good at all of them.

:::danger Annotation Fatigue Destroys Data Quality Annotators rating their 50th task in a session produce dramatically lower quality annotations than on their first 10. Cognitive fatigue causes rating drift, lower attention to edge cases, and mechanical responses. Limit sessions to 30–40 tasks maximum. Enforce breaks. Monitor annotation time per task - very fast (under 30 seconds) or very slow (over 15 minutes) annotations should be flagged for review. For crowdwork, distribute tasks across many short sessions rather than a few long ones. :::

:::warning Comparative Evaluation Requires Careful Blinding When annotators evaluate agent version A vs version B, any signal that reveals which version is which introduces bias. Blind your annotators to version information. If they can detect which version is which from response style, format, or length, your comparative results will be confounded. This is especially important when comparing a new model against an existing one that annotators are familiar with. :::

Interview Q&A

Q: When is human evaluation mandatory, and when can automated metrics substitute?

A: Human evaluation is mandatory in four situations. First, safety-critical deployments - no automated system can reliably catch all ways an agent might mislead in healthcare, legal, or financial domains; domain expert review is required. Second, calibrating LLM judges - without human ground truth, you cannot verify whether your automated judge measures what you intend. Third, novel task types outside your existing calibration distribution. Fourth, regulatory requirements in some industries. Automated metrics can substitute when: the task has clear objective correctness (code tests, exact match), the automated metric is well-calibrated against human judgment for that task type, and stakes are low enough that a 10–15% error rate in the evaluation metric is acceptable.

Q: How do you design a rubric that produces high inter-annotator agreement?

A: Four principles. First, define each score level with concrete observable criteria, not vague adjectives - "response contains no verifiable factual errors" rather than "response is accurate." Second, add calibration examples: show annotators 2–3 actual examples at each score level. Third, explicitly state what NOT to consider - length, formatting preferences, stylistic choices - to prevent annotators from using different implicit criteria. Fourth, conduct a calibration session before the study: have all annotators rate the same 10 tasks independently, discuss disagreements, and refine the rubric until IAA exceeds 0.67 (Krippendorff's alpha) on the calibration set. Do not skip the pre-study calibration.

Q: Explain Cohen's kappa, Fleiss' kappa, and Krippendorff's alpha. When do you use each?

A: Cohen's kappa measures agreement between exactly two annotators on categorical data. It corrects for chance agreement. Use it when you have exactly two annotators per item. Fleiss' kappa extends Cohen's to multiple annotators on categorical data - use when you have 3+ annotators per item with complete ratings. Krippendorff's alpha is the most general: it handles any number of annotators, missing data (not every annotator rates every item), and different scale types - nominal (categories), ordinal (ordered categories like Likert), and interval (true numeric scales). For agent evaluation with Likert scale ratings, Krippendorff's alpha with ordinal distance is most appropriate because the distance between a 1 and a 5 should matter more than between a 3 and a 4.

Q: How do you control for quality in crowdworker annotation?

A: Five mechanisms work together. First, gold standard tasks: insert known-answer items (10–15% of tasks) where you know the correct rating. Filter workers whose gold standard performance falls below a threshold. Second, attention checks: simple questions that require reading to answer. Third, minimum time filters: flag tasks completed in under 30 seconds as potentially rushed. Fourth, redundancy: have 3–5 workers rate each item and flag items with high disagreement for expert review. Fifth, ongoing monitoring: track each worker's agreement rate with peers - workers consistently below 60% agreement should be blocked. On managed platforms like Scale AI, quality control is built in; on MTurk, you implement it yourself.

Q: What is the feedback flywheel for human evaluation, and why does it matter?

A: The feedback flywheel is the iterative cycle: select representative tasks → run human evaluation → analyze failure patterns → improve the agent → re-evaluate → add new production failures to the task set → repeat. It matters because a single human evaluation study tells you where the agent fails today. A flywheel tells you whether your improvements are working, catches new failure modes as user behavior evolves, and continuously builds a more challenging and representative evaluation set. Without the flywheel, eval is a one-time gate. With it, eval becomes the engine of continuous quality improvement. The key investment: after every evaluation cycle, add the most important failure cases to the eval set so the next evaluation catches them automatically.

Quick Reference: Human Evaluation Checklist

Before launching any human evaluation study, verify:

This checklist converts the principles in this lesson into a pre-study verification routine. Any item that fails before launch indicates a gap that will reduce label quality and ultimately limit what you can learn from the study.

The Irreplaceable Signal​

When Human Evaluation Is Mandatory​

Human Eval Design Principles​

Principle 1: Define Your Evaluation Goal First​

Principle 2: Task Selection Determines Everything​

Annotator Selection​

Domain Experts​

Crowdworkers​

Target Users​

What to Show Annotators​

Question Design​

Binary Questions (Yes/No)​

Likert Scale Questions (1-5 or 1-7)​

Comparative Questions​

Open-Ended Questions​

Rubric Design​

Inter-Annotator Agreement​

Cohen's Kappa (2 annotators)​

Fleiss' Kappa (multiple annotators)​

Krippendorff's Alpha​

Full Python: Human Evaluation Toolkit​

The Feedback Flywheel​

Interview Q&A​

Quick Reference: Human Evaluation Checklist​

Further Reading​