Human Evaluation for Agents
The Irreplaceable Signalā
Eventually, humans must evaluate your agent. Not just because LLM judges are imperfect - though they are - but because the ultimate test of any agent is whether real humans, with real needs, find it useful and trustworthy.
An agent that scores 88% on your automated metrics and satisfies your LLM judge may still frustrate users with its phrasing, annoy domain experts with its oversimplifications, or alarm safety reviewers with subtle policy violations that no automated system caught. Automated metrics measure proxies. Human evaluation measures the real thing.
This lesson is about doing human evaluation well. The difference between poorly designed human evaluation (which produces noise) and well-designed human evaluation (which produces actionable signal) is entirely in the protocol design. Get the protocol wrong and you spend weeks collecting data that tells you nothing. Get it right and a 50-person annotation study can guide six months of product development.
:::tip š® Interactive Playground Visualize this concept: Try the Human Evaluation Process demo on the EngineersOfAI Playground - no code required. :::
When Human Evaluation Is Mandatoryā
Not every evaluation requires humans. But some situations absolutely do:
Safety-critical deployments: Any agent in healthcare, legal, financial, or safety-related domains must be evaluated by domain experts before deployment. Automated metrics cannot catch the subtle ways an agent might mislead a patient, create legal liability, or trigger a dangerous financial action. Human expert review is not optional.
Novel task types: When your agent handles a task type it has not been extensively evaluated on - new domains, new tool sets, new user populations - existing automated metrics may not be calibrated for that task type. Human evaluation establishes the baseline from which you can calibrate automated metrics.
Calibrating LLM judges: Every LLM judge requires periodic human calibration (see Lesson 05). Without human ground truth, you cannot know whether your judge is measuring what you think it measures.
Regulatory requirements: In regulated industries (financial services, medical devices, public sector AI), human expert evaluation may be legally required before deployment.
Significant model changes: When switching model families (not just versions), the failure modes may change in ways your existing automated metrics do not capture. Human evaluation of the new failure modes is required before trusting automated metrics again.
Human Eval Design Principlesā
Good human evaluation design is harder than it looks. Every decision - task selection, annotator choice, question design, interface design - affects the reliability of your data.
Principle 1: Define Your Evaluation Goal Firstā
Before designing the protocol, answer: what decision is this evaluation meant to inform?
- "Should we ship this agent version?" ā Absolute quality evaluation
- "Is version B better than version A?" ā Comparative evaluation
- "Where are the biggest quality gaps?" ā Diagnostic evaluation
- "Is this safe to deploy?" ā Safety evaluation
Different goals require different designs. Comparative evaluation needs paired ratings. Diagnostic evaluation needs fine-grained dimensions. Safety evaluation needs adversarial task selection and specialized annotators.
Principle 2: Task Selection Determines Everythingā
The tasks you evaluate on constrain everything you can learn. A carefully selected task set of 200 examples produces more useful signal than a carelessly selected set of 2000.
Task selection principles:
- Representative: Cover the full distribution of real user queries, not just easy ones
- Diverse: Include different task types, lengths, domains, and difficulty levels
- Edge cases: Specifically include known hard cases, ambiguous requests, and error-prone scenarios
- Adversarial: Include tasks designed to probe specific failure modes
- Fresh: Use tasks from production that the model has not been trained on
Avoid curating tasks that you know the agent handles well - this produces inflated, unrepresentative scores.
Annotator Selectionā
The annotator population determines the validity of your results. Wrong annotators produce confident but irrelevant data.
Domain Expertsā
When to use: safety review, technical accuracy verification, regulatory compliance, calibration set construction.
Cost: 500/hour for qualified domain experts. A 200-task evaluation study might cost 15,000.
Quality characteristics: High accuracy on domain-specific judgments, low throughput (10ā20 tasks/hour), high variability in what they notice (different experts focus on different aspects of quality).
Practical considerations: Domain experts are hard to find, harder to schedule, and often have strong opinions that make inter-annotator agreement difficult. Invest in rubric alignment sessions before the study.
Crowdworkersā
When to use: large-scale annotation, simple quality tasks (helpfulness, clarity), comparative preference, non-specialist judgments.
Platforms: Amazon Mechanical Turk (high volume, lower quality), Scale AI (managed, higher quality), Prolific (academic-focused, good quality for general tasks).
Cost: 0.50/task for simple tasks, 5/task for complex tasks.
Quality characteristics: High throughput (50ā100 tasks/hour), high variance (some workers are excellent, many are careless), requires quality control mechanisms.
Quality control for crowdworkers:
- Gold standard tasks: Insert known-answer items (10ā15% of tasks). Workers who answer incorrectly get filtered.
- Attention checks: Include obvious questions that attentive workers answer correctly.
- Agreement thresholds: Flag workers whose agreement with peers is below 60%.
- Time filters: Workers who complete tasks in less than 30 seconds are not reading carefully.
- Redundancy: 3ā5 workers per task, use majority vote.
Target Usersā
When to use: usability evaluation, final acceptance testing for new user segments, measuring helpfulness from the user's perspective.
Cost: Recruiting target users for paid studies costs 300/participant for 60-minute sessions.
Quality characteristics: High ecological validity (you are measuring what matters to actual users), low throughput, difficult to recruit at scale.
Best used for: small qualitative studies (10ā20 users) to understand the user experience, complementing quantitative automated evaluation.
What to Show Annotatorsā
A critical and often overlooked design decision: what do annotators see?
| What to Show | Pros | Cons | Best For |
|---|---|---|---|
| Final output only | Fast, low cognitive load | Misses trajectory quality | Output quality evaluation |
| Full trajectory | Captures process quality | Slow, cognitively demanding | In-depth trajectory review |
| Step-by-step | Can rate each decision | Very slow (20+ minutes/task) | Research, failure analysis |
| Redacted trajectory | Removes irrelevant details | May remove useful context | Scaled trajectory review |
For most production evaluations, show annotators the final output only, plus the query. For diagnostic or safety evaluations, show the full trajectory. For large-scale crowdwork, showing the trajectory is usually not practical.
Question Designā
The questions you ask annotators determine what you learn. Each question type has strengths and weaknesses:
Binary Questions (Yes/No)ā
Simple, high agreement, but coarse:
- "Does this response answer the user's question?" (Y/N)
- "Does this response contain any factually incorrect information?" (Y/N)
- "Is this response safe and appropriate?" (Y/N)
Use binary questions for: safety checks, basic quality gates, high-volume crowdwork.
Likert Scale Questions (1-5 or 1-7)ā
Finer-grained but lower agreement:
- "How helpful is this response? (1=Not at all helpful, 5=Extremely helpful)"
- "How accurate is the information? (1=Very inaccurate, 5=Very accurate)"
- "How clear and well-organized is this response? (1=Very confusing, 5=Extremely clear)"
Use Likert for: nuanced quality dimensions, when you need to distinguish between levels of good.
Comparative Questionsā
"Which response is better?" gives stronger signal than "Is this response good?":
- "Response A or Response B: which is more helpful to the user?"
- "Response A or Response B: which would you trust more?"
- "Response A or Response B: which would you prefer to receive?"
Use comparative for: comparing two agent versions, where absolute judgment is difficult.
Open-Ended Questionsā
Richest qualitative signal, hardest to aggregate:
- "What is the most important thing wrong with this response?"
- "What would make this response better?"
- "What would you change to make this more trustworthy?"
Use open-ended for: diagnostic evaluation, identifying failure modes, understanding user needs.
Rubric Designā
Rubrics specify exactly what each score level means. Without rubrics, annotators interpret questions differently, producing low inter-annotator agreement.
HELPFULNESS RUBRIC
5 - Excellent: Response fully addresses the user's need, anticipates likely follow-up
questions, provides appropriate depth, and is organized for the user's context.
4 - Good: Response fully addresses the main need with appropriate depth. Minor aspects
could be improved (slightly more detail, better organization, etc.).
3 - Acceptable: Response addresses the main need but misses notable details or context
that the user would want. The core information is present.
2 - Poor: Response partially addresses the need. Significant aspects are missing,
incorrect, or off-topic. User would need to ask follow-up questions.
1 - Failing: Response fails to address the user's need. Off-topic, incorrect,
or completely unhelpful.
Note: Do not score based on response length. A brief accurate answer scores higher
than a lengthy inaccurate one.
Good rubric properties:
- Each level has a concrete description with observable criteria
- Edge cases and exceptions are noted
- What NOT to consider is explicitly stated
- Calibration examples accompany the rubric (show 2ā3 examples at each score level)
Inter-Annotator Agreementā
Inter-annotator agreement (IAA) measures how consistently your annotators rate the same items. Low IAA means your evaluation is producing noise, not signal.
Cohen's Kappa (2 annotators)ā
For two annotators on categorical ratings:
Where is observed agreement and is expected agreement by chance.
Interpretation:
- : Poor agreement - your rubric or annotators need significant work
- : Moderate agreement - acceptable for exploration, not for decisions
- : Substantial agreement - acceptable for most production evaluations
- : Near-perfect agreement - excellent
Fleiss' Kappa (multiple annotators)ā
For more than two annotators on the same items:
Where is mean annotator agreement and is mean chance agreement.
Krippendorff's Alphaā
More general than kappa - handles missing data, works for ordinal/interval scales:
Where is observed disagreement and is expected disagreement. More appropriate than kappa for Likert scale data where the distance between disagreements matters (disagreeing by 4 points is worse than disagreeing by 1 point).
Target values: For Likert scale agent evaluation data, target Krippendorff's alpha > 0.67.
Full Python: Human Evaluation Toolkitā
"""
Human evaluation toolkit for agent outputs.
Includes: dataset manager, CLI annotation interface,
inter-annotator agreement calculator, result analyzer.
"""
import json
import os
import time
from dataclasses import dataclass, field, asdict
from statistics import mean, stdev
from typing import Optional
import math
# āā Data models āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
@dataclass
class EvalTask:
task_id: str
query: str
agent_response: str
trajectory_summary: Optional[list[str]] = None
metadata: dict = field(default_factory=dict)
is_gold_standard: bool = False # Known-answer item for quality control
gold_rating: Optional[int] = None # Expected rating for gold standard
@dataclass
class Annotation:
annotation_id: str
task_id: str
annotator_id: str
timestamp: float
# Ratings
task_completion: int # 1-5
factual_accuracy: int # 1-5
helpfulness: int # 1-5
safety: int # 1-5 (5 = very safe)
# Qualitative
primary_issue: Optional[str] = None
improvement_suggestion: Optional[str] = None
overall_notes: Optional[str] = None
# Quality
confidence: int = 3 # Annotator's confidence 1-5
time_spent_seconds: float = 0.0
def composite_score(self, weights: dict = None) -> float:
weights = weights or {
"task_completion": 0.35,
"factual_accuracy": 0.30,
"helpfulness": 0.25,
"safety": 0.10,
}
total = (
weights["task_completion"] * self.task_completion +
weights["factual_accuracy"] * self.factual_accuracy +
weights["helpfulness"] * self.helpfulness +
weights["safety"] * self.safety
)
return total / sum(weights.values())
# āā Dataset manager āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
class EvalDatasetManager:
"""
Manages eval tasks and annotations, stored as JSON files.
In production, replace with a database backend.
"""
def __init__(self, data_dir: str):
self.data_dir = data_dir
os.makedirs(data_dir, exist_ok=True)
self.tasks_path = os.path.join(data_dir, "tasks.json")
self.annotations_path = os.path.join(data_dir, "annotations.json")
self._tasks: dict[str, EvalTask] = {}
self._annotations: dict[str, list[Annotation]] = {}
self._load()
def add_task(self, task: EvalTask):
self._tasks[task.task_id] = task
self._save()
def add_annotation(self, annotation: Annotation):
task_id = annotation.task_id
if task_id not in self._annotations:
self._annotations[task_id] = []
self._annotations[task_id].append(annotation)
self._save()
def get_unannotated_tasks(self, annotator_id: str, limit: int = 10) -> list[EvalTask]:
"""Return tasks this annotator has not yet rated."""
annotated = {
ann.task_id
for anns in self._annotations.values()
for ann in anns
if ann.annotator_id == annotator_id
}
unannotated = [t for t_id, t in self._tasks.items() if t_id not in annotated]
# Put gold standard tasks first
gold = [t for t in unannotated if t.is_gold_standard]
regular = [t for t in unannotated if not t.is_gold_standard]
return (gold + regular)[:limit]
def get_multi_annotated_tasks(self, min_annotations: int = 2) -> list[EvalTask]:
"""Return tasks with at least min_annotations annotations."""
return [
self._tasks[task_id]
for task_id, anns in self._annotations.items()
if len(anns) >= min_annotations and task_id in self._tasks
]
def all_annotations(self) -> list[Annotation]:
return [ann for anns in self._annotations.values() for ann in anns]
def _load(self):
if os.path.exists(self.tasks_path):
with open(self.tasks_path) as f:
data = json.load(f)
self._tasks = {k: EvalTask(**v) for k, v in data.items()}
if os.path.exists(self.annotations_path):
with open(self.annotations_path) as f:
data = json.load(f)
self._annotations = {
k: [Annotation(**a) for a in v]
for k, v in data.items()
}
def _save(self):
with open(self.tasks_path, "w") as f:
json.dump({k: asdict(v) for k, v in self._tasks.items()}, f, indent=2)
with open(self.annotations_path, "w") as f:
json.dump(
{k: [asdict(a) for a in v] for k, v in self._annotations.items()},
f, indent=2
)
# āā CLI annotation interface āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
class CLIAnnotationInterface:
"""
Simple command-line interface for human annotation.
For production, replace with a web-based annotation tool.
"""
def __init__(self, dataset: EvalDatasetManager, annotator_id: str):
self.dataset = dataset
self.annotator_id = annotator_id
def start_session(self, num_tasks: int = 10):
tasks = self.dataset.get_unannotated_tasks(self.annotator_id, limit=num_tasks)
if not tasks:
print("No tasks remaining to annotate. Great work!")
return
print(f"\nStarting annotation session for {self.annotator_id}")
print(f"Tasks to annotate: {len(tasks)}")
print("ā" * 60)
for i, task in enumerate(tasks):
print(f"\nTask {i+1}/{len(tasks)} - ID: {task.task_id}")
if task.is_gold_standard:
print("[QUALITY CHECK TASK]")
annotation = self._annotate_task(task)
self.dataset.add_annotation(annotation)
if task.is_gold_standard and task.gold_rating is not None:
composite = annotation.composite_score()
expected = task.gold_rating
diff = abs(composite - expected)
if diff > 1.5:
print(f"Warning: Your rating ({composite:.1f}) differs significantly "
f"from expected ({expected:.1f}). Please review the rubric.")
print(f"\nSession complete! Annotated {len(tasks)} tasks.")
def _annotate_task(self, task: EvalTask) -> Annotation:
import uuid
print(f"\nQuery: {task.query}")
print(f"\nAgent Response:")
print("ā" * 40)
print(task.agent_response)
print("ā" * 40)
if task.trajectory_summary:
show_traj = input("\nShow trajectory? (y/N): ").strip().lower()
if show_traj == "y":
for step in task.trajectory_summary:
print(f" {step}")
print("\nRate this response (all scores 1-5):")
start = time.time()
tc = self._get_rating(
"Task completion (1=failed, 5=perfect)",
valid_range=(1, 5)
)
fa = self._get_rating(
"Factual accuracy (1=many errors, 5=fully accurate)",
valid_range=(1, 5)
)
h = self._get_rating(
"Helpfulness (1=not helpful, 5=very helpful)",
valid_range=(1, 5)
)
s = self._get_rating(
"Safety (1=unsafe, 5=fully safe)",
valid_range=(1, 5)
)
conf = self._get_rating(
"Your confidence in these ratings (1=very unsure, 5=very sure)",
valid_range=(1, 5)
)
issue = input("Primary issue (or press Enter to skip): ").strip() or None
suggestion = input("Improvement suggestion (or press Enter to skip): ").strip() or None
elapsed = time.time() - start
return Annotation(
annotation_id=str(uuid.uuid4())[:8],
task_id=task.task_id,
annotator_id=self.annotator_id,
timestamp=time.time(),
task_completion=tc,
factual_accuracy=fa,
helpfulness=h,
safety=s,
primary_issue=issue,
improvement_suggestion=suggestion,
confidence=conf,
time_spent_seconds=elapsed,
)
def _get_rating(self, prompt: str, valid_range: tuple[int, int]) -> int:
lo, hi = valid_range
while True:
try:
val = int(input(f"{prompt} [{lo}-{hi}]: ").strip())
if lo <= val <= hi:
return val
print(f"Please enter a number between {lo} and {hi}")
except ValueError:
print("Please enter a number")
# āā Inter-annotator agreement calculator āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
class IAACalculator:
"""
Computes inter-annotator agreement metrics.
Supports Cohen's kappa, Fleiss' kappa, and Krippendorff's alpha.
"""
def cohen_kappa(
self,
ratings_a: list[int],
ratings_b: list[int],
) -> float:
"""Cohen's kappa for two annotators on categorical ratings."""
assert len(ratings_a) == len(ratings_b), "Lists must be same length"
n = len(ratings_a)
if n == 0:
return 0.0
categories = sorted(set(ratings_a) | set(ratings_b))
k = len(categories)
cat_to_idx = {c: i for i, c in enumerate(categories)}
# Observed agreement
p_o = sum(a == b for a, b in zip(ratings_a, ratings_b)) / n
# Expected agreement
counts_a = [0] * k
counts_b = [0] * k
for a, b in zip(ratings_a, ratings_b):
counts_a[cat_to_idx[a]] += 1
counts_b[cat_to_idx[b]] += 1
p_e = sum(
(counts_a[i] / n) * (counts_b[i] / n)
for i in range(k)
)
if p_e == 1.0:
return 1.0
return (p_o - p_e) / (1 - p_e)
def fleiss_kappa(
self,
ratings_matrix: list[list[int]],
) -> float:
"""
Fleiss' kappa for multiple annotators.
ratings_matrix: rows = items, columns = annotators
"""
n_items = len(ratings_matrix)
if n_items == 0:
return 0.0
n_annotators = len(ratings_matrix[0])
all_cats = sorted({r for row in ratings_matrix for r in row if r is not None})
k = len(all_cats)
cat_to_idx = {c: i for i, c in enumerate(all_cats)}
# Count matrix: n_items x k
count_matrix = [[0] * k for _ in range(n_items)]
for i, row in enumerate(ratings_matrix):
for r in row:
if r is not None:
count_matrix[i][cat_to_idx[r]] += 1
# P_i: proportion of agreeing pairs per item
p_i_list = []
for i in range(n_items):
n_i = sum(count_matrix[i])
if n_i < 2:
continue
p_i = sum(count_matrix[i][j] * (count_matrix[i][j] - 1)
for j in range(k)) / (n_i * (n_i - 1))
p_i_list.append(p_i)
p_bar = mean(p_i_list) if p_i_list else 0.0
# p_j: proportion of all assignments in each category
total_assignments = n_items * n_annotators
p_j = [
sum(count_matrix[i][j] for i in range(n_items)) / total_assignments
for j in range(k)
]
p_e = sum(p ** 2 for p in p_j)
if p_e == 1.0:
return 1.0
return (p_bar - p_e) / (1 - p_e)
def krippendorff_alpha(
self,
ratings_matrix: list[list[Optional[int]]],
level: str = "ordinal",
) -> float:
"""
Krippendorff's alpha - handles missing data and ordinal scales.
ratings_matrix: rows = items, columns = annotators (None for missing)
level: "nominal", "ordinal", or "interval"
"""
# Flatten all valid ratings
pairs = []
for row in ratings_matrix:
valid_ratings = [r for r in row if r is not None]
for i in range(len(valid_ratings)):
for j in range(i + 1, len(valid_ratings)):
pairs.append((valid_ratings[i], valid_ratings[j]))
if len(pairs) < 2:
return 0.0
def distance(a, b) -> float:
if level == "nominal":
return 0.0 if a == b else 1.0
elif level == "ordinal":
# Rank-based distance
return float((a - b) ** 2)
else: # interval
return float((a - b) ** 2)
# Observed disagreement
d_o = mean(distance(a, b) for a, b in pairs)
# Expected disagreement
all_values = [r for row in ratings_matrix for r in row if r is not None]
all_pairs = [
(all_values[i], all_values[j])
for i in range(len(all_values))
for j in range(len(all_values))
if i != j
]
d_e = mean(distance(a, b) for a, b in all_pairs) if all_pairs else 1.0
if d_e == 0:
return 1.0
return 1 - d_o / d_e
# āā Result analyzer āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
class HumanEvalAnalyzer:
"""Analyzes human evaluation results to extract actionable insights."""
def __init__(self, dataset: EvalDatasetManager):
self.dataset = dataset
self.iaa = IAACalculator()
def compute_iaa(self) -> dict:
"""Compute IAA across all multi-annotated tasks."""
multi_tasks = self.dataset.get_multi_annotated_tasks(min_annotations=2)
if not multi_tasks:
return {"error": "No tasks with multiple annotations"}
# Build ratings matrices per dimension
task_ids = [t.task_id for t in multi_tasks]
dimensions = ["task_completion", "factual_accuracy", "helpfulness", "safety"]
iaa_results = {}
for dim in dimensions:
# Collect all annotators' ratings per task (as a matrix)
ratings_matrix = []
for task_id in task_ids:
anns = self.dataset._annotations.get(task_id, [])
ratings = [getattr(a, dim) for a in anns if getattr(a, dim) is not None]
if len(ratings) >= 2:
ratings_matrix.append(ratings)
if not ratings_matrix:
continue
# Pad to same length with None
max_len = max(len(r) for r in ratings_matrix)
padded = [r + [None] * (max_len - len(r)) for r in ratings_matrix]
alpha = self.iaa.krippendorff_alpha(padded, level="ordinal")
iaa_results[dim] = round(alpha, 3)
# Interpret
if alpha < 0.40:
interpretation = "Poor - rubric needs revision"
elif alpha < 0.60:
interpretation = "Moderate - acceptable for exploration"
elif alpha < 0.80:
interpretation = "Substantial - acceptable for decisions"
else:
interpretation = "Near-perfect - excellent"
iaa_results[f"{dim}_interpretation"] = interpretation
return iaa_results
def quality_control_report(self) -> dict:
"""Identify low-quality annotators via gold standard performance."""
gold_tasks = [
t for t in self.dataset._tasks.values()
if t.is_gold_standard and t.gold_rating is not None
]
if not gold_tasks:
return {"error": "No gold standard tasks configured"}
gold_ids = {t.task_id for t in gold_tasks}
gold_ratings = {t.task_id: t.gold_rating for t in gold_tasks}
# Collect annotator performance on gold tasks
annotator_performance = {}
for task_id, anns in self.dataset._annotations.items():
if task_id not in gold_ids:
continue
expected = gold_ratings[task_id]
for ann in anns:
aid = ann.annotator_id
if aid not in annotator_performance:
annotator_performance[aid] = []
composite = ann.composite_score()
annotator_performance[aid].append(abs(composite - expected))
report = {}
for annotator_id, errors in annotator_performance.items():
avg_error = mean(errors)
report[annotator_id] = {
"gold_tasks_completed": len(errors),
"avg_error": round(avg_error, 2),
"quality": "PASS" if avg_error < 1.0 else "FAIL",
}
return report
def summary_report(self) -> dict:
"""Overall quality summary across all annotations."""
all_anns = self.dataset.all_annotations()
if not all_anns:
return {"error": "No annotations yet"}
composites = [a.composite_score() for a in all_anns]
issues = [a.primary_issue for a in all_anns if a.primary_issue]
avg_time = mean(a.time_spent_seconds for a in all_anns)
# Count issues
issue_counts = {}
for issue in issues:
issue_counts[issue] = issue_counts.get(issue, 0) + 1
top_issues = sorted(issue_counts.items(), key=lambda x: x[1], reverse=True)[:5]
return {
"total_annotations": len(all_anns),
"unique_tasks": len(self.dataset._tasks),
"mean_composite_score": round(mean(composites), 3),
"std_composite_score": round(stdev(composites) if len(composites) > 1 else 0, 3),
"min_score": round(min(composites), 3),
"max_score": round(max(composites), 3),
"avg_annotation_time_seconds": round(avg_time, 1),
"top_issues": top_issues,
"iaa": self.compute_iaa(),
}
# āā Feedback loop āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
class FeedbackLoop:
"""
Converts human eval insights into agent improvement actions.
"""
@staticmethod
def generate_improvement_actions(summary: dict) -> list[str]:
"""Translate human eval findings into concrete agent improvement actions."""
actions = []
mean_score = summary.get("mean_composite_score", 3.0)
if mean_score < 2.5:
actions.append("CRITICAL: Overall quality below threshold. "
"Consider delaying deployment until > 3.5.")
# Dimension-specific actions
top_issues = summary.get("top_issues", [])
for issue, count in top_issues:
if "accuracy" in issue.lower() or "incorrect" in issue.lower():
actions.append(f"Factual errors reported {count}x - "
"add fact-checking tool or external verification step")
elif "incomplete" in issue.lower() or "missing" in issue.lower():
actions.append(f"Incomplete responses reported {count}x - "
"revise system prompt to require comprehensive coverage")
elif "confusing" in issue.lower() or "unclear" in issue.lower():
actions.append(f"Clarity issues reported {count}x - "
"add formatting guidelines to system prompt")
elif "safe" in issue.lower() or "appropriate" in issue.lower():
actions.append(f"Safety concerns reported {count}x - "
"escalate to safety review before deployment")
# IAA issues
iaa = summary.get("iaa", {})
for dim, alpha in iaa.items():
if not isinstance(alpha, (int, float)):
continue
if alpha < 0.40:
actions.append(f"Low IAA on {dim} (alpha={alpha:.2f}) - "
"revise rubric criteria and hold annotator calibration session")
return actions
# āā Demo āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
def demo():
import tempfile
with tempfile.TemporaryDirectory() as tmpdir:
dataset = EvalDatasetManager(tmpdir)
# Add sample tasks
tasks = [
EvalTask(
task_id="task_001",
query="What is the capital of France?",
agent_response="The capital of France is Paris.",
is_gold_standard=True,
gold_rating=4.5,
),
EvalTask(
task_id="task_002",
query="Explain the transformer architecture.",
agent_response="Transformers use attention mechanisms.",
),
]
for task in tasks:
dataset.add_task(task)
# Simulate annotations
import uuid
for annotator_id in ["annotator_a", "annotator_b"]:
for task in tasks:
annotation = Annotation(
annotation_id=str(uuid.uuid4())[:8],
task_id=task.task_id,
annotator_id=annotator_id,
timestamp=time.time(),
task_completion=4 if task.task_id == "task_001" else 2,
factual_accuracy=5 if task.task_id == "task_001" else 2,
helpfulness=4 if task.task_id == "task_001" else 2,
safety=5,
confidence=4,
time_spent_seconds=45.0,
)
dataset.add_annotation(annotation)
analyzer = HumanEvalAnalyzer(dataset)
summary = analyzer.summary_report()
print("\nāā Human Evaluation Summary āāāāāāāāāāāāāāāāāāāāā")
print(f"Total annotations: {summary['total_annotations']}")
print(f"Mean composite score: {summary['mean_composite_score']}")
print(f"IAA: {summary['iaa']}")
actions = FeedbackLoop.generate_improvement_actions(summary)
print("\nRecommended actions:")
for action in actions:
print(f" - {action}")
if __name__ == "__main__":
demo()
The Feedback Flywheelā
Human evaluation is most powerful when it is a repeating cycle, not a one-time exercise:
Each cycle tightens the loop: the insights from one evaluation directly guide the next improvement, and the next evaluation verifies that improvement worked. Over time, the eval set becomes a comprehensive library of challenging tasks, and the agent becomes reliably good at all of them.
:::danger Annotation Fatigue Destroys Data Quality Annotators rating their 50th task in a session produce dramatically lower quality annotations than on their first 10. Cognitive fatigue causes rating drift, lower attention to edge cases, and mechanical responses. Limit sessions to 30ā40 tasks maximum. Enforce breaks. Monitor annotation time per task - very fast (under 30 seconds) or very slow (over 15 minutes) annotations should be flagged for review. For crowdwork, distribute tasks across many short sessions rather than a few long ones. :::
:::warning Comparative Evaluation Requires Careful Blinding When annotators evaluate agent version A vs version B, any signal that reveals which version is which introduces bias. Blind your annotators to version information. If they can detect which version is which from response style, format, or length, your comparative results will be confounded. This is especially important when comparing a new model against an existing one that annotators are familiar with. :::
Interview Q&Aā
Q: When is human evaluation mandatory, and when can automated metrics substitute?
A: Human evaluation is mandatory in four situations. First, safety-critical deployments - no automated system can reliably catch all ways an agent might mislead in healthcare, legal, or financial domains; domain expert review is required. Second, calibrating LLM judges - without human ground truth, you cannot verify whether your automated judge measures what you intend. Third, novel task types outside your existing calibration distribution. Fourth, regulatory requirements in some industries. Automated metrics can substitute when: the task has clear objective correctness (code tests, exact match), the automated metric is well-calibrated against human judgment for that task type, and stakes are low enough that a 10ā15% error rate in the evaluation metric is acceptable.
Q: How do you design a rubric that produces high inter-annotator agreement?
A: Four principles. First, define each score level with concrete observable criteria, not vague adjectives - "response contains no verifiable factual errors" rather than "response is accurate." Second, add calibration examples: show annotators 2ā3 actual examples at each score level. Third, explicitly state what NOT to consider - length, formatting preferences, stylistic choices - to prevent annotators from using different implicit criteria. Fourth, conduct a calibration session before the study: have all annotators rate the same 10 tasks independently, discuss disagreements, and refine the rubric until IAA exceeds 0.67 (Krippendorff's alpha) on the calibration set. Do not skip the pre-study calibration.
Q: Explain Cohen's kappa, Fleiss' kappa, and Krippendorff's alpha. When do you use each?
A: Cohen's kappa measures agreement between exactly two annotators on categorical data. It corrects for chance agreement. Use it when you have exactly two annotators per item. Fleiss' kappa extends Cohen's to multiple annotators on categorical data - use when you have 3+ annotators per item with complete ratings. Krippendorff's alpha is the most general: it handles any number of annotators, missing data (not every annotator rates every item), and different scale types - nominal (categories), ordinal (ordered categories like Likert), and interval (true numeric scales). For agent evaluation with Likert scale ratings, Krippendorff's alpha with ordinal distance is most appropriate because the distance between a 1 and a 5 should matter more than between a 3 and a 4.
Q: How do you control for quality in crowdworker annotation?
A: Five mechanisms work together. First, gold standard tasks: insert known-answer items (10ā15% of tasks) where you know the correct rating. Filter workers whose gold standard performance falls below a threshold. Second, attention checks: simple questions that require reading to answer. Third, minimum time filters: flag tasks completed in under 30 seconds as potentially rushed. Fourth, redundancy: have 3ā5 workers rate each item and flag items with high disagreement for expert review. Fifth, ongoing monitoring: track each worker's agreement rate with peers - workers consistently below 60% agreement should be blocked. On managed platforms like Scale AI, quality control is built in; on MTurk, you implement it yourself.
Q: What is the feedback flywheel for human evaluation, and why does it matter?
A: The feedback flywheel is the iterative cycle: select representative tasks ā run human evaluation ā analyze failure patterns ā improve the agent ā re-evaluate ā add new production failures to the task set ā repeat. It matters because a single human evaluation study tells you where the agent fails today. A flywheel tells you whether your improvements are working, catches new failure modes as user behavior evolves, and continuously builds a more challenging and representative evaluation set. Without the flywheel, eval is a one-time gate. With it, eval becomes the engine of continuous quality improvement. The key investment: after every evaluation cycle, add the most important failure cases to the eval set so the next evaluation catches them automatically.
Quick Reference: Human Evaluation Checklistā
Before launching any human evaluation study, verify:
- Evaluation goal defined: What decision does this evaluation inform?
- Task selection strategy: Representative, diverse, adversarial, edge cases
- Annotator type chosen: Expert, crowdworker, or target user - and why
- Rubric written: Each criterion has precise level descriptions and calibration examples
- Calibration session planned: All annotators rate the same 10 examples before the study
- Gold standard tasks embedded: 10-15% of tasks have known correct answers
- IAA target set: Krippendorff's alpha > 0.67 target before using labels
- Session length capped: Maximum 60-75 minutes per annotator session
- Annotation interface tested: At least 5 annotators have pilot-tested the interface
- Disagreement analysis planned: High-disagreement cases will be reviewed for rubric improvement
- Feedback loop mechanism: How will results improve the agent and the eval set?
This checklist converts the principles in this lesson into a pre-study verification routine. Any item that fails before launch indicates a gap that will reduce label quality and ultimately limit what you can learn from the study.
Further Readingā
- Krippendorff (2004), "Content Analysis: An Introduction to Its Methodology" - the definitive reference for Krippendorff's alpha and content analysis methodology
- Clark et al. (2021), "All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text" - important study on the reliability of human evaluation protocols and how to improve them
- Module 08 Lesson 05: LLM-as-Agent-Judge - the automated scaling layer that human evaluation calibrates and validates
- Module 08 Lesson 07: Production Agent Monitoring - how production data feeds back into your human evaluation task selection
Human evaluation is the most expensive and most valuable component of your evaluation stack. Done well, it provides ground truth that everything else is calibrated against. Done poorly, it provides expensive noise. The investment in protocol design - rubrics, calibration, IAA measurement - is what makes the difference. Spend disproportionate time on it relative to the time you spend running the actual study.
Continue to the final lesson in this module - Production Agent Monitoring - where the insights from human evaluation combine with real-time metrics and distributed tracing to create a complete production quality system.
The three layers of the evaluation stack - automated metrics at scale, LLM judges for quality scoring, and human evaluation for ground truth - are strongest together. No single layer is sufficient. All three, designed well and integrated carefully, form the engineering foundation for trustworthy production agents.
:::note Lesson Connections This lesson connects to Lesson 05 (LLM as Judge) in one direction - human labels calibrate the judge - and to Lesson 07 (Production Monitoring) in the other direction - production data provides the task selection input for human evaluation. The three lessons together describe a complete quality measurement cycle that can sustain agent improvement indefinitely. :::
