:::tip đź Interactive Playground Visualize this concept: Try the Human Evaluation Process demo on the EngineersOfAI Playground - no code required. :::
Annotation Pipelines
When Labeling Bias Becomes Model Biasâ
A medical AI startup spent six months training a diagnosis-assistance model on carefully collected clinical notes from three hospital networks. They hired fifteen annotators to label symptom severity scores on a 1-5 scale. The team ran inter-annotator agreement checks monthly and found average Cohen's Kappa of 0.74 - above the 0.70 threshold they had set internally. Everything looked clean.
After deploying to a pilot hospital, the clinical staff noticed the model was consistently underestimating symptom severity for patients over 65. The model was predicting "mild" at a rate 40% higher than the clinical team's judgments on the same cases. The discrepancy triggered an investigation.
The root cause: four of the fifteen annotators - all younger than 30 and without clinical backgrounds - had a shared misinterpretation of the severity rubric. For elderly patients, they were applying a "relative to baseline health" interpretation rather than an "absolute severity" interpretation specified in the guidelines. The guidelines had not explicitly addressed this distinction. These four annotators had worked on a disproportionate share of the elderly patient cases due to an unbalanced task assignment.
Cohen's Kappa across the full annotator pool had been high enough to pass the threshold, but the systematic bias within that subgroup was invisible at the aggregate level. Six months of model training, regulatory preparation, and clinical validation - compromised by an annotation guideline that failed to specify one crucial distinction. The team had to relabel 23,000 examples and retrain from scratch.
This is the annotation quality problem in its most expensive form. Your model learns from labels, and labels come from humans following guidelines. Guidelines that are ambiguous, inconsistently interpreted, or silent on important edge cases produce labels that look consistent at the surface while encoding systematic biases at depth. The annotation pipeline is where AI data quality is either secured or destroyed - and it is far cheaper to get right the first time than to fix after training.
The Annotation Pipeline Architectureâ
Every production annotation pipeline has the same fundamental stages. Understanding each stage is necessary for knowing where failures occur and how to catch them before they propagate.
Stage 1: Sampling Strategyâ
Annotation budgets are always limited. The sampling strategy determines what goes into the training set, which determines what the model learns. Random sampling ensures distributional coverage but often under-samples rare, important cases. Active learning (covered in the next lesson) lets the model guide sampling toward its uncertainty boundaries. Most production pipelines use a hybrid: random sampling for baseline coverage, stratified sampling to ensure rare category representation, and active learning as the system matures.
Stage 2: Annotation Guidelinesâ
The single most impactful document in the pipeline. If guidelines are ambiguous, everything downstream is unreliable. Guidelines must be treated as living engineering artifacts - versioned, tested with pilot annotation rounds, and updated when edge cases reveal gaps.
Stage 3: Task Assignment with Overlapâ
Overlap (multiple annotators labeling the same item) enables quality measurement. Without overlap, you cannot compute inter-annotator agreement. The amount of overlap depends on task criticality: binary classification tasks might use 2-annotator overlap, while medical or safety-critical tasks should use 3-5 annotator overlap.
Stage 4: Quality Controlâ
Continuous measurement of agreement and gold task accuracy. Agreement below threshold triggers conflict resolution. Annotators whose gold task accuracy drops below threshold are flagged for retraining or removal.
Stage 5: Conflict Resolutionâ
When annotators disagree, the conflict must be resolved before the label enters training. Resolution strategies range from simple majority vote to weighted voting (by annotator accuracy) to LLM adjudication to senior expert review.
Annotation Guidelines: Engineering the Foundationâ
The annotation guideline document is more important than the annotation tool, the annotator pool, or the quality control process. Every downstream quality issue traces back to guideline ambiguity. Good guidelines are specific, full of examples, and explicitly address edge cases.
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class AnnotationExample:
"""A single example with label and explicit reasoning."""
text: str
label: str
reason: str
is_edge_case: bool = False
@dataclass
class AnnotationCategory:
"""A single label category with full specification."""
name: str
description: str
criteria: list[str] # Measurable, unambiguous criteria
positive_examples: list[AnnotationExample] # Correctly labeled YES
negative_examples: list[AnnotationExample] # Should NOT receive this label
edge_cases: list[AnnotationExample] # Hard cases with explicit decisions
common_mistakes: list[str] # Errors annotators commonly make for this category
@dataclass
class AnnotationGuideline:
"""Complete annotation guideline specification."""
task_name: str
version: str
created_at: str
updated_at: str
categories: list[AnnotationCategory]
general_rules: list[str]
tie_breaking_rules: list[str] # When uncertain, default to...
quality_bar: str # What distinguishes good from acceptable
skip_criteria: list[str] # When to mark "cannot determine"
review_history: list[dict] # Track changes with reasoning
def to_markdown(self) -> str:
"""Render as annotator-readable markdown document."""
lines = [
f"# Annotation Guidelines: {self.task_name}",
f"**Version**: {self.version} | **Updated**: {self.updated_at}\n",
"---\n",
"## General Rules",
]
for i, rule in enumerate(self.general_rules, 1):
lines.append(f"{i}. {rule}")
lines.append("\n## Tie-Breaking Rules")
for rule in self.tie_breaking_rules:
lines.append(f"- {rule}")
lines.append("\n## When to Skip")
for skip in self.skip_criteria:
lines.append(f"- {skip}")
lines.append("\n## Categories\n")
for cat in self.categories:
lines.append(f"### {cat.name}")
lines.append(f"\n{cat.description}\n")
lines.append("**Criteria (all must apply):**")
for c in cat.criteria:
lines.append(f"- {c}")
if cat.positive_examples:
lines.append("\n**Label YES - Positive Examples:**")
for ex in cat.positive_examples:
lines.append(f'- "{ex.text}"')
lines.append(f" - Label: **{ex.label}** | Reason: {ex.reason}")
if cat.negative_examples:
lines.append("\n**Label NO - Negative Examples:**")
for ex in cat.negative_examples:
lines.append(f'- "{ex.text}"')
lines.append(f" - Label: **{ex.label}** | Reason: {ex.reason}")
if cat.edge_cases:
lines.append("\n**Edge Cases (read carefully):**")
for ex in cat.edge_cases:
lines.append(f'- "{ex.text}"')
lines.append(f" - Correct label: **{ex.label}** | Because: {ex.reason}")
if cat.common_mistakes:
lines.append("\n**Common Annotator Mistakes:**")
for mistake in cat.common_mistakes:
lines.append(f"- {mistake}")
lines.append("")
return "\n".join(lines)
def get_version_hash(self) -> str:
"""Reproducible hash for provenance tracking."""
import hashlib
content = self.task_name + self.version + str(self.categories)
return hashlib.sha256(content.encode()).hexdigest()[:12]
# Example: medical symptom severity guideline
SEVERITY_GUIDELINE = AnnotationGuideline(
task_name="Patient Symptom Severity",
version="v3.0",
created_at="2025-01-15",
updated_at="2025-06-01",
categories=[
AnnotationCategory(
name="Severe (5)",
description=(
"Symptoms that are life-threatening, require immediate intervention, "
"or cause complete inability to perform basic activities of daily living."
),
criteria=[
"Requires immediate medical attention or hospitalization",
"Patient cannot perform basic self-care (eating, hygiene, mobility) without assistance",
"Vital signs outside safe range OR patient reports inability to function",
],
positive_examples=[
AnnotationExample(
text="Patient cannot get out of bed, has not eaten in 3 days, "
"reports severe chest pain radiating to left arm.",
label="5",
reason="Life-threatening presentation, immediate intervention required"
),
],
negative_examples=[
AnnotationExample(
text="Patient reports significant back pain but is still working.",
label="3",
reason="Impactful but not preventing basic function - this is Moderate, not Severe"
),
],
edge_cases=[
AnnotationExample(
text="85-year-old patient says 'I feel a bit off, some chest tightness, "
"nothing too bad for someone my age.'",
label="4",
reason=(
"CRITICAL EDGE CASE: Do NOT adjust severity relative to patient age or "
"baseline health. Chest tightness in any patient should be scored on absolute "
"severity criteria. Patient minimization ('for my age') should not reduce score. "
"Score 4 (High) given active cardiac symptom, not 2 (Low) because patient is "
"minimizing. Age-relative interpretation is the #1 annotation error for elderly patients."
),
is_edge_case=True
),
],
common_mistakes=[
"Adjusting severity downward because the patient seems stoic or minimizes symptoms",
"Adjusting severity downward because patient is elderly and 'expects' some discomfort",
"Confusing functional impairment (can't work) with life-threatening (can't breathe)",
],
),
],
general_rules=[
"Score ABSOLUTE severity - never relative to the patient's age, baseline, or stated expectations",
"Patient's own words about severity are one data point, not the determining factor",
"When in doubt between two adjacent scores, choose the higher one (err toward safety)",
"If you would call 911 for this patient, score 5. If you would tell them to go to urgent care today, score 4.",
],
tie_breaking_rules=[
"Between 4 and 5: choose 4 (hospitalization threshold is high)",
"Between any other adjacent scores: choose the higher score",
"When patient's self-report contradicts clinical indicators: weight clinical indicators",
],
quality_bar=(
"A good severity label reflects the clinical reality of the patient's presentation "
"as understood by an experienced emergency physician, not the patient's own rating."
),
skip_criteria=[
"Clinical notes are corrupted, incomplete, or clearly transcription errors",
"The note does not contain enough information to assess severity",
"You are not confident you understand the medical terminology used",
],
review_history=[
{
"version": "v3.0",
"change": "Added explicit edge case for elderly patient minimization",
"reason": "Systematic annotation error found in v2.x audit",
"affected_categories": ["Severe (5)", "High (4)"],
},
],
)
:::tip Start Guideline Development with Disagreement Mining Before writing guidelines from scratch, have five annotators independently label 100 items. Feed all annotations to an LLM with the prompt: "These annotators disagreed on X items. Identify the specific edge cases and ambiguities that are causing disagreement." The disagreement points reveal exactly which edge cases your guidelines need to address. This process, which takes one day, saves weeks of guideline iteration by surfacing real boundary cases rather than theoretical ones. :::
Task Assignment and Overlap Strategyâ
Overlap (having multiple annotators label the same item) is the foundation of quality measurement. Without overlap, you cannot compute inter-annotator agreement, detect systematic annotator biases, or validate individual decisions through majority vote.
from dataclasses import dataclass, field
from typing import Optional
import uuid
import random
import hashlib
@dataclass
class AnnotationTask:
"""A single annotation task assigned to one or more annotators."""
task_id: str = field(default_factory=lambda: str(uuid.uuid4()))
data_id: str = ""
data_content: str = ""
data_metadata: dict = field(default_factory=dict)
assigned_annotators: list[str] = field(default_factory=list)
labels: dict = field(default_factory=dict) # {annotator_id: label}
is_gold: bool = False # Gold = known-answer calibration task
gold_answer: str = "" # Correct answer for gold tasks (hidden from annotators)
guideline_version: str = "" # Which guideline version applies
created_at: float = 0.0
status: str = "pending" # pending, in_progress, complete, disputed
class TaskAssigner:
"""
Distributes annotation tasks with appropriate overlap and gold task injection.
Design decisions encoded here:
- Overlap by task type: medical/legal get 5 annotators, binary gets 2
- Gold task ratio: 10% of tasks are known-answer calibration
- Assignment is deterministic by data_id: same item always gets same annotators
- Gold tasks are interleaved randomly to prevent annotators from identifying them
"""
OVERLAP_BY_TASK_TYPE = {
"binary_classification": 2,
"multiclass": 3,
"subjective_rating": 3,
"preference_comparison": 3,
"medical": 5,
"legal": 5,
"safety_critical": 5,
}
MIN_ANNOTATORS_FOR_STATS = {
"binary_classification": 20, # Need 20 shared items for reliable Kappa
"medical": 30,
"legal": 30,
}
def __init__(
self,
annotators: list[str],
task_type: str = "binary_classification",
gold_ratio: float = 0.10,
guideline_version: str = "v1.0",
):
self.annotators = annotators
self.overlap = self.OVERLAP_BY_TASK_TYPE.get(task_type, 3)
self.gold_ratio = gold_ratio
self.guideline_version = guideline_version
self._assignment_cache: dict[str, list[str]] = {}
def assign_batch(
self,
data_items: list[dict],
gold_items: list[dict],
) -> dict[str, list[AnnotationTask]]:
"""
Create annotation tasks with overlap and gold task injection.
Returns: {annotator_id: [AnnotationTask, ...]}
Gold items are randomly interleaved with regular tasks so annotators
cannot identify calibration tasks by position or pattern.
"""
assignments: dict[str, list[AnnotationTask]] = {
ann: [] for ann in self.annotators
}
# Create regular tasks with overlap
for item in data_items:
selected = self._select_annotators(item["id"])
task = AnnotationTask(
data_id=item["id"],
data_content=item.get("text", item.get("content", "")),
data_metadata={k: v for k, v in item.items()
if k not in ("id", "text", "content")},
assigned_annotators=selected,
is_gold=False,
guideline_version=self.guideline_version,
)
for ann in selected:
assignments[ann].append(task)
# Inject gold calibration tasks
n_gold = max(1, int(len(data_items) * self.gold_ratio))
sampled_gold = random.sample(gold_items, min(n_gold, len(gold_items)))
for gold in sampled_gold:
# All annotators get gold tasks to enable cross-annotator calibration
gold_task = AnnotationTask(
data_id=gold["id"],
data_content=gold.get("text", gold.get("content", "")),
assigned_annotators=list(self.annotators),
is_gold=True,
gold_answer=gold.get("correct_label", ""),
guideline_version=self.guideline_version,
)
for ann in self.annotators:
assignments[ann].append(gold_task)
# Shuffle each queue to interleave gold and regular tasks
for ann in assignments:
random.shuffle(assignments[ann])
return assignments
def _select_annotators(self, data_id: str) -> list[str]:
"""
Deterministic annotator selection based on data_id.
Same data item always gets same annotators - enables reproducibility.
"""
if data_id in self._assignment_cache:
return self._assignment_cache[data_id]
seed = int(hashlib.sha256(data_id.encode()).hexdigest(), 16) % (2**31)
rng = random.Random(seed)
selected = rng.sample(self.annotators, min(self.overlap, len(self.annotators)))
self._assignment_cache[data_id] = selected
return selected
def estimate_annotation_cost(
self,
n_items: int,
cost_per_annotation: float,
time_per_annotation_minutes: float,
n_annotators_available: int,
) -> dict:
"""Estimate cost and time for annotation batch."""
total_annotations = n_items * self.overlap
gold_annotations = int(n_items * self.gold_ratio) * len(self.annotators)
total_with_gold = total_annotations + gold_annotations
return {
"n_items": n_items,
"total_annotations": total_with_gold,
"labor_cost_usd": round(total_with_gold * cost_per_annotation, 2),
"total_hours": round(total_with_gold * time_per_annotation_minutes / 60, 1),
"days_with_n_annotators": round(
total_with_gold * time_per_annotation_minutes / 60 / (n_annotators_available * 8),
1
),
}
Inter-Annotator Agreement: The Quality Signalâ
Cohen's Kappa and Fleiss's Kappa measure whether annotators agree beyond what chance alone would produce. Raw agreement percentage is insufficient - if 90% of items belong to one class and all annotators always vote for that class, agreement is 90% but no actual information is being encoded.
The formula for Cohen's Kappa is:
where is the observed agreement proportion and is the expected agreement by chance.
from collections import defaultdict
from typing import Optional
import math
def cohens_kappa(
labels_a: list,
labels_b: list,
) -> float:
"""
Cohen's Kappa: agreement between exactly two annotators.
Interpretation thresholds (Landis & Koch 1977):
kappa < 0.20 â Poor agreement
0.20 - 0.40 â Fair agreement
0.40 - 0.60 â Moderate agreement
0.60 - 0.80 â Substantial agreement
0.80 - 1.00 â Almost perfect agreement
For production HITL:
- Binary tasks: target kappa >= 0.70
- Medical/legal: target kappa >= 0.80
- Subjective preference: kappa >= 0.60 acceptable
"""
assert len(labels_a) == len(labels_b), "Label lists must be same length"
n = len(labels_a)
if n == 0:
return 0.0
categories = sorted(set(labels_a) | set(labels_b))
# Observed agreement
po = sum(1 for a, b in zip(labels_a, labels_b) if a == b) / n
# Expected agreement by chance
pe = sum(
(labels_a.count(cat) / n) * (labels_b.count(cat) / n)
for cat in categories
)
if pe >= 1.0:
return 1.0 # Edge case: only one category used
return (po - pe) / (1 - pe)
def fleiss_kappa(
annotations: dict, # {item_id: {annotator_id: label}}
) -> float:
"""
Fleiss's Kappa: agreement among multiple annotators.
Generalization of Cohen's Kappa to N > 2 annotators.
Formula:
kappa = (P_bar - P_e) / (1 - P_e)
where:
P_bar = mean agreement per item across all annotators
P_e = expected agreement by chance across all labels
"""
items = list(annotations.keys())
if not items:
return 0.0
all_labels: set = set()
for labels in annotations.values():
all_labels.update(labels.values())
n_items = len(items)
# Handle variable annotator counts
annotators_per_item = [len(annotations[item]) for item in items]
if min(annotators_per_item) < 2:
return 0.0
# Label counts per item
label_counts: dict = {}
for item_id, labels in annotations.items():
label_counts[item_id] = {cat: 0 for cat in all_labels}
for label in labels.values():
label_counts[item_id][label] = label_counts[item_id].get(label, 0) + 1
# Agreement per item P_i
P_i = []
for item_id in items:
counts = label_counts[item_id]
n = len(annotations[item_id])
if n <= 1:
P_i.append(0.0)
else:
p = sum(c * (c - 1) for c in counts.values()) / (n * (n - 1))
P_i.append(p)
P_bar = sum(P_i) / n_items
# Expected agreement by chance P_e
total = sum(len(annotations[item]) for item in items)
P_j = {}
for cat in all_labels:
count = sum(label_counts[item][cat] for item in items)
P_j[cat] = count / total
P_e = sum(p ** 2 for p in P_j.values())
if P_e >= 1.0:
return 1.0
return (P_bar - P_e) / (1 - P_e)
class AgreementMonitor:
"""
Real-time inter-annotator agreement monitoring.
Computes pairwise kappa and Fleiss kappa, alerts on threshold violations.
"""
THRESHOLDS = {
"binary_classification": 0.70,
"multiclass": 0.65,
"subjective_rating": 0.60,
"medical": 0.80,
"legal": 0.80,
"safety_critical": 0.85,
"preference_comparison": 0.65,
}
def __init__(self, task_type: str = "binary_classification"):
self.task_type = task_type
self.threshold = self.THRESHOLDS.get(task_type, 0.70)
self._task_labels: dict = defaultdict(dict) # {item_id: {annotator_id: label}}
def record_label(self, item_id: str, annotator_id: str, label) -> None:
"""Record an annotator's label for an item."""
self._task_labels[item_id][annotator_id] = label
def compute_pairwise_agreement(
self,
min_shared_items: int = 20,
) -> dict:
"""
Compute pairwise Cohen's Kappa between all annotator pairs.
Only computed when pairs share >= min_shared_items.
"""
all_annotators = set()
for labels in self._task_labels.values():
all_annotators.update(labels.keys())
annotators = sorted(all_annotators)
pairwise = {}
for i in range(len(annotators)):
for j in range(i + 1, len(annotators)):
ann_a, ann_b = annotators[i], annotators[j]
shared_items = [
item for item, labels in self._task_labels.items()
if ann_a in labels and ann_b in labels
]
if len(shared_items) < min_shared_items:
continue
labels_a = [self._task_labels[item][ann_a] for item in shared_items]
labels_b = [self._task_labels[item][ann_b] for item in shared_items]
kappa = cohens_kappa(labels_a, labels_b)
pairwise[f"{ann_a} vs {ann_b}"] = {
"kappa": round(kappa, 3),
"n_shared": len(shared_items),
"interpretation": self._interpret_kappa(kappa),
"below_threshold": kappa < self.threshold,
}
# Overall Fleiss Kappa
fleiss = fleiss_kappa(dict(self._task_labels))
low_pairs = [k for k, v in pairwise.items() if v["below_threshold"]]
avg_kappa = (
sum(v["kappa"] for v in pairwise.values()) / len(pairwise)
if pairwise else 0.0
)
return {
"task_type": self.task_type,
"threshold": self.threshold,
"fleiss_kappa": round(fleiss, 3),
"avg_pairwise_kappa": round(avg_kappa, 3),
"overall_pass": fleiss >= self.threshold,
"pairwise": pairwise,
"low_agreement_pairs": low_pairs,
"action_required": len(low_pairs) > 0,
}
def _interpret_kappa(self, kappa: float) -> str:
if kappa < 0.20:
return "Poor - guideline rewrite needed"
elif kappa < 0.40:
return "Fair - significant guideline gaps"
elif kappa < 0.60:
return "Moderate - edge cases need clarification"
elif kappa < 0.80:
return "Substantial - acceptable for most tasks"
else:
return "Almost perfect"
def identify_problematic_categories(self) -> dict:
"""
Find which label categories have lowest agreement.
Categories with low intra-category agreement need guideline work.
"""
category_agreements: dict = defaultdict(list)
for item_id, labels in self._task_labels.items():
annotator_list = list(labels.keys())
label_list = list(labels.values())
for i in range(len(annotator_list)):
for j in range(i + 1, len(annotator_list)):
label_a = label_list[i]
label_b = label_list[j]
category_agreements[label_a].append(label_a == label_b)
if label_a != label_b:
category_agreements[label_b].append(False)
return {
cat: {
"agreement_rate": round(sum(agreements) / len(agreements), 3),
"n_comparisons": len(agreements),
}
for cat, agreements in category_agreements.items()
if agreements
}
Gold Standard Tasks and Annotator Calibrationâ
Gold tasks - items with known correct answers - are injected into every annotator's queue. They serve two purposes: calibrating annotator accuracy over time, and detecting annotators who are not reading carefully. A consistent 90% accuracy on gold tasks, combined with 60% agreement with peers on regular tasks, is diagnostic of a guideline problem. A 65% accuracy on gold combined with 90% peer agreement points to a different annotator compared to the expert who created the gold standard.
from dataclasses import dataclass
from typing import Optional
from collections import defaultdict
import time
@dataclass
class GoldTaskResult:
"""Annotator accuracy assessment from gold task performance."""
annotator_id: str
accuracy: float
n_gold_tasks: int
below_threshold: bool
by_category: dict[str, float]
accuracy_trend: list[float] # Accuracy in sliding windows - detect fatigue/drift
worst_categories: list[str]
class AnnotatorCalibrator:
"""
Tracks annotator performance on gold (known-answer) tasks.
Gold tasks should:
- Represent the full distribution of difficulty
- Include clear cases AND edge cases
- Cover all label categories
- Be refreshed periodically to prevent memorization
"""
# Minimum accuracy on gold tasks to continue annotating
THRESHOLD_BY_TASK_TYPE = {
"binary_classification": 0.85,
"multiclass": 0.80,
"medical": 0.90,
"legal": 0.90,
"safety_critical": 0.92,
}
def __init__(
self,
gold_labels: dict[str, dict], # {item_id: {"label": ..., "category": ...}}
task_type: str = "binary_classification",
):
self.gold = gold_labels
self.threshold = self.THRESHOLD_BY_TASK_TYPE.get(task_type, 0.85)
self._annotator_records: dict = defaultdict(list)
# {annotator_id: [{"item_id": ..., "label": ..., "timestamp": ...}]}
def record_gold_label(
self,
annotator_id: str,
item_id: str,
label,
time_spent_seconds: float = 0.0,
) -> bool:
"""
Record an annotator's label on a gold item.
Returns True if correct, False otherwise.
"""
if item_id not in self.gold:
raise ValueError(f"Item {item_id} is not a gold item")
correct_label = self.gold[item_id]["label"]
is_correct = (str(label).strip().lower() == str(correct_label).strip().lower())
self._annotator_records[annotator_id].append({
"item_id": item_id,
"submitted_label": label,
"correct_label": correct_label,
"is_correct": is_correct,
"category": self.gold[item_id].get("category", "unknown"),
"time_spent_seconds": time_spent_seconds,
"timestamp": time.time(),
})
return is_correct
def evaluate_annotator(
self,
annotator_id: str,
min_samples: int = 20,
) -> Optional[GoldTaskResult]:
"""
Full accuracy evaluation for an annotator.
Returns None if fewer than min_samples gold tasks have been completed.
"""
records = self._annotator_records.get(annotator_id, [])
if len(records) < min_samples:
return None
# Overall accuracy
overall_accuracy = sum(r["is_correct"] for r in records) / len(records)
# Accuracy by category
cat_records: dict = defaultdict(list)
for r in records:
cat_records[r["category"]].append(r["is_correct"])
by_category = {
cat: round(sum(vals) / len(vals), 3)
for cat, vals in cat_records.items()
}
# Accuracy trend: compute over sliding windows of 20
window = 20
trend = []
for i in range(0, len(records) - window + 1, window // 2):
window_records = records[i:i + window]
trend.append(round(
sum(r["is_correct"] for r in window_records) / len(window_records), 3
))
# Worst categories
worst = sorted(
[cat for cat, acc in by_category.items() if acc < self.threshold],
key=lambda c: by_category[c]
)
return GoldTaskResult(
annotator_id=annotator_id,
accuracy=round(overall_accuracy, 3),
n_gold_tasks=len(records),
below_threshold=overall_accuracy < self.threshold,
by_category=by_category,
accuracy_trend=trend,
worst_categories=worst[:3],
)
def get_flagged_annotators(self, min_samples: int = 20) -> list[GoldTaskResult]:
"""Return results for annotators who should be reviewed or retrained."""
flagged = []
for ann_id in self._annotator_records:
result = self.evaluate_annotator(ann_id, min_samples)
if result and result.below_threshold:
flagged.append(result)
return sorted(flagged, key=lambda r: r.accuracy)
def detect_fatigue(self, annotator_id: str) -> Optional[dict]:
"""
Detect if accuracy drops within a session (indicating fatigue).
Compare first half vs second half of recent sessions.
"""
records = self._annotator_records.get(annotator_id, [])
if len(records) < 40:
return None
# Sort by timestamp
sorted_records = sorted(records, key=lambda r: r["timestamp"])
recent = sorted_records[-40:]
first_half = recent[:20]
second_half = recent[20:]
acc_first = sum(r["is_correct"] for r in first_half) / 20
acc_second = sum(r["is_correct"] for r in second_half) / 20
drop = acc_first - acc_second
return {
"first_half_accuracy": round(acc_first, 3),
"second_half_accuracy": round(acc_second, 3),
"accuracy_drop": round(drop, 3),
"fatigue_detected": drop > 0.10, # >10% drop suggests fatigue
}
Conflict Resolution: From Disagreement to Clean Labelâ
When annotators disagree, the conflict must be resolved before the label enters training. The resolution strategy should scale with the cost of being wrong: majority vote for routine tasks, weighted voting or expert adjudication for high-stakes decisions.
from collections import Counter
from typing import Optional
import json
import anthropic
def majority_vote(labels: list) -> Optional[str]:
"""
Simple majority vote. Returns None on tie (even number of annotators).
"""
if not labels:
return None
counts = Counter(labels)
most_common = counts.most_common(2)
if len(most_common) == 1:
return most_common[0][0]
if most_common[0][1] > most_common[1][1]:
return most_common[0][0]
return None # Tie - requires adjudication
def weighted_majority_vote(
labels: list,
weights: list[float],
) -> Optional[str]:
"""
Weighted majority vote - annotators with higher historical gold accuracy
receive proportionally more weight.
"""
if not labels or len(labels) != len(weights):
return None
weighted_counts: dict[str, float] = {}
for label, weight in zip(labels, weights):
weighted_counts[label] = weighted_counts.get(label, 0.0) + weight
sorted_labels = sorted(weighted_counts.items(), key=lambda x: x[1], reverse=True)
if len(sorted_labels) == 1:
return sorted_labels[0][0]
# Require clear winner (>10% weight advantage)
if (sorted_labels[0][1] - sorted_labels[1][1]) < 0.1 * sum(weights):
return None # Too close - needs expert adjudication
return sorted_labels[0][0]
class LLMAdjudicator:
"""
Uses Claude to resolve annotation disputes by applying the official guideline.
When to use:
- Majority vote fails (tie or near-tie)
- High-stakes labels (medical, legal) where errors are costly
- Cases where annotators provided conflicting reasoning
When NOT to use:
- Clear majority vote exists (use majority vote - cheaper and faster)
- Items that should simply be marked as ambiguous and excluded
"""
def __init__(self, guideline: AnnotationGuideline, model: str = "claude-opus-4-6"):
self.client = anthropic.Anthropic()
self.guideline = guideline
self.model = model
def adjudicate(
self,
data_content: str,
conflicting_labels: list[str],
annotator_reasoning: Optional[list[str]] = None,
metadata: Optional[dict] = None,
) -> dict:
"""
Resolve a labeling conflict using the annotation guideline as authority.
Returns:
- label: resolved label
- confidence: 0.0-1.0
- reasoning: explanation of the decision
- guideline_reference: which guideline criterion applies
- recommend_guideline_update: whether this case reveals a guideline gap
"""
labels_str = "\n".join(
f"- Annotator {i+1}: {label}" for i, label in enumerate(conflicting_labels)
)
reasoning_str = ""
if annotator_reasoning:
reasoning_str = "\n\nAnnotator reasoning provided:\n" + "\n".join(
f"- Annotator {i+1}: {r}" for i, r in enumerate(annotator_reasoning)
)
metadata_str = ""
if metadata:
metadata_str = "\n\nAdditional context:\n" + "\n".join(
f"- {k}: {v}" for k, v in metadata.items()
)
guideline_excerpt = self.guideline.to_markdown()[:3000]
prompt = f"""You are an expert annotation adjudicator. Your job is to determine the correct label
for a disputed annotation case, using the official guideline as authority.
ANNOTATION GUIDELINE:
{guideline_excerpt}
ITEM TO LABEL:
{data_content}
CONFLICTING ANNOTATOR LABELS:
{labels_str}
{reasoning_str}
{metadata_str}
Analyze the case carefully against the guideline criteria. Then respond with JSON:
{{
"label": "the correct label per guideline",
"confidence": 0.0-1.0,
"reasoning": "step-by-step reasoning applying guideline criteria",
"guideline_reference": "which specific criterion or rule applies",
"recommend_guideline_update": true/false,
"guideline_update_suggestion": "if true, what should be added to the guideline"
}}
Apply the guideline literally. If the guideline is ambiguous on this case,
set recommend_guideline_update to true and explain what clarification is needed."""
response = self.client.messages.create(
model=self.model,
max_tokens=600,
messages=[{"role": "user", "content": prompt}]
)
try:
text = response.content[0].text
start = text.find('{')
end = text.rfind('}') + 1
result = json.loads(text[start:end])
return result
except (json.JSONDecodeError, ValueError):
return {
"label": majority_vote(conflicting_labels),
"confidence": 0.4,
"reasoning": "LLM parse failed - using majority vote as fallback",
"guideline_reference": "fallback",
"recommend_guideline_update": False,
"guideline_update_suggestion": None,
}
def adjudicate_batch(
self,
disputed_items: list[dict],
) -> list[dict]:
"""
Batch adjudication for multiple disputed items.
Cheaper to run than individual calls.
"""
results = []
for item in disputed_items:
result = self.adjudicate(
data_content=item["content"],
conflicting_labels=item["labels"],
annotator_reasoning=item.get("reasoning"),
metadata=item.get("metadata"),
)
result["item_id"] = item["id"]
results.append(result)
# Collect guideline update recommendations
if result.get("recommend_guideline_update"):
print(f"[GUIDELINE UPDATE RECOMMENDED] Item {item['id']}: "
f"{result.get('guideline_update_suggestion', '')}")
return results
LLM-Assisted Pre-Annotationâ
Pre-annotation uses an LLM to generate a suggested label before the human annotates. Humans then verify or correct the suggestion. This can increase throughput 3-5x for tasks where the LLM is right most of the time, while keeping humans in control for edge cases.
import anthropic
import json
from typing import Optional
class LLMPreAnnotator:
"""
Uses Claude Haiku to pre-annotate items before human review.
Design principle: pre-annotation accelerates human annotation but
should never prevent humans from overriding. Show the suggestion
prominently but make override equally accessible.
Use Haiku for pre-annotation (cheap, fast) and reserve Opus for
adjudication of genuinely difficult disputes.
"""
def __init__(self, guideline: AnnotationGuideline):
self.client = anthropic.Anthropic()
self.guideline = guideline
def pre_annotate(
self,
items: list[dict],
categories: list[str],
) -> list[dict]:
"""
Generate suggested labels for a batch of items.
Returns items with added "suggested_label" and "suggested_confidence" fields.
"""
results = []
for item in items:
suggestion = self._suggest_label(
content=item.get("text", item.get("content", "")),
categories=categories,
)
results.append({
**item,
"suggested_label": suggestion["label"],
"suggested_confidence": suggestion["confidence"],
"suggested_reasoning": suggestion["reasoning"],
})
return results
def _suggest_label(self, content: str, categories: list[str]) -> dict:
"""Generate a single label suggestion using Haiku."""
guideline_excerpt = self.guideline.to_markdown()[:1500]
prompt = f"""Apply this annotation guideline to classify the following item.
GUIDELINE (excerpt):
{guideline_excerpt}
CATEGORIES: {categories}
ITEM:
{content}
Respond as JSON: {{"label": "one of the categories", "confidence": 0.0-1.0, "reasoning": "brief"}}
If genuinely uncertain between two categories, pick the more conservative one per the tie-breaking rules."""
try:
response = self.client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)
text = response.content[0].text
start = text.find('{')
end = text.rfind('}') + 1
return json.loads(text[start:end])
except Exception:
return {
"label": categories[0] if categories else "unknown",
"confidence": 0.1,
"reasoning": "Pre-annotation failed - annotate independently"
}
def compute_pre_annotation_accuracy(
self,
pre_annotations: list[dict],
human_labels: list[str],
) -> dict:
"""
Measure how accurate pre-annotation is vs human ground truth.
Informs whether pre-annotation is helping or creating bias.
"""
if len(pre_annotations) != len(human_labels):
raise ValueError("Mismatch in pre-annotation and human label counts")
correct = sum(
1 for pa, hl in zip(pre_annotations, human_labels)
if pa.get("suggested_label") == hl
)
accuracy = correct / len(human_labels)
# Measure override rate by confidence bucket
override_by_confidence: dict = {"high": [], "medium": [], "low": []}
for pa, hl in zip(pre_annotations, human_labels):
conf = pa.get("suggested_confidence", 0.5)
override = pa.get("suggested_label") != hl
if conf >= 0.85:
override_by_confidence["high"].append(override)
elif conf >= 0.60:
override_by_confidence["medium"].append(override)
else:
override_by_confidence["low"].append(override)
return {
"pre_annotation_accuracy": round(accuracy, 3),
"total_items": len(human_labels),
"override_rate_by_confidence": {
bucket: {
"override_rate": round(sum(overrides) / max(len(overrides), 1), 3),
"n_items": len(overrides),
}
for bucket, overrides in override_by_confidence.items()
},
"recommendation": (
"Pre-annotation is effective - accuracy above 80%"
if accuracy >= 0.80 else
"Pre-annotation may be creating bias - consider disabling"
if accuracy < 0.65 else
"Pre-annotation is marginal - monitor for bias introduction"
)
}
Annotation Quality Comparison Tableâ
| Metric | Poor | Acceptable | Good | Medical/Legal Target |
|---|---|---|---|---|
| Cohen's Kappa | less than 0.40 | 0.40-0.60 | 0.60-0.80 | greater than 0.80 |
| Fleiss Kappa (3+ annotators) | less than 0.40 | 0.40-0.65 | 0.65-0.80 | greater than 0.80 |
| Gold Task Accuracy | less than 0.75 | 0.75-0.85 | 0.85-0.92 | greater than 0.90 |
| Annotator Agreement Rate | less than 60% | 60-75% | 75-90% | greater than 85% |
| Conflict Rate | greater than 40% | 20-40% | less than 20% | less than 15% |
| Pre-annotation Accuracy | less than 65% | 65-80% | 80-90% | greater than 85% |
:::warning Do Not Use Low-Cost Crowdsourcing for High-Stakes Data Platforms like Amazon MTurk work for simple, clear-cut tasks (binary image classification, simple yes/no questions). For complex tasks - medical, legal, nuanced sentiment, safety-critical decisions - low-cost crowdsourcing produces labels noisy enough to mislead model training. The cost savings are routinely consumed by additional model debugging, retraining, and regulatory risk. Invest in expert annotators for high-stakes domains. The unit economics are better than they appear. :::
:::tip Version Your Guidelines and Your Data Every labeled example should be traceable to the specific guideline version in effect when it was labeled. When you update guidelines (which you will, as edge cases surface), you need to know which examples may need relabeling. Without version tracking, a guideline update forces a full relabeling audit of the entire dataset. With version tracking, you can isolate exactly which examples were labeled under the old interpretation and prioritize them for review. :::
Interview Q&Aâ
Q1: What is Cohen's Kappa and why is it preferred over raw agreement percentage for measuring annotation quality?â
Cohen's Kappa corrects for the agreement that would occur by chance if annotators were randomly assigning labels proportional to the class distribution. Raw agreement percentage is misleading because it inflates in imbalanced datasets. If 95% of items belong to class A and all annotators always vote A, raw agreement is 95% - but no actual information is being encoded. Cohen's Kappa for this scenario would be close to 0, correctly reflecting that the annotators are just following the base rate rather than exercising genuine judgment.
The formula is , where is observed agreement and is expected chance agreement computed from the marginal distributions. Standard interpretation: above 0.60 is substantial agreement, above 0.80 is almost perfect. For production HITL systems, target Kappa above 0.70 for general tasks and above 0.80 for medical or legal annotation.
Use Fleiss's Kappa when you have more than two annotators, as Cohen's Kappa is only defined for exactly two. In practice, run pairwise Cohen's Kappa between all annotator pairs to identify specific pairs with low agreement, then use Fleiss's Kappa for the overall system health signal.
Q2: How do you design annotation guidelines that minimize ambiguity?â
The structure matters as much as the content. Every category needs: (1) a clear, unambiguous definition, (2) measurable criteria that each apply independently, (3) positive examples showing what gets this label and why, (4) negative examples showing what should NOT get this label, (5) edge case table documenting specific hard cases with explicit decisions, (6) common mistake list for annotators who have worked on similar tasks before, and (7) tie-breaking rules specifying which way to default when uncertain.
The most important guideline development practice is pilot annotation with disagreement mining. Have five annotators independently label 100 items without any guidelines except the basic task description. Collect their annotations and use an LLM to identify all items where annotators disagreed, then analyze what edge case or ambiguity caused the disagreement. This surfaces the real boundary cases in your data rather than the theoretical ones you would think of independently.
Test your guidelines by having two new annotators use them to label the same 50 items independently. Any items they disagree on are guideline gaps. Fix those gaps before deploying to the full annotator pool.
Q3: What overlap strategy should you use for different task types?â
Overlap (multiple annotators per item) is the foundation of quality measurement but adds cost. Match overlap to task criticality and task difficulty.
For binary classification tasks with clear-cut criteria, 2-annotator overlap is sufficient: you can detect disagreement and resolve by tiebreaker or adjudication. For multiclass or subjective tasks, 3-annotator overlap enables majority vote resolution. For medical, legal, or safety-critical annotation, use 5-annotator overlap: majority vote among 5 is more reliable than 3, you have enough signal to compute meaningful per-annotator accuracy, and you can afford to lose one annotator's judgment without recomputing.
Gold task injection should run at 10% of total volume, with all annotators receiving all gold tasks. This gives you cross-annotator calibration data that allows you to weight annotators by accuracy when resolving conflicts through weighted majority vote.
Q4: How do you handle annotation guideline updates without invalidating existing labeled data?â
Guideline updates are inevitable - edge cases surface that the original guideline did not address. The key practice is version tracking: every labeled example records which guideline version applied at labeling time.
When you update guidelines, first audit how many existing examples fall into the updated category. If you add an edge case clarification that changes how to label "elderly patient minimizing symptoms," identify all labeled examples involving elderly patients and priority-queue them for relabeling review. Do not relabel everything - focus on examples in the affected subspace.
Create a "relabeling required" flag in your dataset with the guideline version that triggered the requirement. Track relabeling progress separately from the overall labeling progress. Once relabeled, remove the flag and update the guideline version on those examples.
Maintain a review history in the guideline document itself: what was changed, why, which categories were affected, and what the new interpretation is. This provides the audit trail for why some examples in your dataset have different provenance.
Q5: How do you use LLMs to accelerate annotation without introducing systematic bias?â
LLMs can be used in two annotation support roles: pre-annotation (suggesting labels before human review) and adjudication (resolving human disagreements). Both have bias risks that require specific countermeasures.
For pre-annotation: use a cheap, fast model (Claude Haiku) to suggest labels before the annotator sees the item. Show the suggestion clearly but make override equally accessible - avoid UI patterns that make it harder to override than to accept. Measure override rate by confidence bucket weekly. If high-confidence pre-annotations have an override rate below 5%, either the model is very accurate (good) or annotators are rubber-stamping (bad). Distinguish these by running blind annotation sessions where the suggestion is hidden, then comparing results to sessions where it is shown.
For adjudication: use a stronger model (Claude Opus) with the full guideline document as context. The LLM is applying the guideline, not exercising independent judgment - it should reference specific guideline criteria in its reasoning. Track adjudication decisions by category and periodically audit a sample against expert human judgment to verify the LLM is applying the guideline correctly.
The most important safeguard: measure pre-annotation accuracy and override rate weekly. If accuracy drops below 75% or override rate rises above 40%, the pre-annotation is creating more noise than signal and should be disabled for that category until the underlying model improves.
