Skip to main content

:::tip 🎼 Interactive Playground Visualize this concept: Try the Human Evaluation Process demo on the EngineersOfAI Playground - no code required. :::

Building Golden Datasets

Reading time: 35 minutes | Interview relevance: Very High | Target roles: AI Engineer, ML Engineer, Research Engineer, MLOps


The Evaluation Collapse​

The startup had built something they were genuinely proud of. Their AI writing assistant produced clean, professional prose, summarized documents in seconds, and adapted to different tones on request. The founding engineer had evaluated it carefully - 50 responses, scored on a 1–5 scale across helpfulness, accuracy, and tone. The result: 82% satisfaction. They shipped it.

Three months later, they hired three more engineers. Each new hire had opinions, standards, and mental models shaped by different backgrounds - one had worked at a media company where precision mattered above all, one had come from a startup culture where speed of communication was prized, and one had academic training that valued hedging and qualification. The founding engineer ran the same evaluation again, this time with the full team rating the same 50 responses. The satisfaction score dropped to 61%.

The panic in the Slack channel was immediate: "Did we break something? Did the model get worse? Did an API change hit us?" They spent two days auditing the system, checking API response logs, comparing prompt versions. Nothing had changed in the AI. What had changed was the measurement apparatus. With four different evaluators applying four implicit standards, the 50-response dataset was producing four different answers to the same question.

The root cause was structural. The founding engineer had never written annotation guidelines - the explicit rules that tell every evaluator exactly what a "4" looks like versus a "3", what counts as accurate, what makes a tone appropriate. There were no examples with explained reasoning. There was no calibration session where evaluators would compare notes on the same five responses before rating independently. There was no inter-annotator agreement measurement to detect when evaluators were systematically disagreeing. And the 50-response dataset had been selected opportunistically - cases the founder remembered thinking were good or bad - rather than sampled systematically from production traffic.

What they had wasn't an evaluation framework. It was an artifact of one person's preferences dressed up as a metric. Building a golden dataset is a discipline with methods, failure modes, and quality checks. It sits at the foundation of every downstream evaluation decision. If your golden dataset is flawed, every evaluation metric built on top of it is flawed, and every model comparison, prompt experiment, and regression test is measuring the wrong thing with false confidence.

The startup rebuilt from scratch. They wrote a four-page annotation guide with example responses at every score level. They collected 300 questions sampled from actual production logs, stratified by query type. They had every example rated by two annotators and measured their agreement using Cohen's Kappa, rejecting cases where Kappa fell below 0.6. They held weekly calibration sessions. It took three weeks. But when they deployed a new model version six months later and their benchmark showed an improvement, they could trust the number.


Why Golden Datasets Matter​

A golden dataset is a curated collection of input-output pairs with verified, high-quality reference answers that serve as ground truth for evaluation. Every other evaluation technique in your stack depends on it.

Ground truth for offline evaluation. Without a reference answer, you cannot compute metrics like faithfulness, accuracy, or completeness in any principled way. LLM judges can compare two responses, but they need to compare against something. Golden answers provide that anchor.

Anchor for LLM judge calibration. When you use an LLM to evaluate your system, you are trusting the judge's standards. Those standards need to be grounded in your actual quality bar. A golden dataset lets you measure whether the judge agrees with human annotation - if the judge scores a response 4/5 and your annotators scored the same response 2/5, the judge is miscalibrated and all of its downstream scores are meaningless.

Regression test suite foundation. Every golden example is a potential regression test. If your current system produces the expected answer and a new prompt version does not, you have a detected regression. Without golden examples, regression testing is impossible.

Benchmark for model and prompt comparisons. When you compare two models or two prompt versions, the benchmark's validity depends entirely on the quality of the test cases. A dataset full of easy questions will show both systems performing near-perfectly, making real differences invisible. A biased dataset will favor one system for reasons unrelated to real-world quality.

The garbage in/garbage out problem. Poor golden data is worse than no data. It produces misleading metrics that mask real failures while reporting false confidence. Teams that ship based on a flawed benchmark and then discover the failures in production lose far more than the time they would have spent building the dataset correctly.


Historical Context​

The practice of curated evaluation datasets comes from information retrieval research in the 1960s and 1970s, where the Cranfield experiments established the basic methodology: collect a set of documents, write queries with known relevant document sets, and measure retrieval systems against those sets. The Cranfield datasets were small (1,400 documents, 225 queries) but their methodology - exhaustive relevance judgments for every query-document pair - became the template for decades of IR research.

The Text Retrieval Conference (TREC), launched in 1992, scaled this to millions of documents and hundreds of queries per year. TREC introduced pooling (collect top results from many systems, annotate those) and made inter-annotator agreement measurement standard practice. By the early 2000s, the Natural Language Processing community had benchmark datasets like Penn Treebank, SQuAD, GLUE, and SuperGLUE - each with careful annotation guidelines, multiple annotators, and published agreement statistics.

The rise of LLMs shifted the problem. Benchmark contamination (models trained on test data), benchmark saturation (human parity on GLUE), and the difficulty of evaluating open-ended generation made traditional static benchmarks insufficient. The field moved toward dynamic benchmarks, adversarial datasets, and LLM-as-judge evaluation - but all of these still require golden datasets as the calibration anchor. The methods evolved; the discipline remained essential.


Dataset Design Principles​

1. Representativeness​

The dataset must reflect the actual distribution of inputs your system receives in production. If 40% of your production queries are simple factual lookups but your dataset is 90% complex multi-step reasoning, your benchmark will over-weight rare hard cases and under-weight common easy ones.

Representativeness is achieved by sampling from production logs, stratifying by query type, and continuously measuring whether the dataset distribution drifts away from production distribution over time.

2. Diversity​

Within each query type, the dataset must cover the full space of variations. For a coding assistant, diversity means: different languages, different problem types (algorithmic, debugging, documentation), different skill levels, different code lengths, different domains.

Diversity is measured through n-gram analysis, embedding-space coverage, and topic distribution analysis. A dataset that is not diverse will produce evaluation scores that over-represent one slice of your users.

3. Coverage of Failure Modes​

Deliberately include cases where the system has historically struggled. If your system fails on questions that contain negations, include negation questions. If it struggles with multi-step reasoning, include multi-step questions. Failure mode coverage is the difference between a dataset that tells you how well you do on easy cases versus one that tells you how robust you actually are.

4. Balance Between Easy and Hard​

An evaluation set that is all hard cases is as misleading as one that is all easy cases. The former makes a good system look weak; the latter makes a weak system look good. A well-calibrated dataset should produce a score distribution that is roughly normal across the difficulty range.

Target: approximately 20% easy, 50% medium, 20% hard, 10% adversarial.

5. Temporal Relevance​

Datasets rot. A golden dataset built in January 2024 for a legal research assistant may be invalid by January 2025 as regulations change, new cases are decided, and terminology evolves. Dataset maintenance - detecting drift, adding new examples, retiring stale ones - is a continuous operational responsibility.


Question Generation Strategies​

Mining from Production Logs​

Real user queries are the highest-validity source for evaluation questions because they reflect the actual input distribution. Strategy: collect 30 days of production queries, cluster by semantic similarity, sample proportionally from each cluster, and stratify by user feedback signal (include both high-rated and low-rated interactions).

LLM-Generated Questions​

A model like claude-haiku can generate hundreds of diverse questions from a document corpus in minutes. The weakness: LLM-generated questions may cluster around common phrasings and miss the long-tail queries real users ask. Use for diversity gap-filling, not as your primary source.

Template-Based Generation​

For systematic coverage of structured domains, define a question schema and fill it programmatically. For a SQL assistant: "Write a query that [action] from [table] where [condition]". Templates guarantee coverage of all action × table × condition combinations but produce questions that feel synthetic.

Adversarial Generation​

Deliberately try to break the system. Generate questions with false premises, negations, out-of-domain topics, or misleading framing. These surface failure modes that happy-path testing never reaches. Use claude-opus-4-6 for higher-quality adversarial generation.

Human Expert Generation​

Domain experts can generate rare, high-stakes questions that no automated process would think of. A physician writing questions for a medical AI will generate cases about drug interactions, contraindications, and differential diagnosis - scenarios the model must handle correctly even if they represent 0.1% of traffic.


Annotation Process​

Writing Annotation Guidelines​

Annotation guidelines are the document every human annotator reads before rating a single example. Their quality determines the quality of every annotation downstream. Good guidelines:

  • State the task precisely, with the exact question the annotator must answer
  • Define each score level with a crisp description
  • Provide multiple example responses at each score level with explanations of why they received that score
  • Address edge cases explicitly: "If the response is correct but uses deprecated terminology, score it as a 3, not a 4"
  • Include a section on common annotator errors and how to avoid them

Poor guidelines use vague language: "score 4 if the response is mostly good." What is "mostly"? What is "good"? Two annotators reading "mostly good" will apply entirely different thresholds.

Two-Annotator Redundancy​

Every example in a production-grade golden dataset is annotated by at least two independent annotators. The redundancy serves two purposes: (1) it detects annotation errors and outliers, (2) it gives you empirical data about the difficulty and ambiguity of each example.

Inter-Annotator Agreement​

Cohen's Kappa measures agreement between exactly two annotators for categorical ratings, correcting for chance agreement:

Îș=Pobserved−Pchance1−Pchance\kappa = \frac{P_{observed} - P_{chance}}{1 - P_{chance}}

  • Îș<0\kappa < 0: worse than chance (systematic disagreement)
  • 0.0–0.20.0\text{–}0.2: slight agreement
  • 0.2–0.40.2\text{–}0.4: fair agreement
  • 0.4–0.60.4\text{–}0.6: moderate agreement (minimum acceptable)
  • 0.6–0.80.6\text{–}0.8: substantial agreement (target for AI eval tasks)
  • 0.8–1.00.8\text{–}1.0: almost perfect agreement

Krippendorff's Alpha generalizes Cohen's Kappa to multiple annotators, multiple scales (ordinal, interval, ratio), and missing data. Use Krippendorff's Alpha when you have more than two annotators or want a single metric across the entire annotation pool.

Adjudication​

When annotators disagree by 2+ points on a 5-point scale, the example needs adjudication: a third annotator rates it, and the majority vote determines the final score. If a tie persists after three annotators, the example is either flagged as genuinely ambiguous (useful signal in itself) or excluded from the dataset.


Dataset Quality Checks​

Duplicate detection. Normalize questions to lowercase, remove punctuation, check for exact and near-duplicate matches (same first 8 words). Duplicates inflate your dataset size while contributing no new coverage.

Annotation consistency. Plot the distribution of scores per annotator. If one annotator has a dramatically different distribution than others (all 5s, or all 3s), they may be applying a different standard - investigate before accepting their annotations.

Coverage analysis. Build a topic × question-type coverage matrix. Cells with zero examples are coverage gaps. Cells with disproportionately many examples (more than 30% of the total) indicate over-sampling.

Difficulty calibration. Run your current production system on all examples and measure performance. If the system achieves 95%+ on most examples, the dataset is too easy. If it achieves less than 20%, it may be too hard (or your system has a serious problem). Target a system accuracy of 60-80% for a calibrated benchmark.


Complete Implementation​

import anthropic
import json
import hashlib
import statistics
import math
import random
from dataclasses import dataclass, field
from typing import Optional
from datetime import datetime, timezone
from collections import defaultdict, Counter
from enum import Enum

client = anthropic.Anthropic()


# ---------------------------------------------------------------------------
# Data Structures
# ---------------------------------------------------------------------------

class Difficulty(Enum):
TRIVIAL = 1
EASY = 2
MEDIUM = 3
HARD = 4
ADVERSARIAL = 5


class QuestionType(Enum):
FACTUAL = "factual"
PROCEDURAL = "procedural"
COMPARISON = "comparison"
EDGE_CASE = "edge_case"
ADVERSARIAL = "adversarial"
MULTI_HOP = "multi_hop"


@dataclass
class GoldenExample:
"""A single verified input-output pair in the golden dataset."""
example_id: str
question: str
reference_answer: str
context: Optional[str] = None
difficulty: Difficulty = Difficulty.MEDIUM
question_type: QuestionType = QuestionType.FACTUAL
topic: str = ""
created_at: str = field(
default_factory=lambda: datetime.now(timezone.utc).isoformat()
)
annotator_ids: list[str] = field(default_factory=list)
annotation_agreement: Optional[float] = None
metadata: dict = field(default_factory=dict)
is_adversarial: bool = False
source: str = "human" # human | production_log | llm_generated | template

def to_dict(self) -> dict:
return {
"example_id": self.example_id,
"question": self.question,
"reference_answer": self.reference_answer,
"context": self.context,
"difficulty": self.difficulty.value,
"question_type": self.question_type.value,
"topic": self.topic,
"created_at": self.created_at,
"annotator_ids": self.annotator_ids,
"annotation_agreement": self.annotation_agreement,
"metadata": self.metadata,
"is_adversarial": self.is_adversarial,
"source": self.source,
}

@classmethod
def from_dict(cls, d: dict) -> "GoldenExample":
d2 = dict(d)
d2["difficulty"] = Difficulty(d2.get("difficulty", 3))
d2["question_type"] = QuestionType(d2.get("question_type", "factual"))
return cls(**d2)


@dataclass
class Annotation:
"""A single human annotation for one example."""
example_id: str
annotator_id: str
score: int # 1-5
reasoning: str
timestamp: str = field(
default_factory=lambda: datetime.now(timezone.utc).isoformat()
)
flags: list[str] = field(default_factory=list)


@dataclass
class AnnotationGuideline:
"""Structured guidelines for annotators."""
task_description: str
scoring_rubric: dict[int, str]
examples: dict[int, list[dict]]
edge_case_rules: list[str]
common_mistakes: list[str]
version: str = "1.0.0"

def render(self) -> str:
lines = [
f"# Annotation Guidelines v{self.version}",
"",
"## Task Description",
self.task_description,
"",
"## Scoring Rubric",
]
for score in sorted(self.scoring_rubric.keys()):
lines.append(f" **{score}/5**: {self.scoring_rubric[score]}")

lines.extend(["", "## Examples by Score Level"])
for score in sorted(self.examples.keys()):
lines.append(f"\n### Score {score}")
for ex in self.examples[score]:
lines.append(f" Q: {ex['question']}")
lines.append(f" A: {ex['answer']}")
lines.append(f" Why {score}: {ex['explanation']}")

lines.extend(["", "## Edge Case Rules"])
for rule in self.edge_case_rules:
lines.append(f" - {rule}")

lines.extend(["", "## Common Mistakes"])
for mistake in self.common_mistakes:
lines.append(f" - {mistake}")

return "\n".join(lines)


@dataclass
class QualityReport:
total_examples: int
duplicate_count: int
low_agreement_count: int
topic_distribution: dict[str, float]
difficulty_distribution: dict[int, float]
type_distribution: dict[str, float]
coverage_score: float
diversity_score: float
issues: list[str]
passed: bool


@dataclass
class DriftAlert:
kl_divergence: float
drift_score: float
drifted_topics: list[str]
recommended_action: str


@dataclass
class DatasetDiff:
added: list[str]
removed: list[str]
modified: list[str]
agreement_delta: float
size_delta: int


# ---------------------------------------------------------------------------
# Question Generator
# ---------------------------------------------------------------------------

class QuestionGenerator:
"""
Generates diverse evaluation questions from source material.
Uses claude-haiku for cost-effective large-scale generation.
"""

QUESTION_TYPE_PROMPTS = {
QuestionType.FACTUAL: (
"Generate a specific factual question that has one clear correct answer."
),
QuestionType.PROCEDURAL: (
"Generate a 'how to' or step-by-step procedural question."
),
QuestionType.COMPARISON: (
"Generate a question that asks to compare or contrast two things."
),
QuestionType.EDGE_CASE: (
"Generate a question about an unusual, boundary, or edge-case scenario."
),
QuestionType.ADVERSARIAL: (
"Generate a tricky question designed to expose model weaknesses: "
"embedded false assumptions, negations, or misleading framing."
),
QuestionType.MULTI_HOP: (
"Generate a question requiring reasoning through multiple steps "
"or combining information from multiple sources."
),
}

def __init__(self):
self.client = anthropic.Anthropic()

def generate_from_corpus(
self,
corpus_chunks: list[str],
n: int = 100,
question_type: Optional[QuestionType] = None,
topic_hint: str = "",
) -> list[dict]:
"""Generate n diverse questions from corpus chunks."""
all_questions: list[dict] = []
types = (
list(QuestionType) if question_type is None else [question_type]
)
per_type = max(1, n // len(types))
random.shuffle(corpus_chunks)

for q_type in types:
generated = self._generate_type_batch(
corpus_chunks[:min(10, len(corpus_chunks))],
q_type,
per_type,
topic_hint,
)
all_questions.extend(generated)

return self.ensure_diversity(all_questions, max_count=n)

def _generate_type_batch(
self,
chunks: list[str],
q_type: QuestionType,
n: int,
topic_hint: str,
) -> list[dict]:
combined = "\n\n---\n\n".join(chunks[:5])
type_instruction = self.QUESTION_TYPE_PROMPTS[q_type]

prompt = f"""You are building an evaluation dataset for an AI system about: {topic_hint or 'the provided content'}.

Context material:
{combined[:3000]}

Task: {type_instruction}

Generate exactly {n} questions based on the context above.
Requirements:
- Each question must be answerable from the context
- Questions must be diverse - no repetition
- Format: numbered list - 1. 2. 3. etc.
- No commentary, just numbered questions

Output {n} questions:"""

response = self.client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}],
)

raw_text = response.content[0].text
questions = []
for line in raw_text.strip().split("\n"):
line = line.strip()
if line and line[0].isdigit():
parts = line.split(".", 1)
if len(parts) == 2:
q_text = parts[1].strip()
if len(q_text) > 10:
chunk_hash = hashlib.md5(
combined.encode()
).hexdigest()[:8]
questions.append({
"question": q_text,
"type": q_type.value,
"source_chunk_hash": chunk_hash,
})
return questions

def generate_adversarial(
self,
system_description: str,
n: int = 50,
) -> list[dict]:
"""Generate adversarial questions using claude-opus-4-6."""
prompt = f"""You are a red-teamer building adversarial test cases for an AI system.

System description: {system_description}

Generate {n} adversarial questions designed to expose weaknesses. Include:
- False premise questions (embed an incorrect assumption the model must reject)
- Leading questions that push toward wrong answers
- Questions with embedded negations or double negatives
- Questions about topics just outside the system's domain
- Questions that exploit common LLM biases (recency bias, authority bias, anchoring)

Format: JSON array with keys: question, attack_type, expected_failure_mode

Output valid JSON array:"""

response = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
messages=[{"role": "user", "content": prompt}],
)

raw = response.content[0].text
try:
start = raw.index("[")
end = raw.rindex("]") + 1
questions = json.loads(raw[start:end])
except (ValueError, json.JSONDecodeError):
questions = []

for q in questions:
q["type"] = QuestionType.ADVERSARIAL.value
q["is_adversarial"] = True

return questions

def ensure_diversity(
self,
questions: list[dict],
max_count: int = 200,
) -> list[dict]:
"""Remove near-duplicates and enforce per-type budget."""
seen_prefixes: set[str] = set()
type_counts: dict[str, int] = defaultdict(int)
diverse = []
per_type_budget = max_count // max(1, len(QuestionType))

for q in questions:
prefix = " ".join(q["question"].split()[:8]).lower()
q_type = q.get("type", "factual")

if prefix in seen_prefixes:
continue
if type_counts[q_type] >= per_type_budget:
continue

seen_prefixes.add(prefix)
type_counts[q_type] += 1
diverse.append(q)

if len(diverse) >= max_count:
break

return diverse


# ---------------------------------------------------------------------------
# Annotation Session
# ---------------------------------------------------------------------------

class AnnotationSession:
"""Manages the collection and measurement of human annotations."""

def __init__(self, guideline: AnnotationGuideline):
self.guideline = guideline
self.annotations: list[Annotation] = []
self._index: dict[str, list[Annotation]] = defaultdict(list)

def record_annotation(
self,
example_id: str,
annotator_id: str,
score: int,
reasoning: str,
flags: Optional[list[str]] = None,
) -> Annotation:
if score < 1 or score > 5:
raise ValueError(f"Score must be 1-5, got {score}")
ann = Annotation(
example_id=example_id,
annotator_id=annotator_id,
score=score,
reasoning=reasoning,
flags=flags or [],
)
self.annotations.append(ann)
self._index[example_id].append(ann)
return ann

def get_example_annotations(self, example_id: str) -> list[Annotation]:
return self._index.get(example_id, [])

def compute_cohen_kappa(
self, annotator_a: str, annotator_b: str
) -> float:
"""Compute Cohen's Kappa between two annotators."""
a_ratings: dict[str, int] = {}
b_ratings: dict[str, int] = {}

for ann in self.annotations:
if ann.annotator_id == annotator_a:
a_ratings[ann.example_id] = ann.score
elif ann.annotator_id == annotator_b:
b_ratings[ann.example_id] = ann.score

shared_ids = set(a_ratings.keys()) & set(b_ratings.keys())
if len(shared_ids) < 2:
return 0.0

a_scores = [a_ratings[eid] for eid in sorted(shared_ids)]
b_scores = [b_ratings[eid] for eid in sorted(shared_ids)]

n = len(a_scores)
categories = list(range(1, 6))

# Observed agreement
p_observed = sum(
1 for a, b in zip(a_scores, b_scores) if a == b
) / n

# Expected agreement (product of marginals)
a_dist = Counter(a_scores)
b_dist = Counter(b_scores)
p_chance = sum(
(a_dist.get(c, 0) / n) * (b_dist.get(c, 0) / n)
for c in categories
)

if p_chance >= 1.0:
return 1.0

kappa = (p_observed - p_chance) / (1.0 - p_chance)
return round(kappa, 4)

def compute_krippendorff_alpha(self) -> float:
"""
Compute Krippendorff's Alpha across all annotators.
Uses ordinal distance metric appropriate for 1-5 rating scales.
"""
annotator_ids = list({ann.annotator_id for ann in self.annotations})
example_ids = list({ann.example_id for ann in self.annotations})

if len(annotator_ids) < 2 or len(example_ids) < 2:
return 0.0

ratings: dict[str, dict[str, Optional[int]]] = {
aid: {} for aid in annotator_ids
}
for ann in self.annotations:
ratings[ann.annotator_id][ann.example_id] = ann.score

n_categories = 5
max_dist_sq = (n_categories - 1) ** 2

def ordinal_dist(v: int, k: int) -> float:
return ((v - k) ** 2) / max_dist_sq

# Observed disagreement
D_o_num = 0.0
D_o_den = 0
for eid in example_ids:
raters_for_ex = [
ratings[aid][eid]
for aid in annotator_ids
if eid in ratings[aid] and ratings[aid][eid] is not None
]
m = len(raters_for_ex)
if m < 2:
continue
for i in range(m):
for j in range(i + 1, m):
D_o_num += ordinal_dist(raters_for_ex[i], raters_for_ex[j])
D_o_den += 1

if D_o_den == 0:
return 1.0

D_o = D_o_num / D_o_den

# Expected disagreement across all rating pairs
all_ratings_flat = [
v
for aid in annotator_ids
for v in ratings[aid].values()
if v is not None
]
n_total = len(all_ratings_flat)
D_e_num = sum(
ordinal_dist(all_ratings_flat[i], all_ratings_flat[j])
for i in range(n_total)
for j in range(i + 1, n_total)
)
D_e = D_e_num / max(1, (n_total * (n_total - 1) / 2))

if D_e == 0:
return 1.0

alpha = 1.0 - (D_o / D_e)
return round(alpha, 4)

def adjudication_needed(self, example_id: str) -> bool:
"""Return True if annotators disagree by 2+ points."""
anns = self.get_example_annotations(example_id)
if len(anns) < 2:
return False
scores = [a.score for a in anns]
return max(scores) - min(scores) >= 2

def resolve_by_majority(self, example_id: str) -> Optional[int]:
"""Return majority score; None if tied."""
anns = self.get_example_annotations(example_id)
if not anns:
return None
count = Counter(a.score for a in anns)
most_common = count.most_common(2)
if len(most_common) == 1 or most_common[0][1] > most_common[1][1]:
return most_common[0][0]
return None # Tied - needs human adjudication


# ---------------------------------------------------------------------------
# Golden Dataset Builder
# ---------------------------------------------------------------------------

class GoldenDatasetBuilder:
"""Orchestrates the full golden dataset construction pipeline."""

def __init__(self, guideline: AnnotationGuideline):
self.guideline = guideline
self.examples: list[GoldenExample] = []
self.session = AnnotationSession(guideline)
self.generator = QuestionGenerator()

def mine_production_logs(
self,
logs: list[dict],
n: int = 200,
diversity_filter: bool = True,
) -> list[GoldenExample]:
"""
Extract representative examples from production query logs.
Each log entry: {query, response, expected_answer, feedback_score, topic}
Stratified: 50% positive, 30% negative, 20% neutral feedback.
"""
rated = [l for l in logs if l.get("feedback_score") is not None]

positive = [l for l in rated if l.get("feedback_score", 0) >= 4]
negative = [l for l in rated if l.get("feedback_score", 0) <= 2]
neutral = [
l for l in rated
if 2 < l.get("feedback_score", 3) < 4
]

n_pos = int(n * 0.5)
n_neg = int(n * 0.3)
n_neu = n - n_pos - n_neg

selected = (
random.sample(positive, min(n_pos, len(positive)))
+ random.sample(negative, min(n_neg, len(negative)))
+ random.sample(neutral, min(n_neu, len(neutral)))
)

examples = []
for log in selected:
eid = hashlib.md5(log["query"].encode()).hexdigest()[:12]
examples.append(GoldenExample(
example_id=eid,
question=log["query"],
reference_answer=log.get(
"expected_answer", log.get("response", "")
),
difficulty=self._estimate_difficulty(log["query"]),
topic=log.get("topic", "general"),
source="production_log",
metadata={
"original_feedback": log.get("feedback_score"),
"user_id": log.get("user_id", "unknown"),
},
))

if diversity_filter:
examples = self._deduplicate(examples)

return examples[:n]

def _estimate_difficulty(self, question: str) -> Difficulty:
"""Heuristic difficulty estimate from question text."""
word_count = len(question.split())
has_negation = any(
neg in question.lower()
for neg in ["not", "never", "except", "without", "unless"]
)
has_comparison = any(
c in question.lower()
for c in ["compare", "difference", "versus", "vs", "better"]
)
has_multi = any(
m in question.lower()
for m in ["and then", "after which", "given that", "assuming"]
)

score = (
(1 if word_count > 30 else 0)
+ (1 if word_count > 60 else 0)
+ (1 if has_negation else 0)
+ (1 if has_comparison else 0)
+ (1 if has_multi else 0)
)

mapping = {
0: Difficulty.EASY,
1: Difficulty.MEDIUM,
2: Difficulty.MEDIUM,
3: Difficulty.HARD,
4: Difficulty.ADVERSARIAL,
5: Difficulty.ADVERSARIAL,
}
return mapping.get(score, Difficulty.MEDIUM)

def generate_adversarial(
self,
system_description: str,
n: int = 50,
) -> list[GoldenExample]:
"""Generate adversarial examples using claude-opus-4-6."""
raw = self.generator.generate_adversarial(system_description, n)
examples = []
for item in raw:
eid = hashlib.md5(item["question"].encode()).hexdigest()[:12]
examples.append(GoldenExample(
example_id=eid,
question=item["question"],
reference_answer="", # Filled in by annotators or reference gen
difficulty=Difficulty.ADVERSARIAL,
question_type=QuestionType.ADVERSARIAL,
is_adversarial=True,
source="llm_generated",
metadata={
"attack_type": item.get("attack_type", "unknown"),
"expected_failure_mode": item.get(
"expected_failure_mode", ""
),
},
))
return examples

def add_reference_answers(
self,
examples: list[GoldenExample],
context: Optional[str] = None,
) -> list[GoldenExample]:
"""
Generate reference answers for examples missing them.
Uses claude-opus-4-6 for high-quality reference generation.
"""
updated = []
for ex in examples:
if ex.reference_answer:
updated.append(ex)
continue

ctx_section = f"\nContext:\n{context}\n" if context else ""
prompt = f"""You are generating a reference answer for an AI evaluation dataset.
{ctx_section}
Question: {ex.question}

Provide a thorough, accurate reference answer. This will be used as ground truth.
Include: the correct answer, key reasoning steps, and important caveats.

Reference answer:"""

response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
ex.reference_answer = response.content[0].text.strip()
updated.append(ex)

return updated

def validate_dataset(
self, dataset: list[GoldenExample]
) -> QualityReport:
"""Compute comprehensive quality metrics for a dataset."""
issues = []

# Duplicate detection
seen_questions: set[str] = set()
duplicates = 0
for ex in dataset:
q_norm = " ".join(ex.question.lower().split())
if q_norm in seen_questions:
duplicates += 1
seen_questions.add(q_norm)
if duplicates > 0:
issues.append(f"{duplicates} duplicate questions detected")

# Annotation agreement check
low_agreement = [
ex for ex in dataset
if ex.annotation_agreement is not None
and ex.annotation_agreement < 0.4
]
if low_agreement:
issues.append(
f"{len(low_agreement)} examples have low annotation "
f"agreement (Kappa < 0.4)"
)

total = len(dataset)

# Topic distribution
topic_counts = Counter(ex.topic for ex in dataset)
topic_dist = {t: c / total for t, c in topic_counts.items()}
for topic, frac in topic_dist.items():
if frac > 0.4:
issues.append(
f"Topic '{topic}' is over-represented "
f"({frac:.0%} of dataset)"
)

# Difficulty distribution
diff_counts = Counter(ex.difficulty.value for ex in dataset)
diff_dist = {d: c / total for d, c in diff_counts.items()}
hard_frac = diff_dist.get(4, 0) + diff_dist.get(5, 0)
if hard_frac < 0.1:
issues.append(
f"Only {hard_frac:.0%} hard/adversarial examples - "
f"increase difficulty coverage to at least 10%"
)

# Type distribution
type_counts = Counter(ex.question_type.value for ex in dataset)
type_dist = {t: c / total for t, c in type_counts.items()}

# Coverage score: unique (topic, type) combinations
combos = {(ex.topic, ex.question_type.value) for ex in dataset}
max_combos = len(topic_counts) * len(type_counts)
coverage_score = len(combos) / max(1, max_combos)

# Diversity: unique bigrams per example
all_words: list[str] = []
for ex in dataset:
all_words.extend(ex.question.lower().split())
unique_bigrams: set[tuple[str, str]] = set()
for i in range(len(all_words) - 1):
unique_bigrams.add((all_words[i], all_words[i + 1]))
diversity_score = min(1.0, len(unique_bigrams) / max(1, total * 5))

return QualityReport(
total_examples=total,
duplicate_count=duplicates,
low_agreement_count=len(low_agreement),
topic_distribution=topic_dist,
difficulty_distribution=diff_dist,
type_distribution=type_dist,
coverage_score=round(coverage_score, 3),
diversity_score=round(diversity_score, 3),
issues=issues,
passed=len(issues) == 0,
)

def split(
self,
dataset: list[GoldenExample],
train_frac: float = 0.6,
val_frac: float = 0.2,
test_frac: float = 0.2,
seed: int = 42,
) -> tuple[list[GoldenExample], list[GoldenExample], list[GoldenExample]]:
"""Stratified split preserving topic distribution across splits."""
assert abs(train_frac + val_frac + test_frac - 1.0) < 1e-6
random.seed(seed)

by_topic: dict[str, list[GoldenExample]] = defaultdict(list)
for ex in dataset:
by_topic[ex.topic].append(ex)

train, val, test = [], [], []
for topic_examples in by_topic.values():
random.shuffle(topic_examples)
n = len(topic_examples)
n_train = int(n * train_frac)
n_val = int(n * val_frac)
train.extend(topic_examples[:n_train])
val.extend(topic_examples[n_train:n_train + n_val])
test.extend(topic_examples[n_train + n_val:])

return train, val, test

def _deduplicate(
self, examples: list[GoldenExample]
) -> list[GoldenExample]:
seen: set[str] = set()
unique = []
for ex in examples:
key = " ".join(ex.question.lower().split()[:6])
if key not in seen:
seen.add(key)
unique.append(ex)
return unique


# ---------------------------------------------------------------------------
# Dataset Drift Detector
# ---------------------------------------------------------------------------

class DatasetDriftDetector:
"""
Monitors whether the golden dataset's distribution matches production traffic.
Alerts when KL divergence exceeds threshold.
"""

def compute_topic_distribution(
self, dataset: list[GoldenExample]
) -> dict[str, float]:
counts = Counter(ex.topic for ex in dataset)
total = sum(counts.values())
return {t: c / total for t, c in counts.items()}

def compute_production_distribution(
self, production_logs: list[dict]
) -> dict[str, float]:
counts = Counter(log.get("topic", "unknown") for log in production_logs)
total = sum(counts.values())
return {t: c / total for t, c in counts.items()}

def kl_divergence(
self,
p: dict[str, float],
q: dict[str, float],
epsilon: float = 1e-10,
) -> float:
"""KL(P || Q)."""
all_keys = set(p.keys()) | set(q.keys())
return round(
sum(
p.get(k, epsilon) * math.log(p.get(k, epsilon) / q.get(k, epsilon))
for k in all_keys
),
4,
)

def compare_distributions(
self,
dataset: list[GoldenExample],
production_logs: list[dict],
) -> tuple[float, float]:
"""Returns (kl_divergence, drift_score 0-1)."""
ds_dist = self.compute_topic_distribution(dataset)
prod_dist = self.compute_production_distribution(production_logs)
kl = self.kl_divergence(prod_dist, ds_dist)
drift_score = min(1.0, kl / 2.0)
return kl, drift_score

def alert_on_drift(
self,
production_logs: list[dict],
golden_dataset: list[GoldenExample],
drift_threshold: float = 0.3,
) -> Optional[DriftAlert]:
kl, drift_score = self.compare_distributions(
golden_dataset, production_logs
)
if drift_score < drift_threshold:
return None

ds_dist = self.compute_topic_distribution(golden_dataset)
prod_dist = self.compute_production_distribution(production_logs)

drifted = []
for topic in set(ds_dist.keys()) | set(prod_dist.keys()):
ds_frac = ds_dist.get(topic, 0)
prod_frac = prod_dist.get(topic, 0)
if abs(ds_frac - prod_frac) > 0.05:
drifted.append(
f"{topic} "
f"(dataset: {ds_frac:.0%}, production: {prod_frac:.0%})"
)

action = (
"Critical drift - update dataset immediately"
if drift_score > 0.6
else "Moderate drift - plan dataset update within 2 weeks"
)

return DriftAlert(
kl_divergence=kl,
drift_score=round(drift_score, 3),
drifted_topics=drifted,
recommended_action=action,
)


# ---------------------------------------------------------------------------
# Dataset Version Manager
# ---------------------------------------------------------------------------

class DatasetVersionManager:
"""
Manages immutable versioned snapshots of golden datasets.
Each version is a complete, self-contained JSON file.
Versions are never modified - only superseded.
"""

def __init__(self, storage_dir: str = "/tmp/golden_datasets"):
self.storage_dir = storage_dir
import os
os.makedirs(storage_dir, exist_ok=True)

def _version_path(self, version: str) -> str:
return f"{self.storage_dir}/golden_v{version}.json"

def save(
self,
dataset: list[GoldenExample],
version: str,
metadata: Optional[dict] = None,
) -> str:
payload = {
"version": version,
"created_at": datetime.now(timezone.utc).isoformat(),
"size": len(dataset),
"metadata": metadata or {},
"examples": [ex.to_dict() for ex in dataset],
}
path = self._version_path(version)
with open(path, "w") as f:
json.dump(payload, f, indent=2)
print(f"Saved {len(dataset)} examples → {path}")
return path

def load(self, version: str) -> list[GoldenExample]:
path = self._version_path(version)
with open(path) as f:
payload = json.load(f)
return [GoldenExample.from_dict(ex) for ex in payload["examples"]]

def diff(self, version_a: str, version_b: str) -> DatasetDiff:
ds_a = {ex.example_id: ex for ex in self.load(version_a)}
ds_b = {ex.example_id: ex for ex in self.load(version_b)}

ids_a, ids_b = set(ds_a.keys()), set(ds_b.keys())
added = list(ids_b - ids_a)
removed = list(ids_a - ids_b)
shared = ids_a & ids_b

modified = [
eid for eid in shared
if ds_a[eid].reference_answer != ds_b[eid].reference_answer
]

a_agree = [
ds_a[eid].annotation_agreement
for eid in shared
if ds_a[eid].annotation_agreement is not None
]
b_agree = [
ds_b[eid].annotation_agreement
for eid in shared
if ds_b[eid].annotation_agreement is not None
]
agreement_delta = (
statistics.mean(b_agree) - statistics.mean(a_agree)
if a_agree and b_agree
else 0.0
)

return DatasetDiff(
added=added,
removed=removed,
modified=modified,
agreement_delta=round(agreement_delta, 4),
size_delta=len(ds_b) - len(ds_a),
)


# ---------------------------------------------------------------------------
# Full Pipeline
# ---------------------------------------------------------------------------

def build_golden_dataset_from_scratch(
corpus: list[str],
production_logs: list[dict],
system_description: str,
n_target: int = 500,
) -> tuple[list[GoldenExample], QualityReport]:
"""
End-to-end pipeline:
1. Generate questions from corpus (40% of target)
2. Mine production logs (40% of target)
3. Generate adversarial examples (20% of target)
4. Add reference answers to unannotated examples
5. Validate quality and return report
"""
print(f"Building golden dataset (target: {n_target} examples)")

guideline = AnnotationGuideline(
task_description=(
"Rate the quality of an AI system's response to a user question. "
"Consider accuracy, completeness, clarity, and appropriateness."
),
scoring_rubric={
1: "Completely wrong or harmful - incorrect information, "
"misses the question entirely",
2: "Mostly wrong - some relevant content but major errors "
"or omissions",
3: "Partially correct - addresses the question but with "
"notable gaps or minor errors",
4: "Mostly correct - accurate and complete, minor improvements "
"possible",
5: "Excellent - fully accurate, complete, clear, and "
"appropriately detailed",
},
examples={
5: [{
"question": "What is Python?",
"answer": (
"Python is a high-level, interpreted programming language "
"known for clear syntax, dynamic typing, and an extensive "
"standard library. It supports procedural, object-oriented, "
"and functional programming paradigms."
),
"explanation": "Accurate, complete, covers key characteristics",
}],
3: [{
"question": "What is Python?",
"answer": "Python is a programming language used for data science.",
"explanation": "Correct but incomplete - misses key characteristics",
}],
1: [{
"question": "What is Python?",
"answer": "Python is a snake found in tropical regions.",
"explanation": "Completely wrong - interprets as the animal",
}],
},
edge_case_rules=[
"Correct answer using deprecated terminology → score 3, not 4",
"Appropriate 'I don't know' for out-of-domain questions → score 4",
"Clarifying question instead of answer → score based on "
"whether the clarification is appropriate",
],
common_mistakes=[
"Don't penalize verbosity if all content is accurate",
"Don't give 5 to any response with a factual error, "
"no matter how minor",
"Don't conflate writing style with accuracy",
],
)

builder = GoldenDatasetBuilder(guideline)
generator = QuestionGenerator()

# Generate from corpus
n_generated = int(n_target * 0.4)
print(f" Generating {n_generated} questions from corpus...")
generated_qs = generator.generate_from_corpus(corpus, n=n_generated)
generated_examples = [
GoldenExample(
example_id=hashlib.md5(q["question"].encode()).hexdigest()[:12],
question=q["question"],
reference_answer="",
question_type=QuestionType(q.get("type", "factual")),
source="llm_generated",
)
for q in generated_qs
]

# Mine production logs
n_mined = int(n_target * 0.4)
print(f" Mining {n_mined} examples from production logs...")
mined = builder.mine_production_logs(production_logs, n=n_mined)

# Generate adversarial
n_adversarial = int(n_target * 0.2)
print(f" Generating {n_adversarial} adversarial examples...")
adversarial = builder.generate_adversarial(system_description, n=n_adversarial)

all_examples = generated_examples + mined + adversarial

# Add reference answers for examples without them
print(f" Adding reference answers for {len(all_examples)} examples...")
all_examples = builder.add_reference_answers(
all_examples,
context=corpus[0] if corpus else None,
)

# Validate
print(" Validating dataset quality...")
report = builder.validate_dataset(all_examples)

print(f"\nDataset complete: {report.total_examples} examples")
print(f" Coverage: {report.coverage_score:.3f}")
print(f" Diversity: {report.diversity_score:.3f}")
if report.issues:
for issue in report.issues:
print(f" Issue: {issue}")

return all_examples, report


# ---------------------------------------------------------------------------
# Demo
# ---------------------------------------------------------------------------

if __name__ == "__main__":
corpus = [
"Python is a high-level programming language with dynamic typing "
"and garbage collection.",
"Machine learning involves training models on data to make predictions.",
"Neural networks are composed of layers of neurons transforming input data.",
"Gradient descent is an optimization algorithm that minimizes loss functions.",
"Transformers use self-attention mechanisms to process sequences in parallel.",
]

production_logs = [
{
"query": "How does gradient descent work?",
"response": "It minimizes loss by moving in the negative gradient direction.",
"expected_answer": (
"Gradient descent is an optimization algorithm that updates model "
"parameters by computing the gradient of the loss function and "
"stepping in the opposite direction (negative gradient). The step "
"size is the learning rate."
),
"feedback_score": 4,
"topic": "optimization",
"user_id": "u001",
},
{
"query": "What is a transformer?",
"response": "A transformer is a neural network architecture.",
"expected_answer": (
"A transformer is a neural network architecture introduced in "
"'Attention Is All You Need' (2017) using self-attention mechanisms "
"instead of recurrence. It consists of encoder and decoder stacks "
"with multi-head attention and feed-forward layers."
),
"feedback_score": 2,
"topic": "architecture",
"user_id": "u002",
},
]

dataset, report = build_golden_dataset_from_scratch(
corpus=corpus,
production_logs=production_logs,
system_description="An AI assistant answering questions about machine learning.",
n_target=20, # Small for demo
)

# Save versioned snapshot
manager = DatasetVersionManager()
path = manager.save(dataset, version="1.0.0", metadata={"project": "ml-qa"})
print(f"\nSaved to: {path}")

# Simulate drift detection
new_logs = [
{"topic": "deployment", "query": "How to deploy a model?"},
{"topic": "deployment", "query": "What is model serving?"},
{"topic": "monitoring", "query": "How to monitor ML models?"},
]

detector = DatasetDriftDetector()
alert = detector.alert_on_drift(new_logs, dataset)
if alert:
print(f"\nDrift detected! Score: {alert.drift_score:.3f}")
print(f" Action: {alert.recommended_action}")
else:
print("\nNo significant drift detected.")

Architecture Diagrams​

Golden Dataset Construction Pipeline​

Annotation Agreement Process​

Dataset Maintenance Lifecycle​


Production Notes​

:::warning Annotation Guideline Versioning Every time you update annotation guidelines, re-annotate a calibration batch of at least 20 examples that were previously annotated under the old guidelines. This lets you measure whether the guideline change shifted your IAA. Without this, you cannot know whether a drop in agreement is due to annotator confusion or a genuine ambiguity in the new guidelines. :::

:::tip Mining Production Logs Safely Production logs often contain PII - user names, email addresses, internal project names. Before using production logs for dataset construction, run a PII scrubbing pipeline: named entity recognition, substitution with synthetic placeholders, and human review of a 10% spot check. Never include raw production PII in annotation tasks sent to external annotators. :::

:::danger Benchmark Contamination If you use the same model to both generate reference answers and evaluate responses against those answers, the model is being evaluated on questions it wrote the answer key for. Use the strongest available model for reference answer generation, but keep generation completely separate from evaluation. Have human subject-matter experts review all LLM-generated reference answers before including them. :::

:::note Dataset Size Rules of Thumb

  • Minimum for meaningful evaluation: 100 examples (confidence intervals ~±10%)
  • Good for production: 300–500 examples (confidence intervals ~±5%)
  • Full production-grade: 1,000+ examples stratified across all query types
  • Adversarial examples: minimum 10–15% of total dataset
  • Difficulty balance target: 20% easy / 50% medium / 20% hard / 10% adversarial :::

Interview Questions and Answers​

Q1: What is Cohen's Kappa and why is raw agreement percentage insufficient for measuring annotation quality?

Cohen's Kappa corrects for the agreement that would occur by chance if annotators were labeling randomly according to their observed distributions. Raw agreement percentage can be misleadingly high: if 80% of examples are "positive" and both annotators default to "positive", they will agree 80% of the time purely by chance. Kappa subtracts this expected chance agreement from the observed agreement and normalizes, so Kappa = 0 means no better than chance, and Kappa = 1 means perfect agreement. For binary labels on imbalanced datasets, raw agreement can exceed 90% while Kappa is below 0.2 - revealing that annotators are not actually applying consistent judgments.

Q2: How do you decide what examples to include in a golden dataset for a system that handles thousands of different query types?

Start with production log analysis: cluster queries by semantic similarity and topic, then sample proportionally from each cluster to ensure representativeness. Add deliberate over-sampling for rare but high-stakes query types - a legal research assistant should have excellent coverage of contract disputes even if they're only 2% of production traffic. Include adversarial examples across all clusters targeting known failure modes. Use a coverage matrix (topics × question types) to identify gaps. Finally, measure n-gram diversity and embedding-space coverage to ensure the dataset is not clustering around a narrow set of phrasings.

Q3: What is dataset drift and how do you detect it automatically?

Dataset drift occurs when the distribution of inputs in production diverges from the distribution represented in your golden dataset, causing benchmark scores to measure performance on a different population than users are experiencing. You detect it by continuously computing the distribution of production traffic (by topic, query type, length, domain) and comparing to your golden dataset's distribution using KL divergence or Jensen-Shannon divergence. Alert thresholds depend on domain stability: a customer support chatbot for a software product may drift with each major release, requiring monthly dataset updates, while an assistant for a stable domain may be stable for years. Automated drift detection should trigger a dataset review workflow, not necessarily an immediate rebuild.

Q4: How do you generate adversarial examples systematically, and what types should be prioritized?

Adversarial examples should cover the systematic attack categories your system is vulnerable to, identified through failure analysis: false premise questions (embedding incorrect assumptions the model must reject), negation questions (double negatives that LLMs often mishandle), out-of-domain questions (testing whether the system appropriately declines), leading questions (framed to push toward wrong answers), and numerical questions (where small arithmetic errors create plausible-but-wrong answers). Generate using a strong model with explicit red-teaming instructions, then validate that each question has a clear correct answer and that the adversarial property is genuine. Priority order: attack types that have already produced failures in production logs come first.

Q5: What is Krippendorff's Alpha and when should you use it instead of Cohen's Kappa?

Krippendorff's Alpha is appropriate when: (1) you have more than two annotators, (2) your rating scale is ordinal, interval, or ratio rather than purely categorical, (3) annotators did not rate exactly the same examples (missing annotations exist), or (4) you want a single IAA metric across all annotators. Cohen's Kappa is limited to exactly two annotators and treats all disagreements as equally bad - a disagreement between 1 and 2 is penalized the same as between 1 and 5. Krippendorff's Alpha with an ordinal distance metric treats large disagreements as more serious than small ones, which is appropriate for quality rating scales. In practice: use Cohen's Kappa for quick pairwise comparisons during annotation sessions, and Krippendorff's Alpha for the final reported IAA of your complete dataset.

Q6: How should golden datasets be versioned, and what triggers a major version bump versus a minor one?

Treat golden datasets with the same versioning discipline as production code. Minor version bumps (1.0 to 1.1): adding new examples without changing existing ones, adjusting metadata, rebalancing difficulty distribution. Major version bumps (1.0 to 2.0): changing the scoring rubric (which makes historical scores non-comparable), removing more than 10% of examples, fundamentally changing what the dataset measures, or re-annotating with updated guidelines. Every version must be immutable - published versions are never modified, only superseded. Maintain a changelog documenting what changed and why. When running regression comparisons between model versions, always use the same dataset version; comparing scores across dataset versions is not valid.

© 2026 EngineersOfAI. All rights reserved.