Task-Specific Evaluation Design

The Vendor Name That Changed Everything

The team had done everything right. They pulled three top-ranked models from the Open LLM Leaderboard, ran them through the standard benchmark suite, and picked the model with the strongest combination of MMLU and GSM8K scores. They deployed it to production on a Tuesday afternoon, handling invoice processing for a logistics company - extracting vendor names, invoice amounts, and due dates from scanned PDF text.

By Thursday, the support queue had forty tickets. The model was extracting vendor names correctly on clean, well-formatted invoices but hallucinating plausible-sounding vendor names on invoices with unusual layouts, OCR noise, or non-standard date formats. MMLU had told them the model knew a lot. It had said nothing about how the model behaved when a vendor name was split across two lines with a hyphenated line break.

The engineering lead, a former researcher, spent Friday building a test set. She pulled 200 invoices from the last six months of production traffic - a stratified sample covering different invoice formats, vendor types, and OCR quality levels. She annotated gold-standard extractions for each one. By Monday morning they had scores. By Tuesday they had switched models. The new model scored three points lower on MMLU. It scored eleven points higher on their extraction test set. It had zero hallucinated vendor names on the entire test set.

This is the central argument of this lesson: the only evaluation that predicts production quality is evaluation on data that looks like production. Standard benchmarks are a filter. Task-specific evaluation is the decision. Everything else is opinion.

The infrastructure for task-specific evaluation is not complicated. It is underbuilt at most companies because engineers underestimate how much signal it provides and overestimate how expensive it is to build. A well-designed 200-example test set, maintained over time, will save more engineering hours than any other investment you make in the model development lifecycle.

Why This Exists

The Mismatch Between Benchmarks and Production

The standard benchmarks evaluate capabilities that are broadly useful - factual knowledge, commonsense reasoning, arithmetic, code generation. They were designed by researchers who wanted to measure general intelligence. They were not designed to tell you whether a specific model will perform well on your specific task.

Production tasks have specific characteristics that benchmarks do not capture:

Domain specificity: Your documents use vocabulary, abbreviations, and formats specific to your industry. A logistics company processes bills of lading, packing lists, and customs declarations - documents that do not appear in MMLU's 57 subjects.
Input distribution: Your inputs have specific noise patterns, length distributions, and structural properties. A customer support chatbot receives truncated sentences, misspellings, and mixed English-Spanish queries. Benchmarks use clean, grammatically correct text.
Failure mode distribution: The errors that matter most in your application are not uniformly distributed across the benchmark's error distribution. A medical summarization system failing on drug dosage mentions is catastrophically worse than failing on synonyms.
Instruction format: Your production prompt template is different from the benchmark prompt template. Small differences in system prompt wording can shift model performance by 5-15% on the same underlying task.

Before standardized task-specific evaluation infrastructure existed, teams made deployment decisions based on vibes - developers tested a few examples manually, agreed the model "seemed good," and shipped. The failure rates were high and the debugging was expensive because there was no systematic signal to guide iteration.

What Task-Specific Evaluation Provides

A well-designed task-specific evaluation suite gives you three things that nothing else can:

Reliable comparison signal: You can say with confidence that model A is better than model B for your task, not just in general. The comparison is controlled - same test set, same prompts, same metrics.
Regression detection: When you update your prompt, fine-tune your model, or switch to a new base model, you can immediately measure whether the change improved or degraded performance. Without a fixed test set, prompt changes can silently regress previously working cases.
Failure mode taxonomy: A good test set, analyzed carefully, tells you exactly where the model fails - which input types, which length ranges, which content domains. This focuses debugging effort.

Historical Context

From Manual QA to Systematic Evaluation

Before the deep learning era, NLP evaluation was mostly manual. A researcher would look at 50 outputs, categorize the errors, and write a qualitative description. This worked when models were narrow and errors were few. It became untenable when neural models could produce fluent but subtly wrong outputs at scale.

The shift toward automated evaluation began with machine translation. BLEU (Bilingual Evaluation Understudy), proposed by Papineni et al. at IBM in 2002, was the first widely adopted automated metric. BLEU computes the overlap between n-grams in the generated translation and one or more reference translations. The key insight was simple: if a translation contains the same sequences of words as human-written translations, it is probably good. BLEU was fast, reproducible, and required only reference translations - no human annotators per evaluation run.

BLEU's limitations were clear almost immediately. It measured surface form overlap, not semantic correctness. A translation could score low BLEU by choosing accurate synonyms. It could score high BLEU by copying reference phrases in slightly wrong order. Correlation with human judgment was positive but weak. Researchers spent the next two decades trying to improve on it.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation), introduced by Lin in 2004 for summarization, adapted the n-gram overlap idea to recall - measuring how much of the reference content appeared in the generated summary. ROUGE-L added longest common subsequence matching to capture reorderings.

BERTScore, introduced by Zhang et al. in 2019, was the first metric to use contextual embeddings. Instead of exact token overlap, BERTScore computes cosine similarity between BERT embeddings of the generated text and reference text tokens. This captures semantic equivalence: "vehicle" and "car" in similar contexts will have high cosine similarity even though they share no characters.

The biggest shift came from LLMs themselves. By 2022, researchers noticed that prompting GPT-4 or Claude to score outputs on a rubric produced judgments that correlated with human expert ratings at the level of inter-annotator agreement. LLM-as-judge evaluation was born - using a strong model to evaluate outputs from a weaker model. This changed the economics of evaluation: instead of spending weeks on human annotation, you could score thousands of examples in hours.

Core Concepts

Evaluation Design Principles

Four principles govern good evaluation design. Violating any one of them makes your evaluation unreliable.

Principle 1: Representative inputs

Your test set must sample from the same distribution as production inputs. If production inputs are 60% short queries (under 50 tokens), 30% medium (50-200 tokens), and 10% long (200+ tokens), your test set should have the same proportions. If production has 15% non-English inputs, your test set should too. A test set that overrepresents easy cases will give you optimistic scores. A test set that overrepresents hard cases will give you pessimistic scores. Neither is useful.

The mathematical statement: let $P$ be the true production distribution and $\hat{P}$ be your test distribution. If $\hat{P} \neq P$ , then the expected score under $\hat{P}$ does not estimate the expected score under $P$ . In practice, you want:

$\text{Score}(\text{model}, \hat{P}) \approx \text{Score}(\text{model}, P)$

This approximation improves as $\hat{P} \to P$ and as the test set size grows.

Principle 2: Unambiguous gold labels

Each example in your test set must have a gold label (the correct answer) that two independent annotators would agree on. If annotators disagree, the label is ambiguous, and scoring the model against that label adds noise rather than signal.

Measure inter-annotator agreement (IAA) on a sample of your test set before using it for evaluation. For categorical labels, use Cohen's kappa:

$\kappa = \frac{P_o - P_e}{1 - P_e}$

where $P_o$ is observed agreement and $P_e$ is expected agreement by chance. $\kappa > 0.8$ is considered strong agreement. If $\kappa < 0.6$ , your annotation guidelines are too ambiguous and need revision before the test set is trustworthy.

Principle 3: Coverage of failure modes

A good test set deliberately includes the cases most likely to cause failures. If OCR noise causes extraction failures, include noisy examples. If unusual date formats cause parsing errors, include unusual formats. If the model struggles with multi-page documents where the vendor name appears only on the last page, include those cases.

This principle requires domain knowledge. You must have some hypothesis about where the model will fail in order to ensure those cases are covered. One practical approach: run a preliminary evaluation on 30-50 production examples, identify the failure modes, then ensure your full test set includes at least 20% hard cases drawn from those failure modes.

Principle 4: Appropriate metrics

The metric must measure what matters for the application. For invoice extraction, the metric is exact match on the extracted field - either you got the vendor name right or you did not. For a summarization task, exact match is wrong - there are many valid ways to summarize a document. For conversational applications, fluency and coherence matter in addition to correctness.

Match the metric to the task type, and make sure the metric is actionable: if the model scores 72% on your metric, you should be able to interpret what that means in production terms (e.g., "28% of invoices will require manual review").

Building a Test Set: Size and Sampling Strategy

How large should your test set be?

The minimum useful size is 100 examples. Below 100, confidence intervals are wide enough that you cannot distinguish between a genuinely good model and one that got lucky. With 100 examples, a 2 percentage point difference in accuracy has a 95% confidence interval of approximately $\pm 4.4$ percentage points (using the Wilson score interval for proportions). That means you cannot reliably distinguish 70% from 72% with 100 examples.

With 500 examples, the confidence interval for a 2 point difference narrows to $\pm 2$ points. With 1000 examples, to $\pm 1.4$ points. For most production applications, 300-500 examples gives enough statistical power to detect meaningful differences while keeping annotation cost manageable.

The Wilson score confidence interval for a proportion $p$ over $n$ examples is:

$CI = \frac{p + \frac{z^2}{2n} \pm z\sqrt{\frac{p(1-p)}{n} + \frac{z^2}{4n^2}}}{1 + \frac{z^2}{n}}$

where $z = 1.96$ for a 95% confidence interval.

Sampling strategy

For a test set that will be used repeatedly over time, random sampling from recent production data is the most defensible approach. "Recent" matters because production data distributions drift. A test set built from data from six months ago may not represent today's production inputs if your product has evolved.

For a new application without production data, you have two options: (1) synthetic generation - have domain experts write examples covering the range of expected inputs; (2) proxy data - find publicly available data from a similar domain and treat it as a proxy until you have production data to replace it with.

Do not use the same data for both development (prompt engineering, hyperparameter tuning) and final evaluation. Split your annotated data: 80% development set for iteration, 20% held-out test set for final scoring. The held-out test set must not influence any design decisions.

Annotation Guidelines

Good annotation guidelines have five components:

Task definition: One sentence describing exactly what the annotator must do. "Extract the legal entity name of the vendor from the invoice text. The vendor is the entity receiving payment."
Input format description: What the input looks like, what variation is expected, what should be ignored.
Output format specification: Exact format requirements. "Return the vendor name exactly as it appears in the document, including capitalization and punctuation. Do not include legal suffixes (LLC, Inc., Corp.) unless they are part of the primary name."
Decision rules for edge cases: Numbered rules for handling ambiguous cases. "If two vendor names appear on the invoice (e.g., parent company and subsidiary), use the name that appears on the 'Bill From' line or closest to the invoice total."
Examples: At least 5 annotated examples covering the most common cases and the most common edge cases.

Without explicit decision rules for edge cases, annotators make different decisions and your IAA will be low. The investment in writing thorough guidelines pays off directly in test set reliability.

Metrics by Task Type

Classification Tasks

For binary or multi-class classification, the primary metrics are:

Accuracy: $\frac{\text{correct predictions}}{\text{total predictions}}$ . Use when classes are balanced. Misleading when one class dominates - a model that always predicts the majority class can have high accuracy.

Precision, Recall, and F1: For binary classification:

$\text{Precision} = \frac{TP}{TP + FP}, \quad \text{Recall} = \frac{TP}{TP + FN}, \quad F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$

Precision measures how often positive predictions are correct. Recall measures how often actual positives are caught. F1 is their harmonic mean. Use F1 when you care about both, or if class imbalance makes accuracy misleading.

AUC-ROC: Area under the receiver operating characteristic curve. Measures discriminative ability across all classification thresholds. AUC of 0.5 is random; 1.0 is perfect. Use AUC when the decision threshold is not fixed or when you want a threshold-independent measure.

For LLM classification tasks (e.g., "classify this email as spam or not spam"), you typically extract the model's generated label token and compute these metrics. If the model generates free text (e.g., "This is spam because..."), you extract the label via pattern matching or a secondary classifier.

Generation Tasks

ROUGE is fast and deterministic but measures only n-gram overlap.

ROUGE-1 measures unigram recall: $\text{ROUGE-1} = \frac{\text{count of overlapping unigrams}}{\text{count of unigrams in reference}}$

ROUGE-L measures longest common subsequence: $\text{ROUGE-L} = \frac{LCS(\text{generated}, \text{reference})}{|\text{reference}|}$

ROUGE is useful as a cheap filter - if a model has very low ROUGE scores, it is probably producing irrelevant output. But ROUGE does not capture semantic quality: a summary that says "vehicle" where the reference says "car" will be penalized even if both are correct.

BERTScore computes token-level similarity using contextual embeddings. For each token in the candidate text, it finds the maximum cosine similarity to any token in the reference:

$\text{Precision} = \frac{1}{|\hat{y}|}\sum_{\hat{y}_j \in \hat{y}} \max_{y_i \in y} \cos(\mathbf{e}_{\hat{y}_j}, \mathbf{e}_{y_i})$

$\text{Recall} = \frac{1}{|y|}\sum_{y_i \in y} \max_{\hat{y}_j \in \hat{y}} \cos(\mathbf{e}_{y_i}, \mathbf{e}_{\hat{y}_j})$

BERTScore F1 is the harmonic mean of these two. It correlates better with human judgment than ROUGE on most generation tasks, particularly for domains with rich synonym usage.

LLM-as-judge is now the standard approach for complex generation tasks where there is no single correct reference. You prompt a strong model (GPT-4o, Claude 3.5 Sonnet) with a rubric and ask it to score the output. The key design decisions are:

Rubric design: What dimensions matter? (Factual accuracy, relevance, conciseness, tone, safety) Score each dimension 1-5 with clear descriptions for each point.
Reference provision: Provide the input, the model output, and (optionally) a reference output. Including the reference reduces variance in judge scores.
Judge model choice: Stronger judge models are more reliable but more expensive. For most tasks, a strong frontier model is worth the cost.
Position bias mitigation: LLM judges show preference for outputs presented first when comparing two outputs. Use randomized presentation order and average across both orderings.

Extraction Tasks

For structured extraction (named entities, key-value pairs, schema-compliant JSON), use:

Exact Match: The extracted value exactly matches the gold label after normalization. Normalization should strip leading/trailing whitespace and standardize case. Exact match is appropriate when the extraction target is fully specified in the input text and any valid extraction should produce the same string.

Token-level F1: Treat the extracted string and gold string as token bags. Compute precision and recall over token overlap. This is appropriate when small paraphrases are acceptable - e.g., "Acme Corp." vs "ACME CORP" should score close to 1.0.

Schema Compliance Rate: For JSON or structured output extraction, the fraction of outputs that are valid against the expected schema. A model can extract the right information but format it incorrectly (missing a field, wrong data type, malformed JSON). Schema compliance measures this separately from accuracy.

Conversation and Multi-turn Tasks

For conversational applications, add:

Coherence: Does each turn follow naturally from the previous turn? Evaluate using LLM-as-judge with a specific coherence rubric.

Instruction Adherence: If the user gave specific constraints (length, format, persona), are they followed? Can be automated with rule-based checkers for explicit constraints (word count, banned words, required phrases).

Task Completion Rate: For goal-oriented dialogues, did the conversation achieve the user's stated goal? This requires defining what "goal achieved" means for each test case, which is labor-intensive but the most meaningful metric for task-oriented bots.

Mermaid Diagrams

Evaluation Design Workflow

Metrics Selection by Task Type

A/B Evaluation Architecture

Code: Building a Complete Evaluation Pipeline

Test Set Loader

import json
import random
from dataclasses import dataclass, field
from typing import Any
from pathlib import Path

@dataclass
class EvalExample:
    """A single evaluation example with input, gold label, and optional metadata."""
    id: str
    input_text: str
    gold_label: Any
    metadata: dict = field(default_factory=dict)

    # Filled in during evaluation
    model_output: str = ""
    score: float = 0.0

@dataclass
class EvalDataset:
    """A complete evaluation dataset."""
    name: str
    task_type: str  # "extraction", "classification", "generation", "conversation"
    examples: list[EvalExample]

    def __len__(self) -> int:
        return len(self.examples)

    def sample(self, n: int, seed: int = 42) -> "EvalDataset":
        """Return a random subsample for quick debugging runs."""
        rng = random.Random(seed)
        sampled = rng.sample(self.examples, min(n, len(self.examples)))
        return EvalDataset(name=f"{self.name}_sample_{n}", task_type=self.task_type, examples=sampled)

    @classmethod
    def from_jsonl(cls, path: str, task_type: str, name: str = "") -> "EvalDataset":
        """Load from a JSONL file where each line is a JSON object."""
        examples = []
        with open(path) as f:
            for i, line in enumerate(f):
                data = json.loads(line.strip())
                examples.append(EvalExample(
                    id=data.get("id", str(i)),
                    input_text=data["input"],
                    gold_label=data["gold"],
                    metadata=data.get("metadata", {})
                ))
        return cls(name=name or Path(path).stem, task_type=task_type, examples=examples)

    def to_jsonl(self, path: str) -> None:
        """Save to JSONL file."""
        with open(path, "w") as f:
            for ex in self.examples:
                record = {
                    "id": ex.id,
                    "input": ex.input_text,
                    "gold": ex.gold_label,
                    "metadata": ex.metadata
                }
                if ex.model_output:
                    record["model_output"] = ex.model_output
                    record["score"] = ex.score
                f.write(json.dumps(record) + "\n")

Model Runner

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from tqdm import tqdm

class LocalModelRunner:
    """Run inference for evaluation using a local HuggingFace model."""

    def __init__(
        self,
        model_name: str,
        system_prompt: str = "",
        max_new_tokens: int = 256,
        temperature: float = 0.0,
        device: str = "auto",
    ):
        self.model_name = model_name
        self.system_prompt = system_prompt
        self.max_new_tokens = max_new_tokens
        self.temperature = temperature

        print(f"Loading {model_name}...")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map=device,
        )
        self.model.eval()

    def format_prompt(self, input_text: str) -> str:
        """Format input with system prompt using the model's chat template if available."""
        if hasattr(self.tokenizer, "apply_chat_template") and self.tokenizer.chat_template:
            messages = []
            if self.system_prompt:
                messages.append({"role": "system", "content": self.system_prompt})
            messages.append({"role": "user", "content": input_text})
            return self.tokenizer.apply_chat_template(
                messages, tokenize=False, add_generation_prompt=True
            )
        else:
            if self.system_prompt:
                return f"{self.system_prompt}\n\n{input_text}"
            return input_text

    @torch.no_grad()
    def generate(self, input_text: str) -> str:
        """Generate a single output."""
        prompt = self.format_prompt(input_text)
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)

        generate_kwargs = {
            "max_new_tokens": self.max_new_tokens,
            "do_sample": self.temperature > 0,
            "pad_token_id": self.tokenizer.eos_token_id,
        }
        if self.temperature > 0:
            generate_kwargs["temperature"] = self.temperature

        output_ids = self.model.generate(**inputs, **generate_kwargs)
        # Decode only the new tokens (not the input)
        new_tokens = output_ids[0][inputs["input_ids"].shape[1]:]
        return self.tokenizer.decode(new_tokens, skip_special_tokens=True).strip()

    def run_dataset(self, dataset: EvalDataset, batch_size: int = 1) -> EvalDataset:
        """Run inference on all examples in the dataset."""
        for example in tqdm(dataset.examples, desc=f"Running {self.model_name}"):
            example.model_output = self.generate(example.input_text)
        return dataset

Metrics Library

import re
from collections import Counter

def normalize_text(text: str) -> str:
    """Standard normalization for string comparison."""
    text = text.lower().strip()
    text = re.sub(r"[^\w\s]", "", text)  # remove punctuation
    text = re.sub(r"\s+", " ", text)      # normalize whitespace
    return text

def exact_match(prediction: str, gold: str, normalize: bool = True) -> float:
    """Binary exact match score."""
    if normalize:
        return float(normalize_text(prediction) == normalize_text(gold))
    return float(prediction.strip() == gold.strip())

def token_f1(prediction: str, gold: str) -> float:
    """Token-level F1 between prediction and gold string."""
    pred_tokens = normalize_text(prediction).split()
    gold_tokens = normalize_text(gold).split()

    if not pred_tokens or not gold_tokens:
        return 0.0

    pred_counter = Counter(pred_tokens)
    gold_counter = Counter(gold_tokens)

    common = sum((pred_counter & gold_counter).values())

    if common == 0:
        return 0.0

    precision = common / len(pred_tokens)
    recall = common / len(gold_tokens)
    f1 = 2 * precision * recall / (precision + recall)
    return f1

def compute_rouge_l(prediction: str, reference: str) -> float:
    """ROUGE-L using longest common subsequence."""
    pred_tokens = normalize_text(prediction).split()
    ref_tokens = normalize_text(reference).split()

    if not pred_tokens or not ref_tokens:
        return 0.0

    # Dynamic programming for LCS
    m, n = len(pred_tokens), len(ref_tokens)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if pred_tokens[i-1] == ref_tokens[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])

    lcs_len = dp[m][n]
    precision = lcs_len / m
    recall = lcs_len / n
    if precision + recall == 0:
        return 0.0
    return 2 * precision * recall / (precision + recall)

def compute_schema_compliance(prediction: str, required_fields: list[str]) -> float:
    """Check what fraction of required fields appear in the prediction.

    For JSON outputs, parse and verify field presence.
    """
    try:
        parsed = json.loads(prediction)
        present = sum(1 for field in required_fields if field in parsed and parsed[field] is not None)
        return present / len(required_fields)
    except json.JSONDecodeError:
        # Not valid JSON - check for field names as a fallback
        present = sum(1 for field in required_fields if field in prediction)
        return present / len(required_fields) * 0.5  # penalize for invalid JSON

LLM-as-Judge

import openai
from string import Template

JUDGE_RUBRIC_TEMPLATE = Template("""You are an expert evaluator. Score the following model output on the specified task.

TASK: $task_description

INPUT:
$input_text

MODEL OUTPUT:
$model_output

REFERENCE (gold standard):
$reference

SCORING RUBRIC:
$rubric

Score the model output on each dimension. Return ONLY a JSON object with the following structure:
{
  "scores": {
    "dimension_name": <integer 1-5>,
    ...
  },
  "overall": <integer 1-5>,
  "reasoning": "<one sentence explanation>"
}
""")

class LLMJudge:
    """Evaluate model outputs using an LLM judge."""

    def __init__(
        self,
        judge_model: str = "gpt-4o",
        task_description: str = "",
        rubric: str = "",
        dimensions: list[str] = None,
    ):
        self.client = openai.OpenAI()
        self.judge_model = judge_model
        self.task_description = task_description
        self.rubric = rubric
        self.dimensions = dimensions or ["quality"]

    def score(self, input_text: str, model_output: str, reference: str = "") -> dict:
        """Score a single output. Returns dict with per-dimension and overall scores."""
        prompt = JUDGE_RUBRIC_TEMPLATE.substitute(
            task_description=self.task_description,
            input_text=input_text,
            model_output=model_output,
            reference=reference if reference else "No reference provided.",
            rubric=self.rubric,
        )

        response = self.client.chat.completions.create(
            model=self.judge_model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,  # deterministic scoring
            response_format={"type": "json_object"},
        )

        content = response.choices[0].message.content
        return json.loads(content)

    def score_dataset(self, dataset: EvalDataset) -> list[dict]:
        """Score all examples in a dataset."""
        results = []
        for example in tqdm(dataset.examples, desc="LLM judging"):
            try:
                result = self.score(
                    input_text=example.input_text,
                    model_output=example.model_output,
                    reference=str(example.gold_label) if example.gold_label else ""
                )
                results.append({"id": example.id, **result})
            except Exception as e:
                print(f"Judge failed on example {example.id}: {e}")
                results.append({"id": example.id, "scores": {}, "overall": 0, "reasoning": "error"})
        return results

Evaluation Runner and Report

import numpy as np
from scipy import stats as scipy_stats

class EvaluationRunner:
    """Orchestrate a complete evaluation run and produce a report."""

    def __init__(self, dataset: EvalDataset, task_type: str):
        self.dataset = dataset
        self.task_type = task_type

    def score_extraction(self) -> dict:
        """Score an extraction task using exact match and token F1."""
        em_scores = []
        f1_scores = []

        for ex in self.dataset.examples:
            em = exact_match(ex.model_output, str(ex.gold_label))
            f1 = token_f1(ex.model_output, str(ex.gold_label))
            ex.score = em  # primary metric for extraction is exact match
            em_scores.append(em)
            f1_scores.append(f1)

        return {
            "exact_match": float(np.mean(em_scores)),
            "token_f1": float(np.mean(f1_scores)),
            "n_examples": len(em_scores),
        }

    def score_classification(self, classes: list[str]) -> dict:
        """Score classification by extracting class labels from generated text."""
        correct = 0
        per_class_tp = {c: 0 for c in classes}
        per_class_fp = {c: 0 for c in classes}
        per_class_fn = {c: 0 for c in classes}

        for ex in self.dataset.examples:
            # Extract predicted label from free-form output
            pred = ex.model_output.lower().strip()
            gold = str(ex.gold_label).lower().strip()

            # Find the first class name that appears in the output
            matched_class = None
            for c in classes:
                if c.lower() in pred:
                    matched_class = c.lower()
                    break

            is_correct = matched_class == gold
            ex.score = float(is_correct)

            if is_correct:
                correct += 1
                per_class_tp[gold] = per_class_tp.get(gold, 0) + 1
            else:
                if matched_class:
                    per_class_fp[matched_class] = per_class_fp.get(matched_class, 0) + 1
                per_class_fn[gold] = per_class_fn.get(gold, 0) + 1

        accuracy = correct / len(self.dataset)

        # Macro F1
        f1s = []
        for c in classes:
            tp = per_class_tp.get(c, 0)
            fp = per_class_fp.get(c, 0)
            fn = per_class_fn.get(c, 0)
            prec = tp / (tp + fp) if (tp + fp) > 0 else 0
            rec = tp / (tp + fn) if (tp + fn) > 0 else 0
            f1 = 2 * prec * rec / (prec + rec) if (prec + rec) > 0 else 0
            f1s.append(f1)

        return {
            "accuracy": accuracy,
            "macro_f1": float(np.mean(f1s)),
            "n_examples": len(self.dataset),
        }

    def compare_models(
        self,
        scores_a: list[float],
        scores_b: list[float],
        alpha: float = 0.05
    ) -> dict:
        """Compare two models using a paired t-test."""
        assert len(scores_a) == len(scores_b), "Score lists must be same length"

        mean_a = float(np.mean(scores_a))
        mean_b = float(np.mean(scores_b))
        diff = mean_b - mean_a

        # Paired t-test
        t_stat, p_value = scipy_stats.ttest_rel(scores_a, scores_b)

        # 95% CI on the difference via bootstrap
        n_bootstrap = 1000
        diffs = []
        n = len(scores_a)
        for _ in range(n_bootstrap):
            idx = np.random.randint(0, n, n)
            diffs.append(np.mean(np.array(scores_b)[idx]) - np.mean(np.array(scores_a)[idx]))

        ci_low = float(np.percentile(diffs, 2.5))
        ci_high = float(np.percentile(diffs, 97.5))

        return {
            "mean_a": mean_a,
            "mean_b": mean_b,
            "difference": diff,
            "p_value": float(p_value),
            "significant": p_value < alpha,
            "ci_95": (ci_low, ci_high),
            "recommendation": (
                "B significantly better" if diff > 0 and p_value < alpha else
                "A significantly better" if diff < 0 and p_value < alpha else
                "No significant difference"
            )
        }

    def print_report(self, scores: dict, model_name: str = "model") -> None:
        """Print a human-readable evaluation report."""
        print(f"\n{'='*60}")
        print(f"EVALUATION REPORT: {model_name}")
        print(f"Dataset: {self.dataset.name} ({len(self.dataset)} examples)")
        print(f"{'='*60}")
        for metric, value in scores.items():
            if isinstance(value, float):
                print(f"  {metric:<30} {value*100:.1f}%")
            else:
                print(f"  {metric:<30} {value}")
        print(f"{'='*60}\n")

Full Document Processing Evaluation Example

# Example: complete evaluation for an invoice extraction task

import json
from pathlib import Path

# --- Step 1: Define the system prompt for the task ---
EXTRACTION_SYSTEM_PROMPT = """You are an invoice data extraction assistant.
Extract the following fields from the invoice text:
- vendor_name: the legal name of the vendor (company sending the invoice)
- invoice_number: the invoice identifier
- total_amount: the total amount due (number only, no currency symbol)
- due_date: the payment due date in YYYY-MM-DD format

Return ONLY a JSON object with these four fields. If a field is not found, use null."""

# --- Step 2: Build evaluation dataset from JSONL ---
# Each line in the JSONL has: {"id": "...", "input": "invoice text...", "gold": {"vendor_name": ..., ...}}
dataset = EvalDataset.from_jsonl(
    path="./eval_data/invoices_test.jsonl",
    task_type="extraction",
    name="invoice_extraction_v1"
)

print(f"Loaded {len(dataset)} test examples")

# --- Step 3: Run inference with two candidate models ---
model_a = LocalModelRunner(
    model_name="mistralai/Mistral-7B-Instruct-v0.2",
    system_prompt=EXTRACTION_SYSTEM_PROMPT,
    max_new_tokens=128,
    temperature=0.0,
)

model_b = LocalModelRunner(
    model_name="meta-llama/Meta-Llama-3-8B-Instruct",
    system_prompt=EXTRACTION_SYSTEM_PROMPT,
    max_new_tokens=128,
    temperature=0.0,
)

dataset_a = model_a.run_dataset(dataset.sample(n=50))  # quick test on 50 examples
dataset_b = model_b.run_dataset(dataset.sample(n=50))

# --- Step 4: Score each model ---
REQUIRED_FIELDS = ["vendor_name", "invoice_number", "total_amount", "due_date"]

def score_extraction_output(output: str, gold: dict) -> dict:
    """Score a single extraction output against gold."""
    schema_score = compute_schema_compliance(output, REQUIRED_FIELDS)
    try:
        parsed = json.loads(output)
    except json.JSONDecodeError:
        return {"schema_compliance": 0.0, "field_f1": 0.0, "vendor_exact": 0.0}

    field_scores = []
    for field in REQUIRED_FIELDS:
        pred_val = str(parsed.get(field, "") or "")
        gold_val = str(gold.get(field, "") or "")
        field_scores.append(token_f1(pred_val, gold_val))

    vendor_em = exact_match(
        str(parsed.get("vendor_name", "") or ""),
        str(gold.get("vendor_name", "") or "")
    )

    return {
        "schema_compliance": schema_score,
        "field_f1": float(np.mean(field_scores)),
        "vendor_exact": vendor_em,
    }

scores_a = []
scores_b = []

for ex_a, ex_b in zip(dataset_a.examples, dataset_b.examples):
    gold = ex_a.gold_label  # same gold for both
    score_a = score_extraction_output(ex_a.model_output, gold)
    score_b = score_extraction_output(ex_b.model_output, gold)
    scores_a.append(score_a["field_f1"])
    scores_b.append(score_b["field_f1"])

# --- Step 5: Compare and report ---
runner = EvaluationRunner(dataset_a, task_type="extraction")
comparison = runner.compare_models(scores_a, scores_b)

print(f"Mistral-7B-Instruct field F1: {comparison['mean_a']*100:.1f}%")
print(f"Llama3-8B-Instruct field F1:  {comparison['mean_b']*100:.1f}%")
print(f"Difference: {comparison['difference']*100:+.1f}pp")
print(f"P-value: {comparison['p_value']:.3f}")
print(f"95% CI: ({comparison['ci_95'][0]*100:.1f}%, {comparison['ci_95'][1]*100:.1f}%)")
print(f"Recommendation: {comparison['recommendation']}")

Evaluation Data Leakage

What Leakage Is and Why It Destroys Your Signal

Evaluation data leakage occurs when examples from your test set influence the model's training or prompting. The canonical form: your test set contains 200 invoice examples. You use 20 of them as few-shot examples in your prompt. You then evaluate the model on the same 200 examples. The model has seen the test examples during inference - in the prompt, not in training - and will perform artificially well on those 20 examples.

A subtler form: you use your test set to select the best prompt phrasing. You try 10 different system prompts and pick the one with the highest test score. Even if you never directly expose the test examples to the model, you have overfit your prompt to the test distribution. The reported test score will be optimistic.

Preventing Leakage

Strict separation is the only prevention. Designate 20% of your annotated data as the held-out test set before any experimentation. Lock it behind a code path that is only callable by the final evaluation script. Use the remaining 80% as a development set for prompt iteration and model selection.

If you must do few-shot prompting, draw few-shot examples exclusively from the development set, never from the test set.

If you run many experiments on the development set, your reported development scores will be optimistic due to selection bias. This is acceptable for the development set - it is expected that you overfit to it during iteration. The test set must remain untouched until you are ready to report final results.

Test Set Maintenance Over Time

Production data distributions drift. Your test set from six months ago may not represent today's inputs. Maintain your test set by:

Adding a fixed number of new examples from recent production traffic each quarter (e.g., 50 new annotated examples per quarter)
Retiring examples that represent use cases that no longer appear in production
Versioning the test set with a timestamp so scores are comparable only within the same version

This versioning practice lets you separate "the model improved" from "the test set got easier." If you update the test set, re-evaluate your current model on the new version to establish a new baseline before running comparative experiments.

Production Engineering Notes

Evaluation Latency and Cost

Running LLM-as-judge evaluation costs money and takes time. For a 300-example test set with GPT-4o as judge at ~800 tokens per evaluation call, the cost is approximately 0.3M input tokens plus 0.3M output tokens per evaluation run. At GPT-4o pricing, this is roughly $3-5 per run. For daily evaluation runs in a CI pipeline, budget accordingly.

To reduce cost, use a tiered approach: automated metrics (exact match, token F1, ROUGE) on every run; LLM judge only on runs where automated metrics show a significant change or improvement. This reduces judge calls by 80-90% with minimal loss of signal quality.

Parallelizing Evaluation

LLM inference is embarrassingly parallel. Use asyncio and batch API calls when running LLM-as-judge evaluation:

import asyncio
from openai import AsyncOpenAI

async def judge_example_async(client, prompt: str) -> dict:
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

async def judge_all_async(examples: list, judge: LLMJudge) -> list[dict]:
    client = AsyncOpenAI()
    tasks = [
        judge_example_async(client, judge._build_prompt(ex))
        for ex in examples
    ]
    return await asyncio.gather(*tasks)

# Run 300 examples concurrently (rate-limited by the API)
results = asyncio.run(judge_all_async(dataset.examples, judge))

This reduces wall-clock time for 300 examples from ~30 minutes (serial) to ~2-3 minutes (parallel with rate limiting).

Integrating Evaluation Into CI

Evaluation should run automatically when model code or prompt templates change. A minimal CI evaluation stage:

# .github/workflows/eval.yml (GitHub Actions example)
name: Model Evaluation
on:
  push:
    paths:
      - 'src/prompts/**'
      - 'src/model_config.py'
      - 'eval/**'

jobs:
  evaluate:
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4
      - name: Run evaluation
        run: |
          python eval/run_evaluation.py \
            --dataset eval/data/test_v3.jsonl \
            --model-config src/model_config.py \
            --output eval/results/latest.json
      - name: Check regression
        run: |
          python eval/check_regression.py \
            --current eval/results/latest.json \
            --baseline eval/results/baseline.json \
            --threshold 0.02  # fail if score drops more than 2pp

Common Mistakes

:::danger Never evaluate on the same data you used to select your prompt

If you try 10 prompts and pick the one with the highest score on your test set, that score is inflated. You have overfit your prompt to the test distribution, and the reported score does not predict performance on new data. Always separate development data (used for prompt iteration) from test data (used for final reporting). If you have only one pool of annotated data, use 80% for development and 20% for evaluation, and keep the 20% locked until you are ready for final scoring.

:::

:::danger Do not use ROUGE as your primary metric for generation tasks

ROUGE measures n-gram overlap with a reference. A model that generates a semantically identical summary using different words will score poorly. A model that copies phrases from the source document in a different order may score better than a model that wrote a clear, concise summary. ROUGE is useful as a fast sanity check but should not be the primary metric for generation quality. Use BERTScore for semantic similarity and LLM-as-judge for holistic quality.

:::

:::warning Do not draw conclusions from a test set smaller than 100 examples

With 50 examples, a 4-point accuracy difference has a confidence interval of approximately plus or minus 8 points. You cannot tell whether the difference is real or noise. If you cannot annotate at least 100 examples before making a model decision, treat your evaluation as directional (useful for eliminating obviously bad models) rather than definitive (useful for choosing between closely competing models).

:::

:::warning Normalize your extracted text before computing exact match

Exact match is fragile to minor formatting differences. "Acme Corp." and "ACME CORP" are the same vendor name. "2024-01-15" and "January 15, 2024" are the same date. Always define and apply a normalization function before computing exact match or token F1. Normalization should at minimum lower-case the text, strip punctuation, and normalize whitespace. For dates, parse and re-format to a canonical format before comparison.

:::

:::warning LLM judges exhibit position bias and verbosity bias

When using an LLM judge to compare two outputs side by side, it tends to prefer the output presented first (position bias) and the longer output (verbosity bias), independent of actual quality. Mitigate by: (1) always presenting the two outputs in both orderings and averaging the two judgments; (2) instructing the judge explicitly in the rubric to not prefer longer outputs; (3) using absolute scoring (score each output on a rubric independently) rather than relative comparison (which output is better?).

:::

Interview Q&A

Q1: You have been asked to evaluate a new LLM for a customer support ticket classification task. How do you design the evaluation?

First, I define the task precisely: given a customer support ticket, classify it into one of eight categories (billing, shipping, returns, technical issues, account access, product defect, general inquiry, escalation required). The model should output a single category label.

Second, I build a test set. I sample 400 tickets from the last 90 days of production data, stratified by category to ensure all eight categories are represented. I aim for 40-60 examples per category so I have enough statistical power to compute per-category F1. I annotate each ticket with a gold label using a two-annotator protocol and resolve disagreements by committee. I measure IAA (target kappa > 0.7 for this task), then set aside 80 examples as the held-out test set and use the remaining 320 as a development set.

Third, I select metrics: accuracy (for overall performance), macro-F1 (for balanced performance across categories, appropriate since I care about all categories not just the most common ones), and per-category precision and recall (to identify which specific categories the model struggles with).

Fourth, I design the prompt: system prompt describing the categories with clear definitions, then the ticket text as user input. I use the development set to iterate on the prompt, finding the phrasing that maximizes macro-F1 on the development set.

Fifth, I run final evaluation on the held-out test set with the best prompt, report macro-F1 and per-category scores, and compute 95% confidence intervals. I compare at least two candidate models to pick the one with higher macro-F1, checking statistical significance of the difference.

Q2: What is LLM-as-judge evaluation, and what are its main failure modes?

LLM-as-judge uses a strong language model (typically a frontier API model) to score or compare outputs from a weaker model under evaluation. The judge is prompted with the task description, the input, the model's output, optionally a reference, and a scoring rubric. It returns a score or ranking along with brief reasoning.

The main failure modes are:

Position bias: When asked to compare two outputs, the judge tends to prefer whichever is presented first, independent of quality. Mitigation: always run both orderings (A then B, B then A) and average the results.

Verbosity bias: Judges tend to rate longer outputs as higher quality, even when the additional content is not useful. Mitigation: explicitly state in the rubric that concise accurate outputs should score equally to longer accurate outputs. Evaluate using absolute rubrics rather than relative comparison.

Sycophancy: Judges tend to agree with the perspective or claim in the output being evaluated, especially if it is stated confidently. If the model output makes a wrong claim confidently, the judge may agree. Mitigation: provide a ground-truth reference and instruct the judge to score against it.

Model version drift: Judge scores are not stable across judge model versions. If you score 300 examples with GPT-4 today and score them again after a GPT-4 update, the scores may change. This makes historical comparison unreliable. Mitigation: snapshot judge model versions and use consistent versioned endpoints.

Cost and latency: LLM judging is expensive. 300 examples at frontier model pricing can cost $5-10 per evaluation run. This limits how often you can run full evaluations. Use automated metrics as a fast pre-filter.

Q3: Explain token-level F1 and when it is a better metric than exact match for extraction tasks.

Exact match scores 1 if the extracted string is identical to the gold label after normalization and 0 otherwise. It is appropriate when the extraction target is fully specified in the source and any correct extraction should produce identical output - for example, extracting an invoice number where the answer is always a specific alphanumeric code.

Token-level F1 treats the extracted and gold strings as bags of tokens and computes the F1 between them. It scores partial matches: if the gold vendor name is "Acme Manufacturing Corporation" and the model extracts "Acme Manufacturing", it would score 0 on exact match but 0.8 on token F1 (two of three tokens matched).

Token F1 is better when: (1) there is legitimate variation in how the correct answer can be phrased ("New York" vs "NY" for a city name - still somewhat wrong but partially right); (2) the extraction target has variable length and partial extraction is meaningfully better than no extraction; (3) you are evaluating across multiple extracted fields where a model that gets most of every field is better than a model that gets some fields perfectly and others completely wrong.

Exact match is better when: any deviation from the exact correct answer is equally wrong (invoice numbers, dates in a standardized format, phone numbers) and partial credit would obscure real differences in model quality.

For a practical extraction pipeline, I typically report both: exact match as the strict metric (this is what must pass before deployment) and token F1 as the soft metric (this measures how much information is being recovered overall).

Q4: How do you handle evaluation data leakage, and what are the subtler forms it takes?

The obvious form of leakage is direct exposure: few-shot examples drawn from the test set, or test examples appearing in the model's training data. Preventing this requires strict separation: the test set is created once, locked, and never used for training or prompt construction.

Subtler forms:

Selection leakage: You run 20 prompt variants on the test set and report the highest score. Even without exposing the test examples to the model, you have overfit your prompt to the test distribution. Every prompt decision made using test set signal is selection leakage. Prevention: use the development set for all prompt iteration; test set only for final reporting.

Annotation leakage: Your annotation guidelines were informed by looking at model outputs. If you looked at what the model produces and wrote guidelines that define those outputs as correct, your gold labels are biased toward the model being evaluated. Prevention: write annotation guidelines before running any inference; use the guidelines as the spec, not the model's outputs.

Distributional leakage: You collected your test set by sampling from a specific time window, then updated the model using data from a later time window that overlaps with the test window. The model has seen recent versions of the types of inputs in your test set, even if not the exact examples. Prevention: maintain a clear temporal split between training data and test data; test on inputs from after the training data cutoff.

Implicit leakage via tooling: Your evaluation pipeline uses the model's generated outputs to clean or correct gold labels. For example, if you use the model to suggest corrections to annotation disagreements, you have biased the gold labels toward the model's behavior. Prevention: gold labels must come entirely from human judgment, independent of the model being evaluated.

Q5: A colleague says "our model achieves 91% accuracy on our test set so it is ready for production." What questions do you ask before agreeing?

I ask six questions.

First: how was the test set built and what is its size? 91% on 30 examples is meaningless noise. 91% on 500 examples is a real signal. And if the test set was built by the same person who built the model, or if examples from the test set were used in development, the number is not trustworthy.

Second: what is the class distribution in the test set? If 90% of examples belong to class A, a model that always predicts class A achieves 90% accuracy with no ability to distinguish any class. 91% in this case is almost worthless. What is the macro-F1 across classes?

Third: has the test set been evaluated before? If this is the first time anyone is looking at this score, it is more credible. If the model was tuned based on signals from this test set, the 91% is overfit.

Fourth: does the test set match production distribution? Are the length distribution, domain coverage, and format variation in the test set representative of what the model will actually see? A test set that over-represents clean, well-formatted inputs will give optimistic scores for a production system that receives noisy real-world inputs.

Fifth: what failure modes are in the remaining 9%? Is the 9% randomly distributed or concentrated in a specific failure mode - a particular input type, length range, or content domain? Concentrated failure modes at 9% can still be production-breaking if they affect high-priority inputs.

Sixth: what does "ready for production" mean in this context? Is there a human review step that catches the remaining errors? What is the cost of a false positive versus a false negative? 91% accuracy might be fine for low-stakes routing but completely insufficient for medical information extraction.

Q6: How do you measure whether your evaluation suite is actually predictive of production quality?

The fundamental challenge is that production quality is hard to measure directly - you often do not have ground truth for production outputs. But there are several approaches.

Correlation with human judgment: Periodically run a human evaluation on a sample of production outputs (50-100 examples per quarter). Have expert annotators rate the outputs on the same rubric used in your automated evaluation. Compute the correlation between automated scores on your test set and human scores on the production sample. If the correlation is high ( $\rho > 0.7$ ), your automated evaluation is a reasonable proxy.

Correlation with downstream metrics: For applications where you can observe business outcomes (user satisfaction ratings, escalation rates, task completion in production), compute the correlation between evaluation scores and those outcomes across model versions. If evaluating a customer support bot, does a 5-point improvement in your evaluation score correspond to a measurable reduction in escalation rate?

Prospective tracking: Every time you deploy a new model version, record both the pre-deployment evaluation score and the post-deployment production metrics over the next two weeks. Over time, you build a dataset of (evaluation score, production outcome) pairs that lets you measure the predictive validity of your evaluation directly.

Adversarial testing: Deliberately create model variants that score well on your test set but should perform poorly in production - for example, a model fine-tuned heavily to produce exactly the right format for your test examples but lacking general capability. If your evaluation correctly gives these models lower scores than genuinely better models, it has some discriminative power. If your evaluation ranks them highly, you have a leakage or overfitting problem in your test set design.

Summary

Standard benchmarks tell you how a model performs on tasks designed by researchers to measure general capability. Task-specific evaluation tells you how the model performs on your task. Only the second number predicts production quality.

The investment in building a good task-specific evaluation suite is one of the highest-leverage things you can do in an LLM deployment project. A 300-example test set with clear annotation guidelines and automated scoring costs two or three days to build. It then pays dividends every time you update your prompt, switch models, or add a new feature - providing a controlled comparison that catches regressions before they reach production.

The key discipline is separation: test data cannot touch training or prompt development. IAA above 0.6 before trusting gold labels. Metrics matched to task type. Confidence intervals reported alongside point estimates. When these practices are followed, your evaluation scores will correlate with production quality, and model decisions become engineering decisions rather than guesses.

In the next lesson, we examine safety and bias evaluation - what happens when the failure mode is not "wrong answer" but "harmful output."

The Vendor Name That Changed Everything​

Why This Exists​

The Mismatch Between Benchmarks and Production​

What Task-Specific Evaluation Provides​

Historical Context​

From Manual QA to Systematic Evaluation​

Core Concepts​

Evaluation Design Principles​

Building a Test Set: Size and Sampling Strategy​

Annotation Guidelines​

Metrics by Task Type​

Classification Tasks​

Generation Tasks​

Extraction Tasks​

Conversation and Multi-turn Tasks​

Mermaid Diagrams​

Evaluation Design Workflow​

Metrics Selection by Task Type​

A/B Evaluation Architecture​

Code: Building a Complete Evaluation Pipeline​

Test Set Loader​

Model Runner​

Metrics Library​

LLM-as-Judge​

Evaluation Runner and Report​

Full Document Processing Evaluation Example​

Evaluation Data Leakage​

What Leakage Is and Why It Destroys Your Signal​

Preventing Leakage​

Test Set Maintenance Over Time​

Production Engineering Notes​

Evaluation Latency and Cost​

Parallelizing Evaluation​

Integrating Evaluation Into CI​

Common Mistakes​

Interview Q&A​

Summary​