Evaluating Fine-Tuned Models

The Metric Trap

The loss curve is flat. Training converged cleanly. You run your fine-tuned model on a few test prompts and the outputs look good. You ship it.

Three weeks later, support tickets start coming in. The model is confidently producing wrong answers to factual questions it would have handled correctly before fine-tuning. It is also generating responses in a slightly off format that downstream parsing code cannot handle. None of this showed up in your manual testing because you tested on examples similar to your training data - exactly where a fine-tuned model looks its best.

This scenario plays out repeatedly in production ML. The problem is not that evaluation is hard in theory. The problem is that the most available metric - training loss - measures how well the model fits your training data, which is exactly not what you care about. You care about how the model performs on real user queries it has never seen, on tasks it was not explicitly trained on, and in conditions where your training data might have introduced biases or gaps.

Evaluation for fine-tuned language models is genuinely different from evaluation in classical ML. There is no single ground-truth label to compare against for open-ended generation. A response can be correct, helpful, and well-formatted in dozens of different ways. Two responses to the same prompt can both be good while being completely different strings, making string-match metrics like BLEU or ROUGE nearly meaningless for generation quality.

The field has converged on a practical three-layer evaluation stack: automatic metrics for fast iteration, LLM-as-judge for quality assessment at scale, and human evaluation for high-stakes decisions. Each layer has different cost, latency, and reliability characteristics. Understanding how to combine them correctly - and which mistakes will give you false confidence - is the subject of this lesson.

Why This Exists - The Evaluation Gap

Before the LLM era, NLP model evaluation was tractable. A sentiment classifier either got the label right or wrong. A machine translation model could be scored against a set of reference translations using BLEU scores. Named entity recognition had an exact match against annotated spans. These metrics were imperfect but correlated reasonably well with human judgment.

The transition to generative language models broke this paradigm. The outputs of an instruction-following model are open-ended natural language. Consider the prompt: "Explain why the sky is blue." The correct answer can be expressed in 50 words or 500 words, using different vocabulary, different levels of technical depth, different sentence structures. None of these formulations are more correct than others - they are appropriate for different audiences and contexts. BLEU score between two valid explanations that share no n-grams beyond common words would be essentially zero, falsely suggesting the model failed.

The research community responded with increasingly sophisticated evaluation frameworks. MT-Bench (Zheng et al., 2023) defined a set of 80 multi-turn conversation questions across 8 categories and used GPT-4 as a judge to score responses on a 1-10 scale. AlpacaEval used a similar approach but focused on win rates against a reference model. lm-eval-harness provided a standardized framework for running dozens of academic benchmarks. Each approach solved part of the problem and introduced new failure modes.

What the field learned through collective experience: no single metric is sufficient, data contamination (test set examples appearing in training data) silently inflates benchmark scores, and LLM-as-judge evaluation has systematic biases (verbose responses score higher, outputs that resemble the judge model's style score higher) that can mislead you if you do not control for them.

Historical Context - How Evaluation Evolved

The first LLM fine-tuning evaluations in 2022 were largely qualitative. InstructGPT (Ouyang et al., 2022) established the template: train on human-curated instructions, evaluate with human raters rating on helpfulness, honesty, and harmlessness. This was expensive ($X per rating) and slow, but grounded. It was the gold standard, not a scalable daily practice.

As open-source models proliferated in 2023, the community needed cheaper evaluation. The Hugging Face Open LLM Leaderboard, launched mid-2023, standardized four academic benchmarks: ARC (common sense reasoning), HellaSwag (sentence completion), MMLU (knowledge questions), and TruthfulQA (factual accuracy). These are multiple-choice benchmarks that can be evaluated automatically and reproducibly. They became the de facto measure of "base model quality" in 2023.

Then came LLM-as-judge. The insight from Zheng et al. (2023) with MT-Bench was that GPT-4 (or any strong model) could evaluate free-form responses in ways that correlated highly with human judgments, at a fraction of the cost of human evaluation. This opened the door to scalable quality assessment for instruction-tuned models. By 2024, LLM-as-judge had become standard practice for teams doing RLHF and DPO alignment work.

The field is still actively working through the limitations of these approaches. Benchmark contamination (models trained on data that overlaps with test sets) became a serious problem by 2024, prompting the development of dynamic benchmarks that change over time. LLM-as-judge bias (length bias, self-preference bias) has been quantified and partially corrected for. The current best practice is a combination of approaches, with human evaluation reserved for high-stakes decisions.

Core Concepts - The Evaluation Stack

Layer 1: Automatic Metrics (Fast, Cheap, Imperfect)

Automatic metrics compute scores algorithmically from model outputs without requiring human judgment. They are fast enough to run during training (as part of your eval loop) and cheap enough to run on thousands of examples.

Perplexity is the most fundamental metric for language models. It measures how surprised the model is by a held-out text:

$\text{Perplexity} = \exp\left(-\frac{1}{N}\sum_{i=1}^{N} \log p_\theta(w_i | w_1, ..., w_{i-1})\right)$

Lower perplexity = more confident = better fit to the data distribution. Perplexity is useful for measuring whether fine-tuning hurt the model's language modeling ability (an indicator of catastrophic forgetting) and for comparing models on the same data distribution. It is not useful for measuring instruction following quality - a model can have low perplexity on a test set while producing unhelpful or incorrectly formatted responses.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures n-gram overlap between generated text and reference text. ROUGE-1 measures unigram overlap, ROUGE-2 measures bigram overlap, ROUGE-L measures longest common subsequence:

$\text{ROUGE-N} = \frac{\sum_{s \in \text{Reference}} \sum_{\text{n-gram} \in s} \text{Count}_{match}(\text{n-gram})}{\sum_{s \in \text{Reference}} \sum_{\text{n-gram} \in s} \text{Count}(\text{n-gram})}$

ROUGE is appropriate when outputs have clear reference answers - summarization, extraction, or classification tasks phrased as generation. It is inappropriate for open-ended generation where many valid responses exist.

BLEU (Bilingual Evaluation Understudy) is the standard machine translation metric, measuring modified precision of n-grams. It is essentially unused for modern LLM evaluation outside of translation-specific tasks.

BERTScore measures semantic similarity using BERT embeddings rather than exact string matches. It handles paraphrase and synonym matching that ROUGE misses, making it more appropriate for open-ended generation where the content matters more than the exact wording.

Layer 2: LLM-as-Judge (Scalable, Moderately Reliable)

LLM-as-judge uses a powerful model (typically GPT-4, Claude, or Gemini) to evaluate free-form outputs from your fine-tuned model. The judge can assess helpfulness, accuracy, format compliance, tone, and any other quality dimension you specify - mimicking human evaluation at scale.

Two variants:

Pointwise scoring: The judge evaluates each response independently on a numerical scale (e.g., 1-10 for helpfulness). Good for monitoring absolute quality over time.

Pairwise comparison: The judge compares two responses (e.g., fine-tuned vs. base model) and declares a winner. Less susceptible to calibration drift across models, better for A/B comparisons. The win rate metric comes from pairwise comparisons.

import openai
import json
from typing import Literal

client = openai.OpenAI()

JUDGE_PROMPT = """You are an expert evaluator of language model responses.

You will be given a user query and two responses (A and B). Your task is to determine
which response is better, considering:
- Accuracy and factual correctness
- Helpfulness and completeness
- Clarity and conciseness
- Format appropriateness

Output a JSON object with:
- "winner": "A", "B", or "tie"
- "reasoning": one sentence explaining your decision
- "scores": {{"A": 1-10, "B": 1-10}}

User Query: {query}

Response A: {response_a}

Response B: {response_b}

Output only the JSON object, no other text."""


def judge_pairwise(
    query: str,
    response_a: str,
    response_b: str,
) -> dict:
    """Use GPT-4 to compare two model responses."""
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": JUDGE_PROMPT.format(
                    query=query,
                    response_a=response_a,
                    response_b=response_b,
                ),
            }
        ],
        temperature=0,
        response_format={"type": "json_object"},
    )
    return json.loads(completion.choices[0].message.content)


def compute_win_rate(
    queries: list[str],
    baseline_responses: list[str],
    model_responses: list[str],
) -> dict:
    """Compute win rate of model_responses vs baseline_responses."""
    results = {"wins": 0, "losses": 0, "ties": 0, "scores": []}

    for query, baseline, model_resp in zip(queries, baseline_responses, model_responses):
        # Run in both orders to control for position bias
        result_ab = judge_pairwise(query, model_resp, baseline)
        result_ba = judge_pairwise(query, baseline, model_resp)

        # Aggregate (A=model in result_ab, B=model in result_ba)
        model_wins = (result_ab["winner"] == "A") + (result_ba["winner"] == "B")
        baseline_wins = (result_ab["winner"] == "B") + (result_ba["winner"] == "A")

        if model_wins > baseline_wins:
            results["wins"] += 1
        elif baseline_wins > model_wins:
            results["losses"] += 1
        else:
            results["ties"] += 1

        results["scores"].append({
            "model": result_ab["scores"]["A"],
            "baseline": result_ab["scores"]["B"],
        })

    total = len(queries)
    results["win_rate"] = results["wins"] / total
    results["loss_rate"] = results["losses"] / total
    results["tie_rate"] = results["ties"] / total
    return results

Layer 3: Human Evaluation (Gold Standard, Expensive)

Human evaluation involves having actual people rate or compare model outputs. It is the only evaluation method that captures the full richness of what makes a response good - tone, trustworthiness, actual correctness of complex claims, appropriateness for a specific user population.

The cost structure for human evaluation:

Internal team evaluation: fast, cheap, but limited to questions your team can assess. Appropriate for format and factual accuracy.
Crowdsourced evaluation (MTurk, Scale AI): covers larger sample sizes. Requires careful rubric design to get consistent ratings.
Domain expert evaluation: expensive but necessary for specialized domains (medical, legal, scientific). Non-experts cannot assess whether a complex answer is actually correct.

Practical guidance: run human evaluation at major milestones (model release decisions, major data updates) rather than continuously. Use LLM-as-judge for continuous monitoring between human evaluation checkpoints.

Setting Up a Proper Train/Eval Split

The most important infrastructure decision for evaluation is creating a hold-out test set before any training. This seems obvious, but production projects routinely make mistakes here.

from datasets import load_dataset, DatasetDict
from sklearn.model_selection import train_test_split
import hashlib

def create_deterministic_splits(dataset_path: str, val_ratio: float = 0.1, test_ratio: float = 0.1):
    """
    Create train/val/test splits with deterministic assignment.
    Each example always goes to the same split regardless of when you run this.
    """
    dataset = load_dataset("json", data_files=dataset_path, split="train")

    def assign_split(example):
        """Hash-based split assignment - deterministic, no leakage."""
        # Use a unique field from the example (e.g., an ID or the full text hash)
        example_hash = hashlib.md5(str(example).encode()).hexdigest()
        hash_value = int(example_hash, 16) % 100

        if hash_value < int(test_ratio * 100):
            return {"split": "test"}
        elif hash_value < int((val_ratio + test_ratio) * 100):
            return {"split": "val"}
        else:
            return {"split": "train"}

    dataset = dataset.map(assign_split)

    return DatasetDict({
        "train": dataset.filter(lambda x: x["split"] == "train"),
        "validation": dataset.filter(lambda x: x["split"] == "val"),
        "test": dataset.filter(lambda x: x["split"] == "test"),
    })


# The test set must NEVER be touched until final evaluation
splits = create_deterministic_splits("data/instruction_data.jsonl", val_ratio=0.1, test_ratio=0.1)
print(f"Train: {len(splits['train'])} | Val: {len(splits['validation'])} | Test: {len(splits['test'])}")

Key rules for test set integrity:

Create the test set before any data exploration or cleaning. Data cleaning can leak information about the test distribution if you inspect test examples to understand the data.
Never inspect individual test examples after training begins. Looking at failures and fixing them creates implicit test set leakage.
The validation set is for hyperparameter tuning during training. The test set is used exactly once, for final model evaluation.
For very small datasets (under 1,000 examples), consider k-fold cross validation rather than a held-out test set.

Building Your Domain-Specific Evaluation Set

The most important evaluation is on examples that reflect your actual production use case. Academic benchmarks and general metrics will not tell you whether your model works for your specific application.

A practical process for building a domain-specific evaluation set:

Step 1: Sample from production queries. If you have logs of real user queries, sample 200-500 representative examples. Cluster by topic or intent to ensure coverage across your use case space, not just the highest-volume query types.

Step 2: Create high-quality reference answers. For each evaluation query, write or curate the ideal response. This is expensive and requires domain expertise. For most teams, 100-200 high-quality examples is more valuable than 1,000 mediocre ones.

Step 3: Define your evaluation rubric. What specifically makes a response good for your use case? Be concrete: "the response should be under 200 words," "the response should always recommend consulting a doctor for medical questions," "the response should cite sources for factual claims." Concrete rubric criteria translate directly to LLM-as-judge prompts.

Step 4: Establish a baseline. Run the base model (pre-fine-tuning) and a strong prompted baseline (e.g., base model with a carefully engineered system prompt) on your evaluation set before you train anything. These are your comparison points.

"""Build a domain evaluation set and run baseline comparisons."""

import json
from pathlib import Path
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch


def load_eval_queries(path: str) -> list[dict]:
    """Load evaluation queries with reference answers."""
    with open(path) as f:
        return [json.loads(line) for line in f]


def run_model_on_eval_set(
    model_path: str,
    eval_queries: list[dict],
    max_new_tokens: int = 512,
) -> list[dict]:
    """Generate responses for all eval queries."""
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype=torch.bfloat16,
        device_map="auto",
    )

    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=max_new_tokens,
        do_sample=False,        # greedy decoding for evaluation (reproducible)
        temperature=1.0,
        top_p=1.0,
    )

    results = []
    for example in eval_queries:
        messages = [{"role": "user", "content": example["query"]}]
        prompt = tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        output = pipe(prompt)[0]["generated_text"]
        # Strip the prompt prefix from output
        response = output[len(prompt):]

        results.append({
            "query": example["query"],
            "reference": example.get("reference_answer", ""),
            "model_response": response,
            "model_path": model_path,
        })

    return results


# Compare base model vs fine-tuned model
eval_queries = load_eval_queries("eval/domain_eval_set.jsonl")

base_results = run_model_on_eval_set("meta-llama/Meta-Llama-3-8B-Instruct", eval_queries)
finetuned_results = run_model_on_eval_set("./outputs/llama3-8b-finetuned/merged", eval_queries)

# Save results for judge evaluation
with open("eval/base_results.jsonl", "w") as f:
    for r in base_results:
        f.write(json.dumps(r) + "\n")

with open("eval/finetuned_results.jsonl", "w") as f:
    for r in finetuned_results:
        f.write(json.dumps(r) + "\n")

MT-Bench and AlpacaEval - Instruction Following Quality

MT-Bench evaluates a model's ability to follow multi-turn instructions across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge (STEM), and knowledge (humanities). It is useful for assessing whether fine-tuning improved or degraded general instruction following capability.

AlpacaEval measures win rate against a reference model (originally text-davinci-003, later GPT-4-turbo) on 805 diverse instruction-following prompts. The metric is the percentage of queries where the evaluated model beats the reference.

# Install evaluation tools
pip install lm-eval
pip install alpaca-eval

# Run MT-Bench (requires OpenAI API key for judging)
git clone https://github.com/lm-sys/FastChat.git
cd FastChat/fastchat/llm_judge

# Generate model answers
python gen_model_answer.py \
    --model-path ./outputs/llama3-8b-finetuned/merged \
    --model-id llama3-8b-finetuned

# Judge with GPT-4
python gen_judgment.py \
    --model-list llama3-8b-finetuned \
    --parallel 4

# Show results
python show_result.py --model-list llama3-8b-finetuned

# AlpacaEval
alpaca_eval evaluate \
    --model_outputs outputs/model_outputs.json \
    --annotators_config chatgpt_fn \
    --name "llama3-8b-finetuned"

Catastrophic Forgetting - Evaluating What You Broke

Fine-tuning on a narrow domain always risks degrading the model's performance on tasks outside that domain. If you fine-tune on customer support conversations, the model might become worse at coding. If you fine-tune on formal writing, it might lose ability to generate casual conversational responses. This is catastrophic forgetting - the model overwrites general capabilities with specialized ones.

Detection requires evaluating on benchmarks that cover general capabilities before and after fine-tuning.

Running lm-eval-harness for Capability Benchmarks

# Run before fine-tuning to establish baseline
# lm_eval is lm-evaluation-harness from EleutherAI

import subprocess
import json

def run_benchmark(model_path: str, output_path: str, tasks: list[str] = None):
    """Run lm-eval benchmarks on a model."""
    if tasks is None:
        # Standard capability benchmark suite
        tasks = [
            "arc_challenge",        # Common sense reasoning
            "hellaswag",            # Sentence completion
            "mmlu",                 # Knowledge breadth
            "truthfulqa_mc2",       # Factual accuracy
            "winogrande",           # Pronoun resolution
            "gsm8k",                # Math reasoning
        ]

    cmd = [
        "lm_eval",
        "--model", "hf",
        "--model_args", f"pretrained={model_path},dtype=bfloat16",
        "--tasks", ",".join(tasks),
        "--batch_size", "auto",
        "--output_path", output_path,
        "--log_samples",
    ]
    result = subprocess.run(cmd, capture_output=True, text=True)
    return result


def compare_benchmarks(baseline_path: str, finetuned_path: str):
    """Compare benchmark results and flag regressions."""
    with open(baseline_path) as f:
        baseline = json.load(f)
    with open(finetuned_path) as f:
        finetuned = json.load(f)

    print(f"{'Task':<30} {'Baseline':>10} {'Fine-tuned':>12} {'Delta':>8} {'Alert':>6}")
    print("-" * 70)

    alerts = []
    for task, base_results in baseline["results"].items():
        if task not in finetuned["results"]:
            continue

        base_acc = base_results.get("acc_norm,none", base_results.get("acc,none", 0))
        ft_acc = finetuned["results"][task].get(
            "acc_norm,none",
            finetuned["results"][task].get("acc,none", 0)
        )
        delta = ft_acc - base_acc
        alert = "WARN" if delta < -0.03 else "CRIT" if delta < -0.05 else "OK"

        if alert != "OK":
            alerts.append((task, delta, alert))

        print(f"{task:<30} {base_acc:>10.3f} {ft_acc:>12.3f} {delta:>+8.3f} {alert:>6}")

    if alerts:
        print(f"\nRegression alerts: {len(alerts)}")
        for task, delta, level in alerts:
            print(f"  [{level}] {task}: {delta:+.3f}")

    return alerts


# Run before training
run_benchmark(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    "eval/benchmarks/baseline/",
)

# Run after training
run_benchmark(
    "./outputs/llama3-8b-finetuned/merged",
    "eval/benchmarks/finetuned/",
)

# Compare
alerts = compare_benchmarks(
    "eval/benchmarks/baseline/results.json",
    "eval/benchmarks/finetuned/results.json",
)

Task-Specific Metrics - Choosing the Right Measure

Different downstream tasks require different evaluation metrics. Using the wrong metric for your task is as harmful as not evaluating at all.

Classification Tasks (Formatted as Generation)

When you fine-tune a model to output a label from a fixed set, you can compute exact match accuracy and per-class F1:

from sklearn.metrics import classification_report
import re

def extract_label(response: str, valid_labels: list[str]) -> str | None:
    """Extract classification label from model response."""
    response_lower = response.lower().strip()

    # Try exact match first
    for label in valid_labels:
        if response_lower.startswith(label.lower()):
            return label

    # Try finding label anywhere in response
    for label in valid_labels:
        if label.lower() in response_lower:
            return label

    return None   # Failed to extract a valid label


def evaluate_classifier(
    model_responses: list[str],
    true_labels: list[str],
    valid_labels: list[str],
) -> dict:
    """Evaluate classification performance from model responses."""
    predicted = [
        extract_label(r, valid_labels) or "UNKNOWN"
        for r in model_responses
    ]

    # Count extraction failures
    extraction_failures = sum(1 for p in predicted if p == "UNKNOWN")
    extraction_rate = 1 - (extraction_failures / len(predicted))

    report = classification_report(true_labels, predicted, output_dict=True)
    report["label_extraction_rate"] = extraction_rate

    return report

Code Generation - Execution-Based Evaluation

For code generation tasks, the only reliable metric is whether the generated code actually runs and passes tests. String-match metrics are essentially useless for code.

import subprocess
import tempfile
import os

def evaluate_code_solution(code: str, test_cases: list[dict]) -> dict:
    """
    Execute generated code against test cases.
    Returns pass rate and per-case results.
    """
    results = {"passed": 0, "failed": 0, "errors": 0, "cases": []}

    for test in test_cases:
        # Write code + test to temp file
        full_code = code + "\n\n" + test["test_code"]

        with tempfile.NamedTemporaryFile(
            mode="w", suffix=".py", delete=False
        ) as f:
            f.write(full_code)
            tmp_path = f.name

        try:
            result = subprocess.run(
                ["python", tmp_path],
                capture_output=True,
                text=True,
                timeout=10,  # 10 second timeout per test
            )
            passed = result.returncode == 0
            if passed:
                results["passed"] += 1
            else:
                results["failed"] += 1

            results["cases"].append({
                "test_id": test["id"],
                "passed": passed,
                "stderr": result.stderr[:200] if not passed else "",
            })
        except subprocess.TimeoutExpired:
            results["errors"] += 1
            results["cases"].append({"test_id": test["id"], "passed": False, "error": "timeout"})
        finally:
            os.unlink(tmp_path)

    total = len(test_cases)
    results["pass_rate"] = results["passed"] / total if total > 0 else 0
    return results

Comparing Fine-Tuned vs Base vs Prompting - The Right Comparison

Before concluding that fine-tuning added value, you must compare against two baselines, not one:

Baseline 1: The base instruction model with no additional prompting. This establishes whether fine-tuning beat the off-the-shelf model.

Baseline 2: The base instruction model with a well-engineered system prompt. This is the comparison that matters most. Many teams skip this baseline and declare fine-tuning a success when a good system prompt would have achieved the same result at zero training cost.

SYSTEM_PROMPTS = {
    "no_system": "",

    "basic_system": "You are a helpful assistant.",

    "domain_engineered": """You are an expert customer support agent for AcmeCorp software.
You help users troubleshoot technical issues with our products.
Always be concise - responses should be under 150 words.
If you cannot solve an issue, escalate with: "I'll connect you with our technical team."
Never discuss competitor products.
Format multi-step solutions as numbered lists.""",
}


def run_prompting_baseline(
    model,
    tokenizer,
    eval_queries: list[dict],
    system_prompt: str,
) -> list[dict]:
    """Run evaluation with a specific system prompt."""
    results = []
    for example in eval_queries:
        messages = []
        if system_prompt:
            messages.append({"role": "system", "content": system_prompt})
        messages.append({"role": "user", "content": example["query"]})

        inputs = tokenizer.apply_chat_template(
            messages, return_tensors="pt", add_generation_prompt=True
        ).to(model.device)

        with torch.no_grad():
            output_ids = model.generate(
                inputs,
                max_new_tokens=512,
                do_sample=False,
            )

        response = tokenizer.decode(
            output_ids[0][inputs.shape[1]:],
            skip_special_tokens=True
        )
        results.append({"query": example["query"], "response": response})

    return results

A good decision framework: if the domain-engineered prompting baseline achieves 80%+ of your target metrics, fine-tuning may not be worth the operational complexity. Fine-tuning adds the most value when: (1) your desired behavior cannot be described in a system prompt (it requires the model to internalize a style or domain vocabulary), (2) you need consistent structured output that prompting achieves only 70-80% of the time, or (3) you have cost/latency requirements that push toward smaller models, and fine-tuning makes a smaller model competitive with a larger prompted one.

Building an Automated Evaluation Pipeline

Production fine-tuning requires repeatable, automated evaluation that runs after every training job without manual intervention.

"""
Automated evaluation pipeline for fine-tuned models.
Runs after every training job as a CI step.
"""

import json
import logging
from dataclasses import dataclass
from pathlib import Path
from typing import Optional
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline


@dataclass
class EvalConfig:
    model_path: str
    baseline_model_path: str
    eval_set_path: str
    output_dir: str
    run_benchmarks: bool = True
    run_llm_judge: bool = True
    run_task_metrics: bool = True
    openai_api_key: Optional[str] = None


@dataclass
class EvalReport:
    model_path: str
    task_metrics: dict
    benchmark_scores: dict
    win_rate_vs_baseline: Optional[float]
    catastrophic_forgetting_flags: list[str]
    passed: bool
    failure_reasons: list[str]


class EvaluationPipeline:
    """End-to-end evaluation pipeline for fine-tuned models."""

    # Minimum acceptable win rate vs baseline to pass evaluation
    MIN_WIN_RATE = 0.55

    # Maximum acceptable regression on any benchmark task
    MAX_BENCHMARK_REGRESSION = 0.05

    def __init__(self, config: EvalConfig):
        self.config = config
        self.output_dir = Path(config.output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)

    def load_eval_set(self) -> list[dict]:
        with open(self.config.eval_set_path) as f:
            return [json.loads(line) for line in f]

    def generate_responses(self, model_path: str, queries: list[dict]) -> list[str]:
        """Generate model responses for all eval queries."""
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        model = AutoModelForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.bfloat16,
            device_map="auto",
        )

        pipe = pipeline(
            "text-generation",
            model=model,
            tokenizer=tokenizer,
            max_new_tokens=512,
            do_sample=False,
            batch_size=8,
        )

        responses = []
        for example in queries:
            messages = [{"role": "user", "content": example["query"]}]
            prompt = tokenizer.apply_chat_template(
                messages, tokenize=False, add_generation_prompt=True
            )
            output = pipe(prompt)[0]["generated_text"]
            responses.append(output[len(prompt):].strip())

        # Clean up to free VRAM before next evaluation
        del model
        torch.cuda.empty_cache()
        return responses

    def compute_win_rate(
        self,
        queries: list[dict],
        finetuned_responses: list[str],
        baseline_responses: list[str],
    ) -> float:
        """Compute win rate using LLM-as-judge."""
        if not self.config.openai_api_key:
            logging.warning("No OpenAI key - skipping LLM-judge evaluation")
            return None

        pairs = list(zip(
            [q["query"] for q in queries],
            finetuned_responses,
            baseline_responses,
        ))

        wins, total = 0, len(pairs)
        for query, ft_resp, base_resp in pairs:
            result = judge_pairwise(query, ft_resp, base_resp)
            if result["winner"] == "A":   # A = finetuned
                wins += 1
            elif result["winner"] == "tie":
                wins += 0.5   # count ties as half wins

        return wins / total

    def run(self) -> EvalReport:
        """Execute full evaluation pipeline."""
        eval_set = self.load_eval_set()
        failure_reasons = []
        forgetting_flags = []

        logging.info(f"Generating responses for fine-tuned model: {self.config.model_path}")
        finetuned_responses = self.generate_responses(self.config.model_path, eval_set)

        logging.info(f"Generating responses for baseline: {self.config.baseline_model_path}")
        baseline_responses = self.generate_responses(self.config.baseline_model_path, eval_set)

        # --- Win rate ---
        win_rate = None
        if self.config.run_llm_judge:
            win_rate = self.compute_win_rate(eval_set, finetuned_responses, baseline_responses)
            logging.info(f"Win rate vs baseline: {win_rate:.3f}")

            if win_rate is not None and win_rate < self.MIN_WIN_RATE:
                failure_reasons.append(
                    f"Win rate {win_rate:.3f} below threshold {self.MIN_WIN_RATE}"
                )

        # --- Task metrics ---
        task_metrics = {}
        if self.config.run_task_metrics:
            # Add your task-specific metrics here
            pass

        # --- Benchmark scores ---
        benchmark_scores = {}
        if self.config.run_benchmarks:
            # Run lm-eval and compare vs baseline
            pass

        # Save all results
        report = EvalReport(
            model_path=self.config.model_path,
            task_metrics=task_metrics,
            benchmark_scores=benchmark_scores,
            win_rate_vs_baseline=win_rate,
            catastrophic_forgetting_flags=forgetting_flags,
            passed=len(failure_reasons) == 0,
            failure_reasons=failure_reasons,
        )

        with open(self.output_dir / "eval_report.json", "w") as f:
            json.dump(report.__dict__, f, indent=2)

        return report


if __name__ == "__main__":
    config = EvalConfig(
        model_path="./outputs/llama3-8b-finetuned/merged",
        baseline_model_path="meta-llama/Meta-Llama-3-8B-Instruct",
        eval_set_path="eval/domain_eval_set.jsonl",
        output_dir="eval/results/run-001",
        run_llm_judge=True,
        openai_api_key="your-key-here",
    )

    pipeline_runner = EvaluationPipeline(config)
    report = pipeline_runner.run()

    print(f"Evaluation {'PASSED' if report.passed else 'FAILED'}")
    if report.win_rate_vs_baseline:
        print(f"Win rate vs baseline: {report.win_rate_vs_baseline:.1%}")
    if report.failure_reasons:
        for reason in report.failure_reasons:
            print(f"  FAIL: {reason}")

Evaluation Pipeline Architecture

Production Engineering Notes

Evaluation Cost Management

LLM-as-judge with GPT-4o costs approximately $0.005 per evaluation pair (input + output tokens at current pricing). For 500 evaluation examples with position-swapped pairwise comparison, that is 1,000 API calls and roughly$ 5-15 per full evaluation run. This is affordable for ad-hoc evaluation but adds up in CI pipelines running multiple times per day.

Cost reduction strategies:

Sample your eval set. Running judge evaluation on 100 examples from a 500-example eval set captures 90%+ of the signal at 20% of the cost. Run full evaluation monthly, sampled evaluation in CI.

Use a cheaper judge for filtering. Run GPT-4o-mini (10x cheaper) for initial evaluation. Only run GPT-4o on examples where the mini judge is uncertain (close scores).

Cache responses. If the same query appears across multiple evaluation runs, cache the response for deterministic models (greedy decoding) and reuse it. A 50-run CI pipeline with no caching re-generates responses that rarely change.

Evaluation Reproducibility

Evaluation results must be reproducible to be meaningful. Ensure:

Generation uses do_sample=False (greedy decoding) for all evaluation. Sampling introduces variance that makes runs incomparable.
LLM judge calls use temperature=0.
Random seeds are set for any evaluation that involves sampling from the eval set.
Model and tokenizer versions are logged alongside results.
lm-eval-harness produces reproducible results by default.

Tracking Evaluation History

Store evaluation results with enough metadata to diagnose trends:

eval_record = {
    "timestamp": "2026-04-26T10:00:00Z",
    "model_path": "outputs/llama3-8b-finetuned-v3",
    "training_config_hash": "abc123",   # hash of axolotl YAML
    "eval_set_version": "v2.1",
    "num_eval_examples": 500,
    "metrics": {
        "win_rate_vs_baseline": 0.72,
        "domain_task_f1": 0.89,
        "arc_challenge": 0.581,
        "mmlu": 0.643,
        "gsm8k": 0.712,
    },
    "regression_flags": [],
}

Store these records in a database or append to a JSONL file. Over multiple training runs, you can plot metric trends and catch slow degradation that would be invisible in any single evaluation.

Common Mistakes

:::danger Evaluating Only on Examples Similar to Training Data The most dangerous evaluation mistake is building an eval set by randomly sampling from the same data source as your training data. If you train on 90% of dataset A and evaluate on the other 10%, your metrics tell you nothing about whether the model generalizes. A fine-tuned model almost always looks excellent on held-out samples from its training distribution and can be simultaneously useless on the actual production distribution.

The eval set must come from a different source than training data - real user queries, manually written examples, or a domain-expert curated set that was assembled independently of your training data collection. :::

:::danger Using LLM-as-Judge Without Position Bias Control When asking a judge model to compare response A vs response B, the judge systematically favors whichever response appears first. Studies have shown this position bias can be as large as 10-15 percentage points in win rate estimates. If you always put your fine-tuned model as response A and the baseline as response B, your win rate estimate is inflated.

Always run pairwise comparisons in both orders (model-first, then baseline-first) and average the results. If the judge says "A wins" when model is A and "A wins" when model is B, that is not a win - it is just position bias. :::

:::warning Evaluating with Sampling Instead of Greedy Decoding Using temperature > 0 or top_p < 1.0 for evaluation generation introduces variance between runs. Two evaluation runs on the same model with the same eval set can produce different scores simply from sampling randomness. This makes it impossible to determine whether a metric change reflects a real difference or just noise.

Always use greedy decoding (do_sample=False) for evaluation. If you need to assess output diversity, run multiple greedy generations with different prompts, not multiple stochastic generations with the same prompt. :::

:::warning Benchmark Contamination - High Scores That Mean Nothing If your training data contains text from the internet, it almost certainly contains some overlap with standard evaluation benchmarks. MMLU questions appear on study websites. ARC questions appear in education forums. A model that "acheives 75% on MMLU" after fine-tuning on web-scraped data may have simply memorized test answers rather than learned to reason.

Indicators of contamination: benchmark scores that are suspiciously high compared to model size, large improvements on multiple benchmarks simultaneously from a small fine-tuning dataset, or benchmark improvements that exceed what the training data could plausibly teach. Use contamination detection tools (overlap analysis between training data and eval sets) and rely more heavily on held-out domain evaluations that you built yourself. :::

:::warning Ignoring Latency and Format in Evaluation A model can score well on quality metrics while being unusable in production because responses are 3x longer than acceptable (high latency, poor UX) or because the output format is inconsistent (JSON vs plain text, inconsistent field naming). Quality evaluation without format and latency evaluation gives an incomplete picture.

Include in your evaluation pipeline: average response length, percentage of responses that match expected format (schema validation for structured outputs), and generation time under expected concurrency. A model with 0.80 win rate and 95% format compliance is often more valuable than one with 0.85 win rate and 70% format compliance. :::

Interview Questions and Answers

Q1: Your fine-tuned model scores higher on your eval set than the baseline, but users report worse experiences in production. What are the most likely causes?

This is the classic evaluation distribution mismatch problem. The most likely causes in order of frequency:

First, the eval set is drawn from the same distribution as training data. The model memorized training distribution patterns, which happen to match the eval set, but diverges on real user queries that were never in training. Solution: rebuild the eval set from production query logs sampled after training.

Second, the eval set is too small or unrepresentative of the full query space. 200 evaluation examples might miss entire categories of user queries. The model could be excellent on the 200 examples and terrible on the 80% of query types not covered. Solution: cluster production queries and ensure eval set coverage across all major clusters.

Third, format or length changes that affect downstream systems. The model might be more accurate but produces longer responses that time out in production, or slightly different JSON field names that break parsers. Solution: add format compliance and latency metrics to evaluation pipeline.

Fourth, the baseline comparison was wrong. If the "baseline" in evaluation was the uninstructed base model rather than the best-prompted production model, the comparison is misleading. Solution: always compare against a well-engineered prompted baseline that represents actual production.

Q2: How does LLM-as-judge evaluation compare to human evaluation in terms of reliability? When is each appropriate?

LLM-as-judge evaluation correlates with human judgment at roughly 0.7-0.85 Spearman correlation for overall quality assessments, according to studies like MT-Bench (Zheng et al., 2023). The correlation is higher for dimensions like format adherence and lower for factual accuracy (where the judge model may share factual errors with the model being evaluated).

Key failure modes of LLM-as-judge: length bias (longer responses score higher independent of quality), self-preference bias (a GPT-4 judge tends to prefer responses that sound like GPT-4), and inability to verify complex factual claims.

LLM-as-judge is appropriate for: continuous evaluation in CI pipelines, large-scale evaluation (thousands of examples), comparative evaluation between similar-quality models, and format and instruction-following assessment. It is inappropriate for: high-stakes decisions about model safety, evaluation requiring domain expertise (medical, legal), or when you suspect the judge model and the evaluated model share systematic biases.

Human evaluation is appropriate for: final go/no-go decisions before major model releases, calibrating your LLM-as-judge setup (run both on a small sample and verify correlation), domain expert assessment for specialized applications, and evaluating genuinely novel capabilities where you do not trust existing judge models.

Q3: What is catastrophic forgetting in the context of LLM fine-tuning, and how do you detect and mitigate it?

Catastrophic forgetting occurs when fine-tuning on a narrow domain overwrites general capabilities the base model had before training. The gradient updates for domain-specific examples shift model weights in directions that improve domain performance but hurt performance on tasks outside the domain.

Detection: run lm-eval-harness benchmarks on the base model and the fine-tuned model, then compute the delta. A regression of more than 5% on any benchmark task is a warning sign. Pay particular attention to tasks far from your fine-tuning domain - a customer support fine-tuned model should still do math and coding at near-baseline levels.

Mitigation strategies: (1) Lower learning rate - reduces how far weights shift from initialization. (2) Fewer training epochs - avoid overfitting to the training distribution. (3) Replay data - mix a small percentage of general instruction-following data (e.g., ShareGPT) into your training data to maintain general capabilities. Typically 5-15% replay data is sufficient. (4) Lower LoRA rank - smaller rank = less expressive adapter = less capacity to forget general patterns. (5) NEFTune noise - adding noise to embedding layers during training (Jain et al., 2023) has been shown to improve generalization and reduce forgetting as a side effect.

Q4: You need to evaluate a fine-tuned model for a medical Q&A application. What evaluation approach would you use?

Medical Q&A requires a multi-tier evaluation strategy due to the high stakes and specialized domain knowledge required.

First, factual accuracy evaluated by domain experts. A sample of 100-200 responses should be reviewed by medical professionals who can assess clinical accuracy. This is non-negotiable for medical applications - LLM-as-judge evaluation by a general model will not catch subtle medical errors.

Second, safety evaluation. A dedicated pass evaluating responses for: dangerous advice, failure to recommend professional consultation, drug interaction claims, dosage information. This should be done by both human reviewers and a safety-specific judge prompt tuned for medical contexts.

Third, a held-out test set built from MedQA, MedMCQA, or similar validated medical Q&A benchmarks not present in training data, evaluated for exact match accuracy on multiple-choice questions. This provides a quantitative baseline.

Fourth, a production shadow evaluation: deploy the model in shadow mode alongside the production system, log all responses for the first two weeks, and have medical reviewers sample 50 responses per week for quality assessment.

LLM-as-judge alone is insufficient for medical applications because the judge model (even GPT-4) can share factual errors about medical information with the model being evaluated.

Q5: How would you design an evaluation pipeline that scales to evaluating 20 different model checkpoints (from a hyperparameter sweep) efficiently?

The key constraint is that generating responses and running judge evaluation for 20 models would be 20x the cost of evaluating one model. Several strategies reduce this cost:

Funnel evaluation: first run a cheap filter (perplexity on validation set, or a 50-example sample with LLM-judge) on all 20 checkpoints. Identify the top 3-5 by cheap metric, then run full evaluation only on those. This reduces full evaluation runs from 20 to 3-5.

Response caching: all 20 checkpoints see the same evaluation queries. Cache the baseline model responses (used as the comparison in win rate) since they are the same for all 20 comparisons.

Shared infrastructure: batch response generation across checkpoints. Load each checkpoint once, generate responses for the full eval set, unload, and move to the next. Running all 20 checkpoint generations before starting any judge evaluations minimizes memory churn.

Cheap metrics for sweep ranking: train loss on validation set, ROUGE-L on examples with reference answers, and format compliance rate are all computable without LLM-judge API calls. Use these to rank the 20 checkpoints and reserve expensive LLM-judge evaluation for the top candidates.

Q6: What is the difference between win rate and average score in LLM-as-judge evaluation, and when should you use each?

Average score (pointwise evaluation) assigns an absolute numerical score (e.g., 1-10) to each response independently. Win rate (pairwise evaluation) compares two responses head-to-head and declares a winner.

Average scores are useful for tracking a single model's quality over time and detecting absolute degradation. If your model's average helpfulness score drops from 7.2 to 6.8 over successive training runs, that is a meaningful signal even without a comparison model.

Win rate is more reliable for comparing two specific models. It avoids the calibration problem - two different judge sessions might have different implicit scales for what a "7" means, but a pairwise comparison within a single session is more consistent. Win rate is also less susceptible to the tendency for LLMs to give most responses a score of 6-8 (compressing the useful range).

Use win rate when making a binary decision: "is the fine-tuned model better than the baseline?" Use average score when monitoring quality over time for a single model. For production monitoring, run both: average score for trend detection, win rate against a fixed reference checkpoint for comparative analysis.

Q7: Your evaluation shows a fine-tuned model wins 72% of pairwise comparisons against the base model on your eval set, but the business team says quality "feels the same" in their usage. How do you investigate?

This gap between measured win rate and perceived quality usually has one of four causes:

First, the eval set does not match the business team's actual query distribution. A 72% win rate on your eval set means the model is better on your eval queries. If the business team's queries are systematically different (longer, more complex, different topics), the win rate on their usage might be 50% or lower. Solution: build an eval set from the business team's actual query logs.

Second, the improvements are in dimensions the business team does not notice or care about, while regressions are in dimensions they do notice. Maybe the model is more concise (scored as a win by the judge) but the business team wanted more detailed responses. Solution: add rubric criteria to your judge prompt that specifically capture what the business team values.

Third, position bias inflation. If evaluation was not run in both orders (A-B and B-A), the 72% figure may be inflated by position bias. Re-run with bias-controlled evaluation to get a more accurate number.

Fourth, the business team is not a reliable qualitative evaluator for small differences. A 72% win rate corresponds to an improvement that is real but potentially too subtle to notice in casual usage. Run a structured A/B preference test with the business team on 20-30 example pairs with explicit criteria to get a calibrated qualitative signal.

The Metric Trap​

Why This Exists - The Evaluation Gap​

Historical Context - How Evaluation Evolved​

Core Concepts - The Evaluation Stack​

Layer 1: Automatic Metrics (Fast, Cheap, Imperfect)​

Layer 2: LLM-as-Judge (Scalable, Moderately Reliable)​

Layer 3: Human Evaluation (Gold Standard, Expensive)​

Setting Up a Proper Train/Eval Split​

Building Your Domain-Specific Evaluation Set​

MT-Bench and AlpacaEval - Instruction Following Quality​

Catastrophic Forgetting - Evaluating What You Broke​

Running lm-eval-harness for Capability Benchmarks​

Task-Specific Metrics - Choosing the Right Measure​

Classification Tasks (Formatted as Generation)​

Code Generation - Execution-Based Evaluation​

Comparing Fine-Tuned vs Base vs Prompting - The Right Comparison​

Building an Automated Evaluation Pipeline​

Evaluation Pipeline Architecture​

Production Engineering Notes​

Evaluation Cost Management​

Evaluation Reproducibility​

Tracking Evaluation History​

Common Mistakes​

Interview Questions and Answers​