What is fine-tuning pipeline?

End-to-end fine-tuning pipeline engineering - from data collection and curation to training, evaluation, and deployment. When to fine-tune vs RAG vs prompt engineering, and how to build the pipeline that makes it repeatable and production-safe.

How does llm fine-tuning work in practice?

Fine-Tuning Pipelines covers fine-tuning pipeline, llm fine-tuning, supervised fine-tuning from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-engineering/llmops/fine-tuning-pipelines

What is the difference between fine-tuning pipeline and supervised fine-tuning?

See the full breakdown at https://engineersofai.com/docs/ai-engineering/llmops/fine-tuning-pipelines

:::tip 🎮 Interactive Playground Visualize this concept: Try the LoRA Fine-Tuning demo on the EngineersOfAI Playground - no code required. :::

Fine-Tuning Pipelines

The Fine-Tune That Went Wrong

The team at a B2B SaaS company had a problem that looked solvable. Their product used Claude to generate customer-facing emails - follow-up messages after support tickets, renewal reminders, upsell recommendations. The model was good but not great. The emails sounded generic. They lacked the company's voice. They occasionally hallucinated product names. The support tickets it referenced were sometimes wrong. Customer response rates were 12% below the human-written baseline.

Someone in the engineering team had a reasonable idea: fine-tune. They had 14,000 examples of emails that had been written by their best human agents - emails that had high open rates, high response rates, and zero complaints. They formatted the data into JSONL, uploaded it to their model provider's fine-tuning API, ran the job over a weekend, and deployed the fine-tuned model to production on Monday morning.

By Wednesday, the customer success team was filing tickets. The fine-tuned model was confidently generating emails that referenced products the customer had never purchased. It was hallucinating renewal dates that were months off. In one case, it congratulated a customer on upgrading to a premium tier they had actually cancelled. The emails sounded great - the tone was perfect, the voice was exactly right - but the factual accuracy had collapsed. The model had learned the style so well that it started generating content with the same confident authority, even when it had no factual basis for what it was writing.

They rolled back to the base model in 48 hours. The postmortem was uncomfortable. The fine-tuning data was the problem: the 14,000 examples contained the final emails but not the support ticket context those emails were grounded in. The model learned "write a confident, warm, specific email" but had no signal that the specifics needed to come from real data. It learned distribution without learning grounding. This is the failure mode that kills fine-tuning projects in production.

The lesson was not "don't fine-tune." The lesson was: fine-tuning is a pipeline problem, not a model problem. The data preparation, the training setup, the evaluation harness, and the deployment strategy all have to be engineered - not assumed. This lesson covers how to build that pipeline correctly, from the decision to fine-tune through to safe production deployment.

Why Fine-Tuning Exists (and When It Doesn't)

Before diving into how to fine-tune, you need to understand what fine-tuning actually does to a model - and why that matters for deciding whether to use it at all.

A pretrained LLM has learned a rich representation of language, facts, and reasoning from billions of tokens of text. It knows how to write. It knows many facts. It can follow instructions. What it does not know is your domain's specific vocabulary, your organization's preferred response format, your task's implicit constraints, or the edge cases that your users generate every day.

Fine-tuning adapts the model's weights to your domain. It adjusts the internal representations so that your task-specific patterns are more accessible to the model. Done correctly, it makes the model faster (fewer tokens in the prompt), cheaper (shorter prompts mean lower costs per call), and more consistent (the behavior is baked in, not prompted in).

But fine-tuning is not magic. It cannot give the model information it was not trained on. It cannot teach the model to reliably access real-time data. It cannot fix a fundamentally broken task definition. And it introduces new failure modes - overfitting, catastrophic forgetting, distribution shift - that do not exist with prompt engineering.

The question "should I fine-tune?" is one of the most misanswered questions in applied AI. Here is the honest answer.

When to Fine-Tune vs. Prompt Engineer

The decision is not about model quality in isolation. It is about cost, latency, consistency, and data availability - weighed against each other.

The Decision Matrix

Factor	Lean Toward Prompting	Lean Toward Fine-Tuning
Task examples available	Fewer than 500 labeled examples	1,000+ high-quality examples
Task stability	Requirements change frequently	Requirements are stable for 6+ months
Prompt length	Short prompts already work well	Prompts exceed 1,000 tokens
Latency requirement	5–10 seconds acceptable	Sub-second required
Cost per call	Low volume (less than 10K calls/day)	High volume (100K+ calls/day)
Consistency required	Some variation is acceptable	Identical formatting every time
Domain specificity	General-purpose task	Highly specialized domain
Reasoning required	Yes - complex multi-step reasoning	No - pattern matching is enough

The Real Test

Before committing to fine-tuning, answer these three questions honestly:

Have you maxed out prompt engineering? Few-shot examples, chain-of-thought, system prompt iteration, output format specification - if you have not tried all of these, you are not ready to fine-tune.
Do you have 1,000+ high-quality, consistent examples? Not 1,000 examples scraped from a log file. 1,000 examples that a human expert would say "yes, this is the right input-output pair for this task." If you do not, fine-tuning will bake in your noise.
Can you measure what "better" means? If you cannot define a quantitative evaluation that tells you whether the fine-tuned model is better than the base model on your task, you cannot safely deploy a fine-tune. You need the eval before you write a single line of training data.

warning

Fine-tuning is irreversible in the sense that it changes model behavior in ways that are difficult to predict. Always run a comprehensive evaluation against the base model before deploying a fine-tune to production. Assume the fine-tune will break something until you prove otherwise.

Fine-Tuning Pipeline Stages

Fine-tuning is not a single step. It is a six-stage pipeline, and each stage can fail independently. The diagram below shows the full pipeline with the critical decision points.

Each stage has specific engineering requirements. The rest of this lesson covers each one in depth with working code.

Stage 1: Data Preparation

Data preparation is where fine-tuning projects succeed or fail. The model will learn exactly what you show it - including your mistakes, your inconsistencies, and your biases. There is no "it will figure out the pattern." The pattern you encode in training data is the pattern the model learns.

JSONL Format Requirements

Most fine-tuning APIs (OpenAI, Anthropic, Google, Mistral) accept a standard JSONL format where each line is a complete conversation. The structure follows the messages format:

{"messages": [
  {"role": "system", "content": "You are a support email assistant. Write concise, friendly emails grounded only in the provided ticket data."},
  {"role": "user", "content": "Ticket #4821: Customer Jane Smith reports she cannot access the dashboard after upgrading to Pro tier on 2024-01-15. Subscription ID: SUB-993421. Write a follow-up email."},
  {"role": "assistant", "content": "Subject: Following up on your dashboard access issue\n\nHi Jane,\n\nThank you for reaching out about the access issue you've experienced since upgrading to Pro on January 15th.\n\nI've reviewed your account (SUB-993421) and can see the upgrade processed successfully, but there's a permissions sync issue on our end that's preventing dashboard access. I've escalated this to our engineering team with high priority.\n\nYou should have full access restored within 2 hours. I'll send you a confirmation email the moment it's resolved.\n\nApologies for the disruption, and thank you for your patience.\n\nBest,\nSupport Team"}
]}

What Makes a High-Quality Training Example

Every training example must satisfy all of the following:

Grounded: the assistant turn contains only information present in the user turn or system prompt - no invented details
Consistent format: the response format matches exactly across all examples - same greeting style, same closing, same structure
Appropriate length: not artificially padded to seem thorough, not cut short to seem efficient - the right length for the content
No contradictions: the example should not teach the model two different responses to similar inputs
Representative: the input distribution should match what the model will see in production

Quality Filtering Code

This is the most important code you will write for a fine-tuning project. It enforces every quality criterion automatically and produces a scored dataset you can inspect.

import json
import hashlib
import re
from dataclasses import dataclass, field
from typing import Optional
from pathlib import Path
from collections import defaultdict


@dataclass
class QualityScore:
    """Structured quality assessment for a single training example."""
    example_id: str
    total_score: float
    has_system_prompt: bool
    response_length_ok: bool
    no_hallucination_markers: bool
    consistent_format: bool
    dedup_hash: str
    issues: list[str] = field(default_factory=list)

    @property
    def passes(self) -> bool:
        return self.total_score >= 0.75


class FineTuningDataPipeline:
    """
    End-to-end pipeline for preparing fine-tuning data.

    Handles:
    - Loading raw examples from JSONL files or dicts
    - Deduplication (exact and near-duplicate)
    - Quality scoring against configurable criteria
    - Format validation against target API schema
    - Stratified train/validation split
    - Output as cleaned JSONL files

    Usage:
        pipeline = FineTuningDataPipeline(
            min_response_tokens=50,
            max_response_tokens=1000,
            expected_format_pattern=r"^Subject:",
        )
        train, val = pipeline.run(
            input_paths=["data/raw_examples.jsonl"],
            output_dir="data/prepared/",
        )
    """

    def __init__(
        self,
        min_response_tokens: int = 50,
        max_response_tokens: int = 1000,
        expected_format_pattern: Optional[str] = None,
        hallucination_markers: Optional[list[str]] = None,
        val_fraction: float = 0.1,
        min_quality_score: float = 0.75,
    ):
        self.min_response_tokens = min_response_tokens
        self.max_response_tokens = max_response_tokens
        self.expected_format_pattern = (
            re.compile(expected_format_pattern)
            if expected_format_pattern
            else None
        )
        self.hallucination_markers = hallucination_markers or [
            "I don't have access to",
            "I cannot verify",
            "as an AI",
            "I don't know your",
            "I'm not sure which",
        ]
        self.val_fraction = val_fraction
        self.min_quality_score = min_quality_score

        # Stats tracking
        self._stats: dict[str, int] = defaultdict(int)

    def run(
        self,
        input_paths: list[str],
        output_dir: str,
    ) -> tuple[list[dict], list[dict]]:
        """
        Full pipeline run. Returns (train_examples, val_examples).
        Also writes train.jsonl and val.jsonl to output_dir.
        """
        output_path = Path(output_dir)
        output_path.mkdir(parents=True, exist_ok=True)

        # 1. Load all raw examples
        raw_examples = self._load_examples(input_paths)
        self._stats["total_loaded"] = len(raw_examples)
        print(f"Loaded {len(raw_examples)} raw examples")

        # 2. Validate schema
        schema_valid = [e for e in raw_examples if self._validate_schema(e)]
        self._stats["schema_invalid"] = len(raw_examples) - len(schema_valid)
        print(f"Schema valid: {len(schema_valid)} ({self._stats['schema_invalid']} rejected)")

        # 3. Deduplicate
        deduped = self._deduplicate(schema_valid)
        self._stats["duplicates_removed"] = len(schema_valid) - len(deduped)
        print(f"After dedup: {len(deduped)} ({self._stats['duplicates_removed']} removed)")

        # 4. Score quality
        scored = [self._score_example(e, i) for i, e in enumerate(deduped)]
        passing = [e for e, s in scored if s.passes]
        self._stats["quality_failed"] = len(scored) - len(passing)
        print(f"Quality passing: {len(passing)} ({self._stats['quality_failed']} rejected)")

        if len(passing) < 100:
            raise ValueError(
                f"Only {len(passing)} examples passed quality filters. "
                "Fine-tuning with fewer than 100 examples is not recommended. "
                "Review your quality criteria or collect more data."
            )

        # 5. Split train/val
        train, val = self._split(passing)
        self._stats["train_count"] = len(train)
        self._stats["val_count"] = len(val)
        print(f"Split: {len(train)} train / {len(val)} val")

        # 6. Write output
        self._write_jsonl(train, output_path / "train.jsonl")
        self._write_jsonl(val, output_path / "val.jsonl")
        self._write_stats(output_path / "pipeline_stats.json")

        return train, val

    def _load_examples(self, paths: list[str]) -> list[dict]:
        examples = []
        for path in paths:
            with open(path, "r") as f:
                for line_num, line in enumerate(f, 1):
                    line = line.strip()
                    if not line:
                        continue
                    try:
                        examples.append(json.loads(line))
                    except json.JSONDecodeError as e:
                        print(f"Warning: JSON parse error in {path} line {line_num}: {e}")
        return examples

    def _validate_schema(self, example: dict) -> bool:
        """Validate the example conforms to the messages API format."""
        if "messages" not in example:
            return False
        messages = example["messages"]
        if not isinstance(messages, list) or len(messages) < 2:
            return False
        # Must have at least one user and one assistant turn
        roles = {m.get("role") for m in messages}
        if "user" not in roles or "assistant" not in roles:
            return False
        # All messages must have role and content
        for msg in messages:
            if "role" not in msg or "content" not in msg:
                return False
            if not isinstance(msg["content"], str) or not msg["content"].strip():
                return False
        return True

    def _deduplicate(self, examples: list[dict]) -> list[dict]:
        """Remove exact duplicates based on full example hash."""
        seen_hashes: set[str] = set()
        unique = []
        for example in examples:
            h = self._hash_example(example)
            if h not in seen_hashes:
                seen_hashes.add(h)
                unique.append(example)
        return unique

    def _hash_example(self, example: dict) -> str:
        """Deterministic hash of an example for dedup."""
        canonical = json.dumps(example, sort_keys=True, ensure_ascii=False)
        return hashlib.sha256(canonical.encode()).hexdigest()

    def _score_example(
        self, example: dict, idx: int
    ) -> tuple[dict, QualityScore]:
        """Score a single example against quality criteria."""
        issues = []
        score_components = []

        messages = example["messages"]
        assistant_messages = [m for m in messages if m["role"] == "assistant"]
        user_messages = [m for m in messages if m["role"] == "user"]
        system_messages = [m for m in messages if m["role"] == "system"]

        # 1. System prompt present
        has_system = len(system_messages) > 0
        score_components.append(0.2 if has_system else 0.0)
        if not has_system:
            issues.append("No system prompt - model may not learn task framing")

        # 2. Response length check
        last_response = assistant_messages[-1]["content"]
        approx_tokens = len(last_response.split())
        length_ok = self.min_response_tokens <= approx_tokens <= self.max_response_tokens
        score_components.append(0.3 if length_ok else 0.0)
        if not length_ok:
            issues.append(
                f"Response length {approx_tokens} tokens outside "
                f"[{self.min_response_tokens}, {self.max_response_tokens}]"
            )

        # 3. Hallucination markers
        no_hallucination = not any(
            marker.lower() in last_response.lower()
            for marker in self.hallucination_markers
        )
        score_components.append(0.3 if no_hallucination else 0.0)
        if not no_hallucination:
            issues.append("Response contains hallucination marker phrases")

        # 4. Format consistency
        format_ok = True
        if self.expected_format_pattern:
            format_ok = bool(self.expected_format_pattern.search(last_response))
            if not format_ok:
                issues.append(f"Response does not match expected format pattern")
        score_components.append(0.2 if format_ok else 0.0)

        total = sum(score_components)
        quality = QualityScore(
            example_id=f"example_{idx}",
            total_score=total,
            has_system_prompt=has_system,
            response_length_ok=length_ok,
            no_hallucination_markers=no_hallucination,
            consistent_format=format_ok,
            dedup_hash=self._hash_example(example),
            issues=issues,
        )
        return example, quality

    def _split(
        self, examples: list[dict]
    ) -> tuple[list[dict], list[dict]]:
        """Deterministic train/val split."""
        import random
        rng = random.Random(42)
        shuffled = examples.copy()
        rng.shuffle(shuffled)
        n_val = max(1, int(len(shuffled) * self.val_fraction))
        return shuffled[n_val:], shuffled[:n_val]

    def _write_jsonl(self, examples: list[dict], path: Path) -> None:
        with open(path, "w") as f:
            for example in examples:
                f.write(json.dumps(example, ensure_ascii=False) + "\n")
        print(f"Wrote {len(examples)} examples to {path}")

    def _write_stats(self, path: Path) -> None:
        with open(path, "w") as f:
            json.dump(dict(self._stats), f, indent=2)

tip

Always inspect the examples that fail quality filtering manually before adjusting the thresholds. If 40% of your examples are failing the response length check, the threshold may be wrong - or your data collection process may be including truncated outputs. Both are worth knowing.

Stage 2: Baseline Evaluation

You must measure the base model before you fine-tune. This is not optional. Without a baseline, you cannot know whether the fine-tune helped or hurt. You cannot make the rollback decision on evidence. You are flying blind.

The baseline evaluation must use:

The same inputs your production system will see
The same metrics you care about in production
A held-out evaluation set that is not used in training

Here is how to build a baseline evaluation using the Anthropic SDK:

import anthropic
import json
import time
from dataclasses import dataclass
from typing import Callable


@dataclass
class EvalResult:
    example_id: str
    model: str
    input_messages: list[dict]
    expected_output: str
    actual_output: str
    latency_ms: float
    metric_scores: dict[str, float]

    @property
    def mean_score(self) -> float:
        if not self.metric_scores:
            return 0.0
        return sum(self.metric_scores.values()) / len(self.metric_scores)


class ModelEvaluator:
    """
    Evaluates a model against a held-out eval set.
    Supports multiple metrics and comparison between models.

    Designed for comparing a fine-tuned model against a base model.
    """

    def __init__(
        self,
        client: anthropic.Anthropic,
        metrics: dict[str, Callable[[str, str], float]],
        max_tokens: int = 1024,
        temperature: float = 0.0,
    ):
        self.client = client
        self.metrics = metrics
        self.max_tokens = max_tokens
        self.temperature = temperature

    def evaluate_model(
        self,
        model_id: str,
        eval_examples: list[dict],
        system_prompt: Optional[str] = None,
    ) -> list[EvalResult]:
        """Run evaluation on a list of examples and return results."""
        results = []

        for i, example in enumerate(eval_examples):
            messages = example["messages"]
            # Extract the context (all but last assistant turn)
            input_messages = [
                m for m in messages if not (
                    m["role"] == "assistant" and m == messages[-1]
                )
            ]
            expected = messages[-1]["content"]

            # Build the message list for inference
            inference_messages = [
                m for m in input_messages if m["role"] != "system"
            ]
            system = system_prompt or next(
                (m["content"] for m in input_messages if m["role"] == "system"),
                None
            )

            # Call the model
            start = time.perf_counter()
            try:
                response = self.client.messages.create(
                    model=model_id,
                    max_tokens=self.max_tokens,
                    temperature=self.temperature,
                    system=system or anthropic.NOT_GIVEN,
                    messages=inference_messages,
                )
                actual = response.content[0].text
            except Exception as e:
                print(f"Error on example {i}: {e}")
                actual = ""
            latency_ms = (time.perf_counter() - start) * 1000

            # Score against each metric
            scores = {
                name: metric_fn(expected, actual)
                for name, metric_fn in self.metrics.items()
            }

            results.append(EvalResult(
                example_id=f"eval_{i}",
                model=model_id,
                input_messages=input_messages,
                expected_output=expected,
                actual_output=actual,
                latency_ms=latency_ms,
                metric_scores=scores,
            ))

            if (i + 1) % 10 == 0:
                print(f"Evaluated {i + 1}/{len(eval_examples)} examples")

        return results

    def compare_models(
        self,
        base_model: str,
        fine_tuned_model: str,
        eval_examples: list[dict],
        system_prompt: Optional[str] = None,
    ) -> dict:
        """
        Head-to-head comparison between base and fine-tuned model.
        Returns a summary dict with per-metric deltas.
        """
        print(f"Evaluating base model: {base_model}")
        base_results = self.evaluate_model(base_model, eval_examples, system_prompt)

        print(f"Evaluating fine-tuned model: {fine_tuned_model}")
        ft_results = self.evaluate_model(fine_tuned_model, eval_examples, system_prompt)

        # Aggregate
        def aggregate(results: list[EvalResult]) -> dict:
            if not results:
                return {}
            metric_names = list(results[0].metric_scores.keys())
            return {
                "mean_overall": sum(r.mean_score for r in results) / len(results),
                "mean_latency_ms": sum(r.latency_ms for r in results) / len(results),
                **{
                    f"mean_{m}": sum(r.metric_scores[m] for r in results) / len(results)
                    for m in metric_names
                },
            }

        base_agg = aggregate(base_results)
        ft_agg = aggregate(ft_results)

        comparison = {
            "base_model": base_model,
            "fine_tuned_model": fine_tuned_model,
            "n_examples": len(eval_examples),
            "base": base_agg,
            "fine_tuned": ft_agg,
            "delta": {
                k: ft_agg.get(k, 0) - base_agg.get(k, 0)
                for k in base_agg
            },
            "fine_tune_wins": ft_agg["mean_overall"] > base_agg["mean_overall"],
        }

        self._print_comparison(comparison)
        return comparison

    def _print_comparison(self, comparison: dict) -> None:
        print("\n" + "=" * 60)
        print("MODEL EVALUATION COMPARISON")
        print("=" * 60)
        print(f"Base:        {comparison['base_model']}")
        print(f"Fine-tuned:  {comparison['fine_tuned_model']}")
        print(f"Examples:    {comparison['n_examples']}")
        print("-" * 60)
        for key in comparison["base"]:
            base_val = comparison["base"][key]
            ft_val = comparison["fine_tuned"][key]
            delta = comparison["delta"][key]
            direction = "+" if delta >= 0 else ""
            print(f"{key:<25} base={base_val:.3f}  ft={ft_val:.3f}  delta={direction}{delta:.3f}")
        print("-" * 60)
        verdict = "PASS - fine-tune wins" if comparison["fine_tune_wins"] else "FAIL - base model wins"
        print(f"Verdict: {verdict}")
        print("=" * 60 + "\n")

Defining Your Metrics

The metrics you choose determine whether your evaluation is honest. Generic metrics like BLEU score are almost always wrong for LLM evaluation. You need task-specific metrics.

For a customer email generation task, useful metrics include:

Format compliance rate: does the response start with "Subject:" - binary, easy to compute
Factual consistency score: does the response avoid referencing entities not in the input - requires a secondary LLM call
Length ratio: ratio of output length to reference length - values far from 1.0 suggest pathological behavior
Tone consistency: a classifier that distinguishes on-brand from off-brand language

# Example metric functions for email generation evaluation
import re
from difflib import SequenceMatcher


def format_compliance_metric(expected: str, actual: str) -> float:
    """Binary: does the response start with 'Subject:'?"""
    return 1.0 if actual.strip().startswith("Subject:") else 0.0


def length_ratio_metric(expected: str, actual: str) -> float:
    """
    Score how close response length is to reference.
    1.0 = same length, penalizes responses that are too long or short.
    """
    if not expected or not actual:
        return 0.0
    ratio = len(actual.split()) / len(expected.split())
    # Score: 1.0 at ratio=1.0, decays toward 0 as ratio diverges
    return max(0.0, 1.0 - abs(1.0 - ratio))


def reference_overlap_metric(expected: str, actual: str) -> float:
    """
    Sequence overlap with reference. Not a replacement for semantic eval
    but useful as a sanity check for format-heavy tasks.
    """
    if not expected or not actual:
        return 0.0
    return SequenceMatcher(None, expected.lower(), actual.lower()).ratio()


def no_hallucination_marker_metric(expected: str, actual: str) -> float:
    """
    Penalize responses containing known hallucination signal phrases.
    Returns 0.0 if any marker is found, 1.0 otherwise.
    """
    markers = [
        "as an ai",
        "i don't have access",
        "i cannot verify",
        "i'm not sure",
        "based on what you've told me",
    ]
    actual_lower = actual.lower()
    return 0.0 if any(m in actual_lower for m in markers) else 1.0

Stage 3: Training Run Management

Once your data is prepared and you have a baseline, you are ready to run the training job. For managed fine-tuning APIs (OpenAI, Anthropic, Google), most of the training configuration is handled by the provider - but you still need to manage hyperparameters, monitor the run, and checkpoint correctly.

Hyperparameters That Matter

Hyperparameter	What It Controls	Conservative Default	When to Increase
Epochs	How many times the model sees your data	1–2	When training loss is still decreasing at epoch end
Learning rate multiplier	Scales the base LR	1.0	When loss decreases too slowly
Batch size	Examples per gradient step	Provider default	Almost never - let provider choose
Warmup steps	Steps before full LR	Provider default	When training loss spikes early

danger

More epochs is not better. The most common fine-tuning mistake is running too many epochs. After 2–3 epochs on a 1,000-example dataset, most models start overfitting. The training loss keeps decreasing but the validation loss starts rising. Always use a validation loss curve to detect this early.

Training Job Manager with W&B Logging

import os
import time
import json
from dataclasses import dataclass, asdict
from typing import Optional
import anthropic


@dataclass
class TrainingConfig:
    """Configuration for a fine-tuning job."""
    model: str
    train_file: str
    val_file: Optional[str]
    n_epochs: int = 2
    learning_rate_multiplier: float = 1.0
    batch_size: Optional[int] = None
    suffix: str = "prod-v1"
    wandb_project: Optional[str] = None
    wandb_run_name: Optional[str] = None


@dataclass
class TrainingJobResult:
    job_id: str
    fine_tuned_model_id: Optional[str]
    status: str
    trained_tokens: int
    training_loss: Optional[float]
    validation_loss: Optional[float]
    duration_seconds: float


class TrainingJobManager:
    """
    Manages a fine-tuning job from submission through completion.
    Handles polling, W&B logging, and result capture.

    Note: This example uses OpenAI's API since Anthropic's managed
    fine-tuning API is accessed through their partner program.
    Adapt the provider-specific calls for your chosen provider.
    """

    def __init__(
        self,
        config: TrainingConfig,
        poll_interval_seconds: int = 30,
    ):
        self.config = config
        self.poll_interval = poll_interval_seconds

        # Initialize W&B if configured
        self._wandb_run = None
        if config.wandb_project:
            self._init_wandb()

    def _init_wandb(self) -> None:
        try:
            import wandb
            self._wandb_run = wandb.init(
                project=self.config.wandb_project,
                name=self.config.wandb_run_name or f"ft-{int(time.time())}",
                config=asdict(self.config),
                tags=["fine-tuning", self.config.model],
            )
            print(f"W&B run initialized: {self._wandb_run.url}")
        except ImportError:
            print("Warning: wandb not installed. Install with: pip install wandb")
        except Exception as e:
            print(f"Warning: W&B initialization failed: {e}")

    def submit_job(self) -> str:
        """Submit the fine-tuning job and return the job ID."""
        import openai
        client = openai.OpenAI()

        # Upload files
        print(f"Uploading training file: {self.config.train_file}")
        with open(self.config.train_file, "rb") as f:
            train_file_obj = client.files.create(file=f, purpose="fine-tune")
        train_file_id = train_file_obj.id
        print(f"Training file ID: {train_file_id}")

        val_file_id = None
        if self.config.val_file:
            print(f"Uploading validation file: {self.config.val_file}")
            with open(self.config.val_file, "rb") as f:
                val_file_obj = client.files.create(file=f, purpose="fine-tune")
            val_file_id = val_file_obj.id
            print(f"Validation file ID: {val_file_id}")

        # Submit job
        hyperparams = {
            "n_epochs": self.config.n_epochs,
        }
        if self.config.learning_rate_multiplier != 1.0:
            hyperparams["learning_rate_multiplier"] = self.config.learning_rate_multiplier
        if self.config.batch_size:
            hyperparams["batch_size"] = self.config.batch_size

        job = client.fine_tuning.jobs.create(
            training_file=train_file_id,
            validation_file=val_file_id,
            model=self.config.model,
            suffix=self.config.suffix,
            hyperparameters=hyperparams,
        )
        print(f"Job submitted: {job.id}")
        if self._wandb_run:
            self._wandb_run.config.update({"job_id": job.id})

        return job.id

    def wait_for_completion(self, job_id: str) -> TrainingJobResult:
        """Poll until the job completes or fails. Returns the result."""
        import openai
        client = openai.OpenAI()

        start_time = time.time()
        last_n_events = 0

        while True:
            job = client.fine_tuning.jobs.retrieve(job_id)
            status = job.status

            # Fetch and log new events
            events = client.fine_tuning.jobs.list_events(
                fine_tuning_job_id=job_id, limit=10
            )
            new_events = list(events.data)[last_n_events:]
            for event in reversed(new_events):
                print(f"[{event.created_at}] {event.message}")
                # Log loss metrics to W&B if available
                if self._wandb_run and "loss" in event.data:
                    self._wandb_run.log({
                        "train_loss": event.data.get("train_loss"),
                        "valid_loss": event.data.get("valid_loss"),
                        "step": event.data.get("step"),
                    })
            last_n_events += len(new_events)

            if status in ("succeeded", "failed", "cancelled"):
                break

            print(f"Status: {status} - waiting {self.poll_interval}s...")
            time.sleep(self.poll_interval)

        duration = time.time() - start_time
        result = TrainingJobResult(
            job_id=job_id,
            fine_tuned_model_id=job.fine_tuned_model,
            status=job.status,
            trained_tokens=job.trained_tokens or 0,
            training_loss=None,  # Retrieved from final event
            validation_loss=None,
            duration_seconds=duration,
        )

        if self._wandb_run:
            self._wandb_run.log({
                "final_status": result.status,
                "trained_tokens": result.trained_tokens,
                "duration_seconds": result.duration_seconds,
            })
            self._wandb_run.finish()

        if result.status != "succeeded":
            raise RuntimeError(
                f"Fine-tuning job {job_id} failed with status: {result.status}"
            )

        print(f"\nFine-tuning complete!")
        print(f"Model ID: {result.fine_tuned_model_id}")
        print(f"Trained tokens: {result.trained_tokens:,}")
        print(f"Duration: {duration/60:.1f} minutes")

        return result

    def run(self) -> TrainingJobResult:
        """Submit and wait for the complete fine-tuning job."""
        job_id = self.submit_job()
        return self.wait_for_completion(job_id)

Experiment Tracking Best Practices

Every training run should be tracked with at minimum:

Training and validation loss curves
Hyperparameters used
Dataset version (hash or version tag)
Base model version
Who ran the job and why
The evaluation results from the comparison against baseline

Without this, you cannot reproduce a good training run, and you cannot diagnose a bad one.

info

W&B (Weights and Biases) is the industry standard for experiment tracking. Free tier is generous enough for most fine-tuning projects. If your organization requires self-hosted tracking, MLflow is a solid open-source alternative that integrates with the same patterns shown above.

Stage 4: Deployment with A/B Testing

Never send 100% of traffic to a fine-tuned model immediately after training. Even if the offline evaluation showed improvement, production traffic can expose failure modes that your eval set did not cover.

The correct deployment pattern is:

Start with 5–10% of traffic to the fine-tuned model
Monitor production metrics for 24–48 hours
Expand to 25%, then 50%, then 100% if metrics are stable
Keep the rollback path active for the first 2 weeks

A/B Testing Implementation

import random
import time
import hashlib
from dataclasses import dataclass
from typing import Optional, Callable
import anthropic


@dataclass
class InferenceRequest:
    user_id: str
    messages: list[dict]
    system: Optional[str] = None


@dataclass
class InferenceResponse:
    text: str
    model_used: str
    variant: str  # "control" or "treatment"
    latency_ms: float
    user_id: str


class ABTestRouter:
    """
    Routes requests between a control (base) model and treatment (fine-tuned) model.

    Uses deterministic user-based routing to ensure the same user always
    gets the same model variant - this prevents confusing mixed experiences
    and enables per-user metric analysis.

    Usage:
        router = ABTestRouter(
            client=anthropic.Anthropic(),
            control_model="claude-3-5-haiku-20241022",
            treatment_model="ft:claude-3-5-haiku:your-org:suffix:id",
            treatment_fraction=0.1,  # 10% of users get fine-tuned model
            on_response=log_to_analytics,
        )
        response = await router.route(request)
    """

    def __init__(
        self,
        client: anthropic.Anthropic,
        control_model: str,
        treatment_model: str,
        treatment_fraction: float = 0.1,
        max_tokens: int = 1024,
        temperature: float = 0.0,
        on_response: Optional[Callable[[InferenceResponse], None]] = None,
    ):
        assert 0.0 <= treatment_fraction <= 1.0, "treatment_fraction must be in [0, 1]"
        self.client = client
        self.control_model = control_model
        self.treatment_model = treatment_model
        self.treatment_fraction = treatment_fraction
        self.max_tokens = max_tokens
        self.temperature = temperature
        self.on_response = on_response

        # Metrics tracking
        self._response_counts = {"control": 0, "treatment": 0}
        self._error_counts = {"control": 0, "treatment": 0}
        self._latencies = {"control": [], "treatment": []}

    def _assign_variant(self, user_id: str) -> str:
        """
        Deterministic variant assignment based on user_id hash.
        Same user_id always gets the same variant.
        """
        h = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
        bucket = (h % 1000) / 1000.0
        return "treatment" if bucket < self.treatment_fraction else "control"

    def route(self, request: InferenceRequest) -> InferenceResponse:
        """Route a request and return the response with metadata."""
        variant = self._assign_variant(request.user_id)
        model = (
            self.treatment_model if variant == "treatment" else self.control_model
        )

        start = time.perf_counter()
        try:
            api_response = self.client.messages.create(
                model=model,
                max_tokens=self.max_tokens,
                temperature=self.temperature,
                system=request.system or anthropic.NOT_GIVEN,
                messages=request.messages,
            )
            text = api_response.content[0].text
        except Exception as e:
            self._error_counts[variant] += 1
            raise RuntimeError(f"Model call failed for variant={variant}: {e}")

        latency_ms = (time.perf_counter() - start) * 1000

        response = InferenceResponse(
            text=text,
            model_used=model,
            variant=variant,
            latency_ms=latency_ms,
            user_id=request.user_id,
        )

        # Update internal stats
        self._response_counts[variant] += 1
        self._latencies[variant].append(latency_ms)

        # Call the analytics callback
        if self.on_response:
            try:
                self.on_response(response)
            except Exception as e:
                print(f"Warning: analytics callback failed: {e}")

        return response

    def get_stats(self) -> dict:
        """Return current A/B test statistics."""
        stats = {}
        for variant in ("control", "treatment"):
            latencies = self._latencies[variant]
            stats[variant] = {
                "total_requests": self._response_counts[variant],
                "errors": self._error_counts[variant],
                "mean_latency_ms": sum(latencies) / len(latencies) if latencies else 0,
                "p95_latency_ms": sorted(latencies)[int(len(latencies) * 0.95)] if latencies else 0,
            }
        return stats

    def print_stats(self) -> None:
        stats = self.get_stats()
        print("\n=== A/B Test Statistics ===")
        for variant, data in stats.items():
            model = self.treatment_model if variant == "treatment" else self.control_model
            print(f"\n{variant.upper()} ({model})")
            print(f"  Requests:     {data['total_requests']:,}")
            print(f"  Errors:       {data['errors']:,}")
            print(f"  Mean latency: {data['mean_latency_ms']:.1f} ms")
            print(f"  P95 latency:  {data['p95_latency_ms']:.1f} ms")


# Example analytics callback for logging to your monitoring system
def log_response_to_analytics(response: InferenceResponse) -> None:
    """
    Example callback. In production, send to your analytics system
    (Datadog, Grafana, Amplitude, Mixpanel, etc.)
    """
    event = {
        "timestamp": time.time(),
        "user_id": response.user_id,
        "variant": response.variant,
        "model": response.model_used,
        "latency_ms": response.latency_ms,
        "response_length": len(response.text),
    }
    # In production: send to your event streaming system
    print(f"Analytics event: {event}")

What to Monitor During A/B Test

During the A/B test, monitor these signals in your analytics system:

Metric	What It Tells You	Red Flag
Task completion rate	Are users accomplishing their goal?	Treatment rate drops more than 5% below control
Error rate	Is the model failing or refusing?	Any increase in refusals or errors
Response length distribution	Is the model being too verbose or too terse?	Mean shifts more than 30%
User rating (if collected)	Explicit quality signal	Treatment rating consistently below control
Downstream business metric	Does the output lead to the desired outcome?	Drop in email response rate, ticket close rate, etc.
Latency P95	Is the fine-tuned model slower?	P95 latency increases more than 20%

Common Failure Modes

Understanding how fine-tuning fails is as important as understanding how it succeeds. These are the failure modes that appear in production most often.

Failure Mode	What Happens	Root Cause	Prevention
Overfitting	Model performs great on eval set, poorly on new inputs	Too many epochs, too little data, or eval set is too similar to train set	Use genuinely held-out eval data; limit epochs; monitor val loss curve
Catastrophic forgetting	Model loses general capabilities it had before fine-tuning	High learning rate, many epochs, small dataset	Use lower LR multiplier; fewer epochs; check capabilities on out-of-domain prompts
Distribution shift	Fine-tuned model trained on historical data fails on new inputs	Production inputs evolve; training data goes stale	Track input distribution over time; retrain on recent data quarterly
Hallucination amplification	Fine-tuned model hallucinates with more confidence	Training data included confident-sounding hallucinations	Rigorous data quality filtering; include grounding constraints in system prompt
Format lock-in	Model refuses to deviate from trained format	Training data too homogeneous	Include format variation in training data; test with unusual input formats
Instruction following regression	Model stops following system prompt instructions	Training overwrote instruction following behavior	Include diverse instruction-following examples in training data
Sycophancy amplification	Model agrees with user even when wrong	Training data rewarded agreement over accuracy	Audit training data for sycophancy patterns; include corrective examples

danger

Catastrophic forgetting is the most dangerous failure mode because it often only surfaces weeks or months after deployment, when a user asks the model to do something that is not in the training distribution. The model's general reasoning ability has degraded, but this is not visible in task-specific metrics. Always run a capabilities regression test on a diverse benchmark after fine-tuning.

Fine-Tuning vs. RAG vs. Prompting

The choice between these three approaches depends on your specific constraints. This is the framework used by most production teams.

Dimension	Prompt Engineering	RAG	Fine-Tuning
Data freshness	Real-time	Real-time (retrieval)	Static (training time)
Knowledge type	General reasoning	External facts, documents	Task patterns, style, format
Setup effort	Low (hours)	Medium (days to weeks)	High (weeks to months)
Iteration speed	Fast (minutes)	Medium (hours)	Slow (days per experiment)
Inference cost	High (long prompts)	Medium (retrieved context)	Low (short prompts)
Inference latency	Slow (large context)	Medium	Fast (compact model)
Required data	None	Document corpus	1,000+ labeled examples
Consistency	Variable	Variable	High
Hallucination risk	Medium	Low (grounded)	High if data is poor
Debugging complexity	Low	Medium	High
Best for	Prototyping, complex reasoning, one-off tasks	Knowledge-intensive tasks, QA over documents	High-volume, stable tasks, format/style consistency

The Recommended Sequence

Most teams should follow this sequence:

Start with prompt engineering - get to a working solution fast, learn the failure modes
Add RAG if the task is knowledge-intensive and facts need to be current or precise
Fine-tune only after you have stable requirements, 1,000+ labeled examples, and a working eval harness

Skipping steps in this sequence is expensive. Teams that jump straight to fine-tuning without prompt engineering almost always discover a simpler solution would have worked. Teams that skip RAG for knowledge tasks often build a fine-tune that hallucinated facts that a retrieval system would have prevented.

tip

The combination of RAG + fine-tuning often outperforms either alone for production tasks. RAG provides factual grounding; fine-tuning provides consistent format and domain-appropriate language. If you have the data and the use case justifies it, do both.

Admonitions Summary

tip

Run your fine-tuning data through a quality pipeline before training. The 20% of examples that fail quality filters are not just noise - they actively teach the model wrong behaviors. Removing them typically improves the fine-tuned model more than adding 1,000 new examples would.

info

Managed fine-tuning APIs (OpenAI, Google, Anthropic's partner program) handle the distributed training infrastructure for you. For most production use cases, managed APIs are faster, cheaper, and more reliable than self-hosting the training infrastructure. Use self-hosted training only when you have strict data privacy requirements or need full control over training configuration.

warning

Fine-tuning cannot compensate for a bad task definition. If you cannot write down in one sentence what the model should do, and produce three examples that everyone on your team agrees are correct, you are not ready to fine-tune. Go back and define the task.

danger

Never use your fine-tuned model's outputs as training data for the next fine-tuning run without human review. This creates a feedback loop where errors compound across generations - the model gets progressively worse in ways that are difficult to detect until a user complains. This is the same failure mode that caused Google's image generation problems in early 2024.

Interview Q&A

Q1: What is catastrophic forgetting in the context of LLM fine-tuning, and how do you prevent it?

Answer: Catastrophic forgetting refers to the phenomenon where fine-tuning a model on a task-specific dataset causes the model to lose general capabilities it had before training. The model's weights shift toward the task-specific distribution, overwriting the broader knowledge encoded during pretraining.

In practice, it manifests as the fine-tuned model performing well on the task it was trained for but failing on adjacent tasks that the base model handled correctly - following complex instructions, reasoning through edge cases, handling unexpected input formats, or maintaining appropriate refusal behavior for harmful requests.

Prevention strategies:

Use a low learning rate multiplier: A multiplier of 0.5–1.0 (relative to the provider's default) limits how much the weights shift during fine-tuning. This keeps the model closer to its pretrained state.

Limit epochs: More than 3 epochs on a small dataset almost always causes overfitting and forgetting. Monitor validation loss and stop when it starts rising.

Include general capability examples in training data: Mix 10–20% of your training examples with diverse, general-purpose instruction-following examples. This technique, sometimes called "replay" or "mixed fine-tuning," prevents the model from forgetting how to handle inputs outside your training distribution.

Run a capabilities regression test: Before deploying any fine-tune, run the model on a benchmark that measures general capabilities - MMLU, HellaSwag, or a custom set of diverse tasks. If scores drop more than 5% relative to baseline, investigate before deploying.

Use PEFT instead of full fine-tuning: Parameter-efficient methods like LoRA (Low-Rank Adaptation) modify only a small fraction of the model's weights, which dramatically reduces catastrophic forgetting because most weights remain frozen at their pretrained values.

Q2: How do you determine how many training examples you need for a fine-tuning project?

Answer: There is no universal number, but the practical guidance from production experience is: fewer than 500 examples rarely produces meaningful improvement over prompt engineering, 1,000–5,000 examples is the common sweet spot for format/style tasks, and 10,000+ examples are needed for tasks that require the model to learn complex domain knowledge.

The more rigorous answer is to do an empirical scaling study:

Split your labeled data into subsets: 100, 200, 500, 1,000, 2,000, etc. (if you have enough)
Fine-tune a separate model on each subset
Evaluate all models on the same held-out set
Plot the performance curve as a function of training examples
Look for the "knee of the curve" - the point where additional examples produce diminishing returns

In most production tasks, this curve flattens around 1,000–3,000 examples. If your curve is still steep at your maximum data count, you need more data. If it flatlined at 500, you may be overfitting to your eval set, or your task is simple enough that prompt engineering would suffice.

Also consider: data quality beats data quantity at every scale. 500 carefully curated examples almost always outperforms 5,000 noisy examples from production logs. Invest in data quality filtering before scaling data collection.

Q3: How do you build a reliable evaluation harness for fine-tuning, and what makes an eval set trustworthy?

Answer: A reliable evaluation harness for fine-tuning has four components:

A held-out eval set that is not used during training: This sounds obvious but is frequently violated. If any examples from your eval set appear in your training set (even in deduplicated form), your eval scores are optimistic. Run explicit overlap detection using hashing or fuzzy matching before training.

Task-specific metrics, not generic ones: BLEU score and ROUGE measure n-gram overlap and are poor proxies for real quality in most LLM tasks. Build metrics that measure what you care about: format compliance, factual consistency, business outcome correlation. For high-stakes evaluation, use LLM-as-judge with a separate, more capable model evaluating each response against your rubric.

Calibrated difficulty distribution: Your eval set should include easy examples, medium examples, and hard examples in proportion to what you see in production. An eval set of only easy examples will make every model look good. Sample your eval set from actual production traffic when possible.

Repeatability and versioning: The eval set must be versioned and pinned. If you evaluate model v1 on eval set A and model v2 on eval set B, you cannot compare the results. Treat the eval set as a first-class artifact with its own version history.

For LLM-as-judge evaluation using the Anthropic SDK:

def llm_judge_metric(
    expected: str,
    actual: str,
    rubric: str,
    client: anthropic.Anthropic,
) -> float:
    """
    Use Claude to evaluate a response against a rubric.
    Returns a score from 0.0 to 1.0.
    """
    judge_prompt = f"""You are an expert evaluator. Score the following response on a scale of 0 to 10.

Rubric:
{rubric}

Reference (ideal response):
{expected}

Response to evaluate:
{actual}

Return ONLY a JSON object with two fields:
- "score": integer from 0 to 10
- "reasoning": one sentence explaining the score

JSON:"""

    response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=200,
        messages=[{"role": "user", "content": judge_prompt}],
    )

    import json as _json
    try:
        result = _json.loads(response.content[0].text)
        return result["score"] / 10.0
    except Exception:
        return 0.0

Q4: What is the difference between supervised fine-tuning (SFT), RLHF, and DPO? When would you use each?

Answer: These three techniques represent a spectrum of how you incorporate human preference signal into model training:

Supervised Fine-Tuning (SFT) trains the model to replicate examples in your dataset using standard next-token prediction. You show the model (input, desired output) pairs and train it to maximize the likelihood of the output. SFT is the foundation - almost every fine-tuning project starts here. It is straightforward, interpretable, and works well when you have high-quality examples and the task is well-defined.

Limitation: SFT trains the model to mimic examples, but does not teach it to distinguish between better and worse responses when both are technically correct. It also requires gold-standard examples, which are expensive to produce.

RLHF (Reinforcement Learning from Human Feedback) trains a reward model on human preference data (response A vs. response B - which is better?) and then uses that reward model to update the main model via reinforcement learning (typically PPO). RLHF is what OpenAI used to produce InstructGPT and what powers the alignment in GPT-4, Claude, and other frontier models. It allows the model to learn nuanced preferences that are difficult to express as explicit examples.

Limitation: RLHF is operationally complex - you need a separate reward model training pipeline, human labelers with clear guidelines, and PPO is unstable to tune. This is impractical for most production fine-tuning projects.

DPO (Direct Preference Optimization) is a more recent technique (Rafailov et al., 2023) that achieves RLHF-like results without the separate reward model or reinforcement learning step. You provide preference pairs (chosen response vs. rejected response for the same input) and train directly on the preference signal using a classification-like objective. DPO is significantly simpler to implement than RLHF and often achieves comparable quality.

When to use each:

SFT: default choice for any new fine-tuning project with good labeled examples
DPO: use when your task has meaningful quality variation between responses and you can collect preference labels (human or LLM-judged)
RLHF: only if you are training foundation-level alignment and have significant infrastructure investment capacity - almost never the right choice for product teams

Q5: How do you manage the production lifecycle of a fine-tuned model, including retraining and versioning?

Answer: Fine-tuned models have a lifecycle that most teams underplan for. The model you deploy today will gradually become wrong - production inputs will shift, your product will evolve, and the model's behavior will not keep pace. Managing this lifecycle requires explicit processes.

Versioning: Treat every fine-tuned model as a versioned artifact with a clear identifier (e.g., ft-email-v3-2024-11) that encodes the task, version, and training date. Store alongside it: the training data version, base model version, eval score at release, and the hyperparameters used. Never overwrite a deployed model in place - always deploy a new version and keep the old one available for rollback.

Drift monitoring: Track input distribution drift in production using statistical tests (Jensen-Shannon divergence or Maximum Mean Discrepancy on input embeddings). When the distribution of production inputs shifts significantly from the training distribution, schedule a retraining run. In practice, most product teams retrain quarterly regardless, supplementing with drift signals to catch unexpected shifts early.

Retraining triggers:

Time-based: retrain every 90 days with recent production data
Drift-based: retrain when input distribution shifts beyond a threshold
Performance-based: retrain when an automated eval score drops below a threshold
Event-based: retrain when a major product change creates new input patterns

The golden set: Maintain a small, stable set of 100–200 examples (the "golden set") that you evaluate every fine-tuned model version against. This set never changes - it is your longitudinal benchmark that lets you compare models trained years apart. Without it, you cannot determine whether a new model is an improvement over models trained a year ago.

Data accumulation: Use production traffic (with appropriate privacy controls and consent) as training data for future versions. Build an annotation pipeline that routes model outputs to human reviewers when confidence is low or when the user provides negative feedback. These reviewed examples are your most valuable training data for the next version.

Summary

Fine-tuning is a pipeline problem. Every stage - task definition, data collection, quality filtering, baseline evaluation, training, evaluation, and deployment - can fail independently, and failures in early stages compound downstream. The team at the opening of this lesson failed at the data preparation stage: they had the right data source but included only the outputs, not the grounding context, creating a model that learned confident style without factual grounding.

The key principles from this lesson:

Decide carefully: Fine-tune only when prompt engineering has been maxed out, you have 1,000+ quality examples, and you have a quantitative evaluation. The RAG vs. fine-tune vs. prompting decision matrix exists because there is no universal answer - only trade-offs to reason through.

Data is everything: The quality of your training data determines the quality of your fine-tuned model. Build a quality scoring pipeline. Filter aggressively. The 20% of examples you remove are more valuable than the 1,000 new examples you might add.

Measure before you train: Establish a baseline on your eval set before running a single training job. Without a baseline, you cannot know if the fine-tune helped.

Deploy gradually: A/B test at 5–10% traffic before full rollout. Monitor production metrics, not just offline eval scores. Keep the rollback path clear.

Understand the failure modes: Overfitting, catastrophic forgetting, distribution shift, and hallucination amplification are the four failure modes that appear most often in production. Test for each one explicitly before declaring a fine-tune production-ready.

Fine-tuning done well produces models that are faster, cheaper, and more consistent than prompt engineering alone - and more reliable on facts than prompting alone on the right task type. Done poorly, it produces a model that is confidently wrong in ways that are difficult to detect and expensive to fix. The pipeline is the difference.

The Fine-Tune That Went Wrong​

Why Fine-Tuning Exists (and When It Doesn't)​

When to Fine-Tune vs. Prompt Engineer​

The Decision Matrix​

The Real Test​

Fine-Tuning Pipeline Stages​

Stage 1: Data Preparation​

JSONL Format Requirements​

What Makes a High-Quality Training Example​

Quality Filtering Code​

Stage 2: Baseline Evaluation​

Defining Your Metrics​

Stage 3: Training Run Management​

Hyperparameters That Matter​

Training Job Manager with W&B Logging​

Experiment Tracking Best Practices​

Stage 4: Deployment with A/B Testing​

A/B Testing Implementation​

What to Monitor During A/B Test​

Common Failure Modes​

Fine-Tuning vs. RAG vs. Prompting​

The Recommended Sequence​

Admonitions Summary​

Interview Q&A​

Q1: What is catastrophic forgetting in the context of LLM fine-tuning, and how do you prevent it?​

Q2: How do you determine how many training examples you need for a fine-tuning project?​

Q3: How do you build a reliable evaluation harness for fine-tuning, and what makes an eval set trustworthy?​

Q4: What is the difference between supervised fine-tuning (SFT), RLHF, and DPO? When would you use each?​

Q5: How do you manage the production lifecycle of a fine-tuned model, including retraining and versioning?​

Summary​

The Fine-Tune That Went Wrong

Why Fine-Tuning Exists (and When It Doesn't)

When to Fine-Tune vs. Prompt Engineer

The Decision Matrix

The Real Test

Fine-Tuning Pipeline Stages

Stage 1: Data Preparation

JSONL Format Requirements

What Makes a High-Quality Training Example

Quality Filtering Code

Stage 2: Baseline Evaluation

Defining Your Metrics

Stage 3: Training Run Management

Hyperparameters That Matter

Training Job Manager with W&B Logging

Experiment Tracking Best Practices

Stage 4: Deployment with A/B Testing

A/B Testing Implementation

What to Monitor During A/B Test

Common Failure Modes

Fine-Tuning vs. RAG vs. Prompting

The Recommended Sequence

Admonitions Summary

Interview Q&A

Q1: What is catastrophic forgetting in the context of LLM fine-tuning, and how do you prevent it?

Q2: How do you determine how many training examples you need for a fine-tuning project?

Q3: How do you build a reliable evaluation harness for fine-tuning, and what makes an eval set trustworthy?

Q4: What is the difference between supervised fine-tuning (SFT), RLHF, and DPO? When would you use each?

Q5: How do you manage the production lifecycle of a fine-tuned model, including retraining and versioning?

Summary