Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the LoRA Fine-Tuning demo on the EngineersOfAI Playground - no code required. :::

Fine-Tuning Pipelines

The Fine-Tune That Went Wrong

The team at a B2B SaaS company had a problem that looked solvable. Their product used Claude to generate customer-facing emails - follow-up messages after support tickets, renewal reminders, upsell recommendations. The model was good but not great. The emails sounded generic. They lacked the company's voice. They occasionally hallucinated product names. The support tickets it referenced were sometimes wrong. Customer response rates were 12% below the human-written baseline.

Someone in the engineering team had a reasonable idea: fine-tune. They had 14,000 examples of emails that had been written by their best human agents - emails that had high open rates, high response rates, and zero complaints. They formatted the data into JSONL, uploaded it to their model provider's fine-tuning API, ran the job over a weekend, and deployed the fine-tuned model to production on Monday morning.

By Wednesday, the customer success team was filing tickets. The fine-tuned model was confidently generating emails that referenced products the customer had never purchased. It was hallucinating renewal dates that were months off. In one case, it congratulated a customer on upgrading to a premium tier they had actually cancelled. The emails sounded great - the tone was perfect, the voice was exactly right - but the factual accuracy had collapsed. The model had learned the style so well that it started generating content with the same confident authority, even when it had no factual basis for what it was writing.

They rolled back to the base model in 48 hours. The postmortem was uncomfortable. The fine-tuning data was the problem: the 14,000 examples contained the final emails but not the support ticket context those emails were grounded in. The model learned "write a confident, warm, specific email" but had no signal that the specifics needed to come from real data. It learned distribution without learning grounding. This is the failure mode that kills fine-tuning projects in production.

The lesson was not "don't fine-tune." The lesson was: fine-tuning is a pipeline problem, not a model problem. The data preparation, the training setup, the evaluation harness, and the deployment strategy all have to be engineered - not assumed. This lesson covers how to build that pipeline correctly, from the decision to fine-tune through to safe production deployment.


Why Fine-Tuning Exists (and When It Doesn't)

Before diving into how to fine-tune, you need to understand what fine-tuning actually does to a model - and why that matters for deciding whether to use it at all.

A pretrained LLM has learned a rich representation of language, facts, and reasoning from billions of tokens of text. It knows how to write. It knows many facts. It can follow instructions. What it does not know is your domain's specific vocabulary, your organization's preferred response format, your task's implicit constraints, or the edge cases that your users generate every day.

Fine-tuning adapts the model's weights to your domain. It adjusts the internal representations so that your task-specific patterns are more accessible to the model. Done correctly, it makes the model faster (fewer tokens in the prompt), cheaper (shorter prompts mean lower costs per call), and more consistent (the behavior is baked in, not prompted in).

But fine-tuning is not magic. It cannot give the model information it was not trained on. It cannot teach the model to reliably access real-time data. It cannot fix a fundamentally broken task definition. And it introduces new failure modes - overfitting, catastrophic forgetting, distribution shift - that do not exist with prompt engineering.

The question "should I fine-tune?" is one of the most misanswered questions in applied AI. Here is the honest answer.


When to Fine-Tune vs. Prompt Engineer

The decision is not about model quality in isolation. It is about cost, latency, consistency, and data availability - weighed against each other.

The Decision Matrix

FactorLean Toward PromptingLean Toward Fine-Tuning
Task examples availableFewer than 500 labeled examples1,000+ high-quality examples
Task stabilityRequirements change frequentlyRequirements are stable for 6+ months
Prompt lengthShort prompts already work wellPrompts exceed 1,000 tokens
Latency requirement5–10 seconds acceptableSub-second required
Cost per callLow volume (less than 10K calls/day)High volume (100K+ calls/day)
Consistency requiredSome variation is acceptableIdentical formatting every time
Domain specificityGeneral-purpose taskHighly specialized domain
Reasoning requiredYes - complex multi-step reasoningNo - pattern matching is enough

The Real Test

Before committing to fine-tuning, answer these three questions honestly:

  1. Have you maxed out prompt engineering? Few-shot examples, chain-of-thought, system prompt iteration, output format specification - if you have not tried all of these, you are not ready to fine-tune.

  2. Do you have 1,000+ high-quality, consistent examples? Not 1,000 examples scraped from a log file. 1,000 examples that a human expert would say "yes, this is the right input-output pair for this task." If you do not, fine-tuning will bake in your noise.

  3. Can you measure what "better" means? If you cannot define a quantitative evaluation that tells you whether the fine-tuned model is better than the base model on your task, you cannot safely deploy a fine-tune. You need the eval before you write a single line of training data.

warning

Fine-tuning is irreversible in the sense that it changes model behavior in ways that are difficult to predict. Always run a comprehensive evaluation against the base model before deploying a fine-tune to production. Assume the fine-tune will break something until you prove otherwise.


Fine-Tuning Pipeline Stages

Fine-tuning is not a single step. It is a six-stage pipeline, and each stage can fail independently. The diagram below shows the full pipeline with the critical decision points.

Each stage has specific engineering requirements. The rest of this lesson covers each one in depth with working code.


Stage 1: Data Preparation

Data preparation is where fine-tuning projects succeed or fail. The model will learn exactly what you show it - including your mistakes, your inconsistencies, and your biases. There is no "it will figure out the pattern." The pattern you encode in training data is the pattern the model learns.

JSONL Format Requirements

Most fine-tuning APIs (OpenAI, Anthropic, Google, Mistral) accept a standard JSONL format where each line is a complete conversation. The structure follows the messages format:

{"messages": [
{"role": "system", "content": "You are a support email assistant. Write concise, friendly emails grounded only in the provided ticket data."},
{"role": "user", "content": "Ticket #4821: Customer Jane Smith reports she cannot access the dashboard after upgrading to Pro tier on 2024-01-15. Subscription ID: SUB-993421. Write a follow-up email."},
{"role": "assistant", "content": "Subject: Following up on your dashboard access issue\n\nHi Jane,\n\nThank you for reaching out about the access issue you've experienced since upgrading to Pro on January 15th.\n\nI've reviewed your account (SUB-993421) and can see the upgrade processed successfully, but there's a permissions sync issue on our end that's preventing dashboard access. I've escalated this to our engineering team with high priority.\n\nYou should have full access restored within 2 hours. I'll send you a confirmation email the moment it's resolved.\n\nApologies for the disruption, and thank you for your patience.\n\nBest,\nSupport Team"}
]}

What Makes a High-Quality Training Example

Every training example must satisfy all of the following:

  • Grounded: the assistant turn contains only information present in the user turn or system prompt - no invented details
  • Consistent format: the response format matches exactly across all examples - same greeting style, same closing, same structure
  • Appropriate length: not artificially padded to seem thorough, not cut short to seem efficient - the right length for the content
  • No contradictions: the example should not teach the model two different responses to similar inputs
  • Representative: the input distribution should match what the model will see in production

Quality Filtering Code

This is the most important code you will write for a fine-tuning project. It enforces every quality criterion automatically and produces a scored dataset you can inspect.

import json
import hashlib
import re
from dataclasses import dataclass, field
from typing import Optional
from pathlib import Path
from collections import defaultdict


@dataclass
class QualityScore:
"""Structured quality assessment for a single training example."""
example_id: str
total_score: float
has_system_prompt: bool
response_length_ok: bool
no_hallucination_markers: bool
consistent_format: bool
dedup_hash: str
issues: list[str] = field(default_factory=list)

@property
def passes(self) -> bool:
return self.total_score >= 0.75


class FineTuningDataPipeline:
"""
End-to-end pipeline for preparing fine-tuning data.

Handles:
- Loading raw examples from JSONL files or dicts
- Deduplication (exact and near-duplicate)
- Quality scoring against configurable criteria
- Format validation against target API schema
- Stratified train/validation split
- Output as cleaned JSONL files

Usage:
pipeline = FineTuningDataPipeline(
min_response_tokens=50,
max_response_tokens=1000,
expected_format_pattern=r"^Subject:",
)
train, val = pipeline.run(
input_paths=["data/raw_examples.jsonl"],
output_dir="data/prepared/",
)
"""

def __init__(
self,
min_response_tokens: int = 50,
max_response_tokens: int = 1000,
expected_format_pattern: Optional[str] = None,
hallucination_markers: Optional[list[str]] = None,
val_fraction: float = 0.1,
min_quality_score: float = 0.75,
):
self.min_response_tokens = min_response_tokens
self.max_response_tokens = max_response_tokens
self.expected_format_pattern = (
re.compile(expected_format_pattern)
if expected_format_pattern
else None
)
self.hallucination_markers = hallucination_markers or [
"I don't have access to",
"I cannot verify",
"as an AI",
"I don't know your",
"I'm not sure which",
]
self.val_fraction = val_fraction
self.min_quality_score = min_quality_score

# Stats tracking
self._stats: dict[str, int] = defaultdict(int)

def run(
self,
input_paths: list[str],
output_dir: str,
) -> tuple[list[dict], list[dict]]:
"""
Full pipeline run. Returns (train_examples, val_examples).
Also writes train.jsonl and val.jsonl to output_dir.
"""
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)

# 1. Load all raw examples
raw_examples = self._load_examples(input_paths)
self._stats["total_loaded"] = len(raw_examples)
print(f"Loaded {len(raw_examples)} raw examples")

# 2. Validate schema
schema_valid = [e for e in raw_examples if self._validate_schema(e)]
self._stats["schema_invalid"] = len(raw_examples) - len(schema_valid)
print(f"Schema valid: {len(schema_valid)} ({self._stats['schema_invalid']} rejected)")

# 3. Deduplicate
deduped = self._deduplicate(schema_valid)
self._stats["duplicates_removed"] = len(schema_valid) - len(deduped)
print(f"After dedup: {len(deduped)} ({self._stats['duplicates_removed']} removed)")

# 4. Score quality
scored = [self._score_example(e, i) for i, e in enumerate(deduped)]
passing = [e for e, s in scored if s.passes]
self._stats["quality_failed"] = len(scored) - len(passing)
print(f"Quality passing: {len(passing)} ({self._stats['quality_failed']} rejected)")

if len(passing) < 100:
raise ValueError(
f"Only {len(passing)} examples passed quality filters. "
"Fine-tuning with fewer than 100 examples is not recommended. "
"Review your quality criteria or collect more data."
)

# 5. Split train/val
train, val = self._split(passing)
self._stats["train_count"] = len(train)
self._stats["val_count"] = len(val)
print(f"Split: {len(train)} train / {len(val)} val")

# 6. Write output
self._write_jsonl(train, output_path / "train.jsonl")
self._write_jsonl(val, output_path / "val.jsonl")
self._write_stats(output_path / "pipeline_stats.json")

return train, val

def _load_examples(self, paths: list[str]) -> list[dict]:
examples = []
for path in paths:
with open(path, "r") as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
try:
examples.append(json.loads(line))
except json.JSONDecodeError as e:
print(f"Warning: JSON parse error in {path} line {line_num}: {e}")
return examples

def _validate_schema(self, example: dict) -> bool:
"""Validate the example conforms to the messages API format."""
if "messages" not in example:
return False
messages = example["messages"]
if not isinstance(messages, list) or len(messages) < 2:
return False
# Must have at least one user and one assistant turn
roles = {m.get("role") for m in messages}
if "user" not in roles or "assistant" not in roles:
return False
# All messages must have role and content
for msg in messages:
if "role" not in msg or "content" not in msg:
return False
if not isinstance(msg["content"], str) or not msg["content"].strip():
return False
return True

def _deduplicate(self, examples: list[dict]) -> list[dict]:
"""Remove exact duplicates based on full example hash."""
seen_hashes: set[str] = set()
unique = []
for example in examples:
h = self._hash_example(example)
if h not in seen_hashes:
seen_hashes.add(h)
unique.append(example)
return unique

def _hash_example(self, example: dict) -> str:
"""Deterministic hash of an example for dedup."""
canonical = json.dumps(example, sort_keys=True, ensure_ascii=False)
return hashlib.sha256(canonical.encode()).hexdigest()

def _score_example(
self, example: dict, idx: int
) -> tuple[dict, QualityScore]:
"""Score a single example against quality criteria."""
issues = []
score_components = []

messages = example["messages"]
assistant_messages = [m for m in messages if m["role"] == "assistant"]
user_messages = [m for m in messages if m["role"] == "user"]
system_messages = [m for m in messages if m["role"] == "system"]

# 1. System prompt present
has_system = len(system_messages) > 0
score_components.append(0.2 if has_system else 0.0)
if not has_system:
issues.append("No system prompt - model may not learn task framing")

# 2. Response length check
last_response = assistant_messages[-1]["content"]
approx_tokens = len(last_response.split())
length_ok = self.min_response_tokens <= approx_tokens <= self.max_response_tokens
score_components.append(0.3 if length_ok else 0.0)
if not length_ok:
issues.append(
f"Response length {approx_tokens} tokens outside "
f"[{self.min_response_tokens}, {self.max_response_tokens}]"
)

# 3. Hallucination markers
no_hallucination = not any(
marker.lower() in last_response.lower()
for marker in self.hallucination_markers
)
score_components.append(0.3 if no_hallucination else 0.0)
if not no_hallucination:
issues.append("Response contains hallucination marker phrases")

# 4. Format consistency
format_ok = True
if self.expected_format_pattern:
format_ok = bool(self.expected_format_pattern.search(last_response))
if not format_ok:
issues.append(f"Response does not match expected format pattern")
score_components.append(0.2 if format_ok else 0.0)

total = sum(score_components)
quality = QualityScore(
example_id=f"example_{idx}",
total_score=total,
has_system_prompt=has_system,
response_length_ok=length_ok,
no_hallucination_markers=no_hallucination,
consistent_format=format_ok,
dedup_hash=self._hash_example(example),
issues=issues,
)
return example, quality

def _split(
self, examples: list[dict]
) -> tuple[list[dict], list[dict]]:
"""Deterministic train/val split."""
import random
rng = random.Random(42)
shuffled = examples.copy()
rng.shuffle(shuffled)
n_val = max(1, int(len(shuffled) * self.val_fraction))
return shuffled[n_val:], shuffled[:n_val]

def _write_jsonl(self, examples: list[dict], path: Path) -> None:
with open(path, "w") as f:
for example in examples:
f.write(json.dumps(example, ensure_ascii=False) + "\n")
print(f"Wrote {len(examples)} examples to {path}")

def _write_stats(self, path: Path) -> None:
with open(path, "w") as f:
json.dump(dict(self._stats), f, indent=2)
tip

Always inspect the examples that fail quality filtering manually before adjusting the thresholds. If 40% of your examples are failing the response length check, the threshold may be wrong - or your data collection process may be including truncated outputs. Both are worth knowing.


Stage 2: Baseline Evaluation

You must measure the base model before you fine-tune. This is not optional. Without a baseline, you cannot know whether the fine-tune helped or hurt. You cannot make the rollback decision on evidence. You are flying blind.

The baseline evaluation must use:

  • The same inputs your production system will see
  • The same metrics you care about in production
  • A held-out evaluation set that is not used in training

Here is how to build a baseline evaluation using the Anthropic SDK:

import anthropic
import json
import time
from dataclasses import dataclass
from typing import Callable


@dataclass
class EvalResult:
example_id: str
model: str
input_messages: list[dict]
expected_output: str
actual_output: str
latency_ms: float
metric_scores: dict[str, float]

@property
def mean_score(self) -> float:
if not self.metric_scores:
return 0.0
return sum(self.metric_scores.values()) / len(self.metric_scores)


class ModelEvaluator:
"""
Evaluates a model against a held-out eval set.
Supports multiple metrics and comparison between models.

Designed for comparing a fine-tuned model against a base model.
"""

def __init__(
self,
client: anthropic.Anthropic,
metrics: dict[str, Callable[[str, str], float]],
max_tokens: int = 1024,
temperature: float = 0.0,
):
self.client = client
self.metrics = metrics
self.max_tokens = max_tokens
self.temperature = temperature

def evaluate_model(
self,
model_id: str,
eval_examples: list[dict],
system_prompt: Optional[str] = None,
) -> list[EvalResult]:
"""Run evaluation on a list of examples and return results."""
results = []

for i, example in enumerate(eval_examples):
messages = example["messages"]
# Extract the context (all but last assistant turn)
input_messages = [
m for m in messages if not (
m["role"] == "assistant" and m == messages[-1]
)
]
expected = messages[-1]["content"]

# Build the message list for inference
inference_messages = [
m for m in input_messages if m["role"] != "system"
]
system = system_prompt or next(
(m["content"] for m in input_messages if m["role"] == "system"),
None
)

# Call the model
start = time.perf_counter()
try:
response = self.client.messages.create(
model=model_id,
max_tokens=self.max_tokens,
temperature=self.temperature,
system=system or anthropic.NOT_GIVEN,
messages=inference_messages,
)
actual = response.content[0].text
except Exception as e:
print(f"Error on example {i}: {e}")
actual = ""
latency_ms = (time.perf_counter() - start) * 1000

# Score against each metric
scores = {
name: metric_fn(expected, actual)
for name, metric_fn in self.metrics.items()
}

results.append(EvalResult(
example_id=f"eval_{i}",
model=model_id,
input_messages=input_messages,
expected_output=expected,
actual_output=actual,
latency_ms=latency_ms,
metric_scores=scores,
))

if (i + 1) % 10 == 0:
print(f"Evaluated {i + 1}/{len(eval_examples)} examples")

return results

def compare_models(
self,
base_model: str,
fine_tuned_model: str,
eval_examples: list[dict],
system_prompt: Optional[str] = None,
) -> dict:
"""
Head-to-head comparison between base and fine-tuned model.
Returns a summary dict with per-metric deltas.
"""
print(f"Evaluating base model: {base_model}")
base_results = self.evaluate_model(base_model, eval_examples, system_prompt)

print(f"Evaluating fine-tuned model: {fine_tuned_model}")
ft_results = self.evaluate_model(fine_tuned_model, eval_examples, system_prompt)

# Aggregate
def aggregate(results: list[EvalResult]) -> dict:
if not results:
return {}
metric_names = list(results[0].metric_scores.keys())
return {
"mean_overall": sum(r.mean_score for r in results) / len(results),
"mean_latency_ms": sum(r.latency_ms for r in results) / len(results),
**{
f"mean_{m}": sum(r.metric_scores[m] for r in results) / len(results)
for m in metric_names
},
}

base_agg = aggregate(base_results)
ft_agg = aggregate(ft_results)

comparison = {
"base_model": base_model,
"fine_tuned_model": fine_tuned_model,
"n_examples": len(eval_examples),
"base": base_agg,
"fine_tuned": ft_agg,
"delta": {
k: ft_agg.get(k, 0) - base_agg.get(k, 0)
for k in base_agg
},
"fine_tune_wins": ft_agg["mean_overall"] > base_agg["mean_overall"],
}

self._print_comparison(comparison)
return comparison

def _print_comparison(self, comparison: dict) -> None:
print("\n" + "=" * 60)
print("MODEL EVALUATION COMPARISON")
print("=" * 60)
print(f"Base: {comparison['base_model']}")
print(f"Fine-tuned: {comparison['fine_tuned_model']}")
print(f"Examples: {comparison['n_examples']}")
print("-" * 60)
for key in comparison["base"]:
base_val = comparison["base"][key]
ft_val = comparison["fine_tuned"][key]
delta = comparison["delta"][key]
direction = "+" if delta >= 0 else ""
print(f"{key:<25} base={base_val:.3f} ft={ft_val:.3f} delta={direction}{delta:.3f}")
print("-" * 60)
verdict = "PASS - fine-tune wins" if comparison["fine_tune_wins"] else "FAIL - base model wins"
print(f"Verdict: {verdict}")
print("=" * 60 + "\n")

Defining Your Metrics

The metrics you choose determine whether your evaluation is honest. Generic metrics like BLEU score are almost always wrong for LLM evaluation. You need task-specific metrics.

For a customer email generation task, useful metrics include:

  • Format compliance rate: does the response start with "Subject:" - binary, easy to compute
  • Factual consistency score: does the response avoid referencing entities not in the input - requires a secondary LLM call
  • Length ratio: ratio of output length to reference length - values far from 1.0 suggest pathological behavior
  • Tone consistency: a classifier that distinguishes on-brand from off-brand language
# Example metric functions for email generation evaluation
import re
from difflib import SequenceMatcher


def format_compliance_metric(expected: str, actual: str) -> float:
"""Binary: does the response start with 'Subject:'?"""
return 1.0 if actual.strip().startswith("Subject:") else 0.0


def length_ratio_metric(expected: str, actual: str) -> float:
"""
Score how close response length is to reference.
1.0 = same length, penalizes responses that are too long or short.
"""
if not expected or not actual:
return 0.0
ratio = len(actual.split()) / len(expected.split())
# Score: 1.0 at ratio=1.0, decays toward 0 as ratio diverges
return max(0.0, 1.0 - abs(1.0 - ratio))


def reference_overlap_metric(expected: str, actual: str) -> float:
"""
Sequence overlap with reference. Not a replacement for semantic eval
but useful as a sanity check for format-heavy tasks.
"""
if not expected or not actual:
return 0.0
return SequenceMatcher(None, expected.lower(), actual.lower()).ratio()


def no_hallucination_marker_metric(expected: str, actual: str) -> float:
"""
Penalize responses containing known hallucination signal phrases.
Returns 0.0 if any marker is found, 1.0 otherwise.
"""
markers = [
"as an ai",
"i don't have access",
"i cannot verify",
"i'm not sure",
"based on what you've told me",
]
actual_lower = actual.lower()
return 0.0 if any(m in actual_lower for m in markers) else 1.0

Stage 3: Training Run Management

Once your data is prepared and you have a baseline, you are ready to run the training job. For managed fine-tuning APIs (OpenAI, Anthropic, Google), most of the training configuration is handled by the provider - but you still need to manage hyperparameters, monitor the run, and checkpoint correctly.

Hyperparameters That Matter

HyperparameterWhat It ControlsConservative DefaultWhen to Increase
EpochsHow many times the model sees your data1–2When training loss is still decreasing at epoch end
Learning rate multiplierScales the base LR1.0When loss decreases too slowly
Batch sizeExamples per gradient stepProvider defaultAlmost never - let provider choose
Warmup stepsSteps before full LRProvider defaultWhen training loss spikes early
danger

More epochs is not better. The most common fine-tuning mistake is running too many epochs. After 2–3 epochs on a 1,000-example dataset, most models start overfitting. The training loss keeps decreasing but the validation loss starts rising. Always use a validation loss curve to detect this early.

Training Job Manager with W&B Logging

import os
import time
import json
from dataclasses import dataclass, asdict
from typing import Optional
import anthropic


@dataclass
class TrainingConfig:
"""Configuration for a fine-tuning job."""
model: str
train_file: str
val_file: Optional[str]
n_epochs: int = 2
learning_rate_multiplier: float = 1.0
batch_size: Optional[int] = None
suffix: str = "prod-v1"
wandb_project: Optional[str] = None
wandb_run_name: Optional[str] = None


@dataclass
class TrainingJobResult:
job_id: str
fine_tuned_model_id: Optional[str]
status: str
trained_tokens: int
training_loss: Optional[float]
validation_loss: Optional[float]
duration_seconds: float


class TrainingJobManager:
"""
Manages a fine-tuning job from submission through completion.
Handles polling, W&B logging, and result capture.

Note: This example uses OpenAI's API since Anthropic's managed
fine-tuning API is accessed through their partner program.
Adapt the provider-specific calls for your chosen provider.
"""

def __init__(
self,
config: TrainingConfig,
poll_interval_seconds: int = 30,
):
self.config = config
self.poll_interval = poll_interval_seconds

# Initialize W&B if configured
self._wandb_run = None
if config.wandb_project:
self._init_wandb()

def _init_wandb(self) -> None:
try:
import wandb
self._wandb_run = wandb.init(
project=self.config.wandb_project,
name=self.config.wandb_run_name or f"ft-{int(time.time())}",
config=asdict(self.config),
tags=["fine-tuning", self.config.model],
)
print(f"W&B run initialized: {self._wandb_run.url}")
except ImportError:
print("Warning: wandb not installed. Install with: pip install wandb")
except Exception as e:
print(f"Warning: W&B initialization failed: {e}")

def submit_job(self) -> str:
"""Submit the fine-tuning job and return the job ID."""
import openai
client = openai.OpenAI()

# Upload files
print(f"Uploading training file: {self.config.train_file}")
with open(self.config.train_file, "rb") as f:
train_file_obj = client.files.create(file=f, purpose="fine-tune")
train_file_id = train_file_obj.id
print(f"Training file ID: {train_file_id}")

val_file_id = None
if self.config.val_file:
print(f"Uploading validation file: {self.config.val_file}")
with open(self.config.val_file, "rb") as f:
val_file_obj = client.files.create(file=f, purpose="fine-tune")
val_file_id = val_file_obj.id
print(f"Validation file ID: {val_file_id}")

# Submit job
hyperparams = {
"n_epochs": self.config.n_epochs,
}
if self.config.learning_rate_multiplier != 1.0:
hyperparams["learning_rate_multiplier"] = self.config.learning_rate_multiplier
if self.config.batch_size:
hyperparams["batch_size"] = self.config.batch_size

job = client.fine_tuning.jobs.create(
training_file=train_file_id,
validation_file=val_file_id,
model=self.config.model,
suffix=self.config.suffix,
hyperparameters=hyperparams,
)
print(f"Job submitted: {job.id}")
if self._wandb_run:
self._wandb_run.config.update({"job_id": job.id})

return job.id

def wait_for_completion(self, job_id: str) -> TrainingJobResult:
"""Poll until the job completes or fails. Returns the result."""
import openai
client = openai.OpenAI()

start_time = time.time()
last_n_events = 0

while True:
job = client.fine_tuning.jobs.retrieve(job_id)
status = job.status

# Fetch and log new events
events = client.fine_tuning.jobs.list_events(
fine_tuning_job_id=job_id, limit=10
)
new_events = list(events.data)[last_n_events:]
for event in reversed(new_events):
print(f"[{event.created_at}] {event.message}")
# Log loss metrics to W&B if available
if self._wandb_run and "loss" in event.data:
self._wandb_run.log({
"train_loss": event.data.get("train_loss"),
"valid_loss": event.data.get("valid_loss"),
"step": event.data.get("step"),
})
last_n_events += len(new_events)

if status in ("succeeded", "failed", "cancelled"):
break

print(f"Status: {status} - waiting {self.poll_interval}s...")
time.sleep(self.poll_interval)

duration = time.time() - start_time
result = TrainingJobResult(
job_id=job_id,
fine_tuned_model_id=job.fine_tuned_model,
status=job.status,
trained_tokens=job.trained_tokens or 0,
training_loss=None, # Retrieved from final event
validation_loss=None,
duration_seconds=duration,
)

if self._wandb_run:
self._wandb_run.log({
"final_status": result.status,
"trained_tokens": result.trained_tokens,
"duration_seconds": result.duration_seconds,
})
self._wandb_run.finish()

if result.status != "succeeded":
raise RuntimeError(
f"Fine-tuning job {job_id} failed with status: {result.status}"
)

print(f"\nFine-tuning complete!")
print(f"Model ID: {result.fine_tuned_model_id}")
print(f"Trained tokens: {result.trained_tokens:,}")
print(f"Duration: {duration/60:.1f} minutes")

return result

def run(self) -> TrainingJobResult:
"""Submit and wait for the complete fine-tuning job."""
job_id = self.submit_job()
return self.wait_for_completion(job_id)

Experiment Tracking Best Practices

Every training run should be tracked with at minimum:

  • Training and validation loss curves
  • Hyperparameters used
  • Dataset version (hash or version tag)
  • Base model version
  • Who ran the job and why
  • The evaluation results from the comparison against baseline

Without this, you cannot reproduce a good training run, and you cannot diagnose a bad one.

info

W&B (Weights and Biases) is the industry standard for experiment tracking. Free tier is generous enough for most fine-tuning projects. If your organization requires self-hosted tracking, MLflow is a solid open-source alternative that integrates with the same patterns shown above.


Stage 4: Deployment with A/B Testing

Never send 100% of traffic to a fine-tuned model immediately after training. Even if the offline evaluation showed improvement, production traffic can expose failure modes that your eval set did not cover.

The correct deployment pattern is:

  1. Start with 5–10% of traffic to the fine-tuned model
  2. Monitor production metrics for 24–48 hours
  3. Expand to 25%, then 50%, then 100% if metrics are stable
  4. Keep the rollback path active for the first 2 weeks

A/B Testing Implementation

import random
import time
import hashlib
from dataclasses import dataclass
from typing import Optional, Callable
import anthropic


@dataclass
class InferenceRequest:
user_id: str
messages: list[dict]
system: Optional[str] = None


@dataclass
class InferenceResponse:
text: str
model_used: str
variant: str # "control" or "treatment"
latency_ms: float
user_id: str


class ABTestRouter:
"""
Routes requests between a control (base) model and treatment (fine-tuned) model.

Uses deterministic user-based routing to ensure the same user always
gets the same model variant - this prevents confusing mixed experiences
and enables per-user metric analysis.

Usage:
router = ABTestRouter(
client=anthropic.Anthropic(),
control_model="claude-3-5-haiku-20241022",
treatment_model="ft:claude-3-5-haiku:your-org:suffix:id",
treatment_fraction=0.1, # 10% of users get fine-tuned model
on_response=log_to_analytics,
)
response = await router.route(request)
"""

def __init__(
self,
client: anthropic.Anthropic,
control_model: str,
treatment_model: str,
treatment_fraction: float = 0.1,
max_tokens: int = 1024,
temperature: float = 0.0,
on_response: Optional[Callable[[InferenceResponse], None]] = None,
):
assert 0.0 <= treatment_fraction <= 1.0, "treatment_fraction must be in [0, 1]"
self.client = client
self.control_model = control_model
self.treatment_model = treatment_model
self.treatment_fraction = treatment_fraction
self.max_tokens = max_tokens
self.temperature = temperature
self.on_response = on_response

# Metrics tracking
self._response_counts = {"control": 0, "treatment": 0}
self._error_counts = {"control": 0, "treatment": 0}
self._latencies = {"control": [], "treatment": []}

def _assign_variant(self, user_id: str) -> str:
"""
Deterministic variant assignment based on user_id hash.
Same user_id always gets the same variant.
"""
h = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
bucket = (h % 1000) / 1000.0
return "treatment" if bucket < self.treatment_fraction else "control"

def route(self, request: InferenceRequest) -> InferenceResponse:
"""Route a request and return the response with metadata."""
variant = self._assign_variant(request.user_id)
model = (
self.treatment_model if variant == "treatment" else self.control_model
)

start = time.perf_counter()
try:
api_response = self.client.messages.create(
model=model,
max_tokens=self.max_tokens,
temperature=self.temperature,
system=request.system or anthropic.NOT_GIVEN,
messages=request.messages,
)
text = api_response.content[0].text
except Exception as e:
self._error_counts[variant] += 1
raise RuntimeError(f"Model call failed for variant={variant}: {e}")

latency_ms = (time.perf_counter() - start) * 1000

response = InferenceResponse(
text=text,
model_used=model,
variant=variant,
latency_ms=latency_ms,
user_id=request.user_id,
)

# Update internal stats
self._response_counts[variant] += 1
self._latencies[variant].append(latency_ms)

# Call the analytics callback
if self.on_response:
try:
self.on_response(response)
except Exception as e:
print(f"Warning: analytics callback failed: {e}")

return response

def get_stats(self) -> dict:
"""Return current A/B test statistics."""
stats = {}
for variant in ("control", "treatment"):
latencies = self._latencies[variant]
stats[variant] = {
"total_requests": self._response_counts[variant],
"errors": self._error_counts[variant],
"mean_latency_ms": sum(latencies) / len(latencies) if latencies else 0,
"p95_latency_ms": sorted(latencies)[int(len(latencies) * 0.95)] if latencies else 0,
}
return stats

def print_stats(self) -> None:
stats = self.get_stats()
print("\n=== A/B Test Statistics ===")
for variant, data in stats.items():
model = self.treatment_model if variant == "treatment" else self.control_model
print(f"\n{variant.upper()} ({model})")
print(f" Requests: {data['total_requests']:,}")
print(f" Errors: {data['errors']:,}")
print(f" Mean latency: {data['mean_latency_ms']:.1f} ms")
print(f" P95 latency: {data['p95_latency_ms']:.1f} ms")


# Example analytics callback for logging to your monitoring system
def log_response_to_analytics(response: InferenceResponse) -> None:
"""
Example callback. In production, send to your analytics system
(Datadog, Grafana, Amplitude, Mixpanel, etc.)
"""
event = {
"timestamp": time.time(),
"user_id": response.user_id,
"variant": response.variant,
"model": response.model_used,
"latency_ms": response.latency_ms,
"response_length": len(response.text),
}
# In production: send to your event streaming system
print(f"Analytics event: {event}")

What to Monitor During A/B Test

During the A/B test, monitor these signals in your analytics system:

MetricWhat It Tells YouRed Flag
Task completion rateAre users accomplishing their goal?Treatment rate drops more than 5% below control
Error rateIs the model failing or refusing?Any increase in refusals or errors
Response length distributionIs the model being too verbose or too terse?Mean shifts more than 30%
User rating (if collected)Explicit quality signalTreatment rating consistently below control
Downstream business metricDoes the output lead to the desired outcome?Drop in email response rate, ticket close rate, etc.
Latency P95Is the fine-tuned model slower?P95 latency increases more than 20%

Common Failure Modes

Understanding how fine-tuning fails is as important as understanding how it succeeds. These are the failure modes that appear in production most often.

Failure ModeWhat HappensRoot CausePrevention
OverfittingModel performs great on eval set, poorly on new inputsToo many epochs, too little data, or eval set is too similar to train setUse genuinely held-out eval data; limit epochs; monitor val loss curve
Catastrophic forgettingModel loses general capabilities it had before fine-tuningHigh learning rate, many epochs, small datasetUse lower LR multiplier; fewer epochs; check capabilities on out-of-domain prompts
Distribution shiftFine-tuned model trained on historical data fails on new inputsProduction inputs evolve; training data goes staleTrack input distribution over time; retrain on recent data quarterly
Hallucination amplificationFine-tuned model hallucinates with more confidenceTraining data included confident-sounding hallucinationsRigorous data quality filtering; include grounding constraints in system prompt
Format lock-inModel refuses to deviate from trained formatTraining data too homogeneousInclude format variation in training data; test with unusual input formats
Instruction following regressionModel stops following system prompt instructionsTraining overwrote instruction following behaviorInclude diverse instruction-following examples in training data
Sycophancy amplificationModel agrees with user even when wrongTraining data rewarded agreement over accuracyAudit training data for sycophancy patterns; include corrective examples
danger

Catastrophic forgetting is the most dangerous failure mode because it often only surfaces weeks or months after deployment, when a user asks the model to do something that is not in the training distribution. The model's general reasoning ability has degraded, but this is not visible in task-specific metrics. Always run a capabilities regression test on a diverse benchmark after fine-tuning.


Fine-Tuning vs. RAG vs. Prompting

The choice between these three approaches depends on your specific constraints. This is the framework used by most production teams.

DimensionPrompt EngineeringRAGFine-Tuning
Data freshnessReal-timeReal-time (retrieval)Static (training time)
Knowledge typeGeneral reasoningExternal facts, documentsTask patterns, style, format
Setup effortLow (hours)Medium (days to weeks)High (weeks to months)
Iteration speedFast (minutes)Medium (hours)Slow (days per experiment)
Inference costHigh (long prompts)Medium (retrieved context)Low (short prompts)
Inference latencySlow (large context)MediumFast (compact model)
Required dataNoneDocument corpus1,000+ labeled examples
ConsistencyVariableVariableHigh
Hallucination riskMediumLow (grounded)High if data is poor
Debugging complexityLowMediumHigh
Best forPrototyping, complex reasoning, one-off tasksKnowledge-intensive tasks, QA over documentsHigh-volume, stable tasks, format/style consistency

Most teams should follow this sequence:

  1. Start with prompt engineering - get to a working solution fast, learn the failure modes
  2. Add RAG if the task is knowledge-intensive and facts need to be current or precise
  3. Fine-tune only after you have stable requirements, 1,000+ labeled examples, and a working eval harness

Skipping steps in this sequence is expensive. Teams that jump straight to fine-tuning without prompt engineering almost always discover a simpler solution would have worked. Teams that skip RAG for knowledge tasks often build a fine-tune that hallucinated facts that a retrieval system would have prevented.

tip

The combination of RAG + fine-tuning often outperforms either alone for production tasks. RAG provides factual grounding; fine-tuning provides consistent format and domain-appropriate language. If you have the data and the use case justifies it, do both.


Admonitions Summary

tip

Run your fine-tuning data through a quality pipeline before training. The 20% of examples that fail quality filters are not just noise - they actively teach the model wrong behaviors. Removing them typically improves the fine-tuned model more than adding 1,000 new examples would.

info

Managed fine-tuning APIs (OpenAI, Google, Anthropic's partner program) handle the distributed training infrastructure for you. For most production use cases, managed APIs are faster, cheaper, and more reliable than self-hosting the training infrastructure. Use self-hosted training only when you have strict data privacy requirements or need full control over training configuration.

warning

Fine-tuning cannot compensate for a bad task definition. If you cannot write down in one sentence what the model should do, and produce three examples that everyone on your team agrees are correct, you are not ready to fine-tune. Go back and define the task.

danger

Never use your fine-tuned model's outputs as training data for the next fine-tuning run without human review. This creates a feedback loop where errors compound across generations - the model gets progressively worse in ways that are difficult to detect until a user complains. This is the same failure mode that caused Google's image generation problems in early 2024.


Interview Q&A

Q1: What is catastrophic forgetting in the context of LLM fine-tuning, and how do you prevent it?

Answer: Catastrophic forgetting refers to the phenomenon where fine-tuning a model on a task-specific dataset causes the model to lose general capabilities it had before training. The model's weights shift toward the task-specific distribution, overwriting the broader knowledge encoded during pretraining.

In practice, it manifests as the fine-tuned model performing well on the task it was trained for but failing on adjacent tasks that the base model handled correctly - following complex instructions, reasoning through edge cases, handling unexpected input formats, or maintaining appropriate refusal behavior for harmful requests.

Prevention strategies:

Use a low learning rate multiplier: A multiplier of 0.5–1.0 (relative to the provider's default) limits how much the weights shift during fine-tuning. This keeps the model closer to its pretrained state.

Limit epochs: More than 3 epochs on a small dataset almost always causes overfitting and forgetting. Monitor validation loss and stop when it starts rising.

Include general capability examples in training data: Mix 10–20% of your training examples with diverse, general-purpose instruction-following examples. This technique, sometimes called "replay" or "mixed fine-tuning," prevents the model from forgetting how to handle inputs outside your training distribution.

Run a capabilities regression test: Before deploying any fine-tune, run the model on a benchmark that measures general capabilities - MMLU, HellaSwag, or a custom set of diverse tasks. If scores drop more than 5% relative to baseline, investigate before deploying.

Use PEFT instead of full fine-tuning: Parameter-efficient methods like LoRA (Low-Rank Adaptation) modify only a small fraction of the model's weights, which dramatically reduces catastrophic forgetting because most weights remain frozen at their pretrained values.


Q2: How do you determine how many training examples you need for a fine-tuning project?

Answer: There is no universal number, but the practical guidance from production experience is: fewer than 500 examples rarely produces meaningful improvement over prompt engineering, 1,000–5,000 examples is the common sweet spot for format/style tasks, and 10,000+ examples are needed for tasks that require the model to learn complex domain knowledge.

The more rigorous answer is to do an empirical scaling study:

  1. Split your labeled data into subsets: 100, 200, 500, 1,000, 2,000, etc. (if you have enough)
  2. Fine-tune a separate model on each subset
  3. Evaluate all models on the same held-out set
  4. Plot the performance curve as a function of training examples
  5. Look for the "knee of the curve" - the point where additional examples produce diminishing returns

In most production tasks, this curve flattens around 1,000–3,000 examples. If your curve is still steep at your maximum data count, you need more data. If it flatlined at 500, you may be overfitting to your eval set, or your task is simple enough that prompt engineering would suffice.

Also consider: data quality beats data quantity at every scale. 500 carefully curated examples almost always outperforms 5,000 noisy examples from production logs. Invest in data quality filtering before scaling data collection.


Q3: How do you build a reliable evaluation harness for fine-tuning, and what makes an eval set trustworthy?

Answer: A reliable evaluation harness for fine-tuning has four components:

A held-out eval set that is not used during training: This sounds obvious but is frequently violated. If any examples from your eval set appear in your training set (even in deduplicated form), your eval scores are optimistic. Run explicit overlap detection using hashing or fuzzy matching before training.

Task-specific metrics, not generic ones: BLEU score and ROUGE measure n-gram overlap and are poor proxies for real quality in most LLM tasks. Build metrics that measure what you care about: format compliance, factual consistency, business outcome correlation. For high-stakes evaluation, use LLM-as-judge with a separate, more capable model evaluating each response against your rubric.

Calibrated difficulty distribution: Your eval set should include easy examples, medium examples, and hard examples in proportion to what you see in production. An eval set of only easy examples will make every model look good. Sample your eval set from actual production traffic when possible.

Repeatability and versioning: The eval set must be versioned and pinned. If you evaluate model v1 on eval set A and model v2 on eval set B, you cannot compare the results. Treat the eval set as a first-class artifact with its own version history.

For LLM-as-judge evaluation using the Anthropic SDK:

def llm_judge_metric(
expected: str,
actual: str,
rubric: str,
client: anthropic.Anthropic,
) -> float:
"""
Use Claude to evaluate a response against a rubric.
Returns a score from 0.0 to 1.0.
"""
judge_prompt = f"""You are an expert evaluator. Score the following response on a scale of 0 to 10.

Rubric:
{rubric}

Reference (ideal response):
{expected}

Response to evaluate:
{actual}

Return ONLY a JSON object with two fields:
- "score": integer from 0 to 10
- "reasoning": one sentence explaining the score

JSON:"""

response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=200,
messages=[{"role": "user", "content": judge_prompt}],
)

import json as _json
try:
result = _json.loads(response.content[0].text)
return result["score"] / 10.0
except Exception:
return 0.0

Q4: What is the difference between supervised fine-tuning (SFT), RLHF, and DPO? When would you use each?

Answer: These three techniques represent a spectrum of how you incorporate human preference signal into model training:

Supervised Fine-Tuning (SFT) trains the model to replicate examples in your dataset using standard next-token prediction. You show the model (input, desired output) pairs and train it to maximize the likelihood of the output. SFT is the foundation - almost every fine-tuning project starts here. It is straightforward, interpretable, and works well when you have high-quality examples and the task is well-defined.

Limitation: SFT trains the model to mimic examples, but does not teach it to distinguish between better and worse responses when both are technically correct. It also requires gold-standard examples, which are expensive to produce.

RLHF (Reinforcement Learning from Human Feedback) trains a reward model on human preference data (response A vs. response B - which is better?) and then uses that reward model to update the main model via reinforcement learning (typically PPO). RLHF is what OpenAI used to produce InstructGPT and what powers the alignment in GPT-4, Claude, and other frontier models. It allows the model to learn nuanced preferences that are difficult to express as explicit examples.

Limitation: RLHF is operationally complex - you need a separate reward model training pipeline, human labelers with clear guidelines, and PPO is unstable to tune. This is impractical for most production fine-tuning projects.

DPO (Direct Preference Optimization) is a more recent technique (Rafailov et al., 2023) that achieves RLHF-like results without the separate reward model or reinforcement learning step. You provide preference pairs (chosen response vs. rejected response for the same input) and train directly on the preference signal using a classification-like objective. DPO is significantly simpler to implement than RLHF and often achieves comparable quality.

When to use each:

  • SFT: default choice for any new fine-tuning project with good labeled examples
  • DPO: use when your task has meaningful quality variation between responses and you can collect preference labels (human or LLM-judged)
  • RLHF: only if you are training foundation-level alignment and have significant infrastructure investment capacity - almost never the right choice for product teams

Q5: How do you manage the production lifecycle of a fine-tuned model, including retraining and versioning?

Answer: Fine-tuned models have a lifecycle that most teams underplan for. The model you deploy today will gradually become wrong - production inputs will shift, your product will evolve, and the model's behavior will not keep pace. Managing this lifecycle requires explicit processes.

Versioning: Treat every fine-tuned model as a versioned artifact with a clear identifier (e.g., ft-email-v3-2024-11) that encodes the task, version, and training date. Store alongside it: the training data version, base model version, eval score at release, and the hyperparameters used. Never overwrite a deployed model in place - always deploy a new version and keep the old one available for rollback.

Drift monitoring: Track input distribution drift in production using statistical tests (Jensen-Shannon divergence or Maximum Mean Discrepancy on input embeddings). When the distribution of production inputs shifts significantly from the training distribution, schedule a retraining run. In practice, most product teams retrain quarterly regardless, supplementing with drift signals to catch unexpected shifts early.

Retraining triggers:

  1. Time-based: retrain every 90 days with recent production data
  2. Drift-based: retrain when input distribution shifts beyond a threshold
  3. Performance-based: retrain when an automated eval score drops below a threshold
  4. Event-based: retrain when a major product change creates new input patterns

The golden set: Maintain a small, stable set of 100–200 examples (the "golden set") that you evaluate every fine-tuned model version against. This set never changes - it is your longitudinal benchmark that lets you compare models trained years apart. Without it, you cannot determine whether a new model is an improvement over models trained a year ago.

Data accumulation: Use production traffic (with appropriate privacy controls and consent) as training data for future versions. Build an annotation pipeline that routes model outputs to human reviewers when confidence is low or when the user provides negative feedback. These reviewed examples are your most valuable training data for the next version.


Summary

Fine-tuning is a pipeline problem. Every stage - task definition, data collection, quality filtering, baseline evaluation, training, evaluation, and deployment - can fail independently, and failures in early stages compound downstream. The team at the opening of this lesson failed at the data preparation stage: they had the right data source but included only the outputs, not the grounding context, creating a model that learned confident style without factual grounding.

The key principles from this lesson:

Decide carefully: Fine-tune only when prompt engineering has been maxed out, you have 1,000+ quality examples, and you have a quantitative evaluation. The RAG vs. fine-tune vs. prompting decision matrix exists because there is no universal answer - only trade-offs to reason through.

Data is everything: The quality of your training data determines the quality of your fine-tuned model. Build a quality scoring pipeline. Filter aggressively. The 20% of examples you remove are more valuable than the 1,000 new examples you might add.

Measure before you train: Establish a baseline on your eval set before running a single training job. Without a baseline, you cannot know if the fine-tune helped.

Deploy gradually: A/B test at 5–10% traffic before full rollout. Monitor production metrics, not just offline eval scores. Keep the rollback path clear.

Understand the failure modes: Overfitting, catastrophic forgetting, distribution shift, and hallucination amplification are the four failure modes that appear most often in production. Test for each one explicitly before declaring a fine-tune production-ready.

Fine-tuning done well produces models that are faster, cheaper, and more consistent than prompt engineering alone - and more reliable on facts than prompting alone on the right task type. Done poorly, it produces a model that is confidently wrong in ways that are difficult to detect and expensive to fix. The pipeline is the difference.

© 2026 EngineersOfAI. All rights reserved.