Training Data Preparation for Fine-Tuning

The Model That Learned to Gaslight

A team at a mid-sized healthcare startup spent six weeks fine-tuning Llama 3 8B on clinical notes. They had 80,000 examples. Validation loss was excellent. The model generated fluent, confident-sounding clinical text. They shipped it internally.

Within two days, their medical advisor flagged something alarming. The model was occasionally confabulating lab values - hallucinating plausible-looking numbers that did not appear in the context. Not always. Not in a detectable pattern. Just often enough that it could not be trusted with real patient data.

The postmortem revealed the source: their training data had been scraped from a medical documentation platform where notes were sometimes pre-filled with placeholder values like "WBC 10.5, Hgb 12.3" as templates. The model had learned that "good clinical notes contain specific numbers" and was generating them even when the context did not support them. The data looked clean to a human reviewer. The model learned the wrong lesson anyway.

This story is not unusual. It is the most common failure mode in fine-tuning projects. The model architecture is fine. The LoRA rank is reasonable. The learning rate is tuned. But the data contains a subtle pattern that the model should not have learned, and now it has internalized it at billions of parameters. You cannot unlearn this without retraining.

Data preparation is not a preprocessing step you do once and forget. It is the highest-leverage part of the entire fine-tuning pipeline. A model trained on 1,000 carefully curated examples will outperform a model trained on 100,000 noisy examples almost every time. The community learned this the hard way over hundreds of Alpaca, Vicuna, WizardLM, and similar models released in 2023. The ones that worked were the ones with good data. The ones that did not were the ones that prioritized quantity over quality.

This lesson covers every step in the data pipeline: understanding chat templates, applying instruction masking, filtering for quality, generating synthetic data, and building a robust train/eval split. By the end, you will have a complete data pipeline that you can adapt for any fine-tuning task.

Why This Exists - The Problem Before Proper Data Pipelines

In 2022 and early 2023, the standard approach to fine-tuning was simple and wrong: take a base model, concatenate your input with your output with a separator token, and train on the whole sequence with cross-entropy loss on every token. This was how the original Alpaca dataset was used with LLaMA 1.

The problems with this approach were numerous:

Problem 1 - The model trains to predict the prompt. When you compute cross-entropy loss on both the instruction and the output tokens, the model gets gradient signal for predicting the instruction text. This is wasteful at best. For long instructions, it can dominate the loss and prevent the model from learning the response generation task properly.

Problem 2 - No chat format awareness. Base models were trained on raw text. Fine-tuning them required teaching them a specific turn-taking format. Without consistent delimiters and role markers, multi-turn conversations were impossible. The model had no way to know where user text ended and assistant text began.

Problem 3 - Template inconsistency. Every project used different formatting. Alpaca used ### Instruction: ... ### Response:. Vicuna used USER: ... ASSISTANT:. Open-Orca used system prompts. Mixing these formats produced models that sometimes followed one convention and sometimes another, depending on which format the prompt looked most similar to.

Problem 4 - No separation between pre-training and instruction-following knowledge. The model needed to learn when to switch from completion mode (what base models do) to instruction-following mode. This required a consistent signal in the training data.

The solution was chat templates - standardized formats that encode role, turn, and boundary information in a way that the tokenizer and model can both understand. When combined with instruction masking (computing loss only on assistant tokens), these templates solved all four problems simultaneously.

Historical Context - From Alpaca to ChatML to LLaMA 3

The evolution of chat templates tells the history of open-source instruction fine-tuning.

Alpaca (March 2023): Stanford released the Alpaca dataset - 52,000 instruction-following examples generated by text-davinci-003. The format was:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{response}

Simple, effective, but limited to single-turn exchanges. No system prompt support. No multi-turn conversations.

Vicuna (March 2023): UC Berkeley's Vicuna fine-tune used ShareGPT data - actual ChatGPT conversations shared publicly. This introduced multi-turn format:

USER: {message}
ASSISTANT: {response}
USER: {follow-up}
ASSISTANT: {response}

Better, but still no system prompt. Inconsistent handling of conversation boundaries.

ChatML (OpenAI, 2023): OpenAI released a chat markup language specification for their API. This became the most widely adopted format in the open-source community because it was clean, extensible, and explicitly handled system prompts and multiple roles:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
Paris.<|im_end|>

The <|im_start|> and <|im_end|> tokens are special tokens added to the vocabulary. The role (system, user, assistant) follows immediately after the start token.

LLaMA 3 (Meta, 2024): Meta's LLaMA 3 instruct models introduced a new format using built-in special tokens:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Paris.<|eot_id|>

The <|eot_id|> (end of turn) token serves as the conversation boundary. Header tokens mark role transitions.

The lesson: format matters. Using the wrong template for your base model is like training on the wrong language. Always use the template that matches your base model's training data.

Core Concepts - Chat Templates and the Tokenizer

How Tokenizers Encode Chat Templates

Modern tokenizer implementations include a chat_template field - a Jinja2 template string that defines how to format a list of messages into a single string that the tokenizer can process. This is stored directly in the tokenizer_config.json file.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

# Inspect the built-in chat template
print(tokenizer.chat_template)
# Outputs the Jinja2 template defining LLaMA 3 formatting

# Convert a list of messages to the formatted string + token IDs
messages = [
    {"role": "system", "content": "You are a helpful medical assistant."},
    {"role": "user", "content": "What are the symptoms of type 2 diabetes?"},
    {"role": "assistant", "content": "Type 2 diabetes symptoms include..."},
]

# tokenize=False returns the formatted string
formatted = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=False,
)
print(formatted)

# tokenize=True returns input_ids directly
input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=False,
    return_tensors="pt",
)

The add_generation_prompt=True flag adds the assistant header at the end without any response content. Use this during inference to prompt the model to continue with an assistant turn. During training, set it to False because you are providing the full response.

Instruction Masking with -100 Labels

The key insight of instruction masking: the model should only predict (and be trained on) the assistant's response tokens, not the instruction tokens. This is implemented by setting the label for all non-assistant tokens to -100.

In PyTorch's cross-entropy loss, label value -100 is ignored by convention:

$\mathcal{L} = -\sum_{t \in \text{assistant tokens}} \log P(y_t | y_{<t}, x)$

Without instruction masking, the gradient includes prediction error on the prompt tokens, which adds noise and wastes capacity. With masking, every gradient step is directly about learning to generate good responses.

import torch
from transformers import AutoTokenizer

def apply_chat_template_with_masking(
    tokenizer,
    messages,
    max_length: int = 2048,
):
    """
    Apply chat template and return input_ids with labels masked for non-assistant turns.

    Returns:
        input_ids: token IDs for the full conversation
        labels: same as input_ids but with -100 for non-assistant tokens
        attention_mask: 1 for real tokens, 0 for padding
    """
    # Tokenize the full conversation
    full_text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False,
    )

    # Tokenize all turns individually to identify assistant token boundaries
    input_ids = tokenizer.encode(full_text, add_special_tokens=False)

    # Build labels: start with -100 everywhere (ignore everything)
    labels = [-100] * len(input_ids)

    # Re-tokenize each message to find where assistant responses are
    # We find the assistant tokens by building the conversation prefix-by-prefix
    current_pos = 0

    for i, message in enumerate(messages):
        if message["role"] != "assistant":
            # Build prefix up to and including this message
            prefix = tokenizer.apply_chat_template(
                messages[:i+1],
                tokenize=False,
                add_generation_prompt=False,
            )
            prefix_ids = tokenizer.encode(prefix, add_special_tokens=False)
            current_pos = len(prefix_ids)
        else:
            # This is an assistant message
            # Build prefix up to (not including) this message
            prefix = tokenizer.apply_chat_template(
                messages[:i],
                tokenize=False,
                add_generation_prompt=True,  # adds the assistant header
            )
            prefix_ids = tokenizer.encode(prefix, add_special_tokens=False)
            start = len(prefix_ids)

            # The assistant response ends at current_pos
            prefix_with_response = tokenizer.apply_chat_template(
                messages[:i+1],
                tokenize=False,
                add_generation_prompt=False,
            )
            prefix_response_ids = tokenizer.encode(
                prefix_with_response, add_special_tokens=False
            )
            end = len(prefix_response_ids)

            # Set labels for assistant tokens to the actual token IDs
            for j in range(start, min(end, len(labels))):
                labels[j] = input_ids[j]

            current_pos = end

    # Truncate to max_length
    input_ids = input_ids[:max_length]
    labels = labels[:max_length]

    # Pad to max_length
    pad_token_id = tokenizer.pad_token_id or tokenizer.eos_token_id
    attention_mask = [1] * len(input_ids)
    padding_needed = max_length - len(input_ids)
    if padding_needed > 0:
        input_ids = input_ids + [pad_token_id] * padding_needed
        labels = labels + [-100] * padding_needed
        attention_mask = attention_mask + [0] * padding_needed

    return {
        "input_ids": torch.tensor(input_ids, dtype=torch.long),
        "labels": torch.tensor(labels, dtype=torch.long),
        "attention_mask": torch.tensor(attention_mask, dtype=torch.long),
    }

Dataset Formats - The Three Standards

Format 1: Alpaca (Single-Turn, Instruction-Output)

The Alpaca format is the simplest: each example has an instruction, an optional input context, and an output. Best for single-turn Q&A, summarization, extraction, and classification tasks.

[
  {
    "instruction": "Summarize the following medical note in one sentence.",
    "input": "Patient is a 67-year-old male presenting with chest pain...",
    "output": "67-year-old male with acute chest pain, possible NSTEMI, admitted for workup."
  },
  {
    "instruction": "Classify the sentiment of this customer review.",
    "input": "The product works exactly as described. Fast shipping too.",
    "output": "Positive"
  }
]

Alpaca format is well-supported by most fine-tuning libraries. When the input field is empty, the template usually omits it. The TRL library handles this automatically.

Format 2: ShareGPT (Multi-Turn Conversations)

ShareGPT format mirrors real chat interactions with multiple turns. Each example is a conversation with a list of turns.

[
  {
    "conversations": [
      {
        "from": "human",
        "value": "I have a patient with elevated troponin but normal ECG. What are the differential diagnoses?"
      },
      {
        "from": "gpt",
        "value": "Elevated troponin with normal ECG can indicate several conditions..."
      },
      {
        "from": "human",
        "value": "What additional tests would you recommend first?"
      },
      {
        "from": "gpt",
        "value": "For immediate workup, I would prioritize: serial troponins at 3 and 6 hours..."
      }
    ]
  }
]

Note the field naming: ShareGPT uses from: "human" and from: "gpt". These get mapped to role: "user" and role: "assistant" when applying modern chat templates.

Format 3: OpenAI Messages (The Current Standard)

The OpenAI messages format is the most versatile and the de facto standard for modern fine-tuning. It directly matches the format used by apply_chat_template.

[
  {
    "messages": [
      {
        "role": "system",
        "content": "You are an expert cardiologist. Provide evidence-based clinical guidance."
      },
      {
        "role": "user",
        "content": "What is the mechanism of action of beta-blockers in heart failure?"
      },
      {
        "role": "assistant",
        "content": "Beta-blockers in heart failure work through several mechanisms..."
      }
    ]
  }
]

This format supports system prompts natively, handles multi-turn easily, and maps directly to tokenizer.apply_chat_template(). Use this format for any new fine-tuning project.

Data Quality - The Signals That Matter

Signal 1: Response Length Distribution

Good instruction fine-tuning data has diverse response lengths. If all responses are 2-3 sentences, the model learns to give short answers. If all responses are 10+ paragraphs, it learns to pad. Check the distribution:

import matplotlib.pyplot as plt
from datasets import load_dataset

def analyze_length_distribution(dataset, tokenizer, column="output"):
    lengths = []
    for example in dataset:
        text = example.get(column, "") or ""
        tokens = tokenizer.encode(text)
        lengths.append(len(tokens))

    import numpy as np
    print(f"Length statistics (in tokens):")
    print(f"  Mean: {np.mean(lengths):.0f}")
    print(f"  Median: {np.median(lengths):.0f}")
    print(f"  P10: {np.percentile(lengths, 10):.0f}")
    print(f"  P90: {np.percentile(lengths, 90):.0f}")
    print(f"  Max: {max(lengths)}")
    print(f"  Examples under 10 tokens: {sum(1 for l in lengths if l < 10)}")
    print(f"  Examples over 2048 tokens: {sum(1 for l in lengths if l > 2048)}")

    return lengths

Red flags:

More than 10% of responses under 10 tokens: likely truncated or low-quality examples
More than 5% of examples over your max sequence length: these get truncated during training, potentially cutting off the end of responses
Bimodal distribution: two clusters of short and long responses usually indicate mixed data quality levels

Signal 2: Deduplication

Exact and near-duplicate examples in your training set cause several problems:

The model over-learns those specific patterns (effectively higher weight on duplicate examples)
Benchmark contamination: if benchmark test questions appear in your training set, your evaluation metrics are invalid
Data leakage: train and eval split contamination inflates eval metrics

from datasets import Dataset
import hashlib
from typing import List


def deduplicate_exact(dataset: Dataset, column: str = "output") -> Dataset:
    """Remove exact duplicate responses."""
    seen_hashes = set()
    keep_indices = []

    for i, example in enumerate(dataset):
        text = example.get(column, "")
        h = hashlib.md5(text.encode()).hexdigest()
        if h not in seen_hashes:
            seen_hashes.add(h)
            keep_indices.append(i)

    original_size = len(dataset)
    deduplicated = dataset.select(keep_indices)
    print(f"Deduplication: {original_size} -> {len(deduplicated)} "
          f"({original_size - len(deduplicated)} duplicates removed)")
    return deduplicated


def deduplicate_near_duplicates(
    examples: List[str],
    threshold: float = 0.85,
) -> List[int]:
    """
    Remove near-duplicates using MinHash LSH.
    Returns indices of examples to keep.
    Requires: pip install datasketch
    """
    from datasketch import MinHash, MinHashLSH

    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    minhashes = []

    # Create MinHash for each example
    for i, text in enumerate(examples):
        mh = MinHash(num_perm=128)
        for word in text.lower().split():
            mh.update(word.encode("utf-8"))
        minhashes.append(mh)

    # Find duplicates
    keep = []
    removed = set()

    for i, mh in enumerate(minhashes):
        if i in removed:
            continue
        keep.append(i)
        # Query for similar items
        try:
            lsh.insert(str(i), mh)
        except ValueError:
            pass  # already inserted

        # Check for near-duplicates among remaining
        similar = lsh.query(mh)
        for j_str in similar:
            j = int(j_str)
            if j != i and j not in removed:
                removed.add(j)

    print(f"Near-dedup (threshold={threshold}): kept {len(keep)}/{len(examples)}")
    return keep

Signal 3: Quality Filtering

Quality filtering removes examples that will hurt your model's behavior. The most important filters:

def quality_filter(example: dict, config: dict = None) -> bool:
    """
    Return True if example should be kept, False if it should be dropped.
    """
    config = config or {}

    instruction = example.get("instruction", "") or ""
    output = example.get("output", "") or ""

    # 1. Minimum lengths
    if len(instruction.split()) < config.get("min_instruction_words", 3):
        return False
    if len(output.split()) < config.get("min_output_words", 5):
        return False

    # 2. Maximum lengths (very long examples truncate badly)
    if len(output.split()) > config.get("max_output_words", 2000):
        return False

    # 3. Refusal patterns - these teach the model to refuse inappropriately
    refusal_patterns = [
        "i cannot", "i can't", "i'm unable to", "as an ai",
        "as a language model", "i don't have the ability",
        "i apologize, but i cannot",
    ]
    output_lower = output.lower()
    if any(pattern in output_lower for pattern in refusal_patterns):
        # Only filter if the instruction is clearly benign
        # (avoid filtering legitimate safety refusals)
        if not any(word in instruction.lower()
                   for word in ["harmful", "illegal", "dangerous", "weapon"]):
            return False

    # 4. Repetition filter - repeated sentences indicate generation failure
    sentences = [s.strip() for s in output.split(".") if s.strip()]
    if len(sentences) > 3:
        unique_sentences = set(sentences)
        if len(unique_sentences) / len(sentences) < 0.7:
            return False

    # 5. Code quality (if training for code tasks)
    if config.get("is_code_task", False):
        # Ensure code blocks are balanced
        if output.count("```") % 2 != 0:
            return False

    # 6. Encoding issues
    try:
        output.encode("utf-8").decode("utf-8")
        instruction.encode("utf-8").decode("utf-8")
    except UnicodeDecodeError:
        return False

    return True


def filter_dataset(dataset: Dataset, **filter_config) -> Dataset:
    """Apply quality filter to dataset."""
    original_size = len(dataset)
    filtered = dataset.filter(
        lambda ex: quality_filter(ex, filter_config),
        num_proc=4,
    )
    print(f"Quality filter: {original_size} -> {len(filtered)} "
          f"({original_size - len(filtered)} removed, "
          f"{100*(original_size-len(filtered))/original_size:.1f}%)")
    return filtered

Data Pipeline Architecture

Data Quantity vs Quality - The Empirical Picture

The most persistent myth in fine-tuning: more data is always better. The evidence says otherwise.

The landmark paper that shifted community thinking was "LIMA: Less Is More for Alignment" (Zhou et al., 2023). The paper fine-tuned LLaMA 65B on exactly 1,000 carefully curated instruction-following examples and showed it matched or outperformed models trained on 50,000+ examples from standard datasets.

The results were striking:

Model	Training Examples	Win Rate vs GPT-4 (human eval)
Alpaca	52,000	5%
Vicuna	70,000	22%
WizardLM	70,000	15%
LIMA	1,000	43%

The LIMA model was not better because it had less data. It was better because the 1,000 examples were handpicked to be high quality, diverse, and free from noise. The larger datasets contained a mix of good examples and bad examples, and the model learned from both equally.

The practical implication: before you spend time collecting 100,000 examples, spend time curating 2,000 excellent ones. The marginal value of your 10,000th example is much lower than the marginal value of replacing your 100 worst examples with 100 better ones.

The Quality-Quantity Tradeoff Framework

$\text{effective training signal} \approx N_{quality} \times q_{avg}$

where $N_{quality}$ is the number of high-quality examples and $q_{avg}$ is the average quality score (conceptually). Adding noisy examples increases $N$ but decreases $q_{avg}$ . The product can go up or down depending on the noise level.

Rule of thumb from practitioner experience:

Start with quality, not quantity
1,000 to 5,000 high-quality examples is enough to learn most task formats and behaviors
10,000 to 50,000 is enough for moderate domain knowledge injection
Over 100,000 is warranted for comprehensive domain coverage or multi-task fine-tuning
Always hold back 5-10% for evaluation before touching the training data

Synthetic Data Generation

When you do not have enough real data, synthetic data from strong frontier models (GPT-4, Claude Opus) is often the best option. The key is generating data that covers your task space and is hard enough to be informative.

The Evol-Instruct Approach

WizardLM (Xu et al., 2023) introduced Evol-Instruct: start with simple seed instructions, then systematically evolve them into more complex variants using a meta-prompt. This produces a wide distribution of difficulty levels from a small seed set.

import anthropic
import json
from typing import List


EVOLUTION_PROMPTS = {
    "add_constraints": """Take the following instruction and rewrite it to be more specific
    by adding constraints or requirements. Make it harder to answer, but still reasonable.

    Original: {instruction}

    Rewritten (add 2-3 constraints):""",

    "increase_complexity": """Take the following instruction and rewrite it to require
    multi-step reasoning. The answer should require at least 3 logical steps.

    Original: {instruction}

    Rewritten (requires multi-step reasoning):""",

    "domain_specific": """Take the following instruction and rewrite it to be domain-specific
    for the {domain} field. Use appropriate technical terminology.

    Original: {instruction}

    Rewritten (domain-specific):""",

    "code_requirement": """Take the following instruction and rewrite it to require
    a code implementation as part of the answer.

    Original: {instruction}

    Rewritten (requires code):""",
}


def generate_synthetic_example(
    client: anthropic.Anthropic,
    instruction: str,
    evolution_type: str = "add_constraints",
    domain: str = "software engineering",
    model: str = "claude-opus-4-6",
) -> dict:
    """Generate one synthetic training example via Evol-Instruct."""

    # Step 1: Evolve the instruction
    evolution_prompt = EVOLUTION_PROMPTS[evolution_type].format(
        instruction=instruction,
        domain=domain,
    )

    evolved_msg = client.messages.create(
        model=model,
        max_tokens=512,
        messages=[{"role": "user", "content": evolution_prompt}]
    )
    evolved_instruction = evolved_msg.content[0].text.strip()

    # Step 2: Generate a high-quality response to the evolved instruction
    response_msg = client.messages.create(
        model=model,
        max_tokens=2048,
        system="You are a highly knowledgeable expert. Provide a thorough, accurate, "
               "well-structured response. Include concrete examples where relevant.",
        messages=[{"role": "user", "content": evolved_instruction}]
    )
    response = response_msg.content[0].text.strip()

    return {
        "messages": [
            {"role": "user", "content": evolved_instruction},
            {"role": "assistant", "content": response},
        ],
        "evolution_type": evolution_type,
        "original_instruction": instruction,
    }


def generate_dataset(
    seed_instructions: List[str],
    client: anthropic.Anthropic,
    examples_per_seed: int = 4,
    output_file: str = "synthetic_dataset.jsonl",
) -> List[dict]:
    """Generate synthetic training data from seed instructions."""

    evolution_types = list(EVOLUTION_PROMPTS.keys())
    dataset = []

    for i, instruction in enumerate(seed_instructions):
        print(f"Processing seed {i+1}/{len(seed_instructions)}: {instruction[:60]}...")

        for j in range(examples_per_seed):
            evolution_type = evolution_types[j % len(evolution_types)]

            try:
                example = generate_synthetic_example(
                    client=client,
                    instruction=instruction,
                    evolution_type=evolution_type,
                )
                dataset.append(example)

                # Save incrementally to avoid losing progress
                with open(output_file, "a") as f:
                    f.write(json.dumps(example) + "\n")

            except Exception as e:
                print(f"  Failed on evolution {evolution_type}: {e}")
                continue

    print(f"\nGenerated {len(dataset)} examples from {len(seed_instructions)} seeds")
    return dataset

Quality Validation for Synthetic Data

Synthetic data from LLMs can contain its own subtle issues: the generator model's biases, its refusal patterns, and its training data's artifacts. Always validate synthetic data before including it in training.

def validate_synthetic_example(
    example: dict,
    min_response_words: int = 50,
    max_response_words: int = 1500,
) -> tuple[bool, str]:
    """
    Validate a synthetic example. Returns (is_valid, reason).
    """
    messages = example.get("messages", [])

    if len(messages) < 2:
        return False, "Too few messages"

    user_msg = next((m for m in messages if m["role"] == "user"), None)
    asst_msg = next((m for m in messages if m["role"] == "assistant"), None)

    if not user_msg or not asst_msg:
        return False, "Missing user or assistant message"

    instruction = user_msg["content"]
    response = asst_msg["content"]

    # Length checks
    response_words = len(response.split())
    if response_words < min_response_words:
        return False, f"Response too short ({response_words} words)"
    if response_words > max_response_words:
        return False, f"Response too long ({response_words} words)"

    # Check for LLM self-references (Claude/GPT mentioning themselves)
    bad_patterns = [
        "as claude", "as an ai assistant", "as a large language model",
        "i'm an ai", "i am an ai", "openai", "anthropic made me",
    ]
    response_lower = response.lower()
    for pattern in bad_patterns:
        if pattern in response_lower:
            return False, f"Contains LLM self-reference: '{pattern}'"

    # Check that response is actually answering the question
    # (very basic check: response should share some vocabulary with the instruction)
    instruction_words = set(instruction.lower().split())
    response_words_set = set(response.lower().split())
    overlap = len(instruction_words & response_words_set) / max(len(instruction_words), 1)
    if overlap < 0.05:
        return False, "Response shares almost no vocabulary with instruction"

    return True, "OK"

Benchmark Decontamination

If your training data contains examples from benchmark test sets, your evaluation numbers are meaningless. This is a serious problem - many datasets scraped from the web inadvertently include questions from MMLU, HellaSwag, GSM8K, and similar benchmarks.

import hashlib
from typing import Set


def load_benchmark_ngrams(benchmark_texts: List[str], n: int = 8) -> Set[str]:
    """
    Build a set of n-grams from benchmark test data.
    We use n=8 for loose matching (catches paraphrases) down to n=13 for exact matching.
    """
    ngrams = set()
    for text in benchmark_texts:
        words = text.lower().split()
        for i in range(len(words) - n + 1):
            gram = " ".join(words[i:i+n])
            ngrams.add(gram)
    return ngrams


def is_contaminated(
    text: str,
    benchmark_ngrams: Set[str],
    n: int = 8,
    threshold: int = 1,
) -> bool:
    """
    Check if text contains n-grams from benchmark data.
    threshold: minimum number of matching n-grams to flag as contaminated.
    """
    words = text.lower().split()
    matches = 0
    for i in range(len(words) - n + 1):
        gram = " ".join(words[i:i+n])
        if gram in benchmark_ngrams:
            matches += 1
            if matches >= threshold:
                return True
    return False


def decontaminate_dataset(
    dataset: "Dataset",
    benchmark_ngrams: Set[str],
    text_columns: List[str] = ("instruction", "output"),
    n: int = 8,
) -> "Dataset":
    """Remove benchmark-contaminated examples from dataset."""

    def is_clean(example):
        for col in text_columns:
            text = example.get(col, "") or ""
            if is_contaminated(text, benchmark_ngrams, n=n):
                return False
        return True

    original_size = len(dataset)
    clean_dataset = dataset.filter(is_clean, num_proc=4)
    removed = original_size - len(clean_dataset)
    print(f"Decontamination: removed {removed} examples ({100*removed/original_size:.1f}%)")
    return clean_dataset

The standard practice: download the test splits of MMLU, GSM8K, HumanEval, and HellaSwag (the most commonly used benchmarks), extract their 8-gram fingerprints, and filter any training example that matches.

The Complete Data Pipeline

Full Production Pipeline Code

"""
Complete data preparation pipeline for LoRA fine-tuning.
Handles ingestion, cleaning, formatting, and export.
"""

import json
import hashlib
from pathlib import Path
from typing import List, Dict, Any, Optional

from datasets import Dataset, DatasetDict, load_dataset
from transformers import AutoTokenizer


# ============================================================
# Step 1: Format Normalization
# ============================================================

def normalize_to_messages_format(
    examples: List[Dict[str, Any]],
    source_format: str = "alpaca",
    system_prompt: Optional[str] = None,
) -> List[Dict[str, Any]]:
    """
    Convert any source format to the OpenAI messages format.
    source_format: "alpaca" | "sharegpt" | "messages" (already normalized)
    """
    normalized = []

    for ex in examples:
        messages = []

        if system_prompt:
            messages.append({"role": "system", "content": system_prompt})

        if source_format == "alpaca":
            instruction = ex.get("instruction", "")
            context = ex.get("input", "")
            output = ex.get("output", "")

            if context:
                user_content = f"{instruction}\n\n{context}"
            else:
                user_content = instruction

            messages.append({"role": "user", "content": user_content})
            messages.append({"role": "assistant", "content": output})

        elif source_format == "sharegpt":
            for turn in ex.get("conversations", []):
                role_map = {
                    "human": "user",
                    "gpt": "assistant",
                    "system": "system",
                }
                role = role_map.get(turn.get("from", ""), turn.get("from", "user"))
                content = turn.get("value", "")
                messages.append({"role": role, "content": content})

        elif source_format == "messages":
            messages = ex.get("messages", [])
            if system_prompt and (not messages or messages[0]["role"] != "system"):
                messages = [{"role": "system", "content": system_prompt}] + messages

        normalized.append({"messages": messages})

    return normalized


# ============================================================
# Step 2: Tokenization with Instruction Masking
# ============================================================

def tokenize_with_masking(
    example: Dict[str, Any],
    tokenizer: AutoTokenizer,
    max_length: int = 2048,
) -> Dict[str, Any]:
    """
    Tokenize a messages example and apply instruction masking.
    Returns input_ids, labels (with -100 for non-assistant tokens), attention_mask.
    """
    messages = example["messages"]

    # Full tokenized conversation
    full_ids = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=False,
    )

    labels = [-100] * len(full_ids)

    # Find assistant token spans
    for i, message in enumerate(messages):
        if message["role"] != "assistant":
            continue

        # Prefix up to (not including) this assistant turn, with generation prompt
        prefix_text = tokenizer.apply_chat_template(
            messages[:i],
            tokenize=False,
            add_generation_prompt=True,
        )
        prefix_ids = tokenizer.encode(prefix_text, add_special_tokens=False)
        start_idx = len(prefix_ids)

        # Prefix including this assistant turn
        full_prefix_text = tokenizer.apply_chat_template(
            messages[:i+1],
            tokenize=False,
            add_generation_prompt=False,
        )
        full_prefix_ids = tokenizer.encode(full_prefix_text, add_special_tokens=False)
        end_idx = len(full_prefix_ids)

        # Set labels for assistant tokens
        for j in range(start_idx, min(end_idx, len(labels))):
            labels[j] = full_ids[j]

    # Truncate
    full_ids = full_ids[:max_length]
    labels = labels[:max_length]
    attention_mask = [1] * len(full_ids)

    # Check that at least some labels are not -100
    # (skip examples where the assistant response was entirely truncated)
    if all(l == -100 for l in labels):
        return None  # will be filtered downstream

    return {
        "input_ids": full_ids,
        "labels": labels,
        "attention_mask": attention_mask,
    }


# ============================================================
# Step 3: Full Pipeline
# ============================================================

def build_training_dataset(
    data_files: List[str],
    model_id: str,
    source_format: str = "alpaca",
    system_prompt: Optional[str] = None,
    max_length: int = 2048,
    eval_fraction: float = 0.05,
    output_dir: str = "./processed_dataset",
    push_to_hub: Optional[str] = None,
) -> DatasetDict:
    """
    Complete data pipeline: load, normalize, clean, tokenize, split.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Step 1: Load data
    print("Loading data...")
    raw_examples = []
    for file_path in data_files:
        with open(file_path, "r") as f:
            if file_path.endswith(".jsonl"):
                for line in f:
                    raw_examples.append(json.loads(line.strip()))
            else:
                raw_examples.extend(json.load(f))
    print(f"Loaded {len(raw_examples)} raw examples")

    # Step 2: Normalize format
    print("Normalizing format...")
    normalized = normalize_to_messages_format(
        raw_examples, source_format=source_format, system_prompt=system_prompt
    )

    # Step 3: Exact deduplication on the first assistant response
    print("Deduplicating...")
    seen_hashes = set()
    deduped = []
    for ex in normalized:
        asst = next((m["content"] for m in ex["messages"] if m["role"] == "assistant"), "")
        h = hashlib.md5(asst.encode()).hexdigest()
        if h not in seen_hashes:
            seen_hashes.add(h)
            deduped.append(ex)
    print(f"After dedup: {len(deduped)} examples")

    # Step 4: Quality filter
    print("Quality filtering...")
    filtered = []
    for ex in deduped:
        asst = next((m["content"] for m in ex["messages"] if m["role"] == "assistant"), "")
        user = next((m["content"] for m in ex["messages"] if m["role"] == "user"), "")

        if len(asst.split()) < 5:
            continue
        if len(user.split()) < 3:
            continue
        if len(asst.split()) > 2000:
            continue

        filtered.append(ex)
    print(f"After quality filter: {len(filtered)} examples")

    # Step 5: Convert to HuggingFace Dataset and tokenize
    print("Tokenizing with instruction masking...")
    dataset = Dataset.from_list(filtered)

    def tokenize_fn(example):
        result = tokenize_with_masking(example, tokenizer, max_length=max_length)
        if result is None:
            return {"input_ids": [], "labels": [], "attention_mask": []}
        return result

    tokenized = dataset.map(
        tokenize_fn,
        remove_columns=["messages"],
        num_proc=4,
    )

    # Remove examples where assistant response was truncated
    tokenized = tokenized.filter(lambda ex: len(ex["input_ids"]) > 0)
    print(f"After tokenization: {len(tokenized)} examples")

    # Step 6: Train/eval split
    split = tokenized.train_test_split(
        test_size=eval_fraction,
        seed=42,
        shuffle=True,
    )
    dataset_dict = DatasetDict({
        "train": split["train"],
        "eval": split["test"],
    })

    print(f"\nFinal dataset:")
    print(f"  Train: {len(dataset_dict['train'])} examples")
    print(f"  Eval: {len(dataset_dict['eval'])} examples")

    # Step 7: Save
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    dataset_dict.save_to_disk(output_dir)
    print(f"Saved to {output_dir}")

    if push_to_hub:
        dataset_dict.push_to_hub(push_to_hub)
        print(f"Pushed to Hub: {push_to_hub}")

    return dataset_dict


# ============================================================
# Example Usage
# ============================================================

if __name__ == "__main__":
    dataset = build_training_dataset(
        data_files=["data/medical_qa.jsonl", "data/general_instruct.json"],
        model_id="meta-llama/Llama-3.1-8B-Instruct",
        source_format="alpaca",
        system_prompt="You are a helpful medical assistant. Provide accurate, "
                      "evidence-based information. Always recommend consulting a "
                      "healthcare professional for personal medical advice.",
        max_length=2048,
        eval_fraction=0.05,
        output_dir="./processed_medical_dataset",
    )

Production Engineering Notes

Dataset Storage and Versioning

Use HuggingFace Hub for dataset versioning. Every time you modify your training data, push a new version with a clear description. When your model performs differently than expected, you want to be able to reproduce the exact dataset that was used.

from datasets import DatasetDict

# Push with metadata
dataset_dict.push_to_hub(
    "your-org/dataset-name",
    commit_message="v2: added decontamination, increased min response length to 30 words",
)

Memory-Efficient Loading with Arrow Format

For large datasets (over 100,000 examples), load from disk using Arrow format rather than loading everything into memory at once. HuggingFace datasets use Apache Arrow under the hood, which supports zero-copy loading:

from datasets import load_from_disk

# Loads metadata only, actual data is memory-mapped
dataset = load_from_disk("./processed_dataset")

# Streaming mode for very large datasets
dataset = load_dataset("your-org/dataset-name", streaming=True)

Packing - Maximizing GPU Utilization

When examples vary widely in length, short examples leave most of the context window empty, wasting GPU compute. Packing combines multiple short examples into a single sequence up to max_length, separated by EOS tokens:

def pack_examples(
    dataset,
    max_length: int = 2048,
    eos_token_id: int = 2,
):
    """
    Pack multiple short examples into single sequences for efficient training.
    Uses a simple greedy bin-packing algorithm.
    """
    packed_input_ids = []
    packed_labels = []
    packed_attention_masks = []

    current_ids = []
    current_labels = []
    current_masks = []

    for example in dataset:
        ids = example["input_ids"]
        labels = example["labels"]
        masks = example["attention_mask"]

        # Add EOS between examples in a pack
        if current_ids and len(current_ids) + len(ids) + 1 <= max_length:
            current_ids.extend([eos_token_id] + ids)
            current_labels.extend([-100] + labels)
            current_masks.extend([1] + masks)
        elif not current_ids:
            current_ids = ids[:max_length]
            current_labels = labels[:max_length]
            current_masks = masks[:max_length]
        else:
            # Pad and store the current pack
            pad_len = max_length - len(current_ids)
            packed_input_ids.append(current_ids + [eos_token_id] * pad_len)
            packed_labels.append(current_labels + [-100] * pad_len)
            packed_attention_masks.append(current_masks + [0] * pad_len)

            # Start new pack
            current_ids = ids[:max_length]
            current_labels = labels[:max_length]
            current_masks = masks[:max_length]

    # Add the last pack
    if current_ids:
        pad_len = max_length - len(current_ids)
        packed_input_ids.append(current_ids + [eos_token_id] * pad_len)
        packed_labels.append(current_labels + [-100] * pad_len)
        packed_attention_masks.append(current_masks + [0] * pad_len)

    return Dataset.from_dict({
        "input_ids": packed_input_ids,
        "labels": packed_labels,
        "attention_mask": packed_attention_masks,
    })

Packing can reduce training time by 2-4x on datasets with short examples. Most modern training frameworks (TRL, Axolotl) have built-in packing support.

Common Mistakes

:::danger Using the Wrong Chat Template

If you fine-tune on data formatted with LLaMA 3's template and then try to run inference with a Mistral tokenizer, the model will not generate coherent responses. The special tokens that delimit turns are model-specific.

Symptoms:

Model repeats the prompt
Model generates from the wrong "role" (writes user-like text when prompted as assistant)
Model terminates too early or runs on indefinitely

Always verify by decoding your tokenized examples back to text and inspecting the output:

tokenizer = AutoTokenizer.from_pretrained(model_id)
example = dataset[0]
decoded = tokenizer.decode(example["input_ids"])
print(decoded)
# Verify the correct template tokens are present

:::

:::danger Training on Instruction Tokens (No Masking)

If labels are not masked for instruction tokens, the model trains to predict the prompt itself. This is inefficient and can cause the model to overfit to the instruction style while underperforming on the actual task.

Verify masking is working:

example = dataset[0]
input_ids = example["input_ids"]
labels = example["labels"]

# Count tokens with real labels (non -100)
real_label_count = sum(1 for l in labels if l != -100)
total_count = len(input_ids)
print(f"Masked tokens: {total_count - real_label_count}/{total_count}")
print(f"Training on: {real_label_count} tokens ({100*real_label_count/total_count:.1f}%)")

# Decode only the unmasked tokens to verify they are assistant responses
unmasked_ids = [iid for iid, lbl in zip(input_ids, labels) if lbl != -100]
print("Training target text:")
print(tokenizer.decode(unmasked_ids))

:::

:::warning Train/Eval Split Contamination

Using a random split without deduplication first means near-duplicate examples can appear in both train and eval. This inflates eval metrics and makes it impossible to detect overfitting.

Always deduplicate before splitting:

# Wrong order
dataset = raw_dataset.train_test_split(test_size=0.05)  # split first
dataset["train"] = deduplicate(dataset["train"])  # now train and eval may share similar examples

# Correct order
raw_dataset = deduplicate(raw_dataset)  # deduplicate first
dataset = raw_dataset.train_test_split(test_size=0.05)  # then split

:::

:::warning Truncation Cutting Off Assistant Responses

When you truncate at max_length, long conversations get cut off. If the truncation point falls in the middle of an assistant response, the model trains on a partial response with no EOS token - it learns to generate incomplete outputs.

Check for this:

def check_truncation_quality(dataset, tokenizer):
    """Check whether truncation is cutting assistant responses."""
    eos_id = tokenizer.eos_token_id
    cut_count = 0

    for example in dataset:
        ids = example["input_ids"]
        labels = example["labels"]

        # Check if the last real label (non -100) is followed by EOS
        last_real_idx = max((i for i, l in enumerate(labels) if l != -100), default=-1)
        if last_real_idx == -1:
            continue

        # If the sequence doesn't end with EOS after the last real label, it was truncated
        remaining = ids[last_real_idx:]
        if eos_id not in remaining:
            cut_count += 1

    print(f"Truncated assistant responses: {cut_count}/{len(dataset)} "
          f"({100*cut_count/len(dataset):.1f}%)")
    if cut_count / len(dataset) > 0.1:
        print("WARNING: Over 10% of examples have truncated assistant responses.")
        print("Consider increasing max_length or filtering very long examples.")

:::

Interview Q&A

Q1: What is instruction masking and why is it important?

A: Instruction masking sets the training labels for all non-assistant tokens to -100, which PyTorch's cross-entropy loss ignores. This means the model only receives gradient signal for the tokens in the assistant's response, not for the system prompt or user instruction.

Without masking, the model trains to predict both the instruction and the response. For a conversation like "Translate to French: 'Hello'" -> "Bonjour", the model trains equally on predicting "Translate to French: 'Hello'" (useless, the prompt is not what we want to generate) and on predicting "Bonjour" (what we actually want). The gradient signal from the instruction tokens adds noise and can cause the model to overfit to the instruction style rather than learning the response task. With masking, every training step is focused purely on learning to generate good responses.

Q2: What is the ChatML format and why has it become the community standard?

A: ChatML (Chat Markup Language) is a format developed by OpenAI that uses special tokens <|im_start|> and <|im_end|> to delimit conversation turns. Each turn begins with <|im_start|>{role} and ends with <|im_end|>. The format supports system prompts, multi-turn conversations, and multiple roles cleanly.

It became the community standard because it is explicit (no ambiguity about where turns begin and end), extensible (easy to add new roles), and widely implemented (every major fine-tuning library supports it). The LLaMA 3 format is a variant of the same idea using <|start_header_id|>, <|end_header_id|>, and <|eot_id|> tokens. The Mistral v0.3 format is also similar. All modern tokenizers have a built-in chat_template field that encodes the model-specific variant.

Q3: How do you decide between Alpaca format and ShareGPT format for a fine-tuning project?

A: Use Alpaca format for single-turn tasks where each example is one question and one answer: summarization, extraction, classification, formatting. Use ShareGPT format for tasks that require multi-turn dialogue: customer service bots, tutoring systems, anything where context from previous turns matters.

For a code assistant, most queries are self-contained (write me a function that...) - Alpaca format is fine. For a medical consultation assistant, context from earlier in the conversation matters ("given the symptoms you mentioned earlier...") - use ShareGPT or OpenAI messages format. The practical reason to prefer the OpenAI messages format over both is that it maps directly to apply_chat_template() and handles all cases, so you do not need format-specific preprocessing logic.

Q4: How many training examples do you need for effective instruction fine-tuning?

A: The LIMA paper (Zhou et al., 2023) showed that 1,000 carefully curated examples can be sufficient to teach instruction following to a 65B model. For smaller 7B-8B models, 2,000-5,000 high-quality examples is a reliable starting point for behavioral alignment. For domain knowledge injection, more examples are needed: 10,000-50,000 for moderate coverage, 100,000+ for comprehensive coverage.

The key variable is quality, not quantity. A dataset of 50,000 noisy examples will underperform a dataset of 5,000 carefully filtered examples. Invest in data quality before investing in data quantity. Use the following heuristic: if you randomly sample 20 examples and find even one that seems low-quality, your dataset needs more filtering. Good training data should have no obviously bad examples.

Q5: What is the risk of synthetic data generation and how do you mitigate it?

A: Synthetic data from frontier models carries three risks:

Model self-reference: the generator model (Claude, GPT-4) may mention itself, generating phrases like "as an AI assistant" that teach the fine-tuned model to also identify as a different AI.
Bias transfer: the generator model's biases, refusal patterns, and stylistic tendencies transfer into your training data. If GPT-4 tends to hedge heavily, your model will too.
Capability laundering: you cannot teach a model capabilities it does not have by fine-tuning on synthetic examples of those capabilities. The fine-tuned model will generate text that looks like it has the capability without actually having it - a particularly dangerous failure mode for factual domains.

Mitigate by: post-filtering for model self-references (regex or classifier), verifying factual claims in synthetic data with external sources, limiting synthetic data to behavioral patterns (format, style, structure) rather than factual knowledge, and always including a real data component even if it is small.

Q6: Why should you run deduplication before splitting into train and eval, not after?

A: If you split first and deduplicate second, near-duplicate examples can appear in both train and eval. For example, if your dataset has 10 similar variations of the same question, a random split will put some in train and some in eval. The model trains on the train variants and then evaluates on the nearly-identical eval variants - artificially inflating eval metrics.

More precisely: the eval set should measure the model's ability to generalize to data it has not seen. If eval examples are near-duplicates of train examples, you are not measuring generalization - you are measuring memorization. Always deduplicate the full dataset first, then split. The split should be the last operation in your pipeline before tokenization.

Summary

Data preparation is where fine-tuning projects succeed or fail. The technical choices that matter most:

Format: use the OpenAI messages format with apply_chat_template() for consistency across model families
Masking: always mask instruction tokens with -100 labels - compute loss only on assistant responses
Quality over quantity: 2,000 clean examples outperform 50,000 noisy examples consistently
Deduplication before splitting: near-duplicates across train and eval produce inflated metrics
Benchmark decontamination: a training set contaminated with benchmark test data renders all evaluation meaningless
Synthetic data: useful for behavioral patterns, risky for factual knowledge - always validate and filter

The data pipeline is not glamorous work. It is careful, methodical, and often tedious. It is also the highest-leverage work in the entire fine-tuning process. A model trained on excellent data will surprise you with its quality. A model trained on poor data will disappoint you regardless of how much time you spend tuning the LoRA rank.

The Model That Learned to Gaslight​

Why This Exists - The Problem Before Proper Data Pipelines​

Historical Context - From Alpaca to ChatML to LLaMA 3​

Core Concepts - Chat Templates and the Tokenizer​

How Tokenizers Encode Chat Templates​

Instruction Masking with -100 Labels​

Dataset Formats - The Three Standards​

Format 1: Alpaca (Single-Turn, Instruction-Output)​

Format 2: ShareGPT (Multi-Turn Conversations)​

Format 3: OpenAI Messages (The Current Standard)​

Data Quality - The Signals That Matter​

Signal 1: Response Length Distribution​

Signal 2: Deduplication​

Signal 3: Quality Filtering​

Data Pipeline Architecture​

Data Quantity vs Quality - The Empirical Picture​

The Quality-Quantity Tradeoff Framework​

Synthetic Data Generation​

The Evol-Instruct Approach​

Quality Validation for Synthetic Data​

Benchmark Decontamination​

The Complete Data Pipeline​

Full Production Pipeline Code​

Production Engineering Notes​

Dataset Storage and Versioning​

Memory-Efficient Loading with Arrow Format​

Packing - Maximizing GPU Utilization​

Common Mistakes​

Interview Q&A​

Q1: What is instruction masking and why is it important?​

Q2: What is the ChatML format and why has it become the community standard?​

Q3: How do you decide between Alpaca format and ShareGPT format for a fine-tuning project?​

Q4: How many training examples do you need for effective instruction fine-tuning?​

Q5: What is the risk of synthetic data generation and how do you mitigate it?​

Q6: Why should you run deduplication before splitting into train and eval, not after?​

Summary​