Synthetic Data and Self-Improvement
The Dataset Crisis Nobody Talks About
It is 11pm on a Thursday. Your team has spent three months negotiating a data licensing agreement with a major financial services firm. The contract finally landed - 50,000 rows of annotated support transcripts, every edge case a domain-specific assistant could need. Legal approved it last week. Then your security review flagged it: customer PII woven through 40% of the rows, consent forms that predate GDPR, and a clause that technically prohibits ML training use if the data is "transformed." The dataset is unusable.
Your fine-tuning timeline just evaporated. You have a base model, a clear task, and exactly zero training examples you can legally touch.
This scenario is not hypothetical. It plays out constantly across enterprise AI projects. Healthcare companies sit on millions of clinical notes they cannot train on without patient consent at a scale they cannot practically obtain. Legal firms have decades of case documents under attorney-client privilege. Fintechs have transaction histories that regulators would scrutinize if used for model training. The data exists - it is just locked behind walls that are not coming down any time soon.
The naive response is to hire annotators. But annotation at scale is brutally expensive: 750K and $2.5M. For a startup trying to fine-tune a domain-specific assistant, this math does not work. For an enterprise project with a four-week deadline, it does not work either.
The better answer, discovered in a cascade of research papers between 2022 and 2024, is to let the language model generate its own training data. Not blindly - the raw output of a language model is noisy, repetitive, and full of failure modes that will poison your fine-tune. But with the right scaffolding - evolutionary instruction mutation, self-critique loops, constitutional filtering, and careful deduplication - synthetic data generation has become the dominant data strategy for instruction-tuned models. WizardLM-70B, Phi-2, Magicoder, and Orca2 are all products of synthetic data pipelines. This lesson shows you how to build one.
Why This Exists - The Human Annotation Bottleneck
What Came Before
Before synthetic data, instruction-tuned models were trained on human-curated datasets. InstructGPT (2022) used approximately 13,000 prompt-response pairs written and reviewed by OpenAI contractors. FLAN (2021) compiled 62 natural language processing benchmarks into a unified instruction format - but these were existing academic datasets reformatted, not novel instructions. The common thread was that humans either wrote or validated every training example.
This approach worked at the scale OpenAI and Google could afford. For everyone else, it was a dead end. The bottleneck was not compute - it was labeling throughput. You could rent 1,000 A100s, but you could not hire 1,000 expert annotators on short notice without sacrificing quality.
Why Human Annotation Failed to Scale
Human annotation has three failure modes at production scale:
Quality variance. Annotation quality depends on annotator expertise, attention, and the specificity of your guidelines. At scale, you get inconsistency. The annotator who wrote examples 1-500 interpreted "be helpful" differently from the one who wrote examples 3,000-3,500. Models trained on this data learn the noise as signal.
Coverage gaps. Human annotators write the examples that come to mind first. These tend to cluster around common cases. The tail - unusual phrasings, adversarial inputs, edge-case domains - gets underrepresented. Fine-tuned models then fail precisely on the cases where they most need to perform.
Cost and speed. A high-quality annotation project covering 100,000 instruction-response pairs takes months and hundreds of thousands of dollars. When your dataset requirements change mid-project (they always do), you cannot iterate quickly.
What Synthetic Data Solves
The insight behind Self-Instruct (Wang et al., 2022) was disarmingly simple: a capable language model already knows what good instructions look like. You can prompt it to generate new instructions, then filter the outputs for quality. The model becomes its own annotator.
This breaks the bottleneck. A single API call generates dozens of candidate instructions in seconds. A filtering pass removes duplicates and low-quality outputs. What used to take weeks of annotation time now takes hours of compute time. The cost drops by two to three orders of magnitude.
The catch - which the rest of this lesson is about - is that naive self-generation amplifies the model's existing biases. Simple instructions get overrepresented. Certain response styles dominate. Without evolutionary pressure, deduplication, and quality filtering, you end up with a large dataset that teaches the model nothing new.
Historical Context - How We Learned to Teach Models to Teach Themselves
Self-Instruct (Wang et al., December 2022)
Yizhong Wang and colleagues at the University of Washington published Self-Instruct in December 2022, just weeks after ChatGPT launched and changed everyone's priors about what instruction-tuned models could do. The paper's central claim was audacious: you could fine-tune GPT-3 using only GPT-3-generated data and a handful of human-written seed examples, and get performance close to InstructGPT-001 (trained on 13,000 human-curated examples) using almost zero human annotation.
The pipeline was four steps: (1) start with 175 hand-written seed tasks, (2) prompt the model to generate new instructions inspired by the seeds, (3) generate input-output pairs for each new instruction, (4) filter for quality using heuristics (length, ROUGE similarity to existing instructions). The paper showed that GPT-3 fine-tuned on this data matched InstructGPT on the SUPER-NaturalInstructions benchmark with 33x less human effort. This was the "aha moment" that launched an entire research direction.
Stanford Alpaca (March 2023)
One month after Self-Instruct, the Stanford team applied the technique to LLaMA-7B using text-davinci-003 as the teacher model. They generated 52,000 instruction-response pairs for roughly $500 in API costs and released the Alpaca model and dataset. The project went viral because it demonstrated that you could create a capable instruction-following model for less than a thousand dollars, democratizing fine-tuning in a way that the research community had not seen before.
Evol-Instruct and WizardLM (April 2023)
Can Xu and colleagues at Microsoft Research identified the core weakness of Self-Instruct: the generated instructions cluster around easy tasks. The model generates what it is good at, which is mostly simple requests. Hard instructions - those requiring multi-step reasoning, domain expertise, or unusual constraint combinations - are underrepresented.
Their fix was Evol-Instruct: take existing instructions and "evolve" them to be harder through a set of mutation operations (add constraints, increase reasoning depth, add domain complexity, convert to code, etc.). WizardLM trained on this evolved dataset significantly outperformed Alpaca on complex instruction benchmarks, establishing that instruction complexity - not just quantity - was a critical dimension.
Constitutional AI (Anthropic, December 2022)
Separately, Anthropic developed Constitutional AI (CAI), which addressed a different problem: how do you ensure a model's self-generated outputs are safe and aligned without human raters labeling every example as harmful or not? Their solution was a "constitution" - a set of natural language principles ("be helpful, harmless, and honest") - that the model uses to critique and revise its own outputs. The RLHF stage is replaced (or augmented) by AI feedback from the constitutional principles. This is now the basis of Anthropic's production models and has been replicated in open-source training pipelines.
OSS-Instruct and Magicoder (November 2023)
Wei et al. at UIUC contributed one more key insight: grounding matters. When you generate coding instructions in free-form, the model produces toy examples that do not reflect real code complexity. OSS-Instruct seeded the generation process with real open-source code snippets from GitHub, then asked the model to generate programming problems inspired by those snippets. Magicoder trained on this data outperformed models trained on ten times more synthetic code data, demonstrating that high-quality seeds produce qualitatively better synthetic data.
Core Concepts
Self-Instruct - The Foundation
Self-Instruct is the base algorithm from which most other techniques derive. The formal definition: given a small set of seed tasks and a capable language model , generate a large set of instruction-input-output triples where .
The generation loop operates as follows. At each step, sample tasks from the current pool (initially just the seeds), format them as in-context examples, and prompt to generate new instructions. Then, for each new instruction, prompt to generate a response. Apply filters: discard instructions with ROUGE-L similarity above a threshold to any existing instruction (deduplication), discard instructions shorter than a minimum length, discard instructions containing certain keywords (e.g., "image"). Add surviving pairs to the pool. Repeat.
The ROUGE-L filter is critical. Without it, the model generates near-duplicate instructions that superficially differ but teach the model nothing new. The standard threshold is - instructions with ROUGE-L above 0.7 to any existing instruction are discarded.
where is the length of the longest common subsequence.
Evol-Instruct - Adding Complexity Pressure
Evol-Instruct treats each instruction as an organism that can be mutated. The mutations fall into two categories:
In-depth evolution - makes the instruction harder along a single dimension:
- Add constraints (impose additional requirements or restrictions)
- Deepen (require more detailed or expert knowledge)
- Concretize (replace abstract concepts with specific examples)
- Increase reasoning steps (require multi-hop inference)
- Complicate input (add more variables, conditions, or edge cases)
In-breadth evolution - generates new instructions inspired by the current one but covering different ground:
- Create a new task inspired by the topic of the current instruction
- Mutate to a related domain
- Generate a more creative or unusual variant
For each original instruction , the evolution step produces a mutated instruction by prompting the teacher model with:
Rewrite the following instruction to make it more complex.
Original: {I}
Evolved (add constraints and increase reasoning depth):
The evolved instruction then gets a new response generated. The resulting dataset skews toward complexity in a way that pure self-generation does not.
Constitutional AI - Self-Critique Loops
Constitutional AI (CAI) operates in two phases: supervised learning from AI feedback (SL-CAF) and reinforcement learning from AI feedback (RLHF with AI preferences instead of human preferences).
In the SL-CAF phase, the model generates an initial response to a potentially harmful prompt, then critiques that response against each principle in the constitution, then revises the response to address the critique. The final revised response becomes a training example. This creates a dataset of (harmful-prompt, safe-response) pairs without requiring humans to label harm.
The constitutional principles look like:
- "Choose the response that is least likely to contain harmful, unethical, racist, sexist, or dangerous content."
- "Choose the response that is most supportive of long-term human wellbeing, even if it means refusing to answer."
- "Which response is most honest about what the AI knows and does not know?"
The critique-revision cycle can run multiple times. In practice, even one iteration significantly reduces harmful outputs compared to direct generation.
Rejection Sampling - Quality as a Filter
Rejection sampling is the simplest quality control mechanism: generate candidate responses to a given instruction, then keep only those that pass a quality filter . The filter can be:
- A reward model score above threshold (if you have a reward model)
- A judge LLM rating above some threshold
- Heuristics (response length, format compliance, absence of refusals)
- Task-specific metrics (code execution success, factual accuracy against a knowledge base)
The expected number of accepted samples from candidates is , where is the acceptance probability. If your filter is strict (), you need candidates to get one accepted sample on average. The API cost scales with , so there is a direct tradeoff between quality and cost.
Meta's Llama 2 paper reported using rejection sampling fine-tuning as a core component of their RLHF pipeline: generate multiple completions per prompt, use a reward model to score them, keep only the highest-scored completion as a training example. This "RS-FT" approach often outperforms RLHF with PPO at lower compute cost.
Self-Play and SPIN
Self-Play Fine-Tuning (SPIN, Chen et al., 2024) is a clever application of game theory to fine-tuning. The idea: at iteration , the current model plays against the previous model . The current model acts as a discriminator trying to distinguish real human responses from -generated responses. The training objective pushes to generate responses that are indistinguishable from human responses.
Formally, SPIN optimizes:
where is a synthetic response from the previous model and is the real human response.
This is equivalent to DPO where the "chosen" response is always a human response and the "rejected" response is always generated by the previous model version. No reward model is needed - the game structure provides the training signal.
OSS-Instruct - Grounded Generation
The Magicoder technique is straightforward but powerful. Instead of generating code problems from abstract instructions, seed the generation with real code:
- Sample a random code snippet from an open-source corpus (GitHub, etc.)
- Prompt the teacher model: "Here is a code snippet: {s}. Write a programming problem inspired by this code."
- Generate a solution to the problem
- Filter for quality (code execution, tests passing)
The grounding ensures diversity and realism. Real code snippets span unusual APIs, obscure algorithms, non-standard patterns - all the things that toy generated examples miss. Magicoder-S trained on 75K OSS-Instruct examples outperformed WizardCoder trained on 110K Evol-Instruct examples on HumanEval and MBPP.
Code Examples
Building a Self-Instruct Pipeline
import random
import re
from anthropic import Anthropic
from rouge_score import rouge_scorer
client = Anthropic()
scorer_obj = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
SEED_TASKS = [
{
"instruction": "Explain the difference between supervised and unsupervised learning.",
"response": "Supervised learning uses labeled training data where each example has an input and a desired output. The model learns a mapping from inputs to outputs by minimizing prediction error. Examples include classification (spam detection) and regression (house price prediction). Unsupervised learning works with unlabeled data, discovering structure without predefined outputs. Examples include clustering (customer segmentation) and dimensionality reduction (PCA). The key distinction is whether you have ground truth labels to guide learning."
},
{
"instruction": "Write a Python function that checks if a number is prime.",
"response": "def is_prime(n):\n if n < 2:\n return False\n if n == 2:\n return True\n if n % 2 == 0:\n return False\n for i in range(3, int(n**0.5) + 1, 2):\n if n % i == 0:\n return False\n return True"
},
{
"instruction": "What are the main causes of the 2008 financial crisis?",
"response": "The 2008 financial crisis had several interlocking causes. First, the housing bubble: banks issued subprime mortgages to borrowers with poor credit, fueled by the belief that housing prices would keep rising. Second, securitization: these risky mortgages were bundled into mortgage-backed securities (MBS) and collateralized debt obligations (CDOs), spreading risk through the financial system while obscuring it. Third, excessive leverage: banks held thin capital cushions against these risky assets. Fourth, regulatory failure: deregulation allowed shadow banking to grow outside traditional oversight. When housing prices fell in 2007, defaults cascaded through the securitization chain, triggering the crisis."
}
]
def generate_new_instructions(seed_pool: list[dict], n_to_generate: int = 8) -> list[str]:
"""Generate new instructions using in-context examples from the seed pool."""
# Sample k examples from the pool
k = min(8, len(seed_pool))
examples = random.sample(seed_pool, k)
# Format them as in-context demonstrations
demonstrations = "\n\n".join([
f"Task {i+1}: {ex['instruction']}"
for i, ex in enumerate(examples)
])
prompt = f"""Here are some example tasks:
{demonstrations}
Generate {n_to_generate} new tasks that are diverse in topic, format, and difficulty.
Each task should be on a new line starting with "Task: ".
Do NOT repeat the example tasks. Generate entirely new ones covering different topics."""
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}]
)
text = response.content[0].text
# Parse out the generated instructions
instructions = []
for line in text.split("\n"):
line = line.strip()
if line.startswith("Task:"):
instruction = line[5:].strip()
if len(instruction) > 10: # minimum length filter
instructions.append(instruction)
return instructions
def generate_response(instruction: str) -> str:
"""Generate a response for a given instruction."""
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": instruction
}]
)
return response.content[0].text
def is_too_similar(new_instruction: str, existing_instructions: list[str], threshold: float = 0.7) -> bool:
"""Check if a new instruction is too similar to any existing one using ROUGE-L."""
for existing in existing_instructions:
scores = scorer_obj.score(new_instruction, existing)
if scores["rougeL"].fmeasure > threshold:
return True
return False
def quality_filter(instruction: str, response: str) -> bool:
"""Basic quality filters for instruction-response pairs."""
# Length filters
if len(instruction.split()) < 3:
return False
if len(response.split()) < 10:
return False
if len(response.split()) > 500:
return False
# Avoid instructions that reference images/visuals (model can't see them)
bad_keywords = ["image", "picture", "photo", "figure", "chart", "diagram", "table"]
if any(kw in instruction.lower() for kw in bad_keywords):
return False
# Avoid empty or refusal responses
refusal_phrases = ["i cannot", "i'm unable", "i don't have access", "as an ai"]
if any(phrase in response.lower() for phrase in refusal_phrases):
return False
return True
def run_self_instruct_pipeline(
seed_tasks: list[dict],
target_size: int = 500,
n_per_round: int = 8,
dedup_threshold: float = 0.7
) -> list[dict]:
"""Run the full Self-Instruct pipeline."""
pool = list(seed_tasks)
existing_instructions = [t["instruction"] for t in pool]
generated_dataset = []
print(f"Starting with {len(seed_tasks)} seed tasks. Target: {target_size} examples.")
round_num = 0
while len(generated_dataset) < target_size:
round_num += 1
print(f"Round {round_num}: Pool size={len(pool)}, Dataset size={len(generated_dataset)}")
# Step 1: Generate new instruction candidates
new_instructions = generate_new_instructions(pool, n_to_generate=n_per_round)
accepted = 0
for instruction in new_instructions:
# Step 2: Deduplication filter
if is_too_similar(instruction, existing_instructions, threshold=dedup_threshold):
continue
# Step 3: Generate response
response = generate_response(instruction)
# Step 4: Quality filter
if not quality_filter(instruction, response):
continue
# Step 5: Add to pool and dataset
new_example = {"instruction": instruction, "response": response}
pool.append(new_example)
existing_instructions.append(instruction)
generated_dataset.append(new_example)
accepted += 1
if len(generated_dataset) >= target_size:
break
print(f" Accepted {accepted}/{len(new_instructions)} new instructions this round.")
if accepted == 0:
print("No instructions accepted this round. Stopping to avoid infinite loop.")
break
print(f"Done. Generated {len(generated_dataset)} examples.")
return generated_dataset
Evol-Instruct: Evolutionary Instruction Mutation
import random
from anthropic import Anthropic
from typing import Literal
client = Anthropic()
EVOLUTION_OPERATIONS = [
"add_constraints",
"deepen_complexity",
"concretize",
"increase_reasoning_steps",
"breadth_mutation"
]
EVOLUTION_PROMPTS = {
"add_constraints": """Rewrite the following instruction by adding 2-3 additional constraints or requirements that make it more challenging. Keep the core task the same but add restrictions.
Original instruction: {instruction}
Evolved instruction (with added constraints):""",
"deepen_complexity": """Rewrite the following instruction to require significantly more expert knowledge or technical depth to answer correctly. The evolved version should require deeper understanding of the subject matter.
Original instruction: {instruction}
Evolved instruction (requiring expert depth):""",
"concretize": """Rewrite the following instruction to replace abstract or vague concepts with specific, concrete scenarios, numbers, or examples. Make it a specific instance of the general case.
Original instruction: {instruction}
Evolved instruction (more concrete and specific):""",
"increase_reasoning_steps": """Rewrite the following instruction to require multi-step reasoning, analysis, or problem-solving. The evolved version should require the solver to work through multiple intermediate steps.
Original instruction: {instruction}
Evolved instruction (requiring multi-step reasoning):""",
"breadth_mutation": """Create a new instruction that is inspired by the topic and format of the following instruction but covers a related but different aspect of the subject matter.
Original instruction: {instruction}
New instruction (related but different):"""
}
def evolve_instruction(
instruction: str,
operation: str | None = None
) -> str:
"""Evolve a single instruction using one of the mutation operations."""
if operation is None:
operation = random.choice(EVOLUTION_OPERATIONS)
prompt_template = EVOLUTION_PROMPTS[operation]
prompt = prompt_template.format(instruction=instruction)
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=512,
messages=[{"role": "user", "content": prompt}]
)
evolved = response.content[0].text.strip()
# Remove any preamble the model might add
for prefix in ["Evolved instruction:", "New instruction:", "Rewritten:"]:
if evolved.startswith(prefix):
evolved = evolved[len(prefix):].strip()
return evolved
def generate_response_for_evolved(instruction: str) -> str:
"""Generate a high-quality response for an evolved instruction."""
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Please provide a thorough and accurate response to the following:\n\n{instruction}"
}]
)
return response.content[0].text
def estimate_instruction_complexity(instruction: str) -> float:
"""Estimate instruction complexity on a 1-10 scale using LLM-as-judge."""
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=64,
messages=[{
"role": "user",
"content": f"""Rate the complexity of the following instruction on a scale of 1-10.
1 = trivial (hello world, basic facts), 10 = expert-level (PhD-level analysis, complex multi-step reasoning).
Respond with only a number.
Instruction: {instruction}
Complexity score:"""
}]
)
text = response.content[0].text.strip()
try:
score = float(re.search(r'\d+\.?\d*', text).group())
return min(max(score, 1.0), 10.0)
except (AttributeError, ValueError):
return 5.0 # default if parsing fails
def run_evol_instruct_pipeline(
seed_instructions: list[str],
n_evolution_rounds: int = 3,
min_complexity: float = 4.0
) -> list[dict]:
"""Run the Evol-Instruct pipeline."""
current_pool = list(seed_instructions)
all_examples = []
for round_num in range(n_evolution_rounds):
print(f"\nEvolution round {round_num + 1}/{n_evolution_rounds}")
next_pool = []
for i, instruction in enumerate(current_pool):
# Apply a random evolution operation
operation = random.choice(EVOLUTION_OPERATIONS)
evolved = evolve_instruction(instruction, operation)
# Check complexity
complexity = estimate_instruction_complexity(evolved)
if complexity >= min_complexity:
response = generate_response_for_evolved(evolved)
example = {
"instruction": evolved,
"response": response,
"source_instruction": instruction,
"operation": operation,
"complexity_score": complexity,
"evolution_round": round_num + 1
}
all_examples.append(example)
next_pool.append(evolved)
print(f" [{i+1}/{len(current_pool)}] Complexity: {complexity:.1f} | Op: {operation}")
else:
print(f" [{i+1}/{len(current_pool)}] REJECTED (complexity {complexity:.1f} < {min_complexity})")
current_pool = next_pool
print(f"Pool carried to next round: {len(current_pool)} instructions")
return all_examples
Rejection Sampling with LLM-as-Judge
from anthropic import Anthropic
from dataclasses import dataclass
import concurrent.futures
client = Anthropic()
@dataclass
class ScoredResponse:
response: str
score: float
reasoning: str
def judge_response(instruction: str, response: str) -> ScoredResponse:
"""Use LLM-as-judge to score a response on 1-10 scale."""
judge_prompt = f"""You are a strict quality evaluator for AI assistant responses.
Instruction: {instruction}
Response to evaluate:
{response}
Evaluate this response on a scale of 1-10 based on:
- Accuracy: Is the information correct?
- Completeness: Does it fully answer the instruction?
- Clarity: Is it well-organized and easy to understand?
- Helpfulness: Would a user find this genuinely useful?
Respond in this exact format:
SCORE: [number 1-10]
REASONING: [one sentence explaining your score]"""
result = client.messages.create(
model="claude-haiku-4-5",
max_tokens=256,
messages=[{"role": "user", "content": judge_prompt}]
)
text = result.content[0].text
# Parse score
score_match = re.search(r"SCORE:\s*(\d+(?:\.\d+)?)", text)
score = float(score_match.group(1)) if score_match else 5.0
# Parse reasoning
reasoning_match = re.search(r"REASONING:\s*(.+)", text, re.DOTALL)
reasoning = reasoning_match.group(1).strip() if reasoning_match else ""
return ScoredResponse(response=response, score=score, reasoning=reasoning)
def rejection_sampling_ft(
instructions: list[str],
k_candidates: int = 5,
min_score: float = 7.0
) -> list[dict]:
"""
Rejection sampling fine-tuning: generate k candidates per instruction,
keep only those scoring above min_score.
"""
dataset = []
for i, instruction in enumerate(instructions):
print(f"Processing {i+1}/{len(instructions)}: {instruction[:60]}...")
# Generate k candidate responses
candidates = []
for _ in range(k_candidates):
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": instruction}]
)
candidates.append(response.content[0].text)
# Score all candidates
scored = [judge_response(instruction, c) for c in candidates]
scored.sort(key=lambda x: x.score, reverse=True)
# Keep best response if it meets threshold
best = scored[0]
print(f" Best score: {best.score:.1f}/10 | Reasoning: {best.reasoning[:80]}")
if best.score >= min_score:
dataset.append({
"instruction": instruction,
"response": best.response,
"judge_score": best.score,
"judge_reasoning": best.reasoning,
"n_candidates_evaluated": k_candidates
})
print(f" ACCEPTED (score {best.score} >= threshold {min_score})")
else:
print(f" REJECTED (best score {best.score} < threshold {min_score})")
acceptance_rate = len(dataset) / len(instructions) * 100
print(f"\nAccepted {len(dataset)}/{len(instructions)} examples ({acceptance_rate:.1f}%)")
return dataset
Constitutional AI - Critique and Revision Loop
from anthropic import Anthropic
client = Anthropic()
CONSTITUTION = [
"Choose the response that is most helpful while avoiding any harmful, dangerous, or unethical content.",
"Choose the response that is most honest - it should not make up facts, should acknowledge uncertainty, and should not mislead the user.",
"Choose the response that best respects user privacy and does not encourage sharing of personal information unnecessarily.",
"Choose the response that is most supportive of human autonomy - it should help users make their own informed decisions rather than being paternalistic.",
]
def generate_initial_response(instruction: str) -> str:
"""Generate an initial response that may be unfiltered."""
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": instruction}]
)
return response.content[0].text
def critique_response(instruction: str, response: str, principle: str) -> str:
"""Critique a response against a constitutional principle."""
critique_prompt = f"""Please read the following instruction and response, then identify any ways the response violates or could better align with this principle:
Principle: {principle}
Instruction: {instruction}
Response: {response}
Critique (identify specific issues, or say "No issues found" if the response already aligns well):"""
result = client.messages.create(
model="claude-haiku-4-5",
max_tokens=512,
messages=[{"role": "user", "content": critique_prompt}]
)
return result.content[0].text
def revise_response(instruction: str, response: str, critique: str, principle: str) -> str:
"""Revise a response based on a critique."""
revision_prompt = f"""Please revise the following response to better align with this principle: {principle}
Instruction: {instruction}
Original response: {response}
Issues identified: {critique}
Revised response (address the issues while remaining helpful):"""
result = client.messages.create(
model="claude-haiku-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": revision_prompt}]
)
return result.content[0].text
def constitutional_ai_pipeline(
instructions: list[str],
n_revision_rounds: int = 2
) -> list[dict]:
"""Run the Constitutional AI self-critique pipeline."""
dataset = []
for instruction in instructions:
print(f"\nInstruction: {instruction[:80]}...")
# Step 1: Initial response
current_response = generate_initial_response(instruction)
# Step 2: Critique and revise for each principle
for round_num in range(n_revision_rounds):
# Sample a random principle to critique against
principle = random.choice(CONSTITUTION)
critique = critique_response(instruction, current_response, principle)
if "no issues found" not in critique.lower():
revised = revise_response(instruction, current_response, critique, principle)
print(f" Round {round_num+1}: Revised based on principle: '{principle[:60]}...'")
current_response = revised
else:
print(f" Round {round_num+1}: No issues found for selected principle")
dataset.append({
"instruction": instruction,
"response": current_response,
"n_revision_rounds": n_revision_rounds
})
return dataset
Architecture Diagrams
Production Engineering Notes
Degeneration: The Silent Killer of Synthetic Pipelines
After 10,000+ examples, self-instruct pipelines without strong deduplication converge to a narrow distribution. The ROUGE-L filter catches exact-near-duplicates, but semantic duplicates (different wording, same concept) slip through. After generation completes, run a second deduplication pass using embedding similarity:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
def semantic_dedup(instructions: list[str], threshold: float = 0.85) -> list[str]:
"""Remove semantically duplicate instructions using embedding cosine similarity."""
embeddings = model.encode(instructions, show_progress_bar=True)
keep = [0] # always keep the first
for i in range(1, len(embeddings)):
kept_embeddings = embeddings[keep]
similarities = cosine_similarity([embeddings[i]], kept_embeddings)[0]
if similarities.max() < threshold:
keep.append(i)
print(f"Semantic dedup: {len(instructions)} -> {len(keep)} instructions")
return [instructions[i] for i in keep]
Diversity Metrics - Measuring What You Cannot See
A synthetic dataset can look large while being informationally small. Track these metrics throughout generation:
Vocabulary diversity: unique unigrams / total tokens. Values below 0.15 suggest over-repetition.
Task type distribution: classify each instruction by type (factual QA, instruction following, creative writing, code, reasoning). A healthy dataset has roughly even distribution. If 70% of your examples are factual QA, your model will overspecialize.
Length histogram: plot response length distribution. A sharp spike at any length (e.g., 95% of responses are 100-150 tokens) suggests the teacher model has a strong length prior that is bleeding into your data.
N-gram frequency analysis: if any 4-gram appears in more than 0.1% of responses, you have a style template problem. Common offenders: "As an AI language model," "I hope this helps," "Great question!"
Teacher Model Selection
The quality of your synthetic data is bounded by your teacher model. Using GPT-4o or Claude Opus as the teacher produces meaningfully better data than using GPT-3.5 or Claude Haiku, but costs 10-20x more. The tradeoff depends on your target model size:
- Training a 7B model: Claude Haiku or GPT-3.5 Turbo teacher is usually sufficient
- Training a 13B-70B model: Claude Sonnet or GPT-4o-mini recommended
- Training a 70B+ model: Claude Opus or GPT-4o for best quality ceiling
A useful heuristic: the teacher model should be at least 5-10x larger (in parameter terms) than the student model for effective knowledge distillation. When teacher and student are similar in scale, the synthetic data lacks the quality margin needed for meaningful improvement.
API Cost Management
Synthetic data generation at scale requires API budget planning. Rough estimates at current (2025) pricing:
| Pipeline | Examples | Estimated Cost |
|---|---|---|
| Self-Instruct (Haiku teacher) | 10,000 | $8-15 |
| Evol-Instruct (Haiku, 3 rounds) | 5,000 | $20-40 |
| Rejection Sampling (k=5, Haiku) | 2,000 | $15-30 |
| Constitutional AI (2 rounds) | 5,000 | $30-60 |
Use batching where possible. Anthropic's Batch API provides a 50% discount for non-real-time requests. For large pipelines (100k+ examples), batch processing pays for itself in the first 20,000 examples.
Format Consistency
Instruction-tuned models are sensitive to prompt format. If your synthetic data mixes multiple formats (some using "User:", some using "Human:", some with no prefix), the model learns a noisy mixture that hurts inference performance. Enforce a single format template at data generation time:
CHAT_TEMPLATE = """<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
{response}<|im_end|>"""
Apply the same template that your base model was pre-trained with. For Llama 3 models, use the [INST] format. For Qwen models, use ChatML. Mixing templates within a dataset is one of the most common and damaging production mistakes.
Common Mistakes
:::danger Dataset Collapse from Weak Deduplication Running Self-Instruct without aggressive deduplication produces datasets where 30-40% of examples are near-duplicates. The model learns to excel at the overlapping patterns and degrades on everything else. Always run BOTH lexical deduplication (ROUGE-L threshold 0.7) AND semantic deduplication (embedding cosine similarity threshold 0.85). Do this before computing your dataset size - what looks like 50,000 examples may be 30,000 unique ones. :::
:::danger Teacher Model Hallucinations Become Ground Truth The teacher model generates incorrect facts, wrong code, and plausible-sounding nonsense. When these become training examples, the student model learns to hallucinate with confidence. Always run domain-specific quality checks: execute generated code, verify factual claims against a knowledge base, spot-check random examples manually. A 2% hallucination rate in your training data creates a model that halluccinates far more frequently. :::
:::warning Complexity Collapse in Evol-Instruct Evolution operations do not always increase complexity. "Add constraints" applied to an already-constrained instruction sometimes produces a simpler instruction that just has more words. Always score complexity before and after evolution and discard evolved instructions that are not demonstrably harder. The complexity estimation prompt is cheap (Claude Haiku, 64 max tokens) relative to the cost of training on useless examples. :::
:::warning Constitutional Revision Loops Removing Helpful Content Constitutional AI revision can make models over-cautious. If the constitution emphasizes safety but not helpfulness, every revision moves the response toward refusal. Balance your constitution with explicit helpfulness principles: "Choose the response that is most useful and substantive, even when discussing difficult topics." Run evals on helpfulness benchmarks (MT-Bench, AlpacaEval) after constitutional fine-tuning - safety-only constitutions routinely drop helpfulness scores by 5-10%. :::
:::warning Forgetting to Filter Refusals from Training Data If your teacher model refuses some instructions ("I cannot help with that"), those refusals end up in your dataset unless you filter them. Training on refusals teaches the student model to refuse similar instructions, creating a dataset that hobbles your model's capabilities. Filter any response containing: "I cannot," "I'm unable to," "I don't have the ability," "As an AI," "I apologize, but." :::
Interview Q&A
Q: What is the core insight behind Self-Instruct, and why does it work?
Self-Instruct exploits a property of large language models that was underappreciated before 2022: a model trained on internet text has implicitly learned what instructions look like, what responses look like, and what makes a good instruction-response pair - even if it was never explicitly trained on this. When you prompt a capable LLM to generate new instructions, it draws on this implicit knowledge. The model is not guessing - it has seen millions of examples of instructions and responses in its pre-training corpus (tutorials, documentation, forum posts, Stack Overflow). Self-Instruct just provides the scaffolding to extract and formalize this knowledge. It works because the model's prior over "what a good instruction looks like" is already quite good; you are not asking it to generalize far outside its training distribution.
Q: Evol-Instruct improved over Self-Instruct on complex benchmarks. Why does instruction complexity matter so much for fine-tuning?
Instruction complexity directly shapes the reasoning patterns the model learns. Simple instructions ("what is the capital of France?") can be answered by pattern-matching against memorized facts - no reasoning required. A model fine-tuned predominantly on simple instructions learns to retrieve, not reason. Complex instructions ("Given these three economic indicators, analyze whether raising interest rates is likely to reduce inflation in this specific scenario") require multi-step reasoning, concept synthesis, and nuanced judgment. Models fine-tuned on complex instructions learn the underlying reasoning process, not just the surface pattern. This generalizes to novel complex instructions at inference time. The practical implication: a dataset of 10,000 complex instructions often produces better models than a dataset of 100,000 simple ones.
Q: How does Constitutional AI compare to RLHF with human feedback? When would you use each?
Constitutional AI replaces human preference labels with AI-generated critiques against a set of principles. The tradeoff is cost vs. alignment precision. RLHF with human feedback is more expensive (requires a trained reward model built from human pairwise comparisons) but can capture nuanced human preferences that are hard to express as constitutional principles. CAI is cheaper and faster to iterate but depends on the quality of your constitution - poorly written principles produce systematically miscalibrated models. Use CAI when: you have a clear, articulable set of principles, you need fast iteration, or budget is constrained. Use RLHF when: alignment with specific human preferences is critical, the domain is subtle enough that principles are hard to write, or you have existing preference data. In practice, Anthropic's production models use both: CAI generates the initial supervised learning data, then RLHF with human feedback from the RLAIF loop further refines the model.
Q: What is SPIN (Self-Play Fine-Tuning) and what problem does it solve that DPO doesn't?
Standard DPO requires explicit negative examples - responses that are worse than the chosen responses. Building a high-quality negative set is expensive: you need either human-labeled preference pairs or a reward model to score multiple candidates per prompt. SPIN sidesteps this by using the model's previous version as the source of negatives. At each iteration, the current model generates synthetic responses, which become the "rejected" examples in DPO training (with human responses as "chosen"). The key insight is that the current model's outputs are always imperfect relative to human responses, so they are always valid negatives. As training progresses and the model improves, the negatives get harder, creating a natural curriculum. The theoretical guarantee is that the game between the model and its previous self converges when the model cannot distinguish its own outputs from human outputs - at which point it has learned to match the human response distribution.
Q: You have a 100K synthetic dataset but your fine-tuned model is not improving much beyond the base model. What are the most likely causes?
Three main candidates: First, dataset collapse - the 100K examples may be 20K unique examples after deduplication, not enough diversity to teach the model new behaviors. Run semantic deduplication and check the effective unique example count. Second, format mismatch - if the synthetic data format does not match the chat template used during pre-training, the model is spending capacity learning to handle the format rather than the content. Third, difficulty distribution - if most examples are easy (complexity score below 4/10), the model already knows how to handle them from pre-training. Check your complexity distribution; if it peaks at low complexity, the model is not learning anything new. A fourth, less common cause: teacher model ceiling - if you used a weak teacher (GPT-3.5 or similar) and your base model is a strong 70B, the teacher's responses are not much better than what the base model would generate anyway. There is no quality gradient for the student to learn from.
Q: How do you handle the problem of OSS-Instruct style grounding for domains where you don't have public code or text to sample from?
The grounding technique generalizes beyond code. The principle is: seed generation from real examples in your domain rather than generating in free-form. For legal tasks, seed from real (anonymized) contract clauses. For medical tasks, seed from PubMed abstracts. For financial tasks, seed from 10-K filing excerpts. For customer support, seed from product documentation. The key is that the seed material should be representative of the real distribution the model will encounter at inference time - diverse, realistic, and varied in style and complexity. When real examples are truly unavailable (new product, novel domain), generate synthetic seeds first with different generation parameters, then use those seeds for instruction generation. This two-stage seeding still outperforms un-seeded generation because it forces the generation process to anchor to specific concrete scenarios rather than abstract generalities.
