Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Synthetic Data Generation demo on the EngineersOfAI Playground - no code required. :::

LLM as Data Generator

The Annotation Bottleneck Nobody Talks About

It is early 2023. A fintech startup has built an impressive natural language interface for financial analysis. Users can ask questions in plain English and get portfolio breakdowns, risk assessments, and sector comparisons. The problem: when users ask unusual questions - edge cases involving derivatives, multi-leg options strategies, or tax-loss harvesting - the model fails. Not catastrophically, but with confident-sounding wrong answers that sometimes mislead users into bad decisions.

The team knows what the fix is: more training data covering complex financial reasoning. They have a working product with real users and real question logs. What they do not have is annotated answers to those complex questions - answers that require CFA-level knowledge to produce correctly. Hiring financial experts to annotate training data costs 150/hour.At10examplesperhour,annotating10,000complexfinancialexampleswouldcost150/hour. At 10 examples per hour, annotating 10,000 complex financial examples would cost 150,000 and take months.

Then an ML engineer on the team runs an experiment. She takes 200 complex financial questions and generates answers using Claude claude-opus-4-6 with a detailed financial reasoning system prompt. She then pays a CFA to evaluate 50 of those answers against the rubric they'd use for human annotators. The CFA rates 87% of Claude's answers as "acceptable" or "excellent" - comparable to senior human annotators. The cost per answer: 0.15insteadof0.15 instead of 15.

The cost reduction is 100x. The time reduction is immediate. She generates 10,000 complex financial training examples overnight at a cost of 1,500.Thefinetunedmodelstopshallucinatingoncomplexfinancialedgecases.Theannotationbottleneckthatwasgoingtorequirea1,500. The fine-tuned model stops hallucinating on complex financial edge cases. The annotation bottleneck that was going to require a 150,000 budget is solved for 1% of the cost.

This is the central premise of LLM-as-data-generator: use capable models to do the annotation work that was previously a human bottleneck, then use that data to train smaller, cheaper, specialized models for specific tasks.

The Teacher-Student Asymmetry

The economic logic behind LLM-as-data-generator rests on a fundamental asymmetry in inference costs:

The insight: it's much cheaper to run a small model at inference time than a large model. If you can transfer the large model's knowledge to a small model via synthetic training data, you get the best of both worlds - frontier-level task performance at small-model inference cost. You spend the frontier model budget once (data generation) instead of repeatedly (inference for every user request).

This is not just a cost argument. It's also a latency argument: a fine-tuned 7B model responding in 300ms versus a frontier model responding in 8 seconds changes what user experiences are possible. And it's a deployment argument: a fine-tuned model can run on smaller infrastructure, be deployed on-premise for data-sensitive applications, and operated without ongoing API dependency.

The Core Generation Pattern

The loop is important: you iterate. Your first generation run will reveal problems with your prompts - certain topics will be over-represented, certain response styles will leak through, certain difficulty levels will be missing. Each round of evaluation feeds back into improved generation prompts.

Generation Strategy 1: Direct Instruction Generation

The simplest approach: give the LLM a task description and ask it to generate diverse examples in one call.

import anthropic
import json
import re
from typing import Optional

client = anthropic.Anthropic()


def generate_instruction_batch(
task_description: str,
domain: str,
num_examples: int = 20,
temperature: float = 0.9,
) -> list[dict]:
"""
Generate a batch of instruction-response pairs in a single API call.

Args:
task_description: What the model should learn to do
domain: Domain context for the examples
num_examples: How many examples to generate (max ~50 reliably)
temperature: Higher = more diverse outputs (0.8-1.0 recommended)

Returns:
List of dicts with "instruction" and "response" keys
"""
generation_prompt = f"""Generate {num_examples} diverse instruction-response pairs for training an AI model.

Task: {task_description}
Domain: {domain}

Requirements for diversity:
- Vary the complexity (simple lookup to multi-step reasoning)
- Vary the format (short direct answers, numbered steps, paragraphs)
- Vary the phrasing style (formal, casual, technical, conceptual)
- Include edge cases and unusual scenarios, not just typical ones
- Make each instruction genuinely distinct from the others

Quality requirements:
- Instructions should sound like real user questions
- Responses should be accurate, complete, and appropriately detailed
- Avoid generic opener phrases like "Certainly!" or "Great question!"
- Responses should be directly useful, not just informative

Format: Return a JSON array with objects having "instruction" and "response" fields.

[
{{
"instruction": "...",
"response": "..."
}}
]"""

response = client.messages.create(
model="claude-opus-4-6",
max_tokens=8192,
temperature=temperature,
messages=[{"role": "user", "content": generation_prompt}]
)

text = response.content[0].text
# Robust JSON extraction: handle markdown code blocks
json_match = re.search(r'\[.*\]', text, re.DOTALL)
if not json_match:
return []

try:
examples = json.loads(json_match.group())
# Validate structure
return [
ex for ex in examples
if isinstance(ex, dict)
and "instruction" in ex
and "response" in ex
and len(ex["instruction"]) > 10
and len(ex["response"]) > 10
]
except json.JSONDecodeError:
return []


# Example usage
if __name__ == "__main__":
examples = generate_instruction_batch(
task_description="Answer questions about Python async/await programming with accurate, runnable code examples",
domain="Python programming",
num_examples=20,
temperature=0.9
)
print(f"Generated {len(examples)} examples")
for ex in examples[:3]:
print(f"\nQ: {ex['instruction'][:100]}...")
print(f"A: {ex['response'][:200]}...")

Generation Strategy 2: Seed-and-Expand

Start with a small set of human-written examples and expand them. This is the approach behind Self-Instruct (covered in Lesson 03). The key advantage: your human examples anchor the quality and style, while the LLM provides scale.

import anthropic
import json
import re
from typing import Optional

client = anthropic.Anthropic()

SEED_EXAMPLES = [
{
"instruction": "Summarize this earnings call transcript in 3 bullet points focusing on guidance, revenue, and key risks.",
"response": (
"• Revenue grew 23% YoY to $2.3B, driven by enterprise cloud segment (+41% YoY)\n"
"• Q4 guidance raised 5% to $2.4-2.5B on strong pipeline momentum\n"
"• Key risk: Macro headwinds affecting SMB segment; renewal rates down 3pp to 87%"
)
},
{
"instruction": "Extract all financial metrics with their YoY changes from this text.",
"response": "Revenue: $2.3B (+23% YoY) | EBITDA: $450M, 19.6% margin (+2pp) | Free Cash Flow: $380M (+31%) | Net Revenue Retention: 118% (-4pp)"
}
]


def expand_from_seeds(
seeds: list[dict],
num_variations_per_seed: int = 10,
domain_context: str = "",
include_complexity_progression: bool = True
) -> list[dict]:
"""
Generate variations of seed examples with increasing complexity.

The key is that seeds anchor quality and intent while the LLM
provides diversity across topic subsets, phrasings, and difficulty levels.

Args:
seeds: Human-written seed examples to expand from
num_variations_per_seed: How many variations to generate per seed
domain_context: Additional domain guidance for the LLM
include_complexity_progression: Whether to explicitly vary complexity

Returns:
List of generated instruction-response pairs
"""
all_examples = []

for i, seed in enumerate(seeds):
complexity_guidance = ""
if include_complexity_progression:
complexity_guidance = """
Vary complexity systematically:
- 3 simpler variations (beginner-friendly, single concept)
- 4 similar-complexity variations (different but comparable scenarios)
- 3 harder variations (multi-step, edge cases, production considerations)"""

expand_prompt = f"""Here is an example instruction-response pair:

Instruction: {seed['instruction']}
Response: {seed['response']}

{f'Domain context: {domain_context}' if domain_context else ''}

Generate {num_variations_per_seed} variations. Each variation should:
1. Cover a similar but distinct scenario in the same domain
2. Have a genuinely different instruction (not just a paraphrase)
3. Have an accurate, complete response appropriate to the instruction
{complexity_guidance}

Do NOT include the original example. Generate new, distinct ones.
Return as JSON array: [{{"instruction": "...", "response": "..."}}]"""

response = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
temperature=0.9,
messages=[{"role": "user", "content": expand_prompt}]
)

text = response.content[0].text
json_match = re.search(r'\[.*\]', text, re.DOTALL)
if json_match:
try:
variations = json.loads(json_match.group())
# Tag with seed source for traceability
for v in variations:
v["seed_source"] = i
all_examples.extend(variations)
print(f"Seed {i+1}: generated {len(variations)} variations")
except json.JSONDecodeError:
print(f"Seed {i+1}: JSON parse failed")

return all_examples

Generation Strategy 3: Attribute-Controlled Generation

The most systematic approach: define an explicit attribute grid and generate examples for every combination. This guarantees coverage of all intended difficulty/topic/format combinations, rather than letting the LLM naturally cluster toward the center of the distribution.

import anthropic
import json
import re
from itertools import product
from typing import Optional

client = anthropic.Anthropic()


def generate_with_attribute_grid(
task: str,
attribute_grid: dict[str, list],
examples_per_combo: int = 3,
model: str = "claude-opus-4-6"
) -> list[dict]:
"""
Generate examples covering a specified attribute grid systematically.

This is the most reliable way to ensure dataset coverage. Instead of
hoping the LLM will cover all combinations, you explicitly request
each combination.

Args:
task: The task to generate examples for
attribute_grid: Dict mapping attribute names to possible values
examples_per_combo: How many examples per attribute combination
model: Model to use for generation

Returns:
List of examples tagged with their attribute combination

Example attribute_grid:
{
"difficulty": ["beginner", "intermediate", "advanced"],
"topic": ["data structures", "algorithms", "async"],
"response_format": ["code-only", "explanation+code", "step-by-step"],
}
→ 27 combinations × 3 examples = 81 total examples
"""
attr_names = list(attribute_grid.keys())
attr_values = list(attribute_grid.values())
combinations = list(product(*attr_values))

all_examples = []
total = len(combinations)

print(f"Generating {total} attribute combinations × {examples_per_combo} examples = {total * examples_per_combo} total")

for idx, combo in enumerate(combinations):
attrs = dict(zip(attr_names, combo))
attr_desc = "\n".join(f"- {k}: {v}" for k, v in attrs.items())

prompt = f"""Generate {examples_per_combo} instruction-response pairs for: {task}

Required attributes for ALL examples in this batch:
{attr_desc}

Each example must clearly exhibit ALL of the listed attributes.
Vary the specific scenario within these constraints.
Return as JSON array: [{{"instruction": "...", "response": "..."}}]"""

try:
response = client.messages.create(
model=model,
max_tokens=3000,
temperature=0.8,
messages=[{"role": "user", "content": prompt}]
)

text = response.content[0].text
json_match = re.search(r'\[.*\]', text, re.DOTALL)
if json_match:
examples = json.loads(json_match.group())
for ex in examples:
ex["attributes"] = attrs
all_examples.extend(examples)

except (json.JSONDecodeError, Exception) as e:
print(f" Combo {idx+1}/{total} failed: {e}")
continue

if (idx + 1) % 10 == 0:
print(f" Progress: {idx+1}/{total} combinations completed")

return all_examples


# Example: Python programming with full coverage
if __name__ == "__main__":
examples = generate_with_attribute_grid(
task="Python programming instruction-following",
attribute_grid={
"difficulty": ["beginner", "intermediate", "advanced"],
"topic": ["data structures", "algorithms", "async programming", "error handling"],
"response_format": ["explanation with code", "code only with comments", "step-by-step breakdown"],
},
examples_per_combo=2 # 36 combos × 2 = 72 targeted examples
)
print(f"\nGenerated {len(examples)} examples across all attribute combinations")

Diversity Maximization Techniques

Diversity is the most critical property of synthetic training data. Without it, the fine-tuned model overfits to the generator model's stylistic patterns - producing responses that technically resemble what you asked for but are all variations of the same surface pattern.

Temperature and Sampling

Temperature directly controls output diversity. For training data generation, use high temperature (0.8–1.0). For quality evaluation, use low temperature (0.0).

import anthropic
import random
from typing import Optional

client = anthropic.Anthropic()


def generate_diverse_outputs(
prompt: str,
n: int = 20,
temperature_range: tuple[float, float] = (0.7, 1.0)
) -> list[str]:
"""
Generate multiple diverse outputs using varied temperatures.

Using a range rather than a fixed temperature ensures we sample
from different parts of the probability distribution, producing
outputs that vary in style, structure, and specificity.

Args:
prompt: Generation prompt
n: Number of outputs to generate
temperature_range: (min_temp, max_temp) to sample from

Returns:
List of generated text strings
"""
outputs = []
for i in range(n):
# Sample temperature randomly within range
temp = random.uniform(*temperature_range)

response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
temperature=temp,
messages=[{"role": "user", "content": prompt}]
)
outputs.append(response.content[0].text)

return outputs

Embedding-Based Semantic Deduplication

Surface-level deduplication (exact match, ROUGE-L) misses semantically equivalent instructions phrased differently. Embedding-based deduplication catches these:

import numpy as np
from typing import Optional


def semantic_deduplicate(
examples: list[dict],
similarity_threshold: float = 0.85,
field: str = "instruction",
batch_size: int = 128
) -> tuple[list[dict], list[dict]]:
"""
Remove semantically duplicate examples using embedding similarity.

Two instructions with cosine similarity > threshold are considered
duplicates; only the first one encountered is kept.

Args:
examples: List of example dicts
similarity_threshold: Cosine similarity above which to mark as duplicate
field: Which field to embed for comparison
batch_size: Embedding batch size (tune based on memory)

Returns:
(kept_examples, removed_examples) tuple
"""
try:
from sentence_transformers import SentenceTransformer
except ImportError:
print("sentence-transformers not installed. Run: pip install sentence-transformers")
return examples, []

if not examples:
return [], []

model = SentenceTransformer("all-MiniLM-L6-v2")
texts = [ex[field] for ex in examples]

print(f"Computing embeddings for {len(texts)} examples...")
embeddings = model.encode(
texts,
batch_size=batch_size,
show_progress_bar=True,
normalize_embeddings=True # Enables dot product as cosine similarity
)

kept_indices = [0]
kept_embeddings = [embeddings[0]]

for i in range(1, len(examples)):
# Compute similarity to all kept examples using vectorized dot product
kept_matrix = np.array(kept_embeddings)
similarities = np.dot(kept_matrix, embeddings[i])
max_sim = similarities.max()

if max_sim < similarity_threshold:
kept_indices.append(i)
kept_embeddings.append(embeddings[i])

kept = [examples[i] for i in kept_indices]
removed = [examples[i] for i in range(len(examples)) if i not in set(kept_indices)]

print(f"Deduplication: {len(examples)}{len(kept)} kept, {len(removed)} removed")
print(f"(threshold={similarity_threshold})")

return kept, removed

Topic-Spread Generation

Even with high temperature, LLMs naturally cluster toward the center of their knowledge distribution. Explicit topic spreading forces coverage of the long tail:

import anthropic
import json
import re
from typing import Optional

client = anthropic.Anthropic()


def generate_topic_balanced_dataset(
topics: list[str],
task_template: str,
examples_per_topic: int = 100,
model: str = "claude-opus-4-6"
) -> list[dict]:
"""
Generate examples balanced across topic areas.

Without topic spreading, LLMs naturally generate more examples
about common topics and fewer about specialized/rare ones.
This function forces uniform coverage.

Args:
topics: List of topics to cover
task_template: Template with {topic} placeholder
examples_per_topic: How many examples per topic

Returns:
List of examples tagged with their topic
"""
all_examples = []

for topic in topics:
task = task_template.format(topic=topic)

prompt = f"""Generate {examples_per_topic} instruction-response training pairs.
Task: {task}
Topic focus: {topic}

Make ALL examples clearly related to {topic} specifically.
Vary difficulty and format across examples.
Return as JSON array: [{{"instruction": "...", "response": "..."}}]"""

response = client.messages.create(
model=model,
max_tokens=8192,
temperature=0.9,
messages=[{"role": "user", "content": prompt}]
)

try:
text = response.content[0].text
json_match = re.search(r'\[.*\]', text, re.DOTALL)
if json_match:
topic_examples = json.loads(json_match.group())
for ex in topic_examples:
ex["topic"] = topic
all_examples.extend(topic_examples)
print(f"Generated {len(topic_examples)} examples for topic: {topic}")
except json.JSONDecodeError:
print(f"Parse failed for topic: {topic}")

return all_examples


# Domain-specific example
FINANCIAL_TOPICS = [
"equity valuation methods",
"fixed income and bond pricing",
"options and derivatives pricing",
"portfolio risk management",
"financial statement analysis",
"macroeconomic indicators and their market effects",
"regulatory compliance and reporting",
"algorithmic trading strategies",
]

if __name__ == "__main__":
dataset = generate_topic_balanced_dataset(
topics=FINANCIAL_TOPICS,
task_template="Answer questions requiring {topic} expertise with precise financial reasoning",
examples_per_topic=50
)
print(f"\nTotal dataset size: {len(dataset)} examples")
print(f"Expected: {len(FINANCIAL_TOPICS) * 50} examples")

Quality vs. Quantity: The LIMA Lesson

LIMA (Less Is More for Alignment, 2023) changed the field's intuitions. The paper showed that 1,000 carefully curated examples outperformed 50,000 low-quality examples for instruction fine-tuning. This isn't an argument against large datasets - it's an argument that each example must earn its place.

The LLM-as-judge scoring pattern addresses this:

import anthropic
import json
import re
import numpy as np
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import Optional

client = anthropic.Anthropic()

QUALITY_RUBRIC = """You are a training data quality evaluator. Score this instruction-response pair.

Task context: {task_description}

Instruction: {instruction}

Response: {response}

Rate each dimension 1-5:
1. accuracy: Is the response factually correct and technically sound?
2. helpfulness: Does it fully address the instruction?
3. naturalness: Does the instruction sound like something a real person would ask?
4. uniqueness: Is this a non-generic, non-formulaic example?
5. format_quality: Is the response well-formatted and appropriately detailed?

Respond with ONLY this JSON (no other text):
{{"accuracy": N, "helpfulness": N, "naturalness": N, "uniqueness": N, "format_quality": N, "overall": N}}

Where overall is your holistic assessment, not just the average."""


def score_example(
example: dict,
task_description: str,
scorer_model: str = "claude-haiku-4-5-20251001"
) -> dict:
"""
Score a single training example on quality dimensions.

Uses claude-haiku (cheap) for high-throughput scoring.
The overall score determines whether an example is included.

Args:
example: Dict with "instruction" and "response" keys
task_description: Context for what the data should teach
scorer_model: Model to use for scoring (cheap is fine here)

Returns:
Dict with dimension scores and overall score
"""
prompt = QUALITY_RUBRIC.format(
task_description=task_description,
instruction=example.get("instruction", "")[:800],
response=example.get("response", "")[:1200]
)

try:
response = client.messages.create(
model=scorer_model,
max_tokens=200,
temperature=0, # Low temp for consistent scoring
messages=[{"role": "user", "content": prompt}]
)

text = response.content[0].text.strip()
json_match = re.search(r'\{[^}]+\}', text, re.DOTALL)
if json_match:
scores = json.loads(json_match.group())
return scores
else:
return {"overall": 3.0, "parse_error": True}
except Exception as e:
return {"overall": 3.0, "error": str(e)}


def batch_score_and_filter(
examples: list[dict],
task_description: str,
min_overall_score: float = 3.5,
max_workers: int = 10,
sample_rate: float = 1.0
) -> tuple[list[dict], list[dict]]:
"""
Score examples in parallel and filter by minimum quality score.

Args:
examples: List of examples to evaluate
task_description: What these examples should teach
min_overall_score: Minimum overall score to keep (1-5 scale)
max_workers: Parallel workers for API calls
sample_rate: Score this fraction of examples (save cost by sampling)

Returns:
(accepted_examples, rejected_examples) tuple
"""
import random

# Optionally sample for cost control
if sample_rate < 1.0:
score_indices = set(random.sample(range(len(examples)), int(len(examples) * sample_rate)))
unscored_examples = [ex for i, ex in enumerate(examples) if i not in score_indices]
score_examples_list = [ex for i, ex in enumerate(examples) if i in score_indices]
else:
score_examples_list = examples
unscored_examples = []

print(f"Scoring {len(score_examples_list)} examples with {max_workers} workers...")

scored_results = []
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_example = {
executor.submit(score_example, ex, task_description): ex
for ex in score_examples_list
}
for future in as_completed(future_to_example):
ex = future_to_example[future]
try:
scores = future.result()
scored_results.append((ex, scores))
except Exception as e:
scored_results.append((ex, {"overall": 3.0, "error": str(e)}))

# Filter scored examples
accepted = []
rejected = []
scores_list = []

for ex, scores in scored_results:
overall = scores.get("overall", 0)
scores_list.append(overall)
ex["_quality_scores"] = scores
if overall >= min_overall_score:
accepted.append(ex)
else:
rejected.append(ex)

# Unscored examples default to accepted (we didn't check them)
accepted.extend(unscored_examples)

print(f"\nQuality filter results:")
if scores_list:
print(f" Score distribution: mean={np.mean(scores_list):.2f}, "
f"p25={np.percentile(scores_list, 25):.2f}, "
f"p75={np.percentile(scores_list, 75):.2f}")
print(f" Min threshold: {min_overall_score}")
print(f" Accepted: {len(accepted)} | Rejected: {len(rejected)}")

return accepted, rejected

Large-Scale Async Generation Pipeline

For generating thousands of examples efficiently, you need async concurrency with rate limiting and incremental disk writes:

import asyncio
import anthropic
import json
import re
from datetime import datetime
from pathlib import Path
from typing import Optional


async def generate_single_async(
client: anthropic.Anthropic,
prompt: str,
semaphore: asyncio.Semaphore,
model: str = "claude-opus-4-6",
max_tokens: int = 1024,
temperature: float = 0.9
) -> Optional[dict]:
"""Generate a single example asynchronously with concurrency control."""
async with semaphore:
loop = asyncio.get_event_loop()
try:
response = await loop.run_in_executor(
None,
lambda: client.messages.create(
model=model,
max_tokens=max_tokens,
temperature=temperature,
messages=[{"role": "user", "content": prompt}]
)
)
text = response.content[0].text
json_match = re.search(r'\{.*\}', text, re.DOTALL)
if json_match:
return json.loads(json_match.group())
except Exception as e:
return None
return None


async def generate_large_dataset(
task_description: str,
total_examples: int = 10000,
concurrency: int = 10,
output_file: str = "synthetic_dataset.jsonl",
model: str = "claude-opus-4-6"
) -> int:
"""
Generate a large synthetic dataset with async concurrency.

Streams results to disk incrementally - safe to interrupt and resume
(just skip already-generated examples on restart).

Args:
task_description: What examples should teach
total_examples: Target number of examples
concurrency: Max concurrent API calls (stay within rate limits)
output_file: Path to write JSONL output
model: Model to use for generation

Returns:
Number of examples actually generated
"""
client = anthropic.Anthropic()
semaphore = asyncio.Semaphore(concurrency)
generated_count = 0

# Build diverse prompts to avoid clustering
def make_prompt(i: int) -> str:
# Rotate through different instruction styles
styles = [
"Return JSON with instruction and response fields. Make the instruction a question.",
"Return JSON with instruction and response fields. Make the instruction a task ('Write a...', 'Implement...', 'Explain...')",
"Return JSON with instruction and response fields. Make the instruction a debugging scenario.",
"Return JSON with instruction and response fields. Make the instruction a comparison request.",
]
style = styles[i % len(styles)]

return f"""Generate one unique instruction-response training pair for: {task_description}

Style guidance: {style}
Position hint: example #{i+1} of {total_examples} - make it distinctly different from a typical example.

{style}"""

output_path = Path(output_file)

with open(output_path, "a") as f: # Append mode for resume safety
# Process in batches to manage memory
batch_size = concurrency * 2
for batch_start in range(0, total_examples, batch_size):
batch_end = min(batch_start + batch_size, total_examples)
batch_size_actual = batch_end - batch_start

prompts = [make_prompt(batch_start + i) for i in range(batch_size_actual)]

tasks = [
generate_single_async(client, prompt, semaphore, model)
for prompt in prompts
]

results = await asyncio.gather(*tasks, return_exceptions=True)

batch_accepted = 0
for result in results:
if isinstance(result, dict) and "instruction" in result and "response" in result:
result["generated_at"] = datetime.now().isoformat()
result["task"] = task_description[:100]
f.write(json.dumps(result) + "\n")
generated_count += 1
batch_accepted += 1

print(f"Batch {batch_start//batch_size + 1}: "
f"{batch_accepted}/{batch_size_actual} accepted | "
f"Total: {generated_count}/{total_examples}")

return generated_count


if __name__ == "__main__":
count = asyncio.run(generate_large_dataset(
task_description="Python programming assistance - questions ranging from syntax to system design",
total_examples=5000,
concurrency=10,
output_file="python_dataset.jsonl"
))
print(f"\nFinal count: {count} examples generated")

Format Templates for Fine-Tuning Frameworks

Different fine-tuning frameworks expect different data formats. Always convert to the right format before training:

import json
from dataclasses import dataclass
from typing import Optional


def format_for_sft_chat(
instruction: str,
response: str,
system_prompt: Optional[str] = None
) -> dict:
"""
Format for Supervised Fine-Tuning in chat format.
Compatible with: HuggingFace TRL, LLaMA-Factory, OpenAI fine-tuning.
"""
messages = []
if system_prompt:
messages.append({"role": "system", "content": system_prompt})
messages.append({"role": "user", "content": instruction})
messages.append({"role": "assistant", "content": response})
return {"messages": messages}


def format_for_alpaca(
instruction: str,
response: str,
context: str = ""
) -> dict:
"""
Format for Alpaca-style fine-tuning.
Compatible with: LLaMA.cpp fine-tuning, many older frameworks.
"""
return {
"instruction": instruction,
"input": context,
"output": response,
}


def format_for_dpo(
instruction: str,
chosen_response: str,
rejected_response: str,
system_prompt: Optional[str] = None
) -> dict:
"""
Format for Direct Preference Optimization (DPO) training.
Requires pairs of chosen (better) and rejected (worse) responses.
"""
chosen_messages = []
rejected_messages = []

if system_prompt:
chosen_messages.append({"role": "system", "content": system_prompt})
rejected_messages.append({"role": "system", "content": system_prompt})

chosen_messages.extend([
{"role": "user", "content": instruction},
{"role": "assistant", "content": chosen_response}
])
rejected_messages.extend([
{"role": "user", "content": instruction},
{"role": "assistant", "content": rejected_response}
])

return {
"chosen": chosen_messages,
"rejected": rejected_messages,
}


def generate_preference_pair(
instruction: str,
client,
strong_model: str = "claude-opus-4-6",
weak_model: str = "claude-haiku-4-5-20251001"
) -> Optional[dict]:
"""
Generate a DPO preference pair where the strong model response
is 'chosen' and the weak model response is 'rejected'.

This is a practical way to create preference data without
human raters - using model quality as a proxy for human preference.

Note: This approach has limitations (both models may be wrong,
the quality gap may be too large or too small). Validate against
human preferences before using at scale.
"""
strong_response = client.messages.create(
model=strong_model,
max_tokens=1024,
messages=[{"role": "user", "content": instruction}]
).content[0].text

weak_response = client.messages.create(
model=weak_model,
max_tokens=1024,
messages=[{"role": "user", "content": instruction}]
).content[0].text

return format_for_dpo(
instruction=instruction,
chosen_response=strong_response,
rejected_response=weak_response,
)

Common Mistakes

:::danger Never use the same model to generate and evaluate If you use Claude to generate data and Claude to score it, you're measuring Claude's self-agreement, not quality. A biased generator will fool a judge that shares its biases. Use a different model for scoring (e.g., generate with claude-opus-4-6, score with a different evaluator), or use rule-based checks, or use human spot-checks on a sample. Self-scoring is the most common mistake in synthetic data pipelines - it feels systematic but produces systematically biased quality gates. :::

:::warning Low temperature does not mean high quality temperature=0 generates deterministic, "safe" outputs - but in synthetic data generation, this creates near-identical examples across multiple calls. For training data, you want diversity. Use temperature=0.8–1.0 for generation. Reserve temperature=0 for the quality scoring step where consistency matters more than diversity. :::

:::danger Synthetic data inherits and amplifies model biases Claude and GPT-4 have biases toward certain response styles, formats, and perspectives. Models trained entirely on synthetic data often exhibit amplified versions of the generator's biases - more formal than human responses, more likely to add unnecessary caveats, more likely to use the same structural patterns. Always mix in some human-annotated data and evaluate trained models against real human preferences, not just against benchmark metrics. :::

:::warning JSON parsing failures are common at scale - plan for them LLMs don't always return valid JSON even when explicitly asked. At 10,000 generations, expect 3–8% parse failures. Always wrap parsing in try/except, log failures with the raw response text for debugging, and use robust extraction (find JSON delimiters with regex before parsing). Track your parse failure rate across generation runs - a sudden spike in failures often indicates the generation prompt has a problem or the model version changed. :::

Interview Q&A

Q: Why use LLMs to generate training data instead of human annotators?

The practical argument is cost, speed, and scalability. Human annotation of complex tasks costs 550perexamplefordomainexpertise.Atscale,thisisprohibitive:50,000examplesat5–50 per example for domain expertise. At scale, this is prohibitive: 50,000 examples at 15 each is 750,000,plusweeksofcoordination.LLMgenerateddatacosts750,000, plus weeks of coordination. LLM-generated data costs 0.001–0.01 per example - a 1,000–10,000x cost reduction.

Beyond cost: (1) Speed - generate 10,000 examples overnight vs. weeks for human annotation. (2) Iteration - when your format is wrong, regenerate instantly. (3) Coverage - systematically cover rare cases that humans rarely write about naturally. (4) Consistency - LLMs apply instructions consistently; human annotators drift in interpretation over long projects.

The caveats: LLM-generated data inherits generator biases, can't capture genuinely novel human insights, and requires validation for quality. Best approach: LLM generation for scale and coverage, human annotation for calibration and high-stakes examples. The LIMA paper demonstrated that 1,000 high-quality human examples plus 10,000 LLM-generated examples outperforms 50,000 LLM-only examples.

Q: How do you maximize diversity in generated synthetic data?

Three primary techniques: (1) Temperature and sampling - use high temperature (0.8–1.0) during generation. Randomness in high-temperature sampling naturally produces diverse outputs. (2) Attribute-controlled generation - explicitly specify a grid of attributes (difficulty × topic × format) and generate examples for each combination. This guarantees systematic coverage rather than drift toward common examples. (3) Embedding-based deduplication - after generation, embed all examples with a sentence encoder and remove any with cosine similarity > 0.85 to an already-retained example. This catches semantic duplicates that surface-level deduplication misses.

Beyond these three: topic spreading (create a topic tree and explicitly request examples from each leaf), negative seeding (after initial generation, look at underrepresented topics and generate specifically for gaps), and multi-system-prompt generation (use varied system prompts to elicit different reasoning and response styles).

Measure diversity: track entropy of topic distribution, average pairwise distance between examples in embedding space, and ratio of unique n-grams in the instruction corpus. A healthy dataset has high scores on all three.

Q: What is the quality-quantity tradeoff in synthetic data, and how do you navigate it?

The consensus has shifted decisively toward quality. LIMA (2023) showed 1,000 curated examples outperform 50,000 noisy examples for instruction following. The IFD paper showed selecting examples where the model struggles produces better results than random sampling.

The intuition: high-quality examples teach the model efficiently. Low-quality examples add noise that the model must learn to ignore - and often can't. A single systematically incorrect example in 10,000 can cause the model to hallucinate on that topic class, because the error is learned with high confidence.

Optimization strategy: generate with high temperature (accepting 20–30% low quality), run LLM quality scoring on all examples using a cheap model (Haiku), set a minimum quality threshold (overall score ≥ 3.5/5), apply embedding deduplication to remaining high-quality examples, then spot-check 100–200 examples manually to calibrate automated scores. Starting from 100,000 generated examples, a typical pipeline retains 15,000–25,000 after quality filtering - smaller but dramatically more effective.

Q: How do you validate that synthetic training data actually improves model performance?

Validation requires comparing against a baseline on a held-out human evaluation set. Process: (1) Create an independent human evaluation set (200–500 examples) NOT used for training - have domain experts annotate these. This is ground truth. (2) Train multiple models: zero-shot baseline, synthetic-only, human-only (if available), mixed. (3) Evaluate all models on the human evaluation set using both automated metrics and human preference ratings. (4) Statistical significance: with N=200, you can detect differences > 5 percentage points at 95% confidence.

Common failure modes to catch in validation: style overfitting (model learned generator's response style, not task performance), domain gaps (synthetic data covered common cases but missed edge cases), calibration errors (model is overconfident on topics where the generator hallucinated during data creation), and length bias (synthetic responses are uniformly verbose, training the model to be wordy when users want concise answers).

Q: How do you handle factual accuracy when generating domain-specific training data?

The core risk: frontier models hallucinate 5–20% of facts in domain-specific contexts, even when sounding confident. Training on hallucinated data amplifies those hallucinations in the student model.

Mitigation strategies by domain: Grounded generation - provide a corpus of verified facts (documents, textbooks) and instruct the LLM to generate questions and answers based only on the provided material. This shifts the LLM's role from knowledge source to reformulator, dramatically reducing hallucination risk. Automated verification - for code, run it; for math, check the arithmetic; for SQL, execute against a test database. Cross-model verification - generate with Model A, verify with Model B. Human expert sampling - have domain experts review a random 5% sample. If error rate exceeds your threshold (e.g., > 5%), fix the generation approach before using the data. Confidence filtering - instruct the generator to include confidence scores and filter out examples where the model expresses uncertainty.

© 2026 EngineersOfAI. All rights reserved.