Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Synthetic Data Generation demo on the EngineersOfAI Playground - no code required. :::

Self-Instruct: Bootstrapping Instruction Datasets from Scratch

The $600 Experiment That Changed Everything

It's late 2022. Yizhong Wang, a researcher at the Allen Institute for AI, is staring at a gap. OpenAI has ChatGPT, trained on millions of human-written instruction-response pairs curated by an army of contractors over years of effort. The cost is in the tens of millions. His team has a GPU cluster and a budget for maybe a few thousand human-written examples. The model they want to fine-tune - GPT-3 - already knows an enormous amount about the world. The bottleneck is not knowledge. The bottleneck is format: GPT-3 doesn't know how to respond helpfully to instructions rather than just predict the next token. Someone on the team has an idea so obvious it seems it can't work: what if you use GPT-3 itself to generate the instructions? If GPT-3 is capable enough to follow many instructions already, surely it's capable enough to describe what an instruction looks like. You start with 175 hand-written diverse tasks - things like "Classify this email as spam or not spam," "Summarize the following article," "Write a Python function to sort a list." Then you ask GPT-3 to generate more tasks like these. You filter the bad ones. You ask GPT-3 to write responses to the good ones. You filter again. You fine-tune a model on the result.

The model that comes out is startlingly good at following instructions. Not perfect - it fails on complex multi-step reasoning, makes occasional factual errors, misunderstands ambiguous requests - but it follows diverse prompts, handles multi-step requests, and generalizes far beyond the original 175 seeds. The paper gets published in January 2023, and the field is genuinely surprised by how well it works. Six months later, Stanford researchers use the same idea with 175 seed tasks and GPT text-davinci-003 to create Alpaca: 52,000 instruction-response pairs for approximately $500 in API costs. LLaMA-7B fine-tuned on this data matches GPT-3 instruction-following ability in human evaluations. The era of cheap, high-quality synthetic instruction data has begun, and it started with a question simple enough to seem naive: what if we just asked the model?

Why Self-Instruct Exists: The Pre-Synthetic Era Problem

Before Self-Instruct, instruction tuning required human-curated datasets. The two main options were both expensive in different ways:

Option 1: RLHF with human feedback - Hire contractors to write instructions, write responses, and rank multiple responses for quality. OpenAI's InstructGPT used this approach. It works extremely well - InstructGPT is the direct predecessor of ChatGPT. It also costs millions of dollars and months of time. Out of reach for academia or any team without serious infrastructure funding.

Option 2: Collect natural instructions - Scrape user queries from websites, Stack Overflow, user forums, etc. This gives you a huge volume of real instructions at low cost. The problem: you also need responses, and the responses in the wild are often incomplete, wrong, contested, or missing entirely. Stack Overflow has good answers sometimes, but matching those answers to training quality is labor-intensive.

Self-Instruct solved a different problem: if you already have a capable base model, you can use it to generate its own training signal. The insight is that generation is easier than execution. A model might struggle to do a complex task on the first try, but it can describe a complex task well enough that another model - or a fine-tuned version of itself - learns to do it. Description is easier than execution. This is the core asymmetry Self-Instruct exploits.

The Self-Instruct Algorithm: Three Stages

The Self-Instruct process has three stages executed iteratively. The elegant part: the same model does all three jobs.

Stage 1: Task Generation

At each iteration, 8 tasks are sampled from the existing pool. The sampling ratio is important: 6 from the growing generated pool, 2 from the original seed tasks. The seed tasks maintain diversity by pulling the generated pool back toward human-written quality. Without them, the pool drifts toward the LLM's natural generation tendencies.

These 8 tasks are inserted into a prompt template:

Come up with a series of tasks:
Task 1: {sample_task_1}
Task 2: {sample_task_2}
...
Task 8: {sample_task_8}
Task 9:

The model completes the prompt, generating tasks 9, 10, 11, and 12. Because the 8 sampled tasks are different on every call (random sampling from the pool), each generation call produces novel tasks. The model is doing in-context learning - it sees examples and generates more examples of the same type.

Stage 2: Instance Generation

Each generated task needs at least one instruction-input-output triple. The subtle part is that generation and classification tasks require different approaches:

Generation tasks (produce free-form output): Generate the input first, then the output. Example: "Summarize this article" - generate a realistic article first, then generate the summary.

Classification tasks (choose from finite options): Generate the label first, then generate an input that matches that label. Example: "Classify sentiment" - generate the label "positive", then generate a sentence that's clearly positive. This output-first approach prevents the model from collapsing to always predicting the majority class.

The detection of which type a task is - also done by the LLM. Simple but effective: "Is the following task a classification task or a generation task?"

Stage 3: Quality Filtering

Generated data is noisy. The filtering heuristics from the original paper:

  1. ROUGE-L deduplication: If the new task has ROUGE-L overlap above 0.7 with any existing task, discard it. ROUGE-L measures longest common subsequence - high overlap means near-duplicate.

  2. Length filtering: Discard tasks shorter than 3 words or longer than 150 words. Discard instances with empty inputs or outputs.

  3. Keyword blacklisting: Discard tasks containing image/audio/video keywords since the model operates on text only: "image", "photo", "audio", "video", "figure", "map", "diagram".

  4. Safety filtering: Simple classifier to catch harmful content.

After filtering, remaining tasks are added to the pool for the next iteration.

Complete Self-Instruct Implementation

Here is a fully working implementation using the Anthropic SDK:

import anthropic
import json
import random
import re
from dataclasses import dataclass, field
from pathlib import Path
from typing import Literal, Optional

client = anthropic.Anthropic()


@dataclass
class Task:
instruction: str
input: str = ""
output: str = ""
task_type: Literal["generation", "classification"] = "generation"
source: str = "generated" # "seed" or "generated"


# --- Seed tasks: 8 diverse examples to bootstrap the process ---
SEED_TASKS: list[Task] = [
Task(
instruction="Classify the sentiment of the following text as positive, negative, or neutral.",
input="I absolutely loved the movie! The acting was superb and the plot kept me engaged throughout.",
output="positive",
task_type="classification",
source="seed"
),
Task(
instruction="Write a Python function that takes a list and returns the sum of all even numbers.",
input="",
output="""def sum_even(numbers):
\"\"\"Return the sum of all even numbers in a list.\"\"\"
return sum(x for x in numbers if x % 2 == 0)

# Example usage
print(sum_even([1, 2, 3, 4, 5, 6])) # Output: 12""",
task_type="generation",
source="seed"
),
Task(
instruction="Summarize the following paragraph in one sentence.",
input="The Amazon rainforest, often referred to as the 'lungs of the Earth,' produces 20% of the world's oxygen and hosts 10% of all known species. Deforestation, primarily driven by cattle ranching and soy farming, has already destroyed 17% of the original forest. Scientists warn that if deforestation exceeds 20-25%, the forest could reach a tipping point, transitioning from a carbon sink to a carbon source and accelerating climate change globally.",
output="The Amazon rainforest faces an existential threat from deforestation that risks triggering a climate tipping point.",
task_type="generation",
source="seed"
),
Task(
instruction="Translate the following English sentence to French.",
input="The weather is beautiful today, perfect for a walk in the park.",
output="Le temps est magnifique aujourd'hui, parfait pour une promenade dans le parc.",
task_type="generation",
source="seed"
),
Task(
instruction="Identify the named entities (people, organizations, places) in the following text.",
input="Apple CEO Tim Cook announced new products at their Cupertino headquarters, while Microsoft's Satya Nadella spoke at a conference in Seattle.",
output="People: Tim Cook, Satya Nadella\nOrganizations: Apple, Microsoft\nPlaces: Cupertino, Seattle",
task_type="generation",
source="seed"
),
Task(
instruction="Explain the concept of recursion to a 10-year-old.",
input="",
output="Recursion is like looking at yourself in two mirrors facing each other - you see the same image repeating smaller and smaller. In programming, a function calls itself with a smaller version of the same problem, keeps going until it reaches a simple case it can answer directly, then builds the answer back up. Like solving a puzzle by solving a smaller version of the same puzzle.",
task_type="generation",
source="seed"
),
Task(
instruction="Given a customer support ticket, classify it as: billing, technical, shipping, or general.",
input="My order shipped on Tuesday but tracking shows it hasn't moved in 3 days and I need it by Friday for an event.",
output="shipping",
task_type="classification",
source="seed"
),
Task(
instruction="Write a SQL query to find the top 5 customers by total purchase amount in the last 30 days.",
input="Table: orders (order_id, customer_id, amount, created_at)",
output="""SELECT
customer_id,
SUM(amount) AS total_amount
FROM orders
WHERE created_at >= NOW() - INTERVAL '30 days'
GROUP BY customer_id
ORDER BY total_amount DESC
LIMIT 5;""",
task_type="generation",
source="seed"
),
]


def format_task_for_prompt(task: Task, index: int) -> str:
"""Format a task as a numbered entry in the generation prompt."""
result = f"Task {index}: {task.instruction}"
if task.input:
truncated_input = task.input[:80] + "..." if len(task.input) > 80 else task.input
result += f"\n (Input example: {truncated_input})"
return result


def stage1_generate_task_instructions(
pool: list[Task],
n_from_pool: int = 6,
n_from_seeds: int = 2,
n_generate: int = 4
) -> list[str]:
"""
Stage 1: Generate new task instructions by few-shot prompting.

Samples from both the growing pool (for diversity) and original
seeds (to maintain quality anchor).
"""
# Sample from generated pool and seeds
pool_generated = [t for t in pool if t.source == "generated"]
pool_seeds = [t for t in pool if t.source == "seed"]

n_from_pool_actual = min(n_from_pool, len(pool_generated))
n_from_seeds_actual = min(n_from_seeds, len(pool_seeds))

sample_tasks = (
random.sample(pool_generated, n_from_pool_actual) +
random.sample(pool_seeds, n_from_seeds_actual)
)

# Fill remainder with any pool tasks if needed
n_needed = (n_from_pool + n_from_seeds) - len(sample_tasks)
if n_needed > 0:
remaining = [t for t in pool if t not in sample_tasks]
sample_tasks.extend(random.sample(remaining, min(n_needed, len(remaining))))

random.shuffle(sample_tasks)

prompt_lines = [
"Come up with a series of diverse tasks. Each task should be a different type of NLP, reasoning, or knowledge task. Be creative and varied in both topic and format.\n"
]
for i, task in enumerate(sample_tasks, 1):
prompt_lines.append(format_task_for_prompt(task, i))

# Ask for more tasks
next_idx = len(sample_tasks) + 1
for i in range(n_generate):
prompt_lines.append(f"Task {next_idx + i}:")

prompt = "\n".join(prompt_lines)

# Use Haiku for this simple pattern-completion task - 60x cheaper than Opus
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=800,
temperature=0.9,
messages=[{"role": "user", "content": prompt}]
)

text = response.content[0].text
new_instructions = []
for line in text.split("\n"):
match = re.match(r"Task \d+:\s*(.+)", line.strip())
if match:
instruction = match.group(1).strip()
if len(instruction.split()) >= 3:
new_instructions.append(instruction)

return new_instructions[:n_generate]


def stage1_classify_task_type(instruction: str) -> Literal["generation", "classification"]:
"""
Stage 1b: Classify whether a task is generation or classification.

Classification tasks require output-first instance generation to
avoid label imbalance. Generation tasks use input-first.
"""
prompt = f"""Is the following task a classification task (output is one of a finite set of labels/categories) or a generation task (output is free-form text, code, or explanation)?

Task: {instruction}

Answer with just one word: "classification" or "generation"."""

response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=10,
temperature=0,
messages=[{"role": "user", "content": prompt}]
)

answer = response.content[0].text.strip().lower()
return "classification" if "classification" in answer else "generation"


def stage2a_generate_instance_for_generation_task(
instruction: str
) -> tuple[str, str]:
"""
Stage 2 (Generation tasks): Generate input first, then output.

For generation tasks like "Summarize this article", you first need
a realistic article to summarize, then you generate the summary.
"""
# Step 1: Generate realistic input (if needed)
input_prompt = f"""For the following task, generate a realistic input example.
If no input is needed (e.g., creative writing from scratch), output exactly: NONE

Task: {instruction}

Input:"""

input_response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=300,
temperature=0.8,
messages=[{"role": "user", "content": input_prompt}]
)
generated_input = input_response.content[0].text.strip()
if generated_input.upper() == "NONE" or len(generated_input) < 3:
generated_input = ""

# Step 2: Generate output based on input
if generated_input:
output_prompt = f"{instruction}\n\nInput: {generated_input}\n\nOutput:"
else:
output_prompt = f"{instruction}\n\nOutput:"

output_response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=600,
temperature=0.7,
messages=[{"role": "user", "content": output_prompt}]
)
generated_output = output_response.content[0].text.strip()

return generated_input, generated_output


def stage2b_generate_instance_for_classification_task(
instruction: str
) -> tuple[str, str]:
"""
Stage 2 (Classification tasks): Generate output label first, then matching input.

If you generate input first and then classify it, you risk:
- Label imbalance (model defaults to most common class)
- Circular reasoning (model picks input that seems typical)

By generating the label first, you force balanced class representation
and ensure the generated input clearly belongs to that class.
"""
# Step 1: Generate a class label
label_prompt = f"""For the following classification task, generate one possible class label or category.
Just output the label, nothing else.

Task: {instruction}

Label:"""

label_response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=30,
temperature=0.7,
messages=[{"role": "user", "content": label_prompt}]
)
generated_label = label_response.content[0].text.strip()

# Step 2: Generate an input that matches this label
input_prompt = f"""For the following classification task, generate an example input that should be classified as "{generated_label}".
The input should clearly and unambiguously belong to this class.

Task: {instruction}
Target class: {generated_label}

Input example:"""

input_response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=200,
temperature=0.8,
messages=[{"role": "user", "content": input_prompt}]
)
generated_input = input_response.content[0].text.strip()

return generated_input, generated_label


def compute_rouge_l(text1: str, text2: str) -> float:
"""
Compute ROUGE-L score (LCS-based) between two strings.

Used for deduplication: if ROUGE-L > 0.7 with any existing task,
the new task is considered a near-duplicate and rejected.

Time complexity: O(m*n) where m,n are token counts.
For typical instruction lengths (5-50 words), this is fast.
"""
tokens1 = text1.lower().split()
tokens2 = text2.lower().split()
if not tokens1 or not tokens2:
return 0.0

m, n = len(tokens1), len(tokens2)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if tokens1[i-1] == tokens2[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
else:
dp[i][j] = max(dp[i-1][j], dp[i][j-1])

lcs_len = dp[m][n]
precision = lcs_len / n if n > 0 else 0
recall = lcs_len / m if m > 0 else 0
if precision + recall == 0:
return 0.0
return 2 * precision * recall / (precision + recall)


BLACKLIST_KEYWORDS = {
"image", "images", "figure", "figures", "picture", "pictures",
"photo", "photos", "audio", "video", "videos", "music", "file",
"files", "map", "maps", "diagram", "diagrams", "graph", "charts",
"screenshot", "camera",
}

FAILURE_INDICATORS = [
"i cannot", "i can't", "as an ai", "i'm unable",
"i apologize but", "i'm not able", "this is not possible",
]


def stage3_passes_quality_filters(
instruction: str,
pool: list[Task],
rouge_threshold: float = 0.7,
min_words: int = 3,
max_words: int = 150
) -> tuple[bool, str]:
"""
Stage 3: Quality filtering for new task instructions.

Returns (passes, reason_if_failed).

Filters applied:
1. Length filter (too short or too long)
2. Keyword blacklist (requires unavailable modalities)
3. ROUGE-L deduplication against all existing tasks
4. Failure indicator check
"""
words = instruction.split()

# Length filter
if len(words) < min_words:
return False, f"too_short ({len(words)} words)"
if len(words) > max_words:
return False, f"too_long ({len(words)} words)"

# Keyword blacklist - only reject referential usage ("this image"), not conceptual ("image processing")
instruction_lower = instruction.lower()
for kw in BLACKLIST_KEYWORDS:
context_phrases = [f"this {kw}", f"the {kw}", f"given {kw}", f"following {kw}", f"attached {kw}"]
if any(phrase in instruction_lower for phrase in context_phrases):
return False, f"requires_modality:{kw}"

# Check for failure indicators in the instruction itself
for indicator in FAILURE_INDICATORS:
if indicator in instruction_lower:
return False, f"failure_indicator:{indicator}"

# ROUGE-L deduplication against all existing pool tasks
for existing_task in pool:
rouge = compute_rouge_l(instruction, existing_task.instruction)
if rouge > rouge_threshold:
return False, f"rouge_duplicate (score={rouge:.3f})"

return True, "passed"


def run_self_instruct(
n_iterations: int = 10,
tasks_per_iteration: int = 4,
output_path: str = "self_instruct_dataset.jsonl",
verbose: bool = True
) -> list[Task]:
"""
Run the complete Self-Instruct pipeline.

Args:
n_iterations: Number of generation rounds
tasks_per_iteration: New task instructions to generate per round
output_path: Where to save the JSONL dataset
verbose: Print progress

Returns:
List of generated Task objects

Cost estimate (Claude Haiku at 2025 pricing):
- ~20 API calls per iteration (generation + classification + instances)
- ~500 tokens per call average
- 10 iterations x 4 tasks x 20 calls = ~800 calls total
- 800 calls x 500 tokens x $0.25/1M tokens = approximately $0.10 for 40 examples
- Scales: 50K examples = approximately $5-10 using Haiku
"""
pool: list[Task] = list(SEED_TASKS)
all_generated: list[Task] = []
stats = {
"generated": 0,
"filtered_quality": 0,
"filtered_rouge": 0,
"filtered_output": 0
}

if verbose:
print(f"Starting Self-Instruct")
print(f"Seed tasks: {len(SEED_TASKS)}")
print(f"Iterations: {n_iterations} x {tasks_per_iteration} tasks = up to {n_iterations * tasks_per_iteration} new tasks\n")

for iteration in range(n_iterations):
if verbose:
print(f"--- Iteration {iteration + 1}/{n_iterations} ---")

# Stage 1: Generate new task instructions
new_instructions = stage1_generate_task_instructions(pool, n_generate=tasks_per_iteration)

for instruction in new_instructions:
# Stage 3 (instruction-level filter - runs before expensive instance generation)
passes, reason = stage3_passes_quality_filters(instruction, pool)
if not passes:
if "rouge_duplicate" in reason:
stats["filtered_rouge"] += 1
else:
stats["filtered_quality"] += 1
if verbose:
print(f" FILTERED [{reason[:40]}]: {instruction[:50]}...")
continue

if verbose:
print(f" ACCEPTED: {instruction[:60]}...")

# Stage 1b: Classify task type
task_type = stage1_classify_task_type(instruction)

# Stage 2: Generate instance using appropriate strategy
if task_type == "classification":
input_text, output_text = stage2b_generate_instance_for_classification_task(instruction)
else:
input_text, output_text = stage2a_generate_instance_for_generation_task(instruction)

# Validate output is non-empty
if not output_text or len(output_text.split()) < 2:
stats["filtered_output"] += 1
if verbose:
print(f" FILTERED [empty output]: {instruction[:40]}...")
continue

new_task = Task(
instruction=instruction,
input=input_text,
output=output_text,
task_type=task_type,
source="generated"
)

pool.append(new_task)
all_generated.append(new_task)
stats["generated"] += 1

# Save dataset
with open(output_path, "w") as f:
for task in all_generated:
f.write(json.dumps({
"instruction": task.instruction,
"input": task.input,
"output": task.output,
"task_type": task.task_type,
}) + "\n")

if verbose:
print(f"\nSelf-Instruct complete:")
print(f" Generated: {stats['generated']}")
print(f" Quality filtered: {stats['filtered_quality']}")
print(f" ROUGE duplicates: {stats['filtered_rouge']}")
print(f" Empty output: {stats['filtered_output']}")
print(f" Saved to: {output_path}")

return all_generated


if __name__ == "__main__":
tasks = run_self_instruct(n_iterations=5, tasks_per_iteration=4)
print(f"\nDataset: {len(tasks)} examples")

The Alpaca Format

Stanford Alpaca popularized a specific prompt format that became the standard for instruction-tuned models. Understanding it is necessary because fine-tuning frameworks often expect this format:

import anthropic

ALPACA_WITH_INPUT = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{response}"""

ALPACA_NO_INPUT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{response}"""


def format_alpaca_example(task: "Task", include_response: bool = True) -> str:
"""
Convert a Task to Alpaca prompt format.

Args:
task: Task with instruction, input, and output
include_response: False for inference (model generates the response)

Returns:
Formatted string for fine-tuning or inference
"""
if task.input and task.input.strip():
text = ALPACA_WITH_INPUT.format(
instruction=task.instruction,
input=task.input,
response=task.output if include_response else ""
)
else:
text = ALPACA_NO_INPUT.format(
instruction=task.instruction,
response=task.output if include_response else ""
)

if not include_response:
text = text.rstrip() # Remove trailing whitespace for inference

return text


def convert_tasks_to_alpaca_json(tasks: list) -> list[dict]:
"""
Convert task list to Alpaca-format JSON for fine-tuning.
The JSON format is used by most fine-tuning frameworks.
"""
return [
{
"instruction": task.instruction,
"input": task.input,
"output": task.output,
}
for task in tasks
]


# Demonstrate generating a training example using claude-opus-4-6
# then formatting it in the Alpaca standard for fine-tuning
def generate_and_format_alpaca_example(topic: str) -> str:
"""Generate one high-quality example and format it for Alpaca fine-tuning."""
client = anthropic.Anthropic()

# Use Opus for the actual generation - quality matters here
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Generate a high-quality instruction-following example about {topic}.
Return JSON with exactly: {{"instruction": "...", "input": "...", "output": "..."}}
Use empty string for input if none is needed. Return only valid JSON."""
}]
)

import json
data = json.loads(response.content[0].text)

# Create a Task-like dict and format it
from dataclasses import dataclass

@dataclass
class SimpleTask:
instruction: str
input: str
output: str

task = SimpleTask(
instruction=data["instruction"],
input=data.get("input", ""),
output=data["output"]
)

return format_alpaca_example(task, include_response=True)

Diversity Analysis: Detecting Topic Clustering

One of the most underappreciated problems with Self-Instruct is that generated tasks cluster around the same topics as the seed set. This analysis helps you identify and address clustering:

import numpy as np
from typing import Optional


def analyze_dataset_diversity(
tasks: list,
n_clusters: int = 20
) -> dict:
"""
Cluster tasks by topic and measure coverage and balance.

Args:
tasks: List of Task objects to analyze
n_clusters: How many topic clusters to fit

Returns:
Dict with diversity metrics including Gini coefficient.
Gini of 0.0 means perfectly equal distribution across clusters.
Gini of 1.0 means all tasks are in one cluster (maximally unequal).
"""
try:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
except ImportError:
print("Install: pip install sentence-transformers scikit-learn")
return {}

model = SentenceTransformer("all-MiniLM-L6-v2")
instructions = [t.instruction for t in tasks]
embeddings = model.encode(instructions, show_progress_bar=False)

kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
labels = kmeans.fit_predict(embeddings)

unique, counts = np.unique(labels, return_counts=True)
cluster_sizes = dict(zip(unique.tolist(), counts.tolist()))

# Gini coefficient: 0 = perfectly equal, 1 = maximally unequal
count_arr = np.array(list(cluster_sizes.values()), dtype=float)
count_arr /= count_arr.sum()
count_arr = np.sort(count_arr)
n = len(count_arr)
gini = (2 * np.sum(np.arange(1, n+1) * count_arr) - (n + 1)) / n

largest_cluster_pct = max(counts) / len(tasks) * 100
smallest_cluster_pct = min(counts) / len(tasks) * 100

return {
"n_tasks": len(tasks),
"n_clusters": n_clusters,
"gini_coefficient": round(gini, 4),
"largest_cluster_pct": round(largest_cluster_pct, 2),
"smallest_cluster_pct": round(smallest_cluster_pct, 4),
"tasks_per_cluster_mean": round(len(tasks) / n_clusters, 1),
"interpretation": (
"Well-balanced" if gini < 0.3
else "Moderately imbalanced" if gini < 0.5
else "Severely imbalanced - consider topic-guided generation"
)
}


def identify_underrepresented_topics(
tasks: list,
n_clusters: int = 20,
underrepresented_threshold_pct: float = 2.0
) -> list[str]:
"""
Identify topic clusters that have fewer examples than expected.

Returns representative instructions from underrepresented clusters
to use as seeds for targeted generation rounds.
"""
try:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
except ImportError:
return []

model = SentenceTransformer("all-MiniLM-L6-v2")
instructions = [t.instruction for t in tasks]
embeddings = model.encode(instructions, show_progress_bar=False)

kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
labels = kmeans.fit_predict(embeddings)

unique, counts = np.unique(labels, return_counts=True)
threshold = len(tasks) * underrepresented_threshold_pct / 100

underrepresented = []
for cluster_id, count in zip(unique, counts):
if count < threshold:
# Get the most central example from this cluster
cluster_indices = np.where(labels == cluster_id)[0]
cluster_embeddings = embeddings[cluster_indices]
centroid = cluster_embeddings.mean(axis=0)
distances = np.linalg.norm(cluster_embeddings - centroid, axis=1)
most_central_idx = cluster_indices[distances.argmin()]
underrepresented.append(tasks[most_central_idx].instruction)

return underrepresented

Limitations and the Evol-Instruct Evolution

Self-Instruct is powerful but has well-documented limitations worth understanding before deploying it:

Complexity ceiling: Self-Instruct cannot generate tasks that require capabilities beyond the teacher model. The student model is bounded by the teacher's ability to describe tasks correctly.

Semantic deduplication failure: ROUGE-L catches lexically similar tasks but misses semantically identical ones. "Summarize this article" and "Give me the main points of this text" have low ROUGE-L but are essentially the same task. Fix: use embedding-based deduplication.

Difficulty distribution skew: Generated tasks cluster around medium difficulty. Hard tasks (requiring expert knowledge, multi-step reasoning, synthesis of multiple concepts) are underrepresented because the model naturally generates tasks it can answer easily. Fix: Evol-Instruct (Lesson 04).

Bias inheritance: All biases in the teacher model transfer to the generated dataset. Fix: explicit diversity prompting, demographic auditing, mixing human examples.

Cost Analysis for Production Scale

Using Claude Haiku as the teacher model for a 50,000 example Self-Instruct dataset:

StageCallsTokens (avg)Cost
Stage 1: Task generation12,500600 in / 200 out$1.88
Stage 1b: Task classification50,000150 in / 5 out$1.91
Stage 2: Input generation50,000100 in / 100 out$2.50
Stage 2: Output generation50,000200 in / 300 out$6.25
Total~$12.54

For 50,000 examples. The original Alpaca achieved 52,000 examples for 500usingGPT3.5in2023modernClaudeHaikuisdramaticallycheaperforthesamequalitylevel.UsingOpusforgenerationinsteadcostsapproximately60xmore( 500 using GPT-3.5 in 2023 - modern Claude Haiku is dramatically cheaper for the same quality level. Using Opus for generation instead costs approximately 60x more (~750), while using Haiku for generation and Opus only for quality-scoring a 10% sample costs approximately $50 total.

Improving Self-Instruct: Embedding-Based Deduplication

ROUGE-L misses semantic duplicates. The fix is to run embedding-based deduplication after ROUGE-L filtering:

import anthropic
import numpy as np
from dataclasses import dataclass
from typing import Optional


def embedding_based_dedup(
candidate_instruction: str,
existing_instructions: list[str],
similarity_threshold: float = 0.90,
embed_client=None,
) -> tuple[bool, float]:
"""
Check if a candidate instruction is semantically similar to any existing one.

Uses cosine similarity on sentence embeddings - catches paraphrases
that ROUGE-L misses. Run after ROUGE-L as a second pass.

Args:
candidate_instruction: New instruction to check
existing_instructions: All instructions already in the pool
similarity_threshold: Cosine similarity above which to reject (0.0-1.0)
embed_client: Embedding client (Voyage AI, OpenAI, etc.)

Returns:
Tuple of (is_duplicate, max_similarity_found)
"""
if not existing_instructions:
return False, 0.0

try:
import voyageai
if embed_client is None:
embed_client = voyageai.Client()

# Embed candidate and all existing
all_texts = [candidate_instruction] + existing_instructions
result = embed_client.embed(all_texts, model="voyage-3-lite")
embeddings = np.array(result.embeddings)

candidate_emb = embeddings[0]
existing_embs = embeddings[1:]

# Cosine similarity
candidate_norm = candidate_emb / np.linalg.norm(candidate_emb)
existing_norms = existing_embs / np.linalg.norm(existing_embs, axis=1, keepdims=True)
similarities = existing_norms @ candidate_norm

max_similarity = float(similarities.max())
is_duplicate = max_similarity >= similarity_threshold

return is_duplicate, max_similarity

except ImportError:
# Fall back to ROUGE-L only if voyageai not installed
return False, 0.0


def quality_score_with_opus(
instruction: str,
output: str,
domain: str = "general"
) -> float:
"""
Use claude-opus-4-6 for high-stakes quality verification of a generated example.

Use this for:
- Calibrating your Haiku quality scores (run on a 5% sample)
- High-stakes domains where errors are costly
- Examples that passed Haiku scoring but seem borderline

Returns score from 0.0 (poor) to 1.0 (excellent).
"""
client = anthropic.Anthropic()

prompt = f"""Rate the quality of this synthetic training example for an AI system in the domain of {domain}.

INSTRUCTION: {instruction}

GENERATED OUTPUT: {output}

Rate on:
- Instruction clarity (is it unambiguous): 0-25 points
- Output accuracy (is it factually correct): 0-25 points
- Completeness (does it fully address the instruction): 0-25 points
- Training value (would a model learn something useful): 0-25 points

Return only: {{"total": <0-100>}}"""

response = client.messages.create(
model="claude-opus-4-6",
max_tokens=50,
messages=[{"role": "user", "content": prompt}]
)

import json
try:
data = json.loads(response.content[0].text)
return float(data.get("total", 0)) / 100.0
except Exception:
return 0.0

:::tip Use Haiku for generation, spot-check with claude-opus-4-6 The three-stage Self-Instruct pipeline works well with claude-haiku-4-5-20251001 for most tasks - it's fast, cheap, and sufficiently capable. Use claude-opus-4-6 for final quality validation: generate with Haiku, then use Opus to rate 5-10% of examples. If Opus rates them highly, your Haiku-generated data is good. If it rates them poorly, fix your generation prompts before scaling up. The calibration cost (Opus on 500 examples) is approximately $5-10 and can save you thousands in wasted generation costs. :::

:::danger Seed task quality is the foundation - spend time here The 175 seed tasks in the original Self-Instruct were carefully hand-selected for diversity across task types, domains, and formats. If you start with a low-quality or narrow seed set, every subsequent generation will amplify those problems. Self-Instruct is a bootstrapping process: the better the bootstrap, the better the result. Budget most of your human effort on the seed tasks - they are the 1% that determines the quality of the 99%. For a domain-specific application, hire a domain expert for one day to write 50-100 diverse, high-quality seed tasks. That investment pays for itself in dataset quality. :::

:::warning ROUGE-L is a necessary but insufficient deduplication strategy ROUGE-L catches lexically near-identical tasks efficiently. It does not catch paraphrases or semantically equivalent tasks. "Summarize this article" and "Write a one-sentence summary of the following text" will pass ROUGE-L and both end up in your pool. For datasets larger than 10,000 examples, this produces meaningful hidden redundancy. Add embedding-based deduplication as a second pass: compute sentence embeddings, run approximate nearest-neighbor search (FAISS or similar), and reject candidates with cosine similarity above 0.90 to any existing task. The additional cost is approximately $2-5 per 50,000 examples using a cheap embedding model. :::

From Self-Instruct to Modern Instruction Datasets

Self-Instruct established the paradigm. Modern instruction datasets have extended it in several directions:

MethodKey InnovationDataset ScaleModel Quality
Self-Instruct (2023)Bootstrap from 175 seeds52K (Alpaca)GPT-3 level
Evol-Instruct (2023)Evolve complexity systematically70K (WizardLM)Beats GPT-3.5 on 52% of tasks
Orca (2023)Include reasoning traces5MMatches GPT-3.5 on reasoning
LIMA (2023)Quality over quantity1K curatedCompetitive with 50K random
OpenHermes (2024)Mix multiple sources1MStrong generalist capability
Magpie (2024)Generate from model self-playVariableState of the art for SFT

The evolution shows three clear trends: (1) quality beating quantity (LIMA), (2) reasoning process mattering as much as final answer (Orca), and (3) complexity and diversity being separable axes to optimize independently (Evol-Instruct on complexity, diverse seed selection on coverage).

:::info Self-Instruct's lasting legacy The Self-Instruct paper's most important contribution was not the specific algorithm - ROUGE-L deduplication and the three-stage loop. It was the proof of concept: that you can bootstrap an instruction-following capability from an almost negligible seed. Every modern synthetic data pipeline for instruction tuning is, in some sense, a refinement of Self-Instruct. Understanding the original algorithm is necessary for understanding why the refinements matter and what problems they solve. :::

Interview Q&A

Q: What problem does Self-Instruct solve, and why was it important when it was published?

Self-Instruct addressed the data scarcity problem in instruction tuning. Before it, creating instruction-following datasets required expensive human labeling at scale - the kind of resource available to OpenAI (with InstructGPT) but not to academic researchers or small companies. Self-Instruct showed you could bootstrap a large instruction dataset from just 175 human-written seed tasks using the model itself as a generator - for cents instead of millions of dollars. This democratized instruction tuning, enabling any team with a GPU and modest API budget to fine-tune their own instruction-following model. The Alpaca follow-up demonstrated this concretely: a $500 dataset producing a model competitive with GPT-3.5.

Q: Why does Self-Instruct use ROUGE-L for deduplication specifically, and what is its weakness?

ROUGE-L (Longest Common Subsequence) is fast to compute and captures surface-level lexical similarity well. If two instructions share most of the same words in roughly the same order, they're likely duplicates - and ROUGE-L catches this reliably. The weakness: it misses semantic duplicates that use different words. "Summarize this article" and "Give me the main idea of this text" have low ROUGE-L overlap but are essentially the same task. A more robust approach uses sentence embeddings and cosine similarity, catching semantic duplicates regardless of wording. In practice: use ROUGE-L as a first pass (fast, no model inference required), then embedding-based dedup as a second pass for the subset that passes ROUGE-L.

Q: Why does Self-Instruct generate output first for classification tasks, and what problem does this solve?

If you generate the input first and then classify it, the model tends to always predict the most common class - label imbalance. It's easier for the model to generate a "typical" example, which tends to belong to the majority class. By generating the label first ("positive sentiment") and then generating an input that matches that label ("a sentence that's clearly positive"), you force balanced class representation. It's also a more constrained generation task - generating "a sentence expressing positive sentiment" is well-defined, while "classify this input" might collapse to majority-class prediction for whatever the model happens to generate.

Q: What are the key limitations of Self-Instruct that Evol-Instruct was designed to address?

Self-Instruct generates diverse tasks but doesn't control complexity. Most generated tasks cluster around medium difficulty - the natural center of what the LLM produces when asked to "come up with a task." The difficulty distribution looks like a bell curve centered on medium. Evol-Instruct (Lesson 04) addresses this by explicitly evolving tasks to become more complex (Add Constraints, Deepen, Increase Reasoning Steps). This produces a right-skewed difficulty distribution with disproportionately more hard examples. Self-Instruct also doesn't ensure comprehensive coverage of a specific domain - it generates whatever the model naturally finds easy to describe. Evol-Instruct starting from domain-specific seeds addresses this by staying within the domain while increasing complexity.

Q: If you were building a Self-Instruct pipeline today, what would you change from the original paper?

Several improvements over the original 2023 design: (1) Replace ROUGE-L with two-stage deduplication - ROUGE-L first pass for speed, embedding cosine similarity second pass to catch semantic duplicates the ROUGE-L misses. (2) Add an automated quality scorer using a separate LLM (not the generator) to rate each example for instruction clarity and response accuracy. (3) Use stratified topic sampling for seeds: track which topic clusters are under-represented in the growing pool and oversample seeds from those clusters. (4) Add factual verification for knowledge-heavy tasks: for tasks with checkable answers, verify the generated response against authoritative sources. (5) Mix in Evol-Instruct after the initial Self-Instruct phase to push the difficulty distribution toward harder examples. (6) Use a capability-appropriate generator - claude-haiku-4-5-20251001 is fine for simple generation tasks and costs dramatically less than claude-opus-4-6, making the economics even more favorable.

Q: How does the Alpaca result change how we think about the relationship between model size and model capability?

Alpaca revealed that much of what makes a model seem "intelligent" in conversation is instruction following and format compliance - not raw parameter count or knowledge. LLaMA-7B already had substantial knowledge of the world from pretraining on billions of tokens. The $500 Alpaca fine-tune taught it how to respond helpfully in a conversational format. This has two important implications: (1) Instruction tuning is very high leverage - a small amount of high-quality SFT data can unlock enormous latent capability that pretraining already built into the model. The model knew things; it just didn't know how to show it. (2) Conversational ability and knowledge are largely separable - you can assess a model's knowledge independently of its conversational style. This insight drove the field's understanding that the "alignment tax" (does RLHF reduce raw capability?) is really about optimizing for different output distributions, not about fundamental capability tradeoffs.

Q: How do you calculate the cost of running Self-Instruct at scale, and what model choices minimize cost without sacrificing quality?

Cost calculation: for each generated task, Self-Instruct makes approximately 4-6 API calls - task generation (shared across 4 tasks), task classification, input generation, output generation. For 50,000 tasks, that's approximately 200,000-300,000 API calls. Average token count per call ranges from 100 (classification) to 600 (task generation with 8 examples in context). At Claude Haiku pricing (0.25/1Minputtokens,0.25/1M input tokens, 1.25/1M output tokens), 50,000 tasks costs approximately $10-15.

Model choice: use Haiku for all three stages - task generation, classification, and instance generation. Haiku is sufficient because these are pattern-completion tasks (generate something like these examples) rather than complex reasoning tasks. Use Opus only for calibration: generate 500 examples with Haiku, rate them with Opus, compute Haiku-Opus agreement. If agreement exceeds 85%, Haiku generation quality is acceptable for your domain. If agreement is below 80%, switch task generation to Opus while keeping classification and instance generation on Haiku - this captures the highest-quality examples while keeping costs manageable (5-10x cost increase vs. all-Haiku, rather than 60x for all-Opus).

© 2026 EngineersOfAI. All rights reserved.