:::tip 🎮 Interactive Playground Visualize this concept: Try the Synthetic Data Generation demo on the EngineersOfAI Playground - no code required. :::
Evol-Instruct: Making Instructions Harder to Make Models Smarter
The Complexity Ceiling Nobody Fixed
It is Tuesday morning, 3 AM in a Seattle apartment, and a senior ML engineer at a mid-sized fintech startup is staring at benchmark results that make no sense. Her team spent six weeks producing a 50,000-example instruction-following dataset. They were methodical about it - diverse tasks covering document summarization, entity extraction, customer intent classification, regulatory question answering, and transaction anomaly explanation. They cleaned the data, deduplicated it, ran quality filters, and got it professionally annotated for correctness. The fine-tuned model tested beautifully on their internal eval harness. Leadership approved the deployment.
The model went live three weeks ago. It handles 80% of queries flawlessly. Users leave positive feedback on the simple requests. But the edge cases - the ones that show up when a compliance officer asks "Walk me through the regulatory capital requirements under Basel III Pillar 2 as they apply to our trading book positions, and explain whether our current tier-1 capital buffer provides adequate protection against a correlated stress scenario" - these produce responses that look correct at first glance but are confidently shallow. The model mentions tier-1 capital and Basel III. It gives a response that a non-expert would accept. But the compliance officer, who asked the question, can immediately see it missed the actual analysis. The model was never trained on examples that complex. It learned to produce plausible-sounding responses at the difficulty level it saw in training, and nothing more.
This is not a bug. It is a fundamental property of supervised fine-tuning. A model learns the distribution of its training data - not just the format and the topics, but the difficulty distribution. When your training data contains 45,000 easy-to-medium examples and 5,000 hard examples, the model learns to handle easy-to-medium questions well and hard questions poorly. The gradient signal from hard examples is small (they're rare) and often overwhelmed by the easy examples. You can scale this dataset to 500,000 examples and still hit the same ceiling if the difficulty distribution doesn't change.
Evol-Instruct was designed specifically to break this ceiling. The paper introducing WizardLM (Xu et al., 2023) applied Evol-Instruct to 70K Alpaca instructions and fine-tuned LLaMA-7B on the result. WizardLM-7B outperformed ChatGPT (GPT-3.5-turbo) in human evaluation on 52.7% of test questions - a model with roughly 1/25th the parameters, fine-tuned on systematically evolved synthetic data, beating a much larger closed model on complex instruction following. The key was not scale. The key was deliberately engineering the difficulty distribution of the training data.
Why Self-Instruct Wasn't Enough
Self-Instruct (covered in the previous lesson) solves the breadth problem. Given 175 seed tasks, it generates thousands of novel tasks covering many different topics and task types - summarization, classification, code generation, question answering, creative writing, and more. The diversity is genuine and valuable. But Self-Instruct doesn't control depth.
When you ask a language model to "come up with a new task," it generates tasks it can comfortably describe and answer. It gravitates toward tasks it has seen many examples of during pretraining. It doesn't naturally generate tasks that push the boundary of its own capability, because those tasks are hard to describe, hard to evaluate, and rare in the pretraining distribution. The result is a dataset with a bell-curve difficulty distribution centered around medium. Easy tasks make up roughly 30%. Medium tasks make up 60%. Genuinely hard tasks - ones requiring expert-level reasoning, multi-step analysis, or constraint satisfaction across multiple competing requirements - make up maybe 10%.
The consequence is exactly what the engineer in our opening scenario experienced. Models trained heavily on medium-difficulty examples get very good at medium-difficulty tasks. Hard examples are rare during training, so the model never develops the deep reasoning patterns those tasks require. The training loss is efficiently minimized on easy and medium tasks - hard examples contribute a small fraction of the total gradient signal. The model learns to produce plausible-sounding responses at medium complexity, and it does this even for hard questions because that's its mode.
Evol-Instruct flips this by systematically evolving instructions upward in complexity. Starting from a seed instruction, it applies one of five transformation operators to produce a harder version. It repeats this process across multiple rounds. The result is a dataset where hard examples are abundant and the difficulty distribution is deliberately skewed toward the upper tail. The fine-tuned model sees many hard examples, develops the reasoning patterns they require, and can actually engage with complex queries rather than producing eloquent-sounding placeholders.
The Five Evolution Operators
Evol-Instruct defines five operators that transform an existing instruction into a harder or more diverse version. Four are "in-depth" operators (making the instruction harder), and one is an "in-breadth" operator (creating a new related task).
In-Depth Evolution: Making It Harder
Operator 1: Add Constraints
Add specific requirements, limitations, conditions, or edge cases that must all be satisfied simultaneously.
Original: "Write a sorting algorithm."
Evolved: "Write a sorting algorithm in Python that:
- Achieves O(n log n) worst-case time complexity
- Uses O(1) auxiliary space (in-place sort, no additional arrays)
- Handles arrays with up to 30% duplicate values without performance degradation
- Is stable (preserves relative order of equal elements)
- Includes unit tests using pytest for at least 5 edge cases:
empty array, single element, all-duplicates, reverse-sorted, and alternating high-low
Explain the algorithmic choices that make each requirement achievable."
The power of this operator: it forces multi-constraint satisfaction. Any reasonable implementation can sort. Sorting in-place is harder. Sorting in-place while being stable is harder still. Add performance requirements and test coverage and you've created an example that requires genuine algorithmic reasoning.
Operator 2: Deepen
Require more specialized expertise, mathematical proofs, theoretical foundations, or formal analysis.
Original: "Explain recursion."
Evolved: "Explain the mathematical induction principle that underlies the correctness
of recursive algorithms. Then:
1. Prove using induction that recursive Fibonacci has O(2^n) time complexity by
establishing the recurrence relation T(n) = T(n-1) + T(n-2) + O(1) and solving it
2. Prove that memoization reduces this to O(n) through amortized analysis
3. Analyze the trade-off between call stack depth (O(n) space) and heap memory in
iterative implementations, including the stack overflow threshold calculation
4. Identify production conditions under which tail-call optimization applies and
whether Python's CPython implementation supports it"
Operator 3: Concretize
Replace abstract descriptions with specific, concrete scenarios and real numbers that require actual computation or analysis.
Original: "Explain machine learning overfitting."
Evolved: "A 5-layer MLP trained on 1,000 labeled samples achieves 98.2% training
accuracy but only 61.4% validation accuracy on a 200-sample validation set.
Training loss stopped decreasing after epoch 47; validation loss started increasing
at epoch 23. Walk through:
1. How to distinguish overfitting from train/validation distribution mismatch
using these specific learning curves
2. The exact regularization interventions to try, with numerical hyperparameter
ranges: L2 lambda values, dropout rates, early stopping patience windows
3. The learning curves you expect to see after each intervention if overfitting
is the actual cause
4. The stopping criterion: how do you know when you've found the right level of
regularization vs. just found a different overfitting regime?"
Operator 4: Increase Reasoning Steps
Require more intermediate logical steps that depend on each other's outputs.
Original: "What is 12% of 350?"
Evolved: "A retail store is running three sequential promotions. Loyalty members
receive a 12% discount. Clearance items receive an additional 8% discount applied
to the post-loyalty price (not the original). State tax of 9.5% applies to the
final discounted price. A $5 coupon applies before tax but after both discounts.
A loyalty member buys a clearance item originally priced at $350.
- Show all intermediate calculations
- Explain why operation order matters for compound discounts
- Calculate the total discount percentage vs. original price
- Show why applying the $5 coupon before vs. after the percentage discounts
produces different final totals"
In-Breadth Evolution: Creating Variety
Operator 5: Mutate
Create a new, related but different task in the same domain - maintaining topic coverage while adding novel tasks that test different skills.
Original: "Write a sorting algorithm."
Mutated: "Design a data structure that maintains a sorted collection and supports:
- O(log n) amortized insertion, deletion, and point search
- O(k + log n) range queries returning all elements between values A and B
- O(1) minimum and maximum retrieval at any time
Justify why a balanced BST or skip list is preferred over maintaining a sorted
array with binary search for this combined operation set, with concrete worst-case
analysis for each operation on each data structure."
The diagram below shows all five operators applied to the same seed instruction:
Complete Evol-Instruct Implementation
The following implementation covers the full Evol-Instruct pipeline: operator application, response generation, elimination filtering, multi-round evolution, and output serialization. It uses claude-haiku-4-5-20251001 for evolution (cheap, fast, sufficient for instruction transformation) and claude-opus-4-6 for response generation (higher quality for the harder evolved instructions).
import anthropic
import json
import random
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import time
client = anthropic.Anthropic()
class EvolutionOperator(Enum):
ADD_CONSTRAINTS = "add_constraints"
DEEPEN = "deepen"
CONCRETIZE = "concretize"
INCREASE_REASONING = "increase_reasoning"
MUTATE = "mutate"
@dataclass
class EvolvedInstruction:
original: str
evolved: str
operator: EvolutionOperator
evolution_round: int
response: str = ""
passed_filter: bool = False
filter_reason: str = ""
metadata: dict = field(default_factory=dict)
EVOLUTION_PROMPTS = {
EvolutionOperator.ADD_CONSTRAINTS: """You are an instruction rewriter.
Make the following instruction harder by adding specific constraints, requirements,
and conditions that must ALL be satisfied simultaneously.
Original instruction: {instruction}
Rewrite it to add:
- Format or language constraints (e.g., "must use Python 3.10+", "under 200 words")
- Performance or complexity requirements (e.g., "must run in O(n log n)")
- Scope limitations that narrow the solution space significantly
- Edge cases that must be explicitly handled in the implementation
- Verification requirements (tests, proofs, or examples)
The evolved instruction must still be a single, answerable task.
Output ONLY the rewritten instruction, no explanation.""",
EvolutionOperator.DEEPEN: """You are an instruction rewriter.
Make the following instruction require deeper expertise and more rigorous reasoning.
Original instruction: {instruction}
Rewrite it to require:
- Expert-level domain knowledge (proofs, formal definitions, mathematical foundations)
- Analysis of edge cases, failure modes, or theoretical limits
- Comparison with alternative approaches and rigorous justification for choices
- Formal complexity analysis or derivation of key properties
- Precise technical terminology and explicit reasoning chains
The evolved instruction should be answerable only by a genuine domain expert.
Output ONLY the rewritten instruction, no explanation.""",
EvolutionOperator.CONCRETIZE: """You are an instruction rewriter.
Make the following instruction more concrete by replacing vague descriptions
with precise real-world scenarios and specific numbers.
Original instruction: {instruction}
Rewrite it to:
- Replace generic examples with specific scenarios including concrete numbers and scale
- Make abstract operations into specific step-by-step processes with measurable outcomes
- Add realistic context (specific technologies, dataset sizes, system constraints)
- Require working with specific inputs and producing verifiable, calculable outputs
- Include decision criteria that can only be answered with the specific numbers given
Output ONLY the rewritten instruction, no explanation.""",
EvolutionOperator.INCREASE_REASONING: """You are an instruction rewriter.
Make the following instruction require significantly more chained reasoning steps.
Original instruction: {instruction}
Rewrite it to:
- Chain multiple sub-problems where each depends on the previous answer
- Require synthesis of concepts from multiple domains or abstraction levels
- Add intermediate verification or validation steps that must be shown
- Include conditional logic branches that must be reasoned through
- Require the answer to consider at least 5 distinct logical steps
The instruction should require 5+ distinct, sequentially-dependent reasoning steps.
Output ONLY the rewritten instruction, no explanation.""",
EvolutionOperator.MUTATE: """You are an instruction creator.
Create a NEW instruction in the same domain as the given instruction,
but testing meaningfully different skills and knowledge.
Original instruction (for domain context only): {instruction}
Create a NEW instruction (not a rewrite) that:
- Is in the same general domain or technical field
- Tests different specific skills, algorithms, or concepts
- Has similar or slightly higher complexity level
- Would require meaningfully different expertise to answer well
- Covers a gap or adjacent area in the domain
Output ONLY the new instruction, no explanation.""",
}
RESPONSE_GENERATION_PROMPT = """You are an expert providing a comprehensive, accurate,
technically rigorous response to the following instruction. Provide a thorough,
high-quality answer that demonstrates genuine domain expertise.
Instruction: {instruction}
Response:"""
FAILURE_INDICATORS = [
"i cannot", "i can't", "as an ai", "i'm unable to",
"i don't have the ability", "i apologize, but i cannot",
"this is not possible", "the rewritten instruction is the same",
"i'm sorry, but", "i must decline", "this request is",
"i cannot fulfill", "as a language model",
]
def apply_evolution_operator(
instruction: str,
operator: EvolutionOperator,
round_num: int,
model: str = "claude-haiku-4-5-20251001"
) -> Optional[EvolvedInstruction]:
"""
Apply one evolution operator to produce a harder version of an instruction.
Uses claude-haiku for evolution (cheap, fast - evolution is a text
transformation task that doesn't require maximum quality).
Uses claude-opus-4-6 for response generation (harder questions need
higher-quality responses to produce useful training signal).
"""
prompt = EVOLUTION_PROMPTS[operator].format(instruction=instruction)
response = client.messages.create(
model=model,
max_tokens=600,
temperature=0.8, # Higher temp for creative evolution
messages=[{"role": "user", "content": prompt}]
)
evolved_text = response.content[0].text.strip()
# Clean up model artifacts that occasionally appear
for prefix in [
"Rewritten instruction:", "New instruction:",
"Evolved instruction:", "Output:", "Here is",
"Here's the", "The evolved instruction:"
]:
if evolved_text.lower().startswith(prefix.lower()):
evolved_text = evolved_text[len(prefix):].strip()
return EvolvedInstruction(
original=instruction,
evolved=evolved_text,
operator=operator,
evolution_round=round_num,
metadata={
"evolution_model": model,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
}
)
def generate_response_for_evolved(
instruction: str,
model: str = "claude-opus-4-6"
) -> str:
"""
Generate a high-quality response for an evolved instruction.
Use a stronger model than evolution because evolved instructions are
harder - a weaker model will produce surface-level responses that
don't reflect the actual reasoning required. Bad responses to hard
instructions are worse training data than no data at all.
"""
prompt = RESPONSE_GENERATION_PROMPT.format(instruction=instruction)
response = client.messages.create(
model=model,
max_tokens=2500,
temperature=0.2, # Low temp for reliable, factual responses
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
def passes_elimination_filter(
original: str,
evolved: str,
response: str
) -> tuple[bool, str]:
"""
Apply the elimination filter from the Evol-Instruct paper.
Three failure modes cause rejection:
1. Failure indicators in evolved instruction or response (model refused)
2. Copy problem: evolved is too similar to or identical with original
3. Response too short to be meaningful (model couldn't engage)
Returns (passes: bool, reason: str).
"""
evolved_lower = evolved.lower()
response_lower = response.lower()
# Failure indicators in the evolved instruction (evolution failed)
for indicator in FAILURE_INDICATORS:
if indicator in evolved_lower:
return False, f"failure_in_evolution: contains '{indicator}'"
# Copy problem: evolution had no meaningful effect
if evolved.strip() == original.strip():
return False, "copy_problem: evolved is identical to original"
# Near-copy: evolved is only trivially different (small changes)
words_orig = set(original.lower().split())
words_evol = set(evolved.lower().split())
if len(words_orig) > 0:
overlap = len(words_orig & words_evol) / len(words_orig)
if overlap > 0.90 and len(evolved) < len(original) * 1.15:
return False, f"near_copy_problem: {overlap:.2f} word overlap ratio"
# Length sanity check on evolved instruction
if len(evolved.split()) < 10:
return False, f"evolved_too_short: {len(evolved.split())} words"
# Evolution that produces something longer than ~300 words is
# often collapsed (incomprehensible constraint accumulation)
if len(evolved.split()) > 300:
return False, f"evolved_too_long: {len(evolved.split())} words (likely collapse)"
# Length check on response: too short means the model couldn't engage
if len(response.split()) < 30:
return False, f"response_too_short: {len(response.split())} words"
# Failure indicators in the response
for indicator in FAILURE_INDICATORS:
if indicator in response_lower:
return False, f"failure_in_response: contains '{indicator}'"
return True, "passed"
Running the Full Multi-Round Pipeline
The multi-round evolution loop is where the difficulty compounding happens. Round 1 evolves seeds into medium-hard examples. Round 2 evolves those into hard examples. Round 3 evolves hard examples into very-hard examples. The key is to seed each subsequent round from the accepted examples of the previous round.
def run_evol_instruct(
seed_instructions: list[str],
n_evolution_rounds: int = 3,
evolutions_per_instruction: int = 2,
mutate_probability: float = 0.25,
output_path: str = "evol_instruct_dataset.jsonl",
verbose: bool = True
) -> list[dict]:
"""
Run the complete Evol-Instruct pipeline.
Each round evolves the ACCEPTED examples from the previous round,
progressively building harder and harder instructions.
Args:
seed_instructions: Starting instruction set
n_evolution_rounds: How many rounds to evolve (3-5 typical)
evolutions_per_instruction: How many operators to apply per instruction
mutate_probability: Probability of also applying MUTATE for breadth
output_path: Output JSONL file for fine-tuning
verbose: Print progress
Strategy:
- Round 1: seeds → medium-hard (difficulty bump: ~1.5x)
- Round 2: medium-hard → hard (difficulty bump: ~2x)
- Round 3: hard → very hard (difficulty bump: ~3x)
- Mutate ~25% of instructions each round to maintain breadth
Returns:
List of all accepted (instruction, response) pairs across all rounds
"""
all_examples = []
current_pool = list(seed_instructions)
stats = {
"attempted": 0, "accepted": 0,
"copy_problem": 0, "failure_indicator": 0,
"short_response": 0, "too_long": 0,
}
in_depth_operators = [
EvolutionOperator.ADD_CONSTRAINTS,
EvolutionOperator.DEEPEN,
EvolutionOperator.CONCRETIZE,
EvolutionOperator.INCREASE_REASONING,
]
if verbose:
print(f"Evol-Instruct starting:")
print(f" Seeds: {len(seed_instructions)}")
print(f" Rounds: {n_evolution_rounds}")
print(f" Operators per instruction: {evolutions_per_instruction}")
print(f" Mutate probability: {mutate_probability:.0%}")
for round_num in range(1, n_evolution_rounds + 1):
if verbose:
print(f"\n=== Round {round_num}/{n_evolution_rounds} "
f"({len(current_pool)} instructions) ===")
next_round_pool = []
for i, instruction in enumerate(current_pool):
if verbose and i % 10 == 0:
print(f" [{i+1}/{len(current_pool)}] "
f"{instruction[:55]}...")
# Select in-depth evolution operators for this instruction
k = min(evolutions_per_instruction, len(in_depth_operators))
operators = random.sample(in_depth_operators, k)
# Probabilistically add MUTATE for breadth
if random.random() < mutate_probability:
operators.append(EvolutionOperator.MUTATE)
for op in operators:
stats["attempted"] += 1
# Stage 1: Evolve the instruction (cheap, fast model)
evolved = apply_evolution_operator(
instruction, op, round_num
)
if not evolved:
continue
# Stage 2: Generate response (expensive, strong model)
# Only use Opus for rounds 2+ to control cost
response_model = (
"claude-opus-4-6" if round_num >= 2
else "claude-sonnet-4-6"
)
response = generate_response_for_evolved(
evolved.evolved, model=response_model
)
evolved.response = response
# Stage 3: Apply elimination filter
passed, reason = passes_elimination_filter(
instruction, evolved.evolved, response
)
evolved.passed_filter = passed
evolved.filter_reason = reason
if passed:
example = {
"instruction": evolved.evolved,
"input": "",
"output": response,
"evolution_operator": op.value,
"evolution_round": round_num,
"original_instruction": instruction,
"metadata": evolved.metadata,
}
all_examples.append(example)
next_round_pool.append(evolved.evolved)
stats["accepted"] += 1
if verbose:
print(f" PASS [{op.value:20s}]: "
f"{evolved.evolved[:45]}...")
else:
# Categorize the failure type for diagnostics
if "copy_problem" in reason:
stats["copy_problem"] += 1
elif "failure_in" in reason:
stats["failure_indicator"] += 1
elif "short" in reason:
stats["short_response"] += 1
elif "too_long" in reason:
stats["too_long"] += 1
if verbose:
print(f" FAIL [{reason[:38]}]")
time.sleep(0.1) # Rate limit headroom
current_pool = next_round_pool
if verbose:
acc_rate = stats["accepted"] / max(stats["attempted"], 1)
print(f" Pool for next round: {len(next_round_pool)}")
print(f" Running acceptance rate: {acc_rate:.0%}")
# Serialize output
with open(output_path, "w") as f:
for example in all_examples:
f.write(json.dumps(example) + "\n")
if verbose:
total = stats["attempted"]
accepted = stats["accepted"]
print(f"\nEvol-Instruct complete:")
print(f" Total attempted: {total}")
print(f" Accepted: {accepted} ({accepted/total*100:.0f}%)")
print(f" Copy problem: {stats['copy_problem']}")
print(f" Failure indicators: {stats['failure_indicator']}")
print(f" Short responses: {stats['short_response']}")
print(f" Too long (collapse): {stats['too_long']}")
print(f" Saved to: {output_path}")
return all_examples
# Example: Evolve a minimal set of coding instructions
CODING_SEEDS = [
"Write a function to find the maximum element in a list.",
"Implement a stack data structure.",
"Write a function that checks if a string is a palindrome.",
"Create a function to count word frequencies in a text.",
"Implement a binary search algorithm.",
"Write a function to flatten a nested list.",
"Implement a simple LRU cache.",
"Write a function to validate a JSON schema.",
]
if __name__ == "__main__":
results = run_evol_instruct(
seed_instructions=CODING_SEEDS,
n_evolution_rounds=3,
evolutions_per_instruction=2,
mutate_probability=0.25,
)
print(f"\nGenerated {len(results)} training examples "
f"from {len(CODING_SEEDS)} seeds")
WizardLM Results and Their Implications
The original WizardLM paper applied Evol-Instruct to 70K Alpaca instructions and fine-tuned LLaMA-7B on the evolved dataset across 4 evolution rounds. The results fundamentally challenged a prevailing assumption: that performance scales primarily with parameter count.
| Model | Parameters | Training Data | Hard Task Win Rate |
|---|---|---|---|
| Alpaca-7B | 7B | 52K simple instructions (Self-Instruct) | 29.0% vs ChatGPT |
| Vicuna-13B | 13B | ShareGPT conversation data | 41.6% vs ChatGPT |
| WizardLM-7B | 7B | 70K evolved instructions (4 rounds) | 52.7% vs ChatGPT |
| WizardLM-13B | 13B | 70K evolved instructions (4 rounds) | 58.9% vs ChatGPT |
| ChatGPT (GPT-3.5-turbo) | ~175B est. | RLHF with human feedback | - (baseline) |
WizardLM-7B beat ChatGPT on more than half of questions in human evaluation - a 1/25th parameter model, fine-tuned on synthetic data alone. The finding is unambiguous: instruction complexity during fine-tuning matters more than scale for specific capability domains.
The practical implication: if you're fine-tuning a 7B model on simple instructions and finding it fails on complex user queries, the answer is almost certainly the difficulty distribution of your training data. The model cannot learn reasoning patterns it was never exposed to. Adding more simple examples won't fix this. Systematically evolving instructions toward higher complexity will.
Domain-Specific Evolution: WizardCoder
WizardCoder (Luo et al., 2023) applied Evol-Instruct specifically to code generation, achieving state-of-the-art HumanEval scores at the time on models under 15B parameters. The key insight: domain-specific evolution operators outperform generic ones because they understand what makes problems in that domain genuinely harder.
import anthropic
import random
client = anthropic.Anthropic()
# Domain-specific operators for coding evolution
# These are more effective than generic operators for code tasks because
# they target the specific dimensions that make coding problems harder
CODING_EVOLUTION_OPERATORS = {
"add_functionality": """Rewrite this coding instruction to require
additional features, proper error handling, and type safety.
Original: {instruction}
Add requirements for:
- Input validation for at least 2 edge cases with specific error messages
- Full type hints following PEP 484 conventions
- A docstring following Google style with Args, Returns, Raises sections
- At least one additional helper function that improves modularity
- pytest unit tests covering the core logic and the error cases
Output ONLY the evolved coding instruction.""",
"add_algorithmic_complexity": """Rewrite this coding instruction to require
a more sophisticated algorithmic approach with complexity guarantees.
Original: {instruction}
Evolve it to require:
- A specific stated time complexity guarantee (e.g., O(n log n) worst-case)
- A specific stated space complexity constraint (e.g., O(1) extra space)
- Handling of at least 3 specified edge cases with correct behavior defined
- Performance analysis comment in the docstring with big-O justification
- A comparison to the naive approach and why the improved approach is necessary
Output ONLY the evolved coding instruction.""",
"add_production_concerns": """Rewrite this coding instruction to be
more realistic and production-oriented.
Original: {instruction}
Add production requirements:
- Thread-safety: the implementation must be safe for concurrent use
- Memory efficiency at scale: specify handling of 1M+ item inputs
- Proper logging using Python's logging module at appropriate levels
- Configuration via a dataclass with sensible defaults
- Public API designed for extensibility (open/closed principle)
- A brief usage example in the docstring showing real-world context
Output ONLY the evolved coding instruction.""",
"add_comprehensive_testing": """Rewrite this coding instruction to require
comprehensive, production-quality testing alongside the implementation.
Original: {instruction}
Require:
- Property-based tests using hypothesis for at least 2 invariant properties
- Integration tests demonstrating end-to-end behavior with realistic data
- Benchmark comparing performance to the naive implementation using timeit
- Coverage of all edge cases: empty input, single element, maximum feasible size
- A test that intentionally triggers each error condition and verifies
the error message and type
Output ONLY the evolved coding instruction.""",
}
def evolve_coding_instruction(
instruction: str,
operator_name: str = None,
model: str = "claude-haiku-4-5-20251001"
) -> str:
"""
Apply a domain-specific coding evolution operator.
Domain-specific operators produce better results than generic operators
because they know that for code problems, what makes them harder is:
- Algorithmic constraints (not just "be more complex")
- Production concerns (not just "add requirements")
- Testing rigor (not just "verify your answer")
Args:
instruction: The coding instruction to evolve
operator_name: Which operator to apply (random if None)
model: Model for evolution (haiku is sufficient for transformations)
Returns:
Evolved instruction string
"""
if operator_name is None:
operator_name = random.choice(list(CODING_EVOLUTION_OPERATORS.keys()))
prompt = CODING_EVOLUTION_OPERATORS[operator_name].format(
instruction=instruction
)
response = client.messages.create(
model=model,
max_tokens=500,
temperature=0.7,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
def generate_coding_response(
instruction: str,
model: str = "claude-opus-4-6"
) -> str:
"""
Generate a production-quality coding response.
For coding tasks, use the strongest model available.
A wrong implementation to a hard coding problem is an actively
harmful training example - the model learns incorrect patterns.
"""
system = """You are a senior software engineer providing comprehensive,
correct, production-quality code responses. Every response must include:
1. Working, tested code with proper error handling and type hints
2. Clear comments explaining non-obvious algorithmic decisions
3. Time and space complexity analysis with justification
4. At least 2 concrete usage examples with expected outputs
5. A note on any important edge cases handled or explicitly excluded"""
response = client.messages.create(
model=model,
max_tokens=3000,
system=system,
temperature=0.1, # Very low temp - code must be correct
messages=[{"role": "user", "content": instruction}]
)
return response.content[0].text.strip()
def validate_coding_response(response: str) -> bool:
"""
Domain-specific quality check: verify response contains actual code.
The generic elimination filter won't catch "long response with no code."
"""
# Must contain at least one code block or inline Python
if "```" in response and ("python" in response.lower() or
"def " in response or
"class " in response):
return True
# Fallback: contains Python keywords indicating actual implementation
python_indicators = [
"def ", "class ", "return ", "import ",
"for ", "while ", "if __name__"
]
return sum(1 for ind in python_indicators if ind in response) >= 3
Measuring Complexity: Did Evolution Work?
After running Evol-Instruct, you need to verify that evolution actually produced harder instructions - not just longer ones. Longer is not harder. Three measurement approaches:
import numpy as np
def rate_instruction_difficulty_batch(
instructions: list[str],
model: str = "claude-haiku-4-5-20251001"
) -> list[int]:
"""
Rate a batch of instructions on a 1-5 difficulty scale.
1 = Very easy (lookup, definition, single-step)
2 = Easy (basic application of one concept)
3 = Medium (combining multiple concepts, some analysis)
4 = Hard (expert knowledge, multi-step reasoning, synthesis)
5 = Very hard (PhD-level, requires deep expertise across domains)
Uses Haiku for cost efficiency - rating is a simple classification task.
"""
ratings = []
for instruction in instructions:
prompt = f"""Rate the difficulty of this instruction on a 1-5 scale:
1 = Very easy (simple lookup, single-step, basic definition)
2 = Easy (applying one concept, straightforward analysis)
3 = Medium (combining concepts, multi-step, moderate analysis)
4 = Hard (expert knowledge + multi-step reasoning + synthesis)
5 = Very hard (PhD-level, deep expertise across multiple domains)
Instruction: {instruction}
Respond with ONLY a single digit (1, 2, 3, 4, or 5):"""
response = client.messages.create(
model=model,
max_tokens=5,
temperature=0,
messages=[{"role": "user", "content": prompt}]
)
try:
rating = int(response.content[0].text.strip()[0])
ratings.append(max(1, min(5, rating)))
except (ValueError, IndexError):
ratings.append(3) # Default to medium on parsing failure
return ratings
def analyze_complexity_improvement(
original_instructions: list[str],
evolved_instructions: list[str],
sample_size: int = 50
) -> dict:
"""
Compare difficulty distributions before and after evolution.
This tells you whether evolution is working (mean difficulty
should increase by 0.5-1.5 points per round) and whether you're
hitting the upper tail (hard_fraction should be above 40% after 3 rounds).
Args:
original_instructions: Seed instructions before evolution
evolved_instructions: Instructions after N evolution rounds
sample_size: How many to rate (LLM rating is expensive)
Returns:
Dict with difficulty statistics and improvement metrics
"""
orig_sample = random.sample(
original_instructions,
min(sample_size, len(original_instructions))
)
evol_sample = random.sample(
evolved_instructions,
min(sample_size, len(evolved_instructions))
)
print(f"Rating {len(orig_sample)} original and {len(evol_sample)} "
f"evolved instructions...")
orig_ratings = rate_instruction_difficulty_batch(orig_sample)
evol_ratings = rate_instruction_difficulty_batch(evol_sample)
def distribution(ratings):
return {i: ratings.count(i) / len(ratings) * 100
for i in range(1, 6)}
orig_hard = sum(1 for r in orig_ratings if r >= 4) / len(orig_ratings)
evol_hard = sum(1 for r in evol_ratings if r >= 4) / len(evol_ratings)
return {
"original": {
"mean_difficulty": round(np.mean(orig_ratings), 2),
"std_difficulty": round(np.std(orig_ratings), 2),
"hard_fraction": round(orig_hard, 3),
"distribution_pct": distribution(orig_ratings),
},
"evolved": {
"mean_difficulty": round(np.mean(evol_ratings), 2),
"std_difficulty": round(np.std(evol_ratings), 2),
"hard_fraction": round(evol_hard, 3),
"distribution_pct": distribution(evol_ratings),
},
"improvement": {
"difficulty_gain": round(
np.mean(evol_ratings) - np.mean(orig_ratings), 2
),
"hard_fraction_increase": round(evol_hard - orig_hard, 3),
"evolution_working": np.mean(evol_ratings) > np.mean(orig_ratings) + 0.5,
}
}
The Production Combination: Self-Instruct + Evol-Instruct
In practice, the best synthetic datasets combine breadth from Self-Instruct with depth from Evol-Instruct. Self-Instruct gives you coverage across many task types and difficulty levels. Evol-Instruct gives you density at the upper tail where the model needs to develop genuine reasoning capability.
The 15/35/50 easy/medium/hard split is a starting point, not a law. If your production users consistently ask simple questions (customer support, FAQ answering), increase the easy fraction. If your users are domain experts asking complex analytical questions (legal research, medical diagnosis support, systems architecture), push the hard fraction higher toward 60–70%. Always calibrate the target difficulty distribution against the actual difficulty distribution of your production queries.
Common Pitfalls and How to Avoid Them
:::danger Evolution Collapse After 4+ Rounds After 4-5 evolution rounds, instructions often become pathologically complex - so loaded with competing constraints that no coherent answer exists, or so verbose (300+ words) that the response generation model produces a surface-level answer that ignores most requirements. Evolution collapse looks like this: "Implement a distributed, thread-safe, O(1) LRU cache that supports TTL expiration, LFU fallback for write-heavy workloads, cross-datacenter replication with eventual consistency, automatic shard rebalancing, and can be configured via environment variables, supports hot-reload of configuration, produces Prometheus metrics, has a REST API for inspection, and includes a complete test suite with property-based tests and chaos testing." No model will answer this correctly. Monitor instruction word count and cap at 200 words. Add an answerable check: ask the model "Is this instruction answerable with a complete, specific response? YES or NO" and reject NO answers. :::
:::warning The Complexity-Quality Tradeoff More complex is not automatically better training data. An instruction that requires 12 reasoning steps but has an ambiguous or underdetermined answer will produce a confidently-worded but arbitrary response. A wrong-but-confident response to a hard instruction is the most dangerous training data you can create: it teaches the model to produce plausible-sounding incorrect reasoning for hard questions. This is worse than not having hard examples at all. Always generate responses with the strongest available model at low temperature. If you can verify ground truth (math problems, code execution, logical puzzles with known answers), do so. For questions where ground truth is subjective, use multi-judge quality scoring before including examples. :::
:::tip Domain-Specific Seeds Produce the Best Results Generic evolution from generic seeds produces a model that is generally better at complex tasks. Domain-specific evolution from domain-specific seeds produces a model that is dramatically better in that specific domain. If you're building a medical QA model, start with USMLE questions (already difficulty-stratified and medically accurate). If you're building a legal analysis model, start with bar exam questions. The evolution operators amplify what is already present in the seeds - if the seeds are domain-specific and well-calibrated, the evolved instructions will be hard examples of the exact task type your users actually need. Generic seeds produce complex but off-target examples. :::
:::info Evolution Round Count: The Sweet Spot The original WizardLM used 4 evolution rounds. WizardCoder found 3-5 rounds optimal for coding tasks. Beyond 5 rounds, the quality of evolved instructions typically degrades faster than the complexity increases. Use 3 rounds as the default. Add a 4th round if your difficulty analysis shows the hard fraction is still below 40% after 3 rounds. Stop at 5 rounds regardless of metrics. Track the elimination filter acceptance rate per round: if it drops below 25%, you have hit the limit of useful evolution for your seed set and continuing will produce more collapsed examples than valid ones. :::
Comparison: Evol-Instruct vs. Alternatives
| Approach | Breadth | Depth | Cost | Best For |
|---|---|---|---|---|
| Self-Instruct | High | Low-Medium | Low | General instruction following, diverse task coverage |
| Evol-Instruct | Medium | High | Medium | Domain-specific expert-level reasoning capability |
| Human annotation | High | Very High | Very High | Maximum quality when cost is not a constraint |
| Distillation (Orca) | Medium | High | High | Explicit reasoning traces, calibration transfer |
| Self-Play (SPIN) | Low | Medium | Medium | Iterative self-improvement from existing model |
Evol-Instruct occupies the cost-effective expert capability niche. It is significantly cheaper than human annotation or large-scale distillation, but it produces data with genuinely hard examples that neither Self-Instruct nor naive distillation achieves by default.
Interview Q&A
Q: What is the key intuition behind Evol-Instruct, and how does it differ from Self-Instruct?
Self-Instruct generates more diverse instructions - it is good for breadth, covering many different task types across a domain. Evol-Instruct generates more complex instructions - it is good for depth, filling the upper tail of the difficulty distribution. The fundamental insight is that a model's capability ceiling is determined by the hardest examples it sees during training. Self-Instruct datasets cluster naturally around medium difficulty because language models generate tasks they can comfortably describe and answer. The upper tail is systematically underrepresented. Evol-Instruct systematically fills that tail by taking existing instructions and repeatedly applying transformation operators that make them harder. WizardLM-7B beating ChatGPT on 52.7% of test questions with 1/25th the parameters is the empirical proof of concept: training data complexity distribution can overcome parameter count differences in specific capability domains.
Q: Why does WizardLM-7B beat ChatGPT on hard instruction following despite being much smaller?
Fine-tuning with Evol-Instruct data teaches explicit reasoning patterns for complex tasks that the base model wouldn't develop from pretraining alone. GPT-3.5 (ChatGPT) was trained to be capable across many tasks at many difficulty levels - the RLHF process doesn't specifically optimize for complex reasoning; it optimizes for human preference, which often rewards fluency over depth. WizardLM-7B was specifically optimized via its training data to handle complex, multi-step instructions. On hard tasks specifically, the specialized fine-tuning overcomes the parameter count difference because the smaller model learned how to reason through hard problems. It saw thousands of examples of complex instructions with corresponding high-quality responses, and it internalized the reasoning patterns those examples demonstrate. The larger model had implicit capability but was not specifically exercised on hard reasoning tasks. This is the general principle: domain-specific fine-tuning on precisely targeted data often beats parameter-scaled general training for specific capabilities.
Q: What are the five evolution operators in Evol-Instruct and when would you use each?
The four in-depth operators and their ideal use cases: (1) Add Constraints - best when the original task is open-ended and can be answered in many ways; it narrows the solution space while increasing complexity. Use this for instructions that could receive a valid 3-sentence answer or a valid 3-page analysis - add constraints that force the 3-page analysis. (2) Deepen - best for knowledge-heavy tasks where surface understanding is currently sufficient. Use this to require formal proofs, theoretical foundations, or expert-level domain knowledge that distinguishes senior practitioners from juniors. (3) Concretize - best when the original instruction is too vague, allowing the model to respond at any level of specificity. Concrete numbers and specific scenarios force the model to engage with the actual problem rather than give a generic answer. (4) Increase Reasoning Steps - best for tasks that currently have direct, single-step answers. Use this to require intermediate calculations, conditional branches, or synthesis across multiple concepts. The fifth operator, (5) Mutate, is for breadth, not depth - it creates a new related task in the same domain, testing different skills. Use mutate when you want variety within a domain without always increasing depth; typically at a 20-30% probability alongside in-depth operators.
Q: How would you detect and prevent evolution collapse?
Evolution collapse is when instructions become so complex that they are unanswerable. Three detection strategies: (1) Word count monitoring - if an evolved instruction exceeds 200 words, it has likely accumulated incompatible or incoherent constraints. Auto-reject above 250 words. (2) Answerable check - make a separate LLM call: "Is this instruction answerable with a complete, specific, correct response? YES or NO." Reject on NO. (3) Response quality monitoring - track the rate of failure indicators per round. If it exceeds 40%, you are past the point of useful evolution for your seed set. For prevention: cap evolution at 5 rounds, track the elimination filter acceptance rate per round (healthy is 60%+, declining to below 30% signals collapse), and use the difficulty rater to verify each round produces a measurable bump (target +0.5 to +1.0 on the 1-5 scale per round, not +2.0 in a single round which suggests random complexity rather than systematic hardening).
Q: How would you adapt Evol-Instruct for a domain where factual accuracy is critical, such as medical question answering?
Standard Evol-Instruct does not verify factual accuracy - evolved responses for hard medical questions might be confidently wrong. Required adaptations: (1) Grounded response generation - generate responses using RAG over authoritative medical sources (UpToDate, PubMed, clinical guidelines from relevant specialty societies) rather than pure generation. The model reformulates retrieved evidence rather than fabricating. (2) Medical-specific evolution operators - "Add Constraints" becomes "add specific patient demographics, comorbidities, and contraindications"; "Deepen" becomes "require citing clinical guidelines with evidence level (Grade A/B/C) and patient-population applicability." (3) Expert sampling for validation - route 10% of evolved examples to board-certified physicians. Track error rate by difficulty level and operator type. If any combination shows above 5% error rate, modify that operator's prompt to be more conservative. (4) Authoritative seeds - start with USMLE Step 1/2/3 questions, which are already difficulty-stratified, medically reviewed, and cover core clinical reasoning. The evolution operators amplify medical complexity rather than introducing generic complexity. (5) Factual consistency check - use a separate model to verify evolved responses against a medical knowledge base and flag logical contradictions or drug interaction errors before including examples in training data.
Q: What is the elimination filter in Evol-Instruct and why is it necessary?
The elimination filter removes examples where the evolution process failed in one of three ways: (1) Failure indicator rejection - the evolved instruction or generated response contains phrases like "I cannot," "as an AI," or "I'm unable to" - signals that the model refused the task rather than performing it. This produces unusable training examples. (2) Copy problem rejection - the evolved instruction is nearly identical to the original, meaning the evolution operator had no real effect. Training on near-duplicates wastes compute and inflates the dataset without adding value. (3) Short response rejection - if the response to an evolved instruction is very short (under 25-30 words), the model couldn't engage with the instruction. A short response to a supposedly hard instruction means either the instruction is still simple (evolution failed) or the response is an unhelpful refusal or deflection. The filter is necessary because naive evolution pipelines produce 15-30% garbage examples - instructions that became incoherent, responses that are refusals, and near-duplicates that look different superficially but test the same thing. Training on these degrades model quality. The filter ensures every included example provides genuine learning signal.
Cost Estimation and Budget Planning
Running Evol-Instruct at production scale requires upfront budget planning. The cost structure has two dominant components: evolution (cheap, uses Haiku) and response generation (expensive, uses Opus).
def estimate_evol_instruct_cost(
n_seeds: int,
n_rounds: int,
evolutions_per_instruction: int,
mutate_probability: float = 0.25,
avg_instruction_tokens: int = 80,
avg_response_tokens: int = 600,
haiku_input_cost: float = 0.80, # $/1M tokens
haiku_output_cost: float = 4.00, # $/1M tokens
opus_input_cost: float = 15.00, # $/1M tokens
opus_output_cost: float = 75.00, # $/1M tokens
) -> dict:
"""
Estimate the API cost of running Evol-Instruct before committing.
Run this before starting - Evol-Instruct at 10K seeds x 3 rounds
can easily cost $2,000-$5,000 if you're not careful about model
selection and batch sizing.
Cost breakdown per round:
- Evolution: Haiku call per instruction per operator (cheap)
- Response generation: Opus call per accepted evolved instruction (expensive)
- Quality rating: Haiku call per accepted evolved instruction (cheap)
Returns:
Dict with per-round and total cost estimates
"""
# Expected pool size per round (assuming ~60% acceptance rate)
acceptance_rate = 0.60
ops_per_instruction = evolutions_per_instruction + mutate_probability
costs = {}
total_cost = 0.0
current_pool = n_seeds
for round_num in range(1, n_rounds + 1):
n_evolutions = int(current_pool * ops_per_instruction)
n_accepted = int(n_evolutions * acceptance_rate)
# Evolution cost (Haiku): transform instruction → evolved instruction
evolution_input = n_evolutions * avg_instruction_tokens / 1_000_000
evolution_output = n_evolutions * 120 / 1_000_000 # ~120 tokens per evolved instruction
evolution_cost = (
evolution_input * haiku_input_cost +
evolution_output * haiku_output_cost
)
# Response generation cost (Opus): evolved instruction → full response
# Use Sonnet for round 1, Opus for rounds 2+ (matches implementation)
resp_input_cost = opus_input_cost if round_num >= 2 else opus_input_cost * 0.4
resp_output_cost = opus_output_cost if round_num >= 2 else opus_output_cost * 0.4
response_input = n_accepted * avg_instruction_tokens / 1_000_000
response_output = n_accepted * avg_response_tokens / 1_000_000
response_cost = (
response_input * resp_input_cost +
response_output * resp_output_cost
)
# Quality rating cost (Haiku): score each accepted example
rating_input = n_accepted * (avg_instruction_tokens + 200) / 1_000_000
rating_output = n_accepted * 5 / 1_000_000
rating_cost = (
rating_input * haiku_input_cost +
rating_output * haiku_output_cost
)
round_cost = evolution_cost + response_cost + rating_cost
total_cost += round_cost
costs[f"round_{round_num}"] = {
"pool_size": current_pool,
"n_evolutions": n_evolutions,
"n_accepted": n_accepted,
"evolution_cost_usd": round(evolution_cost, 2),
"response_cost_usd": round(response_cost, 2),
"rating_cost_usd": round(rating_cost, 2),
"round_total_usd": round(round_cost, 2),
}
current_pool = n_accepted # Next round starts from accepted examples
costs["total_usd"] = round(total_cost, 2)
costs["total_examples"] = sum(
costs[f"round_{r}"]["n_accepted"]
for r in range(1, n_rounds + 1)
)
costs["cost_per_example_usd"] = round(
total_cost / max(costs["total_examples"], 1), 4
)
return costs
# Example: estimate cost before committing
if __name__ == "__main__":
estimate = estimate_evol_instruct_cost(
n_seeds=500,
n_rounds=3,
evolutions_per_instruction=2,
)
print(f"Estimated total cost: ${estimate['total_usd']:.2f}")
print(f"Expected examples: {estimate['total_examples']}")
print(f"Cost per example: ${estimate['cost_per_example_usd']:.4f}")
for round_key, data in estimate.items():
if round_key.startswith("round_"):
print(f"\n {round_key}:")
print(f" Pool size: {data['pool_size']}")
print(f" Evolutions: {data['n_evolutions']}")
print(f" Accepted: {data['n_accepted']}")
print(f" Round cost: ${data['round_total_usd']:.2f}")
Evol-Instruct in Context: The Bigger Dataset Picture
Evol-Instruct is one technique in a broader toolkit for synthetic dataset construction. Understanding where it fits - and where it does not fit - is as important as knowing how to implement it.
Evol-Instruct is the right tool when: (1) you have a seed instruction set that is good but too simple, (2) your model currently fails on the hard end of user queries but succeeds on simple ones, (3) the domain has a clear difficulty gradient (coding, math, logic, multi-step reasoning all have natural difficulty gradients), and (4) you have API budget for multi-round evolution with Opus-level response generation.
Evol-Instruct is not the right tool when: (1) your seed instructions are already at the target difficulty level and you need more examples, not harder ones, (2) the task requires ground truth verification that LLM-generated responses cannot provide (formal proofs, SQL queries against a specific schema, numerical answers to novel math problems), (3) your task doesn't have a meaningful difficulty gradient (translation quality is mostly about fluency and accuracy, not difficulty), or (4) you need factual reliability guarantees that LLM-generated responses cannot provide.
The production pattern that works: use Self-Instruct to establish broad coverage across the domain, use Evol-Instruct to fill the hard end of the difficulty distribution, use human expert review for a sample of the hardest evolved examples to catch systematic errors, and monitor the deployed model's performance by difficulty category. The difficulty category distribution of failing queries tells you exactly where to run more evolution.
:::tip Always Audit a Sample Before Training Before committing a full evolved dataset to training, manually review 50-100 examples - specifically from the highest evolution rounds. Ask: Is this instruction actually answerable? Is the response genuinely correct? Is the reasoning (if present) actually valid? Automated filters catch obvious failures (refusals, duplicates, very short responses) but miss the most dangerous failure mode: confident, well-formatted, wrong responses to hard questions. Manual spot-checks catch this. Budget 30-60 minutes for a 50-example audit. It is cheap insurance against a training run that internalizes systematic errors. :::
The field continues to iterate on Evol-Instruct. WizardMath applied the technique to mathematical reasoning with domain-specific operators that understand mathematical difficulty (introduce new variables, add constraints that require additional lemmas, require proof by contradiction). WizardCoder built coding-specific operators tuned to what makes coding problems genuinely harder. The pattern is clear: start with the general technique, then specialize the operators to your domain. Generic operators produce generically harder instructions. Domain-specific operators produce instructions that are harder in exactly the ways your users' hardest queries are harder.
