Instruction Tuning
The Useless Genius
A researcher named David sits down with early-access GPT-3 in 2020. He types: "Summarize this article in three bullet points:" and pastes a 500-word news article. GPT-3 responds... by continuing the news article. It does not summarize. It just generates more news-article-like text.
David is not surprised. GPT-3 is a base language model. It was trained to predict the next token. If you give it a news article, the most probable continuation is more news article text - not a bulleted summary. The model is not "thinking about what you want." It is doing exactly what it was trained to do: continue the text pattern.
He tries adding examples to the prompt - few-shot learning: "Article: [article1]. Summary: [bullet1, bullet2, bullet3]. Article: [article2]. Summary:" Now GPT-3 produces a summary. The few-shot examples tell it what pattern to continue. It works - but it requires careful prompt engineering for every single task, and you are burning context length on examples.
This was the state of the art in 2020: base models that required careful prompting with examples to do anything useful. The insight that changed everything: what if instead of engineering prompts at inference time, you taught the model at training time to follow instructions of any kind?
Why This Exists: Base Models Are Bad at Following Instructions
Base LLMs are powerful but awkward to use. They predict the next token. They do not inherently understand that "Write a poem about Paris" means you want a poem, not a continuation of text that happens to mention Paris. They do not understand that "Answer briefly" means they should be concise. They do not know that "Translate to French" followed by English text means they should translate.
These are instructions. They require the model to understand the intent behind a request, map it to the appropriate behavior, and execute accordingly. A base model has not been trained to do this.
The solution: instruction tuning - fine-tuning on a massive, diverse collection of (instruction, response) pairs across many different task types. After instruction tuning, the model learns the meta-skill of "following instructions" - it learns that text in the form of a request should be responded to with the appropriate action.
The most important word is diverse. A model fine-tuned on only summarization instructions learns to summarize - it does not generalize to translation or coding. The key discovery of the FLAN paper was that diversity of tasks is what enables generalization to new tasks the model has never seen.
Historical Context: The Papers That Defined the Field
FLAN (Wei et al., 2021)
Fine-tuned Language Net (FLAN) was the first large-scale demonstration of instruction tuning. Wei et al. at Google took a 137B LaMDA model and fine-tuned it on a mixture of 62 NLP datasets reformulated as instructions. Key findings:
- FLAN outperforms GPT-3 (175B) on zero-shot evaluation despite being fine-tuned on much less data
- Task diversity is critical: models fine-tuned on more task types generalize better to held-out tasks
- Larger models benefit more from instruction tuning - the relationship is not linear
The "aha moment": a model that has been instruction-tuned on summarization, translation, QA, classification, and reasoning generalizes to new instruction types it has never seen. It has learned the pattern of "instruction → response" as a general skill.
T0 (Sanh et al., 2021)
T0 (BigScience) fine-tuned T5-11B on the P3 dataset - 2,073 prompt templates across 170 NLP tasks. Key contribution: demonstrated that prompt diversity within tasks (multiple different ways of phrasing the same instruction) significantly improves zero-shot generalization.
InstructGPT (Ouyang et al., 2022)
OpenAI's InstructGPT combined SFT with RLHF. The SFT phase used ~13,000 human-written instruction-response demonstrations. Key finding: a 1.3B InstructGPT model was preferred over the raw 175B GPT-3 by human raters 71% of the time. Instruction tuning + RLHF made a model 100x smaller outperform the base model.
Alpaca (Taori et al., 2023)
Stanford fine-tuned LLaMA-7B on 52,000 instruction-following examples generated by GPT-3.5 (text-davinci-003). Cost: $500. Result: a model that many users found comparable to early GPT-3.5 for simple instruction following. The key: cheap AI-generated instruction data, while imperfect, is sufficient to teach the instruction-following skill.
FLAN-T5 and FLAN-UL2 (Chung et al., 2022)
Scaled FLAN to include 1,836 fine-tuning tasks, explicitly including chain-of-thought examples. Results on BIG-Bench Hard improved dramatically with CoT data included. FLAN-UL2-20B outperformed GPT-3 175B on many benchmarks.
Task Diversity: The Core Principle
The single most important principle of instruction tuning: train on many different task types.
Why does task diversity enable generalization? The hypothesis: by seeing instruction-following as a pattern across many task types, the model internalizes a general-purpose "follow the instruction" capability rather than learning task-specific pattern matching. A model fine-tuned on 1,000 different task types develops a deeper understanding of what it means to respond to a request than a model fine-tuned on 1,000 examples of the same task.
The FLAN paper demonstrated this explicitly: models fine-tuned on more task clusters (each cluster is a set of related tasks) performed better on held-out clusters. The benefit grew with the number of clusters up to 62. This is the empirical foundation for "diversity matters."
Chain-of-Thought Data
One of the most impactful additions to instruction tuning datasets is chain-of-thought (CoT) data - examples where the response includes explicit reasoning steps before the final answer.
Without CoT:
Instruction: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many does he have total?
Response: 11
With CoT:
Instruction: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many does he have total?
Response: Roger starts with 5 tennis balls.
He buys 2 cans, each with 3 balls: 2 × 3 = 6 new balls.
Total: 5 + 6 = 11 tennis balls.
Wei et al. (2022) showed that including CoT examples in fine-tuning dramatically improves performance on reasoning tasks - especially for larger models. The model learns not just the answer format but the reasoning process. FLAN-T5 with CoT data significantly outperformed FLAN-T5 without CoT on BIG-Bench Hard tasks.
The practical implication: for any instruction tuning dataset used for reasoning tasks, CoT examples are highly valuable. Include 20-30% CoT examples in your dataset for tasks that benefit from multi-step reasoning.
Instruction Templates and Format
How you phrase instructions matters significantly. T0 (Sanh et al., 2021) showed that training on multiple different phrasings of the same instruction improves robustness. A model trained only on "Summarize this text:" will perform worse on "Provide a brief summary of the following article:" even though these are semantically identical.
Common instruction formats:
Alpaca format:
Below is an instruction that describes a task. Write a response that completes the request.
### Instruction:
{instruction}
### Response:
{response}
ChatML format (used by many modern models):
<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
{instruction}
<|im_end|>
<|im_start|>assistant
{response}
<|im_end|>
Llama-2-Chat format:
<s>[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
{instruction} [/INST] {response} </s>
Template consistency is critical If you fine-tune with one template and deploy with another, performance will degrade significantly. The model has learned the exact tokenization pattern of the template. Using a slightly different format at inference time is equivalent to speaking to a French speaker in broken French - they will understand some of it, but not reliably.
Open Source Instruction Datasets
Dolly 15K (Databricks, 2023): 15,000 human-written instruction examples across 8 task categories (open QA, closed QA, summarization, information extraction, classification, creative writing, general QA, brainstorming). Notably, all examples were written by real Databricks employees - not generated by AI.
OpenAssistant (LAION, 2023): 35,000 human-written conversation trees in 35 languages. Each tree has multiple branches with human preference annotations. One of the most comprehensive human-generated instruction datasets.
Orca (Mukherjee et al., 2023): 5 million instruction examples with GPT-4-generated explanations. Key innovation: instead of just (instruction, response) pairs, Orca includes system prompts that ask GPT-4 to explain its reasoning. Smaller models fine-tuned on Orca learned to mimic GPT-4's reasoning process, not just its outputs.
ShareGPT: User-contributed ChatGPT conversations. Large quantity but noisy quality. Widely used for initial instruction tuning.
WizardLM (Xu et al., 2023): Uses an "Evol-Instruct" process - LLMs iteratively rewrite simple instructions to be harder and more complex. The result is a dataset with high complexity and diversity.
Code: Instruction Tuning with TRL
"""
Instruction tuning with TRL SFTTrainer.
Demonstrates:
1. Dataset formatting for multiple template styles
2. Training with prompt masking
3. Evaluation on a held-out set
4. Template-consistent generation
"""
import torch
from datasets import Dataset, load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import SFTTrainer, SFTConfig
# ---- Dataset formatting functions ----
def format_alpaca(example):
"""Format example as Alpaca-style instruction-response pair."""
if example.get("input") and example["input"].strip():
# Instruction with additional context
text = (
"Below is an instruction that describes a task, paired with an input. "
"Write a response that appropriately completes the request.\n\n"
f"### Instruction:\n{example['instruction']}\n\n"
f"### Input:\n{example['input']}\n\n"
f"### Response:\n{example['output']}"
)
else:
text = (
"Below is an instruction that describes a task. "
"Write a response that appropriately completes the request.\n\n"
f"### Instruction:\n{example['instruction']}\n\n"
f"### Response:\n{example['output']}"
)
return {"text": text}
def format_chatml(example, system_prompt="You are a helpful assistant."):
"""Format example as ChatML conversation."""
text = (
f"<|im_start|>system\n{system_prompt}<|im_end|>\n"
f"<|im_start|>user\n{example['instruction']}<|im_end|>\n"
f"<|im_start|>assistant\n{example['output']}<|im_end|>"
)
return {"text": text}
def create_diverse_dataset():
"""
Create a small example instruction dataset with diverse tasks.
In practice, load from OpenAssistant, Dolly, ShareGPT, etc.
"""
examples = [
# Classification
{"instruction": "Classify the sentiment of the following review as positive, negative, or neutral: 'The battery life is great but the camera is disappointing.'", "output": "Negative. While the battery life is praised, the camera disappointment outweighs it, making the overall sentiment negative."},
# Summarization
{"instruction": "Summarize the following in one sentence: 'The transformer architecture, introduced in 2017, revolutionized natural language processing by replacing recurrent networks with self-attention mechanisms, enabling parallel training and capture of long-range dependencies in text.'", "output": "The transformer architecture revolutionized NLP by using self-attention instead of recurrence, enabling parallel training and better handling of long-range dependencies."},
# Code generation
{"instruction": "Write a Python function that returns the nth Fibonacci number using dynamic programming.", "output": "```python\ndef fibonacci(n: int) -> int:\n if n <= 1:\n return n\n dp = [0] * (n + 1)\n dp[1] = 1\n for i in range(2, n + 1):\n dp[i] = dp[i-1] + dp[i-2]\n return dp[n]\n```"},
# Reasoning with chain of thought
{"instruction": "A train travels 120 miles in 2 hours. At the same speed, how long will it take to travel 300 miles? Show your reasoning.", "output": "First, find the speed: 120 miles / 2 hours = 60 miles per hour.\n\nThen, find the time for 300 miles: 300 miles / 60 mph = 5 hours.\n\nThe train will take 5 hours to travel 300 miles."},
# Question answering
{"instruction": "What is the difference between a compiler and an interpreter?", "output": "A compiler translates the entire source code into machine code before execution, producing an executable file. An interpreter translates and executes code line by line at runtime without producing a separate file. Compilers generally produce faster execution but require an upfront compilation step; interpreters are more flexible and easier for debugging but typically slower at runtime. Python uses an interpreter; C uses a compiler."},
]
return Dataset.from_list(examples)
# ---- Training setup ----
def train_instruction_model(
base_model: str = "meta-llama/Llama-2-7b-hf",
output_dir: str = "./instruction-tuned-model",
template: str = "alpaca",
):
tokenizer = AutoTokenizer.from_pretrained(base_model)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Pad right for training
model = AutoModelForCausalLM.from_pretrained(
base_model,
torch_dtype=torch.bfloat16,
use_cache=False,
)
model.gradient_checkpointing_enable()
# Prepare dataset
dataset = create_diverse_dataset()
format_fn = format_alpaca if template == "alpaca" else format_chatml
dataset = dataset.map(format_fn)
sft_config = SFTConfig(
output_dir=output_dir,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5,
lr_scheduler_type="cosine",
warmup_ratio=0.05,
weight_decay=0.01,
bf16=True,
max_seq_length=2048,
packing=False, # True: pack multiple examples into one sequence (faster)
logging_steps=10,
save_steps=100,
report_to="none",
)
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset,
processing_class=tokenizer,
)
trainer.train()
trainer.save_model(output_dir)
return trainer
# ---- Template-consistent inference ----
def generate_response(
instruction: str,
model,
tokenizer,
template: str = "alpaca",
max_new_tokens: int = 512,
temperature: float = 0.7,
):
"""
Generate a response using the same template as training.
"""
if template == "alpaca":
prompt = (
"Below is an instruction that describes a task. "
"Write a response that appropriately completes the request.\n\n"
f"### Instruction:\n{instruction}\n\n"
"### Response:\n"
)
else: # chatml
prompt = (
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
f"<|im_start|>user\n{instruction}<|im_end|>\n"
"<|im_start|>assistant\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
prompt_length = inputs["input_ids"].shape[1]
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
# Decode only the new tokens
response_ids = outputs[0][prompt_length:]
return tokenizer.decode(response_ids, skip_special_tokens=True)
Production Engineering Notes
The Self-Instruct Approach
When you have a capable LLM (GPT-4, Claude) but limited human annotation budget, you can use it to generate instruction data for a smaller model. Self-Instruct (Wang et al., 2022) automated this: start with 175 human-written seed tasks, use GPT-3 to generate new instructions, filter for quality and diversity, use GPT-3 to generate responses.
Modern variants (Alpaca, WizardLM, Orca) generate 50K-5M examples this way. Quality is lower than human-written data but the scale advantage often compensates. The key quality filters: remove too-short instructions, deduplication (rouge-L similarity less than 0.7 with existing examples), safety filtering.
Mixing Data for Multi-Task Capability
In practice, the best instruction-tuned models are trained on a carefully designed mixture:
| Category | Fraction |
|---|---|
| General instruction following | 30-40% |
| Code and math (high quality) | 20-30% |
| Reasoning and CoT | 15-20% |
| Domain-specific tasks | 10-15% |
| Safety and refusal examples | 5-10% |
The exact ratios depend on your use case. For a coding assistant, increase code and math fraction. For a reasoning assistant, increase CoT fraction.
Evaluating Instruction Following Specifically
Standard benchmarks measure knowledge but not instruction following. Use:
- MT-Bench (Zheng et al., 2023): 80 multi-turn questions across 8 categories, graded by GPT-4 on a 1-10 scale
- IFEval (Zhou et al., 2023): verifiable instruction following - follow 25 specific types of instructions and check programmatically
- AlpacaEval: automated win rate against a reference model (text-davinci-003)
Common Mistakes
Using instruction data from a single source or task type A model fine-tuned on 100,000 examples of one task type (e.g., all summarization) does not become a general instruction follower - it becomes a very specialized summarizer. Task diversity is the mechanism by which instruction tuning generalizes. If your dataset has only one or two task types, you have not done instruction tuning - you have done task-specific fine-tuning.
Ignoring safety in the fine-tuning dataset Instruction tuning data that lacks examples of appropriate refusals will produce a model that follows all instructions, including harmful ones. Include "I cannot help with that" examples for requests that involve clear harms, illegal activities, or violations of your usage policy. The proportion should be 5-10% of the dataset. More is not always better - too many refusals and the model becomes overly cautious and refuses benign requests.
AI-generated data quality degradation When using LLM-generated instruction data (Alpaca-style), the quality is a function of the generating model's capabilities. GPT-4-generated data is significantly better than GPT-3.5-generated data. But even GPT-4-generated data has systematic failure modes: factual errors (especially for niche topics), bias from RLHF training (the generating model's aligned preferences), and repetitive stylistic patterns. Always sample and manually review 100+ generated examples before using them for training.
Include failure analysis in your evaluation loop After each training run, collect examples where the model fails on your target tasks. Categorize these failures: format errors, factual errors, instruction misinterpretation, length issues, safety issues. Then deliberately add training examples that address the most common failure categories. This iterative data-driven improvement cycle is more effective than just adding more random examples.
Interview Q&A
Q1: What is the difference between instruction tuning and standard supervised fine-tuning?
Standard SFT adapts a model to a specific task - you train a classifier, a summarizer, or an extractor on labeled data. Instruction tuning is SFT on a diverse mixture of tasks, each expressed as a natural language instruction. The goal is not to improve performance on any single task but to develop the general capability of following instructions. The key property: a good instruction-tuned model generalizes to new task types it was not trained on, because it has learned the meta-skill of interpreting and executing instructions. Standard task-specific SFT does not generalize this way.
Q2: Why did the FLAN paper emphasize task diversity rather than dataset size?
Wei et al. (2021) showed that adding more tasks (up to 62 task clusters) improved zero-shot generalization to held-out tasks, while simply adding more examples of the same tasks had diminishing returns on generalization. The intuition: task diversity forces the model to learn a general instruction-following capability rather than memorizing task-specific patterns. The model that has seen instructions for summarization, translation, classification, QA, and reasoning has learned something fundamental about "what it means to follow an instruction" - which transfers to new instruction types. A model trained on 10x more summarization examples just gets better at summarization.
Q3: What is chain-of-thought instruction tuning and when does it help?
Chain-of-thought (CoT) instruction tuning includes training examples where the response contains explicit intermediate reasoning steps before the final answer. "The problem is X. Step 1: [reasoning]. Step 2: [reasoning]. Therefore the answer is Y." Wei et al. (2022) showed this dramatically improves performance on multi-step reasoning tasks (math word problems, logic, code debugging). It works because the model learns to "show its work" - generating intermediate steps makes it more likely to arrive at correct final answers. It helps most on tasks requiring sequential reasoning where the final answer cannot be computed in a single "step." It helps less on straightforward factual retrieval.
Q4: Alpaca cost $500 to create but produced a decent instruction-following model. What were its limitations?
Alpaca (52K GPT-3.5-generated examples + LLaMA-7B fine-tuning) was impressive for its cost but had clear limitations: (1) factual errors - GPT-3.5 makes mistakes on niche knowledge, and those mistakes are baked into the training data; (2) safety issues - GPT-3.5-generated data inherits its alignment properties (sometimes inconsistent refusals, sometimes over-refusals); (3) lack of complex reasoning - short, simple instruction examples do not teach multi-step reasoning; (4) narrow task coverage - the 52K examples do not cover the full diversity of real user requests. Alpaca demonstrated a proof of concept, but production-quality instruction tuning requires either human-annotated data (more expensive) or higher-quality AI-generated data with better filtering and quality control.
Q5: How do you evaluate whether instruction tuning worked well?
A multi-pronged approach. First, loss curves: validation loss should decrease and not diverge. Second, qualitative review: manually read 50-100 model outputs on diverse instructions and categorize failure modes. Third, automated benchmarks: MT-Bench (GPT-4 as judge, 1-10 score), AlpacaEval (win rate vs reference model), IFEval (verifiable instruction following). Fourth, regression testing: check MMLU, HellaSwag, or other general benchmarks to ensure instruction tuning did not cause catastrophic forgetting of base knowledge. Fifth, A/B testing against the base model or previous fine-tuned version on your specific use case. The most important signal is usually qualitative human evaluation - automated metrics often miss subtle quality differences.
Advanced: Scaling Instruction Tuning - What Works at Different Scales
Research has revealed how instruction tuning effectiveness changes with model scale:
Sub-1B models: Instruction tuning provides minimal benefit for very small models. These models lack the capacity to generalize across task types. They can learn a specific task through SFT but not the meta-skill of "follow instructions."
1B-7B models: Instruction tuning works but generalization to new task types is limited. A 7B model fine-tuned on 50 task types performs reasonably on those tasks but struggles with novel task types not represented in training. These models benefit most from extensive data diversity.
7B-70B models: The sweet spot for instruction tuning. Models in this range can genuinely learn the instruction-following meta-skill. FLAN-T5 11B, LLaMA-2-7B-Chat, Mistral-7B-Instruct - all demonstrate convincing cross-task generalization.
70B+ models: At this scale, instruction tuning has a multiplicative effect. The model's existing knowledge and reasoning capabilities are fully unlocked by instruction tuning. Very few instruction examples are needed (LIMA-style, 1,000-10,000 examples suffice). The primary challenge shifts from "teaching the model to follow instructions" to "teaching the model which behaviors are preferred" - which leads naturally to RLHF/DPO.
Instruction Tuning Checklist
Before launching an instruction tuning run, verify:
"""
Instruction tuning pre-flight checklist.
"""
def validate_instruction_dataset(dataset: list, tokenizer) -> dict:
"""
Validate dataset quality before training.
Returns diagnostic report.
"""
issues = []
stats = {
"total_examples": len(dataset),
"avg_instruction_tokens": 0,
"avg_response_tokens": 0,
"empty_responses": 0,
"very_short_responses": 0, # Less than 20 tokens
"very_long_responses": 0, # More than 1024 tokens
"duplicate_instructions": 0,
}
instruction_token_counts = []
response_token_counts = []
seen_instructions = set()
for example in dataset:
# Check required fields
if "instruction" not in example or "response" not in example:
issues.append("Missing 'instruction' or 'response' field")
continue
instruction = example["instruction"].strip()
response = example["response"].strip()
# Empty checks
if not response:
stats["empty_responses"] += 1
continue
# Token counts
inst_tokens = len(tokenizer(instruction)["input_ids"])
resp_tokens = len(tokenizer(response)["input_ids"])
instruction_token_counts.append(inst_tokens)
response_token_counts.append(resp_tokens)
if resp_tokens < 20:
stats["very_short_responses"] += 1
if resp_tokens > 1024:
stats["very_long_responses"] += 1
# Exact duplicate check
if instruction in seen_instructions:
stats["duplicate_instructions"] += 1
seen_instructions.add(instruction)
if instruction_token_counts:
stats["avg_instruction_tokens"] = sum(instruction_token_counts) / len(instruction_token_counts)
stats["avg_response_tokens"] = sum(response_token_counts) / len(response_token_counts)
# Flag critical issues
if stats["empty_responses"] > 0:
issues.append(f"WARNING: {stats['empty_responses']} empty responses (filter these out)")
if stats["duplicate_instructions"] > len(dataset) * 0.05:
issues.append(f"WARNING: {stats['duplicate_instructions']} duplicate instructions ({stats['duplicate_instructions']/len(dataset):.1%})")
if stats["very_short_responses"] > len(dataset) * 0.1:
issues.append(f"WARNING: {stats['very_short_responses']} very short responses ({stats['very_short_responses']/len(dataset):.1%})")
# Print report
print("=== Dataset Validation Report ===")
print(f"Total examples: {stats['total_examples']}")
print(f"Avg instruction length: {stats['avg_instruction_tokens']:.0f} tokens")
print(f"Avg response length: {stats['avg_response_tokens']:.0f} tokens")
print(f"Issues found: {len(issues)}")
for issue in issues:
print(f" - {issue}")
return {"stats": stats, "issues": issues}
# Task diversity analysis - ensure you have multiple task types
def check_task_diversity(dataset: list) -> dict:
"""
Quick check of task type distribution.
Flag if any single task type dominates.
"""
import re
task_indicators = {
"classification": r'\b(classify|categorize|label|sentiment|determine if)\b',
"generation": r'\b(write|generate|create|compose|draft)\b',
"qa": r'\b(what|why|how|who|when|where|explain)\b',
"summarization": r'\b(summarize|summary|tldr|condense)\b',
"translation": r'\b(translate|in french|in spanish|in german)\b',
"coding": r'\b(code|function|implement|program|write a|debug)\b',
"math": r'\b(calculate|solve|compute|find|equation)\b',
}
counts = {k: 0 for k in task_indicators}
counts["other"] = 0
for example in dataset:
instruction_lower = example.get("instruction", "").lower()
matched = False
for task, pattern in task_indicators.items():
if re.search(pattern, instruction_lower):
counts[task] += 1
matched = True
break
if not matched:
counts["other"] += 1
total = len(dataset)
print("\nTask type distribution:")
for task, count in sorted(counts.items(), key=lambda x: -x[1]):
pct = count / total * 100
bar = "#" * int(pct / 2)
print(f" {task:15s}: {pct:5.1f}% {bar}")
# Diversity score: entropy of task distribution
import math
probs = [c/total for c in counts.values() if c > 0]
entropy = -sum(p * math.log2(p) for p in probs)
max_entropy = math.log2(len(counts))
diversity_score = entropy / max_entropy
print(f"\nDiversity score: {diversity_score:.2f} (0=no diversity, 1=perfectly balanced)")
if diversity_score < 0.5:
print("WARNING: Low task diversity. Consider adding more task types.")
return counts
The Self-Instruct expansion technique When you have a small set of high-quality seed examples (e.g., 100 carefully written instruction-response pairs) but need more diversity, use Self-Instruct (Wang et al., 2022): prompt a large LLM with your seed examples and ask it to generate new, diverse instructions on different topics. Filter the generated instructions for quality and uniqueness (ROUGE-L deduplication: reject any instruction with more than 70% n-gram overlap with existing examples). This can expand 100 seed examples into 10,000+ diverse examples at minimal cost.
Instruction Quality vs Quantity
The LIMA paper (Zhou et al., 2023) made a provocative claim: 1,000 carefully curated examples can match the instruction-following quality of models trained on hundreds of thousands of examples. This reignited the quality-vs-quantity debate.
The empirical evidence:
| Dataset | Size | Model | MT-Bench Score |
|---|---|---|---|
| FLAN-v2 | 1,800 tasks | Flan-T5 11B | ~5.5 |
| Alpaca | 52K | LLaMA-7B | ~4.8 |
| WizardLM-Evol | 70K | LLaMA-13B | ~6.4 |
| OpenHermes 2.5 | 1M | Mistral-7B | ~7.0 |
| LIMA | 1K | LLaMA-65B | ~6.3 |
| Tulu 2 | 326K | LLaMA-2-70B | ~7.4 |
The pattern: quality matters most at small scale; quantity starts to win at large scale. With 1–10K examples, every example must be excellent. With 100K+ examples, even 20–30% "good enough" data produces strong models because the model sees enough coverage.
What Makes an Instruction "High Quality"
def score_instruction_quality(example: dict) -> dict[str, float]:
"""
Multi-dimensional quality scoring for instruction-response pairs.
Returns scores between 0 and 1.
"""
instruction = example["instruction"]
response = example["response"]
scores = {}
# 1. Instruction clarity - is it specific enough to have one correct interpretation?
ambiguity_indicators = ["something", "anything", "somehow", "kind of", "sort of"]
ambiguity_count = sum(1 for w in ambiguity_indicators if w in instruction.lower())
scores["clarity"] = max(0.0, 1.0 - 0.2 * ambiguity_count)
# 2. Response length appropriateness
instruction_words = len(instruction.split())
response_words = len(response.split())
if instruction_words < 10 and response_words > 500:
# Short question, very long answer - probably padded
scores["length_appropriate"] = 0.5
elif response_words < 20:
# Possibly too terse
scores["length_appropriate"] = 0.4
elif 50 <= response_words <= 800:
scores["length_appropriate"] = 1.0
else:
scores["length_appropriate"] = 0.7
# 3. Format consistency - does response follow instruction's format request?
wants_list = any(w in instruction.lower() for w in ["list", "enumerate", "bullet", "steps"])
has_list = "1." in response or "- " in response or "* " in response
if wants_list:
scores["format_match"] = 1.0 if has_list else 0.3
else:
scores["format_match"] = 1.0
# 4. Refusal check - response should not refuse reasonable requests
refusal_phrases = [
"i cannot", "i'm unable", "i won't", "as an ai", "i don't have opinions"
]
has_refusal = any(p in response.lower() for p in refusal_phrases)
scores["non_refusal"] = 0.2 if has_refusal else 1.0
# 5. Response starts on topic
first_sentence = response.split(".")[0].lower()
instruction_keywords = set(instruction.lower().split()) - {"the", "a", "an", "is", "of"}
keyword_overlap = len(instruction_keywords & set(first_sentence.split()))
scores["topic_relevance"] = min(1.0, keyword_overlap / max(1, len(instruction_keywords) * 0.3))
overall = sum(scores.values()) / len(scores)
return {"overall": overall, **scores}
def filter_by_quality(
dataset: list[dict],
min_quality_score: float = 0.7,
) -> list[dict]:
"""Filter dataset to only include high-quality examples."""
scored = [(ex, score_instruction_quality(ex)) for ex in dataset]
filtered = [ex for ex, score in scored if score["overall"] >= min_quality_score]
print(f"Kept {len(filtered)}/{len(dataset)} examples ({len(filtered)/len(dataset):.1%})")
return filtered
Instruction Tuning for Specific Domains
General instruction tuning datasets (FLAN, ShareGPT, Orca) cover broad task categories. For domain-specific applications, you need domain-tailored instruction data.
Medical Instruction Tuning
# Example: Medical QA instruction template
MEDICAL_SYSTEM_PROMPT = """You are a medical information assistant. Provide accurate,
evidence-based medical information. Always recommend consulting a healthcare professional
for personal medical decisions. Cite relevant clinical guidelines when applicable."""
def format_medical_instruction(question: str, context: str = "") -> str:
"""Format medical Q&A with safety framing."""
if context:
return f"""Given the following clinical context:
{context}
Answer this medical question: {question}
Provide: (1) a direct answer, (2) relevant pathophysiology, (3) clinical considerations,
and (4) when to refer to a specialist."""
return f"""Answer this medical question: {question}
Provide accurate, evidence-based information appropriate for a healthcare professional audience."""
# Domain-specific datasets for medical instruction tuning:
# - MedAlpaca (medical Alpaca format, 160K examples)
# - PMC-LLaMA dataset (PubMed Central papers)
# - MedQA (USMLE-style questions, 60K examples)
# - ClinicalBench (clinical reasoning evaluation)
Code Instruction Tuning
Code is a particularly effective domain for instruction tuning because quality is objectively measurable - you can run the code.
CODE_TASKS = [
# Level 1: Simple generation
"Write a Python function that {task}",
"Implement {algorithm} in Python",
# Level 2: Explanation
"Explain what this code does:\n```python\n{code}\n```",
"What is the time complexity of this function? Why?\n```python\n{code}\n```",
# Level 3: Debugging
"This code has a bug. Find and fix it:\n```python\n{buggy_code}\n```\nExpected: {expected}",
# Level 4: Optimization
"Optimize this code for performance:\n```python\n{code}\n```",
# Level 5: Architecture
"Design a {system} in Python. Include class structure, key methods, and explain design decisions.",
]
def create_code_instruction_dataset(
problems: list[dict], # [{task, solution, explanation}]
include_buggy: bool = True,
) -> list[dict]:
"""Create diverse code instruction examples from problem set."""
examples = []
import random
for problem in problems:
# Basic generation instruction
examples.append({
"instruction": f"Write a Python function that {problem['task']}",
"response": f"```python\n{problem['solution']}\n```\n\n{problem['explanation']}",
})
# Explanation instruction
examples.append({
"instruction": f"Explain what this Python code does:\n```python\n{problem['solution']}\n```",
"response": problem['explanation'],
})
if include_buggy and "buggy_version" in problem:
# Debug instruction
examples.append({
"instruction": (
f"This Python code has a bug. Find and fix it:\n"
f"```python\n{problem['buggy_version']}\n```\n"
f"Expected behavior: {problem['task']}"
),
"response": (
f"The bug is: {problem['bug_description']}\n\n"
f"Fixed code:\n```python\n{problem['solution']}\n```"
),
})
random.shuffle(examples)
return examples
Instruction tuning datasets for code (as of 2024)
- Code Alpaca (20K): early code instruction dataset, quality is mixed
- CodeFeedback (66K): multi-turn code conversations with execution feedback
- Magicoder-OSS-Instruct (75K): high-quality code generated from real GitHub issues
- OpenCoder-LLM dataset (1B+ tokens): curriculum approach - basic to advanced
- Fine-tuning on code data also improves reasoning ability on non-code tasks (Wei et al., 2022 - code training improves chain-of-thought)
Evaluating Instruction-Tuned Models
Instruction tuning is easy to do but hard to evaluate rigorously. The right evaluation framework depends on your use case.
MT-Bench - Multi-Turn Conversation Evaluation
MT-Bench (Zheng et al., 2023) uses GPT-4 as a judge to score model responses on 80 multi-turn questions across 8 categories: writing, roleplay, reasoning, math, coding, extraction, STEM, humanities. Scores range from 1–10.
import anthropic # or openai
def evaluate_with_llm_judge(
question: str,
model_response: str,
reference_answer: str | None = None,
judge_model: str = "claude-3-5-sonnet-20241022",
) -> dict:
"""Use an LLM as a judge to score a model response."""
client = anthropic.Anthropic()
if reference_answer:
prompt = f"""[Question]
{question}
[Reference Answer]
{reference_answer}
[Model Response]
{model_response}
Rate the model response on a scale of 1-10 based on:
- Correctness (does it answer the question accurately?)
- Completeness (does it cover all important aspects?)
- Clarity (is it well-written and easy to understand?)
Provide your rating as JSON: {{"score": X, "reasoning": "..."}}"""
else:
prompt = f"""[Question]
{question}
[Model Response]
{model_response}
Rate this response on a scale of 1-10 based on:
- Helpfulness (does it directly address what was asked?)
- Accuracy (is the information correct to the best of your knowledge?)
- Quality (is it well-written, clear, and appropriately detailed?)
Provide your rating as JSON: {{"score": X, "reasoning": "..."}}"""
response = client.messages.create(
model=judge_model,
max_tokens=512,
messages=[{"role": "user", "content": prompt}],
)
import json
try:
result = json.loads(response.content[0].text)
except json.JSONDecodeError:
# Fallback: extract score from text
text = response.content[0].text
score = float([w for w in text.split() if w.replace(".", "").isdigit()][0])
result = {"score": score, "reasoning": text}
return result
def run_instruction_evaluation(
model_generate_fn, # Callable: (prompt: str) -> str
eval_set: list[dict], # [{"instruction": ..., "reference": ...}]
sample_size: int = 50,
) -> dict:
"""Evaluate an instruction-tuned model on a sample of the eval set."""
import random
sample = random.sample(eval_set, min(sample_size, len(eval_set)))
scores = []
category_scores = {}
for example in sample:
response = model_generate_fn(example["instruction"])
result = evaluate_with_llm_judge(
question=example["instruction"],
model_response=response,
reference_answer=example.get("reference"),
)
scores.append(result["score"])
category = example.get("category", "general")
if category not in category_scores:
category_scores[category] = []
category_scores[category].append(result["score"])
return {
"overall_score": sum(scores) / len(scores),
"by_category": {cat: sum(s)/len(s) for cat, s in category_scores.items()},
"num_evaluated": len(scores),
}
Winrate Evaluation
Head-to-head comparison against a baseline model (e.g., the pre-instruction-tuning SFT model, or GPT-3.5):
def compute_winrate(
model_a_responses: list[str],
model_b_responses: list[str],
questions: list[str],
judge_fn, # LLM judge
) -> float:
"""Compute win rate of model A vs model B."""
wins_a = 0
wins_b = 0
ties = 0
for q, resp_a, resp_b in zip(questions, model_a_responses, model_b_responses):
# Randomize order to reduce position bias
import random
if random.random() < 0.5:
first, second = resp_a, resp_b
a_is_first = True
else:
first, second = resp_b, resp_a
a_is_first = False
verdict = judge_fn(q, first, second) # Returns "first", "second", or "tie"
if verdict == "tie":
ties += 1
elif (verdict == "first" and a_is_first) or (verdict == "second" and not a_is_first):
wins_a += 1
else:
wins_b += 1
total = wins_a + wins_b + ties
winrate = wins_a / total
print(f"Win rate (A vs B): {winrate:.1%} (A wins: {wins_a}, B wins: {wins_b}, Ties: {ties})")
return winrate
Key Takeaways
Instruction tuning is the bridge between raw language model capability and practical utility. A pretrained model that can complete any text has latent knowledge of nearly every task - instruction tuning teaches it to use that knowledge when asked.
The core lesson from six years of research (FLAN 2021 to Llama 3 2024): task diversity matters more than task quantity. A model that has seen 1,000 diverse tasks generalizes better than a model that has seen 100K examples of 5 tasks. The instruction format is essentially a learned interface - once the model understands the pattern "instruction → helpful response," it can apply that pattern to novel instructions at inference time.
For practitioners: the easiest wins come from (1) using an instruction-tuned base model as your starting point rather than a raw pretrained model, (2) collecting 500–2,000 high-quality domain-specific examples rather than thousands of mediocre ones, and (3) evaluating with an LLM-as-judge rather than only perplexity - perplexity does not capture instruction-following quality.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Instruction Tuning demo on the EngineersOfAI Playground - no code required.
:::
