:::tip 🎮 Interactive Playground Visualize this concept: Try the Knowledge Distillation demo on the EngineersOfAI Playground - no code required. :::
Distillation Datasets: Capturing Frontier Model Knowledge
The Reasoning Gap Nobody Warned You About
The Microsoft research team has been staring at the same benchmark results for three days. It is August 2023, and they have done everything right. They fine-tuned LLaMA-13B on 52,000 Alpaca examples and 70,000 WizardLM examples. Both datasets were cleaned, deduplicated, and scored for quality. The resulting model follows instructions well. It passes internal vibes checks. It handles summarization, basic coding, and Q&A competently. Product managers are enthusiastic. The paper draft is already in review.
Then one researcher runs a systematic analysis on multi-step reasoning tasks - the kind where GPT-4's responses read like a thoughtful expert working through a problem out loud. On these tasks, the fine-tuned model falls apart in a specific, troubling way. It doesn't produce obviously wrong answers. It produces fluent, well-formatted, plausible-sounding wrong answers. Ask it to reason through a multi-step logic problem, and it confidently produces an answer with no visible reasoning. Ask it to explain its reasoning and it generates retrospective rationalization - a post-hoc story that sounds like reasoning but doesn't actually reflect how it arrived at the answer. The model has learned to mimic the output format of intelligent responses without learning the process that produces them.
The insight that changes everything comes from one researcher asking a deceptively simple question: when GPT-4 answers a hard question, what is it actually doing that makes the answer better? The team pulls up a series of GPT-4 responses to complex questions. The pattern is immediate. GPT-4 doesn't just produce answers. It reasons. It identifies what the question is actually asking. It notices what information is relevant. It works through intermediate steps. It sometimes catches an error it made three sentences earlier and corrects course. It acknowledges uncertainty when uncertainty is warranted. The chain of thought is not decorative - it is the mechanism. And when you train on final answers only, you teach a model to skip that mechanism and guess at what the answer should look like.
The Orca paper (Mukherjee et al., 2023) was the result of taking this observation seriously at scale. Instead of training on GPT-4's answers, train on GPT-4's thinking - 5 million examples generated with system prompts that force the model to make all its intermediate reasoning explicit. The resulting Orca-13B matched or exceeded ChatGPT on complex reasoning benchmarks, dramatically outperformed same-size models trained on answer-only data, and produced reasoning that is visibly correct rather than merely fluent. The lesson: distillation is not just about capturing what a model knows. It is about capturing how it thinks.
Why Distillation Datasets Exist: The Inference Cost Problem
The fundamental economic reality of frontier models creates an unsustainable situation at scale. claude-opus-4-6 produces outputs of extraordinary quality, but it costs roughly 15,000–5.5 million to $11 million per year. For most products, this is prohibitive.
Distillation is the engineering solution: use the frontier model extensively during a one-time dataset generation phase, then train a smaller model on those examples. The smaller model costs 100x less to serve.
The math works whenever two conditions hold: you will handle more than roughly 1 million queries total over the model's lifetime, and your task is specific enough that a fine-tuned small model can match frontier quality. The second condition is the critical one. General-purpose open-ended question answering is hard to specialize. Customer support for a specific product, legal document classification in a specific jurisdiction, code generation in a specific codebase, medical triage for a specific care setting - these are all specializable. The distillation dataset focuses the student model on exactly the task distribution it will encounter in production, and the small model becomes genuinely competitive with frontier models on that specific distribution.
The Three Levels of Distillation
Not all distillation datasets capture the same depth of knowledge. The level you choose determines what capabilities actually transfer to the student model.
Level 1: Output Distillation
The simplest approach: ask the question, record the answer. No system prompt engineering. No reasoning elicitation. This is the Alpaca approach.
Input: "Explain why quicksort has O(n²) worst-case time complexity."
Output: "Quicksort degrades to O(n²) when the pivot is consistently the
smallest or largest element, creating n-1 partition levels each
requiring O(n) comparisons..."
What you capture: Output format, surface knowledge, tone. What you miss: The reasoning that got there. Why this explanation angle was chosen. What the model considered and rejected. How certain the model is. The self-correction that made the final answer better than the first draft.
This approach produces models that are excellent at sounding like the teacher. On simple questions, they produce the right answer because the right answer is pattern-matchable from surface features. On complex questions, they produce a confidently-delivered wrong answer because they learned to produce plausible-looking outputs, not to reason toward correct ones.
Level 2: Reasoning Trace Distillation
Capture explicit intermediate reasoning steps by prompting the frontier model to make its thinking visible.
System: "Think step by step. Show all your reasoning. Identify what you're
uncertain about. Verify your answer makes sense."
Input: "Explain why quicksort has O(n²) worst-case complexity."
Output: "Let me think through what determines quicksort's time complexity.
The partition step always does O(n) work - that's fixed per level.
What determines total work is the number of recursive levels.
Best case: pivot splits the array into equal halves each time.
That gives log₂(n) levels, so total work = O(n log n).
Worst case: what if pivot is always the smallest element?
Each call produces one subarray of size n-1 and one of size 0.
We get n-1 recursive calls: work = n + (n-1) + ... + 2 + 1 = n(n+1)/2.
That's O(n²).
When does worst case occur? When the array is already sorted and
we always pick the first element as pivot. This is why randomized
pivot selection (or median-of-three) is standard in practice -
it makes worst-case probability exponentially small."
What you capture: The reasoning process, self-checking, uncertainty acknowledgment, how concepts connect. Impact: Models trained on Level 2 data learn to reason, not just to recognize and reproduce answer patterns. The Orca paper demonstrated this definitively.
Level 3: Textbook Distillation
Generate not just answers but pedagogically structured explanations that teach from first principles:
TEXTBOOK_SYSTEM_PROMPT = """You are an expert teacher creating textbook-quality
explanations for an advanced programming course.
For each concept:
1. Explain from first principles - assume no prior knowledge of this specific concept
2. Work through a concrete, complete example with every intermediate value shown
3. Identify the 2-3 most common misconceptions about this concept and refute each
4. State the general principle that emerges from the specific example
5. Provide 2 practice problems with complete solutions and explanations
6. Explain precisely when to use this concept vs. the most common alternative"""
What you capture: Deep conceptual structure, pedagogical ordering, explicit misconception correction. Best for: Small models (1B-7B parameters) where data quality matters more than quantity. The Phi-1 paper showed 1.3B parameters trained on 1B tokens of textbook-quality data outperforms 13B models trained on 100B tokens of internet code.
Production Distillation Pipeline
import anthropic
import json
import re
import time
from dataclasses import dataclass, field
from pathlib import Path
from typing import Optional
client = anthropic.Anthropic()
@dataclass
class DistillationExample:
"""A single distillation training example with full metadata."""
question: str
system_prompt_key: str
system_prompt: str
full_response: str
reasoning_trace: str
final_answer: str
source_model: str
quality_score: float = 0.0
metadata: dict = field(default_factory=dict)
# Diverse system prompts that elicit different reasoning styles.
# Based on Orca's approach of 16 system prompts - variety ensures the
# student model sees reasoning expressed in many ways, not just one.
DISTILLATION_SYSTEM_PROMPTS = {
"chain_of_thought": (
"You are a helpful AI assistant. When answering questions, think through "
"your reasoning step by step before giving your final answer. Show all your "
"work. Use phrases like 'First, I need to...', 'This tells me that...', "
"'Therefore...' to make reasoning visible."
),
"expert_explanation": (
"You are an expert teacher. For every answer: (1) Start with the key concept "
"or principle that governs the answer. (2) Explain each reasoning step with "
"explicit justification for why that step follows from the previous. "
"(3) Connect back to the original question. (4) State your conclusion clearly "
"and explain what makes it correct. Make explanations detailed enough to teach from."
),
"socratic": (
"You are a wise tutor. When answering: Begin by identifying exactly what the "
"question is asking and what information is needed. Note any assumptions you're "
"making and why they're warranted. Explore the problem from multiple angles before "
"settling on an approach. Point out potential pitfalls or edge cases. Verify your "
"answer makes intuitive sense and explain why."
),
"code_expert": (
"You are a senior software engineer. When answering coding questions: "
"(1) Understand the requirements and identify constraints or edge cases. "
"(2) Design the algorithm before writing code - explain your approach. "
"(3) Write clean, well-commented code with proper error handling. "
"(4) Analyze time and space complexity with justification. "
"(5) Show 2 test cases including one edge case."
),
"scientific": (
"You are a scientist explaining complex phenomena. Start with empirical "
"observations or known facts. Apply first principles to explain the mechanism. "
"Use precise analogies for abstract concepts. Distinguish clearly between "
"well-established facts and inferences. Consider and explicitly refute at "
"least one alternative explanation."
),
"first_principles": (
"You are an analytical thinker. Break every problem down to its fundamental "
"components before building up to the answer. Start from what you know for "
"certain, reason to what can be derived, and clearly mark any assumptions. "
"Show the logical dependency chain from premises to conclusion."
),
"calibrated": (
"You are an AI assistant. When answering, always distinguish between: "
"things you are certain about, things that are likely true but uncertain, "
"and things where reasonable experts disagree. Provide your best answer "
"while being explicit about confidence levels. Never present uncertain "
"claims with more confidence than warranted."
),
"detailed": (
"You are an AI assistant. You will be given a task. You must generate "
"a detailed and comprehensive answer that covers the topic thoroughly. "
"Include relevant context, important nuance, and considerations that a "
"complete answer requires. Prefer depth and accuracy over brevity."
),
}
def generate_distillation_example(
question: str,
system_prompt_key: str,
model: str = "claude-opus-4-6",
max_tokens: int = 2500,
) -> DistillationExample:
"""
Generate one distillation example using the frontier teacher model.
The system prompt is as important as the question. Different system
prompts elicit qualitatively different reasoning structures from the
same model on the same question - varying in depth, style, pedagogical
approach, and what aspects of reasoning they make explicit.
Args:
question: The question to generate a reasoning trace for
system_prompt_key: Which reasoning style to elicit
model: Frontier teacher model (use the best available)
max_tokens: Response length budget
Returns:
DistillationExample with full response, split reasoning, and metadata
"""
system_prompt = DISTILLATION_SYSTEM_PROMPTS[system_prompt_key]
response = client.messages.create(
model=model,
max_tokens=max_tokens,
system=system_prompt,
messages=[{"role": "user", "content": question}]
)
full_response = response.content[0].text.strip()
reasoning_trace, final_answer = extract_reasoning_and_answer(full_response)
return DistillationExample(
question=question,
system_prompt_key=system_prompt_key,
system_prompt=system_prompt,
full_response=full_response,
reasoning_trace=reasoning_trace,
final_answer=final_answer,
source_model=model,
metadata={
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"cost_usd": (
response.usage.input_tokens * 15 / 1_000_000 +
response.usage.output_tokens * 75 / 1_000_000
), # Approximate Opus pricing
}
)
def extract_reasoning_and_answer(text: str) -> tuple[str, str]:
"""
Split a response into reasoning trace and final answer.
Heuristic: scan for conclusion markers in the second half of the
response. If found, split there. Otherwise treat the last paragraph
as the answer.
Args:
text: Full model response
Returns:
(reasoning_trace, final_answer) - both may be empty strings
"""
conclusion_markers = [
"therefore,", "in conclusion,", "to summarize,", "in summary,",
"the answer is", "thus,", "so, the", "the result is",
"final answer:", "to conclude,", "ultimately,", "to wrap up,",
"putting this together,", "to put it together,",
]
lower_text = text.lower()
last_marker_pos = -1
for marker in conclusion_markers:
pos = lower_text.rfind(marker)
# Only count markers that appear in the second half of the response
# Avoid incorrectly splitting on "therefore" in the middle of reasoning
if pos > last_marker_pos and pos > len(text) * 0.45:
last_marker_pos = pos
if last_marker_pos > 0:
reasoning = text[:last_marker_pos].strip()
answer = text[last_marker_pos:].strip()
return reasoning, answer
# No clear marker - treat last paragraph as final answer
paragraphs = [p.strip() for p in text.strip().split("\n\n") if p.strip()]
if len(paragraphs) > 1:
return "\n\n".join(paragraphs[:-1]), paragraphs[-1]
# Single paragraph - no clean separation possible
return "", text
def score_distillation_quality(
example: DistillationExample,
scorer_model: str = "claude-haiku-4-5-20251001"
) -> float:
"""
Score a distillation example on reasoning quality dimensions.
Distillation quality is different from SFT quality. The key
dimensions for distillation are:
- Reasoning visibility: Is the thinking process explicit and followable?
- Reasoning correctness: Are the intermediate steps logically valid?
- Answer correctness: Is the final answer accurate and complete?
- Educational value: Would a student learn sound reasoning patterns?
- Calibration: Does the response acknowledge appropriate uncertainty?
Uses claude-haiku for cost efficiency - scoring is a simple evaluation
task that does not require the full capability of Opus.
Returns float score 0.0-10.0
"""
reasoning_excerpt = (
example.reasoning_trace[:700]
if example.reasoning_trace
else "No explicit reasoning trace extracted"
)
prompt = f"""Evaluate this as a training example for teaching a model to reason.
The goal is to transfer reasoning behavior, not just factual knowledge.
Question: {example.question[:500]}
Reasoning shown:
{reasoning_excerpt}
Final answer:
{example.final_answer[:500]}
Rate each dimension (0-2 points each):
1. Reasoning visibility: Are intermediate steps explicit, followable, and non-trivial?
2. Reasoning correctness: Are the logical steps actually valid (not just plausible-sounding)?
3. Answer correctness: Is the final answer accurate, complete, and well-grounded?
4. Educational value: Would a student learn sound reasoning patterns from this example?
5. Calibration: Does the response express appropriate certainty - neither overconfident nor evasive?
Respond with ONLY a single number 0-10 (total across all 5 dimensions):"""
response = client.messages.create(
model=scorer_model,
max_tokens=10,
temperature=0,
messages=[{"role": "user", "content": prompt}]
)
try:
score_text = response.content[0].text.strip()
# Extract the first number in the response
score = float(''.join(c for c in score_text if c.isdigit() or c == '.'))
return min(10.0, max(0.0, score))
except (ValueError, IndexError):
return 5.0 # Default to neutral score on parsing failure
The Complete Build Pipeline
def build_distillation_dataset(
questions: list[str],
model: str = "claude-opus-4-6",
quality_threshold: float = 6.5,
output_path: str = "distillation_dataset.jsonl",
rotate_system_prompts: bool = True,
verbose: bool = True,
) -> list[DistillationExample]:
"""
Build a complete distillation dataset from a list of questions.
Rotates through system prompts so the same question covered by
different prompts produces meaningfully different reasoning traces.
This is the Orca approach - system prompt diversity is what makes
the dataset teach reasoning rather than just output format.
Args:
questions: Questions to generate reasoning traces for
model: Frontier teacher model
quality_threshold: Minimum quality score (0-10) to include
output_path: JSONL output path for training
rotate_system_prompts: Rotate through system prompts for diversity
verbose: Print progress and cost tracking
Returns:
List of accepted DistillationExample objects
"""
prompt_keys = list(DISTILLATION_SYSTEM_PROMPTS.keys())
accepted = []
rejected = 0
total_cost = 0.0
if verbose:
print(f"Building distillation dataset:")
print(f" Questions: {len(questions)}")
print(f" Teacher model: {model}")
print(f" Quality threshold: {quality_threshold}/10")
with open(output_path, "w") as f:
for i, question in enumerate(questions):
# Rotate system prompts: each question gets a different
# reasoning style, which is critical for dataset diversity
system_key = (
prompt_keys[i % len(prompt_keys)]
if rotate_system_prompts
else "chain_of_thought"
)
if verbose:
print(f"\n[{i+1}/{len(questions)}] {question[:65]}...")
print(f" System: {system_key}")
try:
example = generate_distillation_example(
question, system_key, model
)
cost = example.metadata.get("cost_usd", 0)
total_cost += cost
example.quality_score = score_distillation_quality(example)
if example.quality_score >= quality_threshold:
accepted.append(example)
record = {
"instruction": example.question,
"system": example.system_prompt,
"reasoning_trace": example.reasoning_trace,
"output": example.full_response,
"quality_score": example.quality_score,
"system_prompt_key": example.system_prompt_key,
"source_model": example.source_model,
"metadata": example.metadata,
}
f.write(json.dumps(record) + "\n")
if verbose:
print(f" Score: {example.quality_score:.1f}/10 "
f"- ACCEPTED (cost: ${cost:.4f})")
else:
rejected += 1
if verbose:
print(f" Score: {example.quality_score:.1f}/10 "
f"- REJECTED (below {quality_threshold})")
except Exception as e:
print(f" ERROR: {e}")
rejected += 1
time.sleep(1) # Back off on errors
if verbose:
total = len(accepted) + rejected
print(f"\nDistillation complete:")
print(f" Accepted: {len(accepted)} / {total} "
f"({len(accepted)/total*100:.0f}%)")
print(f" Rejected: {rejected}")
print(f" Total cost: ${total_cost:.2f}")
print(f" Cost per accepted example: "
f"${total_cost/max(len(accepted),1):.4f}")
return accepted
The Orca Approach: System Prompt Diversity at Scale
The Orca paper's most important design decision - and the one that most practitioners overlook - is using 16 different system prompts to elicit varied reasoning styles. The same question asked with different system prompts produces qualitatively different reasoning traces: different starting points, different intermediate steps, different ways of expressing uncertainty, different levels of detail. This variety is what teaches the student model robust reasoning, not just imitation of one particular reasoning style.
import anthropic
import random
client = anthropic.Anthropic()
# Orca-style system prompts covering different reasoning personas and styles
# The original Orca paper used 16 prompts; this is a production-quality set
ORCA_STYLE_PROMPTS = [
# Explicit step-by-step reasoning
("step_by_step",
"You are a helpful AI assistant. Think step by step and justify each step "
"before giving your final answer. Show all reasoning explicitly."),
# Deep expert analysis
("expert_analysis",
"You are an expert analyst. Provide a detailed analysis covering the topic "
"comprehensively, including key mechanisms, relevant context, and implications."),
# Accessible pedagogy
("accessible",
"You are a helpful teacher. Explain clearly and accessibly, starting with "
"a simple intuition before building to the precise answer."),
# Engineering approach
("engineering",
"You are a senior engineer. Approach problems systematically: identify "
"constraints and edge cases first, then design a solution, then verify it."),
# Scientific method
("scientific",
"You are a scientist. Apply the scientific method: start from known facts, "
"reason from first principles, distinguish observation from inference."),
# Verification-first
("verified",
"After answering, explicitly verify your answer by checking it against "
"first principles or a simple test case. Note if anything seems off."),
# Uncertainty-aware
("calibrated",
"Provide the main answer, then explicitly state what you're uncertain about "
"and why. Distinguish between well-established facts and inferences."),
# Alternative perspectives
("multi_perspective",
"Provide the main answer, then provide an alternative perspective or approach "
"that leads to the same conclusion, or note if experts disagree and why."),
# Key insight first
("key_insight",
"Identify the key insight that makes this question answerable, state it "
"clearly first, then elaborate with detail and examples."),
# Structured decomposition
("decompose",
"Start by breaking the question into its component parts. Answer each part "
"systematically, then synthesize into a complete answer."),
]
def generate_orca_style_dataset(
questions: list[str],
model: str = "claude-opus-4-6",
prompts_per_question: int = 1,
max_examples: int = 50_000,
output_path: str = "orca_dataset.jsonl",
) -> list[dict]:
"""
Generate an Orca-style dataset with diverse system prompts.
Key design choices from the Orca paper applied here:
1. Same questions with different system prompts → diverse reasoning styles
2. Source questions from challenging benchmarks, not just conversational data
3. Use the strongest available model as teacher (Opus or GPT-4)
4. Include the system prompt in training data so student learns when/how
to apply different reasoning approaches based on context
Args:
questions: Source questions (from FLAN, MMLU, domain benchmarks, etc.)
model: Teacher model for generation
prompts_per_question: How many system prompts to apply per question
max_examples: Cap total examples to control cost
output_path: Output JSONL file
Returns:
List of training examples in messages format
"""
examples = []
total_budget = min(len(questions) * prompts_per_question, max_examples)
print(f"Generating Orca-style dataset:")
print(f" Source questions: {len(questions)}")
print(f" Prompts per question: {prompts_per_question}")
print(f" Target examples: {total_budget}")
print(f" Estimated cost: ${total_budget * 0.04:.0f} "
f"(at ~$0.04/example)")
with open(output_path, "w") as f:
for i, question in enumerate(questions):
if len(examples) >= total_budget:
break
# Select system prompts for this question
# Different questions get different prompt selections
# to ensure all prompt types are well-represented
num_prompts = min(prompts_per_question, len(ORCA_STYLE_PROMPTS))
selected = random.sample(ORCA_STYLE_PROMPTS, num_prompts)
for prompt_name, system_prompt in selected:
if len(examples) >= total_budget:
break
try:
response = client.messages.create(
model=model,
max_tokens=2000,
system=system_prompt,
messages=[{"role": "user", "content": question}]
)
full_response = response.content[0].text.strip()
reasoning, answer = extract_reasoning_and_answer(full_response)
example = {
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": question},
{"role": "assistant", "content": full_response},
],
"reasoning_trace": reasoning,
"final_answer": answer,
"source_model": model,
"system_prompt_name": prompt_name,
}
examples.append(example)
f.write(json.dumps(example) + "\n")
if len(examples) % 500 == 0:
print(f" Generated: {len(examples)}/{total_budget}")
except Exception as e:
print(f" Error on question {i}: {e}")
time.sleep(1)
print(f"\nGenerated {len(examples)} examples")
return examples
The Phi Approach: Textbook Quality Generation
Microsoft's Phi models (Gunasekar et al., 2023) took a fundamentally different path. Instead of distilling from specific questions, generate entire synthetic textbooks. The hypothesis: data quality matters more than quantity for small models, and textbook-structured content is the highest-quality format for learning conceptual material.
Phi-1 (1.3B parameters) trained on ~1B tokens of GPT-3.5-generated synthetic textbooks achieved coding benchmarks competitive with CodeLlama-7B (5x larger) and StarCoder-15B (11x larger). The quality-over-quantity hypothesis was validated empirically.
import anthropic
import json
import re
import os
client = anthropic.Anthropic()
TEXTBOOK_GENERATION_PROMPT = """Generate a comprehensive, textbook-quality explanation
of the following programming concept. This will appear in a definitive programming
reference - accuracy and pedagogical clarity are paramount.
Topic: {topic}
Structure your explanation precisely as follows:
## Definition
Provide a formal definition (precise, unambiguous) and an informal definition
(intuitive, accessible to a motivated beginner).
## Core Mechanism
Explain exactly how this works internally - not just what it does, but the
underlying mechanism that makes it work. Include the key invariants.
## Worked Example
Walk through a complete, concrete example. Show every intermediate value.
Do not skip steps. Use a realistic example, not a toy.
## Common Misconceptions
List 3 specific, common misconceptions. For each: state the misconception
precisely, explain why it seems reasonable, explain exactly why it is wrong.
## Practice Problems
Provide 2 exercises at different difficulty levels. For each: problem statement,
complete solution with every step explained, and explanation of why this
solution is correct.
## When To Use vs. Alternatives
State the precise conditions under which this concept is the right choice.
Name the 2 most common alternatives and give specific criteria for choosing
between them.
All code examples must be syntactically correct Python 3.10+."""
EXERCISE_GENERATION_PROMPT = """Generate {n} programming exercises about: {topic}
For each exercise, produce a complete JSON object with these fields:
- "title": short descriptive name
- "problem": unambiguous problem statement with input/output format specified
- "test_cases": array of 3 test cases, each with "input", "expected_output", "explanation"
(include at least 1 edge case)
- "hints": array of 3 progressive hints (first: approach direction; second: key step;
third: near-complete guidance)
- "solution": complete, working Python solution
- "solution_explanation": line-by-line walkthrough of the solution
- "complexity": object with "time" and "space" keys and justification for each
- "common_wrong_solutions": array of 2 objects, each with "approach" and "why_wrong"
Respond with ONLY a JSON array of {n} exercise objects. No other text."""
def generate_textbook_section(
topic: str,
model: str = "claude-opus-4-6"
) -> str:
"""
Generate one textbook section for a programming concept.
The textbook format captures more transferable knowledge than Q&A:
- Explicit misconception correction (models the "teacher" catching errors)
- Pedagogically ordered content (builds from known to unknown)
- Internal consistency across examples (no contradictions to unlearn)
- Built-in practice with solutions (teaches how to check understanding)
Args:
topic: Programming concept to explain
model: Generation model (use strongest for content quality)
Returns:
Complete textbook section as markdown-formatted string
"""
response = client.messages.create(
model=model,
max_tokens=4000,
messages=[{
"role": "user",
"content": TEXTBOOK_GENERATION_PROMPT.format(topic=topic)
}]
)
return response.content[0].text.strip()
def generate_exercises_for_topic(
topic: str,
n: int = 5,
model: str = "claude-opus-4-6"
) -> list[dict]:
"""
Generate structured practice exercises for a topic.
Uses JSON output to ensure consistent structure for training.
Falls back gracefully if JSON parsing fails.
Args:
topic: Programming concept to generate exercises for
n: Number of exercises to generate
model: Generation model
Returns:
List of exercise dicts (may be empty on parse failure)
"""
prompt = EXERCISE_GENERATION_PROMPT.format(n=n, topic=topic)
response = client.messages.create(
model=model,
max_tokens=5000,
messages=[{"role": "user", "content": prompt}]
)
exercises_text = response.content[0].text
# Extract JSON array from response
json_match = re.search(r'\[[\s\S]*\]', exercises_text)
if json_match:
try:
return json.loads(json_match.group())
except json.JSONDecodeError as e:
print(f" JSON parse error for {topic}: {e}")
return []
return []
def generate_textbook_curriculum(
topics: list[str],
exercises_per_topic: int = 5,
model: str = "claude-opus-4-6",
output_dir: str = "textbook_dataset/",
) -> dict:
"""
Generate a complete coding curriculum in textbook format.
This is the Phi-1 approach: ~1B tokens of GPT-3.5-generated
textbook-quality Python explanations achieved SOTA for 1.3B models.
The key insight: a small model trained on 1B tokens of textbook
content learns better than one trained on 100B tokens of noisy
internet code, because the signal-to-noise ratio is dramatically higher.
Args:
topics: Programming topics to cover
exercises_per_topic: Practice exercises per topic
model: Generation model
output_dir: Directory for JSONL output files
Returns:
Statistics dict with token counts, costs, and quality metrics
"""
os.makedirs(output_dir, exist_ok=True)
stats = {
"total_sections": 0,
"total_exercises": 0,
"total_words": 0,
"failed_sections": 0,
"failed_exercise_batches": 0,
}
sections_path = os.path.join(output_dir, "textbook_sections.jsonl")
exercises_path = os.path.join(output_dir, "exercises.jsonl")
with open(sections_path, "w") as sf, open(exercises_path, "w") as ef:
for i, topic in enumerate(topics):
print(f"\n[{i+1}/{len(topics)}] Generating: {topic}")
# Generate textbook section
try:
section = generate_textbook_section(topic, model)
record = {
"type": "textbook_section",
"topic": topic,
"content": section,
"source_model": model,
"word_count": len(section.split()),
}
sf.write(json.dumps(record) + "\n")
stats["total_sections"] += 1
stats["total_words"] += len(section.split())
print(f" Section: {len(section.split())} words")
except Exception as e:
print(f" Section error: {e}")
stats["failed_sections"] += 1
# Generate exercises for this topic
try:
exercises = generate_exercises_for_topic(
topic, exercises_per_topic, model
)
for exercise in exercises:
exercise["topic"] = topic
exercise["source_model"] = model
ef.write(json.dumps(exercise) + "\n")
stats["total_exercises"] += 1
print(f" Exercises: {len(exercises)}")
except Exception as e:
print(f" Exercise error: {e}")
stats["failed_exercise_batches"] += 1
time.sleep(0.5) # Rate limit courtesy
print(f"\nCurriculum generation complete:")
print(f" Sections: {stats['total_sections']}")
print(f" Exercises: {stats['total_exercises']}")
print(f" Total words: {stats['total_words']:,}")
print(f" Approx tokens: {stats['total_words'] * 1.3:,.0f}")
return stats
# Python curriculum modeled on Phi-1's topic coverage
# These topics are where conceptual depth matters most
PYTHON_CURRICULUM_TOPICS = [
"variable scoping and closures in Python",
"decorators and function wrapping",
"generators, iterators, and lazy evaluation",
"context managers and the with statement",
"async/await coroutines and the event loop",
"metaclasses and class creation in Python",
"descriptors: __get__, __set__, __delete__",
"Python memory model and garbage collection",
"the Python data model and dunder methods",
"concurrent.futures, ThreadPoolExecutor, ProcessPoolExecutor",
"type hints, generics, and mypy",
"dataclasses: fields, post_init, frozen, and slots",
]
Multi-Teacher Distillation
A more sophisticated approach uses multiple frontier models as teachers. This captures diverse reasoning styles and reduces single-model bias - if your only teacher is Opus, the student learns Opus's particular way of reasoning. Multi-teacher distillation exposes the student to varied approaches.
import anthropic
from dataclasses import dataclass
from typing import Optional
client = anthropic.Anthropic()
@dataclass
class TeacherResponse:
"""A single teacher model's response to a question."""
model: str
response: str
reasoning: str
answer: str
# Heuristic confidence: stronger models get higher base confidence
estimated_confidence: float
def multi_teacher_generate(
question: str,
system_prompt: str,
models: Optional[list[str]] = None,
) -> list[TeacherResponse]:
"""
Get reasoning traces from multiple teacher models.
Using multiple Claude models captures different reasoning styles:
- Opus: deepest reasoning, best for complex questions
- Sonnet: balanced reasoning, fast and thorough
For production: can also include other providers if ToS permits.
Each model produces qualitatively different reasoning traces even
on the same question with the same system prompt.
Args:
question: The question to answer
system_prompt: Reasoning elicitation prompt
models: List of teacher model IDs
Returns:
List of TeacherResponse objects (one per model)
"""
if models is None:
models = ["claude-opus-4-6", "claude-sonnet-4-6"]
responses = []
for model in models:
try:
response = client.messages.create(
model=model,
max_tokens=1800,
system=system_prompt,
messages=[{"role": "user", "content": question}]
)
full_text = response.content[0].text.strip()
reasoning, answer = extract_reasoning_and_answer(full_text)
# Assign confidence heuristically based on model capability
# In production: use benchmark performance as proxy
confidence_map = {
"claude-opus-4-6": 0.95,
"claude-sonnet-4-6": 0.85,
}
confidence = confidence_map.get(model, 0.80)
responses.append(TeacherResponse(
model=model,
response=full_text,
reasoning=reasoning,
answer=answer,
estimated_confidence=confidence,
))
except Exception as e:
print(f" Error from teacher {model}: {e}")
return responses
def create_multi_teacher_example(
question: str,
system_prompt: str,
agreement_bonus: float = 0.05,
) -> Optional[dict]:
"""
Create a training example from multiple teacher responses.
Strategy:
- If teachers agree: high-confidence example, use the most detailed response
- If teachers disagree: use highest-confidence teacher, flag as uncertain
(these are valuable for calibration training)
Teacher agreement is itself a quality signal. Questions where multiple
strong models agree are more likely to have reliable ground truth than
questions where they disagree. Use this as a quality filter.
Args:
question: Training question
system_prompt: System prompt used for generation
agreement_bonus: Added confidence when teachers agree
Returns:
Training example dict, or None if generation failed
"""
teacher_responses = multi_teacher_generate(question, system_prompt)
if not teacher_responses:
return None
# Check for agreement between teachers
# Heuristic: do final answers share significant content?
# In production: use embedding cosine similarity (>0.85 = agreement)
if len(teacher_responses) >= 2:
answers = [r.answer.lower() for r in teacher_responses]
# Simple overlap heuristic - replace with embedding similarity in prod
words_0 = set(answers[0].split())
words_1 = set(answers[1].split())
if len(words_0) > 0 and len(words_1) > 0:
overlap = len(words_0 & words_1) / max(len(words_0), len(words_1))
agree = overlap > 0.40 # Rough threshold for semantic agreement
else:
agree = False
else:
agree = True # Single teacher, treat as agreed
# Select the highest-confidence response to use as training target
best = max(teacher_responses, key=lambda r: r.estimated_confidence)
return {
"instruction": question,
"system": system_prompt,
"output": best.response,
"reasoning_trace": best.reasoning,
"primary_teacher": best.model,
"n_teachers": len(teacher_responses),
"teachers_agree": agree,
"confidence": best.estimated_confidence + (agreement_bonus if agree else -0.05),
"all_teachers": [
{"model": r.model, "answer_excerpt": r.answer[:150]}
for r in teacher_responses
],
}
Dataset Format Conversion
Training frameworks (HuggingFace TRL, LLaMA Factory, Axolotl) expect specific formats. Converting between them correctly is operationally critical - format errors cause silent training failures where the model trains on garbled data.
from typing import Optional
def to_chatml(example: DistillationExample) -> dict:
"""
ChatML format - compatible with OpenAI fine-tuning API,
HuggingFace TRL SFTTrainer, and most modern fine-tuning stacks.
Key decision: include the reasoning trace in the assistant turn.
This teaches the model to produce reasoning, not just answers.
If you only include the final answer, you get output distillation
(Level 1) even though you generated Level 2 data - wasted effort.
"""
messages = []
if example.system_prompt:
messages.append({
"role": "system",
"content": example.system_prompt
})
messages.append({
"role": "user",
"content": example.question
})
# Include full reasoning trace + answer in assistant turn
# This is the critical choice that makes this Level 2 distillation
if example.reasoning_trace:
assistant_content = (
f"{example.reasoning_trace}\n\n{example.final_answer}"
)
else:
assistant_content = example.full_response
messages.append({
"role": "assistant",
"content": assistant_content
})
return {"messages": messages}
def to_alpaca(example: DistillationExample) -> dict:
"""
Alpaca format - compatible with older fine-tuning frameworks.
Note: this format drops the system prompt, which means the model
won't learn to apply different reasoning styles contextually.
Use ChatML when possible.
"""
return {
"instruction": example.question,
"input": "",
"output": example.full_response,
}
def to_dpo_pair(
question: str,
chosen_response: str,
rejected_response: str,
system: Optional[str] = None,
) -> dict:
"""
DPO (Direct Preference Optimization) format.
For distillation DPO: chosen = stronger model's response,
rejected = weaker model's response on the same question.
This teaches the model to distinguish high-quality reasoning from
low-quality reasoning, not just to imitate high-quality responses.
DPO distillation is more data-efficient than SFT distillation for
teaching relative quality judgment. Use when you have 2+ teacher
models and want to explicitly teach quality discrimination.
"""
messages_base = []
if system:
messages_base.append({"role": "system", "content": system})
messages_base.append({"role": "user", "content": question})
chosen = messages_base + [
{"role": "assistant", "content": chosen_response}
]
rejected = messages_base + [
{"role": "assistant", "content": rejected_response}
]
return {"chosen": chosen, "rejected": rejected}
def export_dataset(
examples: list[DistillationExample],
output_path: str,
fmt: str = "chatml"
) -> None:
"""
Export a distillation dataset in the specified training format.
Args:
examples: Distillation examples to export
output_path: Output JSONL file path
fmt: Format string - "chatml" or "alpaca"
"""
format_fns = {
"chatml": to_chatml,
"alpaca": to_alpaca,
}
if fmt not in format_fns:
raise ValueError(
f"Unknown format '{fmt}'. "
f"Valid options: {list(format_fns.keys())}"
)
format_fn = format_fns[fmt]
with open(output_path, "w") as f:
for example in examples:
record = format_fn(example)
f.write(json.dumps(record) + "\n")
print(f"Exported {len(examples)} examples in {fmt} format → {output_path}")
Comparison: Distillation Approaches
| Approach | Examples Needed | Cost per Example | Reasoning Transfer | Best For |
|---|---|---|---|---|
| Output distillation (Alpaca) | 50K-200K | $0.001-0.005 | None | Style + format |
| Reasoning trace (Orca) | 50K-5M | $0.03-0.10 | High | Complex reasoning tasks |
| Textbook (Phi) | 1K-10K topics | $0.50-2.00/topic | Very High | Small models, specific domains |
| Multi-teacher DPO | 20K-100K pairs | $0.05-0.20/pair | Medium-High | Quality discrimination |
| Self-play (SPIN) | 10K-50K | $0.01-0.04 | Medium | Iterative capability improvement |
The right choice depends on: what capability you need to transfer, how small the target model is (smaller models benefit more from textbook-quality data), and how much you can spend on data generation.
Legal and Ethical Considerations
:::danger OpenAI Terms of Service - Competitive Use Prohibition OpenAI's Terms of Service explicitly prohibit using outputs from OpenAI models "to develop models that compete with OpenAI." This is a real legal risk for commercial products. The Orca paper generated controversy because it distilled from GPT-4 outputs into an open-source competing model. Before building any distillation pipeline using third-party models: (1) Read the current ToS of every teacher model you plan to use. (2) Save a timestamped copy of the ToS with the date reviewed. (3) Get explicit legal sign-off for commercial deployments. Anthropic's usage policy is different - check current policies at anthropic.com/usage-policy. This is not legal advice. :::
:::warning Model Cards Must Disclose Synthetic Training Data If your model is trained on distillation data from frontier models, document this in your model card. Users and downstream deployers have a right to know where the model's capabilities came from and what its limitations are. The EU AI Act (phasing in through 2026) increasingly requires training data disclosure for models above capability thresholds. Failure to disclose synthetic training data sources is both ethically questionable and increasingly a regulatory risk as AI transparency requirements become enforceable across major markets. :::
:::tip Safe Models for Distillation Without ToS Risk Several high-quality models explicitly allow distillation for commercial use: Meta's LLaMA 3 and 3.1 (under Meta's commercial license), Mistral models (Apache 2.0), Qwen models (Tongyi Qianwen License allows research and commercial use), and any model released under MIT or CC-BY licenses. Using these as teachers eliminates ToS legal risk. The quality gap vs. Opus or GPT-4 is real - but legal certainty has genuine commercial value, and the quality gap is often smaller than expected for specific domains. :::
:::info The Memorization Problem in Distillation Frontier models sometimes reproduce training data verbatim in their outputs, particularly for common code snippets, well-known quotes, or frequently-requested content. If this verbatim text is copyrighted, your distillation dataset contains that copyrighted material. Mitigations: (1) Run n-gram deduplication between your generated outputs and known copyrighted sources. (2) Use exact-match detection against common code snippets, book excerpts, and lyrics. (3) Require reformulation rather than recitation in your system prompts. (4) Add a content compliance filter that flags outputs with very long exact matches against a reference corpus. This is an active area of legal uncertainty - document your mitigation steps. :::
Interview Q&A
Q: What is dataset-level knowledge distillation in LLMs and how does it differ from traditional model distillation?
Traditional knowledge distillation (Hinton et al., 2015) is white-box: the student mimics the teacher's probability distributions (soft logits) at every token position. You need direct access to the teacher's internal weights to extract these distributions. The student is typically trained end-to-end with a KL-divergence loss between student and teacher distributions.
Dataset-level distillation (Alpaca, Orca, Phi approach) is black-box: you generate training examples using the frontier model's API and fine-tune a student on those examples as an SFT (supervised fine-tuning) problem. No internal access required. You are transferring demonstrated behaviors rather than learned representations.
The tradeoffs: black-box dataset distillation captures surface behaviors (what the model says) but not the internal representation geometry (how the model organizes knowledge). For most production applications, behavioral transfer is sufficient, and API-only access is the only viable option for closed-source frontier models. White-box distillation can achieve tighter capability transfer when you have model weights, but most frontier models are closed-source.
Q: What makes Orca fundamentally different from Alpaca if both use GPT outputs as training data?
Three critical differences, not just one. First, what is captured: Alpaca captures final answers (question → answer); Orca captures reasoning traces (question → step-by-step thinking → answer). This is the most important difference. The Orca paper's core claim is that training on reasoning traces teaches the model how to reason, while training on answers only teaches the model what to say. Orca-13B matching ChatGPT on complex reasoning benchmarks while being ~13x smaller by parameter count is the empirical proof.
Second, scale: Orca used 5 million examples; Alpaca used 52K. Third, system prompt diversity: Orca used 16 different system prompts that elicited varied reasoning styles from the same questions. This produces a dataset where the student sees reasoning expressed in many ways, not just one format. The combination of these three differences explains why Orca-13B so dramatically outperforms same-size models trained on Alpaca-style data on reasoning-heavy benchmarks.
Q: What is the Phi approach to distillation and why did it achieve such strong results with small models?
Phi (Gunasekar et al., 2023) didn't distill from specific questions at all. Instead, it generated entire synthetic textbooks - pedagogically structured, internally consistent explanations of programming concepts, with worked examples, common misconceptions explicitly addressed, and practice problems with solutions. Phi-1 (1.3B parameters) trained on approximately 1B tokens of GPT-3.5-generated synthetic textbooks achieved HumanEval coding benchmarks competitive with StarCoder-15B (11x larger) and CodeLlama-7B (5x larger).
The mechanism: for small models with limited capacity, data quality matters more than quantity. The model can only learn from what it is trained on - if the training data is noisy internet code (with syntax errors, contradictions, and irrelevant context), the model wastes capacity learning noise. Textbook-quality data has near-zero noise: every explanation is accurate, every example is correct, every misconception correction is explicit. The small model's limited capacity learns from pure signal. Textbook structure also provides pedagogical ordering, which internet data lacks - the model sees concepts introduced before they're assumed, which makes generalization to new problems more reliable.
Q: How would you decide whether to use output distillation, reasoning trace distillation, or textbook distillation for a fine-tuning project?
The decision depends on three factors: target capability, target model size, and acceptable cost.
Use output distillation when: you need style transfer or format normalization, the task doesn't require multi-step reasoning (customer service templates, structured data extraction, classification), and cost is the binding constraint. This is the cheapest and fastest approach.
Use reasoning trace distillation when: your users ask questions requiring multi-step reasoning, logical deduction, or complex analysis, and you want the model to produce visible, followable reasoning rather than just correct-looking answers. The Orca approach works well for models 7B-70B where you need genuine reasoning capability. Budget roughly $0.03-0.10 per example for Opus-level reasoning traces.
Use textbook distillation when: your target model is small (1B-7B parameters) and you're working in a specific domain with teachable conceptual structure (programming, mathematics, science). The 1K-10K topic approach produces surprisingly strong results because quality dramatically dominates quantity for small model capacity. Budget more per example but fewer total examples - the ROI is higher per token of training data.
In practice, combine all three: base SFT data from output distillation for format and style, reasoning traces for complex query types, and textbook sections for the conceptual foundations the model needs. Layer them during training with appropriate loss weighting.
Q: What legal risks come with distillation from frontier models, and how do you manage them in a production setting?
Three distinct legal risks require separate treatment.
First, ToS compliance: OpenAI explicitly prohibits training competing models on its outputs. Anthropic and others have varying policies. Mitigation: (1) Review and document the ToS of every teacher model used, with date. (2) Use models that explicitly allow distillation (Llama 3, Mistral, Apache 2.0 models) when commercial legal clarity is required. (3) For closed-source teacher usage, get explicit legal sign-off before training any product model.
Second, copyright contamination: frontier models sometimes reproduce copyrighted material verbatim. Your dataset inherits this. Mitigation: run n-gram overlap detection against a reference corpus of known copyrighted works, flag exact matches above a word-count threshold (typically 30+ words), and require paraphrasing in your generation system prompts rather than recitation.
Third, emerging regulatory requirements: the EU AI Act requires training data disclosure for high-capability models. In the US, FTC guidance on AI transparency is evolving. Mitigation: maintain a training data card that documents teacher models, generation methods, quality filters, dates of generation, and any bias audits performed. This documentation is both ethically appropriate and increasingly legally required.
Q: How do you evaluate whether a distillation dataset is actually transferring reasoning capability versus just surface style?
Surface style transfer is easy to detect: the student model's outputs look like the teacher's (similar formatting, similar hedging language, similar response length). Reasoning transfer is harder and requires specific evaluation.
Three evaluation approaches: (1) Process evaluation: generate responses to complex multi-step problems and evaluate whether the intermediate reasoning steps are logically valid, not just whether the final answer is correct. A model with transferred reasoning capability shows correct process on problems it gets wrong (it's reasoning but making a factual error) rather than wrong process on problems it gets right (it's guessing correctly). (2) OOD generalization: test the model on out-of-distribution problems in the same domain. A model with genuine reasoning capability generalizes; a model with surface style transfer pattern-matches and fails on novel problem structures. (3) Step-deletion experiment: remove reasoning steps from test questions and see if performance degrades. If it does, the model was using the reasoning (genuine transfer). If it doesn't, the model was ignoring the reasoning and pattern-matching to answers (surface transfer only). Combine these with standard benchmark scores - reasoning benchmarks like GSM8K, MATH, and HellaSwag are specifically designed to distinguish reasoning capability from surface pattern matching.
