Prompt Optimization and DSPy
The Prompt That Broke the Team
For six months, three engineers spent approximately 40% of their time on prompt engineering. Every time the product changed, the prompts had to change. Every time a new model was released, the prompts had to be re-tuned. When the team moved from GPT-4 to Claude, they had to rewrite everything from scratch - a two-week project.
The prompts had grown organically, layer by layer. Each new edge case added a new rule. Each complaint from users added a new instruction. The system prompt was now 3,000 tokens long, contradicted itself in two places, and nobody was entirely sure why some specific phrasing was there. Removing any piece might break something.
And yet: the system still got 72% accuracy on the test set. That was the number they'd optimized to. Hand-crafted, agonized over, 40% of three engineers' time - for 72%.
A researcher on the team read the DSPy paper. Over a weekend, they built a DSPy program equivalent to the hand-crafted pipeline. After compiling with 50 labeled examples, accuracy was 79%. With 200 examples: 84%.
The prompts DSPy generated were different from the hand-crafted ones in ways nobody expected. Some rules the team thought were critical weren't there at all. Other things - phrasing they'd never have thought of - made a significant difference. Manual prompt engineering was optimizing in a small neighborhood of the solution space. DSPy searched more broadly.
This is what automated prompt optimization enables: systematic search over the prompt space, guided by a metric, at a scale that humans can't match manually.
The Problem with Manual Prompt Engineering
Manual prompt engineering has fundamental limitations:
It's brittle: Prompts optimized for GPT-4 may not transfer to Claude. Prompts that work in March may not work in June when the model is updated.
It doesn't generalize: You're optimizing for the examples you've tested. Edge cases break.
It's non-compositional: When you have a 5-step pipeline, optimizing each prompt independently doesn't optimize the pipeline as a whole. Upstream changes affect downstream behavior.
It's human-bottlenecked: Your intuitions about what makes a good prompt may be wrong. The model knows what it responds to; you can only guess.
It's not measurable: "This prompt feels better" is not an engineering metric.
The Landscape of Prompt Optimization
APE: Automatic Prompt Engineer (Zhou et al., 2022)
Automatic Prompt Engineer (Zhou et al., "Large Language Models Are Human-Level Prompt Engineers," 2022) was one of the first systematic approaches to automated prompt optimization.
The algorithm:
- Given a set of (input, output) demonstrations, use an LLM to generate N candidate instruction prompts
- Evaluate each candidate instruction on a held-out test set
- Select the best-performing instruction
- Optionally: iteratively refine by generating variations of the best instruction
import anthropic
client = anthropic.Anthropic()
class APEOptimizer:
"""
Simplified APE (Automatic Prompt Engineer) implementation.
Generates candidate prompts and selects the best-performing one.
"""
def __init__(self, demonstrations: list[dict], test_set: list[dict]):
"""
Args:
demonstrations: list of {"input": str, "output": str} pairs for generation
test_set: list of {"input": str, "expected_output": str} for evaluation
"""
self.demonstrations = demonstrations
self.test_set = test_set
def generate_candidate_prompts(self, n_candidates: int = 5) -> list[str]:
"""Use an LLM to generate candidate instruction prompts."""
demo_text = "\n\n".join([
f"Input: {d['input']}\nOutput: {d['output']}"
for d in self.demonstrations[:10] # Use up to 10 demonstrations
])
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1000,
messages=[{
"role": "user",
"content": f"""I have a task where given an input, I need to produce an output.
Here are some examples:
{demo_text}
Generate {n_candidates} different instruction prompts that would cause a language model
to correctly transform inputs to outputs like in the examples above.
Each prompt should be on its own line, numbered 1-{n_candidates}.
Focus on the underlying pattern - what is this task really doing?
The prompt should be placed BEFORE the input when calling the model."""
}]
)
# Parse candidate prompts from the response
lines = message.content[0].text.strip().split('\n')
prompts = []
for line in lines:
# Remove numbering and clean up
import re
cleaned = re.sub(r'^\d+[.)]\s*', '', line).strip()
if cleaned and len(cleaned) > 20:
prompts.append(cleaned)
return prompts[:n_candidates]
def evaluate_prompt(self, prompt: str) -> float:
"""Evaluate a prompt on the test set. Returns accuracy."""
correct = 0
for item in self.test_set:
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=100,
temperature=0,
messages=[{
"role": "user",
"content": f"{prompt}\n\nInput: {item['input']}"
}]
)
output = message.content[0].text.strip().lower()
expected = item['expected_output'].strip().lower()
if output == expected or expected in output:
correct += 1
return correct / len(self.test_set) if self.test_set else 0.0
def optimize(self, n_candidates: int = 5) -> tuple[str, float]:
"""Run APE optimization. Returns (best_prompt, best_score)."""
print(f"Generating {n_candidates} candidate prompts...")
candidates = self.generate_candidate_prompts(n_candidates)
print(f"Evaluating {len(candidates)} candidates on test set...")
scores = []
for i, prompt in enumerate(candidates):
score = self.evaluate_prompt(prompt)
scores.append((score, prompt))
print(f" Prompt {i+1}: {score:.2%} - '{prompt[:60]}...'")
best_score, best_prompt = max(scores, key=lambda x: x[0])
return best_prompt, best_score
# Example usage: sentiment classification
demonstrations = [
{"input": "This product is absolutely amazing!", "output": "positive"},
{"input": "Terrible quality, broke in one day.", "output": "negative"},
{"input": "It's okay, does what it says.", "output": "neutral"},
{"input": "Best purchase I've made this year!", "output": "positive"},
{"input": "Disappointed with the build quality.", "output": "negative"},
]
test_set = [
{"input": "Love it!", "expected_output": "positive"},
{"input": "Waste of money.", "expected_output": "negative"},
{"input": "Nothing special.", "expected_output": "neutral"},
]
optimizer = APEOptimizer(demonstrations, test_set)
best_prompt, best_score = optimizer.optimize(n_candidates=5)
print(f"\nBest prompt: '{best_prompt}'")
print(f"Best accuracy: {best_score:.2%}")
OPRO: Optimization by Prompting (Yang et al., 2023)
OPRO (Yang et al., "Large Language Models as Optimizers," 2023, Google DeepMind) treats the LLM itself as an optimizer. Instead of using gradient descent, OPRO uses a "meta-prompt" that shows the LLM its previous prompt attempts along with their scores, and asks it to generate a better one.
class OPROOptimizer:
"""
OPRO: Optimization by PROmpting.
Uses an LLM to iteratively improve prompts by showing it previous attempts.
"""
def __init__(self, task_description: str, test_set: list[dict]):
self.task_description = task_description
self.test_set = test_set
self.history: list[tuple[str, float]] = [] # (prompt, score) pairs
def generate_next_prompt(self) -> str:
"""Generate a new prompt based on optimization history."""
if not self.history:
# Start with a basic prompt
return f"Complete the following task: {self.task_description}"
# Build the meta-prompt showing history
history_text = "\n".join([
f"Prompt: {p}\nScore: {s:.2%}\n"
for p, s in sorted(self.history, key=lambda x: x[1])[-10:] # Last 10
])
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=300,
messages=[{
"role": "user",
"content": f"""I'm trying to find the best instruction prompt for this task:
Task: {self.task_description}
Previous attempts (sorted by score, worst to best):
{history_text}
Based on what worked and what didn't, generate a new instruction prompt that should
score higher than all previous attempts. The prompt should:
- Be clear and specific
- Build on insights from the highest-scoring attempts
- Avoid patterns from the lowest-scoring attempts
Output only the new prompt, nothing else."""
}]
)
return message.content[0].text.strip()
def evaluate(self, prompt: str) -> float:
"""Evaluate a prompt on the test set."""
correct = 0
for item in self.test_set:
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=50,
temperature=0,
messages=[{
"role": "user",
"content": f"{prompt}\n\nInput: {item['input']}"
}]
)
output = message.content[0].text.strip().lower()
if item['expected_output'].lower() in output:
correct += 1
return correct / len(self.test_set)
def optimize(self, n_iterations: int = 10) -> tuple[str, float]:
"""Run OPRO for n_iterations. Returns best (prompt, score)."""
best_prompt, best_score = "", 0.0
for i in range(n_iterations):
prompt = self.generate_next_prompt()
score = self.evaluate(prompt)
self.history.append((prompt, score))
print(f"Iteration {i+1}: score={score:.2%}, prompt='{prompt[:60]}...'")
if score > best_score:
best_score = score
best_prompt = prompt
return best_prompt, best_score
DSPy: Programs Not Prompts
APE and OPRO optimize individual prompts. DSPy (Khattab et al., "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines," 2023, Stanford) takes a fundamentally different approach: it treats the entire LLM pipeline as a program with learnable parameters.
The key insight: in a multi-step LLM pipeline, manually optimizing each prompt in isolation is suboptimal. The best prompt for step 2 depends on what step 1 produces. DSPy optimizes the whole pipeline end-to-end.
Core DSPy Concepts
Signature: Declares the input/output interface of an LLM call.
import dspy
class SentimentClassifier(dspy.Signature):
"""Classify the sentiment of a product review."""
review: str = dspy.InputField()
sentiment: str = dspy.OutputField(desc="positive, negative, or neutral")
Module: A composable unit that wraps one or more LLM calls.
class ClassifyReview(dspy.Module):
def __init__(self):
self.classify = dspy.Predict(SentimentClassifier)
def forward(self, review: str) -> str:
return self.classify(review=review).sentiment
Optimizer (Teleprompter): Automatically finds the best prompt or few-shot examples for each module.
Compiling: Running the optimizer to produce an optimized version of your program.
A Complete DSPy Example
import dspy
from dspy.teleprompt import BootstrapFewShot
from dspy.evaluate import Evaluate
# Configure the LM backend
# DSPy supports Claude, OpenAI, and other providers
lm = dspy.Claude(model="claude-sonnet-4-6", api_key="your-api-key")
dspy.settings.configure(lm=lm)
# Step 1: Define Signatures
class ExtractClaims(dspy.Signature):
"""Extract factual claims from a news article."""
article: str = dspy.InputField()
claims: list[str] = dspy.OutputField(desc="List of distinct factual claims made in the article")
class VerifyClaim(dspy.Signature):
"""Verify whether a factual claim is supported by evidence."""
claim: str = dspy.InputField()
context: str = dspy.InputField(desc="Evidence or context to check against")
is_supported: bool = dspy.OutputField()
confidence: float = dspy.OutputField(desc="Confidence score from 0 to 1")
reasoning: str = dspy.OutputField(desc="Brief explanation of verdict")
class GenerateSummary(dspy.Signature):
"""Summarize fact-checking results for an article."""
article: str = dspy.InputField()
verified_claims: list[dict] = dspy.InputField()
summary: str = dspy.OutputField(desc="Executive summary of fact-check results")
overall_accuracy: str = dspy.OutputField(desc="mostly accurate / mixed / mostly inaccurate")
# Step 2: Build the Pipeline as a DSPy Module
class FactCheckPipeline(dspy.Module):
"""Multi-step fact-checking pipeline using DSPy."""
def __init__(self):
super().__init__()
# Each of these can be individually optimized by DSPy
self.extract_claims = dspy.Predict(ExtractClaims)
self.verify_claim = dspy.ChainOfThought(VerifyClaim)
self.summarize = dspy.Predict(GenerateSummary)
def forward(self, article: str, evidence_docs: list[str]) -> dict:
# Step 1: Extract claims from the article
extraction = self.extract_claims(article=article)
claims = extraction.claims
# Step 2: Verify each claim against evidence
context = "\n\n".join(evidence_docs)
verified = []
for claim in claims[:5]: # Limit for cost
verification = self.verify_claim(claim=claim, context=context)
verified.append({
"claim": claim,
"is_supported": verification.is_supported,
"confidence": verification.confidence,
"reasoning": verification.reasoning,
})
# Step 3: Generate overall summary
summary_result = self.summarize(
article=article,
verified_claims=verified
)
return {
"claims": verified,
"summary": summary_result.summary,
"overall_accuracy": summary_result.overall_accuracy,
}
# Step 3: Define Training Data
trainset = [
dspy.Example(
article="The COVID vaccine is 95% effective according to Pfizer's trial data.",
evidence_docs=["Pfizer's Phase 3 trial showed 95% efficacy against symptomatic COVID-19."],
expected_accuracy="mostly accurate"
).with_inputs("article", "evidence_docs"),
# ... more examples
]
# Step 4: Define Evaluation Metric
def accuracy_metric(example, prediction, trace=None) -> bool:
"""Check if the overall accuracy verdict matches expected."""
return example.expected_accuracy in prediction.get("overall_accuracy", "").lower()
# Step 5: Compile (Optimize)
optimizer = BootstrapFewShot(
metric=accuracy_metric,
max_bootstrapped_demos=4, # Max few-shot examples to add
max_labeled_demos=8, # Max labeled examples to use
max_rounds=2, # Optimization rounds
)
# The key step: compile optimizes prompts and few-shot examples
# for the entire pipeline end-to-end
pipeline = FactCheckPipeline()
optimized_pipeline = optimizer.compile(
student=pipeline,
trainset=trainset,
)
# Step 6: Evaluate
evaluator = Evaluate(devset=trainset, metric=accuracy_metric, num_threads=4)
score = evaluator(optimized_pipeline)
print(f"Optimized pipeline accuracy: {score:.2%}")
# Step 7: Save and load
optimized_pipeline.save("optimized_fact_checker.json")
# loaded = FactCheckPipeline().load("optimized_fact_checker.json")
DSPy Optimizers Compared
Evaluation-Driven Prompt Engineering
DSPy formalizes a practice that every serious prompt engineer should follow regardless of whether they use DSPy: evaluation-driven development.
The workflow:
import anthropic
from dataclasses import dataclass
from typing import Callable
client = anthropic.Anthropic()
@dataclass
class EvalResult:
prompt: str
score: float
n_correct: int
n_total: int
failures: list[dict]
def evaluate_prompt(
prompt_template: str,
eval_set: list[dict],
metric: Callable[[str, str], bool],
model: str = "claude-sonnet-4-6",
temperature: float = 0,
) -> EvalResult:
"""
Evaluate a prompt template against a labeled eval set.
Args:
prompt_template: Prompt with {input} placeholder
eval_set: list of {"input": str, "expected": str}
metric: Function(predicted, expected) -> bool
model: Claude model to use
temperature: Sampling temperature
Returns:
EvalResult with score and failure analysis
"""
correct = 0
failures = []
for item in eval_set:
prompt = prompt_template.format(input=item["input"])
message = client.messages.create(
model=model,
max_tokens=200,
temperature=temperature,
messages=[{"role": "user", "content": prompt}]
)
predicted = message.content[0].text.strip()
expected = item["expected"]
if metric(predicted, expected):
correct += 1
else:
failures.append({
"input": item["input"],
"expected": expected,
"predicted": predicted,
})
score = correct / len(eval_set)
return EvalResult(
prompt=prompt_template,
score=score,
n_correct=correct,
n_total=len(eval_set),
failures=failures
)
def analyze_failures(result: EvalResult) -> dict:
"""Analyze failure patterns to guide prompt improvement."""
if not result.failures:
return {"status": "no failures"}
# Ask the LLM to analyze the failure patterns
failures_text = "\n".join([
f"Input: {f['input']}\nExpected: {f['expected']}\nGot: {f['predicted']}"
for f in result.failures[:20] # Analyze up to 20 failures
])
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=500,
messages=[{
"role": "user",
"content": f"""Analyze these prediction failures and identify patterns:
{failures_text}
Identify:
1. The main error categories (2-4 categories max)
2. What's causing each category of error
3. Specific prompt changes that would fix each error type
Be concise and actionable."""
}]
)
return {
"failure_count": len(result.failures),
"failure_rate": 1 - result.score,
"analysis": message.content[0].text.strip(),
"sample_failures": result.failures[:5],
}
# Example: iterative prompt improvement
eval_set = [
{"input": "Revenue declined 12% YoY but margins improved", "expected": "negative revenue, positive margins"},
{"input": "All KPIs exceeded targets this quarter", "expected": "positive"},
{"input": "Flat growth in core business, strong performance in new segments", "expected": "mixed"},
# ... more examples
]
def exact_match(predicted: str, expected: str) -> bool:
return expected.lower() in predicted.lower()
# Iteration 1
prompt_v1 = "Describe the financial performance in this statement: {input}"
result_v1 = evaluate_prompt(prompt_v1, eval_set, exact_match)
print(f"v1 score: {result_v1.score:.2%}")
# Analyze failures and improve
analysis = analyze_failures(result_v1)
print(f"Failure analysis:\n{analysis['analysis']}")
# Iteration 2: improved based on analysis
prompt_v2 = """Analyze this financial statement and classify performance as:
- positive: overall good results
- negative: overall poor results
- mixed: some good, some bad aspects
Statement: {input}
Respond with ONLY the classification: positive, negative, or mixed."""
result_v2 = evaluate_prompt(prompt_v2, eval_set, exact_match)
print(f"v2 score: {result_v2.score:.2%}")
Prompt Versioning and A/B Testing
import anthropic
import hashlib
import json
from datetime import datetime
client = anthropic.Anthropic()
class PromptRegistry:
"""Version-controlled prompt registry with A/B testing support."""
def __init__(self):
self.prompts: dict[str, dict] = {}
def register(self, name: str, template: str, description: str = "") -> str:
"""Register a prompt version and return its hash."""
prompt_hash = hashlib.md5(template.encode()).hexdigest()[:8]
version_key = f"{name}_v{len([k for k in self.prompts if k.startswith(name)]) + 1}"
self.prompts[version_key] = {
"name": name,
"template": template,
"hash": prompt_hash,
"description": description,
"registered_at": datetime.utcnow().isoformat(),
"eval_scores": {}
}
return version_key
def record_eval_score(self, version_key: str, dataset: str, score: float):
"""Record evaluation results for a prompt version."""
if version_key in self.prompts:
self.prompts[version_key]["eval_scores"][dataset] = {
"score": score,
"evaluated_at": datetime.utcnow().isoformat()
}
def get_best_version(self, name: str, dataset: str) -> str | None:
"""Get the best-performing version for a given dataset."""
candidates = {
k: v for k, v in self.prompts.items()
if v["name"] == name and dataset in v["eval_scores"]
}
if not candidates:
return None
return max(candidates, key=lambda k: candidates[k]["eval_scores"][dataset]["score"])
def ab_test(
self,
version_a: str,
version_b: str,
test_input: str,
traffic_split: float = 0.5
) -> tuple[str, str]:
"""Route traffic between two prompt versions. Returns (version_used, output)."""
import random
version = version_a if random.random() < traffic_split else version_b
template = self.prompts[version]["template"]
prompt = template.format(input=test_input)
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=200,
temperature=0,
messages=[{"role": "user", "content": prompt}]
)
return version, message.content[0].text.strip()
# Usage
registry = PromptRegistry()
v1 = registry.register(
"sentiment",
"What is the sentiment? {input}",
"Initial naive version"
)
v2 = registry.register(
"sentiment",
"Classify sentiment as positive/negative/neutral. Return only the word.\n\nText: {input}",
"Improved with format constraint"
)
registry.record_eval_score(v1, "product_reviews", 0.71)
registry.record_eval_score(v2, "product_reviews", 0.84)
best = registry.get_best_version("sentiment", "product_reviews")
print(f"Best version: {best}")
# A/B test the two versions
for i in range(5):
version_used, output = registry.ab_test(v1, v2, "This product is fantastic!")
print(f"Version: {version_used}, Output: {output}")
When to Use DSPy vs. Manual Prompting
| Situation | Recommendation |
|---|---|
| Prototyping, early exploration | Manual prompting - faster iteration |
| Single-step task, stable requirements | Manual prompting with eval set |
| Multi-step pipeline | DSPy - end-to-end optimization is key |
| Frequent model changes | DSPy - compile once per model |
| Less than 20 labeled examples | APE or OPRO - DSPy needs more data |
| 50+ labeled examples | DSPy - BootstrapFewShot |
| 200+ labeled examples | DSPy - MIPRO or BFRS for best results |
| Production, high-stakes task | DSPy + eval-driven development |
Production Engineering Notes
1. Treat Your Eval Set as Critical Infrastructure
# Your eval set is as important as your test suite
# Version control it alongside your code
eval_set = [
{"input": "...", "expected": "...", "category": "edge_case_ambiguous"},
{"input": "...", "expected": "...", "category": "standard"},
# ...
]
# Never remove examples from the eval set
# Only add - otherwise you lose regression detection
2. Monitor Prompt Performance Over Time
Models drift. Even if you don't change your prompt, model updates can change performance:
def track_prompt_performance(
prompt: str,
eval_set: list[dict],
metric: Callable,
run_id: str
) -> dict:
result = evaluate_prompt(prompt, eval_set, metric)
return {
"run_id": run_id,
"timestamp": datetime.utcnow().isoformat(),
"score": result.score,
"n_total": result.n_total,
}
# Run weekly or after any model update
# Alert if score drops more than 5 percentage points
3. Keep Prompt Compilation Offline
DSPy compilation involves many LLM calls (each bootstrap iteration samples training examples). Run compilation offline:
# Run this as a CI/CD job, not in your serving path
optimized = optimizer.compile(student=pipeline, trainset=train_data)
optimized.save("optimized_pipeline_v3.json")
# Serving: just load
serving_pipeline = FactCheckPipeline()
serving_pipeline.load("optimized_pipeline_v3.json")
Common Mistakes
:::danger Mistake 1: No Eval Set You cannot do prompt engineering without a labeled eval set. "It looks good" is not a metric. Before writing a single prompt, define your success metric and collect 50-200 labeled examples. :::
:::danger Mistake 2: Optimizing Against Your Eval Set If you manually inspect failures and tune the prompt to fix them, then evaluate on the same set - you're overfitting. Hold out 20% of examples as a true test set, and tune only on the remaining 80%. :::
:::warning Mistake 3: Using DSPy for Simple Tasks DSPy adds complexity and requires a labeled dataset to compile. For a single-step task with stable requirements, a well-crafted manual prompt with an eval set is often better. Use DSPy when the pipeline complexity justifies it. :::
:::warning Mistake 4: Not Monitoring After Deployment Model providers update their models. Your prompt's performance can change without you changing anything. Set up periodic eval runs and alert on score degradation. :::
Interview Q&A
Q1: What is the core problem that automated prompt optimization solves?
Manual prompt engineering is a narrow, human-guided search in a vast space. Humans are limited by their intuitions about what makes a good prompt - which are often wrong. Manual optimization is brittle (changes in model, task, or data break hand-crafted prompts), non-compositional (optimizing each prompt in a pipeline independently doesn't optimize the pipeline globally), and doesn't scale with pipeline complexity. Automated prompt optimization - APE, OPRO, DSPy - replaces intuition-driven search with systematic, metric-driven optimization, searching more broadly and finding prompts that humans wouldn't have written.
Q2: How does APE work and what are its limitations?
APE (Automatic Prompt Engineer, Zhou et al. 2022) uses an LLM to generate candidate instruction prompts from demonstrations, evaluates each on a test set, and selects the best. It can optionally iterate by generating variations of the top candidates. Limitations: (1) It optimizes a single instruction, not the whole pipeline; (2) It requires a test set for evaluation; (3) It doesn't optimize few-shot examples - only the instruction; (4) The quality of generated candidates depends on the quality of demonstrations provided; (5) It's computationally expensive - evaluating N candidates requires N × |test_set| LLM calls.
Q3: What is DSPy's key innovation over APE and manual prompting?
DSPy treats the entire LLM pipeline as a program where prompts and few-shot examples are learnable parameters, not fixed strings. The key innovations: (1) Signatures: declarative input/output contracts that separate task specification from implementation; (2) Modules: composable building blocks (Predict, ChainOfThought, ReAct) that can be individually optimized; (3) Compiling: end-to-end optimization of the whole pipeline - not just individual prompts - guided by a metric evaluated on labeled data; (4) Portability: compile once, re-compile when you change models. This enables systematic improvement of complex pipelines without manual prompt tuning.
Q4: What is DSPy's BootstrapFewShot optimizer and how does it work?
BootstrapFewShot automatically selects the best few-shot examples for each module in your DSPy program. It runs the unoptimized program on training examples, collects the (input, output) pairs where the full pipeline succeeded (as measured by your metric), and uses those as few-shot demonstrations in the prompts. This is "bootstrapping" - using successful pipeline runs to generate examples. Unlike manual few-shot selection, it: considers the full pipeline (what examples help the whole system, not just individual modules), is data-driven (examples are selected by their actual impact on the metric), and scales automatically with more training data.
Q5: How do you set up evaluation-driven prompt engineering for a production task?
The workflow: (1) Define a clear success metric - binary (correct/incorrect), score-based, or LLM-judged; (2) Collect 50-200 labeled examples - split 80/20 into development and test sets; (3) Write an initial baseline prompt; (4) Evaluate on the development set; (5) Analyze failures - categorize error types to understand what's wrong; (6) Improve the prompt based on failure analysis; (7) Repeat until dev set score exceeds threshold; (8) Evaluate on test set to confirm no overfitting; (9) Deploy with ongoing monitoring - run eval weekly and alert on regressions. Never evaluate on the same examples you used to debug the prompt.
Q6: When would you use OPRO instead of DSPy for prompt optimization?
OPRO (Yang et al. 2023) is better when: you have a single-step task (not a pipeline), you have fewer than 50 labeled examples (DSPy works better with more data), you want a simple optimization loop without the full DSPy framework, or you want to understand the optimization process at a mechanistic level. OPRO's key advantage: it's simple - you just show the model its previous attempts and scores and ask it to improve. DSPy is better when: you have a multi-step pipeline, you want to optimize few-shot examples AND instructions jointly (MIPRO), you have enough labeled data (50+), or you want the robustness that comes with formal module composition and compilation.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the DSPy: Automatic Prompt Optimization demo on the EngineersOfAI Playground - no code required.
:::
